.../admin-guide/kernel-parameters.txt | 1 +
arch/arm/include/asm/topology.h | 6 +-
arch/arm64/include/asm/topology.h | 6 +-
drivers/base/arch_topology.c | 26 ++++----
drivers/cpufreq/cpufreq.c | 36 +++++++++++
drivers/cpufreq/qcom-cpufreq-hw.c | 4 +-
drivers/thermal/cpufreq_cooling.c | 3 -
include/linux/arch_topology.h | 8 +--
include/linux/cpufreq.h | 10 +++
include/linux/sched/topology.h | 8 +--
.../{thermal_pressure.h => hw_pressure.h} | 14 ++---
include/trace/events/sched.h | 2 +-
init/Kconfig | 12 ++--
kernel/sched/core.c | 8 +--
kernel/sched/fair.c | 63 +++++++++----------
kernel/sched/pelt.c | 18 +++---
kernel/sched/pelt.h | 16 ++---
kernel/sched/sched.h | 22 +------
18 files changed, 144 insertions(+), 119 deletions(-)
rename include/trace/events/{thermal_pressure.h => hw_pressure.h} (55%)
Following the consolidation and cleanup of CPU capacity in [1], this serie
reworks how the scheduler gets the pressures on CPUs. We need to take into
account all pressures applied by cpufreq on the compute capacity of a CPU
for dozens of ms or more and not only cpufreq cooling device or HW
mitigiations. we split the pressure applied on CPU's capacity in 2 parts:
- one from cpufreq and freq_qos
- one from HW high freq mitigiation.
The next step will be to add a dedicated interface for long standing
capping of the CPU capacity (i.e. for seconds or more) like the
scaling_max_freq of cpufreq sysfs. The latter is already taken into
account by this serie but as a temporary pressure which is not always the
best choice when we know that it will happen for seconds or more.
[1] https://lore.kernel.org/lkml/20231211104855.558096-1-vincent.guittot@linaro.org/
Change since v1:
- Rework cpufreq_update_pressure()
Change since v1:
- Use struct cpufreq_policy as parameter of cpufreq_update_pressure()
- Fix typos and comments
- Make sched_thermal_decay_shift boot param as deprecated
Vincent Guittot (5):
cpufreq: Add a cpufreq pressure feedback for the scheduler
sched: Take cpufreq feedback into account
thermal/cpufreq: Remove arch_update_thermal_pressure()
sched: Rename arch_update_thermal_pressure into
arch_update_hw_pressure
sched/pelt: Remove shift of thermal clock
.../admin-guide/kernel-parameters.txt | 1 +
arch/arm/include/asm/topology.h | 6 +-
arch/arm64/include/asm/topology.h | 6 +-
drivers/base/arch_topology.c | 26 ++++----
drivers/cpufreq/cpufreq.c | 36 +++++++++++
drivers/cpufreq/qcom-cpufreq-hw.c | 4 +-
drivers/thermal/cpufreq_cooling.c | 3 -
include/linux/arch_topology.h | 8 +--
include/linux/cpufreq.h | 10 +++
include/linux/sched/topology.h | 8 +--
.../{thermal_pressure.h => hw_pressure.h} | 14 ++---
include/trace/events/sched.h | 2 +-
init/Kconfig | 12 ++--
kernel/sched/core.c | 8 +--
kernel/sched/fair.c | 63 +++++++++----------
kernel/sched/pelt.c | 18 +++---
kernel/sched/pelt.h | 16 ++---
kernel/sched/sched.h | 22 +------
18 files changed, 144 insertions(+), 119 deletions(-)
rename include/trace/events/{thermal_pressure.h => hw_pressure.h} (55%)
--
2.34.1
On 08/01/2024 14:48, Vincent Guittot wrote:
> Following the consolidation and cleanup of CPU capacity in [1], this serie
> reworks how the scheduler gets the pressures on CPUs. We need to take into
> account all pressures applied by cpufreq on the compute capacity of a CPU
> for dozens of ms or more and not only cpufreq cooling device or HW
> mitigiations. we split the pressure applied on CPU's capacity in 2 parts:
> - one from cpufreq and freq_qos
> - one from HW high freq mitigiation.
>
> The next step will be to add a dedicated interface for long standing
> capping of the CPU capacity (i.e. for seconds or more) like the
> scaling_max_freq of cpufreq sysfs. The latter is already taken into
> account by this serie but as a temporary pressure which is not always the
> best choice when we know that it will happen for seconds or more.
I guess this is related to the 'user space system pressure' (*) slide of
your OSPM '23 talk.
Where do you draw the line when it comes to time between (*) and the
'medium pace system pressure' (e.g. thermal and FREQ_QOS).
IIRC, with (*) you want to rebuild the sched domains etc.
>
> [1] https://lore.kernel.org/lkml/20231211104855.558096-1-vincent.guittot@linaro.org/
>
> Change since v1:
> - Rework cpufreq_update_pressure()
>
> Change since v1:
> - Use struct cpufreq_policy as parameter of cpufreq_update_pressure()
> - Fix typos and comments
> - Make sched_thermal_decay_shift boot param as deprecated
>
> Vincent Guittot (5):
> cpufreq: Add a cpufreq pressure feedback for the scheduler
> sched: Take cpufreq feedback into account
> thermal/cpufreq: Remove arch_update_thermal_pressure()
> sched: Rename arch_update_thermal_pressure into
> arch_update_hw_pressure
> sched/pelt: Remove shift of thermal clock
>
> .../admin-guide/kernel-parameters.txt | 1 +
> arch/arm/include/asm/topology.h | 6 +-
> arch/arm64/include/asm/topology.h | 6 +-
> drivers/base/arch_topology.c | 26 ++++----
> drivers/cpufreq/cpufreq.c | 36 +++++++++++
> drivers/cpufreq/qcom-cpufreq-hw.c | 4 +-
> drivers/thermal/cpufreq_cooling.c | 3 -
> include/linux/arch_topology.h | 8 +--
> include/linux/cpufreq.h | 10 +++
> include/linux/sched/topology.h | 8 +--
> .../{thermal_pressure.h => hw_pressure.h} | 14 ++---
> include/trace/events/sched.h | 2 +-
> init/Kconfig | 12 ++--
> kernel/sched/core.c | 8 +--
> kernel/sched/fair.c | 63 +++++++++----------
> kernel/sched/pelt.c | 18 +++---
> kernel/sched/pelt.h | 16 ++---
> kernel/sched/sched.h | 22 +------
> 18 files changed, 144 insertions(+), 119 deletions(-)
> rename include/trace/events/{thermal_pressure.h => hw_pressure.h} (55%)
On Tue, 9 Jan 2024 at 12:34, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 08/01/2024 14:48, Vincent Guittot wrote:
> > Following the consolidation and cleanup of CPU capacity in [1], this serie
> > reworks how the scheduler gets the pressures on CPUs. We need to take into
> > account all pressures applied by cpufreq on the compute capacity of a CPU
> > for dozens of ms or more and not only cpufreq cooling device or HW
> > mitigiations. we split the pressure applied on CPU's capacity in 2 parts:
> > - one from cpufreq and freq_qos
> > - one from HW high freq mitigiation.
> >
> > The next step will be to add a dedicated interface for long standing
> > capping of the CPU capacity (i.e. for seconds or more) like the
> > scaling_max_freq of cpufreq sysfs. The latter is already taken into
> > account by this serie but as a temporary pressure which is not always the
> > best choice when we know that it will happen for seconds or more.
>
> I guess this is related to the 'user space system pressure' (*) slide of
> your OSPM '23 talk.
yes
>
> Where do you draw the line when it comes to time between (*) and the
> 'medium pace system pressure' (e.g. thermal and FREQ_QOS).
My goal is to consider the /sys/../scaling_max_freq as the 'user space
system pressure'
>
> IIRC, with (*) you want to rebuild the sched domains etc.
The easiest way would be to rebuild the sched_domain but the cost is
not small so I would prefer to skip the rebuild and add a new signal
that keep track on this capped capacity
>
> >
> > [1] https://lore.kernel.org/lkml/20231211104855.558096-1-vincent.guittot@linaro.org/
> >
> > Change since v1:
> > - Rework cpufreq_update_pressure()
> >
> > Change since v1:
> > - Use struct cpufreq_policy as parameter of cpufreq_update_pressure()
> > - Fix typos and comments
> > - Make sched_thermal_decay_shift boot param as deprecated
> >
> > Vincent Guittot (5):
> > cpufreq: Add a cpufreq pressure feedback for the scheduler
> > sched: Take cpufreq feedback into account
> > thermal/cpufreq: Remove arch_update_thermal_pressure()
> > sched: Rename arch_update_thermal_pressure into
> > arch_update_hw_pressure
> > sched/pelt: Remove shift of thermal clock
> >
> > .../admin-guide/kernel-parameters.txt | 1 +
> > arch/arm/include/asm/topology.h | 6 +-
> > arch/arm64/include/asm/topology.h | 6 +-
> > drivers/base/arch_topology.c | 26 ++++----
> > drivers/cpufreq/cpufreq.c | 36 +++++++++++
> > drivers/cpufreq/qcom-cpufreq-hw.c | 4 +-
> > drivers/thermal/cpufreq_cooling.c | 3 -
> > include/linux/arch_topology.h | 8 +--
> > include/linux/cpufreq.h | 10 +++
> > include/linux/sched/topology.h | 8 +--
> > .../{thermal_pressure.h => hw_pressure.h} | 14 ++---
> > include/trace/events/sched.h | 2 +-
> > init/Kconfig | 12 ++--
> > kernel/sched/core.c | 8 +--
> > kernel/sched/fair.c | 63 +++++++++----------
> > kernel/sched/pelt.c | 18 +++---
> > kernel/sched/pelt.h | 16 ++---
> > kernel/sched/sched.h | 22 +------
> > 18 files changed, 144 insertions(+), 119 deletions(-)
> > rename include/trace/events/{thermal_pressure.h => hw_pressure.h} (55%)
>
On 09/01/2024 14:29, Vincent Guittot wrote:
> On Tue, 9 Jan 2024 at 12:34, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 08/01/2024 14:48, Vincent Guittot wrote:
>>> Following the consolidation and cleanup of CPU capacity in [1], this serie
>>> reworks how the scheduler gets the pressures on CPUs. We need to take into
>>> account all pressures applied by cpufreq on the compute capacity of a CPU
>>> for dozens of ms or more and not only cpufreq cooling device or HW
>>> mitigiations. we split the pressure applied on CPU's capacity in 2 parts:
>>> - one from cpufreq and freq_qos
>>> - one from HW high freq mitigiation.
>>>
>>> The next step will be to add a dedicated interface for long standing
>>> capping of the CPU capacity (i.e. for seconds or more) like the
>>> scaling_max_freq of cpufreq sysfs. The latter is already taken into
>>> account by this serie but as a temporary pressure which is not always the
>>> best choice when we know that it will happen for seconds or more.
>>
>> I guess this is related to the 'user space system pressure' (*) slide of
>> your OSPM '23 talk.
>
> yes
>
>>
>> Where do you draw the line when it comes to time between (*) and the
>> 'medium pace system pressure' (e.g. thermal and FREQ_QOS).
>
> My goal is to consider the /sys/../scaling_max_freq as the 'user space
> system pressure'
>
>>
>> IIRC, with (*) you want to rebuild the sched domains etc.
>
> The easiest way would be to rebuild the sched_domain but the cost is
> not small so I would prefer to skip the rebuild and add a new signal
> that keep track on this capped capacity
Are you saying that you don't need to rebuild sched domains since
cpu_capacity information of the sched domain hierarchy is
independently updated via:
update_sd_lb_stats() {
update_group_capacity() {
if (!child)
update_cpu_capacity(sd, cpu) {
capacity = scale_rt_capacity(cpu) {
max = get_actual_cpu_capacity(cpu) <- (*)
}
sdg->sgc->capacity = capacity;
sdg->sgc->min_capacity = capacity;
sdg->sgc->max_capacity = capacity;
}
}
}
(*) influence of temporary and permanent (to be added) frequency
pressure on cpu_capacity (per-cpu and in sd data)
example: hackbench on h960 with IPA:
cap min max
...
hackbench-2284 [007] .Ns.. 2170.796726: update_group_capacity: sdg !child cpu=7 1017 1017 1017
hackbench-2456 [007] ..s.. 2170.920729: update_group_capacity: sdg !child cpu=7 1018 1018 1018
<...>-2314 [007] ..s1. 2171.044724: update_group_capacity: sdg !child cpu=7 1011 1011 1011
hackbench-2541 [007] ..s.. 2171.168734: update_group_capacity: sdg !child cpu=7 918 918 918
hackbench-2558 [007] .Ns.. 2171.228716: update_group_capacity: sdg !child cpu=7 912 912 912
<...>-2321 [007] ..s.. 2171.352718: update_group_capacity: sdg !child cpu=7 812 812 812
hackbench-2553 [007] ..s.. 2171.476721: update_group_capacity: sdg !child cpu=7 640 640 640
<...>-2446 [007] ..s2. 2171.600743: update_group_capacity: sdg !child cpu=7 610 610 610
hackbench-2347 [007] ..s.. 2171.724738: update_group_capacity: sdg !child cpu=7 406 406 406
hackbench-2331 [007] .Ns1. 2171.848768: update_group_capacity: sdg !child cpu=7 390 390 390
hackbench-2421 [007] ..s.. 2171.972733: update_group_capacity: sdg !child cpu=7 388 388 388
...
On Wed, 10 Jan 2024 at 19:10, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 09/01/2024 14:29, Vincent Guittot wrote:
> > On Tue, 9 Jan 2024 at 12:34, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 08/01/2024 14:48, Vincent Guittot wrote:
> >>> Following the consolidation and cleanup of CPU capacity in [1], this serie
> >>> reworks how the scheduler gets the pressures on CPUs. We need to take into
> >>> account all pressures applied by cpufreq on the compute capacity of a CPU
> >>> for dozens of ms or more and not only cpufreq cooling device or HW
> >>> mitigiations. we split the pressure applied on CPU's capacity in 2 parts:
> >>> - one from cpufreq and freq_qos
> >>> - one from HW high freq mitigiation.
> >>>
> >>> The next step will be to add a dedicated interface for long standing
> >>> capping of the CPU capacity (i.e. for seconds or more) like the
> >>> scaling_max_freq of cpufreq sysfs. The latter is already taken into
> >>> account by this serie but as a temporary pressure which is not always the
> >>> best choice when we know that it will happen for seconds or more.
> >>
> >> I guess this is related to the 'user space system pressure' (*) slide of
> >> your OSPM '23 talk.
> >
> > yes
> >
> >>
> >> Where do you draw the line when it comes to time between (*) and the
> >> 'medium pace system pressure' (e.g. thermal and FREQ_QOS).
> >
> > My goal is to consider the /sys/../scaling_max_freq as the 'user space
> > system pressure'
> >
> >>
> >> IIRC, with (*) you want to rebuild the sched domains etc.
> >
> > The easiest way would be to rebuild the sched_domain but the cost is
> > not small so I would prefer to skip the rebuild and add a new signal
> > that keep track on this capped capacity
>
> Are you saying that you don't need to rebuild sched domains since
> cpu_capacity information of the sched domain hierarchy is
> independently updated via:
>
> update_sd_lb_stats() {
>
> update_group_capacity() {
>
> if (!child)
> update_cpu_capacity(sd, cpu) {
>
> capacity = scale_rt_capacity(cpu) {
>
> max = get_actual_cpu_capacity(cpu) <- (*)
> }
>
> sdg->sgc->capacity = capacity;
> sdg->sgc->min_capacity = capacity;
> sdg->sgc->max_capacity = capacity;
> }
>
> }
>
> }
>
> (*) influence of temporary and permanent (to be added) frequency
> pressure on cpu_capacity (per-cpu and in sd data)
I'm more concerned by rd->max_cpu_capacity which remains at original
capacity and triggers spurious LB if we take into account the
userspace max freq instead of the original max compute capacity of a
CPU. And also how to manage this in RT and DL
>
>
> example: hackbench on h960 with IPA:
> cap min max
> ...
> hackbench-2284 [007] .Ns.. 2170.796726: update_group_capacity: sdg !child cpu=7 1017 1017 1017
> hackbench-2456 [007] ..s.. 2170.920729: update_group_capacity: sdg !child cpu=7 1018 1018 1018
> <...>-2314 [007] ..s1. 2171.044724: update_group_capacity: sdg !child cpu=7 1011 1011 1011
> hackbench-2541 [007] ..s.. 2171.168734: update_group_capacity: sdg !child cpu=7 918 918 918
> hackbench-2558 [007] .Ns.. 2171.228716: update_group_capacity: sdg !child cpu=7 912 912 912
> <...>-2321 [007] ..s.. 2171.352718: update_group_capacity: sdg !child cpu=7 812 812 812
> hackbench-2553 [007] ..s.. 2171.476721: update_group_capacity: sdg !child cpu=7 640 640 640
> <...>-2446 [007] ..s2. 2171.600743: update_group_capacity: sdg !child cpu=7 610 610 610
> hackbench-2347 [007] ..s.. 2171.724738: update_group_capacity: sdg !child cpu=7 406 406 406
> hackbench-2331 [007] .Ns1. 2171.848768: update_group_capacity: sdg !child cpu=7 390 390 390
> hackbench-2421 [007] ..s.. 2171.972733: update_group_capacity: sdg !child cpu=7 388 388 388
> ...
© 2016 - 2025 Red Hat, Inc.