timers/migration: Handle heterogenous CPU capacities

[PATCH 0/6] timers/migration: Handle heterogenous CPU capacities

Posted by Frederic Weisbecker 1 month, 3 weeks ago

Hi,

This is a late follow-up after:

	https://lore.kernel.org/lkml/20250910074251.8148-1-sehee1.jeong@samsung.com/

To summarize, heterogenous capacity CPUs migrate their timers
indifferently between big and little CPUs. And this happens to be often
migrated to big CPUs, increasing their idle target residency.

Thomas proposed to isolate the hierarchy between big and little CPUs.
So here is a try. Note I haven't tested on real heterogenous hardware
so if you have it, please test it!

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	timers/core

HEAD: f0a87af6dab6f3a6dd8a603a2b9d7dcc86fd50e4
Thanks,
	Frederic
---

Frederic Weisbecker (6):
      timers/migration: Fix another hotplug activation race
      timers/migration: Abstract out hierarchy to prepare for CPU capacity awareness
      timers/migration: Track CPUs in a hierarchy
      timers/migration: Split per-capacity hierarchies
      timers/migration: Handle capacity in connect tracepoints
      scripts/timers: Add timer_migration_tree.py

 include/trace/events/timer_migration.h |  24 ++--
 kernel/time/timer_migration.c          | 246 ++++++++++++++++++++++++---------
 kernel/time/timer_migration.h          |  19 +++
 scripts/timer_migration_tree.py        | 122 ++++++++++++++++
 4 files changed, 337 insertions(+), 74 deletions(-)

Re: [PATCH 0/6] timers/migration: Handle heterogenous CPU capacities

Posted by Christian Loehle 1 week, 6 days ago

On 4/23/26 17:53, Frederic Weisbecker wrote:
> Hi,
> 
> This is a late follow-up after:
> 
> 	https://lore.kernel.org/lkml/20250910074251.8148-1-sehee1.jeong@samsung.com/
> 
> To summarize, heterogenous capacity CPUs migrate their timers
> indifferently between big and little CPUs. And this happens to be often
> migrated to big CPUs, increasing their idle target residency.
> 
> Thomas proposed to isolate the hierarchy between big and little CPUs.
> So here is a try. Note I haven't tested on real heterogenous hardware
> so if you have it, please test it!
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> 	timers/core
> 
> HEAD: f0a87af6dab6f3a6dd8a603a2b9d7dcc86fd50e4
> Thanks,
> 	Frederic
> ---
> 
> Frederic Weisbecker (6):
>       timers/migration: Fix another hotplug activation race
>       timers/migration: Abstract out hierarchy to prepare for CPU capacity awareness
>       timers/migration: Track CPUs in a hierarchy
>       timers/migration: Split per-capacity hierarchies
>       timers/migration: Handle capacity in connect tracepoints
>       scripts/timers: Add timer_migration_tree.py
> 
>  include/trace/events/timer_migration.h |  24 ++--
>  kernel/time/timer_migration.c          | 246 ++++++++++++++++++++++++---------
>  kernel/time/timer_migration.h          |  19 +++
>  scripts/timer_migration_tree.py        | 122 ++++++++++++++++
>  4 files changed, 337 insertions(+), 74 deletions(-)

Hi Frederic,
sorry for the late reaction to this, I completely missed it (CCing
linux-pm would have helped :) ).

I'm not convinced that unconditionally splitting the timer migration
hierarchy per-capacity is always the right tradeoff from a power point of
view. On some asymmetric systems we only have one or two CPUs in a given
capacity class. In that case the split can effectively remove most of the
useful timer migration opportunity for that class, even though allowing
migration across nearby capacities may still be better for idle residency.

I tested this on an Orion O6 system with the following topology:

online CPUs: 0-11

capacity 279:  CPUs 2,3,4,5
capacity 866:  CPUs 8,9
capacity 905:  CPUs 6,7
capacity 984:  CPUs 10,11
capacity 1024: CPUs 0,1

I compared the series up to and including the preparatory/refactoring
patch 3 against the full series including the per-capacity hierarchy split.
The numbers below are aggregate cpuidle residency deltas over a 600s run.

Idle workload:

variant    LPI-0     LPI-1     LPI-2     LPI-1+2
base       2298.7s   1253.8s   2817.0s   4070.8s
full       2298.8s   1306.1s   2758.7s   4064.7s
delta      +0.1s     +52.3s    -58.3s    -6.1s

Grouped by capacity class, the LPI-2 loss is mostly on the lower-capacity
CPUs:

group        base LPI-2   full LPI-2   delta full
279          1073.5s      1031.9s      -41.6s
866          502.5s       486.4s       -16.1s
905          499.7s       490.4s       -9.3s
984          488.8s       496.0s       +7.2s
1024         252.5s       254.0s       +1.5s

For a light tbench run (tbench -R 20 -t 600 4), the result is more mixed:

variant    LPI-0     LPI-1     LPI-2     LPI-1+2
base       2593.5s   1483.4s   410.3s    1893.6s
full       2605.3s   1446.5s   416.6s    1863.1s
delta      +11.8s    -36.9s    +6.3s     -30.5s

So tbench gets a small increase in deepest idle, but loses more in
LPI-1+2 overall.

If we do wanna keep the per-capacity hierarchy split, maybe it's sufficient to
gate this behind there being either a small number of capacity classes or
ensuring that they all have >=4 CPUs before splitting?

Kind regards,
Christian

Re: [PATCH 0/6] timers/migration: Handle heterogenous CPU capacities

Posted by Frederic Weisbecker 1 week, 5 days ago

Le Wed, Jun 03, 2026 at 11:50:58PM +0100, Christian Loehle a écrit :
> On 4/23/26 17:53, Frederic Weisbecker wrote:
> > Hi,
> > 
> > This is a late follow-up after:
> > 
> > 	https://lore.kernel.org/lkml/20250910074251.8148-1-sehee1.jeong@samsung.com/
> > 
> > To summarize, heterogenous capacity CPUs migrate their timers
> > indifferently between big and little CPUs. And this happens to be often
> > migrated to big CPUs, increasing their idle target residency.
> > 
> > Thomas proposed to isolate the hierarchy between big and little CPUs.
> > So here is a try. Note I haven't tested on real heterogenous hardware
> > so if you have it, please test it!
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > 	timers/core
> > 
> > HEAD: f0a87af6dab6f3a6dd8a603a2b9d7dcc86fd50e4
> > Thanks,
> > 	Frederic
> > ---
> > 
> > Frederic Weisbecker (6):
> >       timers/migration: Fix another hotplug activation race
> >       timers/migration: Abstract out hierarchy to prepare for CPU capacity awareness
> >       timers/migration: Track CPUs in a hierarchy
> >       timers/migration: Split per-capacity hierarchies
> >       timers/migration: Handle capacity in connect tracepoints
> >       scripts/timers: Add timer_migration_tree.py
> > 
> >  include/trace/events/timer_migration.h |  24 ++--
> >  kernel/time/timer_migration.c          | 246 ++++++++++++++++++++++++---------
> >  kernel/time/timer_migration.h          |  19 +++
> >  scripts/timer_migration_tree.py        | 122 ++++++++++++++++
> >  4 files changed, 337 insertions(+), 74 deletions(-)
> 
> Hi Frederic,
> sorry for the late reaction to this, I completely missed it (CCing
> linux-pm would have helped :) ).

Good point, next time I'll do!

> 
> I'm not convinced that unconditionally splitting the timer migration
> hierarchy per-capacity is always the right tradeoff from a power point of
> view. On some asymmetric systems we only have one or two CPUs in a given
> capacity class. In that case the split can effectively remove most of the
> useful timer migration opportunity for that class, even though allowing
> migration across nearby capacities may still be better for idle residency.
> 
> I tested this on an Orion O6 system with the following topology:
> 
> online CPUs: 0-11
> 
> capacity 279:  CPUs 2,3,4,5
> capacity 866:  CPUs 8,9
> capacity 905:  CPUs 6,7
> capacity 984:  CPUs 10,11
> capacity 1024: CPUs 0,1
> 
> I compared the series up to and including the preparatory/refactoring
> patch 3 against the full series including the per-capacity hierarchy split.
> The numbers below are aggregate cpuidle residency deltas over a 600s run.
> 
> Idle workload:
> 
> variant    LPI-0     LPI-1     LPI-2     LPI-1+2
> base       2298.7s   1253.8s   2817.0s   4070.8s
> full       2298.8s   1306.1s   2758.7s   4064.7s
> delta      +0.1s     +52.3s    -58.3s    -6.1s
> 
> Grouped by capacity class, the LPI-2 loss is mostly on the lower-capacity
> CPUs:
> 
> group        base LPI-2   full LPI-2   delta full
> 279          1073.5s      1031.9s      -41.6s
> 866          502.5s       486.4s       -16.1s
> 905          499.7s       490.4s       -9.3s
> 984          488.8s       496.0s       +7.2s
> 1024         252.5s       254.0s       +1.5s
> 
> For a light tbench run (tbench -R 20 -t 600 4), the result is more mixed:
> 
> variant    LPI-0     LPI-1     LPI-2     LPI-1+2
> base       2593.5s   1483.4s   410.3s    1893.6s
> full       2605.3s   1446.5s   416.6s    1863.1s
> delta      +11.8s    -36.9s    +6.3s     -30.5s
> 
> So tbench gets a small increase in deepest idle, but loses more in
> LPI-1+2 overall.
> 
> If we do wanna keep the per-capacity hierarchy split, maybe it's sufficient to
> gate this behind there being either a small number of capacity classes or
> ensuring that they all have >=4 CPUs before splitting?

Ok I was afraid of something like that, ie: it works for some usages but not
on others.

And I don't know what to do. For example if I apply your suggested contraints,
on which hierarchy should go those capacities with < 4 CPUs ?

Thoughts?

> 
> Kind regards,
> Christian
> 

-- 
Frederic Weisbecker
SUSE Labs

Re: [PATCH 0/6] timers/migration: Handle heterogenous CPU capacities

Posted by Christian Loehle 1 week, 4 days ago

On 6/4/26 14:36, Frederic Weisbecker wrote:
> Le Wed, Jun 03, 2026 at 11:50:58PM +0100, Christian Loehle a écrit :
>> On 4/23/26 17:53, Frederic Weisbecker wrote:
>>> Hi,
>>>
>>> This is a late follow-up after:
>>>
>>> 	https://lore.kernel.org/lkml/20250910074251.8148-1-sehee1.jeong@samsung.com/
>>>
>>> To summarize, heterogenous capacity CPUs migrate their timers
>>> indifferently between big and little CPUs. And this happens to be often
>>> migrated to big CPUs, increasing their idle target residency.
>>>
>>> Thomas proposed to isolate the hierarchy between big and little CPUs.
>>> So here is a try. Note I haven't tested on real heterogenous hardware
>>> so if you have it, please test it!
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
>>> 	timers/core
>>>
>>> HEAD: f0a87af6dab6f3a6dd8a603a2b9d7dcc86fd50e4
>>> Thanks,
>>> 	Frederic
>>> ---
>>>
>>> Frederic Weisbecker (6):
>>>       timers/migration: Fix another hotplug activation race
>>>       timers/migration: Abstract out hierarchy to prepare for CPU capacity awareness
>>>       timers/migration: Track CPUs in a hierarchy
>>>       timers/migration: Split per-capacity hierarchies
>>>       timers/migration: Handle capacity in connect tracepoints
>>>       scripts/timers: Add timer_migration_tree.py
>>>
>>>  include/trace/events/timer_migration.h |  24 ++--
>>>  kernel/time/timer_migration.c          | 246 ++++++++++++++++++++++++---------
>>>  kernel/time/timer_migration.h          |  19 +++
>>>  scripts/timer_migration_tree.py        | 122 ++++++++++++++++
>>>  4 files changed, 337 insertions(+), 74 deletions(-)
>>
>> Hi Frederic,
>> sorry for the late reaction to this, I completely missed it (CCing
>> linux-pm would have helped :) ).
> 
> Good point, next time I'll do!
> 
>>
>> I'm not convinced that unconditionally splitting the timer migration
>> hierarchy per-capacity is always the right tradeoff from a power point of
>> view. On some asymmetric systems we only have one or two CPUs in a given
>> capacity class. In that case the split can effectively remove most of the
>> useful timer migration opportunity for that class, even though allowing
>> migration across nearby capacities may still be better for idle residency.
>>
>> I tested this on an Orion O6 system with the following topology:
>>
>> online CPUs: 0-11
>>
>> capacity 279:  CPUs 2,3,4,5
>> capacity 866:  CPUs 8,9
>> capacity 905:  CPUs 6,7
>> capacity 984:  CPUs 10,11
>> capacity 1024: CPUs 0,1
>>
>> I compared the series up to and including the preparatory/refactoring
>> patch 3 against the full series including the per-capacity hierarchy split.
>> The numbers below are aggregate cpuidle residency deltas over a 600s run.
>>
>> Idle workload:
>>
>> variant    LPI-0     LPI-1     LPI-2     LPI-1+2
>> base       2298.7s   1253.8s   2817.0s   4070.8s
>> full       2298.8s   1306.1s   2758.7s   4064.7s
>> delta      +0.1s     +52.3s    -58.3s    -6.1s
>>
>> Grouped by capacity class, the LPI-2 loss is mostly on the lower-capacity
>> CPUs:
>>
>> group        base LPI-2   full LPI-2   delta full
>> 279          1073.5s      1031.9s      -41.6s
>> 866          502.5s       486.4s       -16.1s
>> 905          499.7s       490.4s       -9.3s
>> 984          488.8s       496.0s       +7.2s
>> 1024         252.5s       254.0s       +1.5s
>>
>> For a light tbench run (tbench -R 20 -t 600 4), the result is more mixed:
>>
>> variant    LPI-0     LPI-1     LPI-2     LPI-1+2
>> base       2593.5s   1483.4s   410.3s    1893.6s
>> full       2605.3s   1446.5s   416.6s    1863.1s
>> delta      +11.8s    -36.9s    +6.3s     -30.5s
>>
>> So tbench gets a small increase in deepest idle, but loses more in
>> LPI-1+2 overall.
>>
>> If we do wanna keep the per-capacity hierarchy split, maybe it's sufficient to
>> gate this behind there being either a small number of capacity classes or
>> ensuring that they all have >=4 CPUs before splitting?
> 
> Ok I was afraid of something like that, ie: it works for some usages but not
> on others.
> 
> And I don't know what to do. For example if I apply your suggested contraints,
> on which hierarchy should go those capacities with < 4 CPUs ?
> 
> Thoughts?
> 

I sure have some thoughts, but I'm unsure about the best solution is though.
A few things bothering me:
1. In the original report the problem was timers being migrated from
little to big CPU leads to a power regression, but of course they most
likely still benefit from the reverse migration, making static partitioning
seem counterintuitive to me in the first place? In particular because usually
#little CPUs > #big CPUs, so my intuition would be that that migration should
be more common, or is that not true? I'd also love to know with what workload
the original issue appeared.
2. While little->big timer migration might usually be bad for power, that's
not always true depending on SoC and workload, we don't really know without
consulting the energy model, for most timers though the energy model wouldn't
be that useful anyway as a good chunk of the decision comes from wasting
potential idle energy instead of active energy, energy model is unaware of
power savings of idle states.

For the static hierarchy split itself my ideas would be:

1. Don't do it if the resulting hierarchy is too awkward, e.g. single CPUs or
too many tiny groups. Obviously that risks excluding the system from the
original report.

2. Group only meaningfully different capacities, rather than exact
arch_scale_cpu_capacity() values. For example, use something like the
capacity_greater() margin so negligible capacity differences don't create
separate timer hierarchies. [1]

3. Have a limited number of buckets, fixed thresholds such as <512
and >=512 would probably work, but are arbitrary.

4. Only start a new bucket if last_capacity != current_capacity &&
last_bucket_cpus >= 4. This feels awkward because the resulting hierarchy then
depends on CPU/hotplug ordering.

If we allow for a more dynamic migration strategy, I think I'd prefer the
decision to be based on observed idle opportunity rather than capacity alone.
Something like rq->avg_idle, could make CPUs with shorter recent idle periods
more likely to handle timers, while avoiding CPUs that tend to get long/deep
idle residencies. Is that unreasonable from your end?

[1] nvidia grace e.g. has capacities of
994
997
1000
1002
1005
1008
1010
1013
1016
1018
1021
1024

This feels like it should all be one hierarchy bucket. On my Orion O6,
using the capacity_greater() margin would at least reduce the split to:

279 (4 CPUs)
866 + 905 (4 CPUs)
984 + 1024 (4 CPUs)

Nonetheless many SoCs are 4+2+1 or 4+3+1, so even that does not fully solve
the tiny hierarchy problem.

Re: [PATCH 0/6] timers/migration: Handle heterogenous CPU capacities

Posted by Frederic Weisbecker 6 days, 15 hours ago

Le Fri, Jun 05, 2026 at 11:10:20AM +0100, Christian Loehle a écrit :
> On 6/4/26 14:36, Frederic Weisbecker wrote:
> > Le Wed, Jun 03, 2026 at 11:50:58PM +0100, Christian Loehle a écrit :
> >> On 4/23/26 17:53, Frederic Weisbecker wrote:
> >>> Hi,
> >>>
> >>> This is a late follow-up after:
> >>>
> >>> 	https://lore.kernel.org/lkml/20250910074251.8148-1-sehee1.jeong@samsung.com/
> >>>
> >>> To summarize, heterogenous capacity CPUs migrate their timers
> >>> indifferently between big and little CPUs. And this happens to be often
> >>> migrated to big CPUs, increasing their idle target residency.
> >>>
> >>> Thomas proposed to isolate the hierarchy between big and little CPUs.
> >>> So here is a try. Note I haven't tested on real heterogenous hardware
> >>> so if you have it, please test it!
> >>>
> >>> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> >>> 	timers/core
> >>>
> >>> HEAD: f0a87af6dab6f3a6dd8a603a2b9d7dcc86fd50e4
> >>> Thanks,
> >>> 	Frederic
> >>> ---
> >>>
> >>> Frederic Weisbecker (6):
> >>>       timers/migration: Fix another hotplug activation race
> >>>       timers/migration: Abstract out hierarchy to prepare for CPU capacity awareness
> >>>       timers/migration: Track CPUs in a hierarchy
> >>>       timers/migration: Split per-capacity hierarchies
> >>>       timers/migration: Handle capacity in connect tracepoints
> >>>       scripts/timers: Add timer_migration_tree.py
> >>>
> >>>  include/trace/events/timer_migration.h |  24 ++--
> >>>  kernel/time/timer_migration.c          | 246 ++++++++++++++++++++++++---------
> >>>  kernel/time/timer_migration.h          |  19 +++
> >>>  scripts/timer_migration_tree.py        | 122 ++++++++++++++++
> >>>  4 files changed, 337 insertions(+), 74 deletions(-)
> >>
> >> Hi Frederic,
> >> sorry for the late reaction to this, I completely missed it (CCing
> >> linux-pm would have helped :) ).
> > 
> > Good point, next time I'll do!
> > 
> >>
> >> I'm not convinced that unconditionally splitting the timer migration
> >> hierarchy per-capacity is always the right tradeoff from a power point of
> >> view. On some asymmetric systems we only have one or two CPUs in a given
> >> capacity class. In that case the split can effectively remove most of the
> >> useful timer migration opportunity for that class, even though allowing
> >> migration across nearby capacities may still be better for idle residency.
> >>
> >> I tested this on an Orion O6 system with the following topology:
> >>
> >> online CPUs: 0-11
> >>
> >> capacity 279:  CPUs 2,3,4,5
> >> capacity 866:  CPUs 8,9
> >> capacity 905:  CPUs 6,7
> >> capacity 984:  CPUs 10,11
> >> capacity 1024: CPUs 0,1
> >>
> >> I compared the series up to and including the preparatory/refactoring
> >> patch 3 against the full series including the per-capacity hierarchy split.
> >> The numbers below are aggregate cpuidle residency deltas over a 600s run.
> >>
> >> Idle workload:
> >>
> >> variant    LPI-0     LPI-1     LPI-2     LPI-1+2
> >> base       2298.7s   1253.8s   2817.0s   4070.8s
> >> full       2298.8s   1306.1s   2758.7s   4064.7s
> >> delta      +0.1s     +52.3s    -58.3s    -6.1s
> >>
> >> Grouped by capacity class, the LPI-2 loss is mostly on the lower-capacity
> >> CPUs:
> >>
> >> group        base LPI-2   full LPI-2   delta full
> >> 279          1073.5s      1031.9s      -41.6s
> >> 866          502.5s       486.4s       -16.1s
> >> 905          499.7s       490.4s       -9.3s
> >> 984          488.8s       496.0s       +7.2s
> >> 1024         252.5s       254.0s       +1.5s
> >>
> >> For a light tbench run (tbench -R 20 -t 600 4), the result is more mixed:
> >>
> >> variant    LPI-0     LPI-1     LPI-2     LPI-1+2
> >> base       2593.5s   1483.4s   410.3s    1893.6s
> >> full       2605.3s   1446.5s   416.6s    1863.1s
> >> delta      +11.8s    -36.9s    +6.3s     -30.5s
> >>
> >> So tbench gets a small increase in deepest idle, but loses more in
> >> LPI-1+2 overall.
> >>
> >> If we do wanna keep the per-capacity hierarchy split, maybe it's sufficient to
> >> gate this behind there being either a small number of capacity classes or
> >> ensuring that they all have >=4 CPUs before splitting?
> > 
> > Ok I was afraid of something like that, ie: it works for some usages but not
> > on others.
> > 
> > And I don't know what to do. For example if I apply your suggested contraints,
> > on which hierarchy should go those capacities with < 4 CPUs ?
> > 
> > Thoughts?
> > 
> 
> I sure have some thoughts, but I'm unsure about the best solution is though.
> A few things bothering me:
> 1. In the original report the problem was timers being migrated from
> little to big CPU leads to a power regression, but of course they most
> likely still benefit from the reverse migration, making static partitioning
> seem counterintuitive to me in the first place? In particular because usually
> #little CPUs > #big CPUs, so my intuition would be that that migration should
> be more common, or is that not true? I'd also love to know with what workload
> the original issue appeared.
> 2. While little->big timer migration might usually be bad for power, that's
> not always true depending on SoC and workload, we don't really know without
> consulting the energy model, for most timers though the energy model wouldn't
> be that useful anyway as a good chunk of the decision comes from wasting
> potential idle energy instead of active energy, energy model is unaware of
> power savings of idle states.
> 
> For the static hierarchy split itself my ideas would be:
> 
> 1. Don't do it if the resulting hierarchy is too awkward, e.g. single CPUs or
> too many tiny groups. Obviously that risks excluding the system from the
> original report.
> 
> 2. Group only meaningfully different capacities, rather than exact
> arch_scale_cpu_capacity() values. For example, use something like the
> capacity_greater() margin so negligible capacity differences don't create
> separate timer hierarchies. [1]
> 
> 3. Have a limited number of buckets, fixed thresholds such as <512
> and >=512 would probably work, but are arbitrary.
> 
> 4. Only start a new bucket if last_capacity != current_capacity &&
> last_bucket_cpus >= 4. This feels awkward because the resulting hierarchy then
> depends on CPU/hotplug ordering.
> 
> If we allow for a more dynamic migration strategy, I think I'd prefer the
> decision to be based on observed idle opportunity rather than capacity alone.
> Something like rq->avg_idle, could make CPUs with shorter recent idle periods
> more likely to handle timers, while avoiding CPUs that tend to get long/deep
> idle residencies. Is that unreasonable from your end?

I guess it's feasible, but that doesn't take into account the capacity itself.
The initial issue was about timers migrating too often to big cores and
therefore keeping them alive too frequently. I guess the biggest issue is
when the last core going idle is a big core. And it's the one that will handle
all global timers for the whole system.

And perhaps it's a fundamental issue because big cores are probably busier by
nature.

That problem is not easy to solve...

> 
> [1] nvidia grace e.g. has capacities of
> 994
> 997
> 1000
> 1002
> 1005
> 1008
> 1010
> 1013
> 1016
> 1018
> 1021
> 1024

Urgh, who needs that?

Thanks.

-- 
Frederic Weisbecker
SUSE Labs