include/trace/events/timer_migration.h | 24 ++-- kernel/time/timer_migration.c | 246 ++++++++++++++++++++++++--------- kernel/time/timer_migration.h | 19 +++ scripts/timer_migration_tree.py | 122 ++++++++++++++++ 4 files changed, 337 insertions(+), 74 deletions(-)
Hi,
This is a late follow-up after:
https://lore.kernel.org/lkml/20250910074251.8148-1-sehee1.jeong@samsung.com/
To summarize, heterogenous capacity CPUs migrate their timers
indifferently between big and little CPUs. And this happens to be often
migrated to big CPUs, increasing their idle target residency.
Thomas proposed to isolate the hierarchy between big and little CPUs.
So here is a try. Note I haven't tested on real heterogenous hardware
so if you have it, please test it!
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
timers/core
HEAD: f0a87af6dab6f3a6dd8a603a2b9d7dcc86fd50e4
Thanks,
Frederic
---
Frederic Weisbecker (6):
timers/migration: Fix another hotplug activation race
timers/migration: Abstract out hierarchy to prepare for CPU capacity awareness
timers/migration: Track CPUs in a hierarchy
timers/migration: Split per-capacity hierarchies
timers/migration: Handle capacity in connect tracepoints
scripts/timers: Add timer_migration_tree.py
include/trace/events/timer_migration.h | 24 ++--
kernel/time/timer_migration.c | 246 ++++++++++++++++++++++++---------
kernel/time/timer_migration.h | 19 +++
scripts/timer_migration_tree.py | 122 ++++++++++++++++
4 files changed, 337 insertions(+), 74 deletions(-)
On 4/23/26 17:53, Frederic Weisbecker wrote: > Hi, > > This is a late follow-up after: > > https://lore.kernel.org/lkml/20250910074251.8148-1-sehee1.jeong@samsung.com/ > > To summarize, heterogenous capacity CPUs migrate their timers > indifferently between big and little CPUs. And this happens to be often > migrated to big CPUs, increasing their idle target residency. > > Thomas proposed to isolate the hierarchy between big and little CPUs. > So here is a try. Note I haven't tested on real heterogenous hardware > so if you have it, please test it! > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git > timers/core > > HEAD: f0a87af6dab6f3a6dd8a603a2b9d7dcc86fd50e4 > Thanks, > Frederic > --- > > Frederic Weisbecker (6): > timers/migration: Fix another hotplug activation race > timers/migration: Abstract out hierarchy to prepare for CPU capacity awareness > timers/migration: Track CPUs in a hierarchy > timers/migration: Split per-capacity hierarchies > timers/migration: Handle capacity in connect tracepoints > scripts/timers: Add timer_migration_tree.py > > include/trace/events/timer_migration.h | 24 ++-- > kernel/time/timer_migration.c | 246 ++++++++++++++++++++++++--------- > kernel/time/timer_migration.h | 19 +++ > scripts/timer_migration_tree.py | 122 ++++++++++++++++ > 4 files changed, 337 insertions(+), 74 deletions(-) Hi Frederic, sorry for the late reaction to this, I completely missed it (CCing linux-pm would have helped :) ). I'm not convinced that unconditionally splitting the timer migration hierarchy per-capacity is always the right tradeoff from a power point of view. On some asymmetric systems we only have one or two CPUs in a given capacity class. In that case the split can effectively remove most of the useful timer migration opportunity for that class, even though allowing migration across nearby capacities may still be better for idle residency. I tested this on an Orion O6 system with the following topology: online CPUs: 0-11 capacity 279: CPUs 2,3,4,5 capacity 866: CPUs 8,9 capacity 905: CPUs 6,7 capacity 984: CPUs 10,11 capacity 1024: CPUs 0,1 I compared the series up to and including the preparatory/refactoring patch 3 against the full series including the per-capacity hierarchy split. The numbers below are aggregate cpuidle residency deltas over a 600s run. Idle workload: variant LPI-0 LPI-1 LPI-2 LPI-1+2 base 2298.7s 1253.8s 2817.0s 4070.8s full 2298.8s 1306.1s 2758.7s 4064.7s delta +0.1s +52.3s -58.3s -6.1s Grouped by capacity class, the LPI-2 loss is mostly on the lower-capacity CPUs: group base LPI-2 full LPI-2 delta full 279 1073.5s 1031.9s -41.6s 866 502.5s 486.4s -16.1s 905 499.7s 490.4s -9.3s 984 488.8s 496.0s +7.2s 1024 252.5s 254.0s +1.5s For a light tbench run (tbench -R 20 -t 600 4), the result is more mixed: variant LPI-0 LPI-1 LPI-2 LPI-1+2 base 2593.5s 1483.4s 410.3s 1893.6s full 2605.3s 1446.5s 416.6s 1863.1s delta +11.8s -36.9s +6.3s -30.5s So tbench gets a small increase in deepest idle, but loses more in LPI-1+2 overall. If we do wanna keep the per-capacity hierarchy split, maybe it's sufficient to gate this behind there being either a small number of capacity classes or ensuring that they all have >=4 CPUs before splitting? Kind regards, Christian
Le Wed, Jun 03, 2026 at 11:50:58PM +0100, Christian Loehle a écrit : > On 4/23/26 17:53, Frederic Weisbecker wrote: > > Hi, > > > > This is a late follow-up after: > > > > https://lore.kernel.org/lkml/20250910074251.8148-1-sehee1.jeong@samsung.com/ > > > > To summarize, heterogenous capacity CPUs migrate their timers > > indifferently between big and little CPUs. And this happens to be often > > migrated to big CPUs, increasing their idle target residency. > > > > Thomas proposed to isolate the hierarchy between big and little CPUs. > > So here is a try. Note I haven't tested on real heterogenous hardware > > so if you have it, please test it! > > > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git > > timers/core > > > > HEAD: f0a87af6dab6f3a6dd8a603a2b9d7dcc86fd50e4 > > Thanks, > > Frederic > > --- > > > > Frederic Weisbecker (6): > > timers/migration: Fix another hotplug activation race > > timers/migration: Abstract out hierarchy to prepare for CPU capacity awareness > > timers/migration: Track CPUs in a hierarchy > > timers/migration: Split per-capacity hierarchies > > timers/migration: Handle capacity in connect tracepoints > > scripts/timers: Add timer_migration_tree.py > > > > include/trace/events/timer_migration.h | 24 ++-- > > kernel/time/timer_migration.c | 246 ++++++++++++++++++++++++--------- > > kernel/time/timer_migration.h | 19 +++ > > scripts/timer_migration_tree.py | 122 ++++++++++++++++ > > 4 files changed, 337 insertions(+), 74 deletions(-) > > Hi Frederic, > sorry for the late reaction to this, I completely missed it (CCing > linux-pm would have helped :) ). Good point, next time I'll do! > > I'm not convinced that unconditionally splitting the timer migration > hierarchy per-capacity is always the right tradeoff from a power point of > view. On some asymmetric systems we only have one or two CPUs in a given > capacity class. In that case the split can effectively remove most of the > useful timer migration opportunity for that class, even though allowing > migration across nearby capacities may still be better for idle residency. > > I tested this on an Orion O6 system with the following topology: > > online CPUs: 0-11 > > capacity 279: CPUs 2,3,4,5 > capacity 866: CPUs 8,9 > capacity 905: CPUs 6,7 > capacity 984: CPUs 10,11 > capacity 1024: CPUs 0,1 > > I compared the series up to and including the preparatory/refactoring > patch 3 against the full series including the per-capacity hierarchy split. > The numbers below are aggregate cpuidle residency deltas over a 600s run. > > Idle workload: > > variant LPI-0 LPI-1 LPI-2 LPI-1+2 > base 2298.7s 1253.8s 2817.0s 4070.8s > full 2298.8s 1306.1s 2758.7s 4064.7s > delta +0.1s +52.3s -58.3s -6.1s > > Grouped by capacity class, the LPI-2 loss is mostly on the lower-capacity > CPUs: > > group base LPI-2 full LPI-2 delta full > 279 1073.5s 1031.9s -41.6s > 866 502.5s 486.4s -16.1s > 905 499.7s 490.4s -9.3s > 984 488.8s 496.0s +7.2s > 1024 252.5s 254.0s +1.5s > > For a light tbench run (tbench -R 20 -t 600 4), the result is more mixed: > > variant LPI-0 LPI-1 LPI-2 LPI-1+2 > base 2593.5s 1483.4s 410.3s 1893.6s > full 2605.3s 1446.5s 416.6s 1863.1s > delta +11.8s -36.9s +6.3s -30.5s > > So tbench gets a small increase in deepest idle, but loses more in > LPI-1+2 overall. > > If we do wanna keep the per-capacity hierarchy split, maybe it's sufficient to > gate this behind there being either a small number of capacity classes or > ensuring that they all have >=4 CPUs before splitting? Ok I was afraid of something like that, ie: it works for some usages but not on others. And I don't know what to do. For example if I apply your suggested contraints, on which hierarchy should go those capacities with < 4 CPUs ? Thoughts? > > Kind regards, > Christian > -- Frederic Weisbecker SUSE Labs
On 6/4/26 14:36, Frederic Weisbecker wrote: > Le Wed, Jun 03, 2026 at 11:50:58PM +0100, Christian Loehle a écrit : >> On 4/23/26 17:53, Frederic Weisbecker wrote: >>> Hi, >>> >>> This is a late follow-up after: >>> >>> https://lore.kernel.org/lkml/20250910074251.8148-1-sehee1.jeong@samsung.com/ >>> >>> To summarize, heterogenous capacity CPUs migrate their timers >>> indifferently between big and little CPUs. And this happens to be often >>> migrated to big CPUs, increasing their idle target residency. >>> >>> Thomas proposed to isolate the hierarchy between big and little CPUs. >>> So here is a try. Note I haven't tested on real heterogenous hardware >>> so if you have it, please test it! >>> >>> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git >>> timers/core >>> >>> HEAD: f0a87af6dab6f3a6dd8a603a2b9d7dcc86fd50e4 >>> Thanks, >>> Frederic >>> --- >>> >>> Frederic Weisbecker (6): >>> timers/migration: Fix another hotplug activation race >>> timers/migration: Abstract out hierarchy to prepare for CPU capacity awareness >>> timers/migration: Track CPUs in a hierarchy >>> timers/migration: Split per-capacity hierarchies >>> timers/migration: Handle capacity in connect tracepoints >>> scripts/timers: Add timer_migration_tree.py >>> >>> include/trace/events/timer_migration.h | 24 ++-- >>> kernel/time/timer_migration.c | 246 ++++++++++++++++++++++++--------- >>> kernel/time/timer_migration.h | 19 +++ >>> scripts/timer_migration_tree.py | 122 ++++++++++++++++ >>> 4 files changed, 337 insertions(+), 74 deletions(-) >> >> Hi Frederic, >> sorry for the late reaction to this, I completely missed it (CCing >> linux-pm would have helped :) ). > > Good point, next time I'll do! > >> >> I'm not convinced that unconditionally splitting the timer migration >> hierarchy per-capacity is always the right tradeoff from a power point of >> view. On some asymmetric systems we only have one or two CPUs in a given >> capacity class. In that case the split can effectively remove most of the >> useful timer migration opportunity for that class, even though allowing >> migration across nearby capacities may still be better for idle residency. >> >> I tested this on an Orion O6 system with the following topology: >> >> online CPUs: 0-11 >> >> capacity 279: CPUs 2,3,4,5 >> capacity 866: CPUs 8,9 >> capacity 905: CPUs 6,7 >> capacity 984: CPUs 10,11 >> capacity 1024: CPUs 0,1 >> >> I compared the series up to and including the preparatory/refactoring >> patch 3 against the full series including the per-capacity hierarchy split. >> The numbers below are aggregate cpuidle residency deltas over a 600s run. >> >> Idle workload: >> >> variant LPI-0 LPI-1 LPI-2 LPI-1+2 >> base 2298.7s 1253.8s 2817.0s 4070.8s >> full 2298.8s 1306.1s 2758.7s 4064.7s >> delta +0.1s +52.3s -58.3s -6.1s >> >> Grouped by capacity class, the LPI-2 loss is mostly on the lower-capacity >> CPUs: >> >> group base LPI-2 full LPI-2 delta full >> 279 1073.5s 1031.9s -41.6s >> 866 502.5s 486.4s -16.1s >> 905 499.7s 490.4s -9.3s >> 984 488.8s 496.0s +7.2s >> 1024 252.5s 254.0s +1.5s >> >> For a light tbench run (tbench -R 20 -t 600 4), the result is more mixed: >> >> variant LPI-0 LPI-1 LPI-2 LPI-1+2 >> base 2593.5s 1483.4s 410.3s 1893.6s >> full 2605.3s 1446.5s 416.6s 1863.1s >> delta +11.8s -36.9s +6.3s -30.5s >> >> So tbench gets a small increase in deepest idle, but loses more in >> LPI-1+2 overall. >> >> If we do wanna keep the per-capacity hierarchy split, maybe it's sufficient to >> gate this behind there being either a small number of capacity classes or >> ensuring that they all have >=4 CPUs before splitting? > > Ok I was afraid of something like that, ie: it works for some usages but not > on others. > > And I don't know what to do. For example if I apply your suggested contraints, > on which hierarchy should go those capacities with < 4 CPUs ? > > Thoughts? > I sure have some thoughts, but I'm unsure about the best solution is though. A few things bothering me: 1. In the original report the problem was timers being migrated from little to big CPU leads to a power regression, but of course they most likely still benefit from the reverse migration, making static partitioning seem counterintuitive to me in the first place? In particular because usually #little CPUs > #big CPUs, so my intuition would be that that migration should be more common, or is that not true? I'd also love to know with what workload the original issue appeared. 2. While little->big timer migration might usually be bad for power, that's not always true depending on SoC and workload, we don't really know without consulting the energy model, for most timers though the energy model wouldn't be that useful anyway as a good chunk of the decision comes from wasting potential idle energy instead of active energy, energy model is unaware of power savings of idle states. For the static hierarchy split itself my ideas would be: 1. Don't do it if the resulting hierarchy is too awkward, e.g. single CPUs or too many tiny groups. Obviously that risks excluding the system from the original report. 2. Group only meaningfully different capacities, rather than exact arch_scale_cpu_capacity() values. For example, use something like the capacity_greater() margin so negligible capacity differences don't create separate timer hierarchies. [1] 3. Have a limited number of buckets, fixed thresholds such as <512 and >=512 would probably work, but are arbitrary. 4. Only start a new bucket if last_capacity != current_capacity && last_bucket_cpus >= 4. This feels awkward because the resulting hierarchy then depends on CPU/hotplug ordering. If we allow for a more dynamic migration strategy, I think I'd prefer the decision to be based on observed idle opportunity rather than capacity alone. Something like rq->avg_idle, could make CPUs with shorter recent idle periods more likely to handle timers, while avoiding CPUs that tend to get long/deep idle residencies. Is that unreasonable from your end? [1] nvidia grace e.g. has capacities of 994 997 1000 1002 1005 1008 1010 1013 1016 1018 1021 1024 This feels like it should all be one hierarchy bucket. On my Orion O6, using the capacity_greater() margin would at least reduce the split to: 279 (4 CPUs) 866 + 905 (4 CPUs) 984 + 1024 (4 CPUs) Nonetheless many SoCs are 4+2+1 or 4+3+1, so even that does not fully solve the tiny hierarchy problem.
Le Fri, Jun 05, 2026 at 11:10:20AM +0100, Christian Loehle a écrit : > On 6/4/26 14:36, Frederic Weisbecker wrote: > > Le Wed, Jun 03, 2026 at 11:50:58PM +0100, Christian Loehle a écrit : > >> On 4/23/26 17:53, Frederic Weisbecker wrote: > >>> Hi, > >>> > >>> This is a late follow-up after: > >>> > >>> https://lore.kernel.org/lkml/20250910074251.8148-1-sehee1.jeong@samsung.com/ > >>> > >>> To summarize, heterogenous capacity CPUs migrate their timers > >>> indifferently between big and little CPUs. And this happens to be often > >>> migrated to big CPUs, increasing their idle target residency. > >>> > >>> Thomas proposed to isolate the hierarchy between big and little CPUs. > >>> So here is a try. Note I haven't tested on real heterogenous hardware > >>> so if you have it, please test it! > >>> > >>> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git > >>> timers/core > >>> > >>> HEAD: f0a87af6dab6f3a6dd8a603a2b9d7dcc86fd50e4 > >>> Thanks, > >>> Frederic > >>> --- > >>> > >>> Frederic Weisbecker (6): > >>> timers/migration: Fix another hotplug activation race > >>> timers/migration: Abstract out hierarchy to prepare for CPU capacity awareness > >>> timers/migration: Track CPUs in a hierarchy > >>> timers/migration: Split per-capacity hierarchies > >>> timers/migration: Handle capacity in connect tracepoints > >>> scripts/timers: Add timer_migration_tree.py > >>> > >>> include/trace/events/timer_migration.h | 24 ++-- > >>> kernel/time/timer_migration.c | 246 ++++++++++++++++++++++++--------- > >>> kernel/time/timer_migration.h | 19 +++ > >>> scripts/timer_migration_tree.py | 122 ++++++++++++++++ > >>> 4 files changed, 337 insertions(+), 74 deletions(-) > >> > >> Hi Frederic, > >> sorry for the late reaction to this, I completely missed it (CCing > >> linux-pm would have helped :) ). > > > > Good point, next time I'll do! > > > >> > >> I'm not convinced that unconditionally splitting the timer migration > >> hierarchy per-capacity is always the right tradeoff from a power point of > >> view. On some asymmetric systems we only have one or two CPUs in a given > >> capacity class. In that case the split can effectively remove most of the > >> useful timer migration opportunity for that class, even though allowing > >> migration across nearby capacities may still be better for idle residency. > >> > >> I tested this on an Orion O6 system with the following topology: > >> > >> online CPUs: 0-11 > >> > >> capacity 279: CPUs 2,3,4,5 > >> capacity 866: CPUs 8,9 > >> capacity 905: CPUs 6,7 > >> capacity 984: CPUs 10,11 > >> capacity 1024: CPUs 0,1 > >> > >> I compared the series up to and including the preparatory/refactoring > >> patch 3 against the full series including the per-capacity hierarchy split. > >> The numbers below are aggregate cpuidle residency deltas over a 600s run. > >> > >> Idle workload: > >> > >> variant LPI-0 LPI-1 LPI-2 LPI-1+2 > >> base 2298.7s 1253.8s 2817.0s 4070.8s > >> full 2298.8s 1306.1s 2758.7s 4064.7s > >> delta +0.1s +52.3s -58.3s -6.1s > >> > >> Grouped by capacity class, the LPI-2 loss is mostly on the lower-capacity > >> CPUs: > >> > >> group base LPI-2 full LPI-2 delta full > >> 279 1073.5s 1031.9s -41.6s > >> 866 502.5s 486.4s -16.1s > >> 905 499.7s 490.4s -9.3s > >> 984 488.8s 496.0s +7.2s > >> 1024 252.5s 254.0s +1.5s > >> > >> For a light tbench run (tbench -R 20 -t 600 4), the result is more mixed: > >> > >> variant LPI-0 LPI-1 LPI-2 LPI-1+2 > >> base 2593.5s 1483.4s 410.3s 1893.6s > >> full 2605.3s 1446.5s 416.6s 1863.1s > >> delta +11.8s -36.9s +6.3s -30.5s > >> > >> So tbench gets a small increase in deepest idle, but loses more in > >> LPI-1+2 overall. > >> > >> If we do wanna keep the per-capacity hierarchy split, maybe it's sufficient to > >> gate this behind there being either a small number of capacity classes or > >> ensuring that they all have >=4 CPUs before splitting? > > > > Ok I was afraid of something like that, ie: it works for some usages but not > > on others. > > > > And I don't know what to do. For example if I apply your suggested contraints, > > on which hierarchy should go those capacities with < 4 CPUs ? > > > > Thoughts? > > > > I sure have some thoughts, but I'm unsure about the best solution is though. > A few things bothering me: > 1. In the original report the problem was timers being migrated from > little to big CPU leads to a power regression, but of course they most > likely still benefit from the reverse migration, making static partitioning > seem counterintuitive to me in the first place? In particular because usually > #little CPUs > #big CPUs, so my intuition would be that that migration should > be more common, or is that not true? I'd also love to know with what workload > the original issue appeared. > 2. While little->big timer migration might usually be bad for power, that's > not always true depending on SoC and workload, we don't really know without > consulting the energy model, for most timers though the energy model wouldn't > be that useful anyway as a good chunk of the decision comes from wasting > potential idle energy instead of active energy, energy model is unaware of > power savings of idle states. > > For the static hierarchy split itself my ideas would be: > > 1. Don't do it if the resulting hierarchy is too awkward, e.g. single CPUs or > too many tiny groups. Obviously that risks excluding the system from the > original report. > > 2. Group only meaningfully different capacities, rather than exact > arch_scale_cpu_capacity() values. For example, use something like the > capacity_greater() margin so negligible capacity differences don't create > separate timer hierarchies. [1] > > 3. Have a limited number of buckets, fixed thresholds such as <512 > and >=512 would probably work, but are arbitrary. > > 4. Only start a new bucket if last_capacity != current_capacity && > last_bucket_cpus >= 4. This feels awkward because the resulting hierarchy then > depends on CPU/hotplug ordering. > > If we allow for a more dynamic migration strategy, I think I'd prefer the > decision to be based on observed idle opportunity rather than capacity alone. > Something like rq->avg_idle, could make CPUs with shorter recent idle periods > more likely to handle timers, while avoiding CPUs that tend to get long/deep > idle residencies. Is that unreasonable from your end? I guess it's feasible, but that doesn't take into account the capacity itself. The initial issue was about timers migrating too often to big cores and therefore keeping them alive too frequently. I guess the biggest issue is when the last core going idle is a big core. And it's the one that will handle all global timers for the whole system. And perhaps it's a fundamental issue because big cores are probably busier by nature. That problem is not easy to solve... > > [1] nvidia grace e.g. has capacities of > 994 > 997 > 1000 > 1002 > 1005 > 1008 > 1010 > 1013 > 1016 > 1018 > 1021 > 1024 Urgh, who needs that? Thanks. -- Frederic Weisbecker SUSE Labs
© 2016 - 2026 Red Hat, Inc.