kernel/Kconfig.hz | 10 ++++++++++ kernel/time/tick-sched.c | 16 ++++++++++++++-- 2 files changed, 24 insertions(+), 2 deletions(-)
From: "Christoph Lameter (Ampere)" <cl@gentwo.org>
Synchronized ticks mean that all processors will simultaneously process
ticks and enter the scheduler. So the contention increases as the number
of cpu increases. The contention causes latency jitter that scales with
the number of processors.
Staggering the timer interrupt also helps mitigate voltage droop related
issues that may be observed in SOCs with large core counts.
See https://semiengineering.com/mitigating-voltage-droop/ for a more
detailed explanation.
Switch to skewed tick for systems with more than 64 processors.
Signed-off-by: Christoph Lameter (Ampere) <cl@gentwo.org>
---
kernel/Kconfig.hz | 10 ++++++++++
kernel/time/tick-sched.c | 16 ++++++++++++++--
2 files changed, 24 insertions(+), 2 deletions(-)
diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz
index ce1435cb08b1..245d938d446b 100644
--- a/kernel/Kconfig.hz
+++ b/kernel/Kconfig.hz
@@ -57,3 +57,13 @@ config HZ
config SCHED_HRTICK
def_bool HIGH_RES_TIMERS
+
+config TICK_SKEW_LIMIT
+ int
+ default 64
+ help
+ If the kernel is booted on systems with a large number of cpus then the
+ concurrent execution of timer ticks causes long holdoffs due to
+ serialization. Synchrononous executions of interrupts can also cause
+ voltage droop in SOCs. So switch to skewed mode. This mechanism
+ can be overridden by specifying "tick_skew=x" on the kernel command line.
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c527b421c865..aab7a1cc25c7 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1554,7 +1554,7 @@ void tick_irq_enter(void)
tick_nohz_irq_enter();
}
-static int sched_skew_tick;
+static int sched_skew_tick = -1;
static int __init skew_tick(char *str)
{
@@ -1572,6 +1572,16 @@ void tick_setup_sched_timer(bool hrtimer)
{
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+ /* Figure out if we should skew the tick */
+ if (sched_skew_tick < 0) {
+ if (num_possible_cpus() >= CONFIG_TICK_SKEW_LIMIT) {
+ sched_skew_tick = 1;
+ pr_info("Tick skewed mode enabled. Possible cpus %u > %u\n",
+ num_possible_cpus(), CONFIG_TICK_SKEW_LIMIT);
+ } else
+ sched_skew_tick = 0;
+ }
+
/* Emulate tick processing via per-CPU hrtimers: */
hrtimer_setup(&ts->sched_timer, tick_nohz_handler, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_HARD);
@@ -1587,7 +1597,9 @@ void tick_setup_sched_timer(bool hrtimer)
do_div(offset, num_possible_cpus());
offset *= smp_processor_id();
hrtimer_add_expires_ns(&ts->sched_timer, offset);
- }
+ } else
+ pr_info("Tick operating in synchronized mode.\n");
+
hrtimer_forward_now(&ts->sched_timer, TICK_NSEC);
if (IS_ENABLED(CONFIG_HIGH_RES_TIMERS) && hrtimer)
---
base-commit: 66701750d5565c574af42bef0b789ce0203e3071
change-id: 20250702-tick_skew-0e7858c10246
Best regards,
--
Christoph Lameter <cl@gentwo.org>
Hi Christoph, kernel test robot noticed the following build errors: [auto build test ERROR on 66701750d5565c574af42bef0b789ce0203e3071] url: https://github.com/intel-lab-lkp/linux/commits/Christoph-Lameter-via-B4-Relay/Skew-tick-for-systems-with-a-large-number-of-processors/20250703-034357 base: 66701750d5565c574af42bef0b789ce0203e3071 patch link: https://lore.kernel.org/r/20250702-tick_skew-v1-1-ff8d73035b02%40gentwo.org patch subject: [PATCH] Skew tick for systems with a large number of processors config: arc-randconfig-002-20250703 (https://download.01.org/0day-ci/archive/20250703/202507032322.5jzirIYw-lkp@intel.com/config) compiler: arc-linux-gcc (GCC) 12.4.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250703/202507032322.5jzirIYw-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202507032322.5jzirIYw-lkp@intel.com/ All errors (new ones prefixed by >>): kernel/time/tick-sched.c: In function 'tick_setup_sched_timer': >> kernel/time/tick-sched.c:1577:44: error: 'CONFIG_TICK_SKEW_LIMIT' undeclared (first use in this function); did you mean 'CONFIG_TICK_ONESHOT'? 1577 | if (num_possible_cpus() >= CONFIG_TICK_SKEW_LIMIT) { | ^~~~~~~~~~~~~~~~~~~~~~ | CONFIG_TICK_ONESHOT kernel/time/tick-sched.c:1577:44: note: each undeclared identifier is reported only once for each function it appears in vim +1577 kernel/time/tick-sched.c 1566 1567 /** 1568 * tick_setup_sched_timer - setup the tick emulation timer 1569 * @hrtimer: whether to use the hrtimer or not 1570 */ 1571 void tick_setup_sched_timer(bool hrtimer) 1572 { 1573 struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched); 1574 1575 /* Figure out if we should skew the tick */ 1576 if (sched_skew_tick < 0) { > 1577 if (num_possible_cpus() >= CONFIG_TICK_SKEW_LIMIT) { 1578 sched_skew_tick = 1; 1579 pr_info("Tick skewed mode enabled. Possible cpus %u > %u\n", 1580 num_possible_cpus(), CONFIG_TICK_SKEW_LIMIT); 1581 } else 1582 sched_skew_tick = 0; 1583 } 1584 1585 /* Emulate tick processing via per-CPU hrtimers: */ 1586 hrtimer_setup(&ts->sched_timer, tick_nohz_handler, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_HARD); 1587 1588 if (IS_ENABLED(CONFIG_HIGH_RES_TIMERS) && hrtimer) 1589 tick_sched_flag_set(ts, TS_FLAG_HIGHRES); 1590 1591 /* Get the next period (per-CPU) */ 1592 hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update()); 1593 1594 /* Offset the tick to avert 'jiffies_lock' contention. */ 1595 if (sched_skew_tick) { 1596 u64 offset = TICK_NSEC >> 1; 1597 do_div(offset, num_possible_cpus()); 1598 offset *= smp_processor_id(); 1599 hrtimer_add_expires_ns(&ts->sched_timer, offset); 1600 } else 1601 pr_info("Tick operating in synchronized mode.\n"); 1602 1603 1604 hrtimer_forward_now(&ts->sched_timer, TICK_NSEC); 1605 if (IS_ENABLED(CONFIG_HIGH_RES_TIMERS) && hrtimer) 1606 hrtimer_start_expires(&ts->sched_timer, HRTIMER_MODE_ABS_PINNED_HARD); 1607 else 1608 tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1); 1609 tick_nohz_activate(ts); 1610 } 1611 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
Christoph! On Wed, Jul 02 2025 at 12:42, Christoph Lameter via wrote: Subject starts with a subsystem followed by a colon and then the short log. That has been that way forever and is clearly documented. You're not really new to kernel development and I pointed that out to you before: https://lore.kernel.org/all/87o74m1oq7.ffs@tglx No? > From: "Christoph Lameter (Ampere)" <cl@gentwo.org> > > Synchronized ticks mean that all processors will simultaneously process > ticks and enter the scheduler. So the contention increases as the number > of cpu increases. The contention causes latency jitter that scales with > the number of processors. > > Staggering the timer interrupt also helps mitigate voltage droop related > issues that may be observed in SOCs with large core counts. > See https://semiengineering.com/mitigating-voltage-droop/ for a more > detailed explanation. > > Switch to skewed tick for systems with more than 64 processors. This lacks a proper explanation why that does not have any negative side effects on existing deployments and application scenarios. > --- a/kernel/Kconfig.hz > +++ b/kernel/Kconfig.hz The tick related Kconfig options are in kernel/time/Kconfig > + > +config TICK_SKEW_LIMIT > + int > + default 64 That wants a range 0 NR_CPUS or such > + help > + If the kernel is booted on systems with a large number of cpus then the > + concurrent execution of timer ticks causes long holdoffs due to > + serialization. Synchrononous executions of interrupts can also cause > + voltage droop in SOCs. So switch to skewed mode. This mechanism What does 'So switch to skewed mode.' help the user to select any useful value? This wants to have a proper explanation for picking a value which is understandable by mere mortals and not some useless "expert" word salad. > + can be overridden by specifying "tick_skew=x" on the kernel command line. Neither does it explain how that override affects the chosen value nor update the documentation of the command line value to make users aware of this behavioural change. For the casual reader this suggests, that tick_skew=x allows to change that number on the kernel command line, which it does not. > -static int sched_skew_tick; > +static int sched_skew_tick = -1; What's this magic -1 here? Can we please have some obvious and understandable define for this? > static int __init skew_tick(char *str) > { > @@ -1572,6 +1572,16 @@ void tick_setup_sched_timer(bool hrtimer) > { > struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched); > > + /* Figure out if we should skew the tick */ > + if (sched_skew_tick < 0) { This is incompatible with the existing code, which is unfortunately stupid already. Today 'tick_skew=-1' causes the tick to be skewed. Now it gets a different meaning. Not that it matters much, but change logs are supposed to mention user visible behavioural differences and argue why they don't matter, no? > + if (num_possible_cpus() >= CONFIG_TICK_SKEW_LIMIT) { > + sched_skew_tick = 1; > + pr_info("Tick skewed mode enabled. Possible cpus %u > %u\n", > + num_possible_cpus(), CONFIG_TICK_SKEW_LIMIT); I'm not convinced that this is useful, but that's the least of the issues. > + } else The else clause wants curly brackets for symmetry. > + sched_skew_tick = 0; The above aside. As you completely failed to provide at least the minimal historical background in the change log, let me fill in the blanks. commit 3704540b4829 ("tick management: spread timer interrupt") added the skew unconditionally in 2007 to avoid lock contention on xtime lock. commit af5ab277ded0 ("clockevents: Remove the per cpu tick skew") removed it in 2010 because the xtime lock contention was gone and the skew affected the power consumption of slightly loaded _large_ servers. commit 5307c9556bc1 ("tick: Add tick skew boot option") brought it back with a command line option to address contention and jitter issues on larger systems. So while you preserved the behaviour of the command line option in the most obscure way, you did not even make an attempt to explain why this change does not bring back the issues which caused the removal in commit af5ab277ded0 or why they are irrelevant today. "Scratches my itch" does not work and you know that. This needs to be consolidated both on the implementation side and also on the user side. Thanks for making me do your homework, tglx
On Thu, 3 Jul 2025, Thomas Gleixner wrote: > The above aside. As you completely failed to provide at least the > minimal historical background in the change log, let me fill in the > blanks. > > commit 3704540b4829 ("tick management: spread timer interrupt") added the > skew unconditionally in 2007 to avoid lock contention on xtime lock. Right but that was only one reason why the timer interrupts where staggered. > commit af5ab277ded0 ("clockevents: Remove the per cpu tick skew") > removed it in 2010 because the xtime lock contention was gone and the > skew affected the power consumption of slightly loaded _large_ servers. But then the tick also executes other code that can cause contention. Why merge such an obvious problematic patch without considering the reasons for the 2007 patch? > commit 5307c9556bc1 ("tick: Add tick skew boot option") brought it back > with a command line option to address contention and jitter issues on > larger systems. And then issues resulted because the scaling issues where not considered when merging the 2010 patch. > So while you preserved the behaviour of the command line option in the > most obscure way, you did not even make an attempt to explain why this > change does not bring back the issues which caused the removal in commit > af5ab277ded0 or why they are irrelevant today. As pointed out in the patch description: The synchronized tick (aside from the jitter) also causes power spikes on large core systems which can cause system instabilities. > "Scratches my itch" does not work and you know that. This needs to be > consolidated both on the implementation side and also on the user > side. We can get to that but I at least need some direction on how to approach this and figure out the concerns that exist. Frankly my initial idea was just to remove the buggy patches since this caused a regression in performance and system stability but I guess there were power savings concerns. How can we address this issue in a better way then? The kernel should not come up all wobbly and causing power spikes every tick.
On Wed, Jul 02 2025 at 17:25, Christoph Lameter wrote: > On Thu, 3 Jul 2025, Thomas Gleixner wrote: > >> The above aside. As you completely failed to provide at least the >> minimal historical background in the change log, let me fill in the >> blanks. >> >> commit 3704540b4829 ("tick management: spread timer interrupt") added the >> skew unconditionally in 2007 to avoid lock contention on xtime lock. > > Right but that was only one reason why the timer interrupts where > staggered. It was the main reason because all CPUs contended on xtime lock and other global locks. The subsequent issues you describe were not observable back then to the extent they are today for bloody obvious reasons. >> commit af5ab277ded0 ("clockevents: Remove the per cpu tick skew") >> removed it in 2010 because the xtime lock contention was gone and the >> skew affected the power consumption of slightly loaded _large_ servers. > > But then the tick also executes other code that can cause contention. Why > merge such an obvious problematic patch without considering the reasons > for the 2007 patch? As I said above, the main reason was contention on xtime lock and some other global locks. These contention issues had been resolved over time, so the initial reason to have the skew was gone. The power consumption issue was a valid reason to remove it and the testing back then did not show any negative side effects. The subsequently discovered issues, were not observable and some of them got introduced by later code changes. Obviously the patch is problematic in hindsight, but hindsight is always 20/20. >> commit 5307c9556bc1 ("tick: Add tick skew boot option") brought it back >> with a command line option to address contention and jitter issues on >> larger systems. > > And then issues resulted because the scaling issues where not > considered when merging the 2010 patch. What are you trying to do here? Playing a blame game is not helping to find a solution. >> So while you preserved the behaviour of the command line option in the >> most obscure way, you did not even make an attempt to explain why this >> change does not bring back the issues which caused the removal in commit >> af5ab277ded0 or why they are irrelevant today. > > As pointed out in the patch description: The synchronized tick (aside from > the jitter) also causes power spikes on large core systems which can cause > system instabilities. That's a _NEW_ problem and has nothing to do with the power saving concerns which led to af5ab277ded0. >> "Scratches my itch" does not work and you know that. This needs to be >> consolidated both on the implementation side and also on the user >> side. > > We can get to that but I at least need some direction on how to approach > this and figure out the concerns that exist. Frankly my initial idea was > just to remove the buggy patches since this caused a regression in > performance and system stability but I guess there were power savings > concerns. Guessing is not a valid engineering approach, as you might know already. It's not rocket science to validate whether these power saving concerns still apply and to reach out to people who have been involved in this and ask them to revalidate. I just Cc'ed Arjan for you. > How can we address this issue in a better way then? By analysing the consequences of flipping the default for skew_tick to default on, which can be evaluated upfront trivially without a single line of code change by adding 'skew_tick=1' to the kernel command line and running tests and asking others to help evaluating. There is only a limited range of scenarios, which need to be looked at: - Big servers and the power saving issues on lightly loaded machines - Battery operated devices - Virtualization (guests) That might not cover 100% of the use cases, but should be a good enough coverage to base an informed decision on. > The kernel should not come up all wobbly and causing power spikes > every tick. The kernel should not do a lot of things, but does them due to historical decisions, which turn out to be suboptimal when technology advances. The power spike problem simply did not exist 18 years ago at least not to the extent that it mattered or caused concerns. If we could have predicted the future and the consequences of ad hoc decisions, we wouldn't have had a BKL, which took only 20 years of effort to get rid of (except for the well hidden leftovers in tty). But what we learned from the past is to avoid hacky ad hoc workarounds, which are guaranteed to just make the situation worse. Thanks, tglx
On Thu, 3 Jul 2025, Thomas Gleixner wrote: > >> So while you preserved the behaviour of the command line option in the > >> most obscure way, you did not even make an attempt to explain why this > >> change does not bring back the issues which caused the removal in commit > >> af5ab277ded0 or why they are irrelevant today. > > > > As pointed out in the patch description: The synchronized tick (aside from > > the jitter) also causes power spikes on large core systems which can cause > > system instabilities. > > That's a _NEW_ problem and has nothing to do with the power saving > concerns which led to af5ab277ded0. The contemporary "advanced on chip power savings" really bite in this scenario. ;-) It was rather suprising to see what can happen. > It's not rocket science to validate whether these power saving concerns > still apply and to reach out to people who have been involved in this > and ask them to revalidate. I just Cc'ed Arjan for you. They definitely apply on an Android phone with fewer cores. There you would want to reduce the number of wakeups as much as possible to conserver power so it needs synchronized mode. That is why my initial thought was to make it dependent on the number of active processors. > There is only a limited range of scenarios, which need to be looked at: > > - Big servers and the power saving issues on lightly loaded > machines If it is only a few active cores and the system is basically idle then it is better to have a synchronized tick but if the system has lots of active processors then the tick should be skewed. So maybe one idea would be to have a counter of active ticks and skew them if that number gets too high. > - Battery operated devices These usually have 1-4 cores. So synchronized is obviously the best. > - Virtualization (guests) I think there is work to do here to sync the ticks between host and guest for further power savings. > That might not cover 100% of the use cases, but should be a good enough > coverage to base an informed decision on. Yea lets see what others say on the matter. > If we could have predicted the future and the consequences of ad hoc > decisions, we wouldn't have had a BKL, which took only 20 years of > effort to get rid of (except for the well hidden leftovers in tty). Oh the BKL was good. Synchronization was much faster after all and less complex. I am sure a BKL approach on small systems would still improve performance.
On Thu, Jul 03 2025 at 07:51, Christoph Lameter wrote: > On Thu, 3 Jul 2025, Thomas Gleixner wrote: >> It's not rocket science to validate whether these power saving concerns >> still apply and to reach out to people who have been involved in this >> and ask them to revalidate. I just Cc'ed Arjan for you. > > They definitely apply on an Android phone with fewer cores. There you > would want to reduce the number of wakeups as much as possible to > conserver power so it needs synchronized mode. That's kinda obvious, but with the new timer migration model, which stops to place timers by crystalball logic, this might not longer be true and needs actual data to back up that claim. > That is why my initial thought was to make it dependent on the number of > active processors. > >> There is only a limited range of scenarios, which need to be looked at: >> >> - Big servers and the power saving issues on lightly loaded >> machines > > If it is only a few active cores and the system is basically idle then > it is better to have a synchronized tick but if the system has lots of > active processors then the tick should be skewed. I agree with the latter, but is your 'few active cores' claim backed by actual data taken from a current kernel or based on historical evidence and hearsay? > So maybe one idea would be to have a counter of active ticks and skew > them if that number gets too high. The idea itself is not that horrible. Though we should tap into the existing accounting resources to figure that out instead of adding yet another ill defined global counter to the mess. All the required metrics should be there already. Actually it should be solvable if you look at it just from a per CPU perspective. This assumes that NOHZ_IDLE is active, because if it is not then you can just go and skew unconditionally. If a CPU is busy, then it just arms the tick skewed. If it goes idle, then it looks at the expected idle time, which is what NOHZ does already today. If it decides to stop the tick until the next timer list expires, then it aligns it. Earlier expiring high resolution timers obviously override the initial decision, but that's not much different from what is happening today already. >> - Battery operated devices > > These usually have 1-4 cores. So synchronized is obviously the best. Same question as above. >> If we could have predicted the future and the consequences of ad hoc >> decisions, we wouldn't have had a BKL, which took only 20 years of >> effort to get rid of (except for the well hidden leftovers in tty). > > Oh the BKL was good. Synchronization was much faster after all and less > complex. I am sure a BKL approach on small systems would still improve > performance. Feel free to scale back to 4 cores and enjoy the undefined BKL semantics forever in your own fork of 2.2.final :) Thanks, tglx
© 2016 - 2025 Red Hat, Inc.