Skew tick for systems with a large number of processors

[PATCH] Skew tick for systems with a large number of processors

Posted by Christoph Lameter via B4 Relay 3 months, 1 week ago

From: "Christoph Lameter (Ampere)" <cl@gentwo.org>

Synchronized ticks mean that all processors will simultaneously process
ticks and enter the scheduler. So the contention increases as the number
of cpu increases. The contention causes latency jitter that scales with
the number of processors.

Staggering the timer interrupt also helps mitigate voltage droop related
issues that may be observed in SOCs with large core counts.
See https://semiengineering.com/mitigating-voltage-droop/ for a more
detailed explanation.

Switch to skewed tick for systems with more than 64 processors.

Signed-off-by: Christoph Lameter (Ampere) <cl@gentwo.org>
---
 kernel/Kconfig.hz        | 10 ++++++++++
 kernel/time/tick-sched.c | 16 ++++++++++++++--
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz
index ce1435cb08b1..245d938d446b 100644
--- a/kernel/Kconfig.hz
+++ b/kernel/Kconfig.hz
@@ -57,3 +57,13 @@ config HZ
 
 config SCHED_HRTICK
 	def_bool HIGH_RES_TIMERS
+
+config TICK_SKEW_LIMIT
+	int
+	default 64
+	help
+	  If the kernel is booted on systems with a large number of cpus then the
+	  concurrent execution of timer ticks causes long holdoffs due to
+	  serialization. Synchrononous executions of interrupts can also cause
+	  voltage droop in SOCs. So switch to skewed mode. This mechanism
+	  can be overridden by specifying "tick_skew=x" on the kernel command line.
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c527b421c865..aab7a1cc25c7 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1554,7 +1554,7 @@ void tick_irq_enter(void)
 	tick_nohz_irq_enter();
 }
 
-static int sched_skew_tick;
+static int sched_skew_tick = -1;
 
 static int __init skew_tick(char *str)
 {
@@ -1572,6 +1572,16 @@ void tick_setup_sched_timer(bool hrtimer)
 {
 	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
 
+	/* Figure out if we should skew the tick */
+	if (sched_skew_tick < 0) {
+		if (num_possible_cpus() >= CONFIG_TICK_SKEW_LIMIT) {
+			sched_skew_tick = 1;
+			pr_info("Tick skewed mode enabled. Possible cpus %u > %u\n",
+				num_possible_cpus(), CONFIG_TICK_SKEW_LIMIT);
+		} else
+			sched_skew_tick = 0;
+	}
+
 	/* Emulate tick processing via per-CPU hrtimers: */
 	hrtimer_setup(&ts->sched_timer, tick_nohz_handler, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_HARD);
 
@@ -1587,7 +1597,9 @@ void tick_setup_sched_timer(bool hrtimer)
 		do_div(offset, num_possible_cpus());
 		offset *= smp_processor_id();
 		hrtimer_add_expires_ns(&ts->sched_timer, offset);
-	}
+	} else
+		pr_info("Tick operating in synchronized mode.\n");
+
 
 	hrtimer_forward_now(&ts->sched_timer, TICK_NSEC);
 	if (IS_ENABLED(CONFIG_HIGH_RES_TIMERS) && hrtimer)

---
base-commit: 66701750d5565c574af42bef0b789ce0203e3071
change-id: 20250702-tick_skew-0e7858c10246

Best regards,
-- 
Christoph Lameter <cl@gentwo.org>

Re: [PATCH] Skew tick for systems with a large number of processors

Posted by kernel test robot 3 months ago

Hi Christoph,

kernel test robot noticed the following build errors:

[auto build test ERROR on 66701750d5565c574af42bef0b789ce0203e3071]

url:    https://github.com/intel-lab-lkp/linux/commits/Christoph-Lameter-via-B4-Relay/Skew-tick-for-systems-with-a-large-number-of-processors/20250703-034357
base:   66701750d5565c574af42bef0b789ce0203e3071
patch link:    https://lore.kernel.org/r/20250702-tick_skew-v1-1-ff8d73035b02%40gentwo.org
patch subject: [PATCH] Skew tick for systems with a large number of processors
config: arc-randconfig-002-20250703 (https://download.01.org/0day-ci/archive/20250703/202507032322.5jzirIYw-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 12.4.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250703/202507032322.5jzirIYw-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507032322.5jzirIYw-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/time/tick-sched.c: In function 'tick_setup_sched_timer':
>> kernel/time/tick-sched.c:1577:44: error: 'CONFIG_TICK_SKEW_LIMIT' undeclared (first use in this function); did you mean 'CONFIG_TICK_ONESHOT'?
    1577 |                 if (num_possible_cpus() >= CONFIG_TICK_SKEW_LIMIT) {
         |                                            ^~~~~~~~~~~~~~~~~~~~~~
         |                                            CONFIG_TICK_ONESHOT
   kernel/time/tick-sched.c:1577:44: note: each undeclared identifier is reported only once for each function it appears in


vim +1577 kernel/time/tick-sched.c

  1566	
  1567	/**
  1568	 * tick_setup_sched_timer - setup the tick emulation timer
  1569	 * @hrtimer: whether to use the hrtimer or not
  1570	 */
  1571	void tick_setup_sched_timer(bool hrtimer)
  1572	{
  1573		struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
  1574	
  1575		/* Figure out if we should skew the tick */
  1576		if (sched_skew_tick < 0) {
> 1577			if (num_possible_cpus() >= CONFIG_TICK_SKEW_LIMIT) {
  1578				sched_skew_tick = 1;
  1579				pr_info("Tick skewed mode enabled. Possible cpus %u > %u\n",
  1580					num_possible_cpus(), CONFIG_TICK_SKEW_LIMIT);
  1581			} else
  1582				sched_skew_tick = 0;
  1583		}
  1584	
  1585		/* Emulate tick processing via per-CPU hrtimers: */
  1586		hrtimer_setup(&ts->sched_timer, tick_nohz_handler, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_HARD);
  1587	
  1588		if (IS_ENABLED(CONFIG_HIGH_RES_TIMERS) && hrtimer)
  1589			tick_sched_flag_set(ts, TS_FLAG_HIGHRES);
  1590	
  1591		/* Get the next period (per-CPU) */
  1592		hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update());
  1593	
  1594		/* Offset the tick to avert 'jiffies_lock' contention. */
  1595		if (sched_skew_tick) {
  1596			u64 offset = TICK_NSEC >> 1;
  1597			do_div(offset, num_possible_cpus());
  1598			offset *= smp_processor_id();
  1599			hrtimer_add_expires_ns(&ts->sched_timer, offset);
  1600		} else
  1601			pr_info("Tick operating in synchronized mode.\n");
  1602	
  1603	
  1604		hrtimer_forward_now(&ts->sched_timer, TICK_NSEC);
  1605		if (IS_ENABLED(CONFIG_HIGH_RES_TIMERS) && hrtimer)
  1606			hrtimer_start_expires(&ts->sched_timer, HRTIMER_MODE_ABS_PINNED_HARD);
  1607		else
  1608			tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1);
  1609		tick_nohz_activate(ts);
  1610	}
  1611	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH] Skew tick for systems with a large number of processors

Posted by Thomas Gleixner 3 months, 1 week ago

Christoph!

On Wed, Jul 02 2025 at 12:42, Christoph Lameter via wrote:

Subject starts with a subsystem followed by a colon and then the short
log. That has been that way forever and is clearly documented. You're
not really new to kernel development and I pointed that out to you
before:

  https://lore.kernel.org/all/87o74m1oq7.ffs@tglx

No?

> From: "Christoph Lameter (Ampere)" <cl@gentwo.org>
>
> Synchronized ticks mean that all processors will simultaneously process
> ticks and enter the scheduler. So the contention increases as the number
> of cpu increases. The contention causes latency jitter that scales with
> the number of processors.
>
> Staggering the timer interrupt also helps mitigate voltage droop related
> issues that may be observed in SOCs with large core counts.
> See https://semiengineering.com/mitigating-voltage-droop/ for a more
> detailed explanation.
>
> Switch to skewed tick for systems with more than 64 processors.

This lacks a proper explanation why that does not have any negative side
effects on existing deployments and application scenarios.

> --- a/kernel/Kconfig.hz
> +++ b/kernel/Kconfig.hz

The tick related Kconfig options are in kernel/time/Kconfig

> +
> +config TICK_SKEW_LIMIT
> +	int
> +	default 64

That wants a

        range 0 NR_CPUS

or such

> +	help
> +	  If the kernel is booted on systems with a large number of cpus then the
> +	  concurrent execution of timer ticks causes long holdoffs due to
> +	  serialization. Synchrononous executions of interrupts can also cause
> +	  voltage droop in SOCs. So switch to skewed mode. This mechanism

What does 'So switch to skewed mode.' help the user to select any
useful value?

This wants to have a proper explanation for picking a value which is
understandable by mere mortals and not some useless "expert" word salad.

> +	  can be overridden by specifying "tick_skew=x" on the kernel command line.

Neither does it explain how that override affects the chosen value nor
update the documentation of the command line value to make users aware
of this behavioural change. For the casual reader this suggests, that
tick_skew=x allows to change that number on the kernel command line,
which it does not.

> -static int sched_skew_tick;
> +static int sched_skew_tick = -1;

What's this magic -1 here? Can we please have some obvious and
understandable define for this?

>  static int __init skew_tick(char *str)
>  {
> @@ -1572,6 +1572,16 @@ void tick_setup_sched_timer(bool hrtimer)
>  {
>  	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
>  
> +	/* Figure out if we should skew the tick */
> +	if (sched_skew_tick < 0) {

This is incompatible with the existing code, which is unfortunately
stupid already. Today 'tick_skew=-1' causes the tick to be skewed. Now
it gets a different meaning. Not that it matters much, but change logs
are supposed to mention user visible behavioural differences and argue
why they don't matter, no?

> +		if (num_possible_cpus() >= CONFIG_TICK_SKEW_LIMIT) {
> +			sched_skew_tick = 1;
> +			pr_info("Tick skewed mode enabled. Possible cpus %u > %u\n",
> +				num_possible_cpus(), CONFIG_TICK_SKEW_LIMIT);

I'm not convinced that this is useful, but that's the least of the issues.

> +		} else

The else clause wants curly brackets for symmetry.

> +			sched_skew_tick = 0;

The above aside. As you completely failed to provide at least the
minimal historical background in the change log, let me fill in the
blanks.

commit 3704540b4829 ("tick management: spread timer interrupt") added the
skew unconditionally in 2007 to avoid lock contention on xtime lock.

commit af5ab277ded0 ("clockevents: Remove the per cpu tick skew")
removed it in 2010 because the xtime lock contention was gone and the
skew affected the power consumption of slightly loaded _large_ servers.

commit 5307c9556bc1 ("tick: Add tick skew boot option") brought it back
with a command line option to address contention and jitter issues on
larger systems.

So while you preserved the behaviour of the command line option in the
most obscure way, you did not even make an attempt to explain why this
change does not bring back the issues which caused the removal in commit
af5ab277ded0 or why they are irrelevant today.

"Scratches my itch" does not work and you know that. This needs to be
consolidated both on the implementation side and also on the user
side.

Thanks for making me do your homework,

        tglx

Re: [PATCH] Skew tick for systems with a large number of processors

Posted by Christoph Lameter (Ampere) 3 months, 1 week ago

On Thu, 3 Jul 2025, Thomas Gleixner wrote:

> The above aside. As you completely failed to provide at least the
> minimal historical background in the change log, let me fill in the
> blanks.
>
> commit 3704540b4829 ("tick management: spread timer interrupt") added the
> skew unconditionally in 2007 to avoid lock contention on xtime lock.

Right but that was only one reason why the timer interrupts where
staggered.

> commit af5ab277ded0 ("clockevents: Remove the per cpu tick skew")
> removed it in 2010 because the xtime lock contention was gone and the
> skew affected the power consumption of slightly loaded _large_ servers.

But then the tick also executes other code that can cause contention. Why
merge such an obvious problematic patch without considering the reasons
for the 2007 patch?

> commit 5307c9556bc1 ("tick: Add tick skew boot option") brought it back
> with a command line option to address contention and jitter issues on
> larger systems.

And then issues resulted because the scaling issues where not
considered when merging the 2010 patch.

> So while you preserved the behaviour of the command line option in the
> most obscure way, you did not even make an attempt to explain why this
> change does not bring back the issues which caused the removal in commit
> af5ab277ded0 or why they are irrelevant today.

As pointed out in the patch description: The synchronized tick (aside from
the jitter) also causes power spikes on large core systems which can cause
system instabilities.

> "Scratches my itch" does not work and you know that. This needs to be
> consolidated both on the implementation side and also on the user
> side.

We can get to that but I at least need some direction on how to approach
this and figure out the concerns that exist. Frankly my initial idea was
just to remove the buggy patches since this caused a regression in
performance and system stability but I guess there were power savings
concerns.

How can we address this issue in a better way then? The kernel should not
come up all wobbly and causing power spikes every tick.

Re: [PATCH] Skew tick for systems with a large number of processors

Posted by Thomas Gleixner 3 months ago

On Wed, Jul 02 2025 at 17:25, Christoph Lameter wrote:
> On Thu, 3 Jul 2025, Thomas Gleixner wrote:
>
>> The above aside. As you completely failed to provide at least the
>> minimal historical background in the change log, let me fill in the
>> blanks.
>>
>> commit 3704540b4829 ("tick management: spread timer interrupt") added the
>> skew unconditionally in 2007 to avoid lock contention on xtime lock.
>
> Right but that was only one reason why the timer interrupts where
> staggered.

It was the main reason because all CPUs contended on xtime lock and
other global locks. The subsequent issues you describe were not
observable back then to the extent they are today for bloody obvious
reasons.

>> commit af5ab277ded0 ("clockevents: Remove the per cpu tick skew")
>> removed it in 2010 because the xtime lock contention was gone and the
>> skew affected the power consumption of slightly loaded _large_ servers.
>
> But then the tick also executes other code that can cause contention. Why
> merge such an obvious problematic patch without considering the reasons
> for the 2007 patch?

As I said above, the main reason was contention on xtime lock and some
other global locks. These contention issues had been resolved over time,
so the initial reason to have the skew was gone.

The power consumption issue was a valid reason to remove it and the
testing back then did not show any negative side effects.

The subsequently discovered issues, were not observable and some of them
got introduced by later code changes.

Obviously the patch is problematic in hindsight, but hindsight is always
20/20.

>> commit 5307c9556bc1 ("tick: Add tick skew boot option") brought it back
>> with a command line option to address contention and jitter issues on
>> larger systems.
>
> And then issues resulted because the scaling issues where not
> considered when merging the 2010 patch.

What are you trying to do here? Playing a blame game is not helping to
find a solution.

>> So while you preserved the behaviour of the command line option in the
>> most obscure way, you did not even make an attempt to explain why this
>> change does not bring back the issues which caused the removal in commit
>> af5ab277ded0 or why they are irrelevant today.
>
> As pointed out in the patch description: The synchronized tick (aside from
> the jitter) also causes power spikes on large core systems which can cause
> system instabilities.

That's a _NEW_ problem and has nothing to do with the power saving
concerns which led to af5ab277ded0. 

>> "Scratches my itch" does not work and you know that. This needs to be
>> consolidated both on the implementation side and also on the user
>> side.
>
> We can get to that but I at least need some direction on how to approach
> this and figure out the concerns that exist. Frankly my initial idea was
> just to remove the buggy patches since this caused a regression in
> performance and system stability but I guess there were power savings
> concerns.

Guessing is not a valid engineering approach, as you might know already.

It's not rocket science to validate whether these power saving concerns
still apply and to reach out to people who have been involved in this
and ask them to revalidate. I just Cc'ed Arjan for you.

> How can we address this issue in a better way then?

By analysing the consequences of flipping the default for skew_tick to
default on, which can be evaluated upfront trivially without a single
line of code change by adding 'skew_tick=1' to the kernel command line
and running tests and asking others to help evaluating.

There is only a limited range of scenarios, which need to be looked at:

      - Big servers and the power saving issues on lightly loaded
        machines

      - Battery operated devices

      - Virtualization (guests)

That might not cover 100% of the use cases, but should be a good enough
coverage to base an informed decision on.

> The kernel should not come up all wobbly and causing power spikes
> every tick.

The kernel should not do a lot of things, but does them due to
historical decisions, which turn out to be suboptimal when technology
advances. The power spike problem simply did not exist 18 years ago at
least not to the extent that it mattered or caused concerns.

If we could have predicted the future and the consequences of ad hoc
decisions, we wouldn't have had a BKL, which took only 20 years of
effort to get rid of (except for the well hidden leftovers in tty).

But what we learned from the past is to avoid hacky ad hoc workarounds,
which are guaranteed to just make the situation worse.

Thanks,

        tglx

Re: [PATCH] Skew tick for systems with a large number of processors

Posted by Christoph Lameter (Ampere) 3 months ago

On Thu, 3 Jul 2025, Thomas Gleixner wrote:

> >> So while you preserved the behaviour of the command line option in the
> >> most obscure way, you did not even make an attempt to explain why this
> >> change does not bring back the issues which caused the removal in commit
> >> af5ab277ded0 or why they are irrelevant today.
> >
> > As pointed out in the patch description: The synchronized tick (aside from
> > the jitter) also causes power spikes on large core systems which can cause
> > system instabilities.
>
> That's a _NEW_ problem and has nothing to do with the power saving
> concerns which led to af5ab277ded0.

The contemporary "advanced on chip power savings" really bite in this
scenario. ;-) It was rather suprising to see what can happen.

> It's not rocket science to validate whether these power saving concerns
> still apply and to reach out to people who have been involved in this
> and ask them to revalidate. I just Cc'ed Arjan for you.

They definitely apply on an Android phone with fewer cores. There you
would want to reduce the number of wakeups as much as possible to
conserver power so it needs synchronized mode.

That is why my initial thought was to make it dependent on the number of
active processors.

> There is only a limited range of scenarios, which need to be looked at:
>
>       - Big servers and the power saving issues on lightly loaded
>         machines

If it is only a few active cores and the system is basically idle then
it is better to have a synchronized tick but if the system has lots of
active processors then the tick should be skewed. So maybe one idea
would be to have a counter of active ticks and skew them if that number
gets too high.

>       - Battery operated devices

These usually have 1-4 cores. So synchronized is obviously the best.

>       - Virtualization (guests)

I think there is work to do here to sync the ticks between host and guest
for further power savings.

> That might not cover 100% of the use cases, but should be a good enough
> coverage to base an informed decision on.

Yea lets see what others say on the matter.

> If we could have predicted the future and the consequences of ad hoc
> decisions, we wouldn't have had a BKL, which took only 20 years of
> effort to get rid of (except for the well hidden leftovers in tty).

Oh the BKL was good. Synchronization was much faster after all and less
complex. I am sure a BKL approach on small systems would still improve
performance.

Re: [PATCH] Skew tick for systems with a large number of processors

Posted by Thomas Gleixner 3 months ago

On Thu, Jul 03 2025 at 07:51, Christoph Lameter wrote:
> On Thu, 3 Jul 2025, Thomas Gleixner wrote:
>> It's not rocket science to validate whether these power saving concerns
>> still apply and to reach out to people who have been involved in this
>> and ask them to revalidate. I just Cc'ed Arjan for you.
>
> They definitely apply on an Android phone with fewer cores. There you
> would want to reduce the number of wakeups as much as possible to
> conserver power so it needs synchronized mode.

That's kinda obvious, but with the new timer migration model, which
stops to place timers by crystalball logic, this might not longer be
true and needs actual data to back up that claim.

> That is why my initial thought was to make it dependent on the number of
> active processors.
>
>> There is only a limited range of scenarios, which need to be looked at:
>>
>>       - Big servers and the power saving issues on lightly loaded
>>         machines
>
> If it is only a few active cores and the system is basically idle then
> it is better to have a synchronized tick but if the system has lots of
> active processors then the tick should be skewed.

I agree with the latter, but is your 'few active cores' claim backed by
actual data taken from a current kernel or based on historical evidence
and hearsay?

> So maybe one idea would be to have a counter of active ticks and skew
> them if that number gets too high.

The idea itself is not that horrible. Though we should tap into the
existing accounting resources to figure that out instead of adding yet
another ill defined global counter to the mess. All the required metrics
should be there already.

Actually it should be solvable if you look at it just from a per CPU
perspective. This assumes that NOHZ_IDLE is active, because if it is not
then you can just go and skew unconditionally.

If a CPU is busy, then it just arms the tick skewed. If it goes idle,
then it looks at the expected idle time, which is what NOHZ does already
today. If it decides to stop the tick until the next timer list expires,
then it aligns it. Earlier expiring high resolution timers obviously
override the initial decision, but that's not much different from what
is happening today already.

>>       - Battery operated devices
>
> These usually have 1-4 cores. So synchronized is obviously the best.

Same question as above.

>> If we could have predicted the future and the consequences of ad hoc
>> decisions, we wouldn't have had a BKL, which took only 20 years of
>> effort to get rid of (except for the well hidden leftovers in tty).
>
> Oh the BKL was good. Synchronization was much faster after all and less
> complex. I am sure a BKL approach on small systems would still improve
> performance.

Feel free to scale back to 4 cores and enjoy the undefined BKL
semantics forever in your own fork of 2.2.final :)

Thanks,

        tglx