[v1] Prevent unexpected TSC to HPET clocksource fallback on many-socket systems

[PATCH RESEND] Prevent unexpected TSC to HPET clocksource fallback on many-socket systems

Posted by Daniel J Blueman 8 months, 1 week ago

On systems with many sockets, kernel timekeeping may quietly fallback from
using the inexpensive core-level TSCs to the expensive legacy socket HPET,
notably impacting application performance until the system is rebooted.
This may be triggered by adverse workloads generating considerable
coherency or processor mesh congestion.

This manifests in the kernel log as:
 clocksource: timekeeping watchdog on CPU1750: Marking clocksource 'tsc' as unstable because the skew is too large:
 clocksource:                       'hpet' wd_nsec: 503029760 wd_now: 48a38f74 wd_last: 47e3ab74 mask: ffffffff
 clocksource:                       'tsc' cs_nsec: 503466648 cs_now: 3224653e7bd cs_last: 3220d4f8d57 mask: ffffffffffffffff
 clocksource:                       Clocksource 'tsc' skewed 436888 ns (0 ms) over watchdog 'hpet' interval of 503029760 ns (503 ms)
 clocksource:                       'tsc' is current clocksource.
 tsc: Marking TSC unstable due to clocksource watchdog
 TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
 sched_clock: Marking unstable (882011139159, 1572951254)<-(913395032446, -29810979023)
 clocksource: Checking clocksource tsc synchronization from CPU 1800 to CPUs 0,187,336,434,495,644,1719,1792.
 clocksource: Switched to clocksource hpet

Scale the default timekeeping watchdog uncertinty margin by the log2 of
the number of online NUMA nodes; this allows a more appropriate margin
from embedded systems to many-socket systems.

This fix successfully prevents HPET fallback on Eviden 12 socket/1440
thread SH120 and 16 socket/1920 thread SH160 Intel SPR systems with
Numascale XNC node controllers.

Reviewed-by: Scott Hamilton <scott.hamilton@eviden.com>
Signed-off-by: Daniel J Blueman <daniel@quora.org>
---
 kernel/time/Kconfig       | 8 +++++---
 kernel/time/clocksource.c | 9 ++++++++-
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index b0b97a60aaa6..48dd517bc0b3 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -200,10 +200,12 @@ config CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
 	int "Clocksource watchdog maximum allowable skew (in microseconds)"
 	depends on CLOCKSOURCE_WATCHDOG
 	range 50 1000
-	default 125
+	default 50
 	help
-	  Specify the maximum amount of allowable watchdog skew in
-	  microseconds before reporting the clocksource to be unstable.
+	  Specify the maximum allowable watchdog skew in microseconds, scaled
+	  by the log2 of the number of online NUMA nodes to track system
+	  latency, before reporting the clocksource to be unstable.
+
 	  The default is based on a half-second clocksource watchdog
 	  interval and NTP's maximum frequency drift of 500 parts
 	  per million.	If the clocksource is good enough for NTP,
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index bb48498ebb5a..43e2e9cc086a 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -10,7 +10,9 @@
 #include <linux/device.h>
 #include <linux/clocksource.h>
 #include <linux/init.h>
+#include <linux/log2.h>
 #include <linux/module.h>
+#include <linux/nodemask.h>
 #include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */
 #include <linux/tick.h>
 #include <linux/kthread.h>
@@ -133,9 +135,12 @@ static u64 suspend_start;
  * under test is not permitted to go below the 500ppm minimum defined
  * by MAX_SKEW_USEC.  This 500ppm minimum may be overridden using the
  * CLOCKSOURCE_WATCHDOG_MAX_SKEW_US Kconfig option.
+ *
+ * If overridden, linearly scale this value by the log2 of the number of
+ * online NUMA nodes for a reasonable upper bound on system latency.
  */
 #ifdef CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
-#define MAX_SKEW_USEC	CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
+#define MAX_SKEW_USEC	(CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US * max(ilog2(nr_online_nodes), 1))
 #else
 #define MAX_SKEW_USEC	(125 * WATCHDOG_INTERVAL / HZ)
 #endif
@@ -1195,6 +1200,8 @@ void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq
 	 * comment preceding CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US above.
 	 */
 	if (scale && freq && !cs->uncertainty_margin) {
+		pr_info("Using clocksource watchdog maximum skew of %uus\n", MAX_SKEW_USEC);
+
 		cs->uncertainty_margin = NSEC_PER_SEC / (scale * freq);
 		if (cs->uncertainty_margin < 2 * WATCHDOG_MAX_SKEW)
 			cs->uncertainty_margin = 2 * WATCHDOG_MAX_SKEW;
-- 
2.48.1

Re: [PATCH RESEND] Prevent unexpected TSC to HPET clocksource fallback on many-socket systems

Posted by John Stultz 8 months, 1 week ago

On Mon, Jun 2, 2025 at 3:34 PM Daniel J Blueman <daniel@quora.org> wrote:
>
> On systems with many sockets, kernel timekeeping may quietly fallback from
> using the inexpensive core-level TSCs to the expensive legacy socket HPET,
> notably impacting application performance until the system is rebooted.
> This may be triggered by adverse workloads generating considerable
> coherency or processor mesh congestion.
>
> This manifests in the kernel log as:
>  clocksource: timekeeping watchdog on CPU1750: Marking clocksource 'tsc' as unstable because the skew is too large:
>  clocksource:                       'hpet' wd_nsec: 503029760 wd_now: 48a38f74 wd_last: 47e3ab74 mask: ffffffff
>  clocksource:                       'tsc' cs_nsec: 503466648 cs_now: 3224653e7bd cs_last: 3220d4f8d57 mask: ffffffffffffffff
>  clocksource:                       Clocksource 'tsc' skewed 436888 ns (0 ms) over watchdog 'hpet' interval of 503029760 ns (503 ms)
>  clocksource:                       'tsc' is current clocksource.
>  tsc: Marking TSC unstable due to clocksource watchdog
>  TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
>  sched_clock: Marking unstable (882011139159, 1572951254)<-(913395032446, -29810979023)
>  clocksource: Checking clocksource tsc synchronization from CPU 1800 to CPUs 0,187,336,434,495,644,1719,1792.
>  clocksource: Switched to clocksource hpet
>
> Scale the default timekeeping watchdog uncertinty margin by the log2 of
> the number of online NUMA nodes; this allows a more appropriate margin
> from embedded systems to many-socket systems.

So, missing context from the commit message:
* Why is it "appropriate" for the TSC and HPET to be further out of
sync on numa machines?
* Why is log2(numa nodes) the right metric to scale by?

> This fix successfully prevents HPET fallback on Eviden 12 socket/1440
> thread SH120 and 16 socket/1920 thread SH160 Intel SPR systems with
> Numascale XNC node controllers.

I recognize improperly falling back to HPET is costly and unwanted,
but given the history of bad TSCs, why is this loosening of the sanity
checks actually safe?

The skew you've highlighted above looks to be > 800ppm, which is well
beyond what NTP can correct for, so it might be good to better explain
why this skew is happening (you mention congestion, so is the skew
consistent, or short term due to read latencies? if so would trying
again or changing how we sample be more appropriate than just growing
the acceptable skew window?).

These sorts of checks were important before as NUMA systems might have
separate crystals on different nodes, so the TSCs (and HPETs) could
drift relative to each other, and ignoring such a problem could result
in visible TSC inconsistencies.  So I just want to make sure this
isn't solving an issue for you but opening a problem for someone else.

thanks
-john

Re: [PATCH RESEND] Prevent unexpected TSC to HPET clocksource fallback on many-socket systems

Posted by Daniel J Blueman 8 months, 1 week ago

On Tue, 3 Jun 2025 at 09:35, John Stultz <jstultz@google.com> wrote:
>
> On Mon, Jun 2, 2025 at 3:34 PM Daniel J Blueman <daniel@quora.org> wrote:
> >
> > On systems with many sockets, kernel timekeeping may quietly fallback from
> > using the inexpensive core-level TSCs to the expensive legacy socket HPET,
> > notably impacting application performance until the system is rebooted.
> > This may be triggered by adverse workloads generating considerable
> > coherency or processor mesh congestion.
> >
> > This manifests in the kernel log as:
> >  clocksource: timekeeping watchdog on CPU1750: Marking clocksource 'tsc' as unstable because the skew is too large:
> >  clocksource:                       'hpet' wd_nsec: 503029760 wd_now: 48a38f74 wd_last: 47e3ab74 mask: ffffffff
> >  clocksource:                       'tsc' cs_nsec: 503466648 cs_now: 3224653e7bd cs_last: 3220d4f8d57 mask: ffffffffffffffff
> >  clocksource:                       Clocksource 'tsc' skewed 436888 ns (0 ms) over watchdog 'hpet' interval of 503029760 ns (503 ms)
> >  clocksource:                       'tsc' is current clocksource.
> >  tsc: Marking TSC unstable due to clocksource watchdog
> >  TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> >  sched_clock: Marking unstable (882011139159, 1572951254)<-(913395032446, -29810979023)
> >  clocksource: Checking clocksource tsc synchronization from CPU 1800 to CPUs 0,187,336,434,495,644,1719,1792.
> >  clocksource: Switched to clocksource hpet
> >
> > Scale the default timekeeping watchdog uncertainty margin by the log2 of
> > the number of online NUMA nodes; this allows a more appropriate margin
> > from embedded systems to many-socket systems.
>
> So, missing context from the commit message:
> * Why is it "appropriate" for the TSC and HPET to be further out of
> sync on numa machines?

I absolutely agree TSC skew is inappropriate. The TSCs are in sync
here using the same low-jitter base clock across all modules, meaning
this is an observability problem.

> * Why is log2(numa nodes) the right metric to scale by?

This is the simplest strategy I could determine to model latency from
the underlying cache coherency mesh congestion, and fits well with the
previous and future processor architectures.

> > This fix successfully prevents HPET fallback on Eviden 12 socket/1440
> > thread SH120 and 16 socket/1920 thread SH160 Intel SPR systems with
> > Numascale XNC node controllers.
>
> I recognize improperly falling back to HPET is costly and unwanted,
> but given the history of bad TSCs, why is this loosening of the sanity
> checks actually safe?

The current approach fails on large systems, therefore interconnect
market leaders of these 12-16 socket systems require users to boot
with "tsc=nowatchdog".

Since this change introduces scaling, it therefore conservatively
tightens the margin for 1-2 NUMA node systems; these values have been
historically appropriate.

> The skew you've highlighted above looks to be > 800ppm, which is well
> beyond what NTP can correct for, so it might be good to better explain
> why this skew is happening (you mention congestion, so is the skew
> consistent, or short term due to read latencies? if so would trying
> again or changing how we sample be more appropriate than just growing
> the acceptable skew window?).

For the workloads I instrumented, the read latencies aren't
consistently high enough to trip HPET fallback if there was further
retrying, so characterising the read latencies as 'bursty' might be
reasonable.

Ultimately, this reflects complex dependency patterns in inter and
intra-socket coherency queuing, so there is some higher baseline
latency.

> These sorts of checks were important before as NUMA systems might have
> separate crystals on different nodes, so the TSCs (and HPETs) could
> drift relative to each other, and ignoring such a problem could result
> in visible TSC inconsistencies.  So I just want to make sure this
> isn't solving an issue for you but opening a problem for someone else.

Yes, we didn't have an inter-module shared base clock in early cache
coherent interconnects. The hierarchical software clocksource mech I
developed closed the gap on near-TSC performance, though at higher
jitter of course.

Definitely agreed that we want to detect systematic TSC skew; I am
happy to prepare an alternative approach if preferred.

Many thanks for the discussion on this John,
  Dan
-- 
Daniel J Blueman