kernel/time/Kconfig | 8 +++++--- kernel/time/clocksource.c | 9 ++++++++- 2 files changed, 13 insertions(+), 4 deletions(-)
On systems with many sockets, kernel timekeeping may quietly fallback from
using the inexpensive core-level TSCs to the expensive legacy socket HPET,
notably impacting application performance until the system is rebooted.
This may be triggered by adverse workloads generating considerable
coherency or processor mesh congestion.
This manifests in the kernel log as:
clocksource: timekeeping watchdog on CPU1750: Marking clocksource 'tsc' as unstable because the skew is too large:
clocksource: 'hpet' wd_nsec: 503029760 wd_now: 48a38f74 wd_last: 47e3ab74 mask: ffffffff
clocksource: 'tsc' cs_nsec: 503466648 cs_now: 3224653e7bd cs_last: 3220d4f8d57 mask: ffffffffffffffff
clocksource: Clocksource 'tsc' skewed 436888 ns (0 ms) over watchdog 'hpet' interval of 503029760 ns (503 ms)
clocksource: 'tsc' is current clocksource.
tsc: Marking TSC unstable due to clocksource watchdog
TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
sched_clock: Marking unstable (882011139159, 1572951254)<-(913395032446, -29810979023)
clocksource: Checking clocksource tsc synchronization from CPU 1800 to CPUs 0,187,336,434,495,644,1719,1792.
clocksource: Switched to clocksource hpet
Scale the default timekeeping watchdog uncertinty margin by the log2 of
the number of online NUMA nodes; this allows a more appropriate margin
from embedded systems to many-socket systems.
This fix successfully prevents HPET fallback on Eviden 12 socket/1440
thread SH120 and 16 socket/1920 thread SH160 Intel SPR systems with
Numascale XNC node controllers.
Reviewed-by: Scott Hamilton <scott.hamilton@eviden.com>
Signed-off-by: Daniel J Blueman <daniel@quora.org>
---
kernel/time/Kconfig | 8 +++++---
kernel/time/clocksource.c | 9 ++++++++-
2 files changed, 13 insertions(+), 4 deletions(-)
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index b0b97a60aaa6..48dd517bc0b3 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -200,10 +200,12 @@ config CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
int "Clocksource watchdog maximum allowable skew (in microseconds)"
depends on CLOCKSOURCE_WATCHDOG
range 50 1000
- default 125
+ default 50
help
- Specify the maximum amount of allowable watchdog skew in
- microseconds before reporting the clocksource to be unstable.
+ Specify the maximum allowable watchdog skew in microseconds, scaled
+ by the log2 of the number of online NUMA nodes to track system
+ latency, before reporting the clocksource to be unstable.
+
The default is based on a half-second clocksource watchdog
interval and NTP's maximum frequency drift of 500 parts
per million. If the clocksource is good enough for NTP,
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index bb48498ebb5a..43e2e9cc086a 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -10,7 +10,9 @@
#include <linux/device.h>
#include <linux/clocksource.h>
#include <linux/init.h>
+#include <linux/log2.h>
#include <linux/module.h>
+#include <linux/nodemask.h>
#include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */
#include <linux/tick.h>
#include <linux/kthread.h>
@@ -133,9 +135,12 @@ static u64 suspend_start;
* under test is not permitted to go below the 500ppm minimum defined
* by MAX_SKEW_USEC. This 500ppm minimum may be overridden using the
* CLOCKSOURCE_WATCHDOG_MAX_SKEW_US Kconfig option.
+ *
+ * If overridden, linearly scale this value by the log2 of the number of
+ * online NUMA nodes for a reasonable upper bound on system latency.
*/
#ifdef CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
-#define MAX_SKEW_USEC CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US
+#define MAX_SKEW_USEC (CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US * max(ilog2(nr_online_nodes), 1))
#else
#define MAX_SKEW_USEC (125 * WATCHDOG_INTERVAL / HZ)
#endif
@@ -1195,6 +1200,8 @@ void __clocksource_update_freq_scale(struct clocksource *cs, u32 scale, u32 freq
* comment preceding CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US above.
*/
if (scale && freq && !cs->uncertainty_margin) {
+ pr_info("Using clocksource watchdog maximum skew of %uus\n", MAX_SKEW_USEC);
+
cs->uncertainty_margin = NSEC_PER_SEC / (scale * freq);
if (cs->uncertainty_margin < 2 * WATCHDOG_MAX_SKEW)
cs->uncertainty_margin = 2 * WATCHDOG_MAX_SKEW;
--
2.48.1
On Mon, Jun 2, 2025 at 3:34 PM Daniel J Blueman <daniel@quora.org> wrote: > > On systems with many sockets, kernel timekeeping may quietly fallback from > using the inexpensive core-level TSCs to the expensive legacy socket HPET, > notably impacting application performance until the system is rebooted. > This may be triggered by adverse workloads generating considerable > coherency or processor mesh congestion. > > This manifests in the kernel log as: > clocksource: timekeeping watchdog on CPU1750: Marking clocksource 'tsc' as unstable because the skew is too large: > clocksource: 'hpet' wd_nsec: 503029760 wd_now: 48a38f74 wd_last: 47e3ab74 mask: ffffffff > clocksource: 'tsc' cs_nsec: 503466648 cs_now: 3224653e7bd cs_last: 3220d4f8d57 mask: ffffffffffffffff > clocksource: Clocksource 'tsc' skewed 436888 ns (0 ms) over watchdog 'hpet' interval of 503029760 ns (503 ms) > clocksource: 'tsc' is current clocksource. > tsc: Marking TSC unstable due to clocksource watchdog > TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. > sched_clock: Marking unstable (882011139159, 1572951254)<-(913395032446, -29810979023) > clocksource: Checking clocksource tsc synchronization from CPU 1800 to CPUs 0,187,336,434,495,644,1719,1792. > clocksource: Switched to clocksource hpet > > Scale the default timekeeping watchdog uncertinty margin by the log2 of > the number of online NUMA nodes; this allows a more appropriate margin > from embedded systems to many-socket systems. So, missing context from the commit message: * Why is it "appropriate" for the TSC and HPET to be further out of sync on numa machines? * Why is log2(numa nodes) the right metric to scale by? > This fix successfully prevents HPET fallback on Eviden 12 socket/1440 > thread SH120 and 16 socket/1920 thread SH160 Intel SPR systems with > Numascale XNC node controllers. I recognize improperly falling back to HPET is costly and unwanted, but given the history of bad TSCs, why is this loosening of the sanity checks actually safe? The skew you've highlighted above looks to be > 800ppm, which is well beyond what NTP can correct for, so it might be good to better explain why this skew is happening (you mention congestion, so is the skew consistent, or short term due to read latencies? if so would trying again or changing how we sample be more appropriate than just growing the acceptable skew window?). These sorts of checks were important before as NUMA systems might have separate crystals on different nodes, so the TSCs (and HPETs) could drift relative to each other, and ignoring such a problem could result in visible TSC inconsistencies. So I just want to make sure this isn't solving an issue for you but opening a problem for someone else. thanks -john
On Tue, 3 Jun 2025 at 09:35, John Stultz <jstultz@google.com> wrote: > > On Mon, Jun 2, 2025 at 3:34 PM Daniel J Blueman <daniel@quora.org> wrote: > > > > On systems with many sockets, kernel timekeeping may quietly fallback from > > using the inexpensive core-level TSCs to the expensive legacy socket HPET, > > notably impacting application performance until the system is rebooted. > > This may be triggered by adverse workloads generating considerable > > coherency or processor mesh congestion. > > > > This manifests in the kernel log as: > > clocksource: timekeeping watchdog on CPU1750: Marking clocksource 'tsc' as unstable because the skew is too large: > > clocksource: 'hpet' wd_nsec: 503029760 wd_now: 48a38f74 wd_last: 47e3ab74 mask: ffffffff > > clocksource: 'tsc' cs_nsec: 503466648 cs_now: 3224653e7bd cs_last: 3220d4f8d57 mask: ffffffffffffffff > > clocksource: Clocksource 'tsc' skewed 436888 ns (0 ms) over watchdog 'hpet' interval of 503029760 ns (503 ms) > > clocksource: 'tsc' is current clocksource. > > tsc: Marking TSC unstable due to clocksource watchdog > > TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. > > sched_clock: Marking unstable (882011139159, 1572951254)<-(913395032446, -29810979023) > > clocksource: Checking clocksource tsc synchronization from CPU 1800 to CPUs 0,187,336,434,495,644,1719,1792. > > clocksource: Switched to clocksource hpet > > > > Scale the default timekeeping watchdog uncertainty margin by the log2 of > > the number of online NUMA nodes; this allows a more appropriate margin > > from embedded systems to many-socket systems. > > So, missing context from the commit message: > * Why is it "appropriate" for the TSC and HPET to be further out of > sync on numa machines? I absolutely agree TSC skew is inappropriate. The TSCs are in sync here using the same low-jitter base clock across all modules, meaning this is an observability problem. > * Why is log2(numa nodes) the right metric to scale by? This is the simplest strategy I could determine to model latency from the underlying cache coherency mesh congestion, and fits well with the previous and future processor architectures. > > This fix successfully prevents HPET fallback on Eviden 12 socket/1440 > > thread SH120 and 16 socket/1920 thread SH160 Intel SPR systems with > > Numascale XNC node controllers. > > I recognize improperly falling back to HPET is costly and unwanted, > but given the history of bad TSCs, why is this loosening of the sanity > checks actually safe? The current approach fails on large systems, therefore interconnect market leaders of these 12-16 socket systems require users to boot with "tsc=nowatchdog". Since this change introduces scaling, it therefore conservatively tightens the margin for 1-2 NUMA node systems; these values have been historically appropriate. > The skew you've highlighted above looks to be > 800ppm, which is well > beyond what NTP can correct for, so it might be good to better explain > why this skew is happening (you mention congestion, so is the skew > consistent, or short term due to read latencies? if so would trying > again or changing how we sample be more appropriate than just growing > the acceptable skew window?). For the workloads I instrumented, the read latencies aren't consistently high enough to trip HPET fallback if there was further retrying, so characterising the read latencies as 'bursty' might be reasonable. Ultimately, this reflects complex dependency patterns in inter and intra-socket coherency queuing, so there is some higher baseline latency. > These sorts of checks were important before as NUMA systems might have > separate crystals on different nodes, so the TSCs (and HPETs) could > drift relative to each other, and ignoring such a problem could result > in visible TSC inconsistencies. So I just want to make sure this > isn't solving an issue for you but opening a problem for someone else. Yes, we didn't have an inter-module shared base clock in early cache coherent interconnects. The hierarchical software clocksource mech I developed closed the gap on near-TSC performance, though at higher jitter of course. Definitely agreed that we want to detect systematic TSC skew; I am happy to prepare an alternative approach if preferred. Many thanks for the discussion on this John, Dan -- Daniel J Blueman
© 2016 - 2025 Red Hat, Inc.