From: Feng Tang <feng.tang@intel.com>
Bugs have been reported on 8 sockets x86 machines in which the TSC was
wrongly disabled when the system is under heavy workload.
[ 818.380354] clocksource: timekeeping watchdog on CPU336: hpet wd-wd read-back delay of 1203520ns
[ 818.436160] clocksource: wd-tsc-wd read-back delay of 181880ns, clock-skew test skipped!
[ 819.402962] clocksource: timekeeping watchdog on CPU338: hpet wd-wd read-back delay of 324000ns
[ 819.448036] clocksource: wd-tsc-wd read-back delay of 337240ns, clock-skew test skipped!
[ 819.880863] clocksource: timekeeping watchdog on CPU339: hpet read-back delay of 150280ns, attempt 3, marking unstable
[ 819.936243] tsc: Marking TSC unstable due to clocksource watchdog
[ 820.068173] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[ 820.092382] sched_clock: Marking unstable (818769414384, 1195404998)
[ 820.643627] clocksource: Checking clocksource tsc synchronization from CPU 267 to CPUs 0,4,25,70,126,430,557,564.
[ 821.067990] clocksource: Switched to clocksource hpet
This can be reproduced by running memory intensive 'stream' tests,
or some of the stress-ng subcases such as 'ioport'.
The reason for these issues is the when system is under heavy load, the
read latency of the clocksources can be very high. Even lightweight TSC
reads can show high latencies, and latencies are much worse for external
clocksources such as HPET or the APIC PM timer. These latencies can
result in false-positive clocksource-unstable determinations.
Given that the clocksource watchdog is a continual diagnostic check with
frequency of twice a second, there is no need to rush it when the system
is under heavy load. Therefore, when high clocksource read latencies
are detected, suspend the watchdog timer for 5 minutes.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Waiman Long <longman@redhat.com>
Cc: John Stultz <jstultz@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Feng Tang <feng.tang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
kernel/time/clocksource.c | 45 ++++++++++++++++++++++++++++-----------
1 file changed, 32 insertions(+), 13 deletions(-)
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index fc486cd972635..91836b727cef5 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -387,6 +387,15 @@ void clocksource_verify_percpu(struct clocksource *cs)
}
EXPORT_SYMBOL_GPL(clocksource_verify_percpu);
+static inline void clocksource_reset_watchdog(void)
+{
+ struct clocksource *cs;
+
+ list_for_each_entry(cs, &watchdog_list, wd_list)
+ cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
+}
+
+
static void clocksource_watchdog(struct timer_list *unused)
{
u64 csnow, wdnow, cslast, wdlast, delta;
@@ -394,6 +403,7 @@ static void clocksource_watchdog(struct timer_list *unused)
int64_t wd_nsec, cs_nsec;
struct clocksource *cs;
enum wd_read_status read_ret;
+ unsigned long extra_wait = 0;
u32 md;
spin_lock(&watchdog_lock);
@@ -413,13 +423,30 @@ static void clocksource_watchdog(struct timer_list *unused)
read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
- if (read_ret != WD_READ_SUCCESS) {
- if (read_ret == WD_READ_UNSTABLE)
- /* Clock readout unreliable, so give it up. */
- __clocksource_unstable(cs);
+ if (read_ret == WD_READ_UNSTABLE) {
+ /* Clock readout unreliable, so give it up. */
+ __clocksource_unstable(cs);
continue;
}
+ /*
+ * When WD_READ_SKIP is returned, it means the system is likely
+ * under very heavy load, where the latency of reading
+ * watchdog/clocksource is very big, and affect the accuracy of
+ * watchdog check. So give system some space and suspend the
+ * watchdog check for 5 minutes.
+ */
+ if (read_ret == WD_READ_SKIP) {
+ /*
+ * As the watchdog timer will be suspended, and
+ * cs->last could keep unchanged for 5 minutes, reset
+ * the counters.
+ */
+ clocksource_reset_watchdog();
+ extra_wait = HZ * 300;
+ break;
+ }
+
/* Clocksource initialized ? */
if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
atomic_read(&watchdog_reset_pending)) {
@@ -523,7 +550,7 @@ static void clocksource_watchdog(struct timer_list *unused)
* pair clocksource_stop_watchdog() clocksource_start_watchdog().
*/
if (!timer_pending(&watchdog_timer)) {
- watchdog_timer.expires += WATCHDOG_INTERVAL;
+ watchdog_timer.expires += WATCHDOG_INTERVAL + extra_wait;
add_timer_on(&watchdog_timer, next_cpu);
}
out:
@@ -548,14 +575,6 @@ static inline void clocksource_stop_watchdog(void)
watchdog_running = 0;
}
-static inline void clocksource_reset_watchdog(void)
-{
- struct clocksource *cs;
-
- list_for_each_entry(cs, &watchdog_list, wd_list)
- cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
-}
-
static void clocksource_resume_watchdog(void)
{
atomic_inc(&watchdog_reset_pending);
--
2.31.1.189.g2e36527f23
On Wed, Jan 04 2023 at 17:07, Paul E. McKenney wrote: > This can be reproduced by running memory intensive 'stream' tests, > or some of the stress-ng subcases such as 'ioport'. > > The reason for these issues is the when system is under heavy load, the > read latency of the clocksources can be very high. Even lightweight TSC > reads can show high latencies, and latencies are much worse for external > clocksources such as HPET or the APIC PM timer. These latencies can > result in false-positive clocksource-unstable determinations. > > Given that the clocksource watchdog is a continual diagnostic check with > frequency of twice a second, there is no need to rush it when the system > is under heavy load. Therefore, when high clocksource read latencies > are detected, suspend the watchdog timer for 5 minutes. We should have enough heuristics in place by now to qualify the output of the clocksource watchdog as a random number generator, right?
On Wed, Jan 11, 2023 at 12:26:58PM +0100, Thomas Gleixner wrote: > On Wed, Jan 04 2023 at 17:07, Paul E. McKenney wrote: > > This can be reproduced by running memory intensive 'stream' tests, > > or some of the stress-ng subcases such as 'ioport'. > > > > The reason for these issues is the when system is under heavy load, the > > read latency of the clocksources can be very high. Even lightweight TSC > > reads can show high latencies, and latencies are much worse for external > > clocksources such as HPET or the APIC PM timer. These latencies can > > result in false-positive clocksource-unstable determinations. > > > > Given that the clocksource watchdog is a continual diagnostic check with > > frequency of twice a second, there is no need to rush it when the system > > is under heavy load. Therefore, when high clocksource read latencies > > are detected, suspend the watchdog timer for 5 minutes. > > We should have enough heuristics in place by now to qualify the output of > the clocksource watchdog as a random number generator, right? Glad to see that you are still keeping up your style, Thomas! ;-) We really do see the occasional clocksource skew in our fleet, and sometimes it really is the TSC that is in disagreement with atomic-clock time. And the watchdog does detect these, for example, the 40,000 parts-per-million case discussed recently. We therefore need a way to check this, but without producing false positives on busy systems and without the current kneejerk reflex of disabling TSC, thus rendering the system useless from a performance standpoint for some important workloads. Yes, if a system was 100% busy forever, this patch would suppress these checks. But 100% busy forever is not the common case, due to thermal throttling and to security updates if nothing else. With all that said, is there a better way to get the desired effects of this patch? Thanx, Paul
On Wed, Jan 11 2023 at 09:50, Paul E. McKenney wrote: > On Wed, Jan 11, 2023 at 12:26:58PM +0100, Thomas Gleixner wrote: > Yes, if a system was 100% busy forever, this patch would suppress these > checks. But 100% busy forever is not the common case, due to thermal > throttling and to security updates if nothing else. > > With all that said, is there a better way to get the desired effects of > this patch? Sane hardware?
On Wed, Jan 11, 2023 at 10:19:50PM +0100, Thomas Gleixner wrote: > On Wed, Jan 11 2023 at 09:50, Paul E. McKenney wrote: > > On Wed, Jan 11, 2023 at 12:26:58PM +0100, Thomas Gleixner wrote: > > Yes, if a system was 100% busy forever, this patch would suppress these > > checks. But 100% busy forever is not the common case, due to thermal > > throttling and to security updates if nothing else. > > > > With all that said, is there a better way to get the desired effects of > > this patch? > > Sane hardware? I must let Feng talk to his systems, but most of the systems I saw were production systems. A few were engineering samples, from which some insanity might be expected behavior. Clearly, something about the hardware or firmware was insane in order to get this result, but that is what diagnostics are for, even on engineering samples. Thanx, Paul
On Wed, Jan 11, 2023 at 01:32:10PM -0800, Paul E. McKenney wrote: > On Wed, Jan 11, 2023 at 10:19:50PM +0100, Thomas Gleixner wrote: > > On Wed, Jan 11 2023 at 09:50, Paul E. McKenney wrote: > > > On Wed, Jan 11, 2023 at 12:26:58PM +0100, Thomas Gleixner wrote: > > > Yes, if a system was 100% busy forever, this patch would suppress these > > > checks. But 100% busy forever is not the common case, due to thermal > > > throttling and to security updates if nothing else. > > > > > > With all that said, is there a better way to get the desired effects of > > > this patch? > > > > Sane hardware? > > I must let Feng talk to his systems, but most of the systems I saw were > production systems. A few were engineering samples, from which some > insanity might be expected behavior. I've tested with several generations of Xeon servers, and they all can reproduce the issue with stress-ng stress load. Those platforms are not bought from market :), but they have latest stepping and firmware, which are close to production systesm. The issue originally came from customer, and there were engineers who reproduced it on production systems(even from different vendors) Thanks, Feng > Clearly, something about the hardware or firmware was insane in order > to get this result, but that is what diagnostics are for, even on > engineering samples. > > Thanx, Paul
On Wed, Jan 11, 2023 at 12:26:58PM +0100, Thomas Gleixner wrote: > On Wed, Jan 04 2023 at 17:07, Paul E. McKenney wrote: > > This can be reproduced by running memory intensive 'stream' tests, > > or some of the stress-ng subcases such as 'ioport'. > > > > The reason for these issues is the when system is under heavy load, the > > read latency of the clocksources can be very high. Even lightweight TSC > > reads can show high latencies, and latencies are much worse for external > > clocksources such as HPET or the APIC PM timer. These latencies can > > result in false-positive clocksource-unstable determinations. > > > > Given that the clocksource watchdog is a continual diagnostic check with > > frequency of twice a second, there is no need to rush it when the system > > is under heavy load. Therefore, when high clocksource read latencies > > are detected, suspend the watchdog timer for 5 minutes. > > We should have enough heuristics in place by now to qualify the output of > the clocksource watchdog as a random number generator, right? The issue was found on a 8 sockets machine (around 400 cores, 800 CPUs), and seems with the bigger and bigger CPU numbers, the spark latency of reading HPET or even TSC is very high, which does affect the accuracy of clocksource watchdog check. And unfortunately, we can't just disable the watchdog for these 8 sockets machine. We tried a debug patch which disables interrupt and does consective reads with 'rdtsc', and check the delta between these reads (when system is running some heavy load), sometimes we can see up to 300 micro-seconds delta, on a 2-sockets Icelake machine. Similar thing if we replace the 'rdtsc' with 'rdtscp' or 'lfence;rdtsc'. And I was told the max latency is much higher on the 8 sockets machine. Thanks, Feng
© 2016 - 2025 Red Hat, Inc.