kernel/time/clocksource.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
The "valid" readout delay between the two reads of the watchdog is larger
than the valid delta between the resulting watchdog and clocksource
intervals, which results in false positive watchdog results.
Assume TSC is the clocksource and HPET is the watchdog and both have a
uncertainty margin of 250us (default). The watchdog readout does:
1) wdnow = read(HPET);
2) csnow = read(TSC);
3) wdend = read(HPET);
The valid window for the delta between #1 and #3 is calculated by the
uncertainty margins of the watchdog and the clocksource:
m = 2 * watchdog.uncertainty_margin + cs.uncertainty margin;
which results in 750us for the TSC/HPET case.
The actual interval comparison uses a smaller margin:
m = watchdog.uncertainty_margin + cs.uncertainty margin;
which results in 500us for the TSC/HPET case.
That means the following scenario will trigger the watchdog:
Watchdog cycle N:
1) wdnow[N] = read(HPET);
2) csnow[N] = read(TSC);
3) wdend[N] = read(HPET);
Assume the delay between #1 and #2 is 100us and the delay between #1 and
#3 is within the 750us margin, i.e. the readout is considered valid.
Watchdog cycle N + 1:
4) wdnow[N + 1] = read(HPET);
5) csnow[N + 1] = read(TSC);
6) wdend[N + 1] = read(HPET);
If the delay between #4 and #6 is within the 750us margin then any delay
between #4 and #5 which is larger than 600us will fail the interval check
and mark the TSC unstable because the intervals are calculated against the
previous value:
wd_int = wdnow[N + 1] - wdnow[N];
cs_int = csnow[N + 1] - csnow[N];
Putting the above delays in place this results in:
cs_int = (wdnow[N + 1] + 610us) - (wdnow[N] + 100us);
-> cs_int = wd_int + 510us;
which is obviously larger than the allowed 500us margin and results in
marking TSC unstable.
Fix this by using the same margin as the interval comparison. If the delay
between two watchdog reads is larger than that, then the readout was either
disturbed by interconnect congestion, NMIs or SMIs.
Fixes: 4ac1dd3245b9 ("clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin")
Reported-by: Daniel J Blueman <daniel@quora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/20250602223251.496591-1-daniel@quora.org/
---
kernel/time/clocksource.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -252,7 +252,7 @@ enum wd_read_status {
static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
{
- int64_t md = 2 * watchdog->uncertainty_margin;
+ int64_t md = watchdog->uncertainty_margin;
unsigned int nretries, max_retries;
int64_t wd_delay, wd_seq_delay;
u64 wd_end, wd_end2;
@@ -285,7 +285,7 @@ static enum wd_read_status cs_watchdog_r
* watchdog test.
*/
wd_seq_delay = cycles_to_nsec_safe(watchdog, wd_end, wd_end2);
- if (wd_seq_delay > md)
+ if (wd_seq_delay > 2 * md)
goto skip_test;
}
On Wed, Dec 17, 2025 at 06:21:05PM +0100, Thomas Gleixner wrote:
> The "valid" readout delay between the two reads of the watchdog is larger
> than the valid delta between the resulting watchdog and clocksource
> intervals, which results in false positive watchdog results.
>
> Assume TSC is the clocksource and HPET is the watchdog and both have a
> uncertainty margin of 250us (default). The watchdog readout does:
>
> 1) wdnow = read(HPET);
> 2) csnow = read(TSC);
> 3) wdend = read(HPET);
>
> The valid window for the delta between #1 and #3 is calculated by the
> uncertainty margins of the watchdog and the clocksource:
>
> m = 2 * watchdog.uncertainty_margin + cs.uncertainty margin;
>
> which results in 750us for the TSC/HPET case.
>
> The actual interval comparison uses a smaller margin:
>
> m = watchdog.uncertainty_margin + cs.uncertainty margin;
>
> which results in 500us for the TSC/HPET case.
>
> That means the following scenario will trigger the watchdog:
>
> Watchdog cycle N:
>
> 1) wdnow[N] = read(HPET);
> 2) csnow[N] = read(TSC);
> 3) wdend[N] = read(HPET);
>
> Assume the delay between #1 and #2 is 100us and the delay between #1 and
> #3 is within the 750us margin, i.e. the readout is considered valid.
>
> Watchdog cycle N + 1:
>
> 4) wdnow[N + 1] = read(HPET);
> 5) csnow[N + 1] = read(TSC);
> 6) wdend[N + 1] = read(HPET);
>
> If the delay between #4 and #6 is within the 750us margin then any delay
> between #4 and #5 which is larger than 600us will fail the interval check
> and mark the TSC unstable because the intervals are calculated against the
> previous value:
>
> wd_int = wdnow[N + 1] - wdnow[N];
> cs_int = csnow[N + 1] - csnow[N];
>
> Putting the above delays in place this results in:
>
> cs_int = (wdnow[N + 1] + 610us) - (wdnow[N] + 100us);
> -> cs_int = wd_int + 510us;
>
> which is obviously larger than the allowed 500us margin and results in
> marking TSC unstable.
>
> Fix this by using the same margin as the interval comparison. If the delay
> between two watchdog reads is larger than that, then the readout was either
> disturbed by interconnect congestion, NMIs or SMIs.
>
> Fixes: 4ac1dd3245b9 ("clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin")
> Reported-by: Daniel J Blueman <daniel@quora.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lore.kernel.org/lkml/20250602223251.496591-1-daniel@quora.org/
OK, in case it still matters, you convinced me.
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
I of course encourage others who provided testing and patches to also
test this. "Works for me!"
Thanx, Paul
> ---
> kernel/time/clocksource.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -252,7 +252,7 @@ enum wd_read_status {
>
> static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> {
> - int64_t md = 2 * watchdog->uncertainty_margin;
> + int64_t md = watchdog->uncertainty_margin;
> unsigned int nretries, max_retries;
> int64_t wd_delay, wd_seq_delay;
> u64 wd_end, wd_end2;
> @@ -285,7 +285,7 @@ static enum wd_read_status cs_watchdog_r
> * watchdog test.
> */
> wd_seq_delay = cycles_to_nsec_safe(watchdog, wd_end, wd_end2);
> - if (wd_seq_delay > md)
> + if (wd_seq_delay > 2 * md)
> goto skip_test;
> }
>
On 12/17/25 12:21 PM, Thomas Gleixner wrote:
> The "valid" readout delay between the two reads of the watchdog is larger
> than the valid delta between the resulting watchdog and clocksource
> intervals, which results in false positive watchdog results.
>
> Assume TSC is the clocksource and HPET is the watchdog and both have a
> uncertainty margin of 250us (default). The watchdog readout does:
>
> 1) wdnow = read(HPET);
> 2) csnow = read(TSC);
> 3) wdend = read(HPET);
>
> The valid window for the delta between #1 and #3 is calculated by the
> uncertainty margins of the watchdog and the clocksource:
>
> m = 2 * watchdog.uncertainty_margin + cs.uncertainty margin;
>
> which results in 750us for the TSC/HPET case.
>
> The actual interval comparison uses a smaller margin:
>
> m = watchdog.uncertainty_margin + cs.uncertainty margin;
>
> which results in 500us for the TSC/HPET case.
>
> That means the following scenario will trigger the watchdog:
>
> Watchdog cycle N:
>
> 1) wdnow[N] = read(HPET);
> 2) csnow[N] = read(TSC);
> 3) wdend[N] = read(HPET);
>
> Assume the delay between #1 and #2 is 100us and the delay between #1 and
> #3 is within the 750us margin, i.e. the readout is considered valid.
>
> Watchdog cycle N + 1:
>
> 4) wdnow[N + 1] = read(HPET);
> 5) csnow[N + 1] = read(TSC);
> 6) wdend[N + 1] = read(HPET);
>
> If the delay between #4 and #6 is within the 750us margin then any delay
> between #4 and #5 which is larger than 600us will fail the interval check
> and mark the TSC unstable because the intervals are calculated against the
> previous value:
>
> wd_int = wdnow[N + 1] - wdnow[N];
> cs_int = csnow[N + 1] - csnow[N];
>
> Putting the above delays in place this results in:
>
> cs_int = (wdnow[N + 1] + 610us) - (wdnow[N] + 100us);
> -> cs_int = wd_int + 510us;
>
> which is obviously larger than the allowed 500us margin and results in
> marking TSC unstable.
>
> Fix this by using the same margin as the interval comparison. If the delay
> between two watchdog reads is larger than that, then the readout was either
> disturbed by interconnect congestion, NMIs or SMIs.
>
> Fixes: 4ac1dd3245b9 ("clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin")
> Reported-by: Daniel J Blueman <daniel@quora.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lore.kernel.org/lkml/20250602223251.496591-1-daniel@quora.org/
> ---
> kernel/time/clocksource.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -252,7 +252,7 @@ enum wd_read_status {
>
> static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> {
> - int64_t md = 2 * watchdog->uncertainty_margin;
> + int64_t md = watchdog->uncertainty_margin;
> unsigned int nretries, max_retries;
> int64_t wd_delay, wd_seq_delay;
> u64 wd_end, wd_end2;
> @@ -285,7 +285,7 @@ static enum wd_read_status cs_watchdog_r
> * watchdog test.
> */
> wd_seq_delay = cycles_to_nsec_safe(watchdog, wd_end, wd_end2);
> - if (wd_seq_delay > md)
> + if (wd_seq_delay > 2 * md)
> goto skip_test;
> }
>
I believe the 2nd hunk isn't needed.
T1 = read(HPET);
T2 = read(TSC);
T3 = read(HPET);
T4 = read(HPET);
wd_delay = T3 - T1 <= md + cs->uncertainty_margin
wd_seq_delay = T4 - T3 > 2*md
wd_delay should be > wd_seq_delay. Here they are comparing about the
same threshold assuming that cs has the same uncertainty margin as the
watchdog. The thresholds comparing wd_delay and wd_seq_delay before
commit 4ac1dd3245b9 were WATCHDOG_MAX_SKEW and WATCHDOG_MAX_SKEW/2. So I
would suggest keeping the (wd_seq_delay > md) check.
Cheers, Longman
On Wed, Dec 17, 2025 at 06:21:05PM +0100, Thomas Gleixner wrote:
> The "valid" readout delay between the two reads of the watchdog is larger
> than the valid delta between the resulting watchdog and clocksource
> intervals, which results in false positive watchdog results.
>
> Assume TSC is the clocksource and HPET is the watchdog and both have a
> uncertainty margin of 250us (default). The watchdog readout does:
>
> 1) wdnow = read(HPET);
> 2) csnow = read(TSC);
> 3) wdend = read(HPET);
4) wd_end2 = read(HPET);
>
> The valid window for the delta between #1 and #3 is calculated by the
> uncertainty margins of the watchdog and the clocksource:
>
> m = 2 * watchdog.uncertainty_margin + cs.uncertainty margin;
>
> which results in 750us for the TSC/HPET case.
Yes, because this interval includes two watchdog reads (#1 and #3 above)
and one clocksource read (#2 above). We therefore need to allow two
watchdog uncertainties and one clocksource uncertainty.
> The actual interval comparison uses a smaller margin:
>
> m = watchdog.uncertainty_margin + cs.uncertainty margin;
>
> which results in 500us for the TSC/HPET case.
This is the (wd_seq_delay > md) comparison, righr? If so, the reason
for this is because it is measuring only a pair of watchdog reads (#3
and #4). There is no clocksource read on the latency recheck, so we do
not include the cs->uncertainty_margin value, only the pair of watchdog
uncertainty values.
If this check fails, that indicates that the watchdog clocksource is much
slower than expected (for example, due to memory-system overload affecting
HPET on multicore systems), so we skip this measurement interval.
> That means the following scenario will trigger the watchdog:
>
> Watchdog cycle N:
>
> 1) wdnow[N] = read(HPET);
> 2) csnow[N] = read(TSC);
> 3) wdend[N] = read(HPET);
>
> Assume the delay between #1 and #2 is 100us and the delay between #1 and
> #3 is within the 750us margin, i.e. the readout is considered valid.
Yes. We expect at most 250us for #1, another 250us for #2, and yet
another 250us for #3.
> Watchdog cycle N + 1:
>
> 4) wdnow[N + 1] = read(HPET);
> 5) csnow[N + 1] = read(TSC);
> 6) wdend[N + 1] = read(HPET);
>
> If the delay between #4 and #6 is within the 750us margin then any delay
> between #4 and #5 which is larger than 600us will fail the interval check
> and mark the TSC unstable because the intervals are calculated against the
> previous value:
>
> wd_int = wdnow[N + 1] - wdnow[N];
> cs_int = csnow[N + 1] - csnow[N];
Except that getting 600us latency between #4 and #5 is not consistent
with a 250us uncertainty. If that is happening, the uncertainty should
instead be at least 300us.
> Putting the above delays in place this results in:
>
> cs_int = (wdnow[N + 1] + 610us) - (wdnow[N] + 100us);
> -> cs_int = wd_int + 510us;
>
> which is obviously larger than the allowed 500us margin and results in
> marking TSC unstable.
Agreed, but due to the ->uncertainty_margin values being too small.
> Fix this by using the same margin as the interval comparison. If the delay
> between two watchdog reads is larger than that, then the readout was either
> disturbed by interconnect congestion, NMIs or SMIs.
>
> Fixes: 4ac1dd3245b9 ("clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin")
> Reported-by: Daniel J Blueman <daniel@quora.org>
If this is happening in real life, we have a couple of choices:
1. Increase the ->uncertainty_margin values to match the objective
universe.
2. In clocksource_watchdog(), replace "(abs(cs_nsec - wd_nsec) > md)"
with "(abs(cs_nsec - wd_nsec) > 2 * md)".
The rationale here is that the ->uncertainty_margin values are
two-tailed, as in the clocksource might report a value that is
->uncertainty_margin and ->uncertainty_margin too late. When I
was coding this, I instead assumed that ->uncertainty_margin
covered the full range, centered on the correct time value.
You would know better than would I.
My concern is that the patch below would force needless cs_watchdog_read()
retries.
Thoughts?
Thanx, Paul
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Link: https://lore.kernel.org/lkml/20250602223251.496591-1-daniel@quora.org/
> ---
> kernel/time/clocksource.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -252,7 +252,7 @@ enum wd_read_status {
>
> static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
> {
> - int64_t md = 2 * watchdog->uncertainty_margin;
> + int64_t md = watchdog->uncertainty_margin;
> unsigned int nretries, max_retries;
> int64_t wd_delay, wd_seq_delay;
> u64 wd_end, wd_end2;
> @@ -285,7 +285,7 @@ static enum wd_read_status cs_watchdog_r
> * watchdog test.
> */
> wd_seq_delay = cycles_to_nsec_safe(watchdog, wd_end, wd_end2);
> - if (wd_seq_delay > md)
> + if (wd_seq_delay > 2 * md)
> goto skip_test;
> }
>
On Wed, Dec 17 2025 at 16:48, Paul E. McKenney wrote:
> On Wed, Dec 17, 2025 at 06:21:05PM +0100, Thomas Gleixner wrote:
>> The "valid" readout delay between the two reads of the watchdog is larger
>> than the valid delta between the resulting watchdog and clocksource
>> intervals, which results in false positive watchdog results.
>>
>> Assume TSC is the clocksource and HPET is the watchdog and both have a
>> uncertainty margin of 250us (default). The watchdog readout does:
>>
>> 1) wdnow = read(HPET);
>> 2) csnow = read(TSC);
>> 3) wdend = read(HPET);
> 4) wd_end2 = read(HPET);
That's completely irrelevant for the problem at hand.
>> The valid window for the delta between #1 and #3 is calculated by the
>> uncertainty margins of the watchdog and the clocksource:
>>
>> m = 2 * watchdog.uncertainty_margin + cs.uncertainty margin;
>>
>> which results in 750us for the TSC/HPET case.
>
> Yes, because this interval includes two watchdog reads (#1 and #3 above)
> and one clocksource read (#2 above). We therefore need to allow two
> watchdog uncertainties and one clocksource uncertainty.
That's a made up and broken theory based on ill defined heuristics.
The uncertainty of a clocksource is defined by:
1) The frequency it runs with, which affects the accuracy of the
time readout.
2) The access time
Your implementation defines that the uncertainty is at least 50us,
with a default of 125us and an upper limit of 1ms. That's just made up
numbers pulled of of thin air which have nothing to do with reality.
Declaring that TSC access time of up to 1ms and at least 50us is
hillarious.
Adding up these made up margins and then double the watchdog margin to
validate the readout is just a hack to make it "work for me" and thereby
breaking the whole machinery for existing problematic systems. See below.
>> The actual interval comparison uses a smaller margin:
>>
>> m = watchdog.uncertainty_margin + cs.uncertainty margin;
>>
>> which results in 500us for the TSC/HPET case.
>
> This is the (wd_seq_delay > md) comparison, righr? If so, the reason
> for this is because it is measuring only a pair of watchdog reads (#3
> and #4). There is no clocksource read on the latency recheck, so we do
> not include the cs->uncertainty_margin value, only the pair of watchdog
> uncertainty values.
No. This is the check which does:
int64_t md = 2 * watchdog->uncertainty_margin;
...
*wdnow = watchdog->read(watchdog);
*csnow = cs->read(cs);
wd_end = watchdog->read(watchdog);
...
wd_delay = cycles_to_nsec_safe(watchdog, *wdnow, wd_end);
if (wd_delay <= md + cs->uncertainty_margin) {
...
return WR_READ_SUCCESS;
}
It has nothing to do with the wd_seq_delay check.
> If this check fails, that indicates that the watchdog clocksource is much
> slower than expected (for example, due to memory-system overload affecting
> HPET on multicore systems), so we skip this measurement interval.
>
>> That means the following scenario will trigger the watchdog:
>>
>> Watchdog cycle N:
>>
>> 1) wdnow[N] = read(HPET);
>> 2) csnow[N] = read(TSC);
>> 3) wdend[N] = read(HPET);
>>
>> Assume the delay between #1 and #2 is 100us and the delay between #1 and
>> #3 is within the 750us margin, i.e. the readout is considered valid.
>
> Yes. We expect at most 250us for #1, another 250us for #2, and yet
> another 250us for #3.
>
>> Watchdog cycle N + 1:
>>
>> 4) wdnow[N + 1] = read(HPET);
>> 5) csnow[N + 1] = read(TSC);
>> 6) wdend[N + 1] = read(HPET);
>>
>> If the delay between #4 and #6 is within the 750us margin then any delay
>> between #4 and #5 which is larger than 600us will fail the interval check
>> and mark the TSC unstable because the intervals are calculated against the
>> previous value:
>>
>> wd_int = wdnow[N + 1] - wdnow[N];
>> cs_int = csnow[N + 1] - csnow[N];
>
> Except that getting 600us latency between #4 and #5 is not consistent
> with a 250us uncertainty. If that is happening, the uncertainty should
> instead be at least 300us.
That's utter nonsense.
>> Putting the above delays in place this results in:
>>
>> cs_int = (wdnow[N + 1] + 610us) - (wdnow[N] + 100us);
>> -> cs_int = wd_int + 510us;
>>
>> which is obviously larger than the allowed 500us margin and results in
>> marking TSC unstable.
>
> Agreed, but due to the ->uncertainty_margin values being too small.
You seriously fail to understand basic math. Let me draw you a picture:
H = HPET read
T = TSC read
RW = read valid window
IW = interval valid window
RW-------------------------------------------RW
IW-------------------------IW
HT H <- Read 1 valid
H TH <- Read 2 valid
|---------------------------------| <- Interval too large
Q: How is increasing the uncertainty values fixing the underlying math
problem of RW > IW?
A: Not at all. It just papers over it and makes the false positive case
more unlikely by further weakening the accuracy.
>> Fix this by using the same margin as the interval comparison. If the delay
>> between two watchdog reads is larger than that, then the readout was either
>> disturbed by interconnect congestion, NMIs or SMIs.
>>
>> Fixes: 4ac1dd3245b9 ("clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin")
>> Reported-by: Daniel J Blueman <daniel@quora.org>
>
> If this is happening in real life, we have a couple of choices:
>
> 1. Increase the ->uncertainty_margin values to match the objective
> universe.
Which universe? The universe of made up math? See above.
> 2. In clocksource_watchdog(), replace "(abs(cs_nsec - wd_nsec) > md)"
> with "(abs(cs_nsec - wd_nsec) > 2 * md)".
>
> The rationale here is that the ->uncertainty_margin values are
> two-tailed, as in the clocksource might report a value that is
> ->uncertainty_margin and ->uncertainty_margin too late. When I
> was coding this, I instead assumed that ->uncertainty_margin
> covered the full range, centered on the correct time value.
So you propose the opposite of what I'm doing, which weakens the
watchdog even further.
> You would know better than would I.
>
> My concern is that the patch below would force needless cs_watchdog_read()
> retries.
That's not the end of the world and way better than degrading the
watchdog further.
It's already useless for the original purpose to detect even slow skew
of TSC between CPUs because it is relative and the uncertainty margins
are insanely big.
I just booted an old machine where the BIOS "hides" SMI time by saving
the TSC value on entry and restoring it on exit. The original watchdog
implementation caught that. Now it happily continues and I can observe
time going backwards between CPUs in user space. After an hour the
difference between the two CPUs is > 1sec and the system still claims
that everything is fine. And no, TSC_ADJUST does not catch that on this
machine because the CPU does not support it.
IOW, this whole big machine scalability hackery broke basic
functionality for existing systems.
We really need to go back to the drawing board and rethink this
machinery from scratch.
There are two main aspects to the watchdog:
1) Detect general frequency skew by comparing it against a known
(assumed to be) stable clocksource
This addresses the problems on older machines where the BIOS
fiddles with the CPU frequency behind the kernels back.
That's a non-issue on modern CPUs because the TSC is constant
frequency.
2) Detect inter CPU skew
This addresses the problems where
A) the BIOS fiddles with the TSC behind the kernels back (see
above) and the modification cannot be detected by the TSC_ADJUST
MSR due to its non-existance
Not an issue on modern CPUs
B) the sockets are not synchronized and the TSC frequency on them
drifts apart
C) hardware failure (something you mentioned back then when you
hacked this uncertainty magic up)
As I just validated on that old machine the whole "scalability" hackery
broke #2A and #2B unless the modifications or drifts are massive. But if
they are not it leaves both user space and kernel with inconsistent
time.
That means we really have to look at this in two ways:
I) On old hardware where TSC is not constant frequency, the validation
against HPET is required.
II) On new hardware with constant frequency TSC and TSC_ADJUST_MSR the
HPET comparison is not really useful anymore as it does not detect
low drift issues between unsynchronized sockets.
What can be done instead?
- Make CPU0 the watchdog supervisor, which runs the timer and
orchestrates the machinery.
- Limit the HPET comparison to CPUs which lack constant frequency,
which makes the whole legacy I/O issue on large systems go away.
Run this comparision only on CPU0
- Instead of moving the timer around CPUs, utilize a SMP function call
similar to what the TSC synchronization mechanism does, i.e.
CPU0 CPUN
sync() sync()
t1 = rdtsc()
handover()
t2 = rdtsc()
handover()
t3 = rdtsc()
....
Each step validates that tN <= TN+1, which detects all issues of #2
above for both old and contemporary machines.
Of course this mechanism is also affected by interconnect
congestion, but the only side effect of interconnect congestion is
that the handover becomes slow, which means that the accuracy of
detecting time going backwards suffers temporarily.
With that all these made up uncertainty heuristics go completely away or
can be turned into something which actually reflects the reality of the
universe, i.e. the accuracy of the clocks and their access time.
I doubt it's needed because the original implementation worked just fine
and it's only relevant for the actual HPET/TSC comparison case. That is
then limited to museum pieces which are not really affected by the
scalability issues of todays machines. Especially not as the HPET/TSC
check is limited to CPU0, which is the one closest to the legacy
peripherals which contain the HPET or in real old machines the ACPI_PM
timer.
I'll go and utilize my copious spare time to implement this, but don't
expect it to materialize before 2026 :)
Thanks,
tglx
On Fri, Dec 19, 2025 at 11:13:05AM +0100, Thomas Gleixner wrote:
> On Wed, Dec 17 2025 at 16:48, Paul E. McKenney wrote:
> > On Wed, Dec 17, 2025 at 06:21:05PM +0100, Thomas Gleixner wrote:
> >> The "valid" readout delay between the two reads of the watchdog is larger
> >> than the valid delta between the resulting watchdog and clocksource
> >> intervals, which results in false positive watchdog results.
OK, first, I have no objection to your reimplementing the clocksource
watchdog. Especially given that when making my changes, I felt the
need to remain within the current design, which was at times quite
constraining. I do feel the need to defend the current code, given
that limitation. But you knew that already.
In fact, you might well have been counting on it. ;-)
The desiderata from this end include:
o Avoid false positives from delays during measurement, including
delays from NMIs, SMIs, and vCPU preemption.
o Detect TSC drift due to bugs in firmware and hardware.
Others have in the past added:
o Avoid false positives due to extreme memory-system overload.
There are probably others that do not come immediately to mind.
> >> Assume TSC is the clocksource and HPET is the watchdog and both have a
> >> uncertainty margin of 250us (default). The watchdog readout does:
> >>
> >> 1) wdnow = read(HPET);
> >> 2) csnow = read(TSC);
> >> 3) wdend = read(HPET);
> > 4) wd_end2 = read(HPET);
>
> That's completely irrelevant for the problem at hand.
It is exactly one of the problems at hand. If we didn't have HPET readout
performance issues, we could dispense with that piece of the puzzle.
By the way, did this issue really occur on real hardware? Or are we
having a strictly theoretical discussion?
> >> The valid window for the delta between #1 and #3 is calculated by the
> >> uncertainty margins of the watchdog and the clocksource:
> >>
> >> m = 2 * watchdog.uncertainty_margin + cs.uncertainty margin;
> >>
> >> which results in 750us for the TSC/HPET case.
> >
> > Yes, because this interval includes two watchdog reads (#1 and #3 above)
> > and one clocksource read (#2 above). We therefore need to allow two
> > watchdog uncertainties and one clocksource uncertainty.
>
> That's a made up and broken theory based on ill defined heuristics.
If we want better heuristics, we either restrict ourselves to timer
hardware that has reasonable access latencies or we add something like
an ->access_time field to struct clocksource.
Otherwise, you will just be cutting off one end of the heuristic "blanket"
and sewing on to another.
Accounting more directly for vCPU preemption would also help a lot, as
that was the root cause of much (but by no means all) of the problem on
Meta's fleet.
> The uncertainty of a clocksource is defined by:
>
> 1) The frequency it runs with, which affects the accuracy of the
> time readout.
>
> 2) The access time
>
> Your implementation defines that the uncertainty is at least 50us,
> with a default of 125us and an upper limit of 1ms. That's just made up
> numbers pulled of of thin air which have nothing to do with reality.
These were set using test data from real hardware.
> Declaring that TSC access time of up to 1ms and at least 50us is
> hillarious.
Yes, in theory we could instead have an ->access_time that is used to
reject bad reads, so that ->uncertainty_margin would only be used to
detect timer skew. The discussion of such an addition didn't go far last
time around, but maybe now is a better time. Though I suspect that the
issues around determining what ->access_time should be set to are still
with us. But I would love to be proven wrong.
> Adding up these made up margins and then double the watchdog margin to
> validate the readout is just a hack to make it "work for me" and thereby
> breaking the whole machinery for existing problematic systems. See below.
Nope.
The readout #4 above into wd_end2 wasn't something my employer needed.
For our use cases, #1, #2, and #3 sufficed. It was instead needed by
other users running memory-bandwidth-heavy workloads on multi-socket
systems. This resulted in HPET access times that were astoundingly large.
One might naively hope that the HPETs in a given system could synchronize
themselves so that each CPU could read the HPET nearest it instead of
everyone reading CPU 0's HPET. Or that some other means would allow
reasonable access times. Hey, I can dream, can't I?
> >> The actual interval comparison uses a smaller margin:
> >>
> >> m = watchdog.uncertainty_margin + cs.uncertainty margin;
> >>
> >> which results in 500us for the TSC/HPET case.
> >
> > This is the (wd_seq_delay > md) comparison, righr? If so, the reason
> > for this is because it is measuring only a pair of watchdog reads (#3
> > and #4). There is no clocksource read on the latency recheck, so we do
> > not include the cs->uncertainty_margin value, only the pair of watchdog
> > uncertainty values.
>
> No. This is the check which does:
>
> int64_t md = 2 * watchdog->uncertainty_margin;
> ...
>
> *wdnow = watchdog->read(watchdog);
> *csnow = cs->read(cs);
> wd_end = watchdog->read(watchdog);
> ...
>
> wd_delay = cycles_to_nsec_safe(watchdog, *wdnow, wd_end);
> if (wd_delay <= md + cs->uncertainty_margin) {
> ...
> return WR_READ_SUCCESS;
> }
>
> It has nothing to do with the wd_seq_delay check.
Eh? Here wd_delay is twice watchdog uncertainty plus clocksource
uncertainty.
> > If this check fails, that indicates that the watchdog clocksource is much
> > slower than expected (for example, due to memory-system overload affecting
> > HPET on multicore systems), so we skip this measurement interval.
> >
> >> That means the following scenario will trigger the watchdog:
> >>
> >> Watchdog cycle N:
> >>
> >> 1) wdnow[N] = read(HPET);
> >> 2) csnow[N] = read(TSC);
> >> 3) wdend[N] = read(HPET);
> >>
> >> Assume the delay between #1 and #2 is 100us and the delay between #1 and
> >> #3 is within the 750us margin, i.e. the readout is considered valid.
> >
> > Yes. We expect at most 250us for #1, another 250us for #2, and yet
> > another 250us for #3.
> >
> >> Watchdog cycle N + 1:
> >>
> >> 4) wdnow[N + 1] = read(HPET);
> >> 5) csnow[N + 1] = read(TSC);
> >> 6) wdend[N + 1] = read(HPET);
> >>
> >> If the delay between #4 and #6 is within the 750us margin then any delay
> >> between #4 and #5 which is larger than 600us will fail the interval check
> >> and mark the TSC unstable because the intervals are calculated against the
> >> previous value:
> >>
> >> wd_int = wdnow[N + 1] - wdnow[N];
> >> cs_int = csnow[N + 1] - csnow[N];
> >
> > Except that getting 600us latency between #4 and #5 is not consistent
> > with a 250us uncertainty. If that is happening, the uncertainty should
> > instead be at least 300us.
>
> That's utter nonsense.
Or alternatively, you simply do not yet understand it. ;-)
> >> Putting the above delays in place this results in:
> >>
> >> cs_int = (wdnow[N + 1] + 610us) - (wdnow[N] + 100us);
> >> -> cs_int = wd_int + 510us;
> >>
> >> which is obviously larger than the allowed 500us margin and results in
> >> marking TSC unstable.
> >
> > Agreed, but due to the ->uncertainty_margin values being too small.
>
> You seriously fail to understand basic math.
Well, if that is the best you can do... ;-)
> Let me draw you a picture:
>
> H = HPET read
> T = TSC read
> RW = read valid window
> IW = interval valid window
>
> RW-------------------------------------------RW
> IW-------------------------IW
>
> HT H <- Read 1 valid
> H TH <- Read 2 valid
> |---------------------------------| <- Interval too large
>
> Q: How is increasing the uncertainty values fixing the underlying math
> problem of RW > IW?
>
> A: Not at all. It just papers over it and makes the false positive case
> more unlikely by further weakening the accuracy.
Yep. Given the current lack of something like ->access_time to go along
with the current ->uncertainty_margin, there are limits to what we can do.
Again, was the report from a real workload running on real hardware, or
are we simply having a theoretical discussion of what might happen?
> >> Fix this by using the same margin as the interval comparison. If the delay
> >> between two watchdog reads is larger than that, then the readout was either
> >> disturbed by interconnect congestion, NMIs or SMIs.
> >>
> >> Fixes: 4ac1dd3245b9 ("clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin")
> >> Reported-by: Daniel J Blueman <daniel@quora.org>
> >
> > If this is happening in real life, we have a couple of choices:
> >
> > 1. Increase the ->uncertainty_margin values to match the objective
> > universe.
>
> Which universe? The universe of made up math? See above.
The universe of the real hardware Daniel was using, should it turn
out that this is a real problem encountered running a real workload on
real hardware.
> > 2. In clocksource_watchdog(), replace "(abs(cs_nsec - wd_nsec) > md)"
> > with "(abs(cs_nsec - wd_nsec) > 2 * md)".
> >
> > The rationale here is that the ->uncertainty_margin values are
> > two-tailed, as in the clocksource might report a value that is
> > ->uncertainty_margin and ->uncertainty_margin too late. When I
> > was coding this, I instead assumed that ->uncertainty_margin
> > covered the full range, centered on the correct time value.
>
> So you propose the opposite of what I'm doing, which weakens the
> watchdog even further.
>
> > You would know better than would I.
> >
> > My concern is that the patch below would force needless cs_watchdog_read()
> > retries.
>
> That's not the end of the world and way better than degrading the
> watchdog further.
But what you proposed is just a further tweak of the heuristics you so
energetically decry above.
> It's already useless for the original purpose to detect even slow skew
> of TSC between CPUs because it is relative and the uncertainty margins
> are insanely big.
Really? The uncertainty margins are way smaller than they were before
read #3 was introduced. Don't get me wrong, it would be great to make
them even smaller.
> I just booted an old machine where the BIOS "hides" SMI time by saving
> the TSC value on entry and restoring it on exit. The original watchdog
> implementation caught that. Now it happily continues and I can observe
> time going backwards between CPUs in user space. After an hour the
> difference between the two CPUs is > 1sec and the system still claims
> that everything is fine. And no, TSC_ADJUST does not catch that on this
> machine because the CPU does not support it.
>
> IOW, this whole big machine scalability hackery broke basic
> functionality for existing systems.
OK, that is at least a real problem, even if on antique hardware.
A problem that surprises me, but I don't have that sort of hardware,
so I will take your word for it. Though you would need what, a 62.5
millisecond SMI for the old watchdog to catch it, right?
(Yes, yes, I have seen far longer SMIs. Don't get me started...)
> We really need to go back to the drawing board and rethink this
> machinery from scratch.
>
> There are two main aspects to the watchdog:
>
> 1) Detect general frequency skew by comparing it against a known
> (assumed to be) stable clocksource
>
> This addresses the problems on older machines where the BIOS
> fiddles with the CPU frequency behind the kernels back.
Fair enough.
> That's a non-issue on modern CPUs because the TSC is constant
> frequency.
Give or take the occasional messed-up motherboard or firmware, where
the frequency is constant---but wrong. :-/
> 2) Detect inter CPU skew
>
> This addresses the problems where
>
> A) the BIOS fiddles with the TSC behind the kernels back (see
> above) and the modification cannot be detected by the TSC_ADJUST
> MSR due to its non-existance
>
> Not an issue on modern CPUs
>
> B) the sockets are not synchronized and the TSC frequency on them
> drifts apart
>
> C) hardware failure (something you mentioned back then when you
> hacked this uncertainty magic up)
3) Deal with insanely slow readout times for some clocksources.
Which your smp_call_function() approach describe below should handle.
> As I just validated on that old machine the whole "scalability" hackery
> broke #2A and #2B unless the modifications or drifts are massive. But if
> they are not it leaves both user space and kernel with inconsistent
> time.
Agreed. Imperfect though the current setup might be, there were reasons
why we changed it.
> That means we really have to look at this in two ways:
>
> I) On old hardware where TSC is not constant frequency, the validation
> against HPET is required.
>
> II) On new hardware with constant frequency TSC and TSC_ADJUST_MSR the
> HPET comparison is not really useful anymore as it does not detect
> low drift issues between unsynchronized sockets.
>
> What can be done instead?
>
> - Make CPU0 the watchdog supervisor, which runs the timer and
> orchestrates the machinery.
Fine on x86, but last I knew some systems can run without CPU 0.
Maybe the boot CPU? Though that can be offlined on some systems.
Handoff? For NO_HZ_FULL people, confine to the housekeeping CPUs?
Or is x86 the only system with more than one timer?
> - Limit the HPET comparison to CPUs which lack constant frequency,
> which makes the whole legacy I/O issue on large systems go away.
I thought that was already the default, and that you need to use the
tsc=watchdog kernel boot parameter to override that default in order to
use a known-stable TSC as the watchdog for HPET and/or ACPI PM timer.
> Run this comparision only on CPU0
>
> - Instead of moving the timer around CPUs, utilize a SMP function call
> similar to what the TSC synchronization mechanism does, i.e.
>
> CPU0 CPUN
> sync() sync()
> t1 = rdtsc()
> handover()
> t2 = rdtsc()
> handover()
> t3 = rdtsc()
> ....
>
> Each step validates that tN <= TN+1, which detects all issues of #2
> above for both old and contemporary machines.
>
> Of course this mechanism is also affected by interconnect
> congestion, but the only side effect of interconnect congestion is
> that the handover becomes slow, which means that the accuracy of
> detecting time going backwards suffers temporarily.
Is this going to be OK for real-time workloads? I guess that you have at
least as good a chance as I would to convince them. The more aggressive
NO_HZ_FULL people might have concerns. Especially these guys:
https://arxiv.org/abs/2509.03855
I have similar code in the current clocksource watchdog, but it doesn't
run until clocksource skew is detected, in other words, only after
something has likely gone sideways.
> With that all these made up uncertainty heuristics go completely away or
> can be turned into something which actually reflects the reality of the
> universe, i.e. the accuracy of the clocks and their access time.
>
> I doubt it's needed because the original implementation worked just fine
> and it's only relevant for the actual HPET/TSC comparison case. That is
> then limited to museum pieces which are not really affected by the
> scalability issues of todays machines. Especially not as the HPET/TSC
> check is limited to CPU0, which is the one closest to the legacy
> peripherals which contain the HPET or in real old machines the ACPI_PM
> timer.
Give or take firmware/hardware bugs that cause the TSC to at a constant
but incorrect rate. Maybe interact with userspace facilities such as NTP?
> I'll go and utilize my copious spare time to implement this, but don't
> expect it to materialize before 2026 :)
The current implementation is working for us, so your schedule is our
schedule. ;-)
Thanx, Paul
On Fri, Dec 19 2025 at 16:18, Paul E. McKenney wrote:
> On Fri, Dec 19, 2025 at 11:13:05AM +0100, Thomas Gleixner wrote:
>> On Wed, Dec 17 2025 at 16:48, Paul E. McKenney wrote:
>> > On Wed, Dec 17, 2025 at 06:21:05PM +0100, Thomas Gleixner wrote:
>> >> Assume TSC is the clocksource and HPET is the watchdog and both have a
>> >> uncertainty margin of 250us (default). The watchdog readout does:
>> >>
>> >> 1) wdnow = read(HPET);
>> >> 2) csnow = read(TSC);
>> >> 3) wdend = read(HPET);
>> > 4) wd_end2 = read(HPET);
>>
>> That's completely irrelevant for the problem at hand.
>
> It is exactly one of the problems at hand. If we didn't have HPET readout
> performance issues, we could dispense with that piece of the puzzle.
It's part of the "make it work" hack, but unrelated to the fact that the
readout valid window is larger than the interval valid window.
> By the way, did this issue really occur on real hardware? Or are we
> having a strictly theoretical discussion?
It has been reported by Daniel and it happens on real hardware.
>> >> The valid window for the delta between #1 and #3 is calculated by the
>> >> uncertainty margins of the watchdog and the clocksource:
>> >>
>> >> m = 2 * watchdog.uncertainty_margin + cs.uncertainty margin;
>> >>
>> >> which results in 750us for the TSC/HPET case.
>> >
>> > Yes, because this interval includes two watchdog reads (#1 and #3 above)
>> > and one clocksource read (#2 above). We therefore need to allow two
>> > watchdog uncertainties and one clocksource uncertainty.
>>
>> That's a made up and broken theory based on ill defined heuristics.
>
> If we want better heuristics, we either restrict ourselves to timer
> hardware that has reasonable access latencies or we add something like
> an ->access_time field to struct clocksource.
>
> Otherwise, you will just be cutting off one end of the heuristic "blanket"
> and sewing on to another.
>
> Accounting more directly for vCPU preemption would also help a lot, as
> that was the root cause of much (but by no means all) of the problem on
> Meta's fleet.
The goal is to get rid of the heuristics completely.
>> The uncertainty of a clocksource is defined by:
>>
>> 1) The frequency it runs with, which affects the accuracy of the
>> time readout.
>>
>> 2) The access time
>>
>> Your implementation defines that the uncertainty is at least 50us,
>> with a default of 125us and an upper limit of 1ms. That's just made up
>> numbers pulled of of thin air which have nothing to do with reality.
>
> These were set using test data from real hardware.
>
>> Declaring that TSC access time of up to 1ms and at least 50us is
>> hillarious.
>
> Yes, in theory we could instead have an ->access_time that is used to
> reject bad reads, so that ->uncertainty_margin would only be used to
> detect timer skew. The discussion of such an addition didn't go far last
> time around, but maybe now is a better time. Though I suspect that the
> issues around determining what ->access_time should be set to are still
> with us. But I would love to be proven wrong.
>
>> Adding up these made up margins and then double the watchdog margin to
>> validate the readout is just a hack to make it "work for me" and thereby
>> breaking the whole machinery for existing problematic systems. See below.
>
> Nope.
>
> The readout #4 above into wd_end2 wasn't something my employer needed.
> For our use cases, #1, #2, and #3 sufficed. It was instead needed by
> other users running memory-bandwidth-heavy workloads on multi-socket
> systems. This resulted in HPET access times that were astoundingly large.
>
> One might naively hope that the HPETs in a given system could synchronize
> themselves so that each CPU could read the HPET nearest it instead of
> everyone reading CPU 0's HPET. Or that some other means would allow
> reasonable access times. Hey, I can dream, can't I?
Latest at that point it should have been clear that heuristics aren't
going to cut it.
>> >> The actual interval comparison uses a smaller margin:
>> >>
>> >> m = watchdog.uncertainty_margin + cs.uncertainty margin;
>> >>
>> >> which results in 500us for the TSC/HPET case.
>> >
>> > This is the (wd_seq_delay > md) comparison, righr? If so, the reason
>> > for this is because it is measuring only a pair of watchdog reads (#3
>> > and #4). There is no clocksource read on the latency recheck, so we do
>> > not include the cs->uncertainty_margin value, only the pair of watchdog
>> > uncertainty values.
>>
>> No. This is the check which does:
>>
>> int64_t md = 2 * watchdog->uncertainty_margin;
>> ...
>>
>> *wdnow = watchdog->read(watchdog);
>> *csnow = cs->read(cs);
>> wd_end = watchdog->read(watchdog);
>> ...
>>
>> wd_delay = cycles_to_nsec_safe(watchdog, *wdnow, wd_end);
>> if (wd_delay <= md + cs->uncertainty_margin) {
>> ...
>> return WR_READ_SUCCESS;
>> }
>>
>> It has nothing to do with the wd_seq_delay check.
>
> Eh? Here wd_delay is twice watchdog uncertainty plus clocksource
> uncertainty.
Correct. As implemented by Paul McKenney, no?
>> Let me draw you a picture:
>>
>> H = HPET read
>> T = TSC read
>> RW = read valid window
>> IW = interval valid window
>>
>> RW-------------------------------------------RW
>> IW-------------------------IW
>>
>> HT H <- Read 1 valid
>> H TH <- Read 2 valid
>> |---------------------------------| <- Interval too large
>>
>> Q: How is increasing the uncertainty values fixing the underlying math
>> problem of RW > IW?
>>
>> A: Not at all. It just papers over it and makes the false positive case
>> more unlikely by further weakening the accuracy.
>
> Yep. Given the current lack of something like ->access_time to go along
> with the current ->uncertainty_margin, there are limits to what we can do.
Well, we can at least limit the damage by making RW and IW consistent, no?
> Again, was the report from a real workload running on real hardware, or
> are we simply having a theoretical discussion of what might happen?
It's real hardware and again even in theory it's entirely clear that
having a readout validation window which is larger than the interval
validation window leads exactly to the problem reported by
Daniel. No?
> OK, that is at least a real problem, even if on antique hardware.
> A problem that surprises me, but I don't have that sort of hardware,
> so I will take your word for it. Though you would need what, a 62.5
> millisecond SMI for the old watchdog to catch it, right?
>
> (Yes, yes, I have seen far longer SMIs. Don't get me started...)
It's a SMI plagued horrorshow where BIOS writers did value add!
>> We really need to go back to the drawing board and rethink this
>> machinery from scratch.
>>
>> There are two main aspects to the watchdog:
>>
>> 1) Detect general frequency skew by comparing it against a known
>> (assumed to be) stable clocksource
>>
>> This addresses the problems on older machines where the BIOS
>> fiddles with the CPU frequency behind the kernels back.
>
> Fair enough.
>
>> That's a non-issue on modern CPUs because the TSC is constant
>> frequency.
>
> Give or take the occasional messed-up motherboard or firmware, where
> the frequency is constant---but wrong. :-/
We still can enforce the frequency check even for constant frequency
systems.
>> 2) Detect inter CPU skew
>>
>> This addresses the problems where
>>
>> A) the BIOS fiddles with the TSC behind the kernels back (see
>> above) and the modification cannot be detected by the TSC_ADJUST
>> MSR due to its non-existance
>>
>> Not an issue on modern CPUs
>>
>> B) the sockets are not synchronized and the TSC frequency on them
>> drifts apart
>>
>> C) hardware failure (something you mentioned back then when you
>> hacked this uncertainty magic up)
>
> 3) Deal with insanely slow readout times for some clocksources.
> Which your smp_call_function() approach describe below should handle.
That's an implementation detail, but not part of what the watchdog
should actually detect, i.e. frequency change and inter CPU skew.
>> As I just validated on that old machine the whole "scalability" hackery
>> broke #2A and #2B unless the modifications or drifts are massive. But if
>> they are not it leaves both user space and kernel with inconsistent
>> time.
>
> Agreed. Imperfect though the current setup might be, there were reasons
> why we changed it.
>
>> That means we really have to look at this in two ways:
>>
>> I) On old hardware where TSC is not constant frequency, the validation
>> against HPET is required.
>>
>> II) On new hardware with constant frequency TSC and TSC_ADJUST_MSR the
>> HPET comparison is not really useful anymore as it does not detect
>> low drift issues between unsynchronized sockets.
>>
>> What can be done instead?
>>
>> - Make CPU0 the watchdog supervisor, which runs the timer and
>> orchestrates the machinery.
>
> Fine on x86, but last I knew some systems can run without CPU 0.
> Maybe the boot CPU? Though that can be offlined on some systems.
> Handoff? For NO_HZ_FULL people, confine to the housekeeping CPUs?
Sure, that's a detail where you assign it to.
> Or is x86 the only system with more than one timer?
The watchdog is only enabled on x86 and MIPS, but none of the MIPS
clocksources sets the MUST_VERIFY flag. PARISC sets the MUST_VERIFY
flag, but does not enable the watchdog. So it's effectively x86.
It seems that everything else either has a sane timer or at least
believes so.
>> - Limit the HPET comparison to CPUs which lack constant frequency,
>> which makes the whole legacy I/O issue on large systems go away.
>
> I thought that was already the default, and that you need to use the
> tsc=watchdog kernel boot parameter to override that default in order to
> use a known-stable TSC as the watchdog for HPET and/or ACPI PM timer.
Correct.
>> Run this comparision only on CPU0
>>
>> - Instead of moving the timer around CPUs, utilize a SMP function call
>> similar to what the TSC synchronization mechanism does, i.e.
>>
>> CPU0 CPUN
>> sync() sync()
>> t1 = rdtsc()
>> handover()
>> t2 = rdtsc()
>> handover()
>> t3 = rdtsc()
>> ....
>>
>> Each step validates that tN <= TN+1, which detects all issues of #2
>> above for both old and contemporary machines.
>>
>> Of course this mechanism is also affected by interconnect
>> congestion, but the only side effect of interconnect congestion is
>> that the handover becomes slow, which means that the accuracy of
>> detecting time going backwards suffers temporarily.
>
> Is this going to be OK for real-time workloads? I guess that you have at
> least as good a chance as I would to convince them. The more aggressive
> NO_HZ_FULL people might have concerns. Especially these guys:
>
> https://arxiv.org/abs/2509.03855
I'm sure that the wishful thinking departement which ignores the system
immanent latencies and demonstrates great results on a system which is
tuned to run only the demonstrator will go apeshit.
> I have similar code in the current clocksource watchdog, but it doesn't
> run until clocksource skew is detected, in other words, only after
> something has likely gone sideways.
Of course we can exclude the nohz full CPUs, but that's again an
implementation detail and does not change the underlying design.
>> With that all these made up uncertainty heuristics go completely away or
>> can be turned into something which actually reflects the reality of the
>> universe, i.e. the accuracy of the clocks and their access time.
>>
>> I doubt it's needed because the original implementation worked just fine
>> and it's only relevant for the actual HPET/TSC comparison case. That is
>> then limited to museum pieces which are not really affected by the
>> scalability issues of todays machines. Especially not as the HPET/TSC
>> check is limited to CPU0, which is the one closest to the legacy
>> peripherals which contain the HPET or in real old machines the ACPI_PM
>> timer.
>
> Give or take firmware/hardware bugs that cause the TSC to at a constant
> but incorrect rate. Maybe interact with userspace facilities such as
> NTP?
Either we enforce HPET/ACPI_PM comparison on the supervisor CPU or let
user space deal with it. If NTP detects time going sideways, but being
consistent between CPUs, then it can switch the clocksource to HPET or
ACPI_PM already today.
>> I'll go and utilize my copious spare time to implement this, but don't
>> expect it to materialize before 2026 :)
>
> The current implementation is working for us, so your schedule is our
> schedule. ;-)
Thanks for confirming that this is based on "works for me" :)
Thanks,
tglx
On Sat, Dec 20, 2025 at 09:18:35AM +0100, Thomas Gleixner wrote:
> On Fri, Dec 19 2025 at 16:18, Paul E. McKenney wrote:
> > On Fri, Dec 19, 2025 at 11:13:05AM +0100, Thomas Gleixner wrote:
> >> On Wed, Dec 17 2025 at 16:48, Paul E. McKenney wrote:
> >> > On Wed, Dec 17, 2025 at 06:21:05PM +0100, Thomas Gleixner wrote:
> >> >> Assume TSC is the clocksource and HPET is the watchdog and both have a
> >> >> uncertainty margin of 250us (default). The watchdog readout does:
> >> >>
> >> >> 1) wdnow = read(HPET);
> >> >> 2) csnow = read(TSC);
> >> >> 3) wdend = read(HPET);
> >> > 4) wd_end2 = read(HPET);
> >>
> >> That's completely irrelevant for the problem at hand.
> >
> > It is exactly one of the problems at hand. If we didn't have HPET readout
> > performance issues, we could dispense with that piece of the puzzle.
>
> It's part of the "make it work" hack, but unrelated to the fact that the
> readout valid window is larger than the interval valid window.
Fair enough. But let the record show that this part of the "make it work"
hack was making it work for others, not for my employer.
> > By the way, did this issue really occur on real hardware? Or are we
> > having a strictly theoretical discussion?
>
> It has been reported by Daniel and it happens on real hardware.
OK, for whatever it is worth at this point, I have given a Tested-by
and Reviewed-by on your patch starting this thread.
> >> >> The valid window for the delta between #1 and #3 is calculated by the
> >> >> uncertainty margins of the watchdog and the clocksource:
> >> >>
> >> >> m = 2 * watchdog.uncertainty_margin + cs.uncertainty margin;
> >> >>
> >> >> which results in 750us for the TSC/HPET case.
> >> >
> >> > Yes, because this interval includes two watchdog reads (#1 and #3 above)
> >> > and one clocksource read (#2 above). We therefore need to allow two
> >> > watchdog uncertainties and one clocksource uncertainty.
> >>
> >> That's a made up and broken theory based on ill defined heuristics.
> >
> > If we want better heuristics, we either restrict ourselves to timer
> > hardware that has reasonable access latencies or we add something like
> > an ->access_time field to struct clocksource.
> >
> > Otherwise, you will just be cutting off one end of the heuristic "blanket"
> > and sewing on to another.
> >
> > Accounting more directly for vCPU preemption would also help a lot, as
> > that was the root cause of much (but by no means all) of the problem on
> > Meta's fleet.
>
> The goal is to get rid of the heuristics completely.
A way of seeing the vCPU preemption would be a step in that direction,
though I am guessing you have something more ambitious in mind.
> >> The uncertainty of a clocksource is defined by:
> >>
> >> 1) The frequency it runs with, which affects the accuracy of the
> >> time readout.
> >>
> >> 2) The access time
> >>
> >> Your implementation defines that the uncertainty is at least 50us,
> >> with a default of 125us and an upper limit of 1ms. That's just made up
> >> numbers pulled of of thin air which have nothing to do with reality.
> >
> > These were set using test data from real hardware.
> >
> >> Declaring that TSC access time of up to 1ms and at least 50us is
> >> hillarious.
> >
> > Yes, in theory we could instead have an ->access_time that is used to
> > reject bad reads, so that ->uncertainty_margin would only be used to
> > detect timer skew. The discussion of such an addition didn't go far last
> > time around, but maybe now is a better time. Though I suspect that the
> > issues around determining what ->access_time should be set to are still
> > with us. But I would love to be proven wrong.
> >
> >> Adding up these made up margins and then double the watchdog margin to
> >> validate the readout is just a hack to make it "work for me" and thereby
> >> breaking the whole machinery for existing problematic systems. See below.
> >
> > Nope.
> >
> > The readout #4 above into wd_end2 wasn't something my employer needed.
> > For our use cases, #1, #2, and #3 sufficed. It was instead needed by
> > other users running memory-bandwidth-heavy workloads on multi-socket
> > systems. This resulted in HPET access times that were astoundingly large.
> >
> > One might naively hope that the HPETs in a given system could synchronize
> > themselves so that each CPU could read the HPET nearest it instead of
> > everyone reading CPU 0's HPET. Or that some other means would allow
> > reasonable access times. Hey, I can dream, can't I?
>
> Latest at that point it should have been clear that heuristics aren't
> going to cut it.
I guess we all can dream. ;-)
> >> >> The actual interval comparison uses a smaller margin:
> >> >>
> >> >> m = watchdog.uncertainty_margin + cs.uncertainty margin;
> >> >>
> >> >> which results in 500us for the TSC/HPET case.
> >> >
> >> > This is the (wd_seq_delay > md) comparison, righr? If so, the reason
> >> > for this is because it is measuring only a pair of watchdog reads (#3
> >> > and #4). There is no clocksource read on the latency recheck, so we do
> >> > not include the cs->uncertainty_margin value, only the pair of watchdog
> >> > uncertainty values.
> >>
> >> No. This is the check which does:
> >>
> >> int64_t md = 2 * watchdog->uncertainty_margin;
> >> ...
> >>
> >> *wdnow = watchdog->read(watchdog);
> >> *csnow = cs->read(cs);
> >> wd_end = watchdog->read(watchdog);
> >> ...
> >>
> >> wd_delay = cycles_to_nsec_safe(watchdog, *wdnow, wd_end);
> >> if (wd_delay <= md + cs->uncertainty_margin) {
> >> ...
> >> return WR_READ_SUCCESS;
> >> }
> >>
> >> It has nothing to do with the wd_seq_delay check.
> >
> > Eh? Here wd_delay is twice watchdog uncertainty plus clocksource
> > uncertainty.
>
> Correct. As implemented by Paul McKenney, no?
Yep. @@@
> >> Let me draw you a picture:
> >>
> >> H = HPET read
> >> T = TSC read
> >> RW = read valid window
> >> IW = interval valid window
> >>
> >> RW-------------------------------------------RW
> >> IW-------------------------IW
> >>
> >> HT H <- Read 1 valid
> >> H TH <- Read 2 valid
> >> |---------------------------------| <- Interval too large
> >>
> >> Q: How is increasing the uncertainty values fixing the underlying math
> >> problem of RW > IW?
> >>
> >> A: Not at all. It just papers over it and makes the false positive case
> >> more unlikely by further weakening the accuracy.
> >
> > Yep. Given the current lack of something like ->access_time to go along
> > with the current ->uncertainty_margin, there are limits to what we can do.
>
> Well, we can at least limit the damage by making RW and IW consistent, no?
Fair enough.
> > Again, was the report from a real workload running on real hardware, or
> > are we simply having a theoretical discussion of what might happen?
>
> It's real hardware and again even in theory it's entirely clear that
> having a readout validation window which is larger than the interval
> validation window leads exactly to the problem reported by
> Daniel. No?
Given reasonable clocksource access times, it would not be a problem
other than in theory, but given the world we actually live in, no
further argument.
> > OK, that is at least a real problem, even if on antique hardware.
> > A problem that surprises me, but I don't have that sort of hardware,
> > so I will take your word for it. Though you would need what, a 62.5
> > millisecond SMI for the old watchdog to catch it, right?
> >
> > (Yes, yes, I have seen far longer SMIs. Don't get me started...)
>
> It's a SMI plagued horrorshow where BIOS writers did value add!
Huh. I should have asked you long ago... Do you mean "horrorshow"
in the Clockwork Orange sense? Seems unlikely in this case, but you
never know...
And yes, SMIs (even those that did not lie about the system' age) would
also have been a problem in the original code that just did the single
read from each clocksource.
> >> We really need to go back to the drawing board and rethink this
> >> machinery from scratch.
> >>
> >> There are two main aspects to the watchdog:
> >>
> >> 1) Detect general frequency skew by comparing it against a known
> >> (assumed to be) stable clocksource
> >>
> >> This addresses the problems on older machines where the BIOS
> >> fiddles with the CPU frequency behind the kernels back.
> >
> > Fair enough.
> >
> >> That's a non-issue on modern CPUs because the TSC is constant
> >> frequency.
> >
> > Give or take the occasional messed-up motherboard or firmware, where
> > the frequency is constant---but wrong. :-/
>
> We still can enforce the frequency check even for constant frequency
> systems.
This is where you tell me that some hardware uses the same external
hardware clock input for both TSC and HPET? :-(
> >> 2) Detect inter CPU skew
> >>
> >> This addresses the problems where
> >>
> >> A) the BIOS fiddles with the TSC behind the kernels back (see
> >> above) and the modification cannot be detected by the TSC_ADJUST
> >> MSR due to its non-existance
> >>
> >> Not an issue on modern CPUs
> >>
> >> B) the sockets are not synchronized and the TSC frequency on them
> >> drifts apart
> >>
> >> C) hardware failure (something you mentioned back then when you
> >> hacked this uncertainty magic up)
> >
> > 3) Deal with insanely slow readout times for some clocksources.
> > Which your smp_call_function() approach describe below should handle.
>
> That's an implementation detail, but not part of what the watchdog
> should actually detect, i.e. frequency change and inter CPU skew.
It might well be an implementation detail, but even if so it is a detail
that complicates the heuristics.
> >> As I just validated on that old machine the whole "scalability" hackery
> >> broke #2A and #2B unless the modifications or drifts are massive. But if
> >> they are not it leaves both user space and kernel with inconsistent
> >> time.
> >
> > Agreed. Imperfect though the current setup might be, there were reasons
> > why we changed it.
> >
> >> That means we really have to look at this in two ways:
> >>
> >> I) On old hardware where TSC is not constant frequency, the validation
> >> against HPET is required.
> >>
> >> II) On new hardware with constant frequency TSC and TSC_ADJUST_MSR the
> >> HPET comparison is not really useful anymore as it does not detect
> >> low drift issues between unsynchronized sockets.
> >>
> >> What can be done instead?
> >>
> >> - Make CPU0 the watchdog supervisor, which runs the timer and
> >> orchestrates the machinery.
> >
> > Fine on x86, but last I knew some systems can run without CPU 0.
> > Maybe the boot CPU? Though that can be offlined on some systems.
> > Handoff? For NO_HZ_FULL people, confine to the housekeeping CPUs?
>
> Sure, that's a detail where you assign it to.
>
> > Or is x86 the only system with more than one timer?
>
> The watchdog is only enabled on x86 and MIPS, but none of the MIPS
> clocksources sets the MUST_VERIFY flag. PARISC sets the MUST_VERIFY
> flag, but does not enable the watchdog. So it's effectively x86.
>
> It seems that everything else either has a sane timer or at least
> believes so.
There are at least a couple that have historically been sane, but the
usual situation features more claims than actual implementations.
> >> - Limit the HPET comparison to CPUs which lack constant frequency,
> >> which makes the whole legacy I/O issue on large systems go away.
> >
> > I thought that was already the default, and that you need to use the
> > tsc=watchdog kernel boot parameter to override that default in order to
> > use a known-stable TSC as the watchdog for HPET and/or ACPI PM timer.
>
> Correct.
OK.
> >> Run this comparision only on CPU0
> >>
> >> - Instead of moving the timer around CPUs, utilize a SMP function call
> >> similar to what the TSC synchronization mechanism does, i.e.
> >>
> >> CPU0 CPUN
> >> sync() sync()
> >> t1 = rdtsc()
> >> handover()
> >> t2 = rdtsc()
> >> handover()
> >> t3 = rdtsc()
> >> ....
> >>
> >> Each step validates that tN <= TN+1, which detects all issues of #2
> >> above for both old and contemporary machines.
> >>
> >> Of course this mechanism is also affected by interconnect
> >> congestion, but the only side effect of interconnect congestion is
> >> that the handover becomes slow, which means that the accuracy of
> >> detecting time going backwards suffers temporarily.
> >
> > Is this going to be OK for real-time workloads? I guess that you have at
> > least as good a chance as I would to convince them. The more aggressive
> > NO_HZ_FULL people might have concerns. Especially these guys:
> >
> > https://arxiv.org/abs/2509.03855
>
> I'm sure that the wishful thinking departement which ignores the system
> immanent latencies and demonstrates great results on a system which is
> tuned to run only the demonstrator will go apeshit.
They had better have very carefully qualified their hardware, firmware,
kernel, and applications, that is for certain.
> > I have similar code in the current clocksource watchdog, but it doesn't
> > run until clocksource skew is detected, in other words, only after
> > something has likely gone sideways.
>
> Of course we can exclude the nohz full CPUs, but that's again an
> implementation detail and does not change the underlying design.
And the nohz_full people will need to have also carefully qualified their
setup because anything that I know of that can check the nohz_full CPUs'
TSCs will inflict jitter.
> >> With that all these made up uncertainty heuristics go completely away or
> >> can be turned into something which actually reflects the reality of the
> >> universe, i.e. the accuracy of the clocks and their access time.
> >>
> >> I doubt it's needed because the original implementation worked just fine
> >> and it's only relevant for the actual HPET/TSC comparison case. That is
> >> then limited to museum pieces which are not really affected by the
> >> scalability issues of todays machines. Especially not as the HPET/TSC
> >> check is limited to CPU0, which is the one closest to the legacy
> >> peripherals which contain the HPET or in real old machines the ACPI_PM
> >> timer.
> >
> > Give or take firmware/hardware bugs that cause the TSC to at a constant
> > but incorrect rate. Maybe interact with userspace facilities such as
> > NTP?
>
> Either we enforce HPET/ACPI_PM comparison on the supervisor CPU or let
> user space deal with it. If NTP detects time going sideways, but being
> consistent between CPUs, then it can switch the clocksource to HPET or
> ACPI_PM already today.
NTP is quite happy to silently compensate for some serious skew, as
long as that skew is consistent. Way beyond the advertised 500PPM.
Of course that means that userspace and kernel times diverge, which can
have "interesting" results.
> >> I'll go and utilize my copious spare time to implement this, but don't
> >> expect it to materialize before 2026 :)
> >
> > The current implementation is working for us, so your schedule is our
> > schedule. ;-)
>
> Thanks for confirming that this is based on "works for me" :)
C'mon, Thomas!
You are younger than I am so you should remember our discussion just
days ago in which I pointed out that "works for me" was quite a bit
simpler than the current setup that was make to work for quite a few
other people. And it is not my fault that you didn't test back then on
your museum-piece system that has firmware that resprograms the TSCs in
order to lie about the system's age! ;-)
Thanx, Paul
> Thanks,
>
> tglx
On Fri, Dec 19 2025 at 16:18, Paul E. McKenney wrote:
> On Fri, Dec 19, 2025 at 11:13:05AM +0100, Thomas Gleixner wrote:
>> > My concern is that the patch below would force needless cs_watchdog_read()
>> > retries.
>>
>> That's not the end of the world and way better than degrading the
>> watchdog further.
>
> But what you proposed is just a further tweak of the heuristics you so
> energetically decry above.
I fully agree that it is a bandaid fix, but it makes the machinery
consistent and correct. It prevents false positives, which are
inevitable in the current design. No?
Thanks,
tglx
On Sat, Dec 20, 2025 at 09:37:40AM +0100, Thomas Gleixner wrote: > On Fri, Dec 19 2025 at 16:18, Paul E. McKenney wrote: > > On Fri, Dec 19, 2025 at 11:13:05AM +0100, Thomas Gleixner wrote: > >> > My concern is that the patch below would force needless cs_watchdog_read() > >> > retries. > >> > >> That's not the end of the world and way better than degrading the > >> watchdog further. > > > > But what you proposed is just a further tweak of the heuristics you so > > energetically decry above. > > I fully agree that it is a bandaid fix, but it makes the machinery > consistent and correct. It prevents false positives, which are > inevitable in the current design. No? It might also be necessary to compensate by tweaking up the number of retries by adjusting clocksource_get_max_watchdog_retry(). Thanx, Paul
On Fri, Dec 19 2025 at 16:18, Paul E. McKenney wrote: > One might naively hope that the HPETs in a given system could synchronize > themselves so that each CPU could read the HPET nearest it instead of > everyone reading CPU 0's HPET. Or that some other means would allow > reasonable access times. Hey, I can dream, can't I? Sounds more like a nightmare TBH.
On Sat, Dec 20, 2025 at 09:38:31AM +0100, Thomas Gleixner wrote: > On Fri, Dec 19 2025 at 16:18, Paul E. McKenney wrote: > > One might naively hope that the HPETs in a given system could synchronize > > themselves so that each CPU could read the HPET nearest it instead of > > everyone reading CPU 0's HPET. Or that some other means would allow > > reasonable access times. Hey, I can dream, can't I? > > Sounds more like a nightmare TBH. There are systems that manage to provide reasonable timer access times and fully synchronized timers, and have been for many decades. So give me ugly hardware, and I will have at most zero sympathy for complaints about software that is less than perfectly beautiful. Thanx, Paul
The following commit has been merged into the timers/urgent branch of tip:
Commit-ID: c06343be0b4e03fe319910dd7a5d5b9929e1c0cb
Gitweb: https://git.kernel.org/tip/c06343be0b4e03fe319910dd7a5d5b9929e1c0cb
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Wed, 17 Dec 2025 18:21:05 +01:00
Committer: Thomas Gleixner <tglx@kernel.org>
CommitterDate: Wed, 21 Jan 2026 11:33:11 +01:00
clocksource: Reduce watchdog readout delay limit to prevent false positives
The "valid" readout delay between the two reads of the watchdog is larger
than the valid delta between the resulting watchdog and clocksource
intervals, which results in false positive watchdog results.
Assume TSC is the clocksource and HPET is the watchdog and both have a
uncertainty margin of 250us (default). The watchdog readout does:
1) wdnow = read(HPET);
2) csnow = read(TSC);
3) wdend = read(HPET);
The valid window for the delta between #1 and #3 is calculated by the
uncertainty margins of the watchdog and the clocksource:
m = 2 * watchdog.uncertainty_margin + cs.uncertainty margin;
which results in 750us for the TSC/HPET case.
The actual interval comparison uses a smaller margin:
m = watchdog.uncertainty_margin + cs.uncertainty margin;
which results in 500us for the TSC/HPET case.
That means the following scenario will trigger the watchdog:
Watchdog cycle N:
1) wdnow[N] = read(HPET);
2) csnow[N] = read(TSC);
3) wdend[N] = read(HPET);
Assume the delay between #1 and #2 is 100us and the delay between #1 and
Watchdog cycle N + 1:
4) wdnow[N + 1] = read(HPET);
5) csnow[N + 1] = read(TSC);
6) wdend[N + 1] = read(HPET);
If the delay between #4 and #6 is within the 750us margin then any delay
between #4 and #5 which is larger than 600us will fail the interval check
and mark the TSC unstable because the intervals are calculated against the
previous value:
wd_int = wdnow[N + 1] - wdnow[N];
cs_int = csnow[N + 1] - csnow[N];
Putting the above delays in place this results in:
cs_int = (wdnow[N + 1] + 610us) - (wdnow[N] + 100us);
-> cs_int = wd_int + 510us;
which is obviously larger than the allowed 500us margin and results in
marking TSC unstable.
Fix this by using the same margin as the interval comparison. If the delay
between two watchdog reads is larger than that, then the readout was either
disturbed by interconnect congestion, NMIs or SMIs.
Fixes: 4ac1dd3245b9 ("clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin")
Reported-by: Daniel J Blueman <daniel@quora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/lkml/20250602223251.496591-1-daniel@quora.org/
Link: https://patch.msgid.link/87bjjxc9dq.ffs@tglx
---
kernel/time/clocksource.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index a1890a0..df71949 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -252,7 +252,7 @@ enum wd_read_status {
static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
{
- int64_t md = 2 * watchdog->uncertainty_margin;
+ int64_t md = watchdog->uncertainty_margin;
unsigned int nretries, max_retries;
int64_t wd_delay, wd_seq_delay;
u64 wd_end, wd_end2;
© 2016 - 2026 Red Hat, Inc.