Fix rolling back of CLOCK_MONOTONIC_COARSE

[PATCH] Fix rolling back of CLOCK_MONOTONIC_COARSE

Posted by Lei Chen 11 months ago

timekeeping_apply_adjustment try to adjust tkr_mono.mult by @mult_adj.
If @mult_adj > 0, tk->tkr_mono.xtime_nsec will be decreased by @offset.
Then timekeeping_update flushes shadow_timekeeper to real tk and vdso
data region. Then rolling back happens.

The drawing below illustrates the reason why timekeeping_apply_adjustment
descreases tk->tkr_mono.xtime_nsec.

     cycle_interval       offset        clock_delta
x-----------------------x----------x---------------------------x

P0                      P1         P2                         P3

N(P) means the nano sec count at the point P.

Assume timekeeping_apply_adjustment runs at P1, with unaccumulated
cycles @offset. Then tkr_mono.mult is adjusted from M1 to M2.

Since offset happens before tkr_mono.mult adjustment, so we want to
achieve:
N(P3) == offset * M1 + clock_delta * M2 + N(P1)   -------- (1)

But at P3, the code works as following:
N(P3) := (offset + clock_delta) * M2 + N(P1)
       = offset * M2 + clock_delta * M2 + N(P1)

Apprently, N(P3) goes away from equation (1). To correct it, N(P1)
should be adjusted at P2:
N(P1) -= offset * (M2 - M1)

To fix this issue, the patch accumulates offset into tk, and export
N(P2) to real tk and vdso.

tk.tkr_mono := N(P2) = N(P1) + offset * M1

Then at P3, we calculate N(P3) based on N(P2) instead of N(P1):
N(P3) := N(P2) + clock_delta * M2

Signed-off-by: Lei Chen <lei.chen@smartx.com>
---
 kernel/time/timekeeping.c | 44 +++++++++------------------------------
 1 file changed, 10 insertions(+), 34 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 1e67d076f195..65647f7bbc69 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1934,15 +1934,14 @@ static __always_inline void timekeeping_apply_adjustment(struct timekeeper *tk,
 {
 	s64 interval = tk->cycle_interval;
 
-	if (mult_adj == 0) {
+	if (mult_adj == 0)
 		return;
-	} else if (mult_adj == -1) {
+
+	if (mult_adj == -1)
 		interval = -interval;
-		offset = -offset;
-	} else if (mult_adj != 1) {
+	else if (mult_adj != 1)
 		interval *= mult_adj;
-		offset *= mult_adj;
-	}
+
 
 	/*
 	 * So the following can be confusing.
@@ -1963,33 +1962,8 @@ static __always_inline void timekeeping_apply_adjustment(struct timekeeper *tk,
 	 * Which can be shortened to:
 	 *	xtime_interval += cycle_interval
 	 *
-	 * So offset stores the non-accumulated cycles. Thus the current
-	 * time (in shifted nanoseconds) is:
-	 *	now = (offset * adj) + xtime_nsec
-	 * Now, even though we're adjusting the clock frequency, we have
-	 * to keep time consistent. In other words, we can't jump back
-	 * in time, and we also want to avoid jumping forward in time.
-	 *
-	 * So given the same offset value, we need the time to be the same
-	 * both before and after the freq adjustment.
-	 *	now = (offset * adj_1) + xtime_nsec_1
-	 *	now = (offset * adj_2) + xtime_nsec_2
-	 * So:
-	 *	(offset * adj_1) + xtime_nsec_1 =
-	 *		(offset * adj_2) + xtime_nsec_2
-	 * And we know:
-	 *	adj_2 = adj_1 + 1
-	 * So:
-	 *	(offset * adj_1) + xtime_nsec_1 =
-	 *		(offset * (adj_1+1)) + xtime_nsec_2
-	 *	(offset * adj_1) + xtime_nsec_1 =
-	 *		(offset * adj_1) + offset + xtime_nsec_2
-	 * Canceling the sides:
-	 *	xtime_nsec_1 = offset + xtime_nsec_2
-	 * Which gives us:
-	 *	xtime_nsec_2 = xtime_nsec_1 - offset
-	 * Which simplifies to:
-	 *	xtime_nsec -= offset
++	 * Since mult will be changed, @offset should be accumulated with original
++	 * mult value
 	 */
 	if ((mult_adj > 0) && (tk->tkr_mono.mult + mult_adj < mult_adj)) {
 		/* NTP adjustment caused clocksource mult overflow */
@@ -1997,9 +1971,11 @@ static __always_inline void timekeeping_apply_adjustment(struct timekeeper *tk,
 		return;
 	}
 
+	tk->tkr_mono.xtime_nsec += offset * tk->tkr_mono.mult;
+	tk->tkr_mono.cycle_last += offset;
+
 	tk->tkr_mono.mult += mult_adj;
 	tk->xtime_interval += interval;
-	tk->tkr_mono.xtime_nsec -= offset;
 }
 
 /*
-- 
2.44.0

Re: [PATCH] Fix rolling back of CLOCK_MONOTONIC_COARSE

Posted by Thomas Gleixner 10 months, 4 weeks ago

On Mon, Mar 10 2025 at 11:00, Lei Chen wrote:
> To fix this issue, the patch accumulates offset into tk, and export
> N(P2) to real tk and vdso.
>
> tk.tkr_mono := N(P2) = N(P1) + offset * M1
>
> Then at P3, we calculate N(P3) based on N(P2) instead of N(P1):
> N(P3) := N(P2) + clock_delta * M2

Your analysis is mostly correct, but it is only correct versus the
coarse timekeeper.

Moving everything forward to P2 breaks the periodic accumulation and
therefore NTP. Monitoring NTP/chrony immediately shows that they are
behaving differently and do not really converge.

The real problem is that the introduction of the coarse time accessors
completely missed the fact, that xtime_nsec is adjusted by multiplier
changes. This went unnoticed for about 15 years :)

So the obvious cure is to leave the accumulation logic alone and to
provide a seperate coarse_nsec instance, which compensates for the
offset.

The offset contains the reminder of the periodic accumulation from the
point where timekeeping_advance() read the clocksource at T1.

At the point of readout T1 nanoseconds have been:

     T1nsec[OLD] = xtime_sec[OLD] * NSEC_PER_SEC +
                   (xtime_nsec[OLD] + offset * mult[OLD]) >> shift;

After the accumulation and eventual multiplier update that becomes:

     T1nsec[NEW] = xtime_sec[NEW] * NSEC_PER_SEC +
                   (xtime_nsec[NEW] + offset * mult[NEW]) >> shift;

If the unmodified accumulation math is correct then:

     T1nsec[OLD] == T1nsec[NEW]

The patch below implements exactly that and survives lightweight testing
where neither in kernel nor in userspace these time jumps are
observable anymore.

It needs quite some eyeballs, testing and validation.

Thanks,

        tglx
---
Subject: timekeeping: Decouple coarse time readout from xtime
From: Thomas Gleixner <tglx@linutronix.de>
Date: Fri, 14 Mar 2025 14:15:59 +0100

< Insert change log>

Fixes: da15cfdae033 ("time: Introduce CLOCK_REALTIME_COARSE")
Reported-by: Lei Chen <lei.chen@smartx.com>
Not-Yet-Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/timekeeper_internal.h |    8 +++--
 kernel/time/timekeeping.c           |   54 ++++++++++++++++++++++++++++++++----
 kernel/time/vsyscall.c              |    4 +-
 3 files changed, 55 insertions(+), 11 deletions(-)

--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -51,7 +51,7 @@ struct tk_read_base {
  * @offs_real:			Offset clock monotonic -> clock realtime
  * @offs_boot:			Offset clock monotonic -> clock boottime
  * @offs_tai:			Offset clock monotonic -> clock tai
- * @tai_offset:			The current UTC to TAI offset in seconds
+ * @coarse_nsec:		The nanoseconds part for coarse time getters
  * @tkr_raw:			The readout base structure for CLOCK_MONOTONIC_RAW
  * @raw_sec:			CLOCK_MONOTONIC_RAW  time in seconds
  * @clock_was_set_seq:		The sequence number of clock was set events
@@ -76,6 +76,7 @@ struct tk_read_base {
  *				ntp shifted nano seconds.
  * @ntp_err_mult:		Multiplication factor for scaled math conversion
  * @skip_second_overflow:	Flag used to avoid updating NTP twice with same second
+ * @tai_offset:			The current UTC to TAI offset in seconds
  *
  * Note: For timespec(64) based interfaces wall_to_monotonic is what
  * we need to add to xtime (or xtime corrected for sub jiffy times)
@@ -100,7 +101,7 @@ struct tk_read_base {
  * which results in the following cacheline layout:
  *
  * 0:	seqcount, tkr_mono
- * 1:	xtime_sec ... tai_offset
+ * 1:	xtime_sec ... coarse_nsec
  * 2:	tkr_raw, raw_sec
  * 3,4: Internal variables
  *
@@ -121,7 +122,7 @@ struct timekeeper {
 	ktime_t			offs_real;
 	ktime_t			offs_boot;
 	ktime_t			offs_tai;
-	s32			tai_offset;
+	u32			coarse_nsec;
 
 	/* Cacheline 2: */
 	struct tk_read_base	tkr_raw;
@@ -144,6 +145,7 @@ struct timekeeper {
 	u32			ntp_error_shift;
 	u32			ntp_err_mult;
 	u32			skip_second_overflow;
+	s32			tai_offset;
 };
 
 #ifdef CONFIG_GENERIC_TIME_VSYSCALL
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -164,10 +164,20 @@ static inline struct timespec64 tk_xtime
 	return ts;
 }
 
+static inline struct timespec64 tk_xtime_coarse(const struct timekeeper *tk)
+{
+	struct timespec64 ts;
+
+	ts.tv_sec = tk->xtime_sec;
+	ts.tv_nsec = tk->coarse_nsec;
+	return ts;
+}
+
 static void tk_set_xtime(struct timekeeper *tk, const struct timespec64 *ts)
 {
 	tk->xtime_sec = ts->tv_sec;
 	tk->tkr_mono.xtime_nsec = (u64)ts->tv_nsec << tk->tkr_mono.shift;
+	tk->coarse_nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
 }
 
 static void tk_xtime_add(struct timekeeper *tk, const struct timespec64 *ts)
@@ -175,6 +185,7 @@ static void tk_xtime_add(struct timekeep
 	tk->xtime_sec += ts->tv_sec;
 	tk->tkr_mono.xtime_nsec += (u64)ts->tv_nsec << tk->tkr_mono.shift;
 	tk_normalize_xtime(tk);
+	tk->coarse_nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
 }
 
 static void tk_set_wall_to_mono(struct timekeeper *tk, struct timespec64 wtm)
@@ -708,6 +719,8 @@ static void timekeeping_forward_now(stru
 		tk_normalize_xtime(tk);
 		delta -= incr;
 	}
+
+	tk->coarse_nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
 }
 
 /**
@@ -804,8 +817,8 @@ EXPORT_SYMBOL_GPL(ktime_get_with_offset)
 ktime_t ktime_get_coarse_with_offset(enum tk_offsets offs)
 {
 	struct timekeeper *tk = &tk_core.timekeeper;
-	unsigned int seq;
 	ktime_t base, *offset = offsets[offs];
+	unsigned int seq;
 	u64 nsecs;
 
 	WARN_ON(timekeeping_suspended);
@@ -813,7 +826,7 @@ ktime_t ktime_get_coarse_with_offset(enu
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
 		base = ktime_add(tk->tkr_mono.base, *offset);
-		nsecs = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+		nsecs = tk->coarse_nsec;
 
 	} while (read_seqcount_retry(&tk_core.seq, seq));
 
@@ -1926,6 +1939,33 @@ static int __init timekeeping_init_ops(v
 device_initcall(timekeeping_init_ops);
 
 /*
+ * Update the nanoseconds part for the coarse time keepers. They can't rely
+ * on xtime_nsec because xtime_nsec is adjusted when the multiplication
+ * factor of the clock is adjusted. See timekeeping_apply_adjustment().
+ *
+ * This is required because tk_read::cycle_last must be advanced by
+ * timekeeper::cycle_interval so that the accumulation happens with a
+ * periodic reference.
+ *
+ * But that adjustment of xtime_nsec can make it go backward to compensate
+ * for a larger multiplicator.
+ *
+ * @offset contains the leftover cycles which were not accumulated.
+ * Therefore the nanoseconds portion of the time when the clocksource was
+ * read in timekeeping_advance() is:
+ *
+ *	nsec = (xtime_nsec + offset * mult) >> shift;
+ *
+ * Calculate that value and store it in timekeeper::coarse_nsec, from where
+ * the coarse time getters consume it.
+ */
+static inline void timekeeping_update_coarse_nsecs(struct timekeeper *tk, u64 offset)
+{
+	offset *= tk->tkr_mono.mult;
+	tk->coarse_nsec = (tk->tkr_mono.xtime_nsec + offset) >> tk->tkr_mono.shift;
+}
+
+/*
  * Apply a multiplier adjustment to the timekeeper
  */
 static __always_inline void timekeeping_apply_adjustment(struct timekeeper *tk,
@@ -2205,6 +2245,8 @@ static bool timekeeping_advance(enum tim
 	 */
 	clock_set |= accumulate_nsecs_to_secs(tk);
 
+	timekeeping_update_coarse_nsecs(tk, offset);
+
 	timekeeping_update_from_shadow(&tk_core, clock_set);
 
 	return !!clock_set;
@@ -2248,7 +2290,7 @@ void ktime_get_coarse_real_ts64(struct t
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
 
-		*ts = tk_xtime(tk);
+		*ts = tk_xtime_coarse(tk);
 	} while (read_seqcount_retry(&tk_core.seq, seq));
 }
 EXPORT_SYMBOL(ktime_get_coarse_real_ts64);
@@ -2271,7 +2313,7 @@ void ktime_get_coarse_real_ts64_mg(struc
 
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
-		*ts = tk_xtime(tk);
+		*ts = tk_xtime_coarse(tk);
 		offset = tk_core.timekeeper.offs_real;
 	} while (read_seqcount_retry(&tk_core.seq, seq));
 
@@ -2350,12 +2392,12 @@ void ktime_get_coarse_ts64(struct timesp
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
 
-		now = tk_xtime(tk);
+		now = tk_xtime_coarse(tk);
 		mono = tk->wall_to_monotonic;
 	} while (read_seqcount_retry(&tk_core.seq, seq));
 
 	set_normalized_timespec64(ts, now.tv_sec + mono.tv_sec,
-				now.tv_nsec + mono.tv_nsec);
+				  now.tv_nsec + mono.tv_nsec);
 }
 EXPORT_SYMBOL(ktime_get_coarse_ts64);
 
--- a/kernel/time/vsyscall.c
+++ b/kernel/time/vsyscall.c
@@ -97,12 +97,12 @@ void update_vsyscall(struct timekeeper *
 	/* CLOCK_REALTIME_COARSE */
 	vdso_ts		= &vdata[CS_HRES_COARSE].basetime[CLOCK_REALTIME_COARSE];
 	vdso_ts->sec	= tk->xtime_sec;
-	vdso_ts->nsec	= tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+	vdso_ts->nsec	= tk->coarse_nsec;
 
 	/* CLOCK_MONOTONIC_COARSE */
 	vdso_ts		= &vdata[CS_HRES_COARSE].basetime[CLOCK_MONOTONIC_COARSE];
 	vdso_ts->sec	= tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
-	nsec		= tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+	nsec		= tk->coarse_nsec;
 	nsec		= nsec + tk->wall_to_monotonic.tv_nsec;
 	vdso_ts->sec	+= __iter_div_u64_rem(nsec, NSEC_PER_SEC, &vdso_ts->nsec);

Re: [PATCH] Fix rolling back of CLOCK_MONOTONIC_COARSE

Posted by John Stultz 10 months, 4 weeks ago

On Fri, Mar 14, 2025 at 12:41 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Mon, Mar 10 2025 at 11:00, Lei Chen wrote:
> > To fix this issue, the patch accumulates offset into tk, and export
> > N(P2) to real tk and vdso.
> >
> > tk.tkr_mono := N(P2) = N(P1) + offset * M1
> >
> > Then at P3, we calculate N(P3) based on N(P2) instead of N(P1):
> > N(P3) := N(P2) + clock_delta * M2
>
> Your analysis is mostly correct, but it is only correct versus the
> coarse timekeeper.
>
> Moving everything forward to P2 breaks the periodic accumulation and
> therefore NTP. Monitoring NTP/chrony immediately shows that they are
> behaving differently and do not really converge.
>
> The real problem is that the introduction of the coarse time accessors
> completely missed the fact, that xtime_nsec is adjusted by multiplier
> changes. This went unnoticed for about 15 years :)
>
> So the obvious cure is to leave the accumulation logic alone and to
> provide a seperate coarse_nsec instance, which compensates for the
> offset.
>
> The offset contains the reminder of the periodic accumulation from the
> point where timekeeping_advance() read the clocksource at T1.
>
> At the point of readout T1 nanoseconds have been:
>
>      T1nsec[OLD] = xtime_sec[OLD] * NSEC_PER_SEC +
>                    (xtime_nsec[OLD] + offset * mult[OLD]) >> shift;
>
> After the accumulation and eventual multiplier update that becomes:
>
>      T1nsec[NEW] = xtime_sec[NEW] * NSEC_PER_SEC +
>                    (xtime_nsec[NEW] + offset * mult[NEW]) >> shift;
>
> If the unmodified accumulation math is correct then:
>
>      T1nsec[OLD] == T1nsec[NEW]
>
> The patch below implements exactly that and survives lightweight testing
> where neither in kernel nor in userspace these time jumps are
> observable anymore.
>
> It needs quite some eyeballs, testing and validation.
>

Hey Thomas,
 Nice work chasing this down. The approach looks ok, but I'm wary of
adding more state to manage.

Since this issue apparently only happens via do_adjtimex() calling
timekeeping_advance(TK_ADV_FREQ) resulting in an mult adjustment being
made without any accumulation before it, the approach I'm testing is:
in that case, to do ~timekeeping_forward_now() to accumulate the small
offset before doing the mult adjustment. I've got to move a little bit
of code around to do this cleanly, but I'll send a patch out here
shortly.

thanks
-john

[RFC PATCH 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids

Posted by John Stultz 10 months, 4 weeks ago

Lei Chen raised an issue with CLOCK_MONOTONIC_COARSE seeing
time inconsistencies.

Lei tracked down that this was being caused by the adjustment
  tk->tkr_mono.xtime_nsec -= offset;

which is made to compensate for the unaccumulated cycles in
offset when the mult value is adjusted forward, so that
the non-_COARSE clockids don't see inconsistencies.

However, the _COARSE clockids don't use the mult*offset value
in their calculations, so this subtraction can cause the
_COARSE clock ids to jump back a bit.

Now, by design, this negative adjustment should be fine, because
the logic run from timekeeping_adjust() is done after we
accumulate approx mult*interval_cycles into xtime_nsec.
The accumulated (mult*interval_cycles) will be larger then the
(mult_adj*offset) value subtracted from xtime_nsec, and both
operations are done together under the tk_core.lock, so the net
change to xtime_nsec should always be positive.

However, do_adjtimex() calls into timekeeping_advance() as well,
since we want to apply the ntp freq adjustment immediately.
In this case, we don't return early when the offset is smaller
then interval_cycles, so we don't end up accumulating any time
into xtime_nsec. But we do go on to call timekeeping_adjust(),
which modifies the mult value, and subtracts from xtime_nsec
to correct for the new mult value.

Here because we did not accumulate anything, we have a window
where the _COARSE clockids that don't utilize the mult*offset
value, can see an inconsistency.

So to fix this, rework the timekeeping_advance() logic a bit
so that when we are called from do_adjtimex() and the offset
is smaller then cycle_interval, that we call
timekeeping_forward(), to first accumulate the sub-interval
time into xtime_nsec. Then with no unaccumulated cycles in
offset, we can do the mult adjustment without worry of the
subtraction having an impact.

NOTE: This was implemented as a potential alternative to
Thomas' approach here:
   https://lore.kernel.org/lkml/87cyej5rid.ffs@tglx/

And similarly, it needs some additional review and testing,
as it was developed while packing for conference travel.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: linux-kselftest@vger.kernel.org
Cc: kernel-team@android.com
Cc: Lei Chen <lei.chen@smartx.com>
Fixes: da15cfdae033 ("time: Introduce CLOCK_REALTIME_COARSE")
Reported-by: Lei Chen <lei.chen@smartx.com>
Closes: https://lore.kernel.org/lkml/20250310030004.3705801-1-lei.chen@smartx.com/
Diagnosed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <jstultz@google.com>
---
 kernel/time/timekeeping.c | 87 ++++++++++++++++++++++++++++-----------
 1 file changed, 62 insertions(+), 25 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 1e67d076f1955..6f3a145e7b113 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -682,18 +682,18 @@ static void timekeeping_update_from_shadow(struct tk_data *tkd, unsigned int act
 }
 
 /**
- * timekeeping_forward_now - update clock to the current time
+ * timekeeping_forward - update clock to given cycle now value
  * @tk:		Pointer to the timekeeper to update
+ * @cycle_now:  Current clocksource read value
  *
  * Forward the current clock to update its state since the last call to
  * update_wall_time(). This is useful before significant clock changes,
  * as it avoids having to deal with this time offset explicitly.
  */
-static void timekeeping_forward_now(struct timekeeper *tk)
+static void timekeeping_forward(struct timekeeper *tk, u64 cycle_now)
 {
-	u64 cycle_now, delta;
+	u64 delta;
 
-	cycle_now = tk_clock_read(&tk->tkr_mono);
 	delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
 				  tk->tkr_mono.clock->max_raw_delta);
 	tk->tkr_mono.cycle_last = cycle_now;
@@ -710,6 +710,21 @@ static void timekeeping_forward_now(struct timekeeper *tk)
 	}
 }
 
+/**
+ * timekeeping_forward_now - update clock to the current time
+ * @tk:		Pointer to the timekeeper to update
+ *
+ * Forward the current clock to update its state since the last call to
+ * update_wall_time(). This is useful before significant clock changes,
+ * as it avoids having to deal with this time offset explicitly.
+ */
+static void timekeeping_forward_now(struct timekeeper *tk)
+{
+	u64 cycle_now = tk_clock_read(&tk->tkr_mono);
+
+	timekeeping_forward(tk, cycle_now);
+}
+
 /**
  * ktime_get_real_ts64 - Returns the time of day in a timespec64.
  * @ts:		pointer to the timespec to be set
@@ -2151,6 +2166,45 @@ static u64 logarithmic_accumulation(struct timekeeper *tk, u64 offset,
 	return offset;
 }
 
+static u64 timekeeping_accumulate(struct timekeeper *tk, u64 now, u64 offset,
+				  unsigned int *clock_set)
+{
+	struct timekeeper *real_tk = &tk_core.timekeeper;
+	int shift = 0, maxshift;
+
+	/*
+	 * If we have a sub-cycle_interval offset, we
+	 * are likely doing a TK_FREQ_ADJ, so accumulate
+	 * everything so we don't have a remainder offset
+	 * when later adjusting the multiplier
+	 */
+	if (offset < real_tk->cycle_interval) {
+		timekeeping_forward(tk, now);
+		*clock_set = 1;
+		return 0;
+	}
+
+	/*
+	 * With NO_HZ we may have to accumulate many cycle_intervals
+	 * (think "ticks") worth of time at once. To do this efficiently,
+	 * we calculate the largest doubling multiple of cycle_intervals
+	 * that is smaller than the offset.  We then accumulate that
+	 * chunk in one go, and then try to consume the next smaller
+	 * doubled multiple.
+	 */
+	shift = ilog2(offset) - ilog2(tk->cycle_interval);
+	shift = max(0, shift);
+	/* Bound shift to one less than what overflows tick_length */
+	maxshift = (64 - (ilog2(ntp_tick_length()) + 1)) - 1;
+	shift = min(shift, maxshift);
+	while (offset >= tk->cycle_interval) {
+		offset = logarithmic_accumulation(tk, offset, shift, clock_set);
+		if (offset < tk->cycle_interval << shift)
+			shift--;
+	}
+	return offset;
+}
+
 /*
  * timekeeping_advance - Updates the timekeeper to the current time and
  * current NTP tick length
@@ -2160,8 +2214,7 @@ static bool timekeeping_advance(enum timekeeping_adv_mode mode)
 	struct timekeeper *tk = &tk_core.shadow_timekeeper;
 	struct timekeeper *real_tk = &tk_core.timekeeper;
 	unsigned int clock_set = 0;
-	int shift = 0, maxshift;
-	u64 offset;
+	u64 cycle_now, offset;
 
 	guard(raw_spinlock_irqsave)(&tk_core.lock);
 
@@ -2169,7 +2222,8 @@ static bool timekeeping_advance(enum timekeeping_adv_mode mode)
 	if (unlikely(timekeeping_suspended))
 		return false;
 
-	offset = clocksource_delta(tk_clock_read(&tk->tkr_mono),
+	cycle_now = tk_clock_read(&tk->tkr_mono);
+	offset = clocksource_delta(cycle_now,
 				   tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
 				   tk->tkr_mono.clock->max_raw_delta);
 
@@ -2177,24 +2231,7 @@ static bool timekeeping_advance(enum timekeeping_adv_mode mode)
 	if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK)
 		return false;
 
-	/*
-	 * With NO_HZ we may have to accumulate many cycle_intervals
-	 * (think "ticks") worth of time at once. To do this efficiently,
-	 * we calculate the largest doubling multiple of cycle_intervals
-	 * that is smaller than the offset.  We then accumulate that
-	 * chunk in one go, and then try to consume the next smaller
-	 * doubled multiple.
-	 */
-	shift = ilog2(offset) - ilog2(tk->cycle_interval);
-	shift = max(0, shift);
-	/* Bound shift to one less than what overflows tick_length */
-	maxshift = (64 - (ilog2(ntp_tick_length())+1)) - 1;
-	shift = min(shift, maxshift);
-	while (offset >= tk->cycle_interval) {
-		offset = logarithmic_accumulation(tk, offset, shift, &clock_set);
-		if (offset < tk->cycle_interval<<shift)
-			shift--;
-	}
+	offset = timekeeping_accumulate(tk, cycle_now, offset, &clock_set);
 
 	/* Adjust the multiplier to correct NTP error */
 	timekeeping_adjust(tk, offset);
-- 
2.49.0.rc1.451.g8f38331e32-goog

Re: [RFC PATCH 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids

Posted by Thomas Gleixner 10 months, 4 weeks ago

On Fri, Mar 14 2025 at 17:37, John Stultz wrote:
> Now, by design, this negative adjustment should be fine, because
> the logic run from timekeeping_adjust() is done after we
> accumulate approx mult*interval_cycles into xtime_nsec.
> The accumulated (mult*interval_cycles) will be larger then the
> (mult_adj*offset) value subtracted from xtime_nsec, and both
> operations are done together under the tk_core.lock, so the net
> change to xtime_nsec should always be positive.

/should/is/

We better are confident about that :)

> However, do_adjtimex() calls into timekeeping_advance() as well,
> since we want to apply the ntp freq adjustment immediately.
> In this case, we don't return early when the offset is smaller
> then interval_cycles, so we don't end up accumulating any time
> into xtime_nsec. But we do go on to call timekeeping_adjust(),
> which modifies the mult value, and subtracts from xtime_nsec
> to correct for the new mult value.

We don't do anything. :)

> Here because we did not accumulate anything, we have a window
> where the _COARSE clockids that don't utilize the mult*offset
> value, can see an inconsistency.
>
> So to fix this, rework the timekeeping_advance() logic a bit
> so that when we are called from do_adjtimex() and the offset
> is smaller then cycle_interval, that we call
> timekeeping_forward(), to first accumulate the sub-interval
> time into xtime_nsec. Then with no unaccumulated cycles in
> offset, we can do the mult adjustment without worry of the
> subtraction having an impact.

It's a smart solution. I briefly pondered something similar, but I'm not
really fond of the fact, that it causes a clock_was_set() event for no
good reason.

clock_was_set() means that there is a time jump. But that's absolutely
not the case with do_adjtimex() changing the frequency for quick
adjustments. That does not affect continuity at all.

That event causes avoidable overhead in the kernel, but it's also
exposed to user space via timerfd TFD_TIMER_CANCEL_ON_SET.

I have no really strong opinion about that, but the route through
clock_was_set() triggers my semantical mismatch sensors :)

> NOTE: This was implemented as a potential alternative to
> Thomas' approach here:
>    https://lore.kernel.org/lkml/87cyej5rid.ffs@tglx/
>
> And similarly, it needs some additional review and testing,
> as it was developed while packing for conference travel.

We can debate that next week over your favourite beverage :)

Have a safe trip!

Thanks,

        tglx

Re: [RFC PATCH 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids

Posted by John Stultz 10 months, 4 weeks ago

On Sat, Mar 15, 2025 at 12:23 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> On Fri, Mar 14 2025 at 17:37, John Stultz wrote:
> > Here because we did not accumulate anything, we have a window
> > where the _COARSE clockids that don't utilize the mult*offset
> > value, can see an inconsistency.
> >
> > So to fix this, rework the timekeeping_advance() logic a bit
> > so that when we are called from do_adjtimex() and the offset
> > is smaller then cycle_interval, that we call
> > timekeeping_forward(), to first accumulate the sub-interval
> > time into xtime_nsec. Then with no unaccumulated cycles in
> > offset, we can do the mult adjustment without worry of the
> > subtraction having an impact.
>
> It's a smart solution. I briefly pondered something similar, but I'm not
> really fond of the fact, that it causes a clock_was_set() event for no
> good reason.
>
> clock_was_set() means that there is a time jump. But that's absolutely
> not the case with do_adjtimex() changing the frequency for quick
> adjustments. That does not affect continuity at all.

Oh, good point.  I wasn't thinking clearly about the semantics, and
was being a little paranoid there (as most calls to
timekeeping_forward_now() have clock_was_set() calls that follow). I
suspect I can do away with that bit and avoid it.  I'll respin later
this week.

> That event causes avoidable overhead in the kernel, but it's also
> exposed to user space via timerfd TFD_TIMER_CANCEL_ON_SET.
>
> I have no really strong opinion about that, but the route through
> clock_was_set() triggers my semantical mismatch sensors :)

Yeah, it's a fair point, thanks for raising it!

> > NOTE: This was implemented as a potential alternative to
> > Thomas' approach here:
> >    https://lore.kernel.org/lkml/87cyej5rid.ffs@tglx/
> >
> > And similarly, it needs some additional review and testing,
> > as it was developed while packing for conference travel.
>
> We can debate that next week over your favourite beverage :)

Looking forward to it :)
-john

Re: [RFC PATCH 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids

Posted by Thomas Gleixner 10 months, 3 weeks ago

On Sat, Mar 15 2025 at 16:22, John Stultz wrote:
> On Sat, Mar 15, 2025 at 12:23 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> > So to fix this, rework the timekeeping_advance() logic a bit
>> > so that when we are called from do_adjtimex() and the offset
>> > is smaller then cycle_interval, that we call
>> > timekeeping_forward(), to first accumulate the sub-interval
>> > time into xtime_nsec. Then with no unaccumulated cycles in
>> > offset, we can do the mult adjustment without worry of the
>> > subtraction having an impact.
>>
>> It's a smart solution. I briefly pondered something similar, but I'm not
>> really fond of the fact, that it causes a clock_was_set() event for no
>> good reason.
>>
>> clock_was_set() means that there is a time jump. But that's absolutely
>> not the case with do_adjtimex() changing the frequency for quick
>> adjustments. That does not affect continuity at all.
>
> Oh, good point.  I wasn't thinking clearly about the semantics, and
> was being a little paranoid there (as most calls to
> timekeeping_forward_now() have clock_was_set() calls that follow). I
> suspect I can do away with that bit and avoid it.  I'll respin later
> this week.

Actually your patch is not even emitting a clock_was_set() event:

> +	if (offset < real_tk->cycle_interval) {
> +		timekeeping_forward(tk, now);
> +		*clock_set = 1;
> +		return 0;
> +	}

#define TK_CLEAR_NTP            (1 << 0)
#define TK_CLOCK_WAS_SET        (1 << 1)

So it clears NTP instead. Not really what you want either. :)

But yes, it simply can forward the clock without side effects.

I think that this should be done for all TICK_ADV_FREQ adjustments. In
case of such asynchronous adjustments it does not make any sense to take
the old ntp_error value into account and accumlate some more. In fact
this simply should clear ntp_error so the new value goes into effect as
provided by NTP and not skewed by ntp_error.

These async adjustments (usually very small ones) happen when the
current source degrades and chronyd/ntpd switches over to a new server.

Something like the below.

Thanks

        tglx
---
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -682,20 +682,19 @@ static void timekeeping_update_from_shad
 }
 
 /**
- * timekeeping_forward_now - update clock to the current time
+ * timekeeping_forward - update clock to given cycle now value
  * @tk:		Pointer to the timekeeper to update
+ * @cycle_now:  Current clocksource read value
  *
  * Forward the current clock to update its state since the last call to
  * update_wall_time(). This is useful before significant clock changes,
  * as it avoids having to deal with this time offset explicitly.
  */
-static void timekeeping_forward_now(struct timekeeper *tk)
+static void timekeeping_forward(struct timekeeper *tk, u64 cycle_now)
 {
-	u64 cycle_now, delta;
+	u64 delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
+				      tk->tkr_mono.clock->max_raw_delta);
 
-	cycle_now = tk_clock_read(&tk->tkr_mono);
-	delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask,
-				  tk->tkr_mono.clock->max_raw_delta);
 	tk->tkr_mono.cycle_last = cycle_now;
 	tk->tkr_raw.cycle_last  = cycle_now;
 
@@ -711,6 +710,21 @@ static void timekeeping_forward_now(stru
 }
 
 /**
+ * timekeeping_forward_now - update clock to the current time
+ * @tk:		Pointer to the timekeeper to update
+ *
+ * Forward the current clock to update its state since the last call to
+ * update_wall_time(). This is useful before significant clock changes,
+ * as it avoids having to deal with this time offset explicitly.
+ */
+static void timekeeping_forward_now(struct timekeeper *tk)
+{
+	u64 cycle_now = tk_clock_read(&tk->tkr_mono);
+
+	timekeeping_forward(tk, cycle_now);
+}
+
+/**
  * ktime_get_real_ts64 - Returns the time of day in a timespec64.
  * @ts:		pointer to the timespec to be set
  *
@@ -2151,6 +2165,54 @@ static u64 logarithmic_accumulation(stru
 	return offset;
 }
 
+static u64 timekeeping_accumulate(struct timekeeper *tk, u64 offset,
+				  enum timekeeping_adv_mode mode,
+				  unsigned int *clock_set)
+{
+	int shift = 0, maxshift;
+
+	/*
+	 * TK_ADV_FREQ indicates that adjtimex(2) directly set the
+	 * frequency or the tick length.
+	 *
+	 * Accumulate the offset, so that the new multiplier starts from
+	 * now. This is required as otherwise for offsets, which are
+	 * smaller than tk::cycle_interval, timekeeping_adjust() could set
+	 * xtime_nsec backwards, which subsequently causes time going
+	 * backwards in the coarse time getters. But even for the case
+	 * where offset is greater than tk::cycle_interval the periodic
+	 * accumulation does not have much value.
+	 *
+	 * Also reset tk::ntp_error as it does not make sense to keep the
+	 * old accumulated error around in this case.
+	 */
+	if (mode == TK_ADV_FREQ) {
+		timekeeping_forward(tk, tk->tkr_mono.cycle_last + offset);
+		tk->ntp_error = 0;
+		return 0;
+	}
+
+	/*
+	 * With NO_HZ we may have to accumulate many cycle_intervals
+	 * (think "ticks") worth of time at once. To do this efficiently,
+	 * we calculate the largest doubling multiple of cycle_intervals
+	 * that is smaller than the offset.  We then accumulate that
+	 * chunk in one go, and then try to consume the next smaller
+	 * doubled multiple.
+	 */
+	shift = ilog2(offset) - ilog2(tk->cycle_interval);
+	shift = max(0, shift);
+	/* Bound shift to one less than what overflows tick_length */
+	maxshift = (64 - (ilog2(ntp_tick_length()) + 1)) - 1;
+	shift = min(shift, maxshift);
+	while (offset >= tk->cycle_interval) {
+		offset = logarithmic_accumulation(tk, offset, shift, clock_set);
+		if (offset < tk->cycle_interval << shift)
+			shift--;
+	}
+	return offset;
+}
+
 /*
  * timekeeping_advance - Updates the timekeeper to the current time and
  * current NTP tick length
@@ -2160,7 +2222,6 @@ static bool timekeeping_advance(enum tim
 	struct timekeeper *tk = &tk_core.shadow_timekeeper;
 	struct timekeeper *real_tk = &tk_core.timekeeper;
 	unsigned int clock_set = 0;
-	int shift = 0, maxshift;
 	u64 offset;
 
 	guard(raw_spinlock_irqsave)(&tk_core.lock);
@@ -2177,24 +2238,7 @@ static bool timekeeping_advance(enum tim
 	if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK)
 		return false;
 
-	/*
-	 * With NO_HZ we may have to accumulate many cycle_intervals
-	 * (think "ticks") worth of time at once. To do this efficiently,
-	 * we calculate the largest doubling multiple of cycle_intervals
-	 * that is smaller than the offset.  We then accumulate that
-	 * chunk in one go, and then try to consume the next smaller
-	 * doubled multiple.
-	 */
-	shift = ilog2(offset) - ilog2(tk->cycle_interval);
-	shift = max(0, shift);
-	/* Bound shift to one less than what overflows tick_length */
-	maxshift = (64 - (ilog2(ntp_tick_length())+1)) - 1;
-	shift = min(shift, maxshift);
-	while (offset >= tk->cycle_interval) {
-		offset = logarithmic_accumulation(tk, offset, shift, &clock_set);
-		if (offset < tk->cycle_interval<<shift)
-			shift--;
-	}
+	offset = timekeeping_accumulate(tk, offset, mode, &clock_set);
 
 	/* Adjust the multiplier to correct NTP error */
 	timekeeping_adjust(tk, offset);

Re: [RFC PATCH 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids

Posted by John Stultz 10 months, 3 weeks ago

On Sun, Mar 16, 2025 at 9:56 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> On Sat, Mar 15 2025 at 16:22, John Stultz wrote:
> > On Sat, Mar 15, 2025 at 12:23 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> > So to fix this, rework the timekeeping_advance() logic a bit
> >> > so that when we are called from do_adjtimex() and the offset
> >> > is smaller then cycle_interval, that we call
> >> > timekeeping_forward(), to first accumulate the sub-interval
> >> > time into xtime_nsec. Then with no unaccumulated cycles in
> >> > offset, we can do the mult adjustment without worry of the
> >> > subtraction having an impact.
> >>
> >> It's a smart solution. I briefly pondered something similar, but I'm not
> >> really fond of the fact, that it causes a clock_was_set() event for no
> >> good reason.
> >>
> >> clock_was_set() means that there is a time jump. But that's absolutely
> >> not the case with do_adjtimex() changing the frequency for quick
> >> adjustments. That does not affect continuity at all.
> >
> > Oh, good point.  I wasn't thinking clearly about the semantics, and
> > was being a little paranoid there (as most calls to
> > timekeeping_forward_now() have clock_was_set() calls that follow). I
> > suspect I can do away with that bit and avoid it.  I'll respin later
> > this week.
>
> Actually your patch is not even emitting a clock_was_set() event:
>
> > +     if (offset < real_tk->cycle_interval) {
> > +             timekeeping_forward(tk, now);
> > +             *clock_set = 1;
> > +             return 0;
> > +     }
>
> #define TK_CLEAR_NTP            (1 << 0)
> #define TK_CLOCK_WAS_SET        (1 << 1)
>
> So it clears NTP instead. Not really what you want either. :)

Hey Thomas,
   Sorry for the slow reply here.   So I agree with you that we don't
want to set clock_set above, that was my mistake. But this last bit I
don't think is right, as timekeeping advance() just returns a bool
(return !!clock_set;), which is used to decide to call clock_was_set()
or not  - not the argument passed to clock_was_set().


> But yes, it simply can forward the clock without side effects.
>
> I think that this should be done for all TICK_ADV_FREQ adjustments. In
> case of such asynchronous adjustments it does not make any sense to take
> the old ntp_error value into account and accumlate some more. In fact
> this simply should clear ntp_error so the new value goes into effect as
> provided by NTP and not skewed by ntp_error.
>
> These async adjustments (usually very small ones) happen when the
> current source degrades and chronyd/ntpd switches over to a new server.
>
> Something like the below.

So I finally got a chance to look at the diff between your change and
mine, and your changes look good to me. Thanks again for catching the
clock_set thinko, and I agree clearing ntp_error looks like the right
thing to do.  I'm going to do some testing here with them and resubmit
shortly.

thanks
-john

Re: [RFC PATCH 1/2] time/timekeeping: Fix possible inconsistencies in _COARSE clockids

Posted by Thomas Gleixner 10 months, 3 weeks ago

On Thu, Mar 20 2025 at 19:01, John Stultz wrote:
> On Sun, Mar 16, 2025 at 9:56 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> #define TK_CLEAR_NTP            (1 << 0)
>> #define TK_CLOCK_WAS_SET        (1 << 1)
>>
>> So it clears NTP instead. Not really what you want either. :)
>
> Hey Thomas,
>    Sorry for the slow reply here.   So I agree with you that we don't
> want to set clock_set above, that was my mistake. But this last bit I
> don't think is right, as timekeeping advance() just returns a bool
> (return !!clock_set;), which is used to decide to call clock_was_set()
> or not  - not the argument passed to clock_was_set().

timekeeping_advance() also uses the clock_set internally.

      clock_set |= ....
      timekeeping_update_from_shadow(..., clock_set);

timekeeping_update_from_shadow() evaluates the TK... bits.

Thanks,

        tglx

[RFC PATCH 2/2] selftests/timers: Improve skew_consistency by testing with other clockids

Posted by John Stultz 10 months, 4 weeks ago

Lei Chen reported a bug with CLOCK_MONOTONIC_COARSE having
inconsistencies when NTP is adjusting the clock frequency.

This has gone seemingly undetected for ~15 years, illustrating a
clear gap in our testing.

The skew_consistency test is intended to catch this sort of
problem, but was focused on only evaluating CLOCK_MONOTONIC, and
thus missed the problem on CLOCK_MONOTONIC_COARSE.

So adjust the test to run with all clockids for 60 seconds each
instead of 10 minutes with just CLOCK_MONOTONIC.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: linux-kselftest@vger.kernel.org
Cc: kernel-team@android.com
Cc: Lei Chen <lei.chen@smartx.com>
Reported-by: Lei Chen <lei.chen@smartx.com>
Closes: https://lore.kernel.org/lkml/20250310030004.3705801-1-lei.chen@smartx.com/
Signed-off-by: John Stultz <jstultz@google.com>
---
 tools/testing/selftests/timers/skew_consistency.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/timers/skew_consistency.c b/tools/testing/selftests/timers/skew_consistency.c
index 83450145fe657..46c391d7f45dc 100644
--- a/tools/testing/selftests/timers/skew_consistency.c
+++ b/tools/testing/selftests/timers/skew_consistency.c
@@ -47,7 +47,7 @@ int main(int argc, char **argv)
 
 	pid = fork();
 	if (!pid)
-		return system("./inconsistency-check -c 1 -t 600");
+		return system("./inconsistency-check -t 60");
 
 	ppm = 500;
 	ret = 0;
-- 
2.49.0.rc1.451.g8f38331e32-goog

Re: [PATCH] Fix rolling back of CLOCK_MONOTONIC_COARSE

Posted by John Stultz 11 months ago

On Sun, Mar 9, 2025 at 8:00 PM Lei Chen <lei.chen@smartx.com> wrote:
>
> timekeeping_apply_adjustment try to adjust tkr_mono.mult by @mult_adj.
> If @mult_adj > 0, tk->tkr_mono.xtime_nsec will be decreased by @offset.
> Then timekeeping_update flushes shadow_timekeeper to real tk and vdso
> data region. Then rolling back happens.
>
> The drawing below illustrates the reason why timekeeping_apply_adjustment
> descreases tk->tkr_mono.xtime_nsec.
>
>      cycle_interval       offset        clock_delta
> x-----------------------x----------x---------------------------x
>
> P0                      P1         P2                         P3
>
> N(P) means the nano sec count at the point P.
>
> Assume timekeeping_apply_adjustment runs at P1, with unaccumulated
> cycles @offset. Then tkr_mono.mult is adjusted from M1 to M2.
>
> Since offset happens before tkr_mono.mult adjustment, so we want to
> achieve:
> N(P3) == offset * M1 + clock_delta * M2 + N(P1)   -------- (1)
>
> But at P3, the code works as following:
> N(P3) := (offset + clock_delta) * M2 + N(P1)
>        = offset * M2 + clock_delta * M2 + N(P1)
>
> Apprently, N(P3) goes away from equation (1). To correct it, N(P1)
> should be adjusted at P2:
> N(P1) -= offset * (M2 - M1)
>
> To fix this issue, the patch accumulates offset into tk, and export
> N(P2) to real tk and vdso.
>
> tk.tkr_mono := N(P2) = N(P1) + offset * M1
>
> Then at P3, we calculate N(P3) based on N(P2) instead of N(P1):
> N(P3) := N(P2) + clock_delta * M2
>
> Signed-off-by: Lei Chen <lei.chen@smartx.com>

Thanks for the email and the patch!

So, I'm having a bit of a hard time understanding the issue you're
actually seeing. It seems to be that you're seeing
CLOCK_MONOTONIC_COARSE go backwards?

The approach in your proposed patch seems to undo some of the
cycle_interval chunked accumulation, which was intentionally avoiding
the multiplication. Instead it tries to accumulate the rest of the
sub-cycle_interval unaccumulated delta. I don't think this is correct,
as it likely would cause problems with the error accounting, as we
accumulate the error in chunks calculated to match the cycle_interval
chunks.

Additionally, your changes are all generic to CLOCK_MONOTONIC, but
your subject suggests only MONOTONIC_CORASE is having the problem?
Could you explain in more detail the problem you are observing, and
how it triggers?
It seems like reading CLOCK_MONOTONIC_COARSE while making a large
negative NTP adjustment would be able to reproduce this?

thanks
-john

Re: [PATCH] Fix rolling back of CLOCK_MONOTONIC_COARSE

Posted by Lei Chen 11 months ago

Hi John,
Thanks for your reply.

On Fri, Mar 14, 2025 at 1:20 AM John Stultz <jstultz@google.com> wrote:
>
> On Sun, Mar 9, 2025 at 8:00 PM Lei Chen <lei.chen@smartx.com> wrote:
> >
> > timekeeping_apply_adjustment try to adjust tkr_mono.mult by @mult_adj.
> > If @mult_adj > 0, tk->tkr_mono.xtime_nsec will be decreased by @offset.
> > Then timekeeping_update flushes shadow_timekeeper to real tk and vdso
> > data region. Then rolling back happens.
> >
> > The drawing below illustrates the reason why timekeeping_apply_adjustment
> > descreases tk->tkr_mono.xtime_nsec.
> >
> >      cycle_interval       offset        clock_delta
> > x-----------------------x----------x---------------------------x
> >
> > P0                      P1         P2                         P3
> >
> > N(P) means the nano sec count at the point P.
> >
> > Assume timekeeping_apply_adjustment runs at P1, with unaccumulated
> > cycles @offset. Then tkr_mono.mult is adjusted from M1 to M2.
> >
> > Since offset happens before tkr_mono.mult adjustment, so we want to
> > achieve:
> > N(P3) == offset * M1 + clock_delta * M2 + N(P1)   -------- (1)
> >
> > But at P3, the code works as following:
> > N(P3) := (offset + clock_delta) * M2 + N(P1)
> >        = offset * M2 + clock_delta * M2 + N(P1)
> >
> > Apprently, N(P3) goes away from equation (1). To correct it, N(P1)
> > should be adjusted at P2:
> > N(P1) -= offset * (M2 - M1)
> >
> > To fix this issue, the patch accumulates offset into tk, and export
> > N(P2) to real tk and vdso.
> >
> > tk.tkr_mono := N(P2) = N(P1) + offset * M1
> >
> > Then at P3, we calculate N(P3) based on N(P2) instead of N(P1):
> > N(P3) := N(P2) + clock_delta * M2
> >
> > Signed-off-by: Lei Chen <lei.chen@smartx.com>
>
> Thanks for the email and the patch!
>
> So, I'm having a bit of a hard time understanding the issue you're
> actually seeing. It seems to be that you're seeing
> CLOCK_MONOTONIC_COARSE go backwards?
>
I'm sorry for that.
Yes, it's CLOCK_MONOTONIC_COARSE that goes backwards.

I hope the code flow can help to explain it.

In user space, clock_gettime(CLOCK_MONOTONIC_COARSE) actually reads
tk->xtime_sec and tk->tkr_mono.xtime_nsec.

But when ntp calls adjtimex, the code works as following:
do_adjtimex
    timekeeping_advance
        timekeeping_apply_adjustment
             tk->tkr_mono.xtime_nsec -= offset; ------------------- (1)
    timekeeping_update
        update_vsyscall    -------------------------(2)

At (1) , if offset > 0, xtime_nsec will go backwards.
And  after (2) CLOCK_MONOTONIC_COARSE will go backwards.

> The approach in your proposed patch seems to undo some of the
> cycle_interval chunked accumulation, which was intentionally avoiding
> the multiplication. Instead it tries to accumulate the rest of the
> sub-cycle_interval unaccumulated delta. I don't think this is correct,
> as it likely would cause problems with the error accounting, as we
> accumulate the error in chunks calculated to match the cycle_interval
> chunks.
Thanks for your suggestion.
Can we just skip modifying tk->tkr_mono.xtime_nsec
in timekeeping_apply_adjustment ?
>
> Additionally, your changes are all generic to CLOCK_MONOTONIC, but
> your subject suggests only MONOTONIC_CORASE is having the problem?
> Could you explain in more detail the problem you are observing, and
> how it triggers?
> It seems like reading CLOCK_MONOTONIC_COARSE while making a large
> negative NTP adjustment would be able to reproduce this?
>
> thanks
> -john

Re: [PATCH] Fix rolling back of CLOCK_MONOTONIC_COARSE

Posted by John Stultz 10 months, 4 weeks ago

On Thu, Mar 13, 2025 at 11:32 PM Lei Chen <lei.chen@smartx.com> wrote:
>
> Hi John,
> Thanks for your reply.
>
> On Fri, Mar 14, 2025 at 1:20 AM John Stultz <jstultz@google.com> wrote:
> >
> > On Sun, Mar 9, 2025 at 8:00 PM Lei Chen <lei.chen@smartx.com> wrote:
> > >
> > > timekeeping_apply_adjustment try to adjust tkr_mono.mult by @mult_adj.
> > > If @mult_adj > 0, tk->tkr_mono.xtime_nsec will be decreased by @offset.
> > > Then timekeeping_update flushes shadow_timekeeper to real tk and vdso
> > > data region. Then rolling back happens.
> > >
> > > The drawing below illustrates the reason why timekeeping_apply_adjustment
> > > descreases tk->tkr_mono.xtime_nsec.
> > >
> > >      cycle_interval       offset        clock_delta
> > > x-----------------------x----------x---------------------------x
> > >
> > > P0                      P1         P2                         P3
> > >
> > > N(P) means the nano sec count at the point P.
> > >
> > > Assume timekeeping_apply_adjustment runs at P1, with unaccumulated
> > > cycles @offset. Then tkr_mono.mult is adjusted from M1 to M2.
> > >
> > > Since offset happens before tkr_mono.mult adjustment, so we want to
> > > achieve:
> > > N(P3) == offset * M1 + clock_delta * M2 + N(P1)   -------- (1)
> > >
> > > But at P3, the code works as following:
> > > N(P3) := (offset + clock_delta) * M2 + N(P1)
> > >        = offset * M2 + clock_delta * M2 + N(P1)
> > >
> > > Apprently, N(P3) goes away from equation (1). To correct it, N(P1)
> > > should be adjusted at P2:
> > > N(P1) -= offset * (M2 - M1)
> > >
> > > To fix this issue, the patch accumulates offset into tk, and export
> > > N(P2) to real tk and vdso.
> > >
> > > tk.tkr_mono := N(P2) = N(P1) + offset * M1
> > >
> > > Then at P3, we calculate N(P3) based on N(P2) instead of N(P1):
> > > N(P3) := N(P2) + clock_delta * M2
> > >
> > > Signed-off-by: Lei Chen <lei.chen@smartx.com>
> >
> > Thanks for the email and the patch!
> >
> > So, I'm having a bit of a hard time understanding the issue you're
> > actually seeing. It seems to be that you're seeing
> > CLOCK_MONOTONIC_COARSE go backwards?
> >
> I'm sorry for that.
> Yes, it's CLOCK_MONOTONIC_COARSE that goes backwards.
>
> I hope the code flow can help to explain it.
>
> In user space, clock_gettime(CLOCK_MONOTONIC_COARSE) actually reads
> tk->xtime_sec and tk->tkr_mono.xtime_nsec.
>
> But when ntp calls adjtimex, the code works as following:
> do_adjtimex
>     timekeeping_advance
>         timekeeping_apply_adjustment
>              tk->tkr_mono.xtime_nsec -= offset; ------------------- (1)
>     timekeeping_update
>         update_vsyscall    -------------------------(2)
>
> At (1) , if offset > 0, xtime_nsec will go backwards.
> And  after (2) CLOCK_MONOTONIC_COARSE will go backwards.

So, I understand we subtract offset from xtime_nsec, when the mult is
incremented, as this is necessary to avoid time inconsistencies with
the non-coarse clockids, since we have unaccumulated cycles in offset.

Briefly:
mult_2 = mult_1 + 1
xtime_nsec_1 + (mult_1 * offset)  ==  xtime_nsec_2  + (mult_2 * offset)
  ==  xtime_nsec_2  + (mult_1 +1) * offset)
  ==  xtime_nsec_2  + (mult_1 * offset) + offset

Then cancelling from both sides:
xtime_nsec_1  ==  xtime_nsec_2  + offset
And re-arranging as such:
xtime_nsec_2 == xtime_nsec_1 - offset

So yeah, I see the concern, as when we are dealing with the _COARSE
clocks, we don't use the offset value. So the subtracting offset from
xtime_nsec initially seems problematic.

But the key here is the timekeeping_adjust() logic, which adjusts the
multiplier and is all done *after* we do the accumulation, adding
(possibly multiple) (mult*cycle_interval) intervals from the offset to
xtime_nsec.

After the accumulation, offset must be smaller than cycle_interval.
So the negative (un-multiplied) offset size adjustment to xtime_nsec
should be smaller than what was immediately before added to it.  Both
accumulation and adjustment are done atomically together under the
tk_core.lock, so I'd not expect to see an inconsistency here.

My suspicion is that if we are coming into timekeeping_advance() more
frequently then cycle_interval cycles, than its possible we didn't
actually accumulate anything, but had some left over ntp error that
triggered a mult adjustment, causing a xtime_nsec to be decremented
without the normal accumulation before that. Opening up a window for
the inconsistency.

The "if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK)"
check there is supposed to catch that, but with
timekeeping_advance(TK_ADV_FREQ) it looks like during an ntp
adjustment we can try to do the mult adjustment without accumulation.

Thomas just tells me now he's found a fix, so hopefully that will be
incoming soon. I'll probably be drafting my own approach just to make
sure we're aligned.

I've also found this is pretty easy to reproduce but unfortunately the
kselftest skew_consistency focused on MONOTONIC and didn't check
_COARSE clockids, so I'll try to tweak that test as well so that we
have better coverage for this case.

Thanks so much for finding and reporting this!

thanks
-john

Re: [PATCH] Fix rolling back of CLOCK_MONOTONIC_COARSE

Posted by Thomas Gleixner 10 months, 4 weeks ago

On Fri, Mar 14 2025 at 12:21, John Stultz wrote:
> On Thu, Mar 13, 2025 at 11:32 PM Lei Chen <lei.chen@smartx.com> wrote:
> My suspicion is that if we are coming into timekeeping_advance() more
> frequently then cycle_interval cycles, than its possible we didn't
> actually accumulate anything, but had some left over ntp error that
> triggered a mult adjustment, causing a xtime_nsec to be decremented
> without the normal accumulation before that. Opening up a window for
> the inconsistency.
>
> The "if (offset < real_tk->cycle_interval && mode == TK_ADV_TICK)"
> check there is supposed to catch that, but with
> timekeeping_advance(TK_ADV_FREQ) it looks like during an ntp
> adjustment we can try to do the mult adjustment without accumulation.

Even if the delta is > cycle_interval in the normal TK_ADV_TICK case,
then the accumulation will always have a leftover offset.

After accumulation timekeeping_advance() invokes timekeeping_adjust(offset).

timekeeping_adjust() does multiplier adjustment based on [NTP] tick
length and ntp error. That's completely independent of TV_ADV_FREQ.

So _ANY_ adjustment of the multiplier has to adjust xtime_nsec so that

     T1nsec[OLD] = xtime_sec[OLD] * NSEC_PER_SEC +
		   (xtime_nsec[OLD] + offset * mult[OLD]) >> shift;

     T1nsec[NEW] = xtime_sec[NEW] * NSEC_PER_SEC +
		   (xtime_nsec[NEW] + offset * mult[NEW]) >> shift;

results in:

     T1nsec[OLD] == T1nsec[NEW]

So while Lei's change fulfils that requirement by advancing xtime_nsec
so that this becomes:

     T1nsec[NEW] = xtime_sec[NEW] * NSEC_PER_SEC +
                   xtime_nsec[NEW] >> shift;

it changes a fundamental property of the timekeeping code:

  cycles_last has to be incremented by multiples of cycles_interval to
  ensure that the accumulation, on which the NTP correction relies, has
  a periodic base. That allows to steer the clock in a coordinated way
  against ntp_tick_length and guarantees smooth error accumulation.

With Lei's changes this property is lost because every multiplicator
adjustment moves cycles_last around by the random offset which happened
to be left over in the accumulation. So the period of cycles_last
advancement gets lost and becomes a self modifying random number
generator depening on the size of the reminder (offset).

Now coming back to your suspicion that this is related to TK_ADV_FREQ.
That's one way to trigger it, but even the regular steering algorithm,
which smoothes the multplicator adjustment with

           mult [+-] = 1

exposes this because there is no guarantee that the remaining offset is
small enough to be not visible. All it takes is:

      (offset >> shift) >= 1

which means it's greater than or equal one nanosecond. Assume a clock
frequency of 1GHz. Then

      mult ~= shift

So all it takes is to hit timekeeping_advance() at a point in time where
the reminder is becomes greater than shift, which is pretty much
guaranteed to happen with NOHZ on a fully idle system.

In the 1GHz clock frequency case and HZ=100 and the in use

   shift = 1 << 23 = 8388608

this opens a window of ~1.62 milliseconds, which is easy enough to
hit. It does not even require NOHZ because in the NTP steering case the
period of the tick is not longer the same as the cycle_interval. Which
means that the reminder slightly increases per tick and hits the window
where the reminder becomes large enough roughly once per second.

I've instrumented timekeeping_advance() accordingly and the
trace_printk() nicely fills the trace buffer over time with xtime_nsec
going backwards. Except for the TK_ADV_FREQ case, this is always just
one nanosecond, but it's one nanosecond too much.

As I said in the previous mail, the fundamental flaw of the coarse time
getters is the assumption that xtime_sec + xtime_nsec does not go
backwards (except when the clock was set).

That's not the case and we can't change that without rewriting the
accumulation algorithm. I looked at the math required and it becomes a
total nightmare to compensate for that. I really don't have any
interrest in chasing the resulting NTP regressions for the next year
just for that.

Providing a seperate and accurate coarse_nsec value is easy enough and
does not touch any of the well established and correct accumulation and
steering code at all.

Thanks,

        tglx