This is the bugfix subset of the RFC series last posted at
https://lore.kernel.org/all/20260520135207.37826-1-dwmw2@infradead.org/
These patches stand alone and fix several long-standing precision issues
in the kernel's NTP drift tracking. The feed-forward clock discipline
which was the reason I was *looking* remains waiting in the wings (and
my ffclock branch) for now; this is just the precision fixes on which
it will depend.
My mental model is that the kernel tracks three time series:
• (A): xtime, reported as CLOCK_REALTIME.
• (B): xtime+ntp_error, the time NTP *wants* to report right now.
• (C): xtime+ntp_error+time_{adjust,offset}, what NTP wants to skew to.
Ignore the units. They'll make your head hurt.
The fundamental unit of timekeeping is the tick, which is a fixed number
(cycle_interval) of clocksource cycles. Each tick, a multiplier 'mult'
is chosen which represents the time of a single cycle during that tick.
Since the actual tick length generally isn't a multiple of the fixed
cycle_interval, we end up dithering between a rounded-down 'mult' which
results in a slightly short tick, and 'mult+1' which results in a
slightly longer tick, such that the average comes out right. And thus
the time we report (A) sawtooths above and below the time we actually
*want* to be reporting (B). The difference between the two is tracked
in the ntp_error variable.
Patch 2 is a single bug fix to that tracking; when an offset is applied
to xtime (A) in timekeeping_apply_adjustment() to ensure monotonicity
when 'mult' is being changed part-way through a tick, we forgot to
account for that change in ntp_error, which meant that the adjustment
became a permanent fixture and wasn't dithered away by subsequent
choices of 'mult'.
Patch 3 is cleaning up an older hack which also deals with remainder
errors. A coarse clocksource like the ACPI PM timer at 3.579545 MHz has
a period of 279.365ns per cycle, which isn't a factor of 1ms; we end up
using 3580 cycles which means 1.000127ms per tick. The "xtime_remainder"
variable introduced in commit a386b5af8edd was intended to represent
that +127PPM discrepancy. It was added to the NTP-adjusted tick length
internally by the timekeeping code, looking for all intents and purposes
like a constant error bias that was *failing* to correctly account for
the delta between the 3580 * mult, and the intended tick length.
In fact it *was* a consistent bias, and what it achieves is to allow
NTP its full ±500PPM skew around the nominal counter frequency, rather
than *starting* at +127PPM and only allowing NTP to skew by +373/-627
PPM. Without it, everything works fine, but you do *see* that NTP is
applying that -127PPM skew. Which is why I completely ripped it out in
the previous v4 series. This time I've refactored it so that the NTP
code in the kernel can actually see it and properly factor it into
the tick_length calculation in ntp_update_frequency(). Not only is
this more consistent and makes my head hurt less, but importantly it
means that when the reference-tracking code comes along later, the
constant bias can actually be taken into account properly too.
Then we come to the phase adjustments via time_offset and time_adjust,
which come from ADJ_OFFSET and ADJ_ADJTIME respectively. The former
applies its delta exponentially as an infinite impulse response, while
the latter does so linearly at a maximum rate of 500µs/s.
They were both... approximate, at best. Each second, second_overflow()
would tweak the effective tick_length such that a new 'mult' was
calculated to run faster or slower as intended. We just kind of assumed
that there would be HZ ticks running at that adjusted length, and didn't
care or didn't notice that especially in a NOHZ setup, the number of
ticks which run with that adjusted rate is only very *approximately* in
the vicinity of HZ.
By tweaking the per-second adjustment to impart a 'skew_delta' to the
timekeeping tick length and moving the actual accounting to the per-tick
code, we can recover the precision such that asking for a phase shift of
5000µs ends up with the kernel's timekeeping clamped *precisely*
5000000±1 ns ahead of the line it used to be tracking, not "around
4997500" as it was before.
And then in the final patch, we can remove the 'tick_length_base' as
we never actually *adjust* tick_length directly; it's all done through
skew_delta.
Most of this just didn't *matter* when NTP is constantly tweaking and
re-applying adjustments. But with it all fixed, we can now add a
timekeeping_set_reference() function which simply sets the tick_length
and time_offset and lets the kernel run, *precisely* tracking the
counter to real time relationship that it's given from either VMClock,
or userspace tooling which disciplines the TSC directly (perhaps via PPS
capture devices with PTM which latch the actual counter value at the
moment of the pulse).
As before, the actual reference clock support is in my branch at
https://git.infradead.org/?p=users/dwmw2/linux.git;a=shortlog;h=refs/heads/ffclock
and the QEMU patch to pass through the host's clock discipline is
https://git.infradead.org/?p=users/dwmw2/qemu.git;a=shortlog;h=refs/heads/vmclock-passthrough
David Woodhouse (6):
MAINTAINERS: Add Miroslav as timekeeping reviewer
timekeeping: Account for monotonicity adjustment in ntp_error
timekeeping: Account for clocksource tick quantisation via NTP
timekeeping: Drive time_offset skew via per-tick ntp_error transfer
timekeeping: deliver adjtime() time_adjust via skew_delta
ntp: Remove tick_length_base, use tick_length directly
MAINTAINERS | 1 +
include/linux/timekeeper_internal.h | 9 +-
kernel/time/ntp.c | 195 ++++++++++++++++++++++++++++++------
kernel/time/ntp_internal.h | 5 +-
kernel/time/timekeeping.c | 80 +++++++++++++--
5 files changed, 242 insertions(+), 48 deletions(-)