timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock

[RFC PATCH v2 0/8] timekeeping: Fix draft tracking precision and add feed-forward discipline via vmclock

Posted by David Woodhouse 1 week ago

This is v2 of the series to add feed-forward clock discipline, allowing
a guest kernel to lock its system clock directly to a hypervisor-provided
vmclock reference with sub-10ns precision and no drift.

The vmclock device (https://uapi-group.org/specifications/specs/vmclock/)
provides a shared memory page containing a linear time function:
time = base + (counter - counter_value) × period. The guest can read
this at any time to determine the hypervisor's view of the current time,
without a VM exit. Unlike guest-driven NTP, it allows for accurate time
to be preserved across live migration.

The existing ptp_vmclock driver already exposes this as a PTP clock for
userspace consumers (phc2sys, chrony). This series adds kernel-internal
consumption: the tick mechanism can clamp directly to the vmclock
reference, eliminating the need for NTP to discipline the guest clock.

The previous series introduced an external oracle to drive the per-tick
dithering mechanism towards the reference clock. By fixing all the
inaccuracies and systematic drift in the kernel's own tracking, we can
dispense with the external oracle and just configure the timekeeping
using the existing frequency/tick_length and time_offset/ntp_error
mechanisms.

Changes since v1 (RFC):
• Fixed three additional issues in the timekeeping code that were
discovered during nanosecond-precision testing with the vmclock
reference:
- The clawback adjustment in timekeeping_apply_adjustment() moved
xtime without updating ntp_error (patch 2).
- The exponential tail of ntp_offset_chunk() asymptotically approached
zero, preventing convergence to the final nanosecond (patch 3).
- A divide-by-zero in timekeeping_adjust() when cycle_interval is
momentarily zero during TSC recalibration on KVM guests (patch 4).
• Replaced the per-tick absolute reference clamping with a cleaner
mechanism: the skew from time_offset is now driven by per-tick
transfer into ntp_error with a matching mult adjustment, rather than
by inflating tick_length (patch 7). This gives exact per-tick
accounting of the time_offset drain with no rounding loss.
• The timekeeping_set_reference() API (patch 5) sets time_offset and
the frequency, letting the standard skew mechanism handle convergence.

The series:

Patches 1-4: Timekeeping bugfixes (suitable for stable/independent review)
1. Remove stale xtime_remainder from ntp_error accumulation.
2. Account for clawback adjustment in ntp_error.
3. Clamp time_offset delta to prevent infinite exponential tail.
4. Guard against divide-by-zero during clocksource recalibration.

Patches 5-6: Feed-forward reference clock infrastructure
5. Add timekeeping_set_reference() API for external clock references.
6. Wire ptp_vmclock to call timekeeping_set_reference() on probe.

Patch 7: Improved time_offset skew mechanism
7. Drive time_offset skew via per-tick ntp_error transfer instead of
tick_length inflation, with mult adjustment for dithering bandwidth.
(we can't *yet* kill tick_length_base; I have to frown at adjtime()
some more first).

Patch 8: Host-side vmclock page export (WIP)
8. Add /dev/vmclock_host miscdev for VMM consumption.

Tested with QEMU passing through a vmclock device to a guest¹. The guest
clock converges to the reference within seconds and remains within
single digit nanoseconds indefinitely, with no further external
correction. Injecting a ±10µs offset via ntp_set_time_offset() converges
to the target via the same exponential decay as before over about 70
seconds, and retains the same single-digit nanosecond jitter around
precisely ±10000ns once converged. Obviously in real usage, the
reference will be periodically changing too, but the feed-forward setup
does rely on the kernel being able to converge to, and remain on, the
precise line it's given.

¹ https://git.infradead.org/?p=users/dwmw2/qemu.git;a=shortlog;h=refs/heads/vmclock-passthrough

David Woodhouse (8):
timekeeping: Remove xtime_remainder from ntp_error accumulation
timekeeping: Account for clawback adjustment in ntp_error
timekeeping: Clamp time_offset delta to prevent infinite tail
timekeeping: Guard against divide-by-zero in timekeeping_adjust
timekeeping: Add absolute reference for feed-forward clock discipline
ptp_vmclock: Feed reference to timekeeping for feed-forward discipline
timekeeping: Drive time_offset skew via per-tick ntp_error transfer
WIP: kernel/time: Add /dev/vmclock_host miscdev