[RFC PATCH v3 00/10] timekeeping: Fix drift tracking precision and add feed-forward discipline via vmclock

David Woodhouse posted 10 patches 4 days, 9 hours ago
MAINTAINERS                                        |   1 +
drivers/ptp/ptp_vmclock.c                          |  95 +++++
include/linux/timekeeper_internal.h                |   6 +-
include/linux/timekeeping_reference.h              |  19 +
kernel/time/Kconfig                                |   7 +
kernel/time/Makefile                               |   1 +
kernel/time/ntp.c                                  | 132 +++++--
kernel/time/ntp_internal.h                         |   9 +
kernel/time/timekeeping.c                          |  76 +++-
kernel/time/vmclock_host.c                         | 391 +++++++++++++++++++++
tools/testing/selftests/timers/vmclock_host_test.c | 171 +++++++++
11 files changed, 867 insertions(+), 41 deletions(-)
[RFC PATCH v3 00/10] timekeeping: Fix drift tracking precision and add feed-forward discipline via vmclock
Posted by David Woodhouse 4 days, 9 hours ago
This is v3 of the series to allow feed-forward clock discipline, allowing
a guest kernel to lock its system clock directly to a hypervisor-provided
vmclock reference with nanosecond precision and no drift.

With all the drift-inducing bugs in the core timekeeping resolved in the 
first patches of the series, the RFC timekeeping_set_reference() 
function basically just sets the tick length and time_offset according 
to the reference, and lets the now-fixed core timekeeping get on with 
its job.

The vmclock device (https://uapi-group.org/specifications/specs/vmclock/)
provides a shared memory page containing a linear time function:
time = base + (counter - counter_value) × period. The guest can read
this at any time to determine the hypervisor's view of the current time,
without a VM exit. Unlike guest-driven NTP, it allows for accurate time
to be preserved across live migration.

The existing ptp_vmclock driver already exposes this as a PTP clock for
userspace consumers (phc2sys, chrony). This series adds kernel-internal
consumption: the tick mechanism can clamp directly to the vmclock
reference, eliminating the need for NTP to discipline the guest clock.

Now testing *without* nohz in the guest, I no longer see single-digit
nanoseconds of jitter; I wouldn't describe it as being that high. It
goes to ±1ns and stays there, although it does take a while to converge
since dropping the exponential tail clamping.

Changes since v2:
 • Renamed "clawback" to "monotonicity adjustment" throughout (patch 2).
 • Drop the exponential tail clamping (v2 patch 3).
 • Convert adjtime() to use time_offset to deliver skew too, and remove
   the separate 'tick_length_base' as adjusting tick_length directly is
   no longer used to skew the clock. The skew_delta basically does the
   same thing, but is easier to get the accounting right.
 • The timekeeping_set_reference() API (patch 8) now takes tk_core.lock
   and computes the phase offset internally, eliminating the race window
   that existed in v2 between setting the reference and the tick code
   consuming it.
 • vmclock_host (patch 10) is no longer marked WIP — it has a selftest
   and proper locking (but is still RFC).
 • Added MAINTAINERS entry for Miroslav Lichvar as timekeeping reviewer
   (patch 1).

The series:

Patch 1: MAINTAINERS update
  1. Add Miroslav Lichvar as timekeeping reviewer.

Patches 2-3: Timekeeping bugfixes (suitable for stable/independent review)
  2. Remove stale xtime_remainder from ntp_error accumulation.
  3. Account for monotonicity adjustment in ntp_error.

Patch 4: Independent bugfix
  4. Guard against divide-by-zero during clocksource recalibration.

Patches 5-7: NTP rework — eliminate tick_length_base
  5. Drive time_offset skew via per-tick ntp_error transfer instead of
     tick_length inflation, with mult adjustment for dithering bandwidth.
  6. Convert adjtime() to use time_offset directly instead of inflating
     tick_length, removing the rounding loss that prevented convergence.
  7. Remove tick_length_base entirely — tick_length is now always the
     NTP-disciplined value with no per-tick inflation.

Patches 8-9: Feed-forward reference clock infrastructure
  8. Add timekeeping_set_reference() API for external clock references.
  9. Wire ptp_vmclock to call timekeeping_set_reference() on probe.

Patch 10: Host-side vmclock page export
  10. Add /dev/vmclock_host miscdev for VMM consumption, with selftest.

Tested with QEMU passing through a vmclock device to a guest. The guest
clock converges to the reference within seconds and remains within
single-digit nanoseconds indefinitely, with no further external
correction.

https://git.infradead.org/?p=users/dwmw2/qemu.git;a=shortlog;h=refs/heads/vmclock-passthrough


David Woodhouse (10):
      MAINTAINERS: Add Miroslav as timekeeping reviewer
      timekeeping: Remove xtime_remainder from ntp_error accumulation
      timekeeping: Account for monotonicity adjustment in ntp_error
      timekeeping: Guard against divide-by-zero in timekeeping_adjust
      timekeeping: Drive time_offset skew via per-tick ntp_error transfer
      ntp: Convert adjtime() to use time_offset instead of tick_length inflation
      ntp: Remove tick_length_base, use tick_length directly
      timekeeping: Add absolute reference for feed-forward clock discipline
      ptp_vmclock: Feed reference to timekeeping for feed-forward discipline
      kernel/time: Add /dev/vmclock_host miscdev

 MAINTAINERS                                        |   1 +
 drivers/ptp/ptp_vmclock.c                          |  95 +++++
 include/linux/timekeeper_internal.h                |   6 +-
 include/linux/timekeeping_reference.h              |  19 +
 kernel/time/Kconfig                                |   7 +
 kernel/time/Makefile                               |   1 +
 kernel/time/ntp.c                                  | 132 +++++--
 kernel/time/ntp_internal.h                         |   9 +
 kernel/time/timekeeping.c                          |  76 +++-
 kernel/time/vmclock_host.c                         | 391 +++++++++++++++++++++
 tools/testing/selftests/timers/vmclock_host_test.c | 171 +++++++++
 11 files changed, 867 insertions(+), 41 deletions(-)

Re: [RFC PATCH v3 00/10] timekeeping: Fix drift tracking precision and add feed-forward discipline via vmclock
Posted by David Woodhouse 1 day, 10 hours ago
On Wed, 2026-05-20 at 14:33 +0100, David Woodhouse wrote:
> This is v3 of the series to allow feed-forward clock discipline, allowing
> a guest kernel to lock its system clock directly to a hypervisor-provided
> vmclock reference with nanosecond precision and no drift.
> 
> With all the drift-inducing bugs in the core timekeeping resolved in the 
> first patches of the series, the RFC timekeeping_set_reference() 
> function basically just sets the tick length and time_offset according 
> to the reference, and lets the now-fixed core timekeeping get on with
> its job.

I wanted to see what effect that had, if any, on normal NTP timekeeping
with chrony.

I set four identical bare metal hosts running the baseline 7.1-rc4+
kernel. With and without my timekeeping fixes, a pair with NO_HZ_IDLE
and a pair with HZ_PERIODIC (at HZ=1000).

I'll let the test run over the weekend; collating the data is a semi-
manual set of script hacks but for the time being this is being kept
updated with the latest results: https://david.woodhou.se/ntptest/

Any feedback on the analysis — and especially on the raw data that I'm
collecting — would be welcome.

This is just using chrony and NTP, no PHC as I wanted to simulate a
"normal" consumer-style setup.

In a future test I'll set up a proper feed-forward discipline and use
timekeeping_set_reference() to steer the kernel's timekeeping based
directly on the TSC.

I suspect that will give better results especially in the nohz case,
because chrony can only discipline what it *sees*, and has no way to
see the "intended" values including ntp_error and time_offset; only
what the actual xtime output has sawtoothed to after prolonged and
unpredictable periods of no ticks to drive the 'mult' dithering.