[RFC PATCH 0/4] timekeeping: add feed-forward clock discipline via vmclock

David Woodhouse posted 4 patches 4 weeks, 1 day ago
drivers/ptp/ptp_vmclock.c                          |  26 ++
include/linux/timekeeper_internal.h                |   2 -
include/linux/timekeeping_reference.h              |  35 +++
include/linux/vmclock_host.h                       |  17 ++
kernel/time/Kconfig                                |   8 +
kernel/time/Makefile                               |   1 +
kernel/time/ntp.c                                  |  36 ++-
kernel/time/ntp_internal.h                         |   3 +
kernel/time/timekeeping.c                          |  89 +++++-
kernel/time/vmclock_host.c                         | 319 +++++++++++++++++++++
tools/testing/selftests/timers/vmclock_host_test.c | 171 +++++++++++
11 files changed, 699 insertions(+), 8 deletions(-)
[RFC PATCH 0/4] timekeeping: add feed-forward clock discipline via vmclock
Posted by David Woodhouse 4 weeks, 1 day ago
This proof-of-concept series adds support for feed-forward clock 
discipline, allowing a guest kernel to lock its system clock directly to 
a hypervisor-provided vmclock reference with sub-microsecond precision 
and no drift.

The vmclock device (https://uapi-group.org/specifications/specs/vmclock/)
provides a shared memory page containing a linear time function:
time = base + (counter - counter_value) × period. The guest can read
this at any time to determine the hypervisor's view of the current time,
without a VM exit.

The existing ptp_vmclock driver already exposes this as a PTP clock for
userspace consumers (phc2sys, chrony). This series adds kernel-internal
consumption: the tick mechanism can clamp directly to the vmclock
reference, eliminating the need for NTP to discipline the guest clock.

Patch 1 fixes a pre-existing bug in the ntp_error accumulator that
caused systematic drift when running at a fixed frequency (without NTP
continuously correcting). This is a standalone bugfix suitable for
stable backport.

Patch 2 adds /dev/vmclock_host, a miscdev that exports the host's
NTP-disciplined time as a vmclock page. This is what QEMU (or another
VMM) maps into the guest.

Patch 3 adds the timekeeping_set_reference() API and the feed-forward
clamping mechanism. When a reference is active, the dithering decision
(mult vs mult+1) uses an absolute comparison against the reference line
instead of the relative ntp_error accumulator. This also sets
time_offset for faster initial convergence.

Patch 4 wires ptp_vmclock to call timekeeping_set_reference() on probe
and on each notification from the hypervisor, enabling automatic
guest-to-host time synchronization.

Tested with QEMU passing through /dev/vmclock_host to a guest¹. The
guest clock converges to the host's vmclock reference within ~60 seconds
and remains locked to within ±500ns indefinitely, with no userspace NTP
daemon required.

TODO:
 • Graceful handling of clocksource switch (guest starts on kvmclock and
   only later switches to the final TSC). Although a modern guest on a
   modern kernel should *never* use kvmclock, and the guest was *told*
   in CPUID what the TSC frequency was, so I don't know why it was doing
   any of that.
 • Review accuracy of the base time reference exported from vmclock_host
   (I think I need to properly for ntp_error at the time of the snapshot,
   as my clocks sometimes converge on a point a few µs away from the
   host's CLOCK_REALTIME.

But generally as a proof of concept it seems to be working fairly well.
Especially after the fix in patch#1 which makes the kernel *actually*
run its clock at the time that adjtimex() tells it to.

In a dedicated hosting environment, the host userspace would probably
be disciplining the TSC against external NTP/1PPS and feeding that
*directly* to guests as as vmclock page, and would want to *also* feed
that into the kernel. We'll want a userspace API for that.

But let's get the part where Thomas calls me an idiot over with first :)

¹ https://git.infradead.org/?p=users/dwmw2/qemu.git;a=shortlog;h=refs/heads/vmclock-passthrough
                                                                                                                                                                                                
David Woodhouse (4):
      timekeeping: Remove xtime_remainder from ntp_error accumulation
      WIP: kernel/time: Add /dev/vmclock_host miscdev
      timekeeping: Add absolute reference for feed-forward clock discipline
      ptp_vmclock: Feed reference to timekeeping for feed-forward discipline

 drivers/ptp/ptp_vmclock.c                          |  26 ++
 include/linux/timekeeper_internal.h                |   2 -
 include/linux/timekeeping_reference.h              |  35 +++
 include/linux/vmclock_host.h                       |  17 ++
 kernel/time/Kconfig                                |   8 +
 kernel/time/Makefile                               |   1 +
 kernel/time/ntp.c                                  |  36 ++-
 kernel/time/ntp_internal.h                         |   3 +
 kernel/time/timekeeping.c                          |  89 +++++-
 kernel/time/vmclock_host.c                         | 319 +++++++++++++++++++++
 tools/testing/selftests/timers/vmclock_host_test.c | 171 +++++++++++
 11 files changed, 699 insertions(+), 8 deletions(-)