This proof-of-concept series adds support for feed-forward clock
discipline, allowing a guest kernel to lock its system clock directly to
a hypervisor-provided vmclock reference with sub-microsecond precision
and no drift.
The vmclock device (https://uapi-group.org/specifications/specs/vmclock/)
provides a shared memory page containing a linear time function:
time = base + (counter - counter_value) × period. The guest can read
this at any time to determine the hypervisor's view of the current time,
without a VM exit.
The existing ptp_vmclock driver already exposes this as a PTP clock for
userspace consumers (phc2sys, chrony). This series adds kernel-internal
consumption: the tick mechanism can clamp directly to the vmclock
reference, eliminating the need for NTP to discipline the guest clock.
Patch 1 fixes a pre-existing bug in the ntp_error accumulator that
caused systematic drift when running at a fixed frequency (without NTP
continuously correcting). This is a standalone bugfix suitable for
stable backport.
Patch 2 adds /dev/vmclock_host, a miscdev that exports the host's
NTP-disciplined time as a vmclock page. This is what QEMU (or another
VMM) maps into the guest.
Patch 3 adds the timekeeping_set_reference() API and the feed-forward
clamping mechanism. When a reference is active, the dithering decision
(mult vs mult+1) uses an absolute comparison against the reference line
instead of the relative ntp_error accumulator. This also sets
time_offset for faster initial convergence.
Patch 4 wires ptp_vmclock to call timekeeping_set_reference() on probe
and on each notification from the hypervisor, enabling automatic
guest-to-host time synchronization.
Tested with QEMU passing through /dev/vmclock_host to a guest¹. The
guest clock converges to the host's vmclock reference within ~60 seconds
and remains locked to within ±500ns indefinitely, with no userspace NTP
daemon required.
TODO:
• Graceful handling of clocksource switch (guest starts on kvmclock and
only later switches to the final TSC). Although a modern guest on a
modern kernel should *never* use kvmclock, and the guest was *told*
in CPUID what the TSC frequency was, so I don't know why it was doing
any of that.
• Review accuracy of the base time reference exported from vmclock_host
(I think I need to properly for ntp_error at the time of the snapshot,
as my clocks sometimes converge on a point a few µs away from the
host's CLOCK_REALTIME.
But generally as a proof of concept it seems to be working fairly well.
Especially after the fix in patch#1 which makes the kernel *actually*
run its clock at the time that adjtimex() tells it to.
In a dedicated hosting environment, the host userspace would probably
be disciplining the TSC against external NTP/1PPS and feeding that
*directly* to guests as as vmclock page, and would want to *also* feed
that into the kernel. We'll want a userspace API for that.
But let's get the part where Thomas calls me an idiot over with first :)
¹ https://git.infradead.org/?p=users/dwmw2/qemu.git;a=shortlog;h=refs/heads/vmclock-passthrough
David Woodhouse (4):
timekeeping: Remove xtime_remainder from ntp_error accumulation
WIP: kernel/time: Add /dev/vmclock_host miscdev
timekeeping: Add absolute reference for feed-forward clock discipline
ptp_vmclock: Feed reference to timekeeping for feed-forward discipline
drivers/ptp/ptp_vmclock.c | 26 ++
include/linux/timekeeper_internal.h | 2 -
include/linux/timekeeping_reference.h | 35 +++
include/linux/vmclock_host.h | 17 ++
kernel/time/Kconfig | 8 +
kernel/time/Makefile | 1 +
kernel/time/ntp.c | 36 ++-
kernel/time/ntp_internal.h | 3 +
kernel/time/timekeeping.c | 89 +++++-
kernel/time/vmclock_host.c | 319 +++++++++++++++++++++
tools/testing/selftests/timers/vmclock_host_test.c | 171 +++++++++++
11 files changed, 699 insertions(+), 8 deletions(-)