drivers/ptp/ptp_vmclock.c | 79 +++++ include/linux/timekeeper_internal.h | 3 +- include/linux/timekeeping_reference.h | 19 ++ include/linux/vmclock_host.h | 17 ++ kernel/time/Kconfig | 8 + kernel/time/Makefile | 1 + kernel/time/ntp.c | 72 ++++- kernel/time/ntp_internal.h | 6 + kernel/time/timekeeping.c | 83 +++++- kernel/time/vmclock_host.c | 319 +++++++++++++++++++++ tools/testing/selftests/timers/vmclock_host_test.c | 171 +++++++++++ 11 files changed, 766 insertions(+), 12 deletions(-)
This is v2 of the series to add feed-forward clock discipline, allowing
a guest kernel to lock its system clock directly to a hypervisor-provided
vmclock reference with sub-10ns precision and no drift.
The vmclock device (https://uapi-group.org/specifications/specs/vmclock/)
provides a shared memory page containing a linear time function:
time = base + (counter - counter_value) × period. The guest can read
this at any time to determine the hypervisor's view of the current time,
without a VM exit. Unlike guest-driven NTP, it allows for accurate time
to be preserved across live migration.
The existing ptp_vmclock driver already exposes this as a PTP clock for
userspace consumers (phc2sys, chrony). This series adds kernel-internal
consumption: the tick mechanism can clamp directly to the vmclock
reference, eliminating the need for NTP to discipline the guest clock.
The previous series introduced an external oracle to drive the per-tick
dithering mechanism towards the reference clock. By fixing all the
inaccuracies and systematic drift in the kernel's own tracking, we can
dispense with the external oracle and just configure the timekeeping
using the existing frequency/tick_length and time_offset/ntp_error
mechanisms.
Changes since v1 (RFC):
• Fixed three additional issues in the timekeeping code that were
discovered during nanosecond-precision testing with the vmclock
reference:
- The clawback adjustment in timekeeping_apply_adjustment() moved
xtime without updating ntp_error (patch 2).
- The exponential tail of ntp_offset_chunk() asymptotically approached
zero, preventing convergence to the final nanosecond (patch 3).
- A divide-by-zero in timekeeping_adjust() when cycle_interval is
momentarily zero during TSC recalibration on KVM guests (patch 4).
• Replaced the per-tick absolute reference clamping with a cleaner
mechanism: the skew from time_offset is now driven by per-tick
transfer into ntp_error with a matching mult adjustment, rather than
by inflating tick_length (patch 7). This gives exact per-tick
accounting of the time_offset drain with no rounding loss.
• The timekeeping_set_reference() API (patch 5) sets time_offset and
the frequency, letting the standard skew mechanism handle convergence.
The series:
Patches 1-4: Timekeeping bugfixes (suitable for stable/independent review)
1. Remove stale xtime_remainder from ntp_error accumulation.
2. Account for clawback adjustment in ntp_error.
3. Clamp time_offset delta to prevent infinite exponential tail.
4. Guard against divide-by-zero during clocksource recalibration.
Patches 5-6: Feed-forward reference clock infrastructure
5. Add timekeeping_set_reference() API for external clock references.
6. Wire ptp_vmclock to call timekeeping_set_reference() on probe.
Patch 7: Improved time_offset skew mechanism
7. Drive time_offset skew via per-tick ntp_error transfer instead of
tick_length inflation, with mult adjustment for dithering bandwidth.
(we can't *yet* kill tick_length_base; I have to frown at adjtime()
some more first).
Patch 8: Host-side vmclock page export (WIP)
8. Add /dev/vmclock_host miscdev for VMM consumption.
Tested with QEMU passing through a vmclock device to a guest¹. The guest
clock converges to the reference within seconds and remains within
single digit nanoseconds indefinitely, with no further external
correction. Injecting a ±10µs offset via ntp_set_time_offset() converges
to the target via the same exponential decay as before over about 70
seconds, and retains the same single-digit nanosecond jitter around
precisely ±10000ns once converged. Obviously in real usage, the
reference will be periodically changing too, but the feed-forward setup
does rely on the kernel being able to converge to, and remain on, the
precise line it's given.
¹ https://git.infradead.org/?p=users/dwmw2/qemu.git;a=shortlog;h=refs/heads/vmclock-passthrough
David Woodhouse (8):
timekeeping: Remove xtime_remainder from ntp_error accumulation
timekeeping: Account for clawback adjustment in ntp_error
timekeeping: Clamp time_offset delta to prevent infinite tail
timekeeping: Guard against divide-by-zero in timekeeping_adjust
timekeeping: Add absolute reference for feed-forward clock discipline
ptp_vmclock: Feed reference to timekeeping for feed-forward discipline
timekeeping: Drive time_offset skew via per-tick ntp_error transfer
WIP: kernel/time: Add /dev/vmclock_host miscdev
drivers/ptp/ptp_vmclock.c | 79 +++++
include/linux/timekeeper_internal.h | 3 +-
include/linux/timekeeping_reference.h | 19 ++
include/linux/vmclock_host.h | 17 ++
kernel/time/Kconfig | 8 +
kernel/time/Makefile | 1 +
kernel/time/ntp.c | 72 ++++-
kernel/time/ntp_internal.h | 6 +
kernel/time/timekeeping.c | 83 +++++-
kernel/time/vmclock_host.c | 319 +++++++++++++++++++++
tools/testing/selftests/timers/vmclock_host_test.c | 171 +++++++++++
11 files changed, 766 insertions(+), 12 deletions(-)
On Sun, May 17, 2026 at 10:25:37PM +0100, David Woodhouse wrote: > The vmclock device (https://uapi-group.org/specifications/specs/vmclock/) > provides a shared memory page containing a linear time function: > time = base + (counter - counter_value) × period. The guest can read > this at any time to determine the hypervisor's view of the current time, > without a VM exit. That sounds nice. > The existing ptp_vmclock driver already exposes this as a PTP clock for > userspace consumers (phc2sys, chrony). This series adds kernel-internal > consumption: the tick mechanism can clamp directly to the vmclock > reference, eliminating the need for NTP to discipline the guest clock. I'm not very familiar with the VM timekeeping and other code. If I understand this idea correctly, by loading the ptp_vmclock module the guest kernel is giving the host control of its clock. Changes in the host's REALTIME/MONOTONIC clock frequency are mirrored to the guest's clock. Differences larger than 100 milliseconds are corrected by step, whether the guest applications like it or not. Smaller steps and errors accumulated due to a delay in the frequency update (is there a limit to this delay?) are corrected by the kernel NTP PLL (with the default time constant?). When the guest is migrated to a different host, the frequency offset between the two hosts is injected to the NTP frequency (assuming REALTIME clocks of the hosts have zero frequency error at that moment?). Have you considered a different approach that would address the problem with frequency step by adjusting the guest's clocksource frequency to match the original host? That would correct all system clocks, i.e. not only REALTIME/MONOTONIC, but also MONOTONIC_RAW and AUX clocks. The guest would still be in control of its clock and follow its own preferences to stepping, maximum frequency errors, etc. It could still compare the stability and accuracy of the host's clock and use it for synchronization only when it's actually better than other available time sources (some VPS providers are known to have poorly synchronized host clocks). An AUX clock could be used to more accurately compare frequencies of the two hosts, ignoring phase corrections. There is a work in progress for chrony to support MONOTONIC_RAW as the main clock. It would be nice if that could be corrected in migrations. That seems to be a common cause of disruptions of public NTP servers. Polling for notifications about clock changes caused by migrations and system suspend+resume would be useful in any case. -- Miroslav Lichvar
On Tue, 2026-05-19 at 15:16 +0200, Miroslav Lichvar wrote: > On Sun, May 17, 2026 at 10:25:37PM +0100, David Woodhouse wrote: > > The vmclock device (https://uapi-group.org/specifications/specs/vmclock/) > > provides a shared memory page containing a linear time function: > > time = base + (counter - counter_value) × period. The guest can read > > this at any time to determine the hypervisor's view of the current time, > > without a VM exit. > > That sounds nice. The design has two major purposes: • Atomically letting the guest know that live migration has perturbed its clock. Without this, some distributed databases which rely on precision timestamps on transactions for eventual coherency were getting corrupted when guests were live migrated. • Avoiding the redundant work of having *hundreds* of guests on the same host *all* calibrating the same underlying oscillator, while enjoying the added fun of steal time as they're trying to to so. Right now, the implementations in both QEMU and the EC2 Nitro Hypervisor only implement part 1, the disruption signal. I plan for QEMU to use the vmclock_host driver from this series, along with the QEMU patch I linked, to expose the host's real time clock guests to follow. For dedicated hosting environments like EC2, we don't care very much about the host's timekeeping; that host kernel exists *only* to host KVM guests. The host userspace can ignore the host's timekeeping completely and manage the relationship of the counter to real time directly — and in some cases will have hardware which will latch the actual CPU's counter at the moment of a 1PPS signal. We'll feed that counter-to-realtime information *directly* to guests. (And will probably export timekeeping_set_reference() via a syscall of some kind so that we *can* set the host's clock from it too, if I can't find a way to precisely do so through adjtimex.) > > The existing ptp_vmclock driver already exposes this as a PTP clock for > > userspace consumers (phc2sys, chrony). This series adds kernel-internal > > consumption: the tick mechanism can clamp directly to the vmclock > > reference, eliminating the need for NTP to discipline the guest clock. > > I'm not very familiar with the VM timekeeping and other code. If I > understand this idea correctly, by loading the ptp_vmclock module the > guest kernel is giving the host control of its clock. Right *now*, the ptp_vmclock module is only providing a PTP clock for userspace to discipline the kernel against, as noted above. But yes, the intent of what I'm doing here is to bypass all that complexity and manage the explicit counter-to-time relationship *directly* within the guest kernel. I did briefly play with simulating 1PPS, and injecting PPS events at the precise time that a PPS signal *would* have triggered, to the cycle: https://lore.kernel.org/all/87cb97d5a26d0f4909d2ba2545c4b43281109470.camel@infradead.org/ > Changes in the host's REALTIME/MONOTONIC clock frequency are mirrored > to the guest's clock. Strictly, "changes in the realtime clock frequency advertised by the vmclock device", but basically yes. > Differences larger than 100 milliseconds are corrected by step, > whether the guest applications like it or not. Smaller steps and > errors accumulated due to a delay in the frequency update (is there a > limit to this delay?) are corrected by the kernel NTP PLL (with the > default time constant?). That behaviour isn't set in stone for vmclock; I'm still only experimenting with the part where it *can* set the frequency, and an offset that the kernel will converge to and *stay* on. Right now it just calls my ntp_set_time_offset() which doesn't step at all, and always just injects via ->time_offset (the NTP PLL). Much the same as legacy adjtime() AIUI. > When the guest is migrated to a different host, the frequency offset > between the two hosts is injected to the NTP frequency (assuming > REALTIME clocks of the hosts have zero frequency error at that > moment?). When the advertised frequency changes (either due to the ongoing clock discipline on the host, or because of migration to a new host), the new frequency is injected directly into tick_length. > Have you considered a different approach that would address the > problem with frequency step by adjusting the guest's clocksource > frequency to match the original host? That would correct all system > clocks, i.e. not only REALTIME/MONOTONIC, but also MONOTONIC_RAW and > AUX clocks. You mean TSC scaling to change the frequency of the actual counter? When stepping between non-identical hosts, that might be helpful. But we still have to deal with the variance of the counter over time even without migration in the picture. > The guest would still be in control of its clock and follow its own > preferences to stepping, maximum frequency errors, etc. It could still > compare the stability and accuracy of the host's clock and use it for > synchronization only when it's actually better than other available > time sources (some VPS providers are known to have poorly synchronized > host clocks). I think that mode is already available as a PTP clock, isn't it? While of course it should be optional for the guest, I'm deliberately optimising for the case here where the hosting provider *does* get it right and *can* be trusted. > An AUX clock could be used to more accurately compare > frequencies of the two hosts, ignoring phase corrections. > > There is a work in progress for chrony to support MONOTONIC_RAW as the > main clock. It would be nice if that could be corrected in migrations. Not sure I understand this. I thought the whole point of MONOTONIC_RAW is that it *isn't* skewed by NTP? > That seems to be a common cause of disruptions of public NTP servers. > Polling for notifications about clock changes caused by migrations and > system suspend+resume would be useful in any case. That much you can do today with /dev/vmclock even when it isn't exposing the actual time information. Timekeeping in migration is fairly hosed in KVM. I don't think there are many implementations that actually set the TSC correctly on the destination host. But that's a different story...
On Tue, May 19, 2026 at 04:50:41PM +0100, David Woodhouse wrote: > The design has two major purposes: > • Avoiding the redundant work of having *hundreds* of guests on the > same host *all* calibrating the same underlying oscillator, while > enjoying the added fun of steal time as they're trying to to so. But isn't that work still duplicated, only moved to the kernel? The userspace part could be a simple loop waiting for vmclock notifications and following the changes of the host. The only difference would be a longer delay, but still insignificant for the intended purpose, right? I don't like the idea of adding more clock control loops to the kernel much. It's a complexity that will likely grow as different requirements come and the code will be even more difficult to understand. IMHO the NTP PLL and hard PPS loops shouldn't have been included in the kernel. The kernel time control API should have been just setting/stepping the time and changing the frequency, both possibly at a specified time instead of the time of the call. > > Have you considered a different approach that would address the > > problem with frequency step by adjusting the guest's clocksource > > frequency to match the original host? That would correct all system > > clocks, i.e. not only REALTIME/MONOTONIC, but also MONOTONIC_RAW and > > AUX clocks. > > You mean TSC scaling to change the frequency of the actual counter? Yes, in hardware if available, or in software if not. An additional 32-bit multiplier applied like this: cycles += (cycles * mult) >> shift Larger adjustments can be done in the normal multiplier for all clocks. > When stepping between non-identical hosts, that might be helpful. But > we still have to deal with the variance of the counter over time even > without migration in the picture. Whatever is synchronizing the guest clock to the host (using the PHC or vmclock page) will take care of that? The point is to avoid migrations causing a frequency step. I'm not sure what identical and non-identical hosts mean in this context, same nominal CPU frequency, or a CPU tied to the same crystal oscillator? > > The guest would still be in control of its clock and follow its own > > preferences to stepping, maximum frequency errors, etc. It could still > > compare the stability and accuracy of the host's clock and use it for > > synchronization only when it's actually better than other available > > time sources (some VPS providers are known to have poorly synchronized > > host clocks). > > I think that mode is already available as a PTP clock, isn't it? Yes, but it's slow due to missing frequency transfer, not feed-forward as you call it. The host's frequency offset could be exposed in the PHC's timex. > > There is a work in progress for chrony to support MONOTONIC_RAW as the > > main clock. It would be nice if that could be corrected in migrations. > > Not sure I understand this. I thought the whole point of MONOTONIC_RAW > is that it *isn't* skewed by NTP? It isn't adjusted, but it can be used as a stable reference avoiding the multiplier-induced jitter, interference from other processes, and synchronization loops, e.g. when an NTP client is synchronizing to an NTP server running on the same system (in different containers). -- Miroslav Lichvar
On Wed, 2026-05-20 at 12:39 +0200, Miroslav Lichvar wrote: > On Tue, May 19, 2026 at 04:50:41PM +0100, David Woodhouse wrote: > > The design has two major purposes: > > > • Avoiding the redundant work of having *hundreds* of guests on the > > same host *all* calibrating the same underlying oscillator, while > > enjoying the added fun of steal time as they're trying to to so. > > But isn't that work still duplicated, only moved to the kernel? Not the actual calibration of the TSC against real time, no. It is the *host* which gets the 1PPS signal and does all the work of tracking and smoothing the frequency drift over time. The guest basically gets the same as a vDSO, *telling* it a relationship from TSC to real time. Many guests in trustworthy hosting environments will just use that and want to feed it directly to the guest kernel timekeeping. Others might want to take a more opinionated stance, as you describe below. Those probably *would* duplicate some of the effort, in order to form their opinion. > The userspace part could be a simple loop waiting for vmclock > notifications and following the changes of the host. The only > difference would be a longer delay, but still insignificant for the > intended purpose, right? > > I don't like the idea of adding more clock control loops to the kernel > much. I completely agree. I am absolutely not planning to add any more clock control to the kernel than we already have. As you say, we probably have too many already. > It's a complexity that will likely grow as different > requirements come and the code will be even more difficult to > understand. IMHO the NTP PLL and hard PPS loops shouldn't have been > included in the kernel. The kernel time control API should have been > just setting/stepping the time and changing the frequency, both possibly > at a specified time instead of the time of the call. There is merit in that argument. The kernel already has a separation between the core timekeeping code in timekeeping.c and the rest of the NTP code in ntp.c which does the higher level control. The timekeeping_set_reference() added in my patch *only* uses the existing basic timekeeping code, taking the vDSO-like information that I mentioned above, and using it to set the frequency and offset for the kernel's core timekeeping to follow. There's a cleaner version in my tree now, because having fixed all the errors in the core timekeeping which were introducing drift, the implementation of timekeeping_set_reference() can be a *whole* lot simpler than it was in my initial proof of concept — it now really can just set the tick length and time_offset, and let it run: https://git.infradead.org/?p=users/dwmw2/linux.git;a=commitdiff;h=c62bf50eca > > > Have you considered a different approach that would address the > > > problem with frequency step by adjusting the guest's clocksource > > > frequency to match the original host? That would correct all system > > > clocks, i.e. not only REALTIME/MONOTONIC, but also MONOTONIC_RAW and > > > AUX clocks. > > > > You mean TSC scaling to change the frequency of the actual counter? > > Yes, in hardware if available, or in software if not. An additional > 32-bit multiplier applied like this: > > cycles += (cycles * mult) >> shift > > Larger adjustments can be done in the normal multiplier for all clocks. > > > When stepping between non-identical hosts, that might be helpful. But > > we still have to deal with the variance of the counter over time even > > without migration in the picture. > > Whatever is synchronizing the guest clock to the host (using the PHC > or vmclock page) will take care of that? The point is to avoid > migrations causing a frequency step. > > I'm not sure what identical and non-identical hosts mean in this > context, same nominal CPU frequency, or a CPU tied to the same crystal > oscillator? I meant same nominal frequency. I'm not sure what scaling the guest TSC would buy us. Sure, it would minimise the frequency step at the moment of migration, but a naïve guest which isn't using vmclock's disruption signal is screwed on live migration *anyway*, because there's *also* a step change in the actual TSC value which is bounded by the real time synchronization of the source and destination host. Anything the guest has done for itself to calibrate the source host's TSC must be entirely thrown away on migration. The vmclock allows the destination host to immediately say "here, use this instead". AFAICT scaling the TSC would just add complexity and wouldn't help much. And TSC scaling is pretty much x86-specific; other architectures have a *defined* counter frequency and don't need to support scaling. I'm not a fan :) > > > The guest would still be in control of its clock and follow its own > > > preferences to stepping, maximum frequency errors, etc. It could still > > > compare the stability and accuracy of the host's clock and use it for > > > synchronization only when it's actually better than other available > > > time sources (some VPS providers are known to have poorly synchronized > > > host clocks). > > > > I think that mode is already available as a PTP clock, isn't it? > > Yes, but it's slow due to missing frequency transfer, not feed-forward > as you call it. The host's frequency offset could be exposed in the > PHC's timex. Yes, that makes a lot of sense. You can literally open /dev/vmclock and consume it *however* you like from userspace. You can even poll() and get woken when there's an update. I think that would be a great thing for chrony to learn to do (and that's how you get the disruption signal too). > > > There is a work in progress for chrony to support MONOTONIC_RAW as the > > > main clock. It would be nice if that could be corrected in migrations. > > > > Not sure I understand this. I thought the whole point of MONOTONIC_RAW > > is that it *isn't* skewed by NTP? > > It isn't adjusted, but it can be used as a stable reference avoiding > the multiplier-induced jitter, interference from other processes, and > synchronization loops, e.g. when an NTP client is synchronizing to an > NTP server running on the same system (in different containers). We could just use the TSC for this, insted of MONOTONIC_RAW, couldn't we? Do all our clock discipline of the *TSC* against the external sources, and then use the same timekeeper_set_reference() to ask the kernel's core timekeeping to track the TSC-to-realtime relationship that we desire? That's exactly what I'm planning to do for a dedicated hosting environment. I think the patches which allow PTP to return paired timestamps with reference to TSC instead of CLOCK_MONOTONIC landed in the net-next tree today? (for TSC, read 'arch counter, timebase, etc.' — none of this is x86- specific but 'TSC' is quicker to type...)
On Wed, May 20 2026 at 13:21, David Woodhouse wrote:
> On Wed, 2026-05-20 at 12:39 +0200, Miroslav Lichvar wrote:
>> It isn't adjusted, but it can be used as a stable reference avoiding
>> the multiplier-induced jitter, interference from other processes, and
>> synchronization loops, e.g. when an NTP client is synchronizing to an
>> NTP server running on the same system (in different containers).
>
> We could just use the TSC for this, insted of MONOTONIC_RAW, couldn't
> we? Do all our clock discipline of the *TSC* against the external
> sources, and then use the same timekeeper_set_reference() to ask the
> kernel's core timekeeping to track the TSC-to-realtime relationship
> that we desire?
>
> That's exactly what I'm planning to do for a dedicated hosting
> environment. I think the patches which allow PTP to return paired
> timestamps with reference to TSC instead of CLOCK_MONOTONIC landed in
> the net-next tree today?
Bah.
> (for TSC, read 'arch counter, timebase, etc.' — none of this is x86-
> specific but 'TSC' is quicker to type...)
As I said in the other thread, that's just creating yet another private
mechanism instead of collecting the counter value together with e.g.
CLOCK_REALTIME or utilizing the PMT correlated one which is available in
get_device_crosstime_stamp().
Can we please stop creating specialized interfaces and instead make them
generic, so they can be used for everything?
Then you can go and extend the posix-timer interface with
clock_set_time_reference() (or whatever name we come up with) and
provide the functionality for all steerable clocks. That'd allow chronyd
to completely ignore the kernel side NTP PLL and do everything in user
space. That obviously needs some thought and input from the chrony
folks, but that's a long term useful solution and not some 'scratch my
itch' side channel.
Thanks,
tglx
On Thu, 2026-05-21 at 20:30 +0200, Thomas Gleixner wrote: > > As I said in the other thread, that's just creating yet another private > mechanism instead of collecting the counter value together with e.g. > CLOCK_REALTIME On the plus side, at least he wasn't providing a counter value at *all* for the system timestamps, which is better than using a bogus one :) Can we have a signed-off-by for your ktime_get_snapshot_id() please? > or utilizing the PMT correlated one which is available in > get_device_crosstime_stamp(). AFAICT that was the *only* one he was exposing, wasn't it? The vmclock driver literally did expose the cycle count used to create the device timestamp, which is equivalent to PTM and looked correct for that part? > Can we please stop creating specialized interfaces and instead make them > generic, so they can be used for everything? Of course. > Then you can go and extend the posix-timer interface with > clock_set_time_reference() (or whatever name we come up with) and > provide the functionality for all steerable clocks. That'd allow chronyd > to completely ignore the kernel side NTP PLL and do everything in user > space. That obviously needs some thought and input from the chrony > folks, but that's a long term useful solution and not some 'scratch my > itch' side channel. Yeah, that's a neat idea. I deliberately hadn't even *proposed* a userspace API for that at all yet; for the timekeeping part I'm just working on the basic *concepts* and the accounting fixes that make it all actually work, with a hack to unconditionally call it directly from vmclock for now. In order to solicit exactly that feedback and design a long term solution that works for everyone, before going too far down any particular implementation path. I like clock_set_time_reference(). I'll have a play and see what I can come up with. It would want to carry error bounds information too. Having a clock_get_time_reference() would be nice too for QEMU to use, but that would just be a snapshot and wouldn't get updated when the clock is adjusted. While the /dev/vmclock_host thing I have in my tree right now can at least use a gtod notifier and the userspace device is pollable. And it can export everything we need in one go. More thought required on that one... but I'm very keen *not* to let that one get forgotten, because I want this to work optimally for the general case of QEMU running on a standard general purpose host, not *only* the dedicated hosting setup where userspace is prepared to do all the work.
On Thu, May 21 2026 at 22:06, David Woodhouse wrote:
> On Thu, 2026-05-21 at 20:30 +0200, Thomas Gleixner wrote:
>>
>> As I said in the other thread, that's just creating yet another private
>> mechanism instead of collecting the counter value together with e.g.
>> CLOCK_REALTIME
>
> On the plus side, at least he wasn't providing a counter value at *all*
> for the system timestamps, which is better than using a bogus one :)
At least ... for now :)
> Can we have a signed-off-by for your ktime_get_snapshot_id() please?
Are you kidding? That's a PoC to demonstrate how it should be done and
it needs some thought to implement it correctly along with the
get_device_cross_timestamp() one, which is actually not entirely
correct as I noticed a few minutes ago.
>> or utilizing the PMT correlated one which is available in
>> get_device_crosstime_stamp().
>
> AFAICT that was the *only* one he was exposing, wasn't it? The vmclock
> driver literally did expose the cycle count used to create the device
> timestamp, which is equivalent to PTM and looked correct for that
> part?
The vmclock driver lives in it's own made up world, so yes this looks
consistent on the first glance.
Thanks,
tglx
On Fri, 2026-05-22 at 10:02 +0200, Thomas Gleixner wrote: > On Thu, May 21 2026 at 22:06, David Woodhouse wrote: > > On Thu, 2026-05-21 at 20:30 +0200, Thomas Gleixner wrote: > > > > > > As I said in the other thread, that's just creating yet another private > > > mechanism instead of collecting the counter value together with e.g. > > > CLOCK_REALTIME > > > > On the plus side, at least he wasn't providing a counter value at *all* > > for the system timestamps, which is better than using a bogus one :) > > At least ... for now :) > > > Can we have a signed-off-by for your ktime_get_snapshot_id() please? > > Are you kidding? That's a PoC to demonstrate how it should be done and > it needs some thought to implement it correctly along with the > get_device_cross_timestamp() one, which is actually not entirely > correct as I noticed a few minutes ago. Obviously. But to take a PoC and then do that thought and turn it into something we can use, it still needs a Co-developed-by: and thus a Signed-off-by: if you would be so kind. > > > or utilizing the PMT correlated one which is available in > > > get_device_crosstime_stamp(). > > > > AFAICT that was the *only* one he was exposing, wasn't it? The vmclock > > driver literally did expose the cycle count used to create the device > > timestamp, which is equivalent to PTM and looked correct for that > > part? > > The vmclock driver lives in it's own made up world, so yes this looks > consistent on the first glance. Heh, the 'made up world' of which you speak is KVM. The older KVM PTP drivers get a CSID_X86_TSC or CSID_ARM_ARCH_COUNTER value too. And they *use* it... and wait, get_device_system_crosststamp() already *does* require the device to generate a system_counterval_t, so your nightmare world where driver authors might pull it out of their posterior *already* exists, doesn't it? And we have things like stmmac which already populate it using CSID_X86_ART. So at least for PTP_SYS_OFFSET_PRECISE, isn't Arthur's patch literally only exporting the same counter values that the driver *already* creates? I'm not quite sure why we have all these histrionics about drivers not being able to create those reliably? Yes, there's plenty to improve as discussed, and we should probably have get_device_system_crosststamp() copy the values from the system_counterval on its local stack into the system_device_crosststamp rather than asking the driver to pass it back through separate fields in the attributes.
On Fri, May 22 2026 at 11:01, David Woodhouse wrote:
> On Fri, 2026-05-22 at 10:02 +0200, Thomas Gleixner wrote:
> Obviously. But to take a PoC and then do that thought and turn it into
> something we can use, it still needs a Co-developed-by: and thus a
> Signed-off-by: if you would be so kind.
git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git timers/ptp/timekeeping
is the work in progress state as of now. I'm not going to touch it in
the next days and it's still in a rough uncompiled state. It lacks quite
some change logs and the last patch needs to be split up.
I'll go and have a look next week so that I can rethink the approach
with a clear mind.
> Heh, the 'made up world' of which you speak is KVM. The older KVM PTP
> drivers get a CSID_X86_TSC or CSID_ARM_ARCH_COUNTER value too.
>
> And they *use* it... and wait, get_device_system_crosststamp() already
> *does* require the device to generate a system_counterval_t, so your
> nightmare world where driver authors might pull it out of their
> posterior *already* exists, doesn't it?
It exists. Because get_device_system_crosststamp() does _NOT_ propagate
the counter values after converting them to actual TSC values. The half
baked snipped I provided you earlier does exactly that (but wrong). The
version in the git branch should be halfways functional.
> And we have things like stmmac which already populate it using
> CSID_X86_ART.
It does not populate back into PTP land. That's a
get_device_system_crosststamp() internal handshake where the driver
callback provides the PTM time stamp and tells the core which clock
source it is based on. The core converts it to the system clocksource
cycles, e.g. ART to TSC, and then calculates MONO_RAW and REALTIME from
it, optionally with an extra snapshot that allows historical
interpolation for devices where the timestamp retrieval takes ages.
> So at least for PTP_SYS_OFFSET_PRECISE, isn't Arthur's patch literally
> only exporting the same counter values that the driver *already*
> creates? I'm not quite sure why we have all these histrionics about
> drivers not being able to create those reliably?
The driver reads it from the hardware but it does not know how to
convert them back to TSC or anything else. For the driver it's an opaque
piece of data which it read out of a register or got retrieved through a
firmware query.
> Yes, there's plenty to improve as discussed, and we should probably
> have get_device_system_crosststamp() copy the values from the
> system_counterval on its local stack into the system_device_crosststamp
> rather than asking the driver to pass it back through separate fields
> in the attributes.
See the original snippet and the git tree how that is done by extending
the cross time stamp structure and storing all the information there,
which is what PTP hands in:
system_cross_timestamp sct;
ptp->info->getcrosstimestamp(..., &sct)
driver_getcrosstimestamp(...., *sct) {
get_device_system_crosststamp(callback, context, ..., sct) {
system_counterval_t scv;
ktime_t device_time;
do {
...
callback(&device_time, &scv, context) {
read_snapshot(&pch_time, &ptm_time);
*device_time = munge(pch_time);
scv->cycles = ptm_time;
scv->cs_id = ART;
}
....
cs_cycles = convert_ptm_to_cs(scv.cycles, scv.cs_id);
real = timekeeping_convert_to_real(cs_cycles);
raw = timekeeping_convert_to_raw(cs_cycles);
} while (seq_retry());
sct->device = device_time;
sct->real = real;
sct->raw = raw;
}
So the new parts are that system_cross_timestamp gains a
system_counter_val and get_device_system_crosststamp() fills that in:
sct->counter.cycles = cs_cycles;
sct->counter.cs_id = csid;
On X86 you get the TSC cycles (derived from ART) and CSID_X86_TSC.
That goes all the way back to the PTP layer. Which means magically _all_
existing users of get_device_system_crosststamp() will provide that data
out of the box.
The existing PRECISE usecase will just ignore sct.counter. Your new
stuff can use it and fill in the related attributes in the user space
attr struct.
This raises an interesting question. Must any of the existing PTM using
drivers mplement that new extended getcrosstimestampattr() callback, in
order to expose the cycles/csid in attr or can you fallback to the
existing callback and have the rest of the fields 0?
Same question arises if you change the pre/post timestamp helpers to
utilize ktime_get_snapshot_id(). All existing drivers which use them
will then automatically retrieve cs_cycles/cs_id.
The other change I did to get_device_system_crosststamp() is to let the
PTP core hand in the clock ID, so it can retrieve either REALTIME or AUX
clocks, which enables the whole AUX world to utilize PTM too once the
PTP IOCTL is updated accordingly.
Can you please make the new PTP_SYS_OFFSET_PRECISE_ATTRS and
PTP_SYS_OFFSET_EXTENDED_ATTRS so that user space can convey the CLOCK
ID, like it does today with PTP_SYS_OFFSET_EXTENDED?
Thanks,
tglx
On Fri, 2026-05-22 at 17:28 +0200, Thomas Gleixner wrote:
>
> git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git timers/ptp/timekeeping
In 94dd85a8d0a ("timekeeping: Add system_counterval_t to struct
system_device_crosststamp") my version ditched the system_counterval_t
on the stack and just used the one in xtstamp directly.
The convert_base_to_cs() function probably wants to scv->id=cs->id for
itself anyway; otherwise it's leaving behind an inconsistent
system_counterval_t object which... will lead to exactly the bug my
first version of that had, that I see you avoided :)
On Fri, May 22 2026 at 17:50, David Woodhouse wrote:
> On Fri, 2026-05-22 at 17:28 +0200, Thomas Gleixner wrote:
>>
>> git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git timers/ptp/timekeeping
>
> In 94dd85a8d0a ("timekeeping: Add system_counterval_t to struct
> system_device_crosststamp") my version ditched the system_counterval_t
> on the stack and just used the one in xtstamp directly.
Which is wrong. I did it the way I did for a very good reason.
> The convert_base_to_cs() function probably wants to scv->id=cs->id for
> itself anyway; otherwise it's leaving behind an inconsistent
> system_counterval_t object which... will lead to exactly the bug my
> first version of that had, that I see you avoided :)
No. It can't because that would corrupt the object for the retry case,
which would then hand back the wrong value.
The object _IS_ consistent because the csid in there is related to the
PTM value and not to the clocksource. The function updates the @cycles
value and leaves everything else untouched. The clock ID for the @cyles
value is guaranteed to be the clock ID of the system clocksource, so
using this is the right thing to do.
Just because it looks tempting or your AI buddy told you so doesn't make
it correct.
Thanks,
tglx
On Sun, May 24 2026 at 17:15, Thomas Gleixner wrote:
> On Fri, May 22 2026 at 17:50, David Woodhouse wrote:
>> On Fri, 2026-05-22 at 17:28 +0200, Thomas Gleixner wrote:
>>>
>>> git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git timers/ptp/timekeeping
>>
>> In 94dd85a8d0a ("timekeeping: Add system_counterval_t to struct
>> system_device_crosststamp") my version ditched the system_counterval_t
>> on the stack and just used the one in xtstamp directly.
>
> Which is wrong. I did it the way I did for a very good reason.
>
>> The convert_base_to_cs() function probably wants to scv->id=cs->id for
>> itself anyway; otherwise it's leaving behind an inconsistent
>> system_counterval_t object which... will lead to exactly the bug my
>> first version of that had, that I see you avoided :)
>
> No. It can't because that would corrupt the object for the retry case,
> which would then hand back the wrong value.
>
> The object _IS_ consistent because the csid in there is related to the
> PTM value and not to the clocksource. The function updates the @cycles
> value and leaves everything else untouched. The clock ID for the @cyles
> value is guaranteed to be the clock ID of the system clocksource, so
> using this is the right thing to do.
>
> Just because it looks tempting or your AI buddy told you so doesn't make
> it correct.
And it's worse. We both are wrong :)
There is an existing bug in that code for the retry case. Fix below.
Thanks,
tglx
---
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1343,12 +1343,14 @@ static bool convert_clock(u64 *val, u32
return true;
}
-static bool convert_base_to_cs(struct system_counterval_t *scv)
+static bool convert_base_to_cs(struct system_counterval_t *scv, u64 *cycles)
{
struct clocksource *cs = tk_core.timekeeper.tkr_mono.clock;
struct clocksource_base *base;
u32 num, den;
+ *cycles = scv->cycles;
+
/* The timestamp was taken from the time keeper clock source */
if (cs->id == scv->cs_id)
return true;
@@ -1364,10 +1366,10 @@ static bool convert_base_to_cs(struct sy
num = scv->use_nsecs ? cs->freq_khz : base->numerator;
den = scv->use_nsecs ? USEC_PER_SEC : base->denominator;
- if (!convert_clock(&scv->cycles, num, den))
+ if (!convert_clock(cycles, num, den))
return false;
- scv->cycles += base->offset;
+ *cycles += base->offset;
return true;
}
@@ -1479,9 +1481,8 @@ int get_device_system_crosststamp(int (*
* installed timekeeper clocksource
*/
if (system_counterval.cs_id == CSID_GENERIC ||
- !convert_base_to_cs(&system_counterval))
+ !convert_base_to_cs(&system_counterval, &cycles))
return -ENODEV;
- cycles = system_counterval.cycles;
/*
* Check whether the system counter value provided by the
On Sun, May 24 2026 at 17:37, Thomas Gleixner wrote: > On Sun, May 24 2026 at 17:15, Thomas Gleixner wrote: > > There is an existing bug in that code for the retry case. Fix below. There is none. It's just too hot to think straight. The counterval is updated once per retry ....
On 24 May 2026 17:36:04 BST, Thomas Gleixner <tglx@kernel.org> wrote: >On Sun, May 24 2026 at 17:37, Thomas Gleixner wrote: >> On Sun, May 24 2026 at 17:15, Thomas Gleixner wrote: >> >> There is an existing bug in that code for the retry case. Fix below. > >There is none. It's just too hot to think straight. The counterval is >updated once per retry .... > > > Yeah, and setting the csid in it at the same time as changing the actual cycle count seemed to make a lot of sense to me. I didn't even ask the AI friend about that; it's entirely crap at anything where you have to take the blinkers off. But it *can* type fast and do test iterations, so it has its place as long as you know you can't trust anything it says :)
On Sun, May 24 2026 at 17:37, Thomas Gleixner wrote: > On Sun, May 24 2026 at 17:15, Thomas Gleixner wrote: > And it's worse. We both are wrong :) > > There is an existing bug in that code for the retry case. Fix below. I've updated the git branch accordingly.
On Fri, 2026-05-22 at 17:28 +0200, Thomas Gleixner wrote:
> On Fri, May 22 2026 at 11:01, David Woodhouse wrote:
> > On Fri, 2026-05-22 at 10:02 +0200, Thomas Gleixner wrote:
> > Obviously. But to take a PoC and then do that thought and turn it into
> > something we can use, it still needs a Co-developed-by: and thus a
> > Signed-off-by: if you would be so kind.
>
> git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git timers/ptp/timekeeping
>
> is the work in progress state as of now. I'm not going to touch it in
> the next days and it's still in a rough uncompiled state. It lacks quite
> some change logs and the last patch needs to be split up.
>
> I'll go and have a look next week so that I can rethink the approach
> with a clear mind.
Thanks. I'll have a play with it. With ptp_read_system_p{re,ost}ts()
also populating pre/post system_counterval_t fields in the struct
ptp_system_timestamp, I can do a bit more cleanup of vmclock than you
have there by using them; I'll work that in.
> > Heh, the 'made up world' of which you speak is KVM. The older KVM PTP
> > drivers get a CSID_X86_TSC or CSID_ARM_ARCH_COUNTER value too.
> >
> > And they *use* it... and wait, get_device_system_crosststamp() already
> > *does* require the device to generate a system_counterval_t, so your
> > nightmare world where driver authors might pull it out of their
> > posterior *already* exists, doesn't it?
>
> It exists. Because get_device_system_crosststamp() does _NOT_ propagate
> the counter values after converting them to actual TSC values. The half
> baked snipped I provided you earlier does exactly that (but wrong). The
> version in the git branch should be halfways functional.
Ah, I see it. convert_base_to_cs().
> The existing PRECISE usecase will just ignore sct.counter. Your new
> stuff can use it and fill in the related attributes in the user space
> attr struct.
Perfect.
> This raises an interesting question. Must any of the existing PTM using
> drivers mplement that new extended getcrosstimestampattr() callback, in
> order to expose the cycles/csid in attr or can you fallback to the
> existing callback and have the rest of the fields 0?
>
> Same question arises if you change the pre/post timestamp helpers to
> utilize ktime_get_snapshot_id(). All existing drivers which use them
> will then automatically retrieve cs_cycles/cs_id.
Taking those in reverse order... yes, this means that with a new
variant of PTP_SYS_OFFSET_EXTENDED, userspace can see actual counter
values even for the system parts of those ABA timestamps, even for non-
PTM clocks, and discipline the TSC/archcounter against the external
clock.
Currently I have userspace which literally does rdtsc() either side of
calling the ioctl :)
And PTM devices can be used with PTP_SYS_OFFSET_PRECISE, which goes
through get_device_system_crosststamp() as described, and all just
works? It's just that we now allow userspace to *see* the counter value
that the driver was already generating.
So to your questions: although there's new userspace ioctl support, the
*drivers* don't need any modification for that (as long as they use the
standard prets/postts helpers).
The remaining question is the device timestamp part (the 'B' in the ABA
sandwich) for PTP_SYS_OFFSET_EXTENDED with PTM-capable drivers. Should
that get a counterval?
I don't have a strong opinion. On one hand we'd have to find a way to
convert it from PTM for devices where it actually *is* PTM, and that's
what PTP_SYS_OFFSET_PRECISE is *for*.
But on the other hand, can't the conversion be a whole lot simpler than
get_device_system_crosststamp() because it's not actually dealing with
any timekeepers; it's basically only invoking convert_base_to_cs()?
And the ioctl should support it *all* but just have a clear way of
indicating that any of the optional fields including the attrs are
*not* populated (or use 0/max values maybe?).
So no, I don't think any driver *has* to add any attr support in order
to expose counter values to userspace. The only reason I asked Arthur
to mix those things up was for the *userspace* API, to avoid adding yet
another ioctl over and over again. And now I feel bad for doing so :)
> The other change I did to get_device_system_crosststamp() is to let the
> PTP core hand in the clock ID, so it can retrieve either REALTIME or AUX
> clocks, which enables the whole AUX world to utilize PTM too once the
> PTP IOCTL is updated accordingly.
>
> Can you please make the new PTP_SYS_OFFSET_PRECISE_ATTRS and
> PTP_SYS_OFFSET_EXTENDED_ATTRS so that user space can convey the CLOCK
> ID, like it does today with PTP_SYS_OFFSET_EXTENDED?
Ack (on Arthur's behalf).
On Fri, May 22 2026 at 17:23, David Woodhouse wrote:
> On Fri, 2026-05-22 at 17:28 +0200, Thomas Gleixner wrote:
>> This raises an interesting question. Must any of the existing PTM using
>> drivers mplement that new extended getcrosstimestampattr() callback, in
>> order to expose the cycles/csid in attr or can you fallback to the
>> existing callback and have the rest of the fields 0?
>>
>> Same question arises if you change the pre/post timestamp helpers to
>> utilize ktime_get_snapshot_id(). All existing drivers which use them
>> will then automatically retrieve cs_cycles/cs_id.
>
> Taking those in reverse order... yes, this means that with a new
> variant of PTP_SYS_OFFSET_EXTENDED, userspace can see actual counter
> values even for the system parts of those ABA timestamps, even for non-
> PTM clocks, and discipline the TSC/archcounter against the external
> clock.
Correct.
> Currently I have userspace which literally does rdtsc() either side of
> calling the ioctl :)
Why am I not surprised? :)
> And PTM devices can be used with PTP_SYS_OFFSET_PRECISE, which goes
> through get_device_system_crosststamp() as described, and all just
> works? It's just that we now allow userspace to *see* the counter value
> that the driver was already generating.
A new variant of PRECISE
> So to your questions: although there's new userspace ioctl support, the
> *drivers* don't need any modification for that (as long as they use the
> standard prets/postts helpers).
Yes.
> The remaining question is the device timestamp part (the 'B' in the ABA
> sandwich) for PTP_SYS_OFFSET_EXTENDED with PTM-capable drivers. Should
> that get a counterval?
PTM-capable driver support cross timestamps, which will with a new
version of PTP_SYS_OFFSET_PRECISE expose the system counterval. No ABA
for that as it's hardware latched AB.
> I don't have a strong opinion. On one hand we'd have to find a way to
> convert it from PTM for devices where it actually *is* PTM, and that's
> what PTP_SYS_OFFSET_PRECISE is *for*.
Correct.
> But on the other hand, can't the conversion be a whole lot simpler than
> get_device_system_crosststamp() because it's not actually dealing with
> any timekeepers; it's basically only invoking convert_base_to_cs()?
But what for? If you have PTM, use PRECISE. There is _zero_ value of
having pre/post timestamps when the hardware already does the correlated
precise sampling, no?
> And the ioctl should support it *all* but just have a clear way of
> indicating that any of the optional fields including the attrs are
> *not* populated (or use 0/max values maybe?).
Yes.
> So no, I don't think any driver *has* to add any attr support in order
> to expose counter values to userspace. The only reason I asked Arthur
> to mix those things up was for the *userspace* API, to avoid adding yet
> another ioctl over and over again. And now I feel bad for doing so :)
I think you can create _one_ data structure variant, which fits both
EXTENDED_ATTR and PRECISE_ATTR:
struct attrs {
u32 valid;
u32 error_bound;
....
u32 reserved[N];
};
@valid tells user space, which of the attributes has been filled in by
the driver. That avoids bounds based validity checks, which are a pain
as you might end up with different bounds for every attribute. Having a
valid flags field avoids that completely.
struct devtime {
ptp_clock_time device_time;
struct attrs attrs;
};
struct systime {
u64 sys_systime;
u64 sys_rawtime;
u64 sys_counter;
u32 sys_counter_id;
u32 reserved;
};
Exposing @sys_counter_id requires to expose CSID_* in the user space ABI
reliably, as otherwise a kernel internal CSID enum change would blow up
the user space guess work. Your ptp_counter_id approach is error prone.
struct timestamp {
union {
struct systime systime;
struct systime pre_systime;
};
struct devtime devtime;
struct systime post_systime;
};
struct request {
u32 valid;
clockid_t clock_id;
unsigned int num_samples;
u32 reserved[N];
};
I rather have @valid here too. The 'zero the reserved' members approach
is a pain as new kernels have to map 0 to default behavior instead of
being free to make 0 mean what they intend. @valid allows you to use
other sizes than u32 for future fields. All you have to take care of is
to keep the existing fields at the same place as before.
struct ioctl_data {
struct request request;
struct timestamp timestamps[];
};
So for both PTP_SYS_OFFSET_EXTENDED_ATTRS and
PTP_SYS_OFFSET_PRECISE_ATTRS user space allocates enough space to
accomodate data::request::num_samples.
For PTP_SYS_OFFSET_PRECISE_ATTRS num_samples has to be 1 and
data::timestamps[0].post_systime is zeroed by the kernel because it has
no meaning.
So now in the kernel you do:
ptp_sys_offset_extended_attrs(struct ptp_clock *ptp, void __user *argptr)
{
struct ioctl_data __user *data = argptr;
struct request;
if (copy_from_user(&request, &data->request, sizeof(request)))
return -EFAULT;
if (!extattr_request_valid(request))
return -EINVAL;
for (unsigned int i; i < request.num_samples; i++) {
struct ptp_system_timestamp sts = { .clock_id = request.clock_id, };
struct timestamp uts = { };
struct timespec64 devts;
if (ptp->info->gettimex64_attr)
ret = ptp->info->gettimex64_attr(ptp->info, &dev_ts, &sts, &uts.attr);
else if (ptp->info->gettimex64)
ret = ptp->info->gettimex64(ptp->info, &dev_ts, &sts);
else
return -ENOTSUPP;
if (ret)
return ret;
uts.pre_systime = mangle(sts.pre_systime);
uts.devtime.device_time = mangle(dev_ts);
uts.post_systime = mangle(sts.post_systime);
if (!copy_to_user(&data->timestamps[i], uts, sizeof(uts)))
return -EFAULT;
}
return 0;
}
ptp_sys_offset_precise_attrs(struct ptp_clock *ptp, void __user *argptr)
{
struct ioctl_data __user *data = argptr;
struct request;
if (copy_from_user(&request, &data->request, sizeof(request)))
return -EFAULT;
if (!preciseattr_request_valid(request))
return -EINVAL;
struct system_device_crosststamp xtstamp = { .clock_id = request.clock_id, };
struct timestamp uts = { };
if (ptp->info->getcrosststamp_attr)
ret = ptp->info->getcrosststamp_attr(ptp->info, &xtstamp, &uts.attr);
else if (ptp->info->getcrosststamp)
ret = ptp->info->getcrosststamp(ptp->info, &xtstamp);
else
return -ENOTSUPP;
if (ret)
return ret;
uts.systime = mangle(xtstamp.systime);
uts.devtime.device_time = mangle(xtstamp.device);
if (!copy_to_user(&data->timestamps[0], uts, sizeof(uts)))
return -EFAULT;
return 0;
}
Or something like this, which immediately enables the functionality for
all drivers which implement the getcrosststamp() or the gettimex64()
callbacks with a unified user space data structure.
The attributes.valid bits are all zero and and once drivers implement
the _attr callback variants, those attributes supported by the driver
will magically appear with the corresponding valid bits set.
Hmm?
Thanks,
tglx
On Sun, 2026-05-24 at 14:36 +0200, Thomas Gleixner wrote: > On Fri, May 22 2026 at 17:23, David Woodhouse wrote: > > On Fri, 2026-05-22 at 17:28 +0200, Thomas Gleixner wrote: > > > This raises an interesting question. Must any of the existing PTM using > > > drivers mplement that new extended getcrosstimestampattr() callback, in > > > order to expose the cycles/csid in attr or can you fallback to the > > > existing callback and have the rest of the fields 0? > > > > > > Same question arises if you change the pre/post timestamp helpers to > > > utilize ktime_get_snapshot_id(). All existing drivers which use them > > > will then automatically retrieve cs_cycles/cs_id. > > > > Taking those in reverse order... yes, this means that with a new > > variant of PTP_SYS_OFFSET_EXTENDED, userspace can see actual counter > > values even for the system parts of those ABA timestamps, even for non- > > PTM clocks, and discipline the TSC/archcounter against the external > > clock. > > Correct. > > > Currently I have userspace which literally does rdtsc() either side of > > calling the ioctl :) > > Why am I not surprised? :) To be fair, I *told* them to do it like that in the short term, knowing it would annoy me enough to chase up the cycles-in-PTP thing. And hey, it worked :) > > And PTM devices can be used with PTP_SYS_OFFSET_PRECISE, which goes > > through get_device_system_crosststamp() as described, and all just > > works? It's just that we now allow userspace to *see* the counter value > > that the driver was already generating. > > A new variant of PRECISE Right. > > So to your questions: although there's new userspace ioctl support, the > > *drivers* don't need any modification for that (as long as they use the > > standard prets/postts helpers). > > Yes. > > > The remaining question is the device timestamp part (the 'B' in the ABA > > sandwich) for PTP_SYS_OFFSET_EXTENDED with PTM-capable drivers. Should > > that get a counterval? > > PTM-capable driver support cross timestamps, which will with a new > version of PTP_SYS_OFFSET_PRECISE expose the system counterval. No ABA > for that as it's hardware latched AB. > > > I don't have a strong opinion. On one hand we'd have to find a way to > > convert it from PTM for devices where it actually *is* PTM, and that's > > what PTP_SYS_OFFSET_PRECISE is *for*. > > Correct. > > > But on the other hand, can't the conversion be a whole lot simpler than > > get_device_system_crosststamp() because it's not actually dealing with > > any timekeepers; it's basically only invoking convert_base_to_cs()? > > But what for? If you have PTM, use PRECISE. There is _zero_ value of > having pre/post timestamps when the hardware already does the correlated > precise sampling, no? The PTM mode and support of PRECISE (or the variant) is currently fairly esoteric: very few devices support it. So I'm not sure we should expect generic userspace to always even try. So there may be some merit in having EXTENDED use the precise hardware paired timestamp. Maybe we don't necessarily care about returning *cycles* but if we *do* use a PTM-capable device (and I'm including the virt TSC-based ones here too), then we kind of want the ABA *all* to be at the same clock cycle. Which is what I've already done for vmclock. > > And the ioctl should support it *all* but just have a clear way of > > indicating that any of the optional fields including the attrs are > > *not* populated (or use 0/max values maybe?). > > Yes. > > > So no, I don't think any driver *has* to add any attr support in order > > to expose counter values to userspace. The only reason I asked Arthur > > to mix those things up was for the *userspace* API, to avoid adding yet > > another ioctl over and over again. And now I feel bad for doing so :) > > I think you can create _one_ data structure variant, which fits both > EXTENDED_ATTR and PRECISE_ATTR: <...> Yeah, that looks eminently sensible. I've been feeding Arthur suggestions along those lines but only nudges; you've fleshed it out in *far* more detail; thanks!
On Sun, May 24 2026 at 14:13, David Woodhouse wrote:
> On Sun, 2026-05-24 at 14:36 +0200, Thomas Gleixner wrote:
>> > But on the other hand, can't the conversion be a whole lot simpler than
>> > get_device_system_crosststamp() because it's not actually dealing with
>> > any timekeepers; it's basically only invoking convert_base_to_cs()?
>>
>> But what for? If you have PTM, use PRECISE. There is _zero_ value of
>> having pre/post timestamps when the hardware already does the correlated
>> precise sampling, no?
>
> The PTM mode and support of PRECISE (or the variant) is currently
> fairly esoteric: very few devices support it. So I'm not sure we should
> expect generic userspace to always even try.
There are not so many PTM capable devices to begin with. And yes, user
space which cares about time and accuracy _should_ try it.
> So there may be some merit in having EXTENDED use the precise hardware
> paired timestamp. Maybe we don't necessarily care about returning
> *cycles* but if we *do* use a PTM-capable device (and I'm including the
> virt TSC-based ones here too), then we kind of want the ABA *all* to be
> at the same clock cycle. Which is what I've already done for vmclock.
If you can do ABA at the same clock cycle, then just implement the cross
timestamp callback and use that.
For PTM capable devices which lack cross timestamp support in the
driver, adding the magic PTM value field in the ABA timestamp won't make
it magically be filled in. So someone has to touch the driver anyway and
then adding the actual cross time support is not much more effort than
adding support for the new field in the extended callback.
Also user space which wants to use the cycles stuff needs to implement
the new IOCTLs anyway. The cycles won't show up magically in the
existing IOCTLs either. So if you make the data struct identical, then
it's really not rocket science to try precise first and then fallback to
extended.
Actuall with the identical data struct you could make that _ONE_ new
IOCTL and the kernel uses cross time stamps if the device supports it or
extended if not. All it has to do is to report the choice and therefore
the number and nature of the samples back to user space. Not rocket
science either.
But in both variant (separate or unified IOCTL) user space has to handle
the data sets correctly.
No strong opinion on that as I have no clue about user space :)
Thanks,
tglx
On Wed, May 20, 2026 at 01:21:46PM +0100, David Woodhouse wrote: > On Wed, 2026-05-20 at 12:39 +0200, Miroslav Lichvar wrote: > > On Tue, May 19, 2026 at 04:50:41PM +0100, David Woodhouse wrote: > > > The design has two major purposes: > > > > > • Avoiding the redundant work of having *hundreds* of guests on the > > > same host *all* calibrating the same underlying oscillator, while > > > enjoying the added fun of steal time as they're trying to to so. > > > > But isn't that work still duplicated, only moved to the kernel? > > Not the actual calibration of the TSC against real time, no. It is the > *host* which gets the 1PPS signal and does all the work of tracking and > smoothing the frequency drift over time. The guest basically gets the > same as a vDSO, *telling* it a relationship from TSC to real time. Ok, but I don't see why the phase corrections of the guest need to be in the kernel. > > I don't like the idea of adding more clock control loops to the kernel > > much. > > I completely agree. I am absolutely not planning to add any more clock > control to the kernel than we already have. As you say, we probably > have too many already. If the vmclock driver is feeding the PLL with the offset between the host and guest clocks, I think that would count as a loop. > I'm not sure what scaling the guest TSC would buy us. Sure, it would > minimise the frequency step at the moment of migration, but a naïve > guest which isn't using vmclock's disruption signal is screwed on live > migration *anyway*, because there's *also* a step change in the actual > TSC value which is bounded by the real time synchronization of the > source and destination host. The TSC offset can be corrected too. I thought that was already happening. > AFAICT scaling the TSC would just add complexity and wouldn't help > much. I think it's a better place to be solving this kind of problems. It's compensating for a hardware change. It doesn't need to happen only at migration. You could adjust the frequency continuously if you really wanted, kind of like synchronous ethernet is doing for clocks over network, improving the stability of the physical clock and phase corrections are done on top of it at a higher level. > And TSC scaling is pretty much x86-specific; other architectures have a > *defined* counter frequency and don't need to support scaling. There can be a software fallback if hardware scaling and/or offset is not supported. > > > > There is a work in progress for chrony to support MONOTONIC_RAW as the > > > > main clock. It would be nice if that could be corrected in migrations. > > > > > > Not sure I understand this. I thought the whole point of MONOTONIC_RAW > > > is that it *isn't* skewed by NTP? > > > > It isn't adjusted, but it can be used as a stable reference avoiding > > the multiplier-induced jitter, interference from other processes, and > > synchronization loops, e.g. when an NTP client is synchronizing to an > > NTP server running on the same system (in different containers). > > We could just use the TSC for this, insted of MONOTONIC_RAW, couldn't > we? > (for TSC, read 'arch counter, timebase, etc.' — none of this is x86- > specific but 'TSC' is quicker to type...) Meaning userspace would have to duplicate the kernel's handling of the counter (wrapping and scaling) just to avoid a single multiplication in the vDSO? -- Miroslav Lichvar
On Thu, 2026-05-21 at 08:35 +0200, Miroslav Lichvar wrote:
> On Wed, May 20, 2026 at 01:21:46PM +0100, David Woodhouse wrote:
> > On Wed, 2026-05-20 at 12:39 +0200, Miroslav Lichvar wrote:
> > > On Tue, May 19, 2026 at 04:50:41PM +0100, David Woodhouse wrote:
> > > > The design has two major purposes:
> > >
> > > > • Avoiding the redundant work of having *hundreds* of guests on the
> > > > same host *all* calibrating the same underlying oscillator, while
> > > > enjoying the added fun of steal time as they're trying to to so.
> > >
> > > But isn't that work still duplicated, only moved to the kernel?
> >
> > Not the actual calibration of the TSC against real time, no. It is the
> > *host* which gets the 1PPS signal and does all the work of tracking and
> > smoothing the frequency drift over time. The guest basically gets the
> > same as a vDSO, *telling* it a relationship from TSC to real time.
>
> Ok, but I don't see why the phase corrections of the guest need to be
> in the kernel.
I'm not sure I understand.
There are no 'phase corrections' as such, except of course that the
phase of the guest kernel's clock does get corrected, and naturally
that does have to take effect inside the guest kernel.
I think the key here is that this is not a feedback loop based on
corrections to the existing clock output; this is a feedforward design
as described in https://dl.acm.org/doi/pdf/10.1109/TNET.2011.2158443
It seems that when Julien et al lamented that, "Until now, however,
there has been a serious practical issue inhibiting feed-forward
approaches: a lack of kernel support", the basics were actually there
in the kernel's core timekeeping all along.
We didn't have to *do* anything to the core timekeeping other than fix
a few bugs that the NTP feedback mechanism always masked — who *cares*
if there's a systematic +5PPM drift due to accounting errors, as NTP
can just interpret that as the counter running 5PPM fast and adjust for
it?
Although I don't think the errors are quite that consistent, as they
vary with tick length and even from tick to tick with the mult±1
dithering and interrupt latency — so I wouldn't be surprised if these
fixes made a detectable improvement even in the normal NTP case.
> > > I don't like the idea of adding more clock control loops to the kernel
> > > much.
> >
> > I completely agree. I am absolutely not planning to add any more clock
> > control to the kernel than we already have. As you say, we probably
> > have too many already.
>
> If the vmclock driver is feeding the PLL with the offset between the
> host and guest clocks, I think that would count as a loop.
It's not an offset; it's a direct feed-forward "when the TSC is <this>
the time is <this>" relationship, like a vDSO does.
https://uapi-group.org/specifications/specs/vmclock/
The core motivation is for virtual machines (and especially for
consistent time across live migration), but hardware implementations
should be possible using PCIe PTM. I keep meaning to get my hands on a
TimeCard and play, but there are only so many hours in the day...
> > I'm not sure what scaling the guest TSC would buy us. Sure, it would
> > minimise the frequency step at the moment of migration, but a naïve
> > guest which isn't using vmclock's disruption signal is screwed on live
> > migration *anyway*, because there's *also* a step change in the actual
> > TSC value which is bounded by the real time synchronization of the
> > source and destination host.
>
> The TSC offset can be corrected too. I thought that was already
> happening.
Yes, it is. The TSC offset (and the guest's KVM clock, which is a whole
different sad story) can be corrected a bit — but the *accuracy* with
which they can be corrected is limited to the accuracy of the source
vs. destination hosts' time synchronization.
If the guest has been using NTP or a PHC to discipline the counter of
the source host that it just came from, carefully tracking not only the
perceived time, but also error bounds in order to ensure coherency of,
say, a distributed database... there is no way that we can migrate it
to a new host and 'fake' the frequency/offset on the new host to
sufficiently match. Database corruption ensues.
The best thing to do is to advertise a disruption signal ("throw away
anything you know about the existing counter"), and provide information
on the new host in that {cycle_count, reference time, counter period,
error bounds} form to allow the guest to return to service as soon as
possible.
Which is precisely what vmclock does.
> > AFAICT scaling the TSC would just add complexity and wouldn't help
> > much.
>
> I think it's a better place to be solving this kind of problems. It's
> compensating for a hardware change. It doesn't need to happen only at
> migration. You could adjust the frequency continuously if you really
> wanted, kind of like synchronous ethernet is doing for clocks over
> network, improving the stability of the physical clock and phase
> corrections are done on top of it at a higher level.
On the *host* side I might accept a PLL on the actual hardware
oscillator and the 1PPS signal... :)
> > And TSC scaling is pretty much x86-specific; other architectures have a
> > *defined* counter frequency and don't need to support scaling.
>
> There can be a software fallback if hardware scaling and/or offset is
> not supported.
Right. This *is* the software fallback, because the hardware scaling
and offset aren't sufficient even if we only care about x86 where the
former is supported.
> > > > > There is a work in progress for chrony to support MONOTONIC_RAW as the
> > > > > main clock. It would be nice if that could be corrected in migrations.
> > > >
> > > > Not sure I understand this. I thought the whole point of MONOTONIC_RAW
> > > > is that it *isn't* skewed by NTP?
> > >
> > > It isn't adjusted, but it can be used as a stable reference avoiding
> > > the multiplier-induced jitter, interference from other processes, and
> > > synchronization loops, e.g. when an NTP client is synchronizing to an
> > > NTP server running on the same system (in different containers).
> >
> > We could just use the TSC for this, insted of MONOTONIC_RAW, couldn't
> > we?
>
> > (for TSC, read 'arch counter, timebase, etc.' — none of this is x86-
> > specific but 'TSC' is quicker to type...)
>
> Meaning userspace would have to duplicate the kernel's handling of
> the counter (wrapping and scaling) just to avoid a single
> multiplication in the vDSO?
Hm yeah, I guess that makes sense.
The way I've done it in these proof of concept patches is counter-
based, because the interface between host and guest (and from that
theoretical hardware implementation) *is* necessarily in terms of the
hardware — we get told the relationship of the actual *counter* to
realtime.
But as long as the conversions in both directions are quick and
accurate there's no fundamental reason why it *couldn't* be expressed
in terms of MONOTONIC_RAW as it's being passed around.
In my RFC, it's just a call to timekeeping_set_reference() which uses
the *existing* mechanisms to just set tick_length and time_offset
accordingly. Which naturally takes counter-based units too.
But I certainly don't think that doing so *unconditionally* from the
vmclock driver in my proof of concept is the right thing to do.
Userspace needs to set policy like that.
And I wasn't stunningly happy with timekeeping_set_reference() passing
fractional seconds in the vmclock (seconds<<64) units instead of the
native (nanoseconds<<32) of the timekeeping code.
So maybe timekeeping_set_reference() should take its input in
MONOTONIC_RAW terms, and the raw information from vmclock should be
converted accordingly? I can try that...
On the *host* side, I anticipate two modes of operation.
A dedicated hosting environment only really cares about disciplining
the host kernel's TSC, and absolutely doesn't *care* about the host
kernel's timekeeping. That's just for logs.
For migrating KVM guests as accurately as possible, we set the guest
*TSC* (scaling and) offset based on our understanding of the host TSC
on both source and destination. The KVM APIs for doing this based on
the kernel's own CLOCK_REALTIME are... a source of sadness. There's a
whole 30-patch series in flight to deal with that, which you can look
at if if you like pain, but the tl;dr is that we get the host kernel's
timekeeping out of the picture as *much* as possible and operate in
terms of the TSC. Migrate the guest kernel's TSC as accurately as
possible, and everything *else* in the guest is derived from that.
So in that dedicated environment, userspace will take our hardware
devices which literally latch the *counter* value on a 1PPS signal, or
use NTP if they really have to fall back to that, and discipline the
*counter*, then use that information to both provide the vmclock for
guests, and migrate guests as accurately as possible. All in userspace,
*necessarily* in raw counter terms.
But hey, it's nice for logs to have good timestamps too, so we can feed
it to the kernel's CLOCK_REALTIME as an afterthought. Probably by using
a userspace hook for timekeeping_set_reference(). I haven't yet looked
at whether the existing adjtimex() can be used/abused/extended to allow
for precisely setting tick_length/time_offset like that.
And then there's the 'normal' host side, with a host kernel running
chrony and a few guests in QEMU. Obviously this mode needs to be
properly taken into account as a first class citizen, which is why I've
built the support that's already *in* QEMU (disruption signal only) and
now the vmclock_host and additional QEMU patch to expose that.
Again it needs to be in terms of the guest TSC by the time the VMM
actually puts it in the shared page, but I'm entirely open to input on
how we get it *out* of the kernel's timekeeping. I do tend to have the
opinion that what we should expose to guests is the "intended" clock,
with ntpdata->time_offset built in and *not* including the constant ±1
changes to 'mult' from the dithering, but using the *actual* intended
frequency from tick_length / cycle_interval.
But other than that, I'm prepared to consider the whole of the
vmclock_host export part as a straw man, and entirely happy to
completely reimplement it however you like, if you have strong
opinions. I just needed to get *something* implemented and working, as
a starting point.
© 2016 - 2026 Red Hat, Inc.