KVM: x86: Mitigate kvm-clock drift caused by masterclock update

[PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update

Posted by Dongli Zhang 3 weeks, 1 day ago

As noted in commit c52ffadc65e2 ("KVM: x86: Don't unnecessarily
force masterclock update on vCPU hotplug"), each unnecessary
KVM_REQ_MASTERCLOCK_UPDATE can cause the kvm-clock time to jump.

Although that commit addressed the kvm-clock drift issue during vCPU
hotplugl there are still unnecessary KVM_REQ_MASTERCLOCK_UPDATE requests
during live migration on the target host.

The patchset below was authored by David Woodhouse. Two of the patches aim
to avoid unnecessary KVM_REQ_MASTERCLOCK_UPDATE requests.

[RFC PATCH v3 00/21] Cleaning up the KVM clock mess
https://lore.kernel.org/all/20240522001817.619072-1-dwmw2@infradead.org/

[RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
[RFC PATCH v3 15/21] KVM: x86: Allow KVM master clock mode when TSCs are offset from each other

The current patchset has three patches.

PATCH 1 is a partial copy of "[RFC PATCH v3 10/21] KVM: x86: Fix software
TSC upscaling in kvm_update_guest_time()", as Sean suggested, "Please do
this in a separate patch. There's no need to squeeze it in here, and this
change is complex/subtle enough as it is.", and David's authorship is
preserved.

PATCH 2 clears unnecessary KVM_REQ_MASTERCLOCK_UPDATE at the end of
KVM_SET_CLOCK, if masterclock is already active.

PATCH 3 avoids unnecessary updates of ka->master_kernel_ns and
ka->master_cycle_now in pvclock_update_vm_gtod_copy(), if it is already
active and will remain active.


David Woodhouse (1):
  KVM: x86: Fix compute_guest_tsc() to cope with negative delta

Dongli Zhang (2):
  KVM: x86: conditionally clear KVM_REQ_MASTERCLOCK_UPDATE at the end of KVM_SET_CLOCK
  KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy()

 arch/x86/kvm/x86.c | 73 +++++++++++++++++++++++++++++++++++--------------
 1 file changed, 52 insertions(+), 21 deletions(-)

base-commit: 944aacb68baf7624ab8d277d0ebf07f025ca137c

Thank you very much!

Dongli Zhang

Re: [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update

Posted by Dongli Zhang 3 weeks, 1 day ago

Hi David,

On 1/15/26 12:22 PM, Dongli Zhang wrote:
> As noted in commit c52ffadc65e2 ("KVM: x86: Don't unnecessarily
> force masterclock update on vCPU hotplug"), each unnecessary
> KVM_REQ_MASTERCLOCK_UPDATE can cause the kvm-clock time to jump.
> 
> Although that commit addressed the kvm-clock drift issue during vCPU
> hotplugl there are still unnecessary KVM_REQ_MASTERCLOCK_UPDATE requests
> during live migration on the target host.
> 
> The patchset below was authored by David Woodhouse. Two of the patches aim
> to avoid unnecessary KVM_REQ_MASTERCLOCK_UPDATE requests.
> 
> [RFC PATCH v3 00/21] Cleaning up the KVM clock mess
> https://lore.kernel.org/all/20240522001817.619072-1-dwmw2@infradead.org/
> 
> [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
> [RFC PATCH v3 15/21] KVM: x86: Allow KVM master clock mode when TSCs are offset from each other
> 
> The current patchset has three patches.
> 
> PATCH 1 is a partial copy of "[RFC PATCH v3 10/21] KVM: x86: Fix software
> TSC upscaling in kvm_update_guest_time()", as Sean suggested, "Please do
> this in a separate patch. There's no need to squeeze it in here, and this
> change is complex/subtle enough as it is.", and David's authorship is
> preserved.
> 

Please let me know if this is inappropriate and whether I should have
confirmed with you before reusing your code from the patch below, with your
authorship preserved.

[RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
https://lore.kernel.org/all/20240522001817.619072-11-dwmw2@infradead.org/

The objective is to trigger a discussion on whether there is any quick,
short-term solution to mitigate the kvm-clock drift issue. We can also
resurrect your patchset.

I have some other work in QEMU userspace.

[PATCH 1/1] target/i386/kvm: account blackout downtime for kvm-clock and guest TSC
https://lore.kernel.org/qemu-devel/20251009095831.46297-1-dongli.zhang@oracle.com/

The combination of changes in QEMU and this KVM patchset can make kvm-clock
drift during live migration very very trivial.

Thank you very much!

Dongli Zhang

Re: [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update

Posted by David Woodhouse 3 weeks, 1 day ago

On Thu, 2026-01-15 at 12:37 -0800, Dongli Zhang wrote:
> 
> Please let me know if this is inappropriate and whether I should have
> confirmed with you before reusing your code from the patch below, with your
> authorship preserved.
> 
> [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
> https://lore.kernel.org/all/20240522001817.619072-11-dwmw2@infradead.org/
> 
> The objective is to trigger a discussion on whether there is any quick,
> short-term solution to mitigate the kvm-clock drift issue. We can also
> resurrect your patchset.
> 
> I have some other work in QEMU userspace.
> 
> [PATCH 1/1] target/i386/kvm: account blackout downtime for kvm-clock and guest TSC
> https://lore.kernel.org/qemu-devel/20251009095831.46297-1-dongli.zhang@oracle.com/
> 
> The combination of changes in QEMU and this KVM patchset can make kvm-clock
> drift during live migration very very trivial.
> 
> Thank you very much!

Not at all inappropriate; thank you so much for updating it. I've been
meaning to do so but it's never made it back to the top of my list.

I don't believe that the existing KVM_SET_CLOCK is viable though. The
aim is that you should be able to create a new KVM on the same host and
set the kvmclock, and the contents of the pvclock that the new guest
sees should be *identical*. Not just 'close'.

I believe we need Jack's KVM_[GS]ET_CLOCK_GUEST for that to be
feasible, so I'd very much prefer that any resurrection of this series
should include that, even if some of the other patches are dropped for
now.

Thanks again.

Re: [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update

Posted by Dongli Zhang 3 weeks, 1 day ago

Hi David,

On 1/15/26 1:13 PM, David Woodhouse wrote:
> On Thu, 2026-01-15 at 12:37 -0800, Dongli Zhang wrote:
>>
>> Please let me know if this is inappropriate and whether I should have
>> confirmed with you before reusing your code from the patch below, with your
>> authorship preserved.
>>
>> [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
>> https://lore.kernel.org/all/20240522001817.619072-11-dwmw2@infradead.org/
>>
>> The objective is to trigger a discussion on whether there is any quick,
>> short-term solution to mitigate the kvm-clock drift issue. We can also
>> resurrect your patchset.
>>
>> I have some other work in QEMU userspace.
>>
>> [PATCH 1/1] target/i386/kvm: account blackout downtime for kvm-clock and guest TSC
>> https://lore.kernel.org/qemu-devel/20251009095831.46297-1-dongli.zhang@oracle.com/
>>
>> The combination of changes in QEMU and this KVM patchset can make kvm-clock
>> drift during live migration very very trivial.
>>
>> Thank you very much!
> 
> Not at all inappropriate; thank you so much for updating it. I've been
> meaning to do so but it's never made it back to the top of my list.
> 
> I don't believe that the existing KVM_SET_CLOCK is viable though. The
> aim is that you should be able to create a new KVM on the same host and
> set the kvmclock, and the contents of the pvclock that the new guest
> sees should be *identical*. Not just 'close'.
> 
> I believe we need Jack's KVM_[GS]ET_CLOCK_GUEST for that to be
> feasible, so I'd very much prefer that any resurrection of this series
> should include that, even if some of the other patches are dropped for
> now.
> 
> Thanks again.

Thank you very much for the feedback.

The issue addressed by this patchset cannot be resolved only by
KVM_[GS]ET_CLOCK_GUEST.

The problem I am trying to solve is avoiding unnecessary
KVM_REQ_MASTERCLOCK_UPDATE requests. Even when using KVM_[GS]ET_CLOCK_GUEST, if
vCPUs already have pending KVM_REQ_MASTERCLOCK_UPDATE requests, unpausing the
vCPUs from the host userspace VMM (i.e., QEMU) can still trigger multiple master
clock updates - typically proportional to the number of vCPUs.

As we known, each KVM_REQ_MASTERCLOCK_UPDATE can cause unexpected kvm-clock
forward/backward drift.

Therefore, rather than KVM_[GS]ET_CLOCK_GUEST, this patchset is more relevant to
the other two of your patches, defining a new policy to minimize
KVM_REQ_MASTERCLOCK_UPDATE.

[RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
[RFC PATCH v3 15/21] KVM: x86: Allow KVM master clock mode when TSCs are offset
from each other

Suppose the combination of QEMU and KVM. The following details explain the
problem I am trying to address.

(Assuming TSC scaling is *inactive*)

## Problem 1. Account the live migration downtimes into kvm-clock and guest_tsc.

So far, QEMU/KVM live migration does not account all elapsed blackout downtimes.
For example, if a guest is live-migrated to a file, left idle for one hour, and
then restored from that file to the target host, the one-hour blackout period
will not be reflected in the kvm-clock or guest TSC.

This can be resolved by leveraging KVM_VCPU_TSC_CTRL and KVM_CLOCK_REALTIME in
QEMU. I have sent a QEMU patch (and just received your feedback on that thread).

[PATCH 1/1] target/i386/kvm: account blackout downtime for kvm-clock and guest TSC
https://lore.kernel.org/qemu-devel/20251009095831.46297-1-dongli.zhang@oracle.com/

## Problem 2. The kvm-clock drifts due to changes in the PVTI data.

Unlike the previous vCPU hotplug-related kvm-clock drift issue, during live
migration the amount of drift is not determined by the time elapsed between two
masterclock updates. Instead, it occurs because guest_clock and guest_tsc are
not stopped or resumed at the same point in time.

For example, MSR_IA32_TSC and KVM_GET_CLOCK are used to save guest_tsc and
guest_clock on the source host. This is effectively equivalent to stopping their
counters. However, they are not stopped simultaneously: guest_tsc stops at time
point P1, while guest_clock stops at time point P2.

- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=0 ===> P1
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=1
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=2
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=3
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=4
... ...
- kvm_get_msr_common(MSR_IA32_TSC) for vCPU=N
- KVM_GET_CLOCK                               ===> P2

On the target host, QEMU restores the saved values using MSR_IA32_TSC and
KVM_SET_CLOCK. As a result, guest_tsc resumes counting at time point P3, while
guest_clock resumes counting at time point P4.

- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=1 ===> P3
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=2
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=3
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=4
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=5
... ...
- kvm_set_msr_common(MSR_IA32_TSC) for vCPU=N
- KVM_SET_CLOCK                               ====> P4

Therefore, below are the equations I use to calculate the expected kvm-clock drift.

T1_ns  = P2 - P1 (nanoseconds)
T2_tsc = P4 - P3 (cycles)
T2_ns  = pvclock_scale_delta(T2_tsc,
                             hv_clock_src.tsc_to_system_mul,
                             hv_clock_src.tsc_shift)

if (T2_ns > T1_ns)
    backward drift: T2_ns - T1_ns
else if (T1_ns > T2_ns)
    forward drift: T1_ns - T2_ns

To fix this issue, ideally both guest_tsc and guest_clock should be stopped and
resumed at exactly the same time.

As you mentioned in the QEMU patch, "the kvmclock should be a fixed relationship
from the guest's TSC which doesn't change for the whole lifetime of the guest."

Fortunately, to take advantage of KVM_VCPU_TSC_CTRL and KVM_CLOCK_REALTIME in
QEMU can achieve the same goal.

[PATCH 1/1] target/i386/kvm: account blackout downtime for kvm-clock and guest TSC
https://lore.kernel.org/qemu-devel/20251009095831.46297-1-dongli.zhang@oracle.com/

## Problem 3. Unfortunately, unnecessary KVM_REQ_MASTERCLOCK_UPDATE requests are
being triggered for the vCPUs.

During kvm_synchronize_tsc() or kvm_arch_tsc_set_attr(KVM_VCPU_TSC_OFFSET),
KVM_REQ_MASTERCLOCK_UPDATE requests may be set either before or after KVM_SET_CLOCK.

As a result, once all vCPUs are unpaused, these unnecessary
KVM_REQ_MASTERCLOCK_UPDATE requests can lead to kvm-clock drift.

Indeed, only PATCH 1 and PATCH 3 from this patch set are sufficient to mitigate
the issue.

With above changes in both QEMU and KVM, a same-host live migration of a 4-vCPU
VM with approximately 10 seconds of downtime (introduced on purpose) results in
only about 4 nanoseconds of backward drift in my test environment. We may even
be able to make more improvement from QEMU to rule out the remaining 4 nanoseconds.

old_clock->tsc_timestamp = 32041800585
old_clock->system_time = 3639151
old_clock->tsc_to_system_mul = 3186238974
old_clock->tsc_shift = -1

new_clock->tsc_timestamp = 213016088950
new_clock->system_time = 67131895453
new_clock->tsc_to_system_mul = 3186238974
new_clock->tsc_shift =  -1

If I do not introduce the ~10 seconds of downtime on purpose during live
migration, the drift is always 0 nanoseconds.

I introduce downtime on purpose by stopping the target QEMU before live
migration. The target QEMU will not resume until the 'cont' command is issued in
the QEMU monitor.

Regarding goal, I appreciate if there can be any quick solution (even short
term) or half-measures to support:

- Account for live migration downtime.
- Minimize kvm-clock drift (especially backward).

Thank you very much!

Dongli Zhang

Re: [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update

Posted by David Woodhouse 2 weeks ago

On Fri, 2026-01-16 at 01:31 -0800, Dongli Zhang wrote:
> 
> With above changes in both QEMU and KVM, a same-host live migration of a 4-vCPU
> VM with approximately 10 seconds of downtime (introduced on purpose) results in
> only about 4 nanoseconds of backward drift in my test environment. We may even
> be able to make more improvement from QEMU to rule out the remaining 4 nanoseconds.

On the same host, even with TSC scaling, there is no excuse for *any*
errors on live migration.

The *offset* of the host → guest TSC should remain precisely the same.
And the calculation of KVM clock from guest TSC should remain precisely the same.

Absolutely *zero* error.

Don't bother with "improvements" which still don't get that right.

Re: [PATCH 0/3] KVM: x86: Mitigate kvm-clock drift caused by masterclock update

Posted by Dongli Zhang 2 weeks, 2 days ago


On 1/16/26 1:31 AM, Dongli Zhang wrote:
> Hi David,
> 
> On 1/15/26 1:13 PM, David Woodhouse wrote:
>> On Thu, 2026-01-15 at 12:37 -0800, Dongli Zhang wrote:
>>>
>>> Please let me know if this is inappropriate and whether I should have
>>> confirmed with you before reusing your code from the patch below, with your
>>> authorship preserved.
>>>
>>> [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
>>> https://lore.kernel.org/all/20240522001817.619072-11-dwmw2@infradead.org/
>>>
>>> The objective is to trigger a discussion on whether there is any quick,
>>> short-term solution to mitigate the kvm-clock drift issue. We can also
>>> resurrect your patchset.
>>>
>>> I have some other work in QEMU userspace.
>>>
>>> [PATCH 1/1] target/i386/kvm: account blackout downtime for kvm-clock and guest TSC
>>> https://lore.kernel.org/qemu-devel/20251009095831.46297-1-dongli.zhang@oracle.com/
>>>
>>> The combination of changes in QEMU and this KVM patchset can make kvm-clock
>>> drift during live migration very very trivial.
>>>
>>> Thank you very much!
>>
>> Not at all inappropriate; thank you so much for updating it. I've been
>> meaning to do so but it's never made it back to the top of my list.
>>
>> I don't believe that the existing KVM_SET_CLOCK is viable though. The
>> aim is that you should be able to create a new KVM on the same host and
>> set the kvmclock, and the contents of the pvclock that the new guest
>> sees should be *identical*. Not just 'close'.
>>
>> I believe we need Jack's KVM_[GS]ET_CLOCK_GUEST for that to be
>> feasible, so I'd very much prefer that any resurrection of this series
>> should include that, even if some of the other patches are dropped for
>> now.
>>
>> Thanks again.
> 
> Thank you very much for the feedback.
> 
> The issue addressed by this patchset cannot be resolved only by
> KVM_[GS]ET_CLOCK_GUEST.
> 
> The problem I am trying to solve is avoiding unnecessary
> KVM_REQ_MASTERCLOCK_UPDATE requests. Even when using KVM_[GS]ET_CLOCK_GUEST, if
> vCPUs already have pending KVM_REQ_MASTERCLOCK_UPDATE requests, unpausing the
> vCPUs from the host userspace VMM (i.e., QEMU) can still trigger multiple master
> clock updates - typically proportional to the number of vCPUs.
> 
> As we known, each KVM_REQ_MASTERCLOCK_UPDATE can cause unexpected kvm-clock
> forward/backward drift.
> 
> Therefore, rather than KVM_[GS]ET_CLOCK_GUEST, this patchset is more relevant to
> the other two of your patches, defining a new policy to minimize
> KVM_REQ_MASTERCLOCK_UPDATE.
> 
> [RFC PATCH v3 10/21] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
> [RFC PATCH v3 15/21] KVM: x86: Allow KVM master clock mode when TSCs are offset
> from each other
> 
> 
> Suppose the combination of QEMU and KVM. The following details explain the
> problem I am trying to address.
> 
> (Assuming TSC scaling is *inactive*)
> 
> 
> ## Problem 1. Account the live migration downtimes into kvm-clock and guest_tsc.
> 
> So far, QEMU/KVM live migration does not account all elapsed blackout downtimes.
> For example, if a guest is live-migrated to a file, left idle for one hour, and
> then restored from that file to the target host, the one-hour blackout period
> will not be reflected in the kvm-clock or guest TSC.
> 
> This can be resolved by leveraging KVM_VCPU_TSC_CTRL and KVM_CLOCK_REALTIME in
> QEMU. I have sent a QEMU patch (and just received your feedback on that thread).
> 
> [PATCH 1/1] target/i386/kvm: account blackout downtime for kvm-clock and guest TSC
> https://lore.kernel.org/qemu-devel/20251009095831.46297-1-dongli.zhang@oracle.com/
> 
> 
> ## Problem 2. The kvm-clock drifts due to changes in the PVTI data.
> 
> Unlike the previous vCPU hotplug-related kvm-clock drift issue, during live
> migration the amount of drift is not determined by the time elapsed between two
> masterclock updates. Instead, it occurs because guest_clock and guest_tsc are
> not stopped or resumed at the same point in time.
> 
> For example, MSR_IA32_TSC and KVM_GET_CLOCK are used to save guest_tsc and
> guest_clock on the source host. This is effectively equivalent to stopping their
> counters. However, they are not stopped simultaneously: guest_tsc stops at time
> point P1, while guest_clock stops at time point P2.
> 
> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=0 ===> P1
> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=1
> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=2
> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=3
> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=4
> ... ...
> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=N
> - KVM_GET_CLOCK                               ===> P2
> 
> On the target host, QEMU restores the saved values using MSR_IA32_TSC and
> KVM_SET_CLOCK. As a result, guest_tsc resumes counting at time point P3, while
> guest_clock resumes counting at time point P4.
> 
> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=1 ===> P3
> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=2
> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=3
> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=4
> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=5
> ... ...
> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=N
> - KVM_SET_CLOCK                               ====> P4
> 
> 
> Therefore, below are the equations I use to calculate the expected kvm-clock drift.
> 
> T1_ns  = P2 - P1 (nanoseconds)
> T2_tsc = P4 - P3 (cycles)
> T2_ns  = pvclock_scale_delta(T2_tsc,
>                              hv_clock_src.tsc_to_system_mul,
>                              hv_clock_src.tsc_shift)
> 
> if (T2_ns > T1_ns)
>     backward drift: T2_ns - T1_ns
> else if (T1_ns > T2_ns)
>     forward drift: T1_ns - T2_ns

Here are more details to explain the prediction of live migration kvm-clock.

As we know, when the masterclock is active, the PVTI is a snapshot created at a
specific point in time as a base to help calculate the kvm-clock.

struct pvclock_vcpu_time_info {
    u32   version;
    u32   pad0;
    u64   tsc_timestamp;
    u64   system_time;
    u32   tsc_to_system_mul;
    s8    tsc_shift;
    u8    flags;
    u8    pad[2];
} __attribute__((__packed__)); /* 32 bytes */

PVTI->tsc_timestamp is the guest TSC when the PVTI is snapshotted.

PVTI->system_time is the kvm-clock when the PVTI is snapshotted.

Ideally, the data in the PVTI remains unchanged.

However, let's assume the PVTI data changes *every nanosecond*.

As you mentioned, "the kvmclock should be a fixed relationship from the guest's
TSC which doesn't change for the whole lifetime of the guest."

We expect both PVTI->tsc_timestamp and PVTI->system_time to increment at the
same speed.

For instance, after GT_0 guest TSC cycles ...

PVTI->tsc_timestamp += GT_0
PVTI->system_time   += pvclock_scale_delta(GT_0)

... after another GT_1 guest TSC cycles ...

PVTI->tsc_timestamp += GT_1
PVTI->system_time   += pvclock_scale_delta(GT_1)

... ...
... ...

... after another GT_N guest TSC cycles ...

PVTI->tsc_timestamp += GT_N
PVTI->system_time   += pvclock_scale_delta(GT_N)


However, in QEMU, the guest TSC and kvm-clock are not stopped or resumed at the
same time.

P2 − P1 is the number of nanoseconds by which PVTI->system_time increments after
PVTI->tsc_timestamp stops incrementing.

P4 − P3 is the number of guest TSC cycles (assuming TSC scaling is inactive) by
which PVTI->tsc_timestamp increments while PVTI->system_time remains stopped
until P4.

Therefore, if (P4 − P3) > (P2 − P1), it indicates that PVTI->tsc_timestamp moves
forward more than PVTI->system_time, which will cause the kvm-clock to go backward.

Thank you very much!

Dongli Zhang