hw/i386/kvm/clock.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-)
Hi,
Would you mind helping confirm if kvm-clock/guest_tsc should stop counting
elapsed time during downtime blackout?
1. guest_clock=T1, realtime=R1.
2. (qemu) stop
3. Wait for several seconds.
4. (qemu) cont
5. guest_clock=T2, realtime=R2.
Should (T1 == T2), or (R2 - R1 == T2 - T1)?
For instance, suppose guest clocksource is 'tsc'. It is still incrementing
during QEMU downtime blackout.
[root@vm ~]# while true; do date; sleep 1; done
Tue Sep 9 15:28:37 PDT 2025
Tue Sep 9 15:28:38 PDT 2025
Tue Sep 9 15:28:39 PDT 2025
Tue Sep 9 15:28:40 PDT 2025
Tue Sep 9 15:28:41 PDT 2025
Tue Sep 9 15:28:42 PDT 2025
Tue Sep 9 15:28:43 PDT 2025 ===> (qemu) stop, wait for 14 seconds.
---> 14 seconds!
Tue Sep 9 15:28:57 PDT 2025 ===> (qemu) cont
Tue Sep 9 15:28:58 PDT 2025
Tue Sep 9 15:28:59 PDT 2025
Tue Sep 9 15:29:00 PDT 2025
Tue Sep 9 15:29:01 PDT 2025
However, 'kvm-clock' stops incrementing during the blackout.
[root@vm ~]# while true; do date; sleep 1; done
Tue Sep 9 15:35:59 PDT 2025
Tue Sep 9 15:36:00 PDT 2025
Tue Sep 9 15:36:01 PDT 2025
Tue Sep 9 15:36:02 PDT 2025
Tue Sep 9 15:36:03 PDT 2025 ===> (qemu) stop, wait for many seconds.
---> No gap!
Tue Sep 9 15:36:04 PDT 2025 ===> (qemu) cont
Tue Sep 9 15:36:05 PDT 2025
Tue Sep 9 15:36:06 PDT 2025
Tue Sep 9 15:36:07 PDT 2025
Tue Sep 9 15:36:08 PDT 2025
Tue Sep 9 15:36:09 PDT 2025
Tue Sep 9 15:36:10 PDT 2025
Tue Sep 9 15:36:11 PDT 2025
Tue Sep 9 15:36:12 PDT 2025
They are many use cases that can involve a long/short downtime blackout.
- stop/cont
- savevm/loadvm
- live migration, especially from/to a file.
- dump-guest-memory
- cpr?
The KVM already exposes 'KVM_CLOCK_REALTIME' and 'KVM_VCPU_TSC_OFFSET' to help
count all elapsed time.
https://lore.kernel.org/all/20210916181538.968978-1-oupton@google.com/
This is a prototype to demonstrate how QEMU can count elapsed downtime by taking
advantage of 'KVM_CLOCK_REALTIME'.
From b97a514ac227645010ce3d1012af3a4943413844 Mon Sep 17 00:00:00 2001
From: Dongli Zhang <dongli.zhang@oracle.com>
Date: Thu, 18 Sep 2025 14:59:42 -0700
Subject: [PATCH 1/1] target/i386/kvm: take advantage of KVM_CLOCK_REALTIME
The Linux kernel commit c68dc1b577ea ("KVM: x86: Report host tsc and
realtime values in KVM_GET_CLOCK") introduced 'realtime' field and
KVM_CLOCK_REALTIME.
The 'realtime' value is saved through KVM_GET_CLOCK and restored via
KVM_SET_CLOCK. This enables the KVM clock to advance by the amount of
elapsed downtime realtime during operations such as live migration,
stop/cont, and savevm/loadvm.
This patch/feature allows QEMU to take advantage of KVM_CLOCK_REALTIME.
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
hw/i386/kvm/clock.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
index f56382717f..906346ce2f 100644
--- a/hw/i386/kvm/clock.c
+++ b/hw/i386/kvm/clock.c
@@ -38,6 +38,8 @@ struct KVMClockState {
/*< public >*/
uint64_t clock;
+ uint64_t realtime;
+ uint32_t flags;
bool clock_valid;
/* whether the 'clock' value was obtained in the 'paused' state */
@@ -107,7 +109,10 @@ static void kvm_update_clock(KVMClockState *s)
fprintf(stderr, "KVM_GET_CLOCK failed: %s\n", strerror(-ret));
abort();
}
+
s->clock = data.clock;
+ s->flags = data.flags & KVM_CLOCK_REALTIME;
+ s->realtime = data.realtime;
/* If kvm_has_adjust_clock_stable() is false, KVM_GET_CLOCK returns
* essentially CLOCK_MONOTONIC plus a guest-specific adjustment. This
@@ -186,6 +191,11 @@ static void kvmclock_vm_state_change(void *opaque, bool
running,
s->clock_valid = false;
data.clock = s->clock;
+ if (s->flags & KVM_CLOCK_REALTIME) {
+ data.flags = s->flags;
+ data.realtime = s->realtime;
+ }
+
ret = kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, &data);
if (ret < 0) {
fprintf(stderr, "KVM_SET_CLOCK failed: %s\n", strerror(-ret));
@@ -259,6 +269,7 @@ static int kvmclock_pre_load(void *opaque)
KVMClockState *s = opaque;
s->clock_is_reliable = false;
+ s->flags = 0;
return 0;
}
@@ -290,12 +301,14 @@ static int kvmclock_pre_save(void *opaque)
static const VMStateDescription kvmclock_vmsd = {
.name = "kvmclock",
- .version_id = 1,
+ .version_id = 2,
.minimum_version_id = 1,
.pre_load = kvmclock_pre_load,
.pre_save = kvmclock_pre_save,
.fields = (const VMStateField[]) {
VMSTATE_UINT64(clock, KVMClockState),
+ VMSTATE_UINT64(realtime, KVMClockState),
+ VMSTATE_UINT32(flags, KVMClockState),
VMSTATE_END_OF_LIST()
},
.subsections = (const VMStateDescription * const []) {
--
2.39.3
To take advantage of 'KVM_VCPU_TSC_OFFSET' can further improve 'guest_tsc'.
Any suggestion on whether kvm-clock/guest_tsc should stop/continue counting
during the blackout? Any expectation or requirement by QEMU?
Thank you very much!
Dongli Zhang
On Mon, 2025-09-22 at 09:37 -0700, Dongli Zhang wrote: > Hi, > > Would you mind helping confirm if kvm-clock/guest_tsc should stop counting > elapsed time during downtime blackout? > > 1. guest_clock=T1, realtime=R1. > 2. (qemu) stop > 3. Wait for several seconds. > 4. (qemu) cont > 5. guest_clock=T2, realtime=R2. > > Should (T1 == T2), or (R2 - R1 == T2 - T1)? Neither. Realtime is something completely different and runs at a different rate to the monotonic clock. In fact its rate compared to the monotonic clock (and the TSC) is *variable* as NTP guides it. In your example of stopping and continuing on the *same* host, the guest TSC *offset* from the host's TSC should remain the same. And the *precise* mathematical relationship that KVM advertises to the guest as "how to turn a TSC value into nanoseconds since boot" should also remain precisely the same. KVM already lets you restore the TSC correctly. To restore KVM clock correctly, you want something like KVM_SET_CLOCK_GUEST from https://lore.kernel.org/all/20240522001817.619072-4-dwmw2@infradead.org/ For cross machine migration, you *do* need to use a realtime clock reference as that's the best you have (make sure you use TAI not UTC and don't get affected by leap seconds or smearing). Use that to restore the *TSC* as well as you can to make it appear to have kept running consistently. And then KVM_SET_CLOCK_GUEST just as you would on the same host. And use vmclock to advertise the wallclock time to the guest as precisely as possible, even the cycle after a live migration.
Hi David, Thank you very much for quick reply! On 9/22/25 9:58 AM, David Woodhouse wrote: > On Mon, 2025-09-22 at 09:37 -0700, Dongli Zhang wrote: >> Hi, >> >> Would you mind helping confirm if kvm-clock/guest_tsc should stop counting >> elapsed time during downtime blackout? >> >> 1. guest_clock=T1, realtime=R1. >> 2. (qemu) stop >> 3. Wait for several seconds. >> 4. (qemu) cont >> 5. guest_clock=T2, realtime=R2. >> >> Should (T1 == T2), or (R2 - R1 == T2 - T1)? > > Neither. > > Realtime is something completely different and runs at a different rate > to the monotonic clock. In fact its rate compared to the monotonic > clock (and the TSC) is *variable* as NTP guides it. > > In your example of stopping and continuing on the *same* host, the > guest TSC *offset* from the host's TSC should remain the same. > > And the *precise* mathematical relationship that KVM advertises to the > guest as "how to turn a TSC value into nanoseconds since boot" should > also remain precisely the same. Does that mean: Regarding "stop/cont" scenario, both kvm-clock and guest_tsc value should remain the same, i.e., 1. When "stop", kvm-clock=K1, guest_tsc=T1. 2. Suppose many hours passed. 3. When "cont", guest VM should see kvm-clock==K1 and guest_tsc==T1, by refreshing both PVTI and tsc_offset at KVM. As demonstrated in my test, currently guest_tsc doesn't stop counting during blackout because of the lack of "MSR_IA32_TSC put" at kvmclock_vm_state_change(). Per my understanding, it is a bug and we may need to fix it. BTW, kvmclock_vm_state_change() already utilizes KVM_SET_CLOCK to re-configure kvm-clock before continuing the guest VM. > > KVM already lets you restore the TSC correctly. To restore KVM clock > correctly, you want something like KVM_SET_CLOCK_GUEST from > https://lore.kernel.org/all/20240522001817.619072-4-dwmw2@infradead.org/ > > For cross machine migration, you *do* need to use a realtime clock > reference as that's the best you have (make sure you use TAI not UTC > and don't get affected by leap seconds or smearing). Use that to > restore the *TSC* as well as you can to make it appear to have kept > running consistently. And then KVM_SET_CLOCK_GUEST just as you would on > the same host. Indeed QEMU Live Migration also relies on kvmclock_vm_state_change() to temporarily stop/cont the source/target VM. Would you mean we expect something different for live migration, i.e., 1. Live Migrate a source VM to a file. 2. Copy the file to another server. 3. Wait for 1 hour. 4. Migrate from the file to target VM. Although it is equivalent to a one-hour downtime, we do need to count the missing one-hour, correct? That means: we have different expectations from stop/cont and live migration. - Live Migration: any downtime should be counted with the help from realtime. - stop/cont (savevm/loadvm): the value of kvm-clock/rdtsc should remain the same. > > And use vmclock to advertise the wallclock time to the guest as > precisely as possible, even the cycle after a live migration. > Thank you very much for suggestion on KVM_SET_CLOCK_GUEST and vmclock! Dongli Zhang
On Mon, 2025-09-22 at 10:31 -0700, Dongli Zhang wrote: > Hi David, > > Thank you very much for quick reply! > > On 9/22/25 9:58 AM, David Woodhouse wrote: > > On Mon, 2025-09-22 at 09:37 -0700, Dongli Zhang wrote: > > > Hi, > > > > > > Would you mind helping confirm if kvm-clock/guest_tsc should stop counting > > > elapsed time during downtime blackout? > > > > > > 1. guest_clock=T1, realtime=R1. > > > 2. (qemu) stop > > > 3. Wait for several seconds. > > > 4. (qemu) cont > > > 5. guest_clock=T2, realtime=R2. > > > > > > Should (T1 == T2), or (R2 - R1 == T2 - T1)? > > > > Neither. > > > > Realtime is something completely different and runs at a different rate > > to the monotonic clock. In fact its rate compared to the monotonic > > clock (and the TSC) is *variable* as NTP guides it. > > > > In your example of stopping and continuing on the *same* host, the > > guest TSC *offset* from the host's TSC should remain the same. > > > > And the *precise* mathematical relationship that KVM advertises to the > > guest as "how to turn a TSC value into nanoseconds since boot" should > > also remain precisely the same. > > Does that mean: > > Regarding "stop/cont" scenario, both kvm-clock and guest_tsc value should remain > the same, i.e., > > 1. When "stop", kvm-clock=K1, guest_tsc=T1. > 2. Suppose many hours passed. > 3. When "cont", guest VM should see kvm-clock==K1 and guest_tsc==T1, by > refreshing both PVTI and tsc_offset at KVM. Assuming a modern host where the TSC just counts sanely at a consistent rate and never deviates.... No. The PVTI should basically *never* change. Whatever the estimated (not NTP skewed) frequency of the TSC is believed to be, the KVM clock PVTI should indicate that at boot, telling the guest how to convert a TSC value into 'monotonic nanoseconds since boot'. If it ever changes, that's a KVM bug. It should be saved and restored in precisely its native form, using the KVM_[GS]ET_CLOCK_GUEST I referenced before. For both live update (same host) and live migration (different host). The TSC should also continue to count at exactly the same rate as the host's TSC at all times. No breaks or discontinuities due to any kind of 'steal time'. For live update that's easy as you just apply the same *offset*. For live migration that's where you have to accept that it depends on clock synchronization between your source and destination hosts, which is probably based on realtime. > > As demonstrated in my test, currently guest_tsc doesn't stop counting during > blackout because of the lack of "MSR_IA32_TSC put" at > kvmclock_vm_state_change(). Per my understanding, it is a bug and we may need to > fix it. > > BTW, kvmclock_vm_state_change() already utilizes KVM_SET_CLOCK to re-configure > kvm-clock before continuing the guest VM. > > > > > KVM already lets you restore the TSC correctly. To restore KVM clock > > correctly, you want something like KVM_SET_CLOCK_GUEST from > > https://lore.kernel.org/all/20240522001817.619072-4-dwmw2@infradead.org/ > > > > For cross machine migration, you *do* need to use a realtime clock > > reference as that's the best you have (make sure you use TAI not UTC > > and don't get affected by leap seconds or smearing). Use that to > > restore the *TSC* as well as you can to make it appear to have kept > > running consistently. And then KVM_SET_CLOCK_GUEST just as you would on > > the same host. > > Indeed QEMU Live Migration also relies on kvmclock_vm_state_change() to > temporarily stop/cont the source/target VM. > > Would you mean we expect something different for live migration, i.e., > > 1. Live Migrate a source VM to a file. > 2. Copy the file to another server. > 3. Wait for 1 hour. > 4. Migrate from the file to target VM. > > Although it is equivalent to a one-hour downtime, we do need to count the > missing one-hour, correct? I don't look at it as counting anything. The clock keeps running even when I'm not looking at it. If I wake up and look at it again, there is no 'counting' how long I was asleep...
On 9/22/25 11:16 AM, David Woodhouse wrote: > On Mon, 2025-09-22 at 10:31 -0700, Dongli Zhang wrote: >> Hi David, >> >> Thank you very much for quick reply! >> >> On 9/22/25 9:58 AM, David Woodhouse wrote: >>> On Mon, 2025-09-22 at 09:37 -0700, Dongli Zhang wrote: >>>> Hi, >>>> >>>> Would you mind helping confirm if kvm-clock/guest_tsc should stop counting >>>> elapsed time during downtime blackout? >>>> >>>> 1. guest_clock=T1, realtime=R1. >>>> 2. (qemu) stop >>>> 3. Wait for several seconds. >>>> 4. (qemu) cont >>>> 5. guest_clock=T2, realtime=R2. >>>> >>>> Should (T1 == T2), or (R2 - R1 == T2 - T1)? >>> >>> Neither. >>> >>> Realtime is something completely different and runs at a different rate >>> to the monotonic clock. In fact its rate compared to the monotonic >>> clock (and the TSC) is *variable* as NTP guides it. >>> >>> In your example of stopping and continuing on the *same* host, the >>> guest TSC *offset* from the host's TSC should remain the same. >>> >>> And the *precise* mathematical relationship that KVM advertises to the >>> guest as "how to turn a TSC value into nanoseconds since boot" should >>> also remain precisely the same. >> >> Does that mean: >> >> Regarding "stop/cont" scenario, both kvm-clock and guest_tsc value should remain >> the same, i.e., >> >> 1. When "stop", kvm-clock=K1, guest_tsc=T1. >> 2. Suppose many hours passed. >> 3. When "cont", guest VM should see kvm-clock==K1 and guest_tsc==T1, by >> refreshing both PVTI and tsc_offset at KVM. > > Assuming a modern host where the TSC just counts sanely at a consistent > rate and never deviates.... > > No. The PVTI should basically *never* change. Whatever the estimated > (not NTP skewed) frequency of the TSC is believed to be, the KVM clock > PVTI should indicate that at boot, telling the guest how to convert a > TSC value into 'monotonic nanoseconds since boot'. If it ever changes, > that's a KVM bug. > > It should be saved and restored in precisely its native form, using the > KVM_[GS]ET_CLOCK_GUEST I referenced before. For both live update (same > host) and live migration (different host). > > The TSC should also continue to count at exactly the same rate as the > host's TSC at all times. No breaks or discontinuities due to any kind > of 'steal time'. For live update that's easy as you just apply the same > *offset*. For live migration that's where you have to accept that it > depends on clock synchronization between your source and destination > hosts, which is probably based on realtime. That means: - Utilize KVM_[GS]ET_CLOCK_GUEST to avoid forward/backward drift due to the change in PVTI data structure (by adjusting 'ka->kvmclock_offset'). - Utilize realtime as reference to keep clock/tsc running.> > > >> >> As demonstrated in my test, currently guest_tsc doesn't stop counting during >> blackout because of the lack of "MSR_IA32_TSC put" at >> kvmclock_vm_state_change(). Per my understanding, it is a bug and we may need to >> fix it. >> >> BTW, kvmclock_vm_state_change() already utilizes KVM_SET_CLOCK to re-configure >> kvm-clock before continuing the guest VM. >> >>> >>> KVM already lets you restore the TSC correctly. To restore KVM clock >>> correctly, you want something like KVM_SET_CLOCK_GUEST from >>> https://lore.kernel.org/all/20240522001817.619072-4-dwmw2@infradead.org/ >>> >>> For cross machine migration, you *do* need to use a realtime clock >>> reference as that's the best you have (make sure you use TAI not UTC >>> and don't get affected by leap seconds or smearing). Use that to >>> restore the *TSC* as well as you can to make it appear to have kept >>> running consistently. And then KVM_SET_CLOCK_GUEST just as you would on >>> the same host. >> >> Indeed QEMU Live Migration also relies on kvmclock_vm_state_change() to >> temporarily stop/cont the source/target VM. >> >> Would you mean we expect something different for live migration, i.e., >> >> 1. Live Migrate a source VM to a file. >> 2. Copy the file to another server. >> 3. Wait for 1 hour. >> 4. Migrate from the file to target VM. >> >> Although it is equivalent to a one-hour downtime, we do need to count the >> missing one-hour, correct? > > I don't look at it as counting anything. The clock keeps running even > when I'm not looking at it. If I wake up and look at it again, there is > no 'counting' how long I was asleep... > That means: - stop/cont: clock/tsc stop running - savevm/loadvm: clock/tsc stop running - any live migration: clock/tsc continue running (equivalent) - any live update (including QEMU cpr): clock/tsc continue running (equivalent) However, there is another scenario that we 'stop' target VM on purpose before any live migration. The 'autostart' is disabled. After live migration, target VM won't autostart automatically, unless we issue 'cont'. I assume this is classified as "any live migration" scenario. We still need to keep clock/tsc running. Thank you very much! Dongli Zhang
On Mon, 2025-09-22 at 12:37 -0700, Dongli Zhang wrote: > On 9/22/25 11:16 AM, David Woodhouse wrote: > > Assuming a modern host where the TSC just counts sanely at a consistent > > rate and never deviates.... > > > > No. The PVTI should basically *never* change. Whatever the estimated > > (not NTP skewed) frequency of the TSC is believed to be, the KVM clock > > PVTI should indicate that at boot, telling the guest how to convert a > > TSC value into 'monotonic nanoseconds since boot'. If it ever changes, > > that's a KVM bug. > > > > It should be saved and restored in precisely its native form, using the > > KVM_[GS]ET_CLOCK_GUEST I referenced before. For both live update (same > > host) and live migration (different host). > > > > The TSC should also continue to count at exactly the same rate as the > > host's TSC at all times. No breaks or discontinuities due to any kind > > of 'steal time'. For live update that's easy as you just apply the same > > *offset*. For live migration that's where you have to accept that it > > depends on clock synchronization between your source and destination > > hosts, which is probably based on realtime. > > That means: > > - Utilize KVM_[GS]ET_CLOCK_GUEST to avoid forward/backward drift due to the > change in PVTI data structure (by adjusting 'ka->kvmclock_offset'). Ultimately for modern hardware I think I'd like to kill ka->kvmclock_offset entirely but yeah, that's how it works right now I think. > - Utilize realtime as reference to keep clock/tsc running. Hm, I don't like talking about 'running' vs. 'stopped'. The clock should always be running. You try to keep it as *stable* as possible, even across live migration. And for live migration, realtime is probably the best you have so it's what you're stuck with. When the guest reads their TSC, they should always get a value which is as *close* as possible to their *original* host's TSC, minus the delta of what that host's TSC was when they were first started (ignoring scaling). > > > > > > > > > > As demonstrated in my test, currently guest_tsc doesn't stop counting during > > > blackout because of the lack of "MSR_IA32_TSC put" at > > > kvmclock_vm_state_change(). Per my understanding, it is a bug and we may need to > > > fix it. > > > > > > BTW, kvmclock_vm_state_change() already utilizes KVM_SET_CLOCK to re-configure > > > kvm-clock before continuing the guest VM. Yeah, right now it's probably just introducing errors for a stop/start of the VM. > > > > > > > > KVM already lets you restore the TSC correctly. To restore KVM clock > > > > correctly, you want something like KVM_SET_CLOCK_GUEST from > > > > https://lore.kernel.org/all/20240522001817.619072-4-dwmw2@infradead.org/ > > > > > > > > For cross machine migration, you *do* need to use a realtime clock > > > > reference as that's the best you have (make sure you use TAI not UTC > > > > and don't get affected by leap seconds or smearing). Use that to > > > > restore the *TSC* as well as you can to make it appear to have kept > > > > running consistently. And then KVM_SET_CLOCK_GUEST just as you would on > > > > the same host. > > > > > > Indeed QEMU Live Migration also relies on kvmclock_vm_state_change() to > > > temporarily stop/cont the source/target VM. > > > > > > Would you mean we expect something different for live migration, i.e., > > > > > > 1. Live Migrate a source VM to a file. > > > 2. Copy the file to another server. > > > 3. Wait for 1 hour. > > > 4. Migrate from the file to target VM. > > > > > > Although it is equivalent to a one-hour downtime, we do need to count the > > > missing one-hour, correct? > > > > I don't look at it as counting anything. The clock keeps running even > > when I'm not looking at it. If I wake up and look at it again, there is > > no 'counting' how long I was asleep... > > > > That means: > > - stop/cont: clock/tsc stop running > - savevm/loadvm: clock/tsc stop running What does "stop running" even mean here? You can never stop the clock running. The only thing you can do is change its offset so that it jumps back to an earlier value, when you resume a VM?
On 9/23/25 9:26 AM, David Woodhouse wrote: > On Mon, 2025-09-22 at 12:37 -0700, Dongli Zhang wrote: >> On 9/22/25 11:16 AM, David Woodhouse wrote: [snip] >>> >>>> >>>> As demonstrated in my test, currently guest_tsc doesn't stop counting during >>>> blackout because of the lack of "MSR_IA32_TSC put" at >>>> kvmclock_vm_state_change(). Per my understanding, it is a bug and we may need to >>>> fix it. >>>> >>>> BTW, kvmclock_vm_state_change() already utilizes KVM_SET_CLOCK to re-configure >>>> kvm-clock before continuing the guest VM. > > Yeah, right now it's probably just introducing errors for a stop/start > of the VM. But that help can meet the expectation? Thanks to KVM_GET_CLOCK and KVM_SET_CLOCK, QEMU saves the clock with KVM_GET_CLOCK when the VM is stopped, and restores it with KVM_SET_CLOCK when the VM is continued. This ensures that the clock value itself does not change between stop and cont. However, QEMU does not adjust the TSC offset via MSR_IA32_TSC during stop. As a result, when execution resumes, the guest TSC suddenly jumps forward. > >>>>> >>>>> KVM already lets you restore the TSC correctly. To restore KVM clock >>>>> correctly, you want something like KVM_SET_CLOCK_GUEST from >>>>> https://lore.kernel.org/all/20240522001817.619072-4-dwmw2@infradead.org/ >>>>> >>>>> For cross machine migration, you *do* need to use a realtime clock >>>>> reference as that's the best you have (make sure you use TAI not UTC >>>>> and don't get affected by leap seconds or smearing). Use that to >>>>> restore the *TSC* as well as you can to make it appear to have kept >>>>> running consistently. And then KVM_SET_CLOCK_GUEST just as you would on >>>>> the same host. >>>> >>>> Indeed QEMU Live Migration also relies on kvmclock_vm_state_change() to >>>> temporarily stop/cont the source/target VM. >>>> >>>> Would you mean we expect something different for live migration, i.e., >>>> >>>> 1. Live Migrate a source VM to a file. >>>> 2. Copy the file to another server. >>>> 3. Wait for 1 hour. >>>> 4. Migrate from the file to target VM. >>>> >>>> Although it is equivalent to a one-hour downtime, we do need to count the >>>> missing one-hour, correct? >>> >>> I don't look at it as counting anything. The clock keeps running even >>> when I'm not looking at it. If I wake up and look at it again, there is >>> no 'counting' how long I was asleep... >>> >> >> That means: >> >> - stop/cont: clock/tsc stop running >> - savevm/loadvm: clock/tsc stop running > > What does "stop running" even mean here? You can never stop the clock > running. The only thing you can do is change its offset so that it > jumps back to an earlier value, when you resume a VM? > Yes, I meant "change its offset so that it jumps back to an earlier value." From the VM's perspective, this is equivalent to "the clock was stopped." Could you help explain why we treat stop/cont differently from live migration? The live migration/update is same as stop/cont because the blackout phase involves stopping a guest, and continuing execution in a different/same host. You technically stop the guest in a way that's not controlled by the guest (compared to say hibernation or suspend-to-idle). and then you continue. Part of the reason I think 'stop'/'cont' ought to have same behavior as live migration. Thank you very much! Dongli Zhang
On Tue, 2025-09-23 at 10:25 -0700, Dongli Zhang wrote: > > > On 9/23/25 9:26 AM, David Woodhouse wrote: > > On Mon, 2025-09-22 at 12:37 -0700, Dongli Zhang wrote: > > > On 9/22/25 11:16 AM, David Woodhouse wrote: > > [snip] > > > > > > > > > > > > > > > As demonstrated in my test, currently guest_tsc doesn't stop counting during > > > > > blackout because of the lack of "MSR_IA32_TSC put" at > > > > > kvmclock_vm_state_change(). Per my understanding, it is a bug and we may need to > > > > > fix it. > > > > > > > > > > BTW, kvmclock_vm_state_change() already utilizes KVM_SET_CLOCK to re-configure > > > > > kvm-clock before continuing the guest VM. > > > > Yeah, right now it's probably just introducing errors for a stop/start > > of the VM. > > But that help can meet the expectation? > > Thanks to KVM_GET_CLOCK and KVM_SET_CLOCK, QEMU saves the clock with > KVM_GET_CLOCK when the VM is stopped, and restores it with KVM_SET_CLOCK when > the VM is continued. It saves the actual *value* of the clock. I would prefer to phrase that as "it makes the clock jump backwards to the time at which the guest was paused". > This ensures that the clock value itself does not change between stop and cont. > > However, QEMU does not adjust the TSC offset via MSR_IA32_TSC during stop. > > As a result, when execution resumes, the guest TSC suddenly jumps forward. Oh wow, that seems really broken. If we're going to make it experience a time warp, we should at least be *consistent*. So a guest which uses the TSC for timekeeping should be mostly unaffected by this and its wallclock should still be accurate. A guest which uses the KVM clock will be hosed by it. I think we should fix this so that the KVM clock is unaffected too. > > > > > > > > > > > > > > KVM already lets you restore the TSC correctly. To restore KVM clock > > > > > > correctly, you want something like KVM_SET_CLOCK_GUEST from > > > > > > https://lore.kernel.org/all/20240522001817.619072-4-dwmw2@infradead.org/ > > > > > > > > > > > > For cross machine migration, you *do* need to use a realtime clock > > > > > > reference as that's the best you have (make sure you use TAI not UTC > > > > > > and don't get affected by leap seconds or smearing). Use that to > > > > > > restore the *TSC* as well as you can to make it appear to have kept > > > > > > running consistently. And then KVM_SET_CLOCK_GUEST just as you would on > > > > > > the same host. > > > > > > > > > > Indeed QEMU Live Migration also relies on kvmclock_vm_state_change() to > > > > > temporarily stop/cont the source/target VM. > > > > > > > > > > Would you mean we expect something different for live migration, i.e., > > > > > > > > > > 1. Live Migrate a source VM to a file. > > > > > 2. Copy the file to another server. > > > > > 3. Wait for 1 hour. > > > > > 4. Migrate from the file to target VM. > > > > > > > > > > Although it is equivalent to a one-hour downtime, we do need to count the > > > > > missing one-hour, correct? > > > > > > > > I don't look at it as counting anything. The clock keeps running even > > > > when I'm not looking at it. If I wake up and look at it again, there is > > > > no 'counting' how long I was asleep... > > > > > > > > > > That means: > > > > > > - stop/cont: clock/tsc stop running > > > - savevm/loadvm: clock/tsc stop running > > > > What does "stop running" even mean here? You can never stop the clock > > running. The only thing you can do is change its offset so that it > > jumps back to an earlier value, when you resume a VM? > > > > Yes, I meant "change its offset so that it jumps back to an earlier value." From > the VM's perspective, this is equivalent to "the clock was stopped." Yeah... let's stop doing that :) > Could you help explain why we treat stop/cont differently from live migration? We shouldn't. Hardware companies spent *years* learning the lesson about clocks. That they should just keep counting. At the same rate. Unconditionally. Even if the CPU is running slower. Or stopped. Or whatever. Just count. Do not stop counting. Let's not repeat the same mistakes.
On 9/23/25 10:47 AM, David Woodhouse wrote: > On Tue, 2025-09-23 at 10:25 -0700, Dongli Zhang wrote: >> >> >> On 9/23/25 9:26 AM, David Woodhouse wrote: >>> On Mon, 2025-09-22 at 12:37 -0700, Dongli Zhang wrote: >>>> On 9/22/25 11:16 AM, David Woodhouse wrote: >> >> [snip] >> >>>>> >>>>>> >>>>>> As demonstrated in my test, currently guest_tsc doesn't stop counting during >>>>>> blackout because of the lack of "MSR_IA32_TSC put" at >>>>>> kvmclock_vm_state_change(). Per my understanding, it is a bug and we may need to >>>>>> fix it. >>>>>> >>>>>> BTW, kvmclock_vm_state_change() already utilizes KVM_SET_CLOCK to re-configure >>>>>> kvm-clock before continuing the guest VM. >>> >>> Yeah, right now it's probably just introducing errors for a stop/start >>> of the VM. >> >> But that help can meet the expectation? >> >> Thanks to KVM_GET_CLOCK and KVM_SET_CLOCK, QEMU saves the clock with >> KVM_GET_CLOCK when the VM is stopped, and restores it with KVM_SET_CLOCK when >> the VM is continued. > > It saves the actual *value* of the clock. I would prefer to phrase that > as "it makes the clock jump backwards to the time at which the guest > was paused". > >> This ensures that the clock value itself does not change between stop and cont. >> >> However, QEMU does not adjust the TSC offset via MSR_IA32_TSC during stop. >> >> As a result, when execution resumes, the guest TSC suddenly jumps forward. > > Oh wow, that seems really broken. If we're going to make it experience > a time warp, we should at least be *consistent*. > > So a guest which uses the TSC for timekeeping should be mostly > unaffected by this and its wallclock should still be accurate. A guest > which uses the KVM clock will be hosed by it. > > I think we should fix this so that the KVM clock is unaffected too. From my understanding of your reply, the kvm-clock/tsc should always be adjusted whenever a QEMU VM is paused and then resumed (i.e. via stop/cont). This applies to: - stop / cont - savevm / loadvm - live migration - cpr It is a bug if the clock jumps backwards to the time at which the guest was paused. The time elapsed while the VM is paused should always be accounted for and reflected in kvm-clock/tsc once the VM resumes. Thank you very much! Dongli Zhang
On Wed, 2025-09-24 at 13:53 -0700, Dongli Zhang wrote: > > > On 9/23/25 10:47 AM, David Woodhouse wrote: > > On Tue, 2025-09-23 at 10:25 -0700, Dongli Zhang wrote: > > > > > > > > > On 9/23/25 9:26 AM, David Woodhouse wrote: > > > > On Mon, 2025-09-22 at 12:37 -0700, Dongli Zhang wrote: > > > > > On 9/22/25 11:16 AM, David Woodhouse wrote: > > > > > > [snip] > > > > > > > > > > > > > > > > > > > > > > > As demonstrated in my test, currently guest_tsc doesn't stop counting during > > > > > > > blackout because of the lack of "MSR_IA32_TSC put" at > > > > > > > kvmclock_vm_state_change(). Per my understanding, it is a bug and we may need to > > > > > > > fix it. > > > > > > > > > > > > > > BTW, kvmclock_vm_state_change() already utilizes KVM_SET_CLOCK to re-configure > > > > > > > kvm-clock before continuing the guest VM. > > > > > > > > Yeah, right now it's probably just introducing errors for a stop/start > > > > of the VM. > > > > > > But that help can meet the expectation? > > > > > > Thanks to KVM_GET_CLOCK and KVM_SET_CLOCK, QEMU saves the clock with > > > KVM_GET_CLOCK when the VM is stopped, and restores it with KVM_SET_CLOCK when > > > the VM is continued. > > > > It saves the actual *value* of the clock. I would prefer to phrase that > > as "it makes the clock jump backwards to the time at which the guest > > was paused". > > > > > This ensures that the clock value itself does not change between stop and cont. > > > > > > However, QEMU does not adjust the TSC offset via MSR_IA32_TSC during stop. > > > > > > As a result, when execution resumes, the guest TSC suddenly jumps forward. > > > > Oh wow, that seems really broken. If we're going to make it experience > > a time warp, we should at least be *consistent*. > > > > So a guest which uses the TSC for timekeeping should be mostly > > unaffected by this and its wallclock should still be accurate. A guest > > which uses the KVM clock will be hosed by it. > > > > I think we should fix this so that the KVM clock is unaffected too. > > From my understanding of your reply, the kvm-clock/tsc should always be adjusted > whenever a QEMU VM is paused and then resumed (i.e. via stop/cont). I think I agree, except I still hate the way you use the word 'adjusted'. If I look at my clock, and then go to sleep for a while and look at the clock again, nobody *adjusts* it. It just keeps running. That's the effect we should always strive for, and that's how we should think about it and talk about it. It's difficult to talk about clocks because what does it mean for a clock to be "unchanged"? Does it mean that it should return the same time value? Or that it should continue to count consistently? I would argue that we should *always* use language which assumes the latter. Turning to physics for a clumsy analogy, it's about the frame of reference. We're all on a moving train. I look at you in the seat opposite me, I go to sleep for a while, and I wake up and you're still there. Nobody has "adjusted" your position to accommodate for the movement of the train while I was asleep. > This applies to: > > - stop / cont > - savevm / loadvm > - live migration > - cpr > > It is a bug if the clock jumps backwards to the time at which the guest was paused. > > The time elapsed while the VM is paused should always be accounted for and > reflected in kvm-clock/tsc once the VM resumes. In particular, in *all* but the live migration case, there should be basically nothing to do. No addition, no subtraction. Only restoring the *existing* relationships, precisely as they were before. That is the TSC *offset* value, and the precise TSC→kvmclock parameters, all bitwise *exactly* the same as before. And the only thing that changes on live migration is that you have to set the TSC offset such that the guest sees the values it *would* have seen on the original host at any given moment in time... and doesn't know it was kidnapped and moved onto a different train while it was sleeping...?
On 9/25/25 1:44 AM, David Woodhouse wrote: > On Wed, 2025-09-24 at 13:53 -0700, Dongli Zhang wrote: >> >> >> On 9/23/25 10:47 AM, David Woodhouse wrote: >>> On Tue, 2025-09-23 at 10:25 -0700, Dongli Zhang wrote: >>>> >>>> >>>> On 9/23/25 9:26 AM, David Woodhouse wrote: >>>>> On Mon, 2025-09-22 at 12:37 -0700, Dongli Zhang wrote: >>>>>> On 9/22/25 11:16 AM, David Woodhouse wrote: >>>> >>>> [snip] >>>> >>>>>>> >>>>>>>> >>>>>>>> As demonstrated in my test, currently guest_tsc doesn't stop counting during >>>>>>>> blackout because of the lack of "MSR_IA32_TSC put" at >>>>>>>> kvmclock_vm_state_change(). Per my understanding, it is a bug and we may need to >>>>>>>> fix it. >>>>>>>> >>>>>>>> BTW, kvmclock_vm_state_change() already utilizes KVM_SET_CLOCK to re-configure >>>>>>>> kvm-clock before continuing the guest VM. >>>>> >>>>> Yeah, right now it's probably just introducing errors for a stop/start >>>>> of the VM. >>>> >>>> But that help can meet the expectation? >>>> >>>> Thanks to KVM_GET_CLOCK and KVM_SET_CLOCK, QEMU saves the clock with >>>> KVM_GET_CLOCK when the VM is stopped, and restores it with KVM_SET_CLOCK when >>>> the VM is continued. >>> >>> It saves the actual *value* of the clock. I would prefer to phrase that >>> as "it makes the clock jump backwards to the time at which the guest >>> was paused". >>> >>>> This ensures that the clock value itself does not change between stop and cont. >>>> >>>> However, QEMU does not adjust the TSC offset via MSR_IA32_TSC during stop. >>>> >>>> As a result, when execution resumes, the guest TSC suddenly jumps forward. >>> >>> Oh wow, that seems really broken. If we're going to make it experience >>> a time warp, we should at least be *consistent*. >>> >>> So a guest which uses the TSC for timekeeping should be mostly >>> unaffected by this and its wallclock should still be accurate. A guest >>> which uses the KVM clock will be hosed by it. >>> >>> I think we should fix this so that the KVM clock is unaffected too. >> >> From my understanding of your reply, the kvm-clock/tsc should always be adjusted >> whenever a QEMU VM is paused and then resumed (i.e. via stop/cont). > > I think I agree, except I still hate the way you use the word > 'adjusted'. > > If I look at my clock, and then go to sleep for a while and look at the > clock again, nobody *adjusts* it. It just keeps running. > > That's the effect we should always strive for, and that's how we should > think about it and talk about it. > > It's difficult to talk about clocks because what does it mean for a > clock to be "unchanged"? Does it mean that it should return the same > time value? Or that it should continue to count consistently? I would > argue that we should *always* use language which assumes the latter. > > Turning to physics for a clumsy analogy, it's about the frame of > reference. We're all on a moving train. I look at you in the seat > opposite me, I go to sleep for a while, and I wake up and you're still > there. Nobody has "adjusted" your position to accommodate for the > movement of the train while I was asleep. > Thank you very much for explanation! I will use something like "keeps running". > > > >> This applies to: >> >> - stop / cont >> - savevm / loadvm >> - live migration >> - cpr >> >> It is a bug if the clock jumps backwards to the time at which the guest was paused. >> >> The time elapsed while the VM is paused should always be accounted for and >> reflected in kvm-clock/tsc once the VM resumes. > > In particular, in *all* but the live migration case, there should be > basically nothing to do. No addition, no subtraction. Only restoring > the *existing* relationships, precisely as they were before. That is > the TSC *offset* value, and the precise TSC→kvmclock parameters, all > bitwise *exactly* the same as before. > > And the only thing that changes on live migration is that you have to > set the TSC offset such that the guest sees the values it *would* have > seen on the original host at any given moment in time... and doesn't > know it was kidnapped and moved onto a different train while it was > sleeping...? > I see. That means, only re-configure tsc_offset, while maintaining the tsc->kvmclock PVTI. That's the reason you would like to remove 'kvm_arch->kvmclock_offset' entirely as future work. Thank you very much! Dongli Zhang
© 2016 - 2025 Red Hat, Inc.