Should QEMU (accel=kvm) kvm-clock/guest_tsc stop counting during downtime blackout?

Dongli Zhang posted 1 patch 5 days, 22 hours ago
Failed in applying to current master (apply log)
hw/i386/kvm/clock.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
Should QEMU (accel=kvm) kvm-clock/guest_tsc stop counting during downtime blackout?
Posted by Dongli Zhang 5 days, 22 hours ago
Hi,

Would you mind helping confirm if kvm-clock/guest_tsc should stop counting
elapsed time during downtime blackout?

1. guest_clock=T1, realtime=R1.
2. (qemu) stop
3. Wait for several seconds.
4. (qemu) cont
5. guest_clock=T2, realtime=R2.

Should (T1 == T2), or (R2 - R1 == T2 - T1)?


For instance, suppose guest clocksource is 'tsc'. It is still incrementing
during QEMU downtime blackout.

[root@vm ~]# while true; do date; sleep 1; done
Tue Sep  9 15:28:37 PDT 2025
Tue Sep  9 15:28:38 PDT 2025
Tue Sep  9 15:28:39 PDT 2025
Tue Sep  9 15:28:40 PDT 2025
Tue Sep  9 15:28:41 PDT 2025
Tue Sep  9 15:28:42 PDT 2025
Tue Sep  9 15:28:43 PDT 2025 ===> (qemu) stop, wait for 14 seconds.
---> 14 seconds!
Tue Sep  9 15:28:57 PDT 2025 ===> (qemu) cont
Tue Sep  9 15:28:58 PDT 2025
Tue Sep  9 15:28:59 PDT 2025
Tue Sep  9 15:29:00 PDT 2025
Tue Sep  9 15:29:01 PDT 2025


However, 'kvm-clock' stops incrementing during the blackout.

[root@vm ~]# while true; do date; sleep 1; done
Tue Sep  9 15:35:59 PDT 2025
Tue Sep  9 15:36:00 PDT 2025
Tue Sep  9 15:36:01 PDT 2025
Tue Sep  9 15:36:02 PDT 2025
Tue Sep  9 15:36:03 PDT 2025 ===> (qemu) stop, wait for many seconds.
---> No gap!
Tue Sep  9 15:36:04 PDT 2025 ===> (qemu) cont
Tue Sep  9 15:36:05 PDT 2025
Tue Sep  9 15:36:06 PDT 2025
Tue Sep  9 15:36:07 PDT 2025
Tue Sep  9 15:36:08 PDT 2025
Tue Sep  9 15:36:09 PDT 2025
Tue Sep  9 15:36:10 PDT 2025
Tue Sep  9 15:36:11 PDT 2025
Tue Sep  9 15:36:12 PDT 2025


They are many use cases that can involve a long/short downtime blackout.

- stop/cont
- savevm/loadvm
- live migration, especially from/to a file.
- dump-guest-memory
- cpr?


The KVM already exposes 'KVM_CLOCK_REALTIME' and 'KVM_VCPU_TSC_OFFSET' to help
count all elapsed time.

https://lore.kernel.org/all/20210916181538.968978-1-oupton@google.com/


This is a prototype to demonstrate how QEMU can count elapsed downtime by taking
advantage of 'KVM_CLOCK_REALTIME'.

From b97a514ac227645010ce3d1012af3a4943413844 Mon Sep 17 00:00:00 2001
From: Dongli Zhang <dongli.zhang@oracle.com>
Date: Thu, 18 Sep 2025 14:59:42 -0700
Subject: [PATCH 1/1] target/i386/kvm: take advantage of KVM_CLOCK_REALTIME

The Linux kernel commit c68dc1b577ea ("KVM: x86: Report host tsc and
realtime values in KVM_GET_CLOCK") introduced 'realtime' field and
KVM_CLOCK_REALTIME.

The 'realtime' value is saved through KVM_GET_CLOCK and restored via
KVM_SET_CLOCK. This enables the KVM clock to advance by the amount of
elapsed downtime realtime during operations such as live migration,
stop/cont, and savevm/loadvm.

This patch/feature allows QEMU to take advantage of KVM_CLOCK_REALTIME.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 hw/i386/kvm/clock.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
index f56382717f..906346ce2f 100644
--- a/hw/i386/kvm/clock.c
+++ b/hw/i386/kvm/clock.c
@@ -38,6 +38,8 @@ struct KVMClockState {
     /*< public >*/

     uint64_t clock;
+    uint64_t realtime;
+    uint32_t flags;
     bool clock_valid;

     /* whether the 'clock' value was obtained in the 'paused' state */
@@ -107,7 +109,10 @@ static void kvm_update_clock(KVMClockState *s)
         fprintf(stderr, "KVM_GET_CLOCK failed: %s\n", strerror(-ret));
                 abort();
     }
+
     s->clock = data.clock;
+    s->flags = data.flags & KVM_CLOCK_REALTIME;
+    s->realtime = data.realtime;

     /* If kvm_has_adjust_clock_stable() is false, KVM_GET_CLOCK returns
      * essentially CLOCK_MONOTONIC plus a guest-specific adjustment.  This
@@ -186,6 +191,11 @@ static void kvmclock_vm_state_change(void *opaque, bool
running,
         s->clock_valid = false;

         data.clock = s->clock;
+        if (s->flags & KVM_CLOCK_REALTIME) {
+            data.flags = s->flags;
+            data.realtime = s->realtime;
+        }
+
         ret = kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, &data);
         if (ret < 0) {
             fprintf(stderr, "KVM_SET_CLOCK failed: %s\n", strerror(-ret));
@@ -259,6 +269,7 @@ static int kvmclock_pre_load(void *opaque)
     KVMClockState *s = opaque;

     s->clock_is_reliable = false;
+    s->flags = 0;

     return 0;
 }
@@ -290,12 +301,14 @@ static int kvmclock_pre_save(void *opaque)

 static const VMStateDescription kvmclock_vmsd = {
     .name = "kvmclock",
-    .version_id = 1,
+    .version_id = 2,
     .minimum_version_id = 1,
     .pre_load = kvmclock_pre_load,
     .pre_save = kvmclock_pre_save,
     .fields = (const VMStateField[]) {
         VMSTATE_UINT64(clock, KVMClockState),
+        VMSTATE_UINT64(realtime, KVMClockState),
+        VMSTATE_UINT32(flags, KVMClockState),
         VMSTATE_END_OF_LIST()
     },
     .subsections = (const VMStateDescription * const []) {
--
2.39.3




To take advantage of 'KVM_VCPU_TSC_OFFSET' can further improve 'guest_tsc'.

Any suggestion on whether kvm-clock/guest_tsc should stop/continue counting
during the blackout? Any expectation or requirement by QEMU?

Thank you very much!

Dongli Zhang
Re: Should QEMU (accel=kvm) kvm-clock/guest_tsc stop counting during downtime blackout?
Posted by David Woodhouse 5 days, 21 hours ago
On Mon, 2025-09-22 at 09:37 -0700, Dongli Zhang wrote:
> Hi,
> 
> Would you mind helping confirm if kvm-clock/guest_tsc should stop counting
> elapsed time during downtime blackout?
> 
> 1. guest_clock=T1, realtime=R1.
> 2. (qemu) stop
> 3. Wait for several seconds.
> 4. (qemu) cont
> 5. guest_clock=T2, realtime=R2.
> 
> Should (T1 == T2), or (R2 - R1 == T2 - T1)?

Neither.

Realtime is something completely different and runs at a different rate
to the monotonic clock. In fact its rate compared to the monotonic
clock (and the TSC) is *variable* as NTP guides it.

In your example of stopping and continuing on the *same* host, the
guest TSC *offset* from the host's TSC should remain the same.

And the *precise* mathematical relationship that KVM advertises to the
guest as "how to turn a TSC value into nanoseconds since boot" should
also remain precisely the same.

KVM already lets you restore the TSC correctly. To restore KVM clock
correctly, you want something like KVM_SET_CLOCK_GUEST from
https://lore.kernel.org/all/20240522001817.619072-4-dwmw2@infradead.org/

For cross machine migration, you *do* need to use a realtime clock
reference as that's the best you have (make sure you use TAI not UTC
and don't get affected by leap seconds or smearing). Use that to
restore the *TSC* as well as you can to make it appear to have kept
running consistently. And then KVM_SET_CLOCK_GUEST just as you would on
the same host.

And use vmclock to advertise the wallclock time to the guest as
precisely as possible, even the cycle after a live migration.

Re: Should QEMU (accel=kvm) kvm-clock/guest_tsc stop counting during downtime blackout?
Posted by Dongli Zhang 5 days, 21 hours ago
Hi David,

Thank you very much for quick reply!

On 9/22/25 9:58 AM, David Woodhouse wrote:
> On Mon, 2025-09-22 at 09:37 -0700, Dongli Zhang wrote:
>> Hi,
>>
>> Would you mind helping confirm if kvm-clock/guest_tsc should stop counting
>> elapsed time during downtime blackout?
>>
>> 1. guest_clock=T1, realtime=R1.
>> 2. (qemu) stop
>> 3. Wait for several seconds.
>> 4. (qemu) cont
>> 5. guest_clock=T2, realtime=R2.
>>
>> Should (T1 == T2), or (R2 - R1 == T2 - T1)?
> 
> Neither.
> 
> Realtime is something completely different and runs at a different rate
> to the monotonic clock. In fact its rate compared to the monotonic
> clock (and the TSC) is *variable* as NTP guides it.
> 
> In your example of stopping and continuing on the *same* host, the
> guest TSC *offset* from the host's TSC should remain the same.
> 
> And the *precise* mathematical relationship that KVM advertises to the
> guest as "how to turn a TSC value into nanoseconds since boot" should
> also remain precisely the same.

Does that mean:

Regarding "stop/cont" scenario, both kvm-clock and guest_tsc value should remain
the same, i.e.,

1. When "stop", kvm-clock=K1, guest_tsc=T1.
2. Suppose many hours passed.
3. When "cont", guest VM should see kvm-clock==K1 and guest_tsc==T1, by
refreshing both PVTI and tsc_offset at KVM.


As demonstrated in my test, currently guest_tsc doesn't stop counting during
blackout because of the lack of "MSR_IA32_TSC put" at
kvmclock_vm_state_change(). Per my understanding, it is a bug and we may need to
fix it.

BTW, kvmclock_vm_state_change() already utilizes KVM_SET_CLOCK to re-configure
kvm-clock before continuing the guest VM.

> 
> KVM already lets you restore the TSC correctly. To restore KVM clock
> correctly, you want something like KVM_SET_CLOCK_GUEST from
> https://lore.kernel.org/all/20240522001817.619072-4-dwmw2@infradead.org/
> 
> For cross machine migration, you *do* need to use a realtime clock
> reference as that's the best you have (make sure you use TAI not UTC
> and don't get affected by leap seconds or smearing). Use that to
> restore the *TSC* as well as you can to make it appear to have kept
> running consistently. And then KVM_SET_CLOCK_GUEST just as you would on
> the same host.

Indeed QEMU Live Migration also relies on kvmclock_vm_state_change() to
temporarily stop/cont the source/target VM.

Would you mean we expect something different for live migration, i.e.,

1. Live Migrate a source VM to a file.
2. Copy the file to another server.
3. Wait for 1 hour.
4. Migrate from the file to target VM.

Although it is equivalent to a one-hour downtime, we do need to count the
missing one-hour, correct?


That means: we have different expectations from stop/cont and live migration.

- Live Migration: any downtime should be counted with the help from realtime.
- stop/cont (savevm/loadvm): the value of kvm-clock/rdtsc should remain the same.

> 
> And use vmclock to advertise the wallclock time to the guest as
> precisely as possible, even the cycle after a live migration.
> 

Thank you very much for suggestion on KVM_SET_CLOCK_GUEST and vmclock!

Dongli Zhang
Re: Should QEMU (accel=kvm) kvm-clock/guest_tsc stop counting during downtime blackout?
Posted by David Woodhouse 5 days, 20 hours ago
On Mon, 2025-09-22 at 10:31 -0700, Dongli Zhang wrote:
> Hi David,
> 
> Thank you very much for quick reply!
> 
> On 9/22/25 9:58 AM, David Woodhouse wrote:
> > On Mon, 2025-09-22 at 09:37 -0700, Dongli Zhang wrote:
> > > Hi,
> > > 
> > > Would you mind helping confirm if kvm-clock/guest_tsc should stop counting
> > > elapsed time during downtime blackout?
> > > 
> > > 1. guest_clock=T1, realtime=R1.
> > > 2. (qemu) stop
> > > 3. Wait for several seconds.
> > > 4. (qemu) cont
> > > 5. guest_clock=T2, realtime=R2.
> > > 
> > > Should (T1 == T2), or (R2 - R1 == T2 - T1)?
> > 
> > Neither.
> > 
> > Realtime is something completely different and runs at a different rate
> > to the monotonic clock. In fact its rate compared to the monotonic
> > clock (and the TSC) is *variable* as NTP guides it.
> > 
> > In your example of stopping and continuing on the *same* host, the
> > guest TSC *offset* from the host's TSC should remain the same.
> > 
> > And the *precise* mathematical relationship that KVM advertises to the
> > guest as "how to turn a TSC value into nanoseconds since boot" should
> > also remain precisely the same.
> 
> Does that mean:
> 
> Regarding "stop/cont" scenario, both kvm-clock and guest_tsc value should remain
> the same, i.e.,
>
> 1. When "stop", kvm-clock=K1, guest_tsc=T1.
> 2. Suppose many hours passed.
> 3. When "cont", guest VM should see kvm-clock==K1 and guest_tsc==T1, by
> refreshing both PVTI and tsc_offset at KVM.

Assuming a modern host where the TSC just counts sanely at a consistent
rate and never deviates....

No. The PVTI should basically *never* change. Whatever the estimated
(not NTP skewed) frequency of the TSC is believed to be, the KVM clock
PVTI should indicate that at boot, telling the guest how to convert a
TSC value into 'monotonic nanoseconds since boot'. If it ever changes,
that's a KVM bug.

It should be saved and restored in precisely its native form, using the
KVM_[GS]ET_CLOCK_GUEST I referenced before. For both live update (same
host) and live migration (different host).

The TSC should also continue to count at exactly the same rate as the
host's TSC at all times. No breaks or discontinuities due to any kind
of 'steal time'. For live update that's easy as you just apply the same
*offset*. For live migration that's where you have to accept that it
depends on clock synchronization between your source and destination
hosts, which is probably based on realtime.



> 
> As demonstrated in my test, currently guest_tsc doesn't stop counting during
> blackout because of the lack of "MSR_IA32_TSC put" at
> kvmclock_vm_state_change(). Per my understanding, it is a bug and we may need to
> fix it.
> 
> BTW, kvmclock_vm_state_change() already utilizes KVM_SET_CLOCK to re-configure
> kvm-clock before continuing the guest VM.
> 
> > 
> > KVM already lets you restore the TSC correctly. To restore KVM clock
> > correctly, you want something like KVM_SET_CLOCK_GUEST from
> > https://lore.kernel.org/all/20240522001817.619072-4-dwmw2@infradead.org/
> > 
> > For cross machine migration, you *do* need to use a realtime clock
> > reference as that's the best you have (make sure you use TAI not UTC
> > and don't get affected by leap seconds or smearing). Use that to
> > restore the *TSC* as well as you can to make it appear to have kept
> > running consistently. And then KVM_SET_CLOCK_GUEST just as you would on
> > the same host.
> 
> Indeed QEMU Live Migration also relies on kvmclock_vm_state_change() to
> temporarily stop/cont the source/target VM.
> 
> Would you mean we expect something different for live migration, i.e.,
> 
> 1. Live Migrate a source VM to a file.
> 2. Copy the file to another server.
> 3. Wait for 1 hour.
> 4. Migrate from the file to target VM.
> 
> Although it is equivalent to a one-hour downtime, we do need to count the
> missing one-hour, correct?

I don't look at it as counting anything. The clock keeps running even
when I'm not looking at it. If I wake up and look at it again, there is
no 'counting' how long I was asleep...

Re: Should QEMU (accel=kvm) kvm-clock/guest_tsc stop counting during downtime blackout?
Posted by Dongli Zhang 5 days, 19 hours ago

On 9/22/25 11:16 AM, David Woodhouse wrote:
> On Mon, 2025-09-22 at 10:31 -0700, Dongli Zhang wrote:
>> Hi David,
>>
>> Thank you very much for quick reply!
>>
>> On 9/22/25 9:58 AM, David Woodhouse wrote:
>>> On Mon, 2025-09-22 at 09:37 -0700, Dongli Zhang wrote:
>>>> Hi,
>>>>
>>>> Would you mind helping confirm if kvm-clock/guest_tsc should stop counting
>>>> elapsed time during downtime blackout?
>>>>
>>>> 1. guest_clock=T1, realtime=R1.
>>>> 2. (qemu) stop
>>>> 3. Wait for several seconds.
>>>> 4. (qemu) cont
>>>> 5. guest_clock=T2, realtime=R2.
>>>>
>>>> Should (T1 == T2), or (R2 - R1 == T2 - T1)?
>>>
>>> Neither.
>>>
>>> Realtime is something completely different and runs at a different rate
>>> to the monotonic clock. In fact its rate compared to the monotonic
>>> clock (and the TSC) is *variable* as NTP guides it.
>>>
>>> In your example of stopping and continuing on the *same* host, the
>>> guest TSC *offset* from the host's TSC should remain the same.
>>>
>>> And the *precise* mathematical relationship that KVM advertises to the
>>> guest as "how to turn a TSC value into nanoseconds since boot" should
>>> also remain precisely the same.
>>
>> Does that mean:
>>
>> Regarding "stop/cont" scenario, both kvm-clock and guest_tsc value should remain
>> the same, i.e.,
>>
>> 1. When "stop", kvm-clock=K1, guest_tsc=T1.
>> 2. Suppose many hours passed.
>> 3. When "cont", guest VM should see kvm-clock==K1 and guest_tsc==T1, by
>> refreshing both PVTI and tsc_offset at KVM.
> 
> Assuming a modern host where the TSC just counts sanely at a consistent
> rate and never deviates....
> 
> No. The PVTI should basically *never* change. Whatever the estimated
> (not NTP skewed) frequency of the TSC is believed to be, the KVM clock
> PVTI should indicate that at boot, telling the guest how to convert a
> TSC value into 'monotonic nanoseconds since boot'. If it ever changes,
> that's a KVM bug.
> 
> It should be saved and restored in precisely its native form, using the
> KVM_[GS]ET_CLOCK_GUEST I referenced before. For both live update (same
> host) and live migration (different host).
> 
> The TSC should also continue to count at exactly the same rate as the
> host's TSC at all times. No breaks or discontinuities due to any kind
> of 'steal time'. For live update that's easy as you just apply the same
> *offset*. For live migration that's where you have to accept that it
> depends on clock synchronization between your source and destination
> hosts, which is probably based on realtime.

That means:

- Utilize KVM_[GS]ET_CLOCK_GUEST to avoid forward/backward drift due to the
change in PVTI data structure (by adjusting 'ka->kvmclock_offset').

- Utilize realtime as reference to keep clock/tsc running.>
> 
> 
>>
>> As demonstrated in my test, currently guest_tsc doesn't stop counting during
>> blackout because of the lack of "MSR_IA32_TSC put" at
>> kvmclock_vm_state_change(). Per my understanding, it is a bug and we may need to
>> fix it.
>>
>> BTW, kvmclock_vm_state_change() already utilizes KVM_SET_CLOCK to re-configure
>> kvm-clock before continuing the guest VM.
>>
>>>
>>> KVM already lets you restore the TSC correctly. To restore KVM clock
>>> correctly, you want something like KVM_SET_CLOCK_GUEST from
>>> https://lore.kernel.org/all/20240522001817.619072-4-dwmw2@infradead.org/
>>>
>>> For cross machine migration, you *do* need to use a realtime clock
>>> reference as that's the best you have (make sure you use TAI not UTC
>>> and don't get affected by leap seconds or smearing). Use that to
>>> restore the *TSC* as well as you can to make it appear to have kept
>>> running consistently. And then KVM_SET_CLOCK_GUEST just as you would on
>>> the same host.
>>
>> Indeed QEMU Live Migration also relies on kvmclock_vm_state_change() to
>> temporarily stop/cont the source/target VM.
>>
>> Would you mean we expect something different for live migration, i.e.,
>>
>> 1. Live Migrate a source VM to a file.
>> 2. Copy the file to another server.
>> 3. Wait for 1 hour.
>> 4. Migrate from the file to target VM.
>>
>> Although it is equivalent to a one-hour downtime, we do need to count the
>> missing one-hour, correct?
> 
> I don't look at it as counting anything. The clock keeps running even
> when I'm not looking at it. If I wake up and look at it again, there is
> no 'counting' how long I was asleep...
> 

That means:

- stop/cont: clock/tsc stop running
- savevm/loadvm: clock/tsc stop running

- any live migration: clock/tsc continue running (equivalent)
- any live update (including QEMU cpr): clock/tsc continue running (equivalent)



However, there is another scenario that we 'stop' target VM on purpose before
any live migration. The 'autostart' is disabled.

After live migration, target VM won't autostart automatically, unless we issue
'cont'.

I assume this is classified as "any live migration" scenario. We still need to
keep clock/tsc running.

Thank you very much!

Dongli Zhang
Re: Should QEMU (accel=kvm) kvm-clock/guest_tsc stop counting during downtime blackout?
Posted by David Woodhouse 4 days, 22 hours ago
On Mon, 2025-09-22 at 12:37 -0700, Dongli Zhang wrote:
> On 9/22/25 11:16 AM, David Woodhouse wrote:
> > Assuming a modern host where the TSC just counts sanely at a consistent
> > rate and never deviates....
> > 
> > No. The PVTI should basically *never* change. Whatever the estimated
> > (not NTP skewed) frequency of the TSC is believed to be, the KVM clock
> > PVTI should indicate that at boot, telling the guest how to convert a
> > TSC value into 'monotonic nanoseconds since boot'. If it ever changes,
> > that's a KVM bug.
> > 
> > It should be saved and restored in precisely its native form, using the
> > KVM_[GS]ET_CLOCK_GUEST I referenced before. For both live update (same
> > host) and live migration (different host).
> > 
> > The TSC should also continue to count at exactly the same rate as the
> > host's TSC at all times. No breaks or discontinuities due to any kind
> > of 'steal time'. For live update that's easy as you just apply the same
> > *offset*. For live migration that's where you have to accept that it
> > depends on clock synchronization between your source and destination
> > hosts, which is probably based on realtime.
> 
> That means:
> 
> - Utilize KVM_[GS]ET_CLOCK_GUEST to avoid forward/backward drift due to the
> change in PVTI data structure (by adjusting 'ka->kvmclock_offset').

Ultimately for modern hardware I think I'd like to kill
ka->kvmclock_offset entirely but yeah, that's how it works right now I
think.

> - Utilize realtime as reference to keep clock/tsc running.

Hm, I don't like talking about 'running' vs. 'stopped'. The clock
should always be running. You try to keep it as *stable* as possible,
even across live migration. And for live migration, realtime is
probably the best you have so it's what you're stuck with.

When the guest reads their TSC, they should always get a value which is
as *close* as possible to their *original* host's TSC, minus the delta
of what that host's TSC was when they were first started (ignoring
scaling).




> > 
> > 
> > > 
> > > As demonstrated in my test, currently guest_tsc doesn't stop counting during
> > > blackout because of the lack of "MSR_IA32_TSC put" at
> > > kvmclock_vm_state_change(). Per my understanding, it is a bug and we may need to
> > > fix it.
> > > 
> > > BTW, kvmclock_vm_state_change() already utilizes KVM_SET_CLOCK to re-configure
> > > kvm-clock before continuing the guest VM.

Yeah, right now it's probably just introducing errors for a stop/start
of the VM.

> > > > 
> > > > KVM already lets you restore the TSC correctly. To restore KVM clock
> > > > correctly, you want something like KVM_SET_CLOCK_GUEST from
> > > > https://lore.kernel.org/all/20240522001817.619072-4-dwmw2@infradead.org/
> > > > 
> > > > For cross machine migration, you *do* need to use a realtime clock
> > > > reference as that's the best you have (make sure you use TAI not UTC
> > > > and don't get affected by leap seconds or smearing). Use that to
> > > > restore the *TSC* as well as you can to make it appear to have kept
> > > > running consistently. And then KVM_SET_CLOCK_GUEST just as you would on
> > > > the same host.
> > > 
> > > Indeed QEMU Live Migration also relies on kvmclock_vm_state_change() to
> > > temporarily stop/cont the source/target VM.
> > > 
> > > Would you mean we expect something different for live migration, i.e.,
> > > 
> > > 1. Live Migrate a source VM to a file.
> > > 2. Copy the file to another server.
> > > 3. Wait for 1 hour.
> > > 4. Migrate from the file to target VM.
> > > 
> > > Although it is equivalent to a one-hour downtime, we do need to count the
> > > missing one-hour, correct?
> > 
> > I don't look at it as counting anything. The clock keeps running even
> > when I'm not looking at it. If I wake up and look at it again, there is
> > no 'counting' how long I was asleep...
> > 
> 
> That means:
> 
> - stop/cont: clock/tsc stop running
> - savevm/loadvm: clock/tsc stop running

What does "stop running" even mean here? You can never stop the clock
running. The only thing you can do is change its offset so that it
jumps back to an earlier value, when you resume a VM?

Re: Should QEMU (accel=kvm) kvm-clock/guest_tsc stop counting during downtime blackout?
Posted by Dongli Zhang 4 days, 21 hours ago

On 9/23/25 9:26 AM, David Woodhouse wrote:
> On Mon, 2025-09-22 at 12:37 -0700, Dongli Zhang wrote:
>> On 9/22/25 11:16 AM, David Woodhouse wrote:

[snip]

>>>
>>>>
>>>> As demonstrated in my test, currently guest_tsc doesn't stop counting during
>>>> blackout because of the lack of "MSR_IA32_TSC put" at
>>>> kvmclock_vm_state_change(). Per my understanding, it is a bug and we may need to
>>>> fix it.
>>>>
>>>> BTW, kvmclock_vm_state_change() already utilizes KVM_SET_CLOCK to re-configure
>>>> kvm-clock before continuing the guest VM.
> 
> Yeah, right now it's probably just introducing errors for a stop/start
> of the VM.

But that help can meet the expectation?

Thanks to KVM_GET_CLOCK and KVM_SET_CLOCK, QEMU saves the clock with
KVM_GET_CLOCK when the VM is stopped, and restores it with KVM_SET_CLOCK when
the VM is continued.

This ensures that the clock value itself does not change between stop and cont.

However, QEMU does not adjust the TSC offset via MSR_IA32_TSC during stop.

As a result, when execution resumes, the guest TSC suddenly jumps forward.

> 
>>>>>
>>>>> KVM already lets you restore the TSC correctly. To restore KVM clock
>>>>> correctly, you want something like KVM_SET_CLOCK_GUEST from
>>>>> https://lore.kernel.org/all/20240522001817.619072-4-dwmw2@infradead.org/
>>>>>
>>>>> For cross machine migration, you *do* need to use a realtime clock
>>>>> reference as that's the best you have (make sure you use TAI not UTC
>>>>> and don't get affected by leap seconds or smearing). Use that to
>>>>> restore the *TSC* as well as you can to make it appear to have kept
>>>>> running consistently. And then KVM_SET_CLOCK_GUEST just as you would on
>>>>> the same host.
>>>>
>>>> Indeed QEMU Live Migration also relies on kvmclock_vm_state_change() to
>>>> temporarily stop/cont the source/target VM.
>>>>
>>>> Would you mean we expect something different for live migration, i.e.,
>>>>
>>>> 1. Live Migrate a source VM to a file.
>>>> 2. Copy the file to another server.
>>>> 3. Wait for 1 hour.
>>>> 4. Migrate from the file to target VM.
>>>>
>>>> Although it is equivalent to a one-hour downtime, we do need to count the
>>>> missing one-hour, correct?
>>>
>>> I don't look at it as counting anything. The clock keeps running even
>>> when I'm not looking at it. If I wake up and look at it again, there is
>>> no 'counting' how long I was asleep...
>>>
>>
>> That means:
>>
>> - stop/cont: clock/tsc stop running
>> - savevm/loadvm: clock/tsc stop running
> 
> What does "stop running" even mean here? You can never stop the clock
> running. The only thing you can do is change its offset so that it
> jumps back to an earlier value, when you resume a VM?
> 

Yes, I meant "change its offset so that it jumps back to an earlier value." From
the VM's perspective, this is equivalent to "the clock was stopped."


Could you help explain why we treat stop/cont differently from live migration?

The live migration/update is same as stop/cont because the blackout phase
involves stopping a guest, and continuing execution in a different/same host.
You technically stop the guest in a way that's not controlled by the guest
(compared to say hibernation or suspend-to-idle). and then you continue. Part of
the reason I think 'stop'/'cont' ought to have same behavior as live migration.

Thank you very much!

Dongli Zhang
Re: Should QEMU (accel=kvm) kvm-clock/guest_tsc stop counting during downtime blackout?
Posted by David Woodhouse 4 days, 21 hours ago
On Tue, 2025-09-23 at 10:25 -0700, Dongli Zhang wrote:
> 
> 
> On 9/23/25 9:26 AM, David Woodhouse wrote:
> > On Mon, 2025-09-22 at 12:37 -0700, Dongli Zhang wrote:
> > > On 9/22/25 11:16 AM, David Woodhouse wrote:
> 
> [snip]
> 
> > > > 
> > > > > 
> > > > > As demonstrated in my test, currently guest_tsc doesn't stop counting during
> > > > > blackout because of the lack of "MSR_IA32_TSC put" at
> > > > > kvmclock_vm_state_change(). Per my understanding, it is a bug and we may need to
> > > > > fix it.
> > > > > 
> > > > > BTW, kvmclock_vm_state_change() already utilizes KVM_SET_CLOCK to re-configure
> > > > > kvm-clock before continuing the guest VM.
> > 
> > Yeah, right now it's probably just introducing errors for a stop/start
> > of the VM.
> 
> But that help can meet the expectation?
> 
> Thanks to KVM_GET_CLOCK and KVM_SET_CLOCK, QEMU saves the clock with
> KVM_GET_CLOCK when the VM is stopped, and restores it with KVM_SET_CLOCK when
> the VM is continued.

It saves the actual *value* of the clock. I would prefer to phrase that
as "it makes the clock jump backwards to the time at which the guest
was paused".

> This ensures that the clock value itself does not change between stop and cont.
> 
> However, QEMU does not adjust the TSC offset via MSR_IA32_TSC during stop.
> 
> As a result, when execution resumes, the guest TSC suddenly jumps forward.

Oh wow, that seems really broken. If we're going to make it experience
a time warp, we should at least be *consistent*.

So a guest which uses the TSC for timekeeping should be mostly
unaffected by this and its wallclock should still be accurate. A guest
which uses the KVM clock will be hosed by it.

I think we should fix this so that the KVM clock is unaffected too.

> > 
> > > > > > 
> > > > > > KVM already lets you restore the TSC correctly. To restore KVM clock
> > > > > > correctly, you want something like KVM_SET_CLOCK_GUEST from
> > > > > > https://lore.kernel.org/all/20240522001817.619072-4-dwmw2@infradead.org/
> > > > > > 
> > > > > > For cross machine migration, you *do* need to use a realtime clock
> > > > > > reference as that's the best you have (make sure you use TAI not UTC
> > > > > > and don't get affected by leap seconds or smearing). Use that to
> > > > > > restore the *TSC* as well as you can to make it appear to have kept
> > > > > > running consistently. And then KVM_SET_CLOCK_GUEST just as you would on
> > > > > > the same host.
> > > > > 
> > > > > Indeed QEMU Live Migration also relies on kvmclock_vm_state_change() to
> > > > > temporarily stop/cont the source/target VM.
> > > > > 
> > > > > Would you mean we expect something different for live migration, i.e.,
> > > > > 
> > > > > 1. Live Migrate a source VM to a file.
> > > > > 2. Copy the file to another server.
> > > > > 3. Wait for 1 hour.
> > > > > 4. Migrate from the file to target VM.
> > > > > 
> > > > > Although it is equivalent to a one-hour downtime, we do need to count the
> > > > > missing one-hour, correct?
> > > > 
> > > > I don't look at it as counting anything. The clock keeps running even
> > > > when I'm not looking at it. If I wake up and look at it again, there is
> > > > no 'counting' how long I was asleep...
> > > > 
> > > 
> > > That means:
> > > 
> > > - stop/cont: clock/tsc stop running
> > > - savevm/loadvm: clock/tsc stop running
> > 
> > What does "stop running" even mean here? You can never stop the clock
> > running. The only thing you can do is change its offset so that it
> > jumps back to an earlier value, when you resume a VM?
> > 
> 
> Yes, I meant "change its offset so that it jumps back to an earlier value." From
> the VM's perspective, this is equivalent to "the clock was stopped."

Yeah... let's stop doing that :)

> Could you help explain why we treat stop/cont differently from live migration?

We shouldn't. Hardware companies spent *years* learning the lesson
about clocks. That they should just keep counting. At the same rate.
Unconditionally. Even if the CPU is running slower. Or stopped. Or
whatever. Just count. Do not stop counting.

Let's not repeat the same mistakes.

Re: Should QEMU (accel=kvm) kvm-clock/guest_tsc stop counting during downtime blackout?
Posted by Dongli Zhang 3 days, 17 hours ago

On 9/23/25 10:47 AM, David Woodhouse wrote:
> On Tue, 2025-09-23 at 10:25 -0700, Dongli Zhang wrote:
>>
>>
>> On 9/23/25 9:26 AM, David Woodhouse wrote:
>>> On Mon, 2025-09-22 at 12:37 -0700, Dongli Zhang wrote:
>>>> On 9/22/25 11:16 AM, David Woodhouse wrote:
>>
>> [snip]
>>
>>>>>
>>>>>>
>>>>>> As demonstrated in my test, currently guest_tsc doesn't stop counting during
>>>>>> blackout because of the lack of "MSR_IA32_TSC put" at
>>>>>> kvmclock_vm_state_change(). Per my understanding, it is a bug and we may need to
>>>>>> fix it.
>>>>>>
>>>>>> BTW, kvmclock_vm_state_change() already utilizes KVM_SET_CLOCK to re-configure
>>>>>> kvm-clock before continuing the guest VM.
>>>
>>> Yeah, right now it's probably just introducing errors for a stop/start
>>> of the VM.
>>
>> But that help can meet the expectation?
>>
>> Thanks to KVM_GET_CLOCK and KVM_SET_CLOCK, QEMU saves the clock with
>> KVM_GET_CLOCK when the VM is stopped, and restores it with KVM_SET_CLOCK when
>> the VM is continued.
> 
> It saves the actual *value* of the clock. I would prefer to phrase that
> as "it makes the clock jump backwards to the time at which the guest
> was paused".
> 
>> This ensures that the clock value itself does not change between stop and cont.
>>
>> However, QEMU does not adjust the TSC offset via MSR_IA32_TSC during stop.
>>
>> As a result, when execution resumes, the guest TSC suddenly jumps forward.
> 
> Oh wow, that seems really broken. If we're going to make it experience
> a time warp, we should at least be *consistent*.
> 
> So a guest which uses the TSC for timekeeping should be mostly
> unaffected by this and its wallclock should still be accurate. A guest
> which uses the KVM clock will be hosed by it.
> 
> I think we should fix this so that the KVM clock is unaffected too.

From my understanding of your reply, the kvm-clock/tsc should always be adjusted
whenever a QEMU VM is paused and then resumed (i.e. via stop/cont).

This applies to:

- stop / cont
- savevm / loadvm
- live migration
- cpr

It is a bug if the clock jumps backwards to the time at which the guest was paused.

The time elapsed while the VM is paused should always be accounted for and
reflected in kvm-clock/tsc once the VM resumes.

Thank you very much!

Dongli Zhang
Re: Should QEMU (accel=kvm) kvm-clock/guest_tsc stop counting during downtime blackout?
Posted by David Woodhouse 3 days, 6 hours ago
On Wed, 2025-09-24 at 13:53 -0700, Dongli Zhang wrote:
> 
> 
> On 9/23/25 10:47 AM, David Woodhouse wrote:
> > On Tue, 2025-09-23 at 10:25 -0700, Dongli Zhang wrote:
> > > 
> > > 
> > > On 9/23/25 9:26 AM, David Woodhouse wrote:
> > > > On Mon, 2025-09-22 at 12:37 -0700, Dongli Zhang wrote:
> > > > > On 9/22/25 11:16 AM, David Woodhouse wrote:
> > > 
> > > [snip]
> > > 
> > > > > > 
> > > > > > > 
> > > > > > > As demonstrated in my test, currently guest_tsc doesn't stop counting during
> > > > > > > blackout because of the lack of "MSR_IA32_TSC put" at
> > > > > > > kvmclock_vm_state_change(). Per my understanding, it is a bug and we may need to
> > > > > > > fix it.
> > > > > > > 
> > > > > > > BTW, kvmclock_vm_state_change() already utilizes KVM_SET_CLOCK to re-configure
> > > > > > > kvm-clock before continuing the guest VM.
> > > > 
> > > > Yeah, right now it's probably just introducing errors for a stop/start
> > > > of the VM.
> > > 
> > > But that help can meet the expectation?
> > > 
> > > Thanks to KVM_GET_CLOCK and KVM_SET_CLOCK, QEMU saves the clock with
> > > KVM_GET_CLOCK when the VM is stopped, and restores it with KVM_SET_CLOCK when
> > > the VM is continued.
> > 
> > It saves the actual *value* of the clock. I would prefer to phrase that
> > as "it makes the clock jump backwards to the time at which the guest
> > was paused".
> > 
> > > This ensures that the clock value itself does not change between stop and cont.
> > > 
> > > However, QEMU does not adjust the TSC offset via MSR_IA32_TSC during stop.
> > > 
> > > As a result, when execution resumes, the guest TSC suddenly jumps forward.
> > 
> > Oh wow, that seems really broken. If we're going to make it experience
> > a time warp, we should at least be *consistent*.
> > 
> > So a guest which uses the TSC for timekeeping should be mostly
> > unaffected by this and its wallclock should still be accurate. A guest
> > which uses the KVM clock will be hosed by it.
> > 
> > I think we should fix this so that the KVM clock is unaffected too.
> 
> From my understanding of your reply, the kvm-clock/tsc should always be adjusted
> whenever a QEMU VM is paused and then resumed (i.e. via stop/cont).

I think I agree, except I still hate the way you use the word
'adjusted'.

If I look at my clock, and then go to sleep for a while and look at the
clock again, nobody *adjusts* it. It just keeps running.

That's the effect we should always strive for, and that's how we should
think about it and talk about it.

It's difficult to talk about clocks because what does it mean for a
clock to be "unchanged"? Does it mean that it should return the same
time value? Or that it should continue to count consistently? I would
argue that we should *always* use language which assumes the latter.

Turning to physics for a clumsy analogy, it's about the frame of
reference. We're all on a moving train. I look at you in the seat
opposite me, I go to sleep for a while, and I wake up and you're still
there. Nobody has "adjusted" your position to accommodate for the
movement of the train while I was asleep.




> This applies to:
> 
> - stop / cont
> - savevm / loadvm
> - live migration
> - cpr
> 
> It is a bug if the clock jumps backwards to the time at which the guest was paused.
> 
> The time elapsed while the VM is paused should always be accounted for and
> reflected in kvm-clock/tsc once the VM resumes.

In particular, in *all* but the live migration case, there should be
basically nothing to do. No addition, no subtraction. Only restoring
the *existing* relationships, precisely as they were before. That is
the TSC *offset* value, and the precise TSC→kvmclock parameters, all
bitwise *exactly* the same as before.

And the only thing that changes on live migration is that you have to
set the TSC offset such that the guest sees the values it *would* have
seen on the original host at any given moment in time... and doesn't
know it was kidnapped and moved onto a different train while it was
sleeping...?

Re: Should QEMU (accel=kvm) kvm-clock/guest_tsc stop counting during downtime blackout?
Posted by Dongli Zhang 2 days, 19 hours ago

On 9/25/25 1:44 AM, David Woodhouse wrote:
> On Wed, 2025-09-24 at 13:53 -0700, Dongli Zhang wrote:
>>
>>
>> On 9/23/25 10:47 AM, David Woodhouse wrote:
>>> On Tue, 2025-09-23 at 10:25 -0700, Dongli Zhang wrote:
>>>>
>>>>
>>>> On 9/23/25 9:26 AM, David Woodhouse wrote:
>>>>> On Mon, 2025-09-22 at 12:37 -0700, Dongli Zhang wrote:
>>>>>> On 9/22/25 11:16 AM, David Woodhouse wrote:
>>>>
>>>> [snip]
>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> As demonstrated in my test, currently guest_tsc doesn't stop counting during
>>>>>>>> blackout because of the lack of "MSR_IA32_TSC put" at
>>>>>>>> kvmclock_vm_state_change(). Per my understanding, it is a bug and we may need to
>>>>>>>> fix it.
>>>>>>>>
>>>>>>>> BTW, kvmclock_vm_state_change() already utilizes KVM_SET_CLOCK to re-configure
>>>>>>>> kvm-clock before continuing the guest VM.
>>>>>
>>>>> Yeah, right now it's probably just introducing errors for a stop/start
>>>>> of the VM.
>>>>
>>>> But that help can meet the expectation?
>>>>
>>>> Thanks to KVM_GET_CLOCK and KVM_SET_CLOCK, QEMU saves the clock with
>>>> KVM_GET_CLOCK when the VM is stopped, and restores it with KVM_SET_CLOCK when
>>>> the VM is continued.
>>>
>>> It saves the actual *value* of the clock. I would prefer to phrase that
>>> as "it makes the clock jump backwards to the time at which the guest
>>> was paused".
>>>
>>>> This ensures that the clock value itself does not change between stop and cont.
>>>>
>>>> However, QEMU does not adjust the TSC offset via MSR_IA32_TSC during stop.
>>>>
>>>> As a result, when execution resumes, the guest TSC suddenly jumps forward.
>>>
>>> Oh wow, that seems really broken. If we're going to make it experience
>>> a time warp, we should at least be *consistent*.
>>>
>>> So a guest which uses the TSC for timekeeping should be mostly
>>> unaffected by this and its wallclock should still be accurate. A guest
>>> which uses the KVM clock will be hosed by it.
>>>
>>> I think we should fix this so that the KVM clock is unaffected too.
>>
>> From my understanding of your reply, the kvm-clock/tsc should always be adjusted
>> whenever a QEMU VM is paused and then resumed (i.e. via stop/cont).
> 
> I think I agree, except I still hate the way you use the word
> 'adjusted'.
> 
> If I look at my clock, and then go to sleep for a while and look at the
> clock again, nobody *adjusts* it. It just keeps running.
> 
> That's the effect we should always strive for, and that's how we should
> think about it and talk about it.
> 
> It's difficult to talk about clocks because what does it mean for a
> clock to be "unchanged"? Does it mean that it should return the same
> time value? Or that it should continue to count consistently? I would
> argue that we should *always* use language which assumes the latter.
> 
> Turning to physics for a clumsy analogy, it's about the frame of
> reference. We're all on a moving train. I look at you in the seat
> opposite me, I go to sleep for a while, and I wake up and you're still
> there. Nobody has "adjusted" your position to accommodate for the
> movement of the train while I was asleep.
> 

Thank you very much for explanation!

I will use something like "keeps running".

> 
> 
> 
>> This applies to:
>>
>> - stop / cont
>> - savevm / loadvm
>> - live migration
>> - cpr
>>
>> It is a bug if the clock jumps backwards to the time at which the guest was paused.
>>
>> The time elapsed while the VM is paused should always be accounted for and
>> reflected in kvm-clock/tsc once the VM resumes.
> 
> In particular, in *all* but the live migration case, there should be
> basically nothing to do. No addition, no subtraction. Only restoring
> the *existing* relationships, precisely as they were before. That is
> the TSC *offset* value, and the precise TSC→kvmclock parameters, all
> bitwise *exactly* the same as before.
> 
> And the only thing that changes on live migration is that you have to
> set the TSC offset such that the guest sees the values it *would* have
> seen on the original host at any given moment in time... and doesn't
> know it was kidnapped and moved onto a different train while it was
> sleeping...?
> 

I see. That means, only re-configure tsc_offset, while maintaining the
tsc->kvmclock PVTI. That's the reason you would like to remove
'kvm_arch->kvmclock_offset' entirely as future work.

Thank you very much!

Dongli Zhang