arch/x86/include/asm/kvm_host.h | 4 +++ arch/x86/kvm/x86.c | 29 +++++++++++++++- b.sh | 1 + virt/kvm/kvm_main.c | 75 +++++++++++++++++++++++++++++++++++++++++ 4 files changed, 108 insertions(+), 1 deletion(-)
Hi,
After performing a live migration on a QEMU guest OS that had been running for over 30 days,
we noticed that the guest OS time was more than 2 seconds behind the actual time.
After extensive debugging, we found that this issue is related to master_kernel_ns and master_cycle_now.
When the guest OS starts, the host initializes a pair of master_kernel_ns and master_cycle_now values.
After live migration, the host updates these values.
Our debugging showed that if the host does not update master_kernel_ns/master_cycle_now,
the guest OS time remains correct.
To illustrate how updating master_kernel_ns/master_cycle_now leads to the guest OS time drifting backward,
we applied the following debug patch:
The patch adds a KVM debugfs entry to trigger time calculations and print the results.
The patch runs on the host side, but we use __pvclock_read_cycles() to simulate the guest OS updating its time.
Example Output:
# cat /sys/kernel/debug/kvm/946-13/pvclock
old: master_kernel_ns: 15119778316
old: master_cycle_now: 37225912658
old: ns: 1893199569691
new: master_kernel_ns: 1908210098649
new: master_cycle_now: 4391329912268
new: ns: 1893199548401
tsc 4391329912368
kvmclock_offset -15010550291
diff: ns: 21290
Explanation of Parameters:
Input:
"old: master_kernel_ns:" The master_kernel_ns value recorded when the guest OS started (remains unchanged during testing).
"old: master_cycle_now:" The master_cycle_now value recorded when the guest OS started (remains unchanged during testing).
"new: master_kernel_ns:" The latest master_kernel_ns value at the time of reading.
"new: master_cycle_now:" The latest master_cycle_now value at the time of reading.
tsc: The rdtsc() value at the time of reading.
kvmclock_offset: The offset recorded by KVM_SET_CLOCK when the guest OS started (remains unchanged during testing).
Output:
"old: ns:" Time in nanoseconds calculated using the old master_kernel_ns/master_cycle_now.
"new: ns:" Time in nanoseconds calculated using the new master_kernel_ns/master_cycle_now.
"diff: ns:" (old ns - new ns), representing the time drift relative to the guest OS start time.
Test Script:
#!/bin/bash
qemu_pid=$(pidof qemu-system-x86_64)
while [ 1 ] ; do
echo "====================================="
echo "Guest OS running time: $(ps -p $qemu_pid -o etime= | awk '{print $1}')"
cat /sys/kernel/debug/kvm/*/pvclock
echo
sleep 10
done
Test Results:
Below are the first and last parts of a >2-hour test run.
As time progresses, the time drift calculated using the latest master_kernel_ns/master_cycle_now increases monotonically.
After 2 hours and 18 minutes, the guest OS time drifted by approximately 93 milliseconds.
I have uploaded an image for a more intuitive visualization of the time drift:
https://postimg.cc/crCDWtD7
Is this a real problem?
If there is any fix patch, I’d be happy to test it. Thanks!
1 =====================================
2 guest os running time: 00:50
3 old: master_kernel_ns: 15119778316
4 old: master_cycle_now: 37225912658
5 old: ns: 48092694964
6 new: master_kernel_ns: 63103244699
7 new: master_cycle_now: 147587790614
8 new: ns: 48092694425
9
10 tsc 147587790654
11 kvmclock_offset -15010550291
12 diff: ns: 539
13
14 =====================================
15 guest os running time: 01:00
16 old: master_kernel_ns: 15119778316
17 old: master_cycle_now: 37225912658
18 old: ns: 58139026532
19 new: master_kernel_ns: 73149576143
20 new: master_cycle_now: 170694333104
21 new: ns: 58139025879
22
23 tsc 170694333168
24 kvmclock_offset -15010550291
25 diff: ns: 653
26
27 =====================================
28 guest os running time: 01:10
29 old: master_kernel_ns: 15119778316
30 old: master_cycle_now: 37225912658
31 old: ns: 68183772122
32 new: master_kernel_ns: 83194321616
33 new: master_cycle_now: 193797227862
34 new: ns: 68183771357
35
36 tsc 193797227936
37 kvmclock_offset -15010550291
38 diff: ns: 765
39
40 =====================================
41 guest os running time: 01:20
42 old: master_kernel_ns: 15119778316
43 old: master_cycle_now: 37225912658
44 old: ns: 78225289157
45 new: master_kernel_ns: 93235838545
46 new: master_cycle_now: 216892696976
47 new: ns: 78225288279
48
49 tsc 216892697034
50 kvmclock_offset -15010550291
51 diff: ns: 878
52
53 =====================================
54 guest os running time: 01:30
55 old: master_kernel_ns: 15119778316
56 old: master_cycle_now: 37225912658
57 old: ns: 88268955340
58 new: master_kernel_ns: 103279504612
59 new: master_cycle_now: 239993109102
60 new: ns: 88268954349
61
62 tsc 239993109168
63 kvmclock_offset -15010550291
64 diff: ns: 991
65
66 =====================================
67 guest os running time: 01:40
68 old: master_kernel_ns: 15119778316
69 old: master_cycle_now: 37225912658
70 old: ns: 98313212581
71 new: master_kernel_ns: 113323761740
72 new: master_cycle_now: 263094880668
73 new: ns: 98313211476
74
75 tsc 263094880732
76 kvmclock_offset -15010550291
77 diff: ns: 1105
.....
.....
10160 =====================================
10161 guest os running time: 02:17:11
10162 old: master_kernel_ns: 15119778316
10163 old: master_cycle_now: 37225912658
10164 old: ns: 8229817213297
10165 new: master_kernel_ns: 8244827670997
10166 new: master_cycle_now: 18965537819524
10167 new: ns: 8229817120748
10168
10169 tsc 18965537819622
10170 kvmclock_offset -15010550291
10171 diff: ns: 92549
10172 =====================================
10173 guest os running time: 02:17:21
10174 old: master_kernel_ns: 15119778316
10175 old: master_cycle_now: 37225912658
10176 old: ns: 8239861074959
10177 new: master_kernel_ns: 8254871532564
10178 new: master_cycle_now: 18988638681302
10179 new: ns: 8239860982297
10180
10181 tsc 18988638681358
10182 kvmclock_offset -15010550291
10183 diff: ns: 92662
10184 =====================================
10185 guest os running time: 02:17:31
10186 old: master_kernel_ns: 15119778316
10187 old: master_cycle_now: 37225912658
10188 old: ns: 8249904622988
10189 new: master_kernel_ns: 8264915080459
10190 new: master_cycle_now: 19011738821632
10191 new: ns: 8249904530213
10192
10193 tsc 19011738821736
10194 kvmclock_offset -15010550291
10195 diff: ns: 92775^@
10196 =====================================
10197 guest os running time: 02:17:41
10198 old: master_kernel_ns: 15119778316
10199 old: master_cycle_now: 37225912658
10200 old: ns: 8259949369203
10201 new: master_kernel_ns: 8274959826576
10202 new: master_cycle_now: 19034841717872
10203 new: ns: 8259949276315
10204
10205 tsc 19034841717942
10206 kvmclock_offset -15010550291
10207 diff: ns: 92888
10208 =====================================
10209 guest os running time: 02:17:51
10210 old: master_kernel_ns: 15119778316
10211 old: master_cycle_now: 37225912658
10212 old: ns: 8269996849598
10213 new: master_kernel_ns: 8285007306846
10214 new: master_cycle_now: 19057950902658
10215 new: ns: 8269996756597
10216
10217 tsc 19057950902756
10218 kvmclock_offset -15010550291
10219 diff: ns: 93001^@
10220 =====================================
10221 guest os running time: 02:18:02
10222 old: master_kernel_ns: 15119778316
10223 old: master_cycle_now: 37225912658
10224 old: ns: 8280039094317
10225 new: master_kernel_ns: 8295049551453
10226 new: master_cycle_now: 19081048045430
10227 new: ns: 8280039001203
10228
10229 tsc 19081048045526
10230 kvmclock_offset -15010550291
10231 diff: ns: 93114^@
pvclock debugfs patch
---
arch/x86/include/asm/kvm_host.h | 4 +++
arch/x86/kvm/x86.c | 29 +++++++++++++++-
b.sh | 1 +
virt/kvm/kvm_main.c | 75 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 108 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 32ae3aa50c7e..5a82a69bfe7a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1411,6 +1411,10 @@ struct kvm_arch {
struct delayed_work kvmclock_update_work;
struct delayed_work kvmclock_sync_work;
+ u64 old_master_kernel_ns;
+ u64 old_master_cycle_now;
+ s64 old_kvmclock_offset;
+
struct kvm_xen_hvm_config xen_hvm_config;
/* reads protected by irq_srcu, writes by irq_lock */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4b64ab350bcd..a56511ed8c5b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2819,7 +2819,7 @@ static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp,
* As with get_kvmclock_base_ns(), this counts from boot time, at the
* frequency of CLOCK_MONOTONIC_RAW (hence adding gtos->offs_boot).
*/
-static int do_kvmclock_base(s64 *t, u64 *tsc_timestamp)
+int do_kvmclock_base(s64 *t, u64 *tsc_timestamp)
{
struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
unsigned long seq;
@@ -2861,6 +2861,27 @@ static int do_monotonic(s64 *t, u64 *tsc_timestamp)
return mode;
}
+u64 mydebug_get_kvmclock_ns(u64 master_kernel_ns, u64 master_cycle_now, s64 kvmclock_offset, u64 tsc)
+{
+ struct pvclock_vcpu_time_info hv_clock;
+ u64 ret;
+
+ hv_clock.tsc_timestamp = master_cycle_now;
+ hv_clock.system_time = master_kernel_ns + kvmclock_offset;
+
+ /* both __this_cpu_read() and rdtsc() should be on the same cpu */
+ get_cpu();
+
+ kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
+ &hv_clock.tsc_shift,
+ &hv_clock.tsc_to_system_mul);
+ ret = __pvclock_read_cycles(&hv_clock, tsc);
+
+ put_cpu();
+
+ return ret;
+}
+
static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
{
struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
@@ -2988,6 +3009,10 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
host_tsc_clocksource = kvm_get_time_and_clockread(
&ka->master_kernel_ns,
&ka->master_cycle_now);
+ ka->old_master_kernel_ns = ka->master_kernel_ns;
+ ka->old_master_cycle_now = ka->master_cycle_now;
+ printk("MYDEBUG: old_master_kernel_ns = %llu, old_master_cycle_now = %llu\n",
+ ka->old_master_kernel_ns, ka->old_master_cycle_now);
ka->use_master_clock = host_tsc_clocksource && vcpus_matched
&& !ka->backwards_tsc_observed
@@ -6989,6 +7014,8 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
else
now_raw_ns = get_kvmclock_base_ns();
ka->kvmclock_offset = data.clock - now_raw_ns;
+ ka->old_kvmclock_offset = ka->kvmclock_offset;
+ printk("MYDEBUG: old_kvmclock_offset = %lld\n", ka->old_kvmclock_offset);
kvm_end_pvclock_update(kvm);
return 0;
}
diff --git a/b.sh b/b.sh
new file mode 120000
index 000000000000..0ff9a93fd53f
--- /dev/null
+++ b/b.sh
@@ -0,0 +1 @@
+/home/mlin/build.upstream/b.sh
\ No newline at end of file
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ba0327e2d0d3..d6b9a6e7275e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -399,6 +399,7 @@ int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity,
return mc->nobjs >= min ? 0 : -ENOMEM;
mc->objects[mc->nobjs++] = obj;
}
+
return 0;
}
@@ -998,6 +999,78 @@ static void kvm_destroy_vm_debugfs(struct kvm *kvm)
}
}
+extern int do_kvmclock_base(s64 *t, u64 *tsc_timestamp);
+extern u64 mydebug_get_kvmclock_ns(u64 master_kernel_ns, u64 master_cycle_now, s64 kvmclock_offset, u64 tsc);
+
+static ssize_t kvm_mydebug_pvclock_read(struct file *file, char __user *buf,
+ size_t len, loff_t *ppos)
+{
+ struct kvm *kvm = file->private_data;
+ struct kvm_arch *ka;
+ char buffer[256];
+ ssize_t ret, copied;
+ u64 new_master_kernel_ns;
+ u64 new_master_cycle_now;
+ u64 old_ns, new_ns;
+ u64 tsc;
+
+ if (!kvm) {
+ pr_err("file->private_data is NULL\n");
+ return -EINVAL;
+ }
+
+ ka = &kvm->arch;
+
+ do_kvmclock_base(&new_master_kernel_ns, &new_master_cycle_now);
+
+ tsc = rdtsc();
+
+ old_ns = mydebug_get_kvmclock_ns(ka->old_master_kernel_ns, ka->old_master_cycle_now, ka->old_kvmclock_offset, tsc);
+ new_ns = mydebug_get_kvmclock_ns(new_master_kernel_ns, new_master_cycle_now, ka->old_kvmclock_offset, tsc);
+
+ ret = snprintf(buffer, sizeof(buffer),
+ "old: master_kernel_ns: %llu\n"
+ "old: master_cycle_now: %llu\n"
+ "old: ns: %llu\n"
+ "new: master_kernel_ns: %llu\n"
+ "new: master_cycle_now: %llu\n"
+ "new: ns: %llu\n\n"
+ "tsc %llu\n"
+ "kvmclock_offset %lld\n"
+ "diff: ns: %lld\n",
+ ka->old_master_kernel_ns, ka->old_master_cycle_now, old_ns,
+ new_master_kernel_ns, new_master_cycle_now, new_ns,
+ tsc, ka->old_kvmclock_offset,
+ old_ns - new_ns
+ );
+
+ if (ret < 0)
+ return ret;
+
+ if ((size_t)ret > sizeof(buffer))
+ ret = sizeof(buffer);
+
+ if (*ppos >= ret)
+ return 0; /* EOF */
+
+ copied = min(len, (size_t)(ret - *ppos));
+
+ if (copy_to_user(buf, buffer + *ppos, copied)) {
+ pr_err("copy_to_user failed\n");
+ return -EFAULT;
+ }
+
+ *ppos += copied;
+
+ return copied;
+}
+
+static const struct file_operations kvm_pvclock_fops = {
+ .owner = THIS_MODULE,
+ .read = kvm_mydebug_pvclock_read,
+ .open = simple_open,
+};
+
static int kvm_create_vm_debugfs(struct kvm *kvm, const char *fdname)
{
static DEFINE_MUTEX(kvm_debugfs_lock);
@@ -1063,6 +1136,8 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, const char *fdname)
&stat_fops_per_vm);
}
+ debugfs_create_file("pvclock", 0444, kvm->debugfs_dentry, kvm, &kvm_pvclock_fops);
+
kvm_arch_create_vm_debugfs(kvm);
return 0;
out_err:
+David
On Wed, Mar 19, 2025, Ming Lin wrote:
> Hi,
>
> After performing a live migration on a QEMU guest OS that had been running for over 30 days,
> we noticed that the guest OS time was more than 2 seconds behind the actual time.
>
> After extensive debugging, we found that this issue is related to master_kernel_ns and master_cycle_now.
>
> When the guest OS starts, the host initializes a pair of master_kernel_ns and master_cycle_now values.
> After live migration, the host updates these values.
>
> Our debugging showed that if the host does not update master_kernel_ns/master_cycle_now,
> the guest OS time remains correct.
>
> To illustrate how updating master_kernel_ns/master_cycle_now leads to the guest OS time drifting backward,
> we applied the following debug patch:
>
> The patch adds a KVM debugfs entry to trigger time calculations and print the results.
> The patch runs on the host side, but we use __pvclock_read_cycles() to simulate the guest OS updating its time.
>
> Example Output:
>
> # cat /sys/kernel/debug/kvm/946-13/pvclock
> old: master_kernel_ns: 15119778316
> old: master_cycle_now: 37225912658
> old: ns: 1893199569691
> new: master_kernel_ns: 1908210098649
> new: master_cycle_now: 4391329912268
> new: ns: 1893199548401
>
> tsc 4391329912368
> kvmclock_offset -15010550291
> diff: ns: 21290
>
> Explanation of Parameters:
>
> Input:
> "old: master_kernel_ns:" The master_kernel_ns value recorded when the guest OS started (remains unchanged during testing).
> "old: master_cycle_now:" The master_cycle_now value recorded when the guest OS started (remains unchanged during testing).
> "new: master_kernel_ns:" The latest master_kernel_ns value at the time of reading.
> "new: master_cycle_now:" The latest master_cycle_now value at the time of reading.
> tsc: The rdtsc() value at the time of reading.
> kvmclock_offset: The offset recorded by KVM_SET_CLOCK when the guest OS started (remains unchanged during testing).
>
> Output:
> "old: ns:" Time in nanoseconds calculated using the old master_kernel_ns/master_cycle_now.
> "new: ns:" Time in nanoseconds calculated using the new master_kernel_ns/master_cycle_now.
> "diff: ns:" (old ns - new ns), representing the time drift relative to the guest OS start time.
>
> Test Script:
> #!/bin/bash
>
> qemu_pid=$(pidof qemu-system-x86_64)
>
> while [ 1 ] ; do
> echo "====================================="
> echo "Guest OS running time: $(ps -p $qemu_pid -o etime= | awk '{print $1}')"
> cat /sys/kernel/debug/kvm/*/pvclock
> echo
> sleep 10
> done
>
> Test Results:
> Below are the first and last parts of a >2-hour test run.
> As time progresses, the time drift calculated using the latest master_kernel_ns/master_cycle_now increases monotonically.
>
> After 2 hours and 18 minutes, the guest OS time drifted by approximately 93 milliseconds.
>
> I have uploaded an image for a more intuitive visualization of the time drift:
> https://postimg.cc/crCDWtD7
>
> Is this a real problem?
David can confirm, but I'm pretty sure the drift you are observing is addressed
by David's series to fix a plethora of kvmclock warts.
https://lore.kernel.org/all/20240522001817.619072-1-dwmw2@infradead.org
> If there is any fix patch, I’d be happy to test it. Thanks!
>
>
> 1 =====================================
> 2 guest os running time: 00:50
> 3 old: master_kernel_ns: 15119778316
> 4 old: master_cycle_now: 37225912658
> 5 old: ns: 48092694964
> 6 new: master_kernel_ns: 63103244699
> 7 new: master_cycle_now: 147587790614
> 8 new: ns: 48092694425
> 9
> 10 tsc 147587790654
> 11 kvmclock_offset -15010550291
> 12 diff: ns: 539
> 13
> 14 =====================================
> 15 guest os running time: 01:00
> 16 old: master_kernel_ns: 15119778316
> 17 old: master_cycle_now: 37225912658
> 18 old: ns: 58139026532
> 19 new: master_kernel_ns: 73149576143
> 20 new: master_cycle_now: 170694333104
> 21 new: ns: 58139025879
> 22
> 23 tsc 170694333168
> 24 kvmclock_offset -15010550291
> 25 diff: ns: 653
> 26
> 27 =====================================
> 28 guest os running time: 01:10
> 29 old: master_kernel_ns: 15119778316
> 30 old: master_cycle_now: 37225912658
> 31 old: ns: 68183772122
> 32 new: master_kernel_ns: 83194321616
> 33 new: master_cycle_now: 193797227862
> 34 new: ns: 68183771357
> 35
> 36 tsc 193797227936
> 37 kvmclock_offset -15010550291
> 38 diff: ns: 765
> 39
> 40 =====================================
> 41 guest os running time: 01:20
> 42 old: master_kernel_ns: 15119778316
> 43 old: master_cycle_now: 37225912658
> 44 old: ns: 78225289157
> 45 new: master_kernel_ns: 93235838545
> 46 new: master_cycle_now: 216892696976
> 47 new: ns: 78225288279
> 48
> 49 tsc 216892697034
> 50 kvmclock_offset -15010550291
> 51 diff: ns: 878
> 52
> 53 =====================================
> 54 guest os running time: 01:30
> 55 old: master_kernel_ns: 15119778316
> 56 old: master_cycle_now: 37225912658
> 57 old: ns: 88268955340
> 58 new: master_kernel_ns: 103279504612
> 59 new: master_cycle_now: 239993109102
> 60 new: ns: 88268954349
> 61
> 62 tsc 239993109168
> 63 kvmclock_offset -15010550291
> 64 diff: ns: 991
> 65
> 66 =====================================
> 67 guest os running time: 01:40
> 68 old: master_kernel_ns: 15119778316
> 69 old: master_cycle_now: 37225912658
> 70 old: ns: 98313212581
> 71 new: master_kernel_ns: 113323761740
> 72 new: master_cycle_now: 263094880668
> 73 new: ns: 98313211476
> 74
> 75 tsc 263094880732
> 76 kvmclock_offset -15010550291
> 77 diff: ns: 1105
> .....
> .....
> 10160 =====================================
> 10161 guest os running time: 02:17:11
> 10162 old: master_kernel_ns: 15119778316
> 10163 old: master_cycle_now: 37225912658
> 10164 old: ns: 8229817213297
> 10165 new: master_kernel_ns: 8244827670997
> 10166 new: master_cycle_now: 18965537819524
> 10167 new: ns: 8229817120748
> 10168
> 10169 tsc 18965537819622
> 10170 kvmclock_offset -15010550291
> 10171 diff: ns: 92549
> 10172 =====================================
> 10173 guest os running time: 02:17:21
> 10174 old: master_kernel_ns: 15119778316
> 10175 old: master_cycle_now: 37225912658
> 10176 old: ns: 8239861074959
> 10177 new: master_kernel_ns: 8254871532564
> 10178 new: master_cycle_now: 18988638681302
> 10179 new: ns: 8239860982297
> 10180
> 10181 tsc 18988638681358
> 10182 kvmclock_offset -15010550291
> 10183 diff: ns: 92662
> 10184 =====================================
> 10185 guest os running time: 02:17:31
> 10186 old: master_kernel_ns: 15119778316
> 10187 old: master_cycle_now: 37225912658
> 10188 old: ns: 8249904622988
> 10189 new: master_kernel_ns: 8264915080459
> 10190 new: master_cycle_now: 19011738821632
> 10191 new: ns: 8249904530213
> 10192
> 10193 tsc 19011738821736
> 10194 kvmclock_offset -15010550291
> 10195 diff: ns: 92775^@
> 10196 =====================================
> 10197 guest os running time: 02:17:41
> 10198 old: master_kernel_ns: 15119778316
> 10199 old: master_cycle_now: 37225912658
> 10200 old: ns: 8259949369203
> 10201 new: master_kernel_ns: 8274959826576
> 10202 new: master_cycle_now: 19034841717872
> 10203 new: ns: 8259949276315
> 10204
> 10205 tsc 19034841717942
> 10206 kvmclock_offset -15010550291
> 10207 diff: ns: 92888
> 10208 =====================================
> 10209 guest os running time: 02:17:51
> 10210 old: master_kernel_ns: 15119778316
> 10211 old: master_cycle_now: 37225912658
> 10212 old: ns: 8269996849598
> 10213 new: master_kernel_ns: 8285007306846
> 10214 new: master_cycle_now: 19057950902658
> 10215 new: ns: 8269996756597
> 10216
> 10217 tsc 19057950902756
> 10218 kvmclock_offset -15010550291
> 10219 diff: ns: 93001^@
> 10220 =====================================
> 10221 guest os running time: 02:18:02
> 10222 old: master_kernel_ns: 15119778316
> 10223 old: master_cycle_now: 37225912658
> 10224 old: ns: 8280039094317
> 10225 new: master_kernel_ns: 8295049551453
> 10226 new: master_cycle_now: 19081048045430
> 10227 new: ns: 8280039001203
> 10228
> 10229 tsc 19081048045526
> 10230 kvmclock_offset -15010550291
> 10231 diff: ns: 93114^@
>
>
>
> pvclock debugfs patch
> ---
> arch/x86/include/asm/kvm_host.h | 4 +++
> arch/x86/kvm/x86.c | 29 +++++++++++++++-
> b.sh | 1 +
> virt/kvm/kvm_main.c | 75 +++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 108 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 32ae3aa50c7e..5a82a69bfe7a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1411,6 +1411,10 @@ struct kvm_arch {
> struct delayed_work kvmclock_update_work;
> struct delayed_work kvmclock_sync_work;
> + u64 old_master_kernel_ns;
> + u64 old_master_cycle_now;
> + s64 old_kvmclock_offset;
> +
> struct kvm_xen_hvm_config xen_hvm_config;
> /* reads protected by irq_srcu, writes by irq_lock */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4b64ab350bcd..a56511ed8c5b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2819,7 +2819,7 @@ static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp,
> * As with get_kvmclock_base_ns(), this counts from boot time, at the
> * frequency of CLOCK_MONOTONIC_RAW (hence adding gtos->offs_boot).
> */
> -static int do_kvmclock_base(s64 *t, u64 *tsc_timestamp)
> +int do_kvmclock_base(s64 *t, u64 *tsc_timestamp)
> {
> struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
> unsigned long seq;
> @@ -2861,6 +2861,27 @@ static int do_monotonic(s64 *t, u64 *tsc_timestamp)
> return mode;
> }
> +u64 mydebug_get_kvmclock_ns(u64 master_kernel_ns, u64 master_cycle_now, s64 kvmclock_offset, u64 tsc)
> +{
> + struct pvclock_vcpu_time_info hv_clock;
> + u64 ret;
> +
> + hv_clock.tsc_timestamp = master_cycle_now;
> + hv_clock.system_time = master_kernel_ns + kvmclock_offset;
> +
> + /* both __this_cpu_read() and rdtsc() should be on the same cpu */
> + get_cpu();
> +
> + kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
> + &hv_clock.tsc_shift,
> + &hv_clock.tsc_to_system_mul);
> + ret = __pvclock_read_cycles(&hv_clock, tsc);
> +
> + put_cpu();
> +
> + return ret;
> +}
> +
> static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
> {
> struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
> @@ -2988,6 +3009,10 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
> host_tsc_clocksource = kvm_get_time_and_clockread(
> &ka->master_kernel_ns,
> &ka->master_cycle_now);
> + ka->old_master_kernel_ns = ka->master_kernel_ns;
> + ka->old_master_cycle_now = ka->master_cycle_now;
> + printk("MYDEBUG: old_master_kernel_ns = %llu, old_master_cycle_now = %llu\n",
> + ka->old_master_kernel_ns, ka->old_master_cycle_now);
> ka->use_master_clock = host_tsc_clocksource && vcpus_matched
> && !ka->backwards_tsc_observed
> @@ -6989,6 +7014,8 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
> else
> now_raw_ns = get_kvmclock_base_ns();
> ka->kvmclock_offset = data.clock - now_raw_ns;
> + ka->old_kvmclock_offset = ka->kvmclock_offset;
> + printk("MYDEBUG: old_kvmclock_offset = %lld\n", ka->old_kvmclock_offset);
> kvm_end_pvclock_update(kvm);
> return 0;
> }
> diff --git a/b.sh b/b.sh
> new file mode 120000
> index 000000000000..0ff9a93fd53f
> --- /dev/null
> +++ b/b.sh
> @@ -0,0 +1 @@
> +/home/mlin/build.upstream/b.sh
> \ No newline at end of file
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ba0327e2d0d3..d6b9a6e7275e 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -399,6 +399,7 @@ int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity,
> return mc->nobjs >= min ? 0 : -ENOMEM;
> mc->objects[mc->nobjs++] = obj;
> }
> +
> return 0;
> }
> @@ -998,6 +999,78 @@ static void kvm_destroy_vm_debugfs(struct kvm *kvm)
> }
> }
> +extern int do_kvmclock_base(s64 *t, u64 *tsc_timestamp);
> +extern u64 mydebug_get_kvmclock_ns(u64 master_kernel_ns, u64 master_cycle_now, s64 kvmclock_offset, u64 tsc);
> +
> +static ssize_t kvm_mydebug_pvclock_read(struct file *file, char __user *buf,
> + size_t len, loff_t *ppos)
> +{
> + struct kvm *kvm = file->private_data;
> + struct kvm_arch *ka;
> + char buffer[256];
> + ssize_t ret, copied;
> + u64 new_master_kernel_ns;
> + u64 new_master_cycle_now;
> + u64 old_ns, new_ns;
> + u64 tsc;
> +
> + if (!kvm) {
> + pr_err("file->private_data is NULL\n");
> + return -EINVAL;
> + }
> +
> + ka = &kvm->arch;
> +
> + do_kvmclock_base(&new_master_kernel_ns, &new_master_cycle_now);
> +
> + tsc = rdtsc();
> +
> + old_ns = mydebug_get_kvmclock_ns(ka->old_master_kernel_ns, ka->old_master_cycle_now, ka->old_kvmclock_offset, tsc);
> + new_ns = mydebug_get_kvmclock_ns(new_master_kernel_ns, new_master_cycle_now, ka->old_kvmclock_offset, tsc);
> +
> + ret = snprintf(buffer, sizeof(buffer),
> + "old: master_kernel_ns: %llu\n"
> + "old: master_cycle_now: %llu\n"
> + "old: ns: %llu\n"
> + "new: master_kernel_ns: %llu\n"
> + "new: master_cycle_now: %llu\n"
> + "new: ns: %llu\n\n"
> + "tsc %llu\n"
> + "kvmclock_offset %lld\n"
> + "diff: ns: %lld\n",
> + ka->old_master_kernel_ns, ka->old_master_cycle_now, old_ns,
> + new_master_kernel_ns, new_master_cycle_now, new_ns,
> + tsc, ka->old_kvmclock_offset,
> + old_ns - new_ns
> + );
> +
> + if (ret < 0)
> + return ret;
> +
> + if ((size_t)ret > sizeof(buffer))
> + ret = sizeof(buffer);
> +
> + if (*ppos >= ret)
> + return 0; /* EOF */
> +
> + copied = min(len, (size_t)(ret - *ppos));
> +
> + if (copy_to_user(buf, buffer + *ppos, copied)) {
> + pr_err("copy_to_user failed\n");
> + return -EFAULT;
> + }
> +
> + *ppos += copied;
> +
> + return copied;
> +}
> +
> +static const struct file_operations kvm_pvclock_fops = {
> + .owner = THIS_MODULE,
> + .read = kvm_mydebug_pvclock_read,
> + .open = simple_open,
> +};
> +
> static int kvm_create_vm_debugfs(struct kvm *kvm, const char *fdname)
> {
> static DEFINE_MUTEX(kvm_debugfs_lock);
> @@ -1063,6 +1136,8 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, const char *fdname)
> &stat_fops_per_vm);
> }
> + debugfs_create_file("pvclock", 0444, kvm->debugfs_dentry, kvm, &kvm_pvclock_fops);
> +
> kvm_arch_create_vm_debugfs(kvm);
> return 0;
> out_err:
>
On Mon, 2025-03-24 at 15:54 -0700, Sean Christopherson wrote: > > David can confirm, but I'm pretty sure the drift you are observing is addressed > by David's series to fix a plethora of kvmclock warts. > > https://lore.kernel.org/all/20240522001817.619072-1-dwmw2@infradead.org Yes, that looks like exactly the problem my series is addressing. We shouldn't update the pvclock so often. And if/when we do, we shouldn't clamp it to some *other* clock which progresses at a different rate to the pvclock, because that causes steps in the pvclock.
On Tue, Mar 25, 2025 at 4:07 AM David Woodhouse <dwmw2@infradead.org> wrote:
>
> On Mon, 2025-03-24 at 15:54 -0700, Sean Christopherson wrote:
> >
> > David can confirm, but I'm pretty sure the drift you are observing is addressed
> > by David's series to fix a plethora of kvmclock warts.
> >
> > https://lore.kernel.org/all/20240522001817.619072-1-dwmw2@infradead.org
>
> Yes, that looks like exactly the problem my series is addressing. We
> shouldn't update the pvclock so often. And if/when we do, we shouldn't
> clamp it to some *other* clock which progresses at a different rate to
> the pvclock, because that causes steps in the pvclock.
I applied the patch series on top of 6.9 cleanly and tested it with my
debug tool patch.
But it seems the time drift still increased monotonically.
Would you help take a look if the tool patch makes sense?
https://github.com/minggr/linux/commit/5284a211b6bdc9f9041b669539558a6a858e88d0
The tool patch adds a KVM debugfs entry to trigger time calculations
and print the results.
See my first email for more detail.
Test Script:
#!/bin/bash
qemu_pid=$(pidof qemu-system-x86_64)
while [ 1 ] ; do
echo "====================================="
echo "Guest OS running time: $(ps -p $qemu_pid -o etime= | awk
'{print $1}')"
cat /sys/kernel/debug/kvm/*/pvclock
echo
sleep 10
done
Thanks,
Ming
On Wed, 2025-03-26 at 08:54 -0700, Ming Lin wrote: > I applied the patch series on top of 6.9 cleanly and tested it with my > debug tool patch. > But it seems the time drift still increased monotonically. > > Would you help take a look if the tool patch makes sense? > https://github.com/minggr/linux/commit/5284a211b6bdc9f9041b669539558a6a858e88d0 > > The tool patch adds a KVM debugfs entry to trigger time calculations > and print the results. > See my first email for more detail. Your first message seemed to say that the problem occurred with live migration. This message says "the time drift still increased monotonically". Trying to make sure I fully understand... the time drift between the host's CLOCK_MONOTONIC_RAW and the guest's kvmclock increases monotonically *but* the guest only observes the change when its master_kernel_ns/master_cycle_now are updated (e.g. on live migration) and its kvmclock is reset back to the host's CLOCK_MONOTONIC_RAW? Is this live migration from one VMM to another on the same host, so we don't have to worry about the accuracy of the TSC itself? The guest TSC remains consistent? And presumably your host does *have* a stable TSC, and the guest's test case really ought to be checking the PVCLOCK_TSC_STABLE_BIT to make sure of that? If all the above assumptions/interpretations of mine are true, I still think it's expected that your clock will jump on live migration *unless* you also taught your VMM to use the new KVM_[GS]ET_CLOCK_GUEST ioctls which were added in my patch series, specifically to preserve the mathematical relationship between guest TSC and kvmclock across a migration.
On Thu, Mar 27, 2025 at 1:10 AM David Woodhouse <dwmw2@infradead.org> wrote: > > On Wed, 2025-03-26 at 08:54 -0700, Ming Lin wrote: > > I applied the patch series on top of 6.9 cleanly and tested it with my > > debug tool patch. > > But it seems the time drift still increased monotonically. > > > > Would you help take a look if the tool patch makes sense? > > https://github.com/minggr/linux/commit/5284a211b6bdc9f9041b669539558a6a858e88d0 > > > > The tool patch adds a KVM debugfs entry to trigger time calculations > > and print the results. > > See my first email for more detail. > > Your first message seemed to say that the problem occurred with live > migration. This message says "the time drift still increased > monotonically". Yes, we discovered this issue in our production environment, where the time inside the guest OS slowed down by more than 2 seconds. This problem occurred both during live upgrades locally and live migrations remotely. However, the issue is only noticeable after the guest OS has been running for a long time, typically over 30 days. Since 30 days is too long to wait, I wrote a debugfs tool to quickly reproduce the original issue, but now I'm not sure if the tool is working correctly. > > Trying to make sure I fully understand... the time drift between the > host's CLOCK_MONOTONIC_RAW and the guest's kvmclock increases > monotonically *but* the guest only observes the change when its > master_kernel_ns/master_cycle_now are updated (e.g. on live migration) > and its kvmclock is reset back to the host's CLOCK_MONOTONIC_RAW? Yes, we are using the 5.4 kernel and have verified that the guest OS time remains correct after live upgrades/migrations, as long as master_kernel_ns / master_cycle_now are not updated (i.e., if the old master_kernel_ns / master_cycle_now values are retained). > > Is this live migration from one VMM to another on the same host, so we > don't have to worry about the accuracy of the TSC itself? The guest TSC > remains consistent? And presumably your host does *have* a stable TSC, > and the guest's test case really ought to be checking the > PVCLOCK_TSC_STABLE_BIT to make sure of that? The live migration is from one VMM to another on a remote host, and we have also observed the same issue during live upgrades on the same host. > > If all the above assumptions/interpretations of mine are true, I still > think it's expected that your clock will jump on live migration > *unless* you also taught your VMM to use the new KVM_[GS]ET_CLOCK_GUEST > ioctls which were added in my patch series, specifically to preserve > the mathematical relationship between guest TSC and kvmclock across a > migration. > We are planning to test the patches on a 6.9 kernel (where they can be applied cleanly) and modify the live upgrade/migration code to use the new KVM_[GS]ET_CLOCK_GUEST ioctls. BTW, what is the plan for upstreaming these patches? Thanks, Ming
On Fri, 2025-03-28 at 11:30 -0700, Ming Lin wrote: > > > > > Is this live migration from one VMM to another on the same host, so we > > don't have to worry about the accuracy of the TSC itself? The guest TSC > > remains consistent? And presumably your host does *have* a stable TSC, > > and the guest's test case really ought to be checking the > > PVCLOCK_TSC_STABLE_BIT to make sure of that? > > The live migration is from one VMM to another on a remote host, and we > have also observed the same issue during live upgrades on the same host. Moving to a remote host also requires that you get the guest TSC to be reasonably synchronised on the destination. Which is another litany of sadness, especially if you have TSC scaling in the mix. Or even if your two "identical" hosts calculated a slightly different TSC frequency when they measured it at first boot. In that latter case, the mul/scale factors advertised to the guest in the pvclock will be *slightly* different on the new host. Your test in the guest would then fail if it requires that the pvclock be *identical*. The actual criterion is that the result should be identical at the time of the live migration (when the TSC frequency effectively changes). The code currently requires that the old and new TSC frequencies are within ±1kHz of each other. > > > > If all the above assumptions/interpretations of mine are true, I still > > think it's expected that your clock will jump on live migration > > *unless* you also taught your VMM to use the new KVM_[GS]ET_CLOCK_GUEST > > ioctls which were added in my patch series, specifically to preserve > > the mathematical relationship between guest TSC and kvmclock across a > > migration. > > > > We are planning to test the patches on a 6.9 kernel (where they can be > applied cleanly) and modify the live upgrade/migration code to use the new > KVM_[GS]ET_CLOCK_GUEST ioctls. > > BTW, what is the plan for upstreaming these patches? I need to find the time to rebase, rework and retest them. You're doing some of the testing and increasing the motivation... I'll see if I can get it done in the next week or three.
© 2016 - 2025 Red Hat, Inc.