Documentation/virt/kvm/api.rst | 43 ++++ arch/x86/kvm/x86.c | 87 +++++++ include/uapi/linux/kvm.h | 3 + tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/x86_64/pvclock_test.c | 223 ++++++++++++++++++ 5 files changed, 357 insertions(+) create mode 100644 tools/testing/selftests/kvm/x86_64/pvclock_test.c
Guest VMs can be provided with a para-virtualized clock source to
perform timekeeping. A KVM guest can map in a PV clock via the
MSR_KVM_SYSTEM_TIME/MSR_KVM_SYSTEM_TIME_NEW virtualized MSRs.
Where as on a Xen guest this can be provided via the vcpu/shared
info pages.
These PV clocks both use a common structure which is mapped between
host <-> guest to provide the PVTI (paravirtual time information)
for the clock. This reference information is a guest TSC timestamp
and a host system time at a SINGULAR point in time.
If KVM decides to update the reference information due to a
KVM_REQ_MASTERCLOCK_UPDATE, a drift between the guest TSC and
the PV clock can be observed, this is exascerbated when the guest
TSC is also scaled too.
If the reference guest TSC & system time within the structure stay
the same there is no potential for a drift between the TSC and PV
clock.
This series adds in two patches, one to add in API/ioctl to allow
a VMM to perform a correction/fixup of the PVTI structure when it
knows that KVM may have updated the KVM clock information and a
second one to verify that the drift is present & corrected.
Jack Allister (2):
KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for KVM clock drift fixup
KVM: selftests: Add KVM/PV clock selftest to prove timer drift
correction
Documentation/virt/kvm/api.rst | 43 ++++
arch/x86/kvm/x86.c | 87 +++++++
include/uapi/linux/kvm.h | 3 +
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/x86_64/pvclock_test.c | 223 ++++++++++++++++++
5 files changed, 357 insertions(+)
create mode 100644 tools/testing/selftests/kvm/x86_64/pvclock_test.c
base-commit: 8cb4a9a82b21623dbb4b3051dd30d98356cf95bc
--
2.40.1
Guest VMs can be provided with a para-virtualized clock source to perform timekeeping. A KVM guest can map in a PV clock via the MSR_KVM_SYSTEM_TIME/MSR_KVM_SYSTEM_TIME_NEW virtualized MSRs. Where as on a Xen guest this can be provided via the vcpu/shared info pages. These PV clocks both use a common structure which is mapped between host <-> guest to provide the PVTI (paravirtual time information) for the clock. This reference information is a guest TSC timestamp and a host system time at a singular point in time. Upon a live-update of a host or live-migration of an instance the PVTI may be recalculated by KVM. Using the existing KVM_[GS]ET_CLOCK functionality the relationship between the TSC and PV clock cannot be precisely saved and restored by userspace. This series adds in two patches, one to add in a new interface to allow a VMM/userspace to perform a correction of the PVTI structure. Then a second to verify the imprecision after a simulation of a live-update/migration and then to verify the correction is to within ±1ns. v1: https://lore.kernel.org/all/20240408220705.7637-1-jalliste@amazon.com/ v2: - Moved new IOCTLs from vm to vcpu level. - Adds extra error checks as suggested by Dongli Zhang / David Woodhouse. - Adds on-demand calculation of PVTI if non currently present in vcpu. - Adds proper synchronization for PV clock during correction. - Added option to test without TSC scaling in sefltest. - Updated commit messages to better explain the situation (thanks David). Jack Allister (2): KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration KVM: selftests: Add KVM/PV clock selftest to prove timer correction Documentation/virt/kvm/api.rst | 37 ++++ arch/x86/kvm/x86.c | 124 +++++++++++ include/uapi/linux/kvm.h | 3 + tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/x86_64/pvclock_test.c | 192 ++++++++++++++++++ 5 files changed, 357 insertions(+) create mode 100644 tools/testing/selftests/kvm/x86_64/pvclock_test.c base-commit: 8cb4a9a82b21623dbb4b3051dd30d98356cf95bc -- 2.40.1
In the common case (where kvm->arch.use_master_clock is true), the KVM
clock is defined as a simple arithmetic function of the guest TSC, based on
a reference point stored in kvm->arch.master_kernel_ns and
kvm->arch.master_cycle_now.
The existing KVM_[GS]ET_CLOCK functionality does not allow for this
relationship to be precisely saved and restored by userspace. All it can
currently do is set the KVM clock at a given UTC reference time, which is
necessarily imprecise.
So on live update, the guest TSC can remain cycle accurate at precisely the
same offset from the host TSC, but there is no way for userspace to restore
the KVM clock accurately.
Even on live migration to a new host, where the accuracy of the guest time-
keeping is fundamentally limited by the accuracy of wallclock
synchronization between the source and destination hosts, the clock jump
experienced by the guest's TSC and its KVM clock should at least be
*consistent*. Even when the guest TSC suffers a discontinuity, its KVM
clock should still remain the *same* arithmetic function of the guest TSC,
and not suffer an *additional* discontinuity.
To allow for accurate migration of the KVM clock, add per-vCPU ioctls which
save and restore the actual PV clock info in pvclock_vcpu_time_info.
The restoration in KVM_SET_CLOCK_GUEST works by creating a new reference
point in time just as kvm_update_masterclock() does, and calculating the
corresponding guest TSC value. This guest TSC value is then passed through
the user-provided pvclock structure to generate the *intended* KVM clock
value at that point in time, and through the *actual* KVM clock calculation.
Then kvm->arch.kvmclock_offset is adjusted to eliminate for the difference.
Where kvm->arch.use_master_clock is false (because the host TSC is
unreliable, or the guest TSCs are configured strangely), the KVM clock
is *not* defined as a function of the guest TSC so KVM_GET_CLOCK_GUEST
returns an error. In this case, as documented, userspace shall use the
legacy KVM_GET_CLOCK ioctl. The loss of precision is acceptable in this
case since the clocks are imprecise in this mode anyway.
On *restoration*, if kvm->arch.use_master_clock is false, an error is
returned for similar reasons and userspace shall fall back to using
KVM_SET_CLOCK. This does mean that, as documented, userspace needs to use
*both* KVM_GET_CLOCK_GUEST and KVM_GET_CLOCK and send both results with the
migration data (unless the intent is to refuse to resume on a host with bad
TSC).
(It may have been possible to make KVM_SET_CLOCK_GUEST "good enough" in the
non-masterclock mode, as that mode is necessarily imprecise anyway. The
explicit fallback allows userspace to deliberately fail migration to a host
with misbehaving TSC where master clock mode wouldn't be active.)
Suggested-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Jack Allister <jalliste@amazon.com>
CC: Paul Durrant <paul@xen.org>
CC: Dongli Zhang <dongli.zhang@oracle.com>
---
Documentation/virt/kvm/api.rst | 37 ++++++++++
arch/x86/kvm/x86.c | 124 +++++++++++++++++++++++++++++++++
include/uapi/linux/kvm.h | 3 +
3 files changed, 164 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 0b5a33ee71ee..80fcd93bba1b 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6352,6 +6352,43 @@ a single guest_memfd file, but the bound ranges must not overlap).
See KVM_SET_USER_MEMORY_REGION2 for additional details.
+4.143 KVM_GET_CLOCK_GUEST
+----------------------------
+
+:Capability: none
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct pvclock_vcpu_time_info (out)
+:Returns: 0 on success, <0 on error
+
+Retrieves the current time information structure used for KVM/PV clocks,
+in precisely the form advertised to the guest vCPU, which gives parameters
+for a direct conversion from a guest TSC value to nanoseconds.
+
+When the KVM clock not is in "master clock" mode, for example because the
+host TSC is unreliable or the guest TSCs are oddly configured, the KVM clock
+is actually defined by the host CLOCK_MONOTONIC_RAW instead of the guest TSC.
+In this case, the KVM_GET_CLOCK_GUEST ioctl returns -EINVAL.
+
+4.144 KVM_SET_CLOCK_GUEST
+----------------------------
+
+:Capability: none
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct pvclock_vcpu_time_info (in)
+:Returns: 0 on success, <0 on error
+
+Sets the KVM clock (for the whole VM) in terms of the vCPU TSC, using the
+pvclock structure as returned by KVM_GET_CLOCK_GUEST. This allows the precise
+arithmetic relationship between guest TSC and KVM clock to be preserved by
+userspace across migration.
+
+When the KVM clock is not in "master clock" mode, and the KVM clock is actually
+defined by the host CLOCK_MONOTONIC_RAW, this ioctl returns -EINVAL. Userspace
+may choose to set the clock using the less precise KVM_SET_CLOCK ioctl, or may
+choose to fail, denying migration to a host whose TSC is misbehaving.
+
5. The kvm_run structure
========================
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 47d9f03b7778..d5cae3ead04d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5859,6 +5859,124 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
}
}
+static int kvm_vcpu_ioctl_get_clock_guest(struct kvm_vcpu *v, void __user *argp)
+{
+ struct pvclock_vcpu_time_info *vcpu_pvti = &v->arch.hv_clock;
+ struct pvclock_vcpu_time_info local_pvti = { 0 };
+ struct kvm_arch *ka = &v->kvm->arch;
+ uint64_t host_tsc, guest_tsc;
+ bool use_master_clock;
+ uint64_t kernel_ns;
+ unsigned int seq;
+
+ /*
+ * CLOCK_MONOTONIC_RAW is not suitable for GET/SET API,
+ * see kvm_vcpu_ioctl_set_clock_guest equivalent comment.
+ */
+ if (!static_cpu_has(X86_FEATURE_CONSTANT_TSC))
+ return -EINVAL;
+
+ /*
+ * Querying needs to be performed in a seqcount loop as it's possible
+ * another vCPU has triggered an update of the master clock. If so we
+ * should store the host TSC & time at this point.
+ */
+ do {
+ seq = read_seqcount_begin(&ka->pvclock_sc);
+ use_master_clock = ka->use_master_clock;
+ if (use_master_clock) {
+ host_tsc = ka->master_cycle_now;
+ kernel_ns = ka->master_kernel_ns;
+ }
+ } while (read_seqcount_retry(&ka->pvclock_sc, seq));
+
+ if (!use_master_clock)
+ return -EINVAL;
+
+ /*
+ * It's possible that this vCPU doesn't have a HVCLOCK configured
+ * but the other vCPUs may. If this is the case calculate based
+ * upon the time gathered in the seqcount but do not update the
+ * vCPU specific PVTI. If we have one, then use that.
+ */
+ if (!vcpu_pvti->tsc_timestamp && !vcpu_pvti->system_time) {
+ guest_tsc = kvm_read_l1_tsc(v, host_tsc);
+
+ local_pvti.tsc_timestamp = guest_tsc;
+ local_pvti.system_time = kernel_ns + ka->kvmclock_offset;
+ } else {
+ local_pvti = *vcpu_pvti;
+ }
+
+ if (copy_to_user(argp, &local_pvti, sizeof(local_pvti)))
+ return -EFAULT;
+
+ return 0;
+}
+
+static int kvm_vcpu_ioctl_set_clock_guest(struct kvm_vcpu *v, void __user *argp)
+{
+ struct pvclock_vcpu_time_info dummy_pvti;
+ struct pvclock_vcpu_time_info orig_pvti;
+ struct kvm *kvm = v->kvm;
+ struct kvm_arch *ka = &kvm->arch;
+ uint64_t clock_orig, clock_dummy;
+ uint64_t host_tsc, guest_tsc;
+ int64_t kernel_ns;
+ int64_t correction;
+ int rc = 0;
+
+ /*
+ * If a constant TSC is not provided by the host then KVM will
+ * be using CLOCK_MONOTONIC_RAW, correction using this is not
+ * precise and as such we can never sync to the precision we
+ * are requiring.
+ */
+ if (!static_cpu_has(X86_FEATURE_CONSTANT_TSC))
+ return -EINVAL;
+
+ if (copy_from_user(&orig_pvti, argp, sizeof(orig_pvti)))
+ return -EFAULT;
+
+ kvm_hv_request_tsc_page_update(kvm);
+ kvm_start_pvclock_update(kvm);
+ pvclock_update_vm_gtod_copy(kvm);
+
+ if (!ka->use_master_clock) {
+ rc = -EINVAL;
+ goto out;
+ }
+
+ /*
+ * Sample the kernel time and host TSC at a singular point.
+ * We then calculate the guest TSC using this exact point in time.
+ * From here we can then determine the delta using the
+ * PV time info requested from the user and what we currently have
+ * using the fixed point in time. This delta is then used as a
+ * correction factor to subtract from the clock offset.
+ */
+ if (!kvm_get_time_and_clockread(&kernel_ns, &host_tsc)) {
+ rc = -EFAULT;
+ goto out;
+ }
+
+ guest_tsc = kvm_read_l1_tsc(v, host_tsc);
+
+ dummy_pvti = orig_pvti;
+ dummy_pvti.tsc_timestamp = guest_tsc;
+ dummy_pvti.system_time = kernel_ns + ka->kvmclock_offset;
+
+ clock_orig = __pvclock_read_cycles(&orig_pvti, guest_tsc);
+ clock_dummy = __pvclock_read_cycles(&dummy_pvti, guest_tsc);
+
+ correction = clock_orig - clock_dummy;
+ ka->kvmclock_offset += correction;
+
+out:
+ kvm_end_pvclock_update(kvm);
+ return rc;
+}
+
long kvm_arch_vcpu_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
@@ -6239,6 +6357,12 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
srcu_read_unlock(&vcpu->kvm->srcu, idx);
break;
}
+ case KVM_SET_CLOCK_GUEST:
+ r = kvm_vcpu_ioctl_set_clock_guest(vcpu, argp);
+ break;
+ case KVM_GET_CLOCK_GUEST:
+ r = kvm_vcpu_ioctl_get_clock_guest(vcpu, argp);
+ break;
#ifdef CONFIG_KVM_HYPERV
case KVM_GET_SUPPORTED_HV_CPUID:
r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2190adbe3002..0d306311e4d6 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1548,4 +1548,7 @@ struct kvm_create_guest_memfd {
__u64 reserved[6];
};
+#define KVM_SET_CLOCK_GUEST _IOW(KVMIO, 0xd5, struct pvclock_vcpu_time_info)
+#define KVM_GET_CLOCK_GUEST _IOR(KVMIO, 0xd6, struct pvclock_vcpu_time_info)
+
#endif /* __LINUX_KVM_H */
--
2.40.1
On Wed, 2024-04-10 at 09:52 +0000, Jack Allister wrote:
>
> +static int kvm_vcpu_ioctl_get_clock_guest(struct kvm_vcpu *v, void __user *argp)
> +{
> + struct pvclock_vcpu_time_info *vcpu_pvti = &v->arch.hv_clock;
> + struct pvclock_vcpu_time_info local_pvti = { 0 };
> + struct kvm_arch *ka = &v->kvm->arch;
> + uint64_t host_tsc, guest_tsc;
> + bool use_master_clock;
> + uint64_t kernel_ns;
> + unsigned int seq;
> +
> + /*
> + * CLOCK_MONOTONIC_RAW is not suitable for GET/SET API,
> + * see kvm_vcpu_ioctl_set_clock_guest equivalent comment.
> + */
> + if (!static_cpu_has(X86_FEATURE_CONSTANT_TSC))
> + return -EINVAL;
> +
> + /*
> + * Querying needs to be performed in a seqcount loop as it's possible
> + * another vCPU has triggered an update of the master clock. If so we
> + * should store the host TSC & time at this point.
> + */
> + do {
> + seq = read_seqcount_begin(&ka->pvclock_sc);
> + use_master_clock = ka->use_master_clock;
> + if (use_master_clock) {
> + host_tsc = ka->master_cycle_now;
> + kernel_ns = ka->master_kernel_ns;
> + }
> + } while (read_seqcount_retry(&ka->pvclock_sc, seq));
> +
> + if (!use_master_clock)
> + return -EINVAL;
> +
> + /*
> + * It's possible that this vCPU doesn't have a HVCLOCK configured
> + * but the other vCPUs may. If this is the case calculate based
> + * upon the time gathered in the seqcount but do not update the
> + * vCPU specific PVTI. If we have one, then use that.
> + */
> + if (!vcpu_pvti->tsc_timestamp && !vcpu_pvti->system_time) {
|| !kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)
Otherwise you may be using out of date information.
> + guest_tsc = kvm_read_l1_tsc(v, host_tsc);
> +
> + local_pvti.tsc_timestamp = guest_tsc;
> + local_pvti.system_time = kernel_ns + ka->kvmclock_offset;
This is missing the scale information in tsc_to_system_mul and
tsc_shift. Is there a reason we can't just call kvm_guest_time_update()
from here? (I think we looked at using it for the *SET* function, but
did we look at doing so for GET?)
> + } else {
> + local_pvti = *vcpu_pvti;
> + }
> +
> + if (copy_to_user(argp, &local_pvti, sizeof(local_pvti)))
> + return -EFAULT;
> +
> + return 0;
> +}
> +
> +static int kvm_vcpu_ioctl_set_clock_guest(struct kvm_vcpu *v, void __user *argp)
> +{
> + struct pvclock_vcpu_time_info dummy_pvti;
> + struct pvclock_vcpu_time_info orig_pvti;
> + struct kvm *kvm = v->kvm;
> + struct kvm_arch *ka = &kvm->arch;
> + uint64_t clock_orig, clock_dummy;
> + uint64_t host_tsc, guest_tsc;
> + int64_t kernel_ns;
> + int64_t correction;
> + int rc = 0;
> +
> + /*
> + * If a constant TSC is not provided by the host then KVM will
> + * be using CLOCK_MONOTONIC_RAW, correction using this is not
> + * precise and as such we can never sync to the precision we
> + * are requiring.
> + */
> + if (!static_cpu_has(X86_FEATURE_CONSTANT_TSC))
> + return -EINVAL;
> +
> + if (copy_from_user(&orig_pvti, argp, sizeof(orig_pvti)))
> + return -EFAULT;
> +
> + kvm_hv_request_tsc_page_update(kvm);
> + kvm_start_pvclock_update(kvm);
> + pvclock_update_vm_gtod_copy(kvm);
> +
> + if (!ka->use_master_clock) {
> + rc = -EINVAL;
> + goto out;
> + }
> +
> + /*
> + * Sample the kernel time and host TSC at a singular point.
> + * We then calculate the guest TSC using this exact point in time.
> + * From here we can then determine the delta using the
> + * PV time info requested from the user and what we currently have
> + * using the fixed point in time. This delta is then used as a
> + * correction factor to subtract from the clock offset.
> + */
> + if (!kvm_get_time_and_clockread(&kernel_ns, &host_tsc)) {
> + rc = -EFAULT;
> + goto out;
> + }
> +
> + guest_tsc = kvm_read_l1_tsc(v, host_tsc);
> +
> + dummy_pvti = orig_pvti;
> + dummy_pvti.tsc_timestamp = guest_tsc;
> + dummy_pvti.system_time = kernel_ns + ka->kvmclock_offset;
> +
> + clock_orig = __pvclock_read_cycles(&orig_pvti, guest_tsc);
> + clock_dummy = __pvclock_read_cycles(&dummy_pvti, guest_tsc);
>
In both cases here you're using the scale information directly from
userspace... that you forgot to fill in for them earlier. I think we
should we have a sanity check on it, to ensure that it matches the TSC
frequency of the vCPU?
> + correction = clock_orig - clock_dummy;
> + ka->kvmclock_offset += correction;
> +
> +out:
> + kvm_end_pvclock_update(kvm);
> + return rc;
> +}
> +
> long kvm_arch_vcpu_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg)
> {
> @@ -6239,6 +6357,12 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
> srcu_read_unlock(&vcpu->kvm->srcu, idx);
> break;
> }
> + case KVM_SET_CLOCK_GUEST:
> + r = kvm_vcpu_ioctl_set_clock_guest(vcpu, argp);
> + break;
> + case KVM_GET_CLOCK_GUEST:
> + r = kvm_vcpu_ioctl_get_clock_guest(vcpu, argp);
> + break;
> #ifdef CONFIG_KVM_HYPERV
> case KVM_GET_SUPPORTED_HV_CPUID:
> r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 2190adbe3002..0d306311e4d6 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1548,4 +1548,7 @@ struct kvm_create_guest_memfd {
> __u64 reserved[6];
> };
>
> +#define KVM_SET_CLOCK_GUEST _IOW(KVMIO, 0xd5, struct pvclock_vcpu_time_info)
> +#define KVM_GET_CLOCK_GUEST _IOR(KVMIO, 0xd6, struct pvclock_vcpu_time_info)
> +
> #endif /* __LINUX_KVM_H */
On 10/04/2024 10:52, Jack Allister wrote:
> In the common case (where kvm->arch.use_master_clock is true), the KVM
> clock is defined as a simple arithmetic function of the guest TSC, based on
> a reference point stored in kvm->arch.master_kernel_ns and
> kvm->arch.master_cycle_now.
>
> The existing KVM_[GS]ET_CLOCK functionality does not allow for this
> relationship to be precisely saved and restored by userspace. All it can
> currently do is set the KVM clock at a given UTC reference time, which is
> necessarily imprecise.
>
> So on live update, the guest TSC can remain cycle accurate at precisely the
> same offset from the host TSC, but there is no way for userspace to restore
> the KVM clock accurately.
>
> Even on live migration to a new host, where the accuracy of the guest time-
> keeping is fundamentally limited by the accuracy of wallclock
> synchronization between the source and destination hosts, the clock jump
> experienced by the guest's TSC and its KVM clock should at least be
> *consistent*. Even when the guest TSC suffers a discontinuity, its KVM
> clock should still remain the *same* arithmetic function of the guest TSC,
> and not suffer an *additional* discontinuity.
>
> To allow for accurate migration of the KVM clock, add per-vCPU ioctls which
> save and restore the actual PV clock info in pvclock_vcpu_time_info.
>
> The restoration in KVM_SET_CLOCK_GUEST works by creating a new reference
> point in time just as kvm_update_masterclock() does, and calculating the
> corresponding guest TSC value. This guest TSC value is then passed through
> the user-provided pvclock structure to generate the *intended* KVM clock
> value at that point in time, and through the *actual* KVM clock calculation.
> Then kvm->arch.kvmclock_offset is adjusted to eliminate for the difference.
>
> Where kvm->arch.use_master_clock is false (because the host TSC is
> unreliable, or the guest TSCs are configured strangely), the KVM clock
> is *not* defined as a function of the guest TSC so KVM_GET_CLOCK_GUEST
> returns an error. In this case, as documented, userspace shall use the
> legacy KVM_GET_CLOCK ioctl. The loss of precision is acceptable in this
> case since the clocks are imprecise in this mode anyway.
>
> On *restoration*, if kvm->arch.use_master_clock is false, an error is
> returned for similar reasons and userspace shall fall back to using
> KVM_SET_CLOCK. This does mean that, as documented, userspace needs to use
> *both* KVM_GET_CLOCK_GUEST and KVM_GET_CLOCK and send both results with the
> migration data (unless the intent is to refuse to resume on a host with bad
> TSC).
>
> (It may have been possible to make KVM_SET_CLOCK_GUEST "good enough" in the
> non-masterclock mode, as that mode is necessarily imprecise anyway. The
> explicit fallback allows userspace to deliberately fail migration to a host
> with misbehaving TSC where master clock mode wouldn't be active.)
>
> Suggested-by: David Woodhouse <dwmw2@infradead.org>
> Signed-off-by: Jack Allister <jalliste@amazon.com>
> CC: Paul Durrant <paul@xen.org>
> CC: Dongli Zhang <dongli.zhang@oracle.com>
> ---
> Documentation/virt/kvm/api.rst | 37 ++++++++++
> arch/x86/kvm/x86.c | 124 +++++++++++++++++++++++++++++++++
> include/uapi/linux/kvm.h | 3 +
> 3 files changed, 164 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 0b5a33ee71ee..80fcd93bba1b 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6352,6 +6352,43 @@ a single guest_memfd file, but the bound ranges must not overlap).
>
> See KVM_SET_USER_MEMORY_REGION2 for additional details.
>
> +4.143 KVM_GET_CLOCK_GUEST
> +----------------------------
> +
> +:Capability: none
> +:Architectures: x86
> +:Type: vcpu ioctl
> +:Parameters: struct pvclock_vcpu_time_info (out)
> +:Returns: 0 on success, <0 on error
> +
> +Retrieves the current time information structure used for KVM/PV clocks,
> +in precisely the form advertised to the guest vCPU, which gives parameters
> +for a direct conversion from a guest TSC value to nanoseconds.
> +
> +When the KVM clock not is in "master clock" mode, for example because the
> +host TSC is unreliable or the guest TSCs are oddly configured, the KVM clock
> +is actually defined by the host CLOCK_MONOTONIC_RAW instead of the guest TSC.
> +In this case, the KVM_GET_CLOCK_GUEST ioctl returns -EINVAL.
> +
> +4.144 KVM_SET_CLOCK_GUEST
> +----------------------------
> +
> +:Capability: none
> +:Architectures: x86
> +:Type: vcpu ioctl
> +:Parameters: struct pvclock_vcpu_time_info (in)
> +:Returns: 0 on success, <0 on error
> +
> +Sets the KVM clock (for the whole VM) in terms of the vCPU TSC, using the
> +pvclock structure as returned by KVM_GET_CLOCK_GUEST. This allows the precise
> +arithmetic relationship between guest TSC and KVM clock to be preserved by
> +userspace across migration.
> +
> +When the KVM clock is not in "master clock" mode, and the KVM clock is actually
> +defined by the host CLOCK_MONOTONIC_RAW, this ioctl returns -EINVAL.
EINVAL doesn't seem appropriate. ENOTSUP perhaps? Same for getting the
clock info I suppose.
> Userspace
> +may choose to set the clock using the less precise KVM_SET_CLOCK ioctl, or may
> +choose to fail, denying migration to a host whose TSC is misbehaving.
> +
> 5. The kvm_run structure
> ========================
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 47d9f03b7778..d5cae3ead04d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5859,6 +5859,124 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
> }
> }
>
> +static int kvm_vcpu_ioctl_get_clock_guest(struct kvm_vcpu *v, void __user *argp)
> +{
> + struct pvclock_vcpu_time_info *vcpu_pvti = &v->arch.hv_clock;
> + struct pvclock_vcpu_time_info local_pvti = { 0 };
> + struct kvm_arch *ka = &v->kvm->arch;
> + uint64_t host_tsc, guest_tsc;
> + bool use_master_clock;
> + uint64_t kernel_ns;
> + unsigned int seq;
> +
> + /*
> + * CLOCK_MONOTONIC_RAW is not suitable for GET/SET API,
> + * see kvm_vcpu_ioctl_set_clock_guest equivalent comment.
> + */
> + if (!static_cpu_has(X86_FEATURE_CONSTANT_TSC))
> + return -EINVAL;
> +
> + /*
> + * Querying needs to be performed in a seqcount loop as it's possible
> + * another vCPU has triggered an update of the master clock. If so we
> + * should store the host TSC & time at this point.
> + */
> + do {
> + seq = read_seqcount_begin(&ka->pvclock_sc);
> + use_master_clock = ka->use_master_clock;
> + if (use_master_clock) {
> + host_tsc = ka->master_cycle_now;
> + kernel_ns = ka->master_kernel_ns;
> + }
> + } while (read_seqcount_retry(&ka->pvclock_sc, seq));
You could bail from the loop if `use_master_clock` is false, couldn't you?
> +
> + if (!use_master_clock)
> + return -EINVAL;
> +
> + /*
> + * It's possible that this vCPU doesn't have a HVCLOCK configured
> + * but the other vCPUs may. If this is the case calculate based
> + * upon the time gathered in the seqcount but do not update the
> + * vCPU specific PVTI. If we have one, then use that.
Given this is a per-vCPU ioctl, why not fail in the case the vCPU
doesn't have HVCLOCK configured? Or is your intention that a GET/SET
should always work if TSC is stable?
> + */
> + if (!vcpu_pvti->tsc_timestamp && !vcpu_pvti->system_time) {
> + guest_tsc = kvm_read_l1_tsc(v, host_tsc);
> +
> + local_pvti.tsc_timestamp = guest_tsc;
> + local_pvti.system_time = kernel_ns + ka->kvmclock_offset;
> + } else {
> + local_pvti = *vcpu_pvti;
> + }
> +
> + if (copy_to_user(argp, &local_pvti, sizeof(local_pvti)))
> + return -EFAULT;
> +
> + return 0;
> +}
> +
On 10 April 2024 11:29:13 BST, Paul Durrant <xadimgnik@gmail.com> wrote: >On 10/04/2024 10:52, Jack Allister wrote: >> + * It's possible that this vCPU doesn't have a HVCLOCK configured >> + * but the other vCPUs may. If this is the case calculate based >> + * upon the time gathered in the seqcount but do not update the >> + * vCPU specific PVTI. If we have one, then use that. > >Given this is a per-vCPU ioctl, why not fail in the case the vCPU doesn't have HVCLOCK configured? Or is your intention that a GET/SET should always work if TSC is stable? It definitely needs to work for SET even when the vCPU hasn't been run yet (and doesn't have a hvclock in vcpu->arch.hv_clock). I think it should ideally work for GET too. I did try arguing that if the vCPU hasn't set up its pvclock then why would it care if it's inaccurate? But there's a pathological case of AMP where one vCPU is dedicated to an RTOS or something, and only the *other* vCPUs bring up their pvclock. This of course brings you to the question of why we have it as a per-vCPU ioctl at all? It only needs to be done *once* to get/set the KVM-wide clock And a function of *this* vCPU's TSC. And the point is that if we're in use_master_clock mode, that's consistent across *all* vCPUs. There would be a bunch of additional complexity in making it a VM ioctl though, especially around the question of what to do if userspace tries to restore it when there *aren't* any vCPUs yet. So we didn't do that.
On 10/04/2024 13:09, David Woodhouse wrote: > On 10 April 2024 11:29:13 BST, Paul Durrant <xadimgnik@gmail.com> wrote: >> On 10/04/2024 10:52, Jack Allister wrote: >>> + * It's possible that this vCPU doesn't have a HVCLOCK configured >>> + * but the other vCPUs may. If this is the case calculate based >>> + * upon the time gathered in the seqcount but do not update the >>> + * vCPU specific PVTI. If we have one, then use that. >> >> Given this is a per-vCPU ioctl, why not fail in the case the vCPU doesn't have HVCLOCK configured? Or is your intention that a GET/SET should always work if TSC is stable? > > It definitely needs to work for SET even when the vCPU hasn't been run yet (and doesn't have a hvclock in vcpu->arch.hv_clock). So would it make sense to set up hvclock earlier?
On Wed, 2024-04-10 at 13:43 +0100, Paul Durrant wrote:
> On 10/04/2024 13:09, David Woodhouse wrote:
> > On 10 April 2024 11:29:13 BST, Paul Durrant <xadimgnik@gmail.com>
> > wrote:
> > > On 10/04/2024 10:52, Jack Allister wrote:
> > > > + * It's possible that this vCPU doesn't have a HVCLOCK
> > > > configured
> > > > + * but the other vCPUs may. If this is the case
> > > > calculate based
> > > > + * upon the time gathered in the seqcount but do not
> > > > update the
> > > > + * vCPU specific PVTI. If we have one, then use that.
> > >
> > > Given this is a per-vCPU ioctl, why not fail in the case the vCPU
> > > doesn't have HVCLOCK configured? Or is your intention that a
> > > GET/SET should always work if TSC is stable?
> >
> > It definitely needs to work for SET even when the vCPU hasn't been
> > run yet (and doesn't have a hvclock in vcpu->arch.hv_clock).
>
> So would it make sense to set up hvclock earlier?
Yeah, and I think we can do so just by calling kvm_guest_time_update().
The GET function can look like this:
static int kvm_vcpu_ioctl_get_clock_guest(struct kvm_vcpu *v, void __user *argp)
{
struct pvclock_vcpu_time_info *hv_clock = &v->arch.hv_clock;
/*
* If KVM_REQ_CLOCK_UPDATE is already pending, or if the hv_clock has
* never been generated at all, call kvm_guest_time_update() to do so.
* Might as well use the PVCLOCK_TSC_STABLE_BIT as the check for ever
* having been written.
*/
if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, v) ||
!(hv_clock->flags & PVCLOCK_TSC_STABLE_BIT)) {
if (kvm_guest_time_update(v))
return -EINVAL;
}
/*
* PVCLOCK_TSC_STABLE_BIT is set in use_master_clock mode where the
* KVM clock is defined in terms of the guest TSC. Otherwise, it is
* is defined by the host CLOCK_MONOTONIC_RAW, and userspace should
* use the legacy KVM_[GS]ET_CLOCK to migrate it.
*/
if (!(hv_clock->flags & PVCLOCK_TSC_STABLE_BIT))
return -EINVAL;
if (copy_to_user(argp, hv_clock, sizeof(*hv_clock)))
return -EFAULT;
return 0;
}
And the SET function doesn't even *need* the existing vCPU's hv_clock,
because we know damn well that the number of TSC cycles elapsed between
the reference time point and... erm... the reference time point... is
zero.
And everything *else* the hv_clock was being used for, either in Jack's
version or my own (where I used it for checking PVCLOCK_TSC_STABLE_BIT
and even used my new hvclock_to_hz() on it), can be done differently
too.
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/clocks
(Try not to look at the 'Improve accuracy of KVM clock' one. It'll just
make you sad. Let Jack and me get to the end of the TODO list and you
can have all the sadness in one go like pulling a band-aid off.)
static int kvm_vcpu_ioctl_set_clock_guest(struct kvm_vcpu *v, void __user *argp)
{
struct pvclock_vcpu_time_info user_hv_clock;
struct kvm *kvm = v->kvm;
struct kvm_arch *ka = &kvm->arch;
uint64_t curr_tsc_hz, user_tsc_hz;
uint64_t user_clk_ns;
uint64_t guest_tsc;
int rc = 0;
if (copy_from_user(&user_hv_clock, argp, sizeof(user_hv_clock)))
return -EFAULT;
if (!user_hv_clock.tsc_to_system_mul)
return -EINVAL;
user_tsc_hz = hvclock_to_hz(user_hv_clock.tsc_to_system_mul,
user_hv_clock.tsc_shift);
kvm_hv_request_tsc_page_update(kvm);
kvm_start_pvclock_update(kvm);
pvclock_update_vm_gtod_copy(kvm);
/*
* If not in use_master_clock mode, do not allow userspace to set
* the clock in terms of the guest TSC. Userspace should either
* fail the migration (to a host with suboptimal TSCs), or should
* knowingly restore the KVM clock using KVM_SET_CLOCK instead.
*/
if (!ka->use_master_clock) {
rc = -EINVAL;
goto out;
}
curr_tsc_hz = get_cpu_tsc_khz() * 1000LL;
if (unlikely(curr_tsc_hz == 0)) {
rc = -EINVAL;
goto out;
}
if (kvm_caps.has_tsc_control)
curr_tsc_hz = kvm_scale_tsc(curr_tsc_hz,
v->arch.l1_tsc_scaling_ratio);
/*
* The scaling factors in the hv_clock do not depend solely on the
* TSC frequency *requested* by userspace. They actually use the
* host TSC frequency that was measured/detected by the host kernel,
* scaled by kvm_scale_tsc() with the vCPU's l1_tsc_scaling_ratio.
*
* So a sanity check that they *precisely* match would have false
* negatives. Allow for a discrepancy of 1 kHz either way.
*/
if (user_tsc_hz < curr_tsc_hz - 1000 ||
user_tsc_hz > curr_tsc_hz + 1000) {
rc = -ERANGE;
goto out;
}
/*
* The call to pvclock_update_vm_gtod_copy() has created a new time
* reference point in ka->master_cycle_now and ka->master_kernel_ns.
*
* Calculate the guest TSC at that moment, and the corresponding KVM
* clock value according to user_hv_clock. The value according to the
* current hv_clock will of course be ka->master_kernel_ns since no
* TSC cycles have elapsed.
*
* Adjust ka->kvmclock_offset to the delta, so that both definitions
* of the clock give precisely the same reading at the reference time.
*/
guest_tsc = kvm_read_l1_tsc(v, ka->master_cycle_now);
user_clk_ns = __pvclock_read_cycles(&user_hv_clock, guest_tsc);
ka->kvmclock_offset = user_clk_ns - ka->master_kernel_ns;
out:
kvm_end_pvclock_update(kvm);
return rc;
}
A VM's KVM/PV clock has an inherent relationship to it's TSC (guest). When
either the host system live-updates or the VM is live-migrated this pairing
of the two clock sources should stay the same.
In reality this is not the case without some correction taking place. Two
new IOCTLs (KVM_GET_CLOCK_GUEST/KVM_SET_CLOCK_GUEST) can be utilized to
perform a correction on the PVTI (PV time information) structure held by
KVM to effectively fixup the kvmclock_offset prior to the guest VM resuming
in either a live-update/migration scenario.
This test proves that without the necessary fixup there is a perceived
change in the guest TSC & KVM/PV clock relationship before and after a LU/
LM takes place.
The following steps are made to verify there is a delta in the relationship
and that it can be corrected:
1. PVTI is sampled by guest at boot (let's call this PVTI0).
2. Induce a change in PVTI data (KVM_REQ_MASTERCLOCK_UPDATE).
3. PVTI is sampled by guest after change (PVTI1).
4. Correction is requested by usermode to KVM using PVTI0.
5. PVTI is sampled by guest after correction (PVTI2).
The guest the records a singular TSC reference point in time and uses it to
calculate 3 KVM clock values utilizing the 3 recorded PVTI prior. Let's
call each clock value CLK[0-2].
In a perfect world CLK[0-2] should all be the same value if the KVM clock
& TSC relationship is preserved across the LU/LM (or faked in this test),
however it is not.
A delta can be observed between CLK0-CLK1 due to KVM recalculating the PVTI
(and the inaccuracies associated with that). A delta of ~3500ns can be
observed if guest TSC scaling to half host TSC frequency is also enabled,
where as without scaling this is observed at ~180ns.
With the correction it should be possible to achieve a delta of ±1ns.
An option to enable guest TSC scaling is available via invoking the tester
with -s/--scale-tsc.
Example of the test output below:
* selftests: kvm: pvclock_test
* scaling tsc from 2999999KHz to 1499999KHz
* before=5038374946 uncorrected=5038371437 corrected=5038374945
* delta_uncorrected=3509 delta_corrected=1
Signed-off-by: Jack Allister <jalliste@amazon.com>
CC: David Woodhouse <dwmw2@infradead.org>
CC: Paul Durrant <paul@xen.org>
CC: Dongli Zhang <dongli.zhang@oracle.com>
---
tools/testing/selftests/kvm/Makefile | 1 +
.../selftests/kvm/x86_64/pvclock_test.c | 192 ++++++++++++++++++
2 files changed, 193 insertions(+)
create mode 100644 tools/testing/selftests/kvm/x86_64/pvclock_test.c
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 741c7dc16afc..02ee1205bbed 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -87,6 +87,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/pmu_counters_test
TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test
TEST_GEN_PROGS_x86_64 += x86_64/private_mem_kvm_exits_test
+TEST_GEN_PROGS_x86_64 += x86_64/pvclock_test
TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test
diff --git a/tools/testing/selftests/kvm/x86_64/pvclock_test.c b/tools/testing/selftests/kvm/x86_64/pvclock_test.c
new file mode 100644
index 000000000000..376ffb730a53
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/pvclock_test.c
@@ -0,0 +1,192 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright © 2024, Amazon.com, Inc. or its affiliates.
+ *
+ * Tests for pvclock API
+ * KVM_SET_CLOCK_GUEST/KVM_GET_CLOCK_GUEST
+ */
+#include <asm/pvclock.h>
+#include <asm/pvclock-abi.h>
+#include <sys/stat.h>
+#include <stdint.h>
+#include <stdio.h>
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+
+enum {
+ STAGE_FIRST_BOOT,
+ STAGE_UNCORRECTED,
+ STAGE_CORRECTED
+};
+
+#define KVMCLOCK_GPA 0xc0000000ull
+#define KVMCLOCK_SIZE sizeof(struct pvclock_vcpu_time_info)
+
+static void trigger_pvti_update(vm_paddr_t pvti_pa)
+{
+ /*
+ * We need a way to trigger KVM to update the fields
+ * in the PV time info. The easiest way to do this is
+ * to temporarily switch to the old KVM system time
+ * method and then switch back to the new one.
+ */
+ wrmsr(MSR_KVM_SYSTEM_TIME, pvti_pa | KVM_MSR_ENABLED);
+ wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED);
+}
+
+static void guest_code(vm_paddr_t pvti_pa)
+{
+ struct pvclock_vcpu_time_info *pvti_va =
+ (struct pvclock_vcpu_time_info *)pvti_pa;
+
+ struct pvclock_vcpu_time_info pvti_boot;
+ struct pvclock_vcpu_time_info pvti_uncorrected;
+ struct pvclock_vcpu_time_info pvti_corrected;
+ uint64_t cycles_boot;
+ uint64_t cycles_uncorrected;
+ uint64_t cycles_corrected;
+ uint64_t tsc_guest;
+
+ /*
+ * Setup the KVMCLOCK in the guest & store the original
+ * PV time structure that is used.
+ */
+ wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED);
+ pvti_boot = *pvti_va;
+ GUEST_SYNC(STAGE_FIRST_BOOT);
+
+ /*
+ * Trigger an update of the PVTI, if we calculate
+ * the KVM clock using this structure we'll see
+ * a delta from the TSC.
+ */
+ trigger_pvti_update(pvti_pa);
+ pvti_uncorrected = *pvti_va;
+ GUEST_SYNC(STAGE_UNCORRECTED);
+
+ /*
+ * The test should have triggered the correction by this
+ * point in time. We have a copy of each of the PVTI structs
+ * at each stage now.
+ *
+ * Let's sample the timestamp at a SINGLE point in time and
+ * then calculate what the KVM clock would be using the PVTI
+ * from each stage.
+ *
+ * Then return each of these values to the tester.
+ */
+ pvti_corrected = *pvti_va;
+ tsc_guest = rdtsc();
+
+ cycles_boot = __pvclock_read_cycles(&pvti_boot, tsc_guest);
+ cycles_uncorrected = __pvclock_read_cycles(&pvti_uncorrected, tsc_guest);
+ cycles_corrected = __pvclock_read_cycles(&pvti_corrected, tsc_guest);
+
+ GUEST_SYNC_ARGS(STAGE_CORRECTED, cycles_boot, cycles_uncorrected,
+ cycles_corrected, 0);
+}
+
+static void run_test(struct kvm_vm *vm, struct kvm_vcpu *vcpu)
+{
+ struct pvclock_vcpu_time_info pvti_before;
+ uint64_t before, uncorrected, corrected;
+ int64_t delta_uncorrected, delta_corrected;
+ struct ucall uc;
+ uint64_t ucall_reason;
+
+ /* Loop through each stage of the test. */
+ while (true) {
+
+ /* Start/restart the running vCPU code. */
+ vcpu_run(vcpu);
+ TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO);
+
+ /* Retrieve and verify our stage. */
+ ucall_reason = get_ucall(vcpu, &uc);
+ TEST_ASSERT(ucall_reason == UCALL_SYNC,
+ "Unhandled ucall reason=%lu",
+ ucall_reason);
+
+ /* Run host specific code relating to stage. */
+ switch (uc.args[1]) {
+ case STAGE_FIRST_BOOT:
+ /* Store the KVM clock values before an update. */
+ vcpu_ioctl(vcpu, KVM_GET_CLOCK_GUEST, &pvti_before);
+
+ /* Sleep for a set amount of time to increase delta. */
+ sleep(5);
+ break;
+
+ case STAGE_UNCORRECTED:
+ /* Restore the KVM clock values. */
+ vcpu_ioctl(vcpu, KVM_SET_CLOCK_GUEST, &pvti_before);
+ break;
+
+ case STAGE_CORRECTED:
+ /* Query the clock information and verify delta. */
+ before = uc.args[2];
+ uncorrected = uc.args[3];
+ corrected = uc.args[4];
+
+ delta_uncorrected = before - uncorrected;
+ delta_corrected = before - corrected;
+
+ pr_info("before=%lu uncorrected=%lu corrected=%lu\n",
+ before, uncorrected, corrected);
+
+ pr_info("delta_uncorrected=%ld delta_corrected=%ld\n",
+ delta_uncorrected, delta_corrected);
+
+ TEST_ASSERT((delta_corrected <= 1) && (delta_corrected >= -1),
+ "larger than expected delta detected = %ld", delta_corrected);
+ return;
+ }
+ }
+}
+
+static void configure_pvclock(struct kvm_vm *vm, struct kvm_vcpu *vcpu)
+{
+ unsigned int gpages;
+
+ gpages = vm_calc_num_guest_pages(VM_MODE_DEFAULT, KVMCLOCK_SIZE);
+ vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+ KVMCLOCK_GPA, 1, gpages, 0);
+ virt_map(vm, KVMCLOCK_GPA, KVMCLOCK_GPA, gpages);
+
+ vcpu_args_set(vcpu, 1, KVMCLOCK_GPA);
+}
+
+static void configure_scaled_tsc(struct kvm_vcpu *vcpu)
+{
+ uint64_t tsc_khz;
+
+ tsc_khz = __vcpu_ioctl(vcpu, KVM_GET_TSC_KHZ, NULL);
+ pr_info("scaling tsc from %ldKHz to %ldKHz\n", tsc_khz, tsc_khz / 2);
+ tsc_khz /= 2;
+ vcpu_ioctl(vcpu, KVM_SET_TSC_KHZ, (void *)tsc_khz);
+}
+
+int main(int argc, char *argv[])
+{
+ struct kvm_vcpu *vcpu;
+ struct kvm_vm *vm;
+ bool scale_tsc;
+
+ scale_tsc = argc > 1 && (!strncmp(argv[1], "-s", 3) ||
+ !strncmp(argv[1], "--scale-tsc", 10));
+
+ TEST_REQUIRE(sys_clocksource_is_based_on_tsc());
+
+ vm = vm_create_with_one_vcpu(&vcpu, guest_code);
+
+ configure_pvclock(vm, vcpu);
+
+ if (scale_tsc)
+ configure_scaled_tsc(vcpu);
+
+ run_test(vm, vcpu);
+
+ return 0;
+}
--
2.40.1
On 4/10/24 02:52, Jack Allister wrote:
> A VM's KVM/PV clock has an inherent relationship to it's TSC (guest). When
> either the host system live-updates or the VM is live-migrated this pairing
> of the two clock sources should stay the same.
>
[snip]
> +
> +int main(int argc, char *argv[])
> +{
> + struct kvm_vcpu *vcpu;
> + struct kvm_vm *vm;
> + bool scale_tsc;
> +
> + scale_tsc = argc > 1 && (!strncmp(argv[1], "-s", 3) ||
> + !strncmp(argv[1], "--scale-tsc", 10));
> +
> + TEST_REQUIRE(sys_clocksource_is_based_on_tsc());
> +
> + vm = vm_create_with_one_vcpu(&vcpu, guest_code);
> +
> + configure_pvclock(vm, vcpu);
> +
> + if (scale_tsc)
> + configure_scaled_tsc(vcpu);
> +
> + run_test(vm, vcpu);
> +
> + return 0;
> +}
David suggested make the scale_tsc automatically based on the availability of
the feature.
Therefore, if there is going to be v3, I suggest add kvm_vm_free() explicitly
(similar to close(fd)).
Thank you very much!
Dongli Zhang
On 10/04/2024 10:52, Jack Allister wrote:
> A VM's KVM/PV clock has an inherent relationship to it's TSC (guest). When
> either the host system live-updates or the VM is live-migrated this pairing
> of the two clock sources should stay the same.
>
> In reality this is not the case without some correction taking place. Two
> new IOCTLs (KVM_GET_CLOCK_GUEST/KVM_SET_CLOCK_GUEST) can be utilized to
> perform a correction on the PVTI (PV time information) structure held by
> KVM to effectively fixup the kvmclock_offset prior to the guest VM resuming
> in either a live-update/migration scenario.
>
> This test proves that without the necessary fixup there is a perceived
> change in the guest TSC & KVM/PV clock relationship before and after a LU/
> LM takes place.
>
> The following steps are made to verify there is a delta in the relationship
> and that it can be corrected:
>
> 1. PVTI is sampled by guest at boot (let's call this PVTI0).
> 2. Induce a change in PVTI data (KVM_REQ_MASTERCLOCK_UPDATE).
> 3. PVTI is sampled by guest after change (PVTI1).
> 4. Correction is requested by usermode to KVM using PVTI0.
> 5. PVTI is sampled by guest after correction (PVTI2).
>
> The guest the records a singular TSC reference point in time and uses it to
> calculate 3 KVM clock values utilizing the 3 recorded PVTI prior. Let's
> call each clock value CLK[0-2].
>
> In a perfect world CLK[0-2] should all be the same value if the KVM clock
> & TSC relationship is preserved across the LU/LM (or faked in this test),
> however it is not.
>
> A delta can be observed between CLK0-CLK1 due to KVM recalculating the PVTI
> (and the inaccuracies associated with that). A delta of ~3500ns can be
> observed if guest TSC scaling to half host TSC frequency is also enabled,
> where as without scaling this is observed at ~180ns.
>
> With the correction it should be possible to achieve a delta of ±1ns.
>
> An option to enable guest TSC scaling is available via invoking the tester
> with -s/--scale-tsc.
>
> Example of the test output below:
> * selftests: kvm: pvclock_test
> * scaling tsc from 2999999KHz to 1499999KHz
> * before=5038374946 uncorrected=5038371437 corrected=5038374945
> * delta_uncorrected=3509 delta_corrected=1
>
> Signed-off-by: Jack Allister <jalliste@amazon.com>
> CC: David Woodhouse <dwmw2@infradead.org>
> CC: Paul Durrant <paul@xen.org>
> CC: Dongli Zhang <dongli.zhang@oracle.com>
> ---
> tools/testing/selftests/kvm/Makefile | 1 +
> .../selftests/kvm/x86_64/pvclock_test.c | 192 ++++++++++++++++++
> 2 files changed, 193 insertions(+)
> create mode 100644 tools/testing/selftests/kvm/x86_64/pvclock_test.c
>
> diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
> index 741c7dc16afc..02ee1205bbed 100644
> --- a/tools/testing/selftests/kvm/Makefile
> +++ b/tools/testing/selftests/kvm/Makefile
> @@ -87,6 +87,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/pmu_counters_test
> TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
> TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test
> TEST_GEN_PROGS_x86_64 += x86_64/private_mem_kvm_exits_test
> +TEST_GEN_PROGS_x86_64 += x86_64/pvclock_test
> TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
> TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
> TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test
> diff --git a/tools/testing/selftests/kvm/x86_64/pvclock_test.c b/tools/testing/selftests/kvm/x86_64/pvclock_test.c
> new file mode 100644
> index 000000000000..376ffb730a53
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/x86_64/pvclock_test.c
> @@ -0,0 +1,192 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright © 2024, Amazon.com, Inc. or its affiliates.
> + *
> + * Tests for pvclock API
> + * KVM_SET_CLOCK_GUEST/KVM_GET_CLOCK_GUEST
> + */
> +#include <asm/pvclock.h>
> +#include <asm/pvclock-abi.h>
> +#include <sys/stat.h>
> +#include <stdint.h>
> +#include <stdio.h>
> +
> +#include "test_util.h"
> +#include "kvm_util.h"
> +#include "processor.h"
> +
> +enum {
> + STAGE_FIRST_BOOT,
> + STAGE_UNCORRECTED,
> + STAGE_CORRECTED
> +};
> +
> +#define KVMCLOCK_GPA 0xc0000000ull
> +#define KVMCLOCK_SIZE sizeof(struct pvclock_vcpu_time_info)
> +
> +static void trigger_pvti_update(vm_paddr_t pvti_pa)
> +{
> + /*
> + * We need a way to trigger KVM to update the fields
> + * in the PV time info. The easiest way to do this is
> + * to temporarily switch to the old KVM system time
> + * method and then switch back to the new one.
> + */
> + wrmsr(MSR_KVM_SYSTEM_TIME, pvti_pa | KVM_MSR_ENABLED);
> + wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED);
> +}
> +
> +static void guest_code(vm_paddr_t pvti_pa)
> +{
> + struct pvclock_vcpu_time_info *pvti_va =
> + (struct pvclock_vcpu_time_info *)pvti_pa;
> +
> + struct pvclock_vcpu_time_info pvti_boot;
> + struct pvclock_vcpu_time_info pvti_uncorrected;
> + struct pvclock_vcpu_time_info pvti_corrected;
> + uint64_t cycles_boot;
> + uint64_t cycles_uncorrected;
> + uint64_t cycles_corrected;
> + uint64_t tsc_guest;
> +
> + /*
> + * Setup the KVMCLOCK in the guest & store the original
> + * PV time structure that is used.
> + */
> + wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED);
> + pvti_boot = *pvti_va;
> + GUEST_SYNC(STAGE_FIRST_BOOT);
> +
> + /*
> + * Trigger an update of the PVTI, if we calculate
> + * the KVM clock using this structure we'll see
> + * a delta from the TSC.
> + */
> + trigger_pvti_update(pvti_pa);
> + pvti_uncorrected = *pvti_va;
> + GUEST_SYNC(STAGE_UNCORRECTED);
> +
> + /*
> + * The test should have triggered the correction by this
> + * point in time. We have a copy of each of the PVTI structs
> + * at each stage now.
> + *
> + * Let's sample the timestamp at a SINGLE point in time and
> + * then calculate what the KVM clock would be using the PVTI
> + * from each stage.
> + *
> + * Then return each of these values to the tester.
> + */
> + pvti_corrected = *pvti_va;
> + tsc_guest = rdtsc();
> +
> + cycles_boot = __pvclock_read_cycles(&pvti_boot, tsc_guest);
> + cycles_uncorrected = __pvclock_read_cycles(&pvti_uncorrected, tsc_guest);
> + cycles_corrected = __pvclock_read_cycles(&pvti_corrected, tsc_guest);
> +
> + GUEST_SYNC_ARGS(STAGE_CORRECTED, cycles_boot, cycles_uncorrected,
> + cycles_corrected, 0);
> +}
> +
> +static void run_test(struct kvm_vm *vm, struct kvm_vcpu *vcpu)
> +{
> + struct pvclock_vcpu_time_info pvti_before;
> + uint64_t before, uncorrected, corrected;
> + int64_t delta_uncorrected, delta_corrected;
> + struct ucall uc;
> + uint64_t ucall_reason;
> +
> + /* Loop through each stage of the test. */
> + while (true) {
> +
Unnecessary blank line. Otherwise LGTM so with that fixed...
Reviewed-by: Paul Durrant <paul@xen.org>
© 2016 - 2026 Red Hat, Inc.