arch/x86/kvm/x86.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
Don't snapshot tsc_khz into max_tsc_khz during KVM initialization if the
host TSC is constant, in which case the actual TSC frequency will never
change and thus capturing the "max" TSC during initialization is
unnecessary, KVM can simply use tsc_khz during VM creation.
On CPUs with constant TSC, but not a hardware-specified TSC frequency,
snapshotting max_tsc_khz and using that to set a VM's default TSC
frequency can lead to KVM thinking it needs to manually scale the guest's
TSC if refining the TSC completes after KVM snapshots tsc_khz. The
actual frequency never changes, only the kernel's calculation of what
that frequency is changes. On systems without hardware TSC scaling, this
either puts KVM into "always catchup" mode (extremely inefficient), or
prevents creating VMs altogether.
Ideally, KVM would not be able to race with TSC refinement, or would have
a hook into tsc_refine_calibration_work() to get an alert when refinement
is complete. Avoiding the race altogether isn't practical as refinement
takes a relative eternity; it's deliberately put on a work queue outside
of the normal boot sequence to avoid unnecessarily delaying boot.
Adding a hook is doable, but somewhat gross due to KVM's ability to be
built as a module. And if the TSC is constant, which is likely the case
for every VMX/SVM-capable CPU produced in the last decade, the race can
be hit if and only if userspace is able to create a VM before TSC
refinement completes; refinement is slow, but not that slow.
For now, punt on a proper fix, as not taking a snapshot can help some
uses cases and not taking a snapshot is arguably correct irrespective of
the race with refinement.
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Anton Romanov <romanton@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/x86.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6552360d8888..81d9d84dc59f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8727,13 +8727,15 @@ static int kvmclock_cpu_online(unsigned int cpu)
static void kvm_timer_init(void)
{
- max_tsc_khz = tsc_khz;
-
if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
#ifdef CONFIG_CPU_FREQ
struct cpufreq_policy *policy;
int cpu;
+#endif
+ max_tsc_khz = tsc_khz;
+
+#ifdef CONFIG_CPU_FREQ
cpu = get_cpu();
policy = cpufreq_cpu_get(cpu);
if (policy) {
@@ -11160,7 +11162,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
vcpu->arch.msr_platform_info = MSR_PLATFORM_INFO_CPUID_FAULT;
kvm_vcpu_mtrr_init(vcpu);
vcpu_load(vcpu);
- kvm_set_tsc_khz(vcpu, max_tsc_khz);
+ kvm_set_tsc_khz(vcpu, max_tsc_khz ? : tsc_khz);
kvm_vcpu_reset(vcpu, false);
kvm_init_mmu(vcpu);
vcpu_put(vcpu);
base-commit: f4bc051fc91ab9f1d5225d94e52d369ef58bec58
--
2.35.1.574.g5d30c73bfb-goog
On 2/25/22 02:39, Sean Christopherson wrote: > Don't snapshot tsc_khz into max_tsc_khz during KVM initialization if the > host TSC is constant, in which case the actual TSC frequency will never > change and thus capturing the "max" TSC during initialization is > unnecessary, KVM can simply use tsc_khz during VM creation. > > On CPUs with constant TSC, but not a hardware-specified TSC frequency, > snapshotting max_tsc_khz and using that to set a VM's default TSC > frequency can lead to KVM thinking it needs to manually scale the guest's > TSC if refining the TSC completes after KVM snapshots tsc_khz. The > actual frequency never changes, only the kernel's calculation of what > that frequency is changes. On systems without hardware TSC scaling, this > either puts KVM into "always catchup" mode (extremely inefficient), or > prevents creating VMs altogether. > > Ideally, KVM would not be able to race with TSC refinement, or would have > a hook into tsc_refine_calibration_work() to get an alert when refinement > is complete. Avoiding the race altogether isn't practical as refinement > takes a relative eternity; it's deliberately put on a work queue outside > of the normal boot sequence to avoid unnecessarily delaying boot. > > Adding a hook is doable, but somewhat gross due to KVM's ability to be > built as a module. And if the TSC is constant, which is likely the case > for every VMX/SVM-capable CPU produced in the last decade, the race can > be hit if and only if userspace is able to create a VM before TSC > refinement completes; refinement is slow, but not that slow. > > For now, punt on a proper fix, as not taking a snapshot can help some > uses cases and not taking a snapshot is arguably correct irrespective of > the race with refinement. > > Cc: Suleiman Souhlal <suleiman@google.com> > Cc: Anton Romanov <romanton@google.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> Queued, but I'd rather have a subject that calls out that max_tsc_khz needs a replacement at vCPU creation time. In fact, the real change (and bug, and fix) is in kvm_arch_vcpu_create(), while the subject mentions only the change in kvm_timer_init(). What do you think of "KVM: x86: Use current rather than max TSC frequency if it is constant"? Pao
On Fri, Feb 25, 2022 at 4:10 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 2/25/22 02:39, Sean Christopherson wrote: > > Don't snapshot tsc_khz into max_tsc_khz during KVM initialization if the > > host TSC is constant, in which case the actual TSC frequency will never > > change and thus capturing the "max" TSC during initialization is > > unnecessary, KVM can simply use tsc_khz during VM creation. > > > > On CPUs with constant TSC, but not a hardware-specified TSC frequency, > > snapshotting max_tsc_khz and using that to set a VM's default TSC > > frequency can lead to KVM thinking it needs to manually scale the guest's > > TSC if refining the TSC completes after KVM snapshots tsc_khz. The > > actual frequency never changes, only the kernel's calculation of what > > that frequency is changes. On systems without hardware TSC scaling, this > > either puts KVM into "always catchup" mode (extremely inefficient), or > > prevents creating VMs altogether. > > > > Ideally, KVM would not be able to race with TSC refinement, or would have > > a hook into tsc_refine_calibration_work() to get an alert when refinement > > is complete. Avoiding the race altogether isn't practical as refinement > > takes a relative eternity; it's deliberately put on a work queue outside > > of the normal boot sequence to avoid unnecessarily delaying boot. > > > > Adding a hook is doable, but somewhat gross due to KVM's ability to be > > built as a module. And if the TSC is constant, which is likely the case > > for every VMX/SVM-capable CPU produced in the last decade, the race can > > be hit if and only if userspace is able to create a VM before TSC > > refinement completes; refinement is slow, but not that slow. > > > > For now, punt on a proper fix, as not taking a snapshot can help some > > uses cases and not taking a snapshot is arguably correct irrespective of > > the race with refinement. > > > > Cc: Suleiman Souhlal <suleiman@google.com> > > Cc: Anton Romanov <romanton@google.com> > > Signed-off-by: Sean Christopherson <seanjc@google.com> > > Queued, but I'd rather have a subject that calls out that max_tsc_khz > needs a replacement at vCPU creation time. In fact, the real change > (and bug, and fix) is in kvm_arch_vcpu_create(), while the subject > mentions only the change in kvm_timer_init(). > > What do you think of "KVM: x86: Use current rather than max TSC > frequency if it is constant"? > > Pao Ping. This said "queued" but I don't think this ever landed. What's the status of this? Paolo, does this need more work?
On 4/12/22 19:38, Anton Romanov wrote: >> Queued, but I'd rather have a subject that calls out that max_tsc_khz >> needs a replacement at vCPU creation time. In fact, the real change >> (and bug, and fix) is in kvm_arch_vcpu_create(), while the subject >> mentions only the change in kvm_timer_init(). >> >> What do you think of "KVM: x86: Use current rather than max TSC >> frequency if it is constant"? > > Ping. This said "queued" but I don't think this ever landed. > What's the status of this? > Paolo, does this need more work? The features in my second pull request were rejected by Linus so they will be in 5.19. I'm going to open kvm/next today and the patches will be there. Unfortunately he hasn't replied to my rebuttal at https://lore.kernel.org/kvm/30ffdecc-6ecd-5194-14ec-40e8b818889a@redhat.com/#t so I have no idea what his opinion is. I'll try to get more stuff in early for the next releases. Paolo
On Fri, 2022-02-25 at 13:10 +0100, Paolo Bonzini wrote: > > Queued, but I'd rather have a subject that calls out that max_tsc_khz > needs a replacement at vCPU creation time. In fact, the real change > (and bug, and fix) is in kvm_arch_vcpu_create(), while the subject > mentions only the change in kvm_timer_init(). In https://lore.kernel.org/kvm/e7be32b06676c7ebf415d9deea5faf50aa8c0785.camel@infradead.org/T/ last night I was coming round to the idea that we might want a KVM-wide default frequency which is settable from userspace and is used instead of max_tsc_khz anyway. I also have questions about the use case for the above patch.... if this is a clean boot and you're just starting to host guests, surely we can wait for the time it takes for the TSC synchronization to complete? And if this is a live update scenario, where we pause the guests, kexec into a new kernel, then resume the "migrated" guests again... why in $DEITY's name isn't the precise TSC frequency being handed over from kernel#1 to kernel#2 over the kexec so that it's known from the start?
On Fri, Feb 25, 2022, David Woodhouse wrote: > On Fri, 2022-02-25 at 13:10 +0100, Paolo Bonzini wrote: > > > > Queued, but I'd rather have a subject that calls out that max_tsc_khz > > needs a replacement at vCPU creation time. In fact, the real change > > (and bug, and fix) is in kvm_arch_vcpu_create(), while the subject > > mentions only the change in kvm_timer_init(). > > In > https://lore.kernel.org/kvm/e7be32b06676c7ebf415d9deea5faf50aa8c0785.camel@infradead.org/T/ > last night I was coming round to the idea that we might want a KVM-wide > default frequency which is settable from userspace and is used instead > of max_tsc_khz anyway. > > I also have questions about the use case for the above patch.... if > this is a clean boot and you're just starting to host guests, surely we > can wait for the time it takes for the TSC synchronization to complete? KVM is built into the kernel in their case, the vmx_init() => kvm_init() gets automatically called during boot. The VMs aren't started until well after synchronization has completed, but KVM has already snapshotted the "bad" value.
On Fri, 2022-02-25 at 16:21 +0000, Sean Christopherson wrote: > > I also have questions about the use case for the above patch.... if > > this is a clean boot and you're just starting to host guests, surely we > > can wait for the time it takes for the TSC synchronization to complete? > > KVM is built into the kernel in their case, the vmx_init() => kvm_init() gets > automatically called during boot. The VMs aren't started until well after > synchronization has completed, but KVM has already snapshotted the "bad" value. Gotcha. So even when we put my patch in front, to snapshot a value into kvm->arch.default_tsc_khz, that's happening later at VM creation time so should also be snapshotting the *good* value. And at least if it snapshots the bad value, all the vCPUs will be *consistent*.
On Fri, 2022-02-25 at 01:39 +0000, Sean Christopherson wrote:
> @@ -11160,7 +11162,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> vcpu->arch.msr_platform_info = MSR_PLATFORM_INFO_CPUID_FAULT;
> kvm_vcpu_mtrr_init(vcpu);
> vcpu_load(vcpu);
> - kvm_set_tsc_khz(vcpu, max_tsc_khz);
> + kvm_set_tsc_khz(vcpu, max_tsc_khz ? : tsc_khz);
> kvm_vcpu_reset(vcpu, false);
> kvm_init_mmu(vcpu);
> vcpu_put(vcpu);
>
Hm, now if you hit that race you end up potentially giving *different*
frequencies to different vCPUs in a single guest, depending on when
they were created.
How about this... (and as noted, I think I want to add an explicit KVM
ioctl to set kvm->arch.default_tsc_khz for subsequently created vCPUs).
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a3385db39d3e..e4696a578f41 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1119,6 +1119,8 @@ struct kvm_arch {
u64 cur_tsc_generation;
int nr_vcpus_matched_tsc;
+ u32 default_tsc_khz;
+
seqcount_raw_spinlock_t pvclock_sc;
bool use_master_clock;
u64 master_kernel_ns;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 83accd3e7502..686891966c15 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2601,6 +2601,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
* kvm_clock stable after CPU hotplug
*/
synchronizing = true;
+ data = kvm->arch.last_tsc_write;
} else {
u64 tsc_exp = kvm->arch.last_tsc_write +
nsec_to_cycles(vcpu, elapsed);
@@ -8728,22 +8729,22 @@ static int kvmclock_cpu_online(unsigned int cpu)
static void kvm_timer_init(void)
{
- max_tsc_khz = tsc_khz;
-
if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
-#ifdef CONFIG_CPU_FREQ
- struct cpufreq_policy *policy;
- int cpu;
-
- cpu = get_cpu();
- policy = cpufreq_cpu_get(cpu);
- if (policy) {
- if (policy->cpuinfo.max_freq)
- max_tsc_khz = policy->cpuinfo.max_freq;
- cpufreq_cpu_put(policy);
+ max_tsc_khz = tsc_khz;
+
+ if (IS_ENABLED(CONFIG_CPU_FREQ)) {
+ struct cpufreq_policy *policy;
+ int cpu;
+
+ cpu = get_cpu();
+ policy = cpufreq_cpu_get(cpu);
+ if (policy) {
+ if (policy->cpuinfo.max_freq)
+ max_tsc_khz = policy->cpuinfo.max_freq;
+ cpufreq_cpu_put(policy);
+ }
+ put_cpu();
}
- put_cpu();
-#endif
cpufreq_register_notifier(&kvmclock_cpufreq_notifier_block,
CPUFREQ_TRANSITION_NOTIFIER);
}
@@ -11165,7 +11166,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
kvm_xen_init_vcpu(vcpu);
kvm_vcpu_mtrr_init(vcpu);
vcpu_load(vcpu);
- kvm_set_tsc_khz(vcpu, max_tsc_khz);
+ kvm_set_tsc_khz(vcpu, vcpu->kvm->arch.default_tsc_khz);
kvm_vcpu_reset(vcpu, false);
kvm_init_mmu(vcpu);
vcpu_put(vcpu);
@@ -11614,6 +11615,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
pvclock_update_vm_gtod_copy(kvm);
raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
+ kvm->arch.default_tsc_khz = max_tsc_khz ? : tsc_khz;
kvm->arch.guest_can_read_msr_platform_info = true;
#if IS_ENABLED(CONFIG_HYPERV)
On Fri, Feb 25, 2022, David Woodhouse wrote: > On Fri, 2022-02-25 at 01:39 +0000, Sean Christopherson wrote: > > @@ -11160,7 +11162,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) > > vcpu->arch.msr_platform_info = MSR_PLATFORM_INFO_CPUID_FAULT; > > kvm_vcpu_mtrr_init(vcpu); > > vcpu_load(vcpu); > > - kvm_set_tsc_khz(vcpu, max_tsc_khz); > > + kvm_set_tsc_khz(vcpu, max_tsc_khz ? : tsc_khz); > > kvm_vcpu_reset(vcpu, false); > > kvm_init_mmu(vcpu); > > vcpu_put(vcpu); > > > > Hm, now if you hit that race you end up potentially giving *different* > frequencies to different vCPUs in a single guest, depending on when > they were created. Yep. Though the race is much harder to hit (userspace vs TSC refinement). The existing race being hit is essentially do_initcalls() vs. TSC refinement. > How about this... (and as noted, I think I want to add an explicit KVM > ioctl to set kvm->arch.default_tsc_khz for subsequently created vCPUs). This wouldn't necessarily help. E.g. assuming userspace knows the actual TSC frequency, creating a vCPU before refinement completes might put the vCPU in "always catchup" purgatory. To really fix the race, KVM needs a notification that refinement completed (or failed). KVM could simply refuse to create vCPUs until it got the notification. In the non-constant case, KVM would also need to refresh max_tsc_khz.
On Fri, 2022-02-25 at 16:40 +0000, Sean Christopherson wrote: > On Fri, Feb 25, 2022, David Woodhouse wrote: > > On Fri, 2022-02-25 at 01:39 +0000, Sean Christopherson wrote: > > > @@ -11160,7 +11162,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) > > > vcpu->arch.msr_platform_info = MSR_PLATFORM_INFO_CPUID_FAULT; > > > kvm_vcpu_mtrr_init(vcpu); > > > vcpu_load(vcpu); > > > - kvm_set_tsc_khz(vcpu, max_tsc_khz); > > > + kvm_set_tsc_khz(vcpu, max_tsc_khz ? : tsc_khz); > > > kvm_vcpu_reset(vcpu, false); > > > kvm_init_mmu(vcpu); > > > vcpu_put(vcpu); > > > > > > > Hm, now if you hit that race you end up potentially giving *different* > > frequencies to different vCPUs in a single guest, depending on when > > they were created. > > Yep. Though the race is much harder to hit (userspace vs TSC refinement). The > existing race being hit is essentially do_initcalls() vs. TSC refinement. > > > How about this... (and as noted, I think I want to add an explicit KVM > > ioctl to set kvm->arch.default_tsc_khz for subsequently created vCPUs). > > This wouldn't necessarily help. E.g. assuming userspace knows the actual TSC > frequency, creating a vCPU before refinement completes might put the vCPU in > "always catchup" purgatory. Right. But at least they'd be *consistent*. I was actually making that change anyway, for the benefit of VMs where we are intentionally scaling to a known, different, TSC frequency — which is currently completely hosed when all the vCPUs set it for themselves because the TSC sync then fails. > To really fix the race, KVM needs a notification that refinement completed (or > failed). KVM could simply refuse to create vCPUs until it got the notification. > In the non-constant case, KVM would also need to refresh max_tsc_khz. Hm, would the world be a better place if we knew that the delta between the unrefined and refined TSC values was always within the tolerance of tsc_tolerance_ppm for which we wouldn't bother scaling anyway?
© 2016 - 2026 Red Hat, Inc.