arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/cpuid.c | 63 ++++++++++++++++++++++++--------- arch/x86/kvm/cpuid.h | 10 +++++- arch/x86/kvm/lapic.c | 2 +- arch/x86/kvm/smm.c | 2 +- arch/x86/kvm/svm/sev.c | 2 +- arch/x86/kvm/svm/svm.c | 2 +- arch/x86/kvm/vmx/vmx.c | 2 +- arch/x86/kvm/x86.c | 22 +++++++++--- 9 files changed, 78 insertions(+), 28 deletions(-)
Fix a hilarious/revolting performance regression (relative to older CPU
generations) in xstate_required_size() that pops up due to CPUID _in the
host_ taking 3x-4x longer on Emerald Rapids than Skylake.
The issue rears its head on nested virtualization transitions, as KVM
(unnecessarily) performs runtime CPUID updates, including XSAVE sizes,
multiple times per transition. And calculating XSAVE sizes, especially
for vCPUs with a decent number of supported XSAVE features and compacted
format support, can add up to thousands of cycles.
To fix the immediate issue, cache the CPUID output at kvm.ko load. The
information is static for a given CPU, i.e. doesn't need to be re-read
from hardware every time. That's patch 1, and eliminates pretty much all
of the meaningful overhead.
Patch 2 is a minor cleanup to try and make the code easier to read.
Patch 3 fixes a wart in CPUID emulation where KVM does a moderately
expensive (though cheap compared to CPUID, lol) MSR lookup that is likely
unnecessary for the vast majority of VMs.
Patches 4 and 5 address the problem of KVM doing runtime CPUID updates
multiple times for each nested VM-Enter and VM-Exit, at least half of
which are completely unnecessary (CPUID is a mandatory intercept on both
Intel and AMD, so ensuring dynamic CPUID bits are up-to-date while running
L2 is pointless). The idea is fairly simple: lazily do the CPUID updates
by deferring them until something might actually consume guest the relevant
bits.
This applies on the cpu_caps overhaul[*], as patches 3-5 would otherwise
conflict, and I didn't want to think about how safe patch 5 is without
the rework.
That said, patch 1, which is the most important and tagged for stable,
applies cleanly on 6.1, 6.6, and 6.12 (and the backport for 5.15 and
earlier shouldn't be too horrific).
Side topic, I can't help but wonder if the CPUID latency on EMR is a CPU
or ucode bug. For a number of leaves, KVM can emulate CPUID faster than
the CPUID can execute the instruction. I.e. the entire VM-Exit => emulate
=> VM-Enter sequence takes less time than executing CPUID on bare metal.
Which seems absolutely insane. But, it shouldn't impact guest performance,
so that's someone else's problem, at least for now.
[*] https://lore.kernel.org/all/20241128013424.4096668-1-seanjc@google.com
Sean Christopherson (5):
KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init
KVM: x86: Use for-loop to iterate over XSTATE size entries
KVM: x86: Apply TSX_CTRL_CPUID_CLEAR if and only if the vCPU has RTM
or HLE
KVM: x86: Query X86_FEATURE_MWAIT iff userspace owns the CPUID feature
bit
KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID
emulation
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/cpuid.c | 63 ++++++++++++++++++++++++---------
arch/x86/kvm/cpuid.h | 10 +++++-
arch/x86/kvm/lapic.c | 2 +-
arch/x86/kvm/smm.c | 2 +-
arch/x86/kvm/svm/sev.c | 2 +-
arch/x86/kvm/svm/svm.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 2 +-
arch/x86/kvm/x86.c | 22 +++++++++---
9 files changed, 78 insertions(+), 28 deletions(-)
base-commit: 06a4919e729761be751366716c00fb8c3f51d37e
--
2.47.0.338.g60cca15819-goog
On Tue, 10 Dec 2024 17:32:57 -0800, Sean Christopherson wrote:
> Fix a hilarious/revolting performance regression (relative to older CPU
> generations) in xstate_required_size() that pops up due to CPUID _in the
> host_ taking 3x-4x longer on Emerald Rapids than Skylake.
>
> The issue rears its head on nested virtualization transitions, as KVM
> (unnecessarily) performs runtime CPUID updates, including XSAVE sizes,
> multiple times per transition. And calculating XSAVE sizes, especially
> for vCPUs with a decent number of supported XSAVE features and compacted
> format support, can add up to thousands of cycles.
>
> [...]
Applied 2-5 to kvm-x86 misc, with a changelog that doesn't incorrectly state
that CPUID is a mandatory intercept on AMD.
[1/5] KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init
(no commit info)
[2/5] KVM: x86: Use for-loop to iterate over XSTATE size entries
https://github.com/kvm-x86/linux/commit/aa93b6f96f64
[3/5] KVM: x86: Apply TSX_CTRL_CPUID_CLEAR if and only if the vCPU has RTM or HLE
https://github.com/kvm-x86/linux/commit/7e9f735e7ac4
[4/5] KVM: x86: Query X86_FEATURE_MWAIT iff userspace owns the CPUID feature bit
https://github.com/kvm-x86/linux/commit/a487f6797c88
[5/5] KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID emulation
https://github.com/kvm-x86/linux/commit/93da6af3ae56
--
https://github.com/kvm-x86/linux/tree/next
On Tue, Dec 10, 2024 at 5:33 PM Sean Christopherson <seanjc@google.com> wrote: > > Fix a hilarious/revolting performance regression (relative to older CPU > generations) in xstate_required_size() that pops up due to CPUID _in the > host_ taking 3x-4x longer on Emerald Rapids than Skylake. > > The issue rears its head on nested virtualization transitions, as KVM > (unnecessarily) performs runtime CPUID updates, including XSAVE sizes, > multiple times per transition. And calculating XSAVE sizes, especially > for vCPUs with a decent number of supported XSAVE features and compacted > format support, can add up to thousands of cycles. > > To fix the immediate issue, cache the CPUID output at kvm.ko load. The > information is static for a given CPU, i.e. doesn't need to be re-read > from hardware every time. That's patch 1, and eliminates pretty much all > of the meaningful overhead. > > Patch 2 is a minor cleanup to try and make the code easier to read. > > Patch 3 fixes a wart in CPUID emulation where KVM does a moderately > expensive (though cheap compared to CPUID, lol) MSR lookup that is likely > unnecessary for the vast majority of VMs. > > Patches 4 and 5 address the problem of KVM doing runtime CPUID updates > multiple times for each nested VM-Enter and VM-Exit, at least half of > which are completely unnecessary (CPUID is a mandatory intercept on both > Intel and AMD, so ensuring dynamic CPUID bits are up-to-date while running > L2 is pointless). The idea is fairly simple: lazily do the CPUID updates > by deferring them until something might actually consume guest the relevant > bits. > > This applies on the cpu_caps overhaul[*], as patches 3-5 would otherwise > conflict, and I didn't want to think about how safe patch 5 is without > the rework. > > That said, patch 1, which is the most important and tagged for stable, > applies cleanly on 6.1, 6.6, and 6.12 (and the backport for 5.15 and > earlier shouldn't be too horrific). > > Side topic, I can't help but wonder if the CPUID latency on EMR is a CPU > or ucode bug. For a number of leaves, KVM can emulate CPUID faster than > the CPUID can execute the instruction. I.e. the entire VM-Exit => emulate > => VM-Enter sequence takes less time than executing CPUID on bare metal. > Which seems absolutely insane. But, it shouldn't impact guest performance, > so that's someone else's problem, at least for now. Virtualization aside, perhaps Linux should set MSR_FEATURE_ENABLES.CPUID_GP_ON_CPL_GT_0[bit 0] on EMR, and emulate the CPUID instruction in the kernel? :) > [*] https://lore.kernel.org/all/20241128013424.4096668-1-seanjc@google.com > > Sean Christopherson (5): > KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init > KVM: x86: Use for-loop to iterate over XSTATE size entries > KVM: x86: Apply TSX_CTRL_CPUID_CLEAR if and only if the vCPU has RTM > or HLE > KVM: x86: Query X86_FEATURE_MWAIT iff userspace owns the CPUID feature > bit > KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID > emulation > > arch/x86/include/asm/kvm_host.h | 1 + > arch/x86/kvm/cpuid.c | 63 ++++++++++++++++++++++++--------- > arch/x86/kvm/cpuid.h | 10 +++++- > arch/x86/kvm/lapic.c | 2 +- > arch/x86/kvm/smm.c | 2 +- > arch/x86/kvm/svm/sev.c | 2 +- > arch/x86/kvm/svm/svm.c | 2 +- > arch/x86/kvm/vmx/vmx.c | 2 +- > arch/x86/kvm/x86.c | 22 +++++++++--- > 9 files changed, 78 insertions(+), 28 deletions(-) > > > base-commit: 06a4919e729761be751366716c00fb8c3f51d37e > -- > 2.47.0.338.g60cca15819-goog >
On Mon, Dec 16, 2024, Jim Mattson wrote: > On Tue, Dec 10, 2024 at 5:33 PM Sean Christopherson <seanjc@google.com> wrote: > > > > Fix a hilarious/revolting performance regression (relative to older CPU > > generations) in xstate_required_size() that pops up due to CPUID _in the > > host_ taking 3x-4x longer on Emerald Rapids than Skylake. > > > > The issue rears its head on nested virtualization transitions, as KVM > > (unnecessarily) performs runtime CPUID updates, including XSAVE sizes, > > multiple times per transition. And calculating XSAVE sizes, especially > > for vCPUs with a decent number of supported XSAVE features and compacted > > format support, can add up to thousands of cycles. > > > > To fix the immediate issue, cache the CPUID output at kvm.ko load. The > > information is static for a given CPU, i.e. doesn't need to be re-read > > from hardware every time. That's patch 1, and eliminates pretty much all > > of the meaningful overhead. > > > > Patch 2 is a minor cleanup to try and make the code easier to read. > > > > Patch 3 fixes a wart in CPUID emulation where KVM does a moderately > > expensive (though cheap compared to CPUID, lol) MSR lookup that is likely > > unnecessary for the vast majority of VMs. > > > > Patches 4 and 5 address the problem of KVM doing runtime CPUID updates > > multiple times for each nested VM-Enter and VM-Exit, at least half of > > which are completely unnecessary (CPUID is a mandatory intercept on both > > Intel and AMD, so ensuring dynamic CPUID bits are up-to-date while running > > L2 is pointless). The idea is fairly simple: lazily do the CPUID updates > > by deferring them until something might actually consume guest the relevant > > bits. > > > > This applies on the cpu_caps overhaul[*], as patches 3-5 would otherwise > > conflict, and I didn't want to think about how safe patch 5 is without > > the rework. > > > > That said, patch 1, which is the most important and tagged for stable, > > applies cleanly on 6.1, 6.6, and 6.12 (and the backport for 5.15 and > > earlier shouldn't be too horrific). > > > > Side topic, I can't help but wonder if the CPUID latency on EMR is a CPU > > or ucode bug. For a number of leaves, KVM can emulate CPUID faster than > > the CPUID can execute the instruction. I.e. the entire VM-Exit => emulate > > => VM-Enter sequence takes less time than executing CPUID on bare metal. > > Which seems absolutely insane. But, it shouldn't impact guest performance, > > so that's someone else's problem, at least for now. > > Virtualization aside, perhaps Linux should set > MSR_FEATURE_ENABLES.CPUID_GP_ON_CPL_GT_0[bit 0] on EMR, and emulate > the CPUID instruction in the kernel? :) Heh, that thought crossed my mind too.
On 12/11/24 02:32, Sean Christopherson wrote: > Fix a hilarious/revolting performance regression (relative to older CPU > generations) in xstate_required_size() that pops up due to CPUID _in the > host_ taking 3x-4x longer on Emerald Rapids than Skylake. > > The issue rears its head on nested virtualization transitions, as KVM > (unnecessarily) performs runtime CPUID updates, including XSAVE sizes, > multiple times per transition. And calculating XSAVE sizes, especially > for vCPUs with a decent number of supported XSAVE features and compacted > format support, can add up to thousands of cycles. > > To fix the immediate issue, cache the CPUID output at kvm.ko load. The > information is static for a given CPU, i.e. doesn't need to be re-read > from hardware every time. That's patch 1, and eliminates pretty much all > of the meaningful overhead. Queued this one, thanks! Paolo
On Tue, Dec 10, 2024, Sean Christopherson wrote: > CPUID is a mandatory intercept on both Intel and AMD Jim pointed out that CPUID is NOT a mandatory intercept on AMD, and so while the code is correct, the changelogs are not.
© 2016 - 2025 Red Hat, Inc.