arch/x86/kvm/svm/svm.c | 20 ++++++++-------- arch/x86/kvm/vmx/main.c | 13 ++++++++++- arch/x86/kvm/vmx/tdx.c | 3 --- arch/x86/kvm/vmx/vmx.c | 7 ------ arch/x86/kvm/x86.c | 51 ++++++++++++++++++++++++++++------------- arch/x86/kvm/x86.h | 2 -- 6 files changed, 56 insertions(+), 40 deletions(-)
This series is the result of the recent PUCK discussion[*] on optimizing the
XCR0/XSS loads that are currently done on every VM-Enter and VM-Exit. My
initial thought that swapping XCR0/XSS outside of the fastpath was spot on;
turns out the only reason they're swapped in the fastpath is because of a
hack-a-fix that papered over an egregious #MC handling bug where the kernel #MC
handler would call schedule() from an atomic context. The resulting #GP due to
trying to swap FPU state with a guest XCR0/XSS was "fixed" by loading the host
values before handling #MCs from the guest.
Thankfully, the #MC mess has long since been cleaned up, so it's once again
safe to swap XCR0/XSS outside of the fastpath (but when IRQs are disabled!).
As for what may be contributing to the SAP HANA performance improvements when
enabling PKU, my instincts again appear to be spot on. As predicted, the
fastpath savings are ~300 cycles on Intel (~500 on AMD). I.e. if the guest
is literally doing _nothing_ but generating fastpath exits, it will see a
~%25 improvement. There's basically zero chance the uplift seen with enabling
PKU is dues to eliding XCR0 loads; my guess is that the guest actualy uses
protection keys to optimize something.
Why does kvm_load_guest_xsave_state() show up in perf? Probably because it's
the only visible symbol other than vmx_vmexit() (and vmx_vcpu_run() when not
hammering the fastpath). E.g. running perf top on a running VM instance yields
these numbers with various guest workloads (the middle one is running
mmu_stress_test in the guest, which hammers on mmu_lock in L0). But other than
doing INVD (handled in the fastpath) in a tight loop, there's no perceived perf
improvement from the guest.
Overhead Shared Object Symbol
15.65% [kernel] [k] vmx_vmexit
6.78% [kernel] [k] kvm_vcpu_halt
5.15% [kernel] [k] __srcu_read_lock
4.73% [kernel] [k] kvm_load_guest_xsave_state
4.69% [kernel] [k] __srcu_read_unlock
4.65% [kernel] [k] read_tsc
4.44% [kernel] [k] vmx_sync_pir_to_irr
4.03% [kernel] [k] kvm_apic_has_interrupt
45.52% [kernel] [k] queued_spin_lock_slowpath
24.40% [kernel] [k] vmx_vmexit
2.84% [kernel] [k] queued_write_lock_slowpath
1.92% [kernel] [k] vmx_vcpu_run
1.40% [kernel] [k] vcpu_run
1.00% [kernel] [k] kvm_load_guest_xsave_state
0.84% [kernel] [k] kvm_load_host_xsave_state
0.72% [kernel] [k] mmu_try_to_unsync_pages
0.68% [kernel] [k] __srcu_read_lock
0.65% [kernel] [k] try_get_folio
17.78% [kernel] [k] vmx_vmexit
5.08% [kernel] [k] vmx_vcpu_run
4.24% [kernel] [k] vcpu_run
4.21% [kernel] [k] _raw_spin_lock_irqsave
2.99% [kernel] [k] kvm_load_guest_xsave_state
2.51% [kernel] [k] rcu_note_context_switch
2.47% [kernel] [k] ktime_get_update_offsets_now
2.21% [kernel] [k] kvm_load_host_xsave_state
2.16% [kernel] [k] fput
[*] https://drive.google.com/corp/drive/folders/1DCdvqFGudQc7pxXjM7f35vXogTf9uhD4
Sean Christopherson (4):
KVM: SVM: Handle #MCs in guest outside of fastpath
KVM: VMX: Handle #MCs on VM-Enter/TD-Enter outside of the fastpath
KVM: x86: Load guest/host XCR0 and XSS outside of the fastpath run
loop
KVM: x86: Load guest/host PKRU outside of the fastpath run loop
arch/x86/kvm/svm/svm.c | 20 ++++++++--------
arch/x86/kvm/vmx/main.c | 13 ++++++++++-
arch/x86/kvm/vmx/tdx.c | 3 ---
arch/x86/kvm/vmx/vmx.c | 7 ------
arch/x86/kvm/x86.c | 51 ++++++++++++++++++++++++++++-------------
arch/x86/kvm/x86.h | 2 --
6 files changed, 56 insertions(+), 40 deletions(-)
base-commit: 4cc167c50eb19d44ac7e204938724e685e3d8057
--
2.51.1.930.gacf6e81ea2-goog
On Thu, 30 Oct 2025 15:42:42 -0700, Sean Christopherson wrote:
> This series is the result of the recent PUCK discussion[*] on optimizing the
> XCR0/XSS loads that are currently done on every VM-Enter and VM-Exit. My
> initial thought that swapping XCR0/XSS outside of the fastpath was spot on;
> turns out the only reason they're swapped in the fastpath is because of a
> hack-a-fix that papered over an egregious #MC handling bug where the kernel #MC
> handler would call schedule() from an atomic context. The resulting #GP due to
> trying to swap FPU state with a guest XCR0/XSS was "fixed" by loading the host
> values before handling #MCs from the guest.
>
> [...]
Applied to kvm-x86 misc, thanks!
[1/4] KVM: SVM: Handle #MCs in guest outside of fastpath
https://github.com/kvm-x86/linux/commit/6e640bb5caab
[2/4] KVM: VMX: Handle #MCs on VM-Enter/TD-Enter outside of the fastpath
https://github.com/kvm-x86/linux/commit/8934c592bcbf
[3/4] KVM: x86: Load guest/host XCR0 and XSS outside of the fastpath run loop
https://github.com/kvm-x86/linux/commit/3377a9233d30
[4/4] KVM: x86: Load guest/host PKRU outside of the fastpath run loop
https://github.com/kvm-x86/linux/commit/7df3021b622f
--
https://github.com/kvm-x86/linux/tree/next
On Mon, Nov 10, 2025, Sean Christopherson wrote: > On Thu, 30 Oct 2025 15:42:42 -0700, Sean Christopherson wrote: > > This series is the result of the recent PUCK discussion[*] on optimizing the > > XCR0/XSS loads that are currently done on every VM-Enter and VM-Exit. My > > initial thought that swapping XCR0/XSS outside of the fastpath was spot on; > > turns out the only reason they're swapped in the fastpath is because of a > > hack-a-fix that papered over an egregious #MC handling bug where the kernel #MC > > handler would call schedule() from an atomic context. The resulting #GP due to > > trying to swap FPU state with a guest XCR0/XSS was "fixed" by loading the host > > values before handling #MCs from the guest. > > > > [...] > > Applied to kvm-x86 misc, thanks! > > [1/4] KVM: SVM: Handle #MCs in guest outside of fastpath > https://github.com/kvm-x86/linux/commit/6e640bb5caab > [2/4] KVM: VMX: Handle #MCs on VM-Enter/TD-Enter outside of the fastpath > https://github.com/kvm-x86/linux/commit/8934c592bcbf > [3/4] KVM: x86: Load guest/host XCR0 and XSS outside of the fastpath run loop > https://github.com/kvm-x86/linux/commit/3377a9233d30 > [4/4] KVM: x86: Load guest/host PKRU outside of the fastpath run loop > https://github.com/kvm-x86/linux/commit/7df3021b622f I've dropped these for now as patch 2 broke TDX. I'll send a v2 shortly.
On Thu, 2025-10-30 at 15:42 -0700, Sean Christopherson wrote: > Sean Christopherson (4): > KVM: SVM: Handle #MCs in guest outside of fastpath > KVM: VMX: Handle #MCs on VM-Enter/TD-Enter outside of the fastpath > KVM: x86: Load guest/host XCR0 and XSS outside of the fastpath run > loop > KVM: x86: Load guest/host PKRU outside of the fastpath run loop Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Interesting analysis.
> On Oct 30, 2025, at 6:42 PM, Sean Christopherson <seanjc@google.com> wrote: > > !-------------------------------------------------------------------| > CAUTION: External Email > > |-------------------------------------------------------------------! > > This series is the result of the recent PUCK discussion[*] on optimizing the > XCR0/XSS loads that are currently done on every VM-Enter and VM-Exit. My > initial thought that swapping XCR0/XSS outside of the fastpath was spot on; > turns out the only reason they're swapped in the fastpath is because of a > hack-a-fix that papered over an egregious #MC handling bug where the kernel #MC > handler would call schedule() from an atomic context. The resulting #GP due to > trying to swap FPU state with a guest XCR0/XSS was "fixed" by loading the host > values before handling #MCs from the guest. > > Thankfully, the #MC mess has long since been cleaned up, so it's once again > safe to swap XCR0/XSS outside of the fastpath (but when IRQs are disabled!). Thank you for doing the diligence on this, I appreciate it! > As for what may be contributing to the SAP HANA performance improvements when > enabling PKU, my instincts again appear to be spot on. As predicted, the > fastpath savings are ~300 cycles on Intel (~500 on AMD). I.e. if the guest > is literally doing _nothing_ but generating fastpath exits, it will see a > ~%25 improvement. There's basically zero chance the uplift seen with enabling > PKU is dues to eliding XCR0 loads; my guess is that the guest actualy uses > protection keys to optimize something. Every little bit counts, thats a healthy percentage speedup for fast path stuff, especially on AMD. > Why does kvm_load_guest_xsave_state() show up in perf? Probably because it's > the only visible symbol other than vmx_vmexit() (and vmx_vcpu_run() when not > hammering the fastpath). E.g. running perf top on a running VM instance yields > these numbers with various guest workloads (the middle one is running > mmu_stress_test in the guest, which hammers on mmu_lock in L0). But other than > doing INVD (handled in the fastpath) in a tight loop, there's no perceived perf > improvement from the guest. nit: it’d be nice if these bits were labeled with what they were from (the middle one you called out above, but what’s the first and third one) > Overhead Shared Object Symbol > 15.65% [kernel] [k] vmx_vmexit > 6.78% [kernel] [k] kvm_vcpu_halt > 5.15% [kernel] [k] __srcu_read_lock > 4.73% [kernel] [k] kvm_load_guest_xsave_state > 4.69% [kernel] [k] __srcu_read_unlock > 4.65% [kernel] [k] read_tsc > 4.44% [kernel] [k] vmx_sync_pir_to_irr > 4.03% [kernel] [k] kvm_apic_has_interrupt > > > 45.52% [kernel] [k] queued_spin_lock_slowpath > 24.40% [kernel] [k] vmx_vmexit > 2.84% [kernel] [k] queued_write_lock_slowpath > 1.92% [kernel] [k] vmx_vcpu_run > 1.40% [kernel] [k] vcpu_run > 1.00% [kernel] [k] kvm_load_guest_xsave_state > 0.84% [kernel] [k] kvm_load_host_xsave_state > 0.72% [kernel] [k] mmu_try_to_unsync_pages > 0.68% [kernel] [k] __srcu_read_lock > 0.65% [kernel] [k] try_get_folio > > 17.78% [kernel] [k] vmx_vmexit > 5.08% [kernel] [k] vmx_vcpu_run > 4.24% [kernel] [k] vcpu_run > 4.21% [kernel] [k] _raw_spin_lock_irqsave > 2.99% [kernel] [k] kvm_load_guest_xsave_state > 2.51% [kernel] [k] rcu_note_context_switch > 2.47% [kernel] [k] ktime_get_update_offsets_now > 2.21% [kernel] [k] kvm_load_host_xsave_state > 2.16% [kernel] [k] fput > > [*] https://drive.google.com/drive/folders/1DCdvqFGudQc7pxXjM7f35vXogTf9uhD4 > > Sean Christopherson (4): > KVM: SVM: Handle #MCs in guest outside of fastpath > KVM: VMX: Handle #MCs on VM-Enter/TD-Enter outside of the fastpath > KVM: x86: Load guest/host XCR0 and XSS outside of the fastpath run > loop > KVM: x86: Load guest/host PKRU outside of the fastpath run loop > > arch/x86/kvm/svm/svm.c | 20 ++++++++-------- > arch/x86/kvm/vmx/main.c | 13 ++++++++++- > arch/x86/kvm/vmx/tdx.c | 3 --- > arch/x86/kvm/vmx/vmx.c | 7 ------ > arch/x86/kvm/x86.c | 51 ++++++++++++++++++++++++++++------------- > arch/x86/kvm/x86.h | 2 -- > 6 files changed, 56 insertions(+), 40 deletions(-) > > > base-commit: 4cc167c50eb19d44ac7e204938724e685e3d8057 > -- > 2.51.1.930.gacf6e81ea2-goog > Had one conversation starter comment on patch 4, but otherwise, LGTM for the entire series, thanks again for the help! Reviewed-By: Jon Kohler <jon@nutanix.com>
© 2016 - 2026 Red Hat, Inc.