arch/x86/kvm/mmu/tdp_mmu.c | 101 +++++++++++++++++++++++++------------ 1 file changed, 68 insertions(+), 33 deletions(-)
This series optmizes KVM mmu notifier.change_pte() handler in x86 TDP MMU
(i.e. kvm_tdp_mmu_set_spte_gfn()) by removing old dead code and prefetching
notified new PFN into SPTEs directly in the handler.
As in [1], .change_pte() has been dead code on x86 for 10+ years.
Patch 1 drops the dead code in x86 TDP MMU to save cpu cycles and prepare
for optimization in TDP MMU in patch 2.
Patch 2 optimizes TDP MMU's .change_pte() handler to prefetch SPTEs in the
handler directly with PFN info contained in .change_pte() to avoid that
each vCPU write that triggers .change_pte() must undergo twice VMExits and
TDP page faults.
base-commit: fdf0eaf11452 + Sean's patch "KVM: Wrap kvm_{gfn,hva}_range.pte
in a per-action union" [2]
[1]: https://lore.kernel.org/lkml/ZMAO6bhan9l6ybQM@google.com/
[2]:
https://lore.kernel.org/lkml/20230729004144.1054885-1-seanjc@google.com/
Yan Zhao (2):
KVM: x86/mmu: Remove dead code in .change_pte() handler in x86 TDP MMU
KVM: x86/mmu: prefetch SPTE directly in x86 TDP MMU's change_pte()
handler
arch/x86/kvm/mmu/tdp_mmu.c | 101 +++++++++++++++++++++++++------------
1 file changed, 68 insertions(+), 33 deletions(-)
--
2.17.1
On Tue, Aug 08, 2023, Yan Zhao wrote: > This series optmizes KVM mmu notifier.change_pte() handler in x86 TDP MMU > (i.e. kvm_tdp_mmu_set_spte_gfn()) by removing old dead code and prefetching > notified new PFN into SPTEs directly in the handler. > > As in [1], .change_pte() has been dead code on x86 for 10+ years. > Patch 1 drops the dead code in x86 TDP MMU to save cpu cycles and prepare > for optimization in TDP MMU in patch 2. If we're going to officially kill the long-dead attempt at optimizing KSM, I'd strongly prefer to rip out .change_pte() entirely, i.e. kill it off in all architectures and remove it from mmu_notifiers. The only reason I haven't proposed such patches is because I didn't want to it to backfire and lead to someone trying to resurrect the optimizations for KSM. > Patch 2 optimizes TDP MMU's .change_pte() handler to prefetch SPTEs in the > handler directly with PFN info contained in .change_pte() to avoid that > each vCPU write that triggers .change_pte() must undergo twice VMExits and > TDP page faults. IMO, prefaulting guest memory as writable is better handled by userspace, e.g. by using QEMU's prealloc option. It's more coarse grained, but at a minimum it's sufficient for improving guest boot time, e.g. by preallocating memory below 4GiB. And we can do even better, e.g. by providing a KVM ioctl() to allow userspace to prefault memory not just into the primary MMU, but also into KVM's MMU. Such an ioctl() is basically manadatory for TDX, we just need to morph the support being added by TDX into a generic ioctl()[*] Prefaulting guest memory as writable into the primary MMU should be able to achieve far better performance than hooking .change_pte(), as it will avoid the mmu_notifier invalidation, e.g. won't trigger taking mmu_lock for write and the resulting remote TLB flush(es). And a KVM ioctl() to prefault into KVM's MMU should eliminate page fault VM-Exits entirely. Explicit prefaulting isn't perfect, but IMO the value added by prefetching in .change_pte() isn't enough to justify carrying the hook and the code in KVM. [*] https://lore.kernel.org/all/ZMFYhkSPE6Zbp8Ea@google.com
On Wed, Aug 16, 2023 at 11:18:03AM -0700, Sean Christopherson wrote: > On Tue, Aug 08, 2023, Yan Zhao wrote: > > This series optmizes KVM mmu notifier.change_pte() handler in x86 TDP MMU > > (i.e. kvm_tdp_mmu_set_spte_gfn()) by removing old dead code and prefetching > > notified new PFN into SPTEs directly in the handler. > > > > As in [1], .change_pte() has been dead code on x86 for 10+ years. > > Patch 1 drops the dead code in x86 TDP MMU to save cpu cycles and prepare > > for optimization in TDP MMU in patch 2. > > If we're going to officially kill the long-dead attempt at optimizing KSM, I'd > strongly prefer to rip out .change_pte() entirely, i.e. kill it off in all > architectures and remove it from mmu_notifiers. The only reason I haven't proposed > such patches is because I didn't want to it to backfire and lead to someone trying > to resurrect the optimizations for KSM. > > > Patch 2 optimizes TDP MMU's .change_pte() handler to prefetch SPTEs in the > > handler directly with PFN info contained in .change_pte() to avoid that > > each vCPU write that triggers .change_pte() must undergo twice VMExits and > > TDP page faults. > > IMO, prefaulting guest memory as writable is better handled by userspace, e.g. by > using QEMU's prealloc option. It's more coarse grained, but at a minimum it's > sufficient for improving guest boot time, e.g. by preallocating memory below 4GiB. > > And we can do even better, e.g. by providing a KVM ioctl() to allow userspace to > prefault memory not just into the primary MMU, but also into KVM's MMU. Such an > ioctl() is basically manadatory for TDX, we just need to morph the support being > added by TDX into a generic ioctl()[*] > > Prefaulting guest memory as writable into the primary MMU should be able to achieve > far better performance than hooking .change_pte(), as it will avoid the mmu_notifier > invalidation, e.g. won't trigger taking mmu_lock for write and the resulting remote > TLB flush(es). And a KVM ioctl() to prefault into KVM's MMU should eliminate page > fault VM-Exits entirely. > > Explicit prefaulting isn't perfect, but IMO the value added by prefetching in > .change_pte() isn't enough to justify carrying the hook and the code in KVM. > > [*] https://lore.kernel.org/all/ZMFYhkSPE6Zbp8Ea@google.com Hi Sean, As I didn't write the full picture of patch 2 in the cover letter well, may I request you to take a look of patch 2 to see if you like it? (in case if you just read the cover letter). What I observed is that each vCPU write to a COW page in primary MMU will lead to twice TDP page faults. Then, I just update the secondary MMU during the first TDP page fault to avoid the second one. It's not a blind prefetch (I checked the vCPU to ensure it's triggered by a vCPU operation as much as possible) and it can benefit guests who doesn't explicitly request a prefault memory as write. Thanks Yan
© 2016 - 2025 Red Hat, Inc.