[PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes

Paolo Bonzini posted 12 patches 4 years, 4 months ago
There is a newer version of this series
arch/x86/include/asm/kvm_host.h |   3 +-
arch/x86/kvm/mmu.h              |  28 +++-
arch/x86/kvm/mmu/mmu.c          | 253 ++++++++++++++++----------------
arch/x86/kvm/mmu/mmu_audit.c    |   4 +-
arch/x86/kvm/mmu/paging_tmpl.h  |   2 +-
arch/x86/kvm/mmu/tdp_mmu.c      |   2 +-
arch/x86/kvm/mmu/tdp_mmu.h      |   2 +-
arch/x86/kvm/svm/nested.c       |   6 +-
arch/x86/kvm/vmx/nested.c       |   8 +-
arch/x86/kvm/vmx/vmx.c          |   2 +-
arch/x86/kvm/x86.c              |  39 +++--
11 files changed, 190 insertions(+), 159 deletions(-)
[PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes
Posted by Paolo Bonzini 4 years, 4 months ago
The TDP MMU has a performance regression compared to the legacy MMU
when CR0 changes often.  This was reported for the grsecurity kernel,
which uses CR0.WP to implement kernel W^X.  In that case, each change to
CR0.WP unloads the MMU and causes a lot of unnecessary work.  When running
nested, this can even cause the L1 to hardly make progress, as the L0
hypervisor it is overwhelmed by the amount of MMU work that is needed.

Initially, my plan for this was to pull kvm_mmu_unload from
kvm_mmu_reset_context into kvm_init_mmu.  Therefore I started by separating
the CPU setup (CR0/CR4/EFER, SMM, guest mode, etc.) from the shadow
page table format.  Right now the "MMU role" is a messy mix of the two
and, whenever something is different between the MMU and the CPU, it is
stored as an extra field in struct kvm_mmu; for extra bonus complication,
sometimes the same thing is stored in both the role and an extra field.
The aim was to keep kvm_mmu_unload only if the MMU role changed, and
drop it if the CPU role changed.

I even posted that cleanup, but it occurred to me later that even
a conditional kvm_mmu_unload in kvm_init_mmu would be overkill.
kvm_mmu_unload is only needed in the rare cases where a TLB flush is
needed (e.g. CR0.PG changing from 1 to 0) or where the guest page table
interpretation changes in way not captured by the role (that is, CPUID
changes).  But the implementation of fast PGD switching is subtle
and requires a call to kvm_mmu_new_pgd (and therefore knowing the
new MMU role) before kvm_init_mmu, therefore kvm_mmu_reset_context
chickens and drops all the roots.

Therefore, the meat of this series is a reorganization of fast PGD
switching; it makes it possible to call kvm_mmu_new_pgd *after*
the MMU has been set up, just using the MMU role instead of
kvm_mmu_calc_root_page_role.

Patches 1 to 3 are bugfixes found while working on the series.

Patches 4 to 5 add more sanity checks that triggered a lot during
development.

Patches 6 and 7 are related cleanups.  In particular patch 7 makes
the cache lookup code a bit more pleasant.

Patches 8 to 9 rework the fast PGD switching.  Patches 10 and
11 are cleanups enabled by the rework, and the only survivors
of the CPU role patchset.

Finally, patch 12 optimizes kvm_mmu_reset_context.

Paolo


Paolo Bonzini (12):
  KVM: x86: host-initiated EFER.LME write affects the MMU
  KVM: MMU: move MMU role accessors to header
  KVM: x86: do not deliver asynchronous page faults if CR0.PG=0
  KVM: MMU: WARN if PAE roots linger after kvm_mmu_unload
  KVM: MMU: avoid NULL-pointer dereference on page freeing bugs
  KVM: MMU: rename kvm_mmu_reload
  KVM: x86: use struct kvm_mmu_root_info for mmu->root
  KVM: MMU: do not consult levels when freeing roots
  KVM: MMU: look for a cached PGD when going from 32-bit to 64-bit
  KVM: MMU: load new PGD after the shadow MMU is initialized
  KVM: MMU: remove kvm_mmu_calc_root_page_role
  KVM: x86: do not unload MMU roots on all role changes

 arch/x86/include/asm/kvm_host.h |   3 +-
 arch/x86/kvm/mmu.h              |  28 +++-
 arch/x86/kvm/mmu/mmu.c          | 253 ++++++++++++++++----------------
 arch/x86/kvm/mmu/mmu_audit.c    |   4 +-
 arch/x86/kvm/mmu/paging_tmpl.h  |   2 +-
 arch/x86/kvm/mmu/tdp_mmu.c      |   2 +-
 arch/x86/kvm/mmu/tdp_mmu.h      |   2 +-
 arch/x86/kvm/svm/nested.c       |   6 +-
 arch/x86/kvm/vmx/nested.c       |   8 +-
 arch/x86/kvm/vmx/vmx.c          |   2 +-
 arch/x86/kvm/x86.c              |  39 +++--
 11 files changed, 190 insertions(+), 159 deletions(-)

-- 
2.31.1

Re: [PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes
Posted by Sean Christopherson 4 years, 4 months ago
On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> The TDP MMU has a performance regression compared to the legacy MMU
> when CR0 changes often.  This was reported for the grsecurity kernel,
> which uses CR0.WP to implement kernel W^X.  In that case, each change to
> CR0.WP unloads the MMU and causes a lot of unnecessary work.  When running
> nested, this can even cause the L1 to hardly make progress, as the L0
> hypervisor it is overwhelmed by the amount of MMU work that is needed.

FWIW, my flushing/zapping series fixes this by doing the teardown in an async
worker.  There's even a selftest for this exact case :-)

https://lore.kernel.org/all/20211223222318.1039223-1-seanjc@google.com
Re: [PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes
Posted by Paolo Bonzini 4 years, 4 months ago
On 2/9/22 18:07, Sean Christopherson wrote:
> On Wed, Feb 09, 2022, Paolo Bonzini wrote:
>> The TDP MMU has a performance regression compared to the legacy MMU
>> when CR0 changes often.  This was reported for the grsecurity kernel,
>> which uses CR0.WP to implement kernel W^X.  In that case, each change to
>> CR0.WP unloads the MMU and causes a lot of unnecessary work.  When running
>> nested, this can even cause the L1 to hardly make progress, as the L0
>> hypervisor it is overwhelmed by the amount of MMU work that is needed.
> 
> FWIW, my flushing/zapping series fixes this by doing the teardown in an async
> worker.  There's even a selftest for this exact case :-)
> 
> https://lore.kernel.org/all/20211223222318.1039223-1-seanjc@google.com

I'll check it out (it's next on my list as soon as I finally push 
kvm/{master,next}, which in turn was blocked by this work).

But not zapping the roots is even better---especially when KVM is nested 
and the TDP MMU's page table rebuild is very heavy on L0.  I'm not sure 
if there are any (cumulative) stats that capture the optimization, but 
if not I'll add them.

Paolo

Re: [PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes
Posted by Sean Christopherson 4 years, 4 months ago
On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> On 2/9/22 18:07, Sean Christopherson wrote:
> > On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> > > The TDP MMU has a performance regression compared to the legacy MMU
> > > when CR0 changes often.  This was reported for the grsecurity kernel,
> > > which uses CR0.WP to implement kernel W^X.  In that case, each change to
> > > CR0.WP unloads the MMU and causes a lot of unnecessary work.  When running
> > > nested, this can even cause the L1 to hardly make progress, as the L0
> > > hypervisor it is overwhelmed by the amount of MMU work that is needed.
> > 
> > FWIW, my flushing/zapping series fixes this by doing the teardown in an async
> > worker.  There's even a selftest for this exact case :-)
> > 
> > https://lore.kernel.org/all/20211223222318.1039223-1-seanjc@google.com
> 
> I'll check it out (it's next on my list as soon as I finally push
> kvm/{master,next}, which in turn was blocked by this work).

No rush, I need to spin a new version (rebase, and hopefully drop unnecessarily
complex be3havior).

> But not zapping the roots is even better

No argument there :-)