KVM: x86: Revert SLOT_ZAP_ALL quirk

[PATCH 0/4] KVM: x86: Revert SLOT_ZAP_ALL quirk

Posted by Sean Christopherson 1 year, 4 months ago

Revert the entire KVM_X86_QUIRK_SLOT_ZAP_ALL series, as the code is buggy
for shadow MMUs, and I'm not convinced a quirk is actually the right way
forward.  I'm not totally opposed to it (obviously, given that I suggested
it at one point), but I would prefer to give ourselves ample time to sort
out exactly how we want to move forward, i.e. not rush something in to
unhose v6.12.

Sean Christopherson (4):
  Revert "KVM: selftests: Test memslot move in memslot_perf_test with
    quirk disabled"
  Revert "KVM: selftests: Allow slot modification stress test with quirk
    disabled"
  Revert "KVM: selftests: Test slot move/delete with slot zap quirk
    enabled/disabled"
  Revert "KVM: x86/mmu: Introduce a quirk to control memslot zap
    behavior"

 Documentation/virt/kvm/api.rst                |  8 -----
 arch/x86/include/asm/kvm_host.h               |  3 +-
 arch/x86/include/uapi/asm/kvm.h               |  1 -
 arch/x86/kvm/mmu/mmu.c                        | 34 +------------------
 .../kvm/memslot_modification_stress_test.c    | 19 ++---------
 .../testing/selftests/kvm/memslot_perf_test.c | 12 +------
 .../selftests/kvm/set_memory_region_test.c    | 29 +++++-----------
 7 files changed, 13 insertions(+), 93 deletions(-)


base-commit: 3f8df6285271d9d8f17d733433e5213a63b83a0b
-- 
2.46.1.824.gd892dcdcdd-goog

Re: [PATCH 0/4] KVM: x86: Revert SLOT_ZAP_ALL quirk

Posted by Paolo Bonzini 1 year, 4 months ago

On Fri, Sep 27, 2024 at 2:18 AM Sean Christopherson <seanjc@google.com> wrote:
>
> Revert the entire KVM_X86_QUIRK_SLOT_ZAP_ALL series, as the code is buggy
> for shadow MMUs, and I'm not convinced a quirk is actually the right way
> forward.  I'm not totally opposed to it (obviously, given that I suggested
> it at one point), but I would prefer to give ourselves ample time to sort
> out exactly how we want to move forward, i.e. not rush something in to
> unhose v6.12.

Yeah, the code is buggy but I think it's safe enough to use code like the
one you wrote back in 2019; untested patch follows:

------------------------------- 8< ------------------------
From: Paolo Bonzini <pbonzini@redhat.com>
Date: Fri, 27 Sep 2024 06:25:35 -0400
Subject: [PATCH] KVM: x86/mmu: fix KVM_X86_QUIRK_SLOT_ZAP_ALL for shadow MMU

As was tried in commit 4e103134b862 ("KVM: x86/mmu: Zap only the relevant
pages when removing a memslot"), all shadow pages, i.e. non-leaf SPTEs,
need to be zapped.  All of the accounting for a shadow page is tied to the
memslot, i.e. the shadow page holds a reference to the memslot, for all
intents and purposes.  Deleting the memslot without removing all relevant
shadow pages, as is done when KVM_X86_QUIRK_SLOT_ZAP_ALL is disabled,
results in NULL pointer derefs when tearing down the VM.

Reintroduce from that commit the code that walks the whole memslot when
there are active shadow MMU pages.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e081f785fb23..6843535905fb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7049,14 +7049,42 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
  	kvm_mmu_zap_all(kvm);
  }

-/*
- * Zapping leaf SPTEs with memslot range when a memslot is moved/deleted.
- *
- * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
- * case scenario we'll have unused shadow pages lying around until they
- * are recycled due to age or when the VM is destroyed.
- */
-static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *slot)
+static void kvm_mmu_zap_memslot_pages_and_flush(struct kvm *kvm,
+						struct kvm_memory_slot *slot,
+						bool flush)
+{
+	LIST_HEAD(invalid_list);
+	unsigned long i;
+
+	if (list_empty(&kvm->arch.active_mmu_pages))
+		goto out_flush;
+
+	/*
+	 * Since accounting information is stored in struct kvm_arch_memory_slot,
+	 * deleting shadow pages (e.g. in unaccount_shadowed()) requires that all
+	 * gfns with a shadow page have a corresponding memslot.  Do so before
+	 * the memslot goes away.
+	 */
+	for (i = 0; i < slot->npages; i++) {
+		struct kvm_mmu_page *sp;
+		gfn_t gfn = slot->base_gfn + i;
+
+		for_each_gfn_valid_sp_with_gptes(kvm, sp, gfn)
+			kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
+
+		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
+			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
+			flush = false;
+			cond_resched_rwlock_write(&kvm->mmu_lock);
+		}
+	}
+
+out_flush:
+	kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
+}
+
+static void kvm_mmu_zap_memslot(struct kvm *kvm,
+				struct kvm_memory_slot *slot)
  {
  	struct kvm_gfn_range range = {
  		.slot = slot,
@@ -7064,11 +7097,11 @@ static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *s
  		.end = slot->base_gfn + slot->npages,
  		.may_block = true,
  	};
+	bool flush;

  	write_lock(&kvm->mmu_lock);
-	if (kvm_unmap_gfn_range(kvm, &range))
-		kvm_flush_remote_tlbs_memslot(kvm, slot);
-
+	flush = kvm_unmap_gfn_range(kvm, &range);
+	kvm_mmu_zap_memslot_pages_and_flush(kvm, slot, flush);
  	write_unlock(&kvm->mmu_lock);
  }

@@ -7084,7 +7117,7 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
  	if (kvm_memslot_flush_zap_all(kvm))
  		kvm_mmu_zap_all_fast(kvm);
  	else
-		kvm_mmu_zap_memslot_leafs(kvm, slot);
+		kvm_mmu_zap_memslot(kvm, slot);
  }

  void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
--------------------------------------------------

(Not too sure about using the sp_has_gptes() test, which is why I haven't
posted this yet).

With respect to the choice of API, the quirk is at least good for
testing; this was already proven, I guess.

Also I think it's safe to enable it for SEV/SEV-ES VM types: they
pretty much depend on NPT (see sev_hardware_setup), and with the
TDP MMU it should always be better to kill the PTEs for the memslot
(even if invalidating the whole MMU is cheap) to avoid having to
fault all the remainder of the memory back in.  So I think the current
version of kvm_memslot_flush_zap_all() is better than using e.g.
kvm_arch_has_private_mem().

The only straggler is software-protected VMs, which I don't care
too much about; but if anything it's better to make them closer to
SNP and TDX VM types.

For now I think I'll send the existing kvm/next to Linus and we
can sort it out next week, as the weekend (and the closure of the
merge window) is impending...

Paolo

Re: [PATCH 0/4] KVM: x86: Revert SLOT_ZAP_ALL quirk

Posted by Sean Christopherson 1 year, 4 months ago

On Fri, Sep 27, 2024, Paolo Bonzini wrote:
> On Fri, Sep 27, 2024 at 2:18 AM Sean Christopherson <seanjc@google.com> wrote:
> > 
> > Revert the entire KVM_X86_QUIRK_SLOT_ZAP_ALL series, as the code is buggy
> > for shadow MMUs, and I'm not convinced a quirk is actually the right way
> > forward.  I'm not totally opposed to it (obviously, given that I suggested
> > it at one point), but I would prefer to give ourselves ample time to sort
> > out exactly how we want to move forward, i.e. not rush something in to
> > unhose v6.12.
> 
> Yeah, the code is buggy but I think it's safe enough to use code like the
> one you wrote back in 2019; untested patch follows:

...

> (Not too sure about using the sp_has_gptes() test, which is why I haven't
> posted this yet).

Heh, I was going to ask about that too.  Luckily I read ahead :-)

To be 100% safe, I think the zap needs to purge everything, even invalid SPs.
I doubt it would ever cause problems to leave dangling invalid SPs, but I don't
love the idea of avoiding UAF purely by relying on KVM not consuming stale info.

The other thing that makes my head hurt is how SPs are tracked by direct SPs in
the shadow MMU, i.e. by the effect of direct_map() and the guest hugepage case
(it would be weird, but legal for the guest to create a hugepage that straddles
a memslot boundary) rounding the gfn for the level when creating SPs.

Hmm, but I suppose that's an argument against being paranoid for the !sp_has_gptes()
case, as KVM already creates SPs with a target gfn that isn't covered by a memslot.
Blech.

> With respect to the choice of API, the quirk is at least good for
> testing; this was already proven, I guess.

True.  I do think the documentation should be updated to be less prescriptive,
i.e. to give KVM wiggle room.  Disabling the quirk should only _allow_ KVM to
a targeted/partial zap, it shouldn't _force_ KVM to do so.

> Also I think it's safe to enable it for SEV/SEV-ES VM types: they
> pretty much depend on NPT (see sev_hardware_setup), and with the
> TDP MMU it should always be better to kill the PTEs for the memslot
> (even if invalidating the whole MMU is cheap) to avoid having to
> fault all the remainder of the memory back in.  So I think the current
> version of kvm_memslot_flush_zap_all() is better than using e.g.
> kvm_arch_has_private_mem().

In practice, you're probably right.  Realistically, the only memslot removal that
would be problematic is the deletion of a large memslot, at which point SEV+ VMs
are in for a world of hurt no matter what.

> The only straggler is software-protected VMs, which I don't care
> too much about; but if anything it's better to make them closer to
> SNP and TDX VM types.
> 
> For now I think I'll send the existing kvm/next to Linus and we
> can sort it out next week, as the weekend (and the closure of the
> merge window) is impending...

Works for me.  Thanks!