[PATCH v2 1/4] KVM: x86/mmu: Introduce a quirk to control memslot zap behavior

Yan Zhao posted 4 patches 1 year, 5 months ago
[PATCH v2 1/4] KVM: x86/mmu: Introduce a quirk to control memslot zap behavior
Posted by Yan Zhao 1 year, 5 months ago
Introduce the quirk KVM_X86_QUIRK_SLOT_ZAP_ALL to allow users to select
KVM's behavior when a memslot is moved or deleted for KVM_X86_DEFAULT_VM
VMs. Make sure KVM behave as if the quirk is always disabled for
non-KVM_X86_DEFAULT_VM VMs.

The KVM_X86_QUIRK_SLOT_ZAP_ALL quirk offers two behavior options:
- when enabled:  Invalidate/zap all SPTEs ("zap-all"),
- when disabled: Precisely zap only the leaf SPTEs within the range of the
                 moving/deleting memory slot ("zap-slot-leafs-only").

"zap-all" is today's KVM behavior to work around a bug [1] where the
changing the zapping behavior of memslot move/deletion would cause VM
instability for VMs with an Nvidia GPU assigned; while
"zap-slot-leafs-only" allows for more precise zapping of SPTEs within the
memory slot range, improving performance in certain scenarios [2], and
meeting the functional requirements for TDX.

Previous attempts to select "zap-slot-leafs-only" include a per-VM
capability approach [3] (which was not preferred because the root cause of
the bug remained unidentified) and a per-memslot flag approach [4]. Sean
and Paolo finally recommended the implementation of this quirk and
explained that it's the least bad option [5].

By default, the quirk is enabled on KVM_X86_DEFAULT_VM VMs to use
"zap-all". Users have the option to disable the quirk to select
"zap-slot-leafs-only" for specific KVM_X86_DEFAULT_VM VMs that are
unaffected by this bug.

For non-KVM_X86_DEFAULT_VM VMs, the "zap-slot-leafs-only" behavior is
always selected without user's opt-in, regardless of if the user opts for
"zap-all".
This is because it is assumed until proven otherwise that non-
KVM_X86_DEFAULT_VM VMs will not be exposed to the bug [1], and most
importantly, it's because TDX must have "zap-slot-leafs-only" always
selected. In TDX's case a memslot's GPA range can be a mixture of "private"
or "shared" memory. Shared is roughly analogous to how EPT is handled for
normal VMs, but private GPAs need lots of special treatment:
1) "zap-all" would require to zap private root page or non-leaf entries or
   at least leaf-entries beyond the deleting memslot scope. However, TDX
   demands that the root page of the private page table remains unchanged,
   with leaf entries being zapped before non-leaf entries, and any dropped
   private guest pages must be re-accepted by the guest.
2) if "zap-all" zaps only shared page tables, it would result in private
   pages still being mapped when the memslot is gone. This may affect even
   other processes if later the gmem fd was whole punched, causing the
   pages being freed on the host while still mapped in the TD, because
   there's no pgoff to the gfn information to zap the private page table
   after memslot is gone.

So, simply go "zap-slot-leafs-only" as if the quirk is always disabled for
non-KVM_X86_DEFAULT_VM VMs to avoid manual opt-in for every VM type [6] or
complicating quirk disabling interface (current quirk disabling interface
is limited, no way to query quirks, or force them to be disabled).

Add a new function kvm_mmu_zap_memslot_leafs() to implement
"zap-slot-leafs-only". This function does not call kvm_unmap_gfn_range(),
bypassing special handling to APIC_ACCESS_PAGE_PRIVATE_MEMSLOT, as
1) The APIC_ACCESS_PAGE_PRIVATE_MEMSLOT cannot be created by users, nor can
   it be moved. It is only deleted by KVM when APICv is permanently
   inhibited.
2) kvm_vcpu_reload_apic_access_page() effectively does nothing when
   APIC_ACCESS_PAGE_PRIVATE_MEMSLOT is deleted.
3) Avoid making all cpus request of KVM_REQ_APIC_PAGE_RELOAD can save on
   costly IPIs.

Suggested-by: Kai Huang <kai.huang@intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com [1]
Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com/#25054908 [2]
Link: https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#mabc0119583dacf621025e9d873c85f4fbaa66d5c [3]
Link: https://lore.kernel.org/all/20240515005952.3410568-3-rick.p.edgecombe@intel.com [4]
Link: https://lore.kernel.org/all/7df9032d-83e4-46a1-ab29-6c7973a2ab0b@redhat.com [5]
Link: https://lore.kernel.org/all/ZnGa550k46ow2N3L@google.com [6]
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 Documentation/virt/kvm/api.rst  |  8 +++++++
 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/include/uapi/asm/kvm.h |  1 +
 arch/x86/kvm/mmu/mmu.c          | 42 ++++++++++++++++++++++++++++++++-
 4 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index a71d91978d9e..7c1248142a5a 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7997,6 +7997,14 @@ KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS By default, KVM emulates MONITOR/MWAIT (if
                                     guest CPUID on writes to MISC_ENABLE if
                                     KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT is
                                     disabled.
+
+KVM_X86_QUIRK_SLOT_ZAP_ALL          By default, KVM invalidates all SPTEs in
+                                    fast way for memslot deletion when VM type
+                                    is KVM_X86_DEFAULT_VM.
+                                    When this quirk is disabled or when VM type
+                                    is other than KVM_X86_DEFAULT_VM, KVM zaps
+                                    only leaf SPTEs that are within the range of
+                                    the memslot being deleted.
 =================================== ============================================
 
 7.32 KVM_CAP_MAX_VCPU_ID
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9bb2e164c523..a40577911744 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2327,7 +2327,8 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages);
 	 KVM_X86_QUIRK_OUT_7E_INC_RIP |		\
 	 KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT |	\
 	 KVM_X86_QUIRK_FIX_HYPERCALL_INSN |	\
-	 KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS)
+	 KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS |	\
+	 KVM_X86_QUIRK_SLOT_ZAP_ALL)
 
 /*
  * KVM previously used a u32 field in kvm_run to indicate the hypercall was
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 988b5204d636..6a399859c42d 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -438,6 +438,7 @@ struct kvm_sync_regs {
 #define KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT	(1 << 4)
 #define KVM_X86_QUIRK_FIX_HYPERCALL_INSN	(1 << 5)
 #define KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS	(1 << 6)
+#define KVM_X86_QUIRK_SLOT_ZAP_ALL		(1 << 7)
 
 #define KVM_STATE_NESTED_FORMAT_VMX	0
 #define KVM_STATE_NESTED_FORMAT_SVM	1
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1432deb75cbb..be3a82a32295 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6915,10 +6915,50 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
 	kvm_mmu_zap_all(kvm);
 }
 
+/*
+ * Zapping leaf SPTEs with memslot range when a memslot is moved/deleted.
+ *
+ * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
+ * case scenario we'll have unused shadow pages lying around until they
+ * are recycled due to age or when the VM is destroyed.
+ */
+static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+	struct kvm_gfn_range range = {
+		.slot = slot,
+		.start = slot->base_gfn,
+		.end = slot->base_gfn + slot->npages,
+		.may_block = true,
+	};
+	bool flush = false;
+
+	write_lock(&kvm->mmu_lock);
+
+	if (kvm_memslots_have_rmaps(kvm))
+		flush = kvm_handle_gfn_range(kvm, &range, kvm_zap_rmap);
+
+	if (tdp_mmu_enabled)
+		flush = kvm_tdp_mmu_unmap_gfn_range(kvm, &range, flush);
+
+	if (flush)
+		kvm_flush_remote_tlbs_memslot(kvm, slot);
+
+	write_unlock(&kvm->mmu_lock);
+}
+
+static inline bool kvm_memslot_flush_zap_all(struct kvm *kvm)
+{
+	return kvm->arch.vm_type == KVM_X86_DEFAULT_VM &&
+	       kvm_check_has_quirk(kvm, KVM_X86_QUIRK_SLOT_ZAP_ALL);
+}
+
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 				   struct kvm_memory_slot *slot)
 {
-	kvm_mmu_zap_all_fast(kvm);
+	if (kvm_memslot_flush_zap_all(kvm))
+		kvm_mmu_zap_all_fast(kvm);
+	else
+		kvm_mmu_zap_memslot_leafs(kvm, slot);
 }
 
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
-- 
2.43.2
Re: [PATCH v2 1/4] KVM: x86/mmu: Introduce a quirk to control memslot zap behavior
Posted by Sean Christopherson 1 year, 2 months ago
On Wed, Jul 03, 2024, Yan Zhao wrote:
> Introduce the quirk KVM_X86_QUIRK_SLOT_ZAP_ALL to allow users to select
> KVM's behavior when a memslot is moved or deleted for KVM_X86_DEFAULT_VM
> VMs. Make sure KVM behave as if the quirk is always disabled for
> non-KVM_X86_DEFAULT_VM VMs.
 
...

> Suggested-by: Kai Huang <kai.huang@intel.com>
> Suggested-by: Sean Christopherson <seanjc@google.com>

Bad Sean, bad.

> +/*
> + * Zapping leaf SPTEs with memslot range when a memslot is moved/deleted.
> + *
> + * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
> + * case scenario we'll have unused shadow pages lying around until they
> + * are recycled due to age or when the VM is destroyed.
> + */
> +static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *slot)
> +{
> +	struct kvm_gfn_range range = {
> +		.slot = slot,
> +		.start = slot->base_gfn,
> +		.end = slot->base_gfn + slot->npages,
> +		.may_block = true,
> +	};
> +	bool flush = false;
> +
> +	write_lock(&kvm->mmu_lock);
> +
> +	if (kvm_memslots_have_rmaps(kvm))
> +		flush = kvm_handle_gfn_range(kvm, &range, kvm_zap_rmap);

This, and Paolo's merged variant, break shadow paging.  As was tried in commit
4e103134b862 ("KVM: x86/mmu: Zap only the relevant pages when removing a memslot"),
all shadow pages, i.e. non-leaf SPTEs, need to be zapped.  All of the accounting
for a shadow page is tied to the memslot, i.e. the shadow page holds a reference
to the memslot, for all intents and purposes.  Deleting the memslot without removing
all relevant shadow pages results in NULL pointer derefs when tearing down the VM.

Note, that commit is/was buggy, and I suspect my follow-up attempt[*] was as well.
https://lore.kernel.org/all/20190820200318.GA15808@linux.intel.com

Rather than trying to get this functional for shadow paging (which includes nested
TDP), I think we should scrap the quirk idea and simply make this the behavior for
S-EPT and nothing else.

 BUG: kernel NULL pointer dereference, address: 00000000000000b0
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 6085f43067 P4D 608c080067 PUD 608c081067 PMD 0 
 Oops: Oops: 0000 [#1] SMP NOPTI
 CPU: 79 UID: 0 PID: 187063 Comm: set_memory_regi Tainted: G        W          6.11.0-smp--24867312d167-cpl #395
 Tainted: [W]=WARN
 Hardware name: Google Astoria/astoria, BIOS 0.20240617.0-0 06/17/2024
 RIP: 0010:__kvm_mmu_prepare_zap_page+0x3a9/0x7b0 [kvm]
 Code:  <48> 8b 8e b0 00 00 00 48 8b 96 e0 00 00 00 48 c1 e9 09 48 29 c8 8b
 RSP: 0018:ff314a25b19f7c28 EFLAGS: 00010212
 Call Trace:
  <TASK>
  kvm_arch_flush_shadow_all+0x7a/0xf0 [kvm]
  kvm_mmu_notifier_release+0x6c/0xb0 [kvm]
  mmu_notifier_unregister+0x85/0x140
  kvm_put_kvm+0x263/0x410 [kvm]
  kvm_vm_release+0x21/0x30 [kvm]
  __fput+0x8d/0x2c0
  __se_sys_close+0x71/0xc0
  do_syscall_64+0x83/0x160
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
Re: [PATCH v2 1/4] KVM: x86/mmu: Introduce a quirk to control memslot zap behavior
Posted by Yan Zhao 1 year, 2 months ago
On Mon, Sep 23, 2024 at 11:37:14AM -0700, Sean Christopherson wrote:
> On Wed, Jul 03, 2024, Yan Zhao wrote:
> > Introduce the quirk KVM_X86_QUIRK_SLOT_ZAP_ALL to allow users to select
> > KVM's behavior when a memslot is moved or deleted for KVM_X86_DEFAULT_VM
> > VMs. Make sure KVM behave as if the quirk is always disabled for
> > non-KVM_X86_DEFAULT_VM VMs.
>  
> ...
> 
> > Suggested-by: Kai Huang <kai.huang@intel.com>
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> 
> Bad Sean, bad.
> 
> > +/*
> > + * Zapping leaf SPTEs with memslot range when a memslot is moved/deleted.
> > + *
> > + * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
> > + * case scenario we'll have unused shadow pages lying around until they
> > + * are recycled due to age or when the VM is destroyed.
> > + */
> > +static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *slot)
> > +{
> > +	struct kvm_gfn_range range = {
> > +		.slot = slot,
> > +		.start = slot->base_gfn,
> > +		.end = slot->base_gfn + slot->npages,
> > +		.may_block = true,
> > +	};
> > +	bool flush = false;
> > +
> > +	write_lock(&kvm->mmu_lock);
> > +
> > +	if (kvm_memslots_have_rmaps(kvm))
> > +		flush = kvm_handle_gfn_range(kvm, &range, kvm_zap_rmap);
> 
> This, and Paolo's merged variant, break shadow paging.  As was tried in commit
> 4e103134b862 ("KVM: x86/mmu: Zap only the relevant pages when removing a memslot"),
> all shadow pages, i.e. non-leaf SPTEs, need to be zapped.  All of the accounting
> for a shadow page is tied to the memslot, i.e. the shadow page holds a reference
> to the memslot, for all intents and purposes.  Deleting the memslot without removing
> all relevant shadow pages results in NULL pointer derefs when tearing down the VM.
> 
> Note, that commit is/was buggy, and I suspect my follow-up attempt[*] was as well.
> https://lore.kernel.org/all/20190820200318.GA15808@linux.intel.com
> 
> Rather than trying to get this functional for shadow paging (which includes nested
> TDP), I think we should scrap the quirk idea and simply make this the behavior for
> S-EPT and nothing else.
Ok. Thanks for identifying this error. Will change code to this way.
BTW: update some findings regarding to the previous bug with Nvidia GPU
assignment:
I found that after v5.19-rc1+, even with nx_huge_pages=N, the bug is not
reproducible when only leaf entries of memslot are zapped.
(no more detailed info due to limited time to debug).


> 
>  BUG: kernel NULL pointer dereference, address: 00000000000000b0
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x0000) - not-present page
>  PGD 6085f43067 P4D 608c080067 PUD 608c081067 PMD 0 
>  Oops: Oops: 0000 [#1] SMP NOPTI
>  CPU: 79 UID: 0 PID: 187063 Comm: set_memory_regi Tainted: G        W          6.11.0-smp--24867312d167-cpl #395
>  Tainted: [W]=WARN
>  Hardware name: Google Astoria/astoria, BIOS 0.20240617.0-0 06/17/2024
>  RIP: 0010:__kvm_mmu_prepare_zap_page+0x3a9/0x7b0 [kvm]
>  Code:  <48> 8b 8e b0 00 00 00 48 8b 96 e0 00 00 00 48 c1 e9 09 48 29 c8 8b
>  RSP: 0018:ff314a25b19f7c28 EFLAGS: 00010212
>  Call Trace:
>   <TASK>
>   kvm_arch_flush_shadow_all+0x7a/0xf0 [kvm]
>   kvm_mmu_notifier_release+0x6c/0xb0 [kvm]
>   mmu_notifier_unregister+0x85/0x140
>   kvm_put_kvm+0x263/0x410 [kvm]
>   kvm_vm_release+0x21/0x30 [kvm]
>   __fput+0x8d/0x2c0
>   __se_sys_close+0x71/0xc0
>   do_syscall_64+0x83/0x160
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
Re: [PATCH v2 1/4] KVM: x86/mmu: Introduce a quirk to control memslot zap behavior
Posted by Yan Zhao 1 year ago
> BTW: update some findings regarding to the previous bug with Nvidia GPU
> assignment:
> I found that after v5.19-rc1+, even with nx_huge_pages=N, the bug is not
> reproducible when only leaf entries of memslot are zapped.
> (no more detailed info due to limited time to debug).
+Alex, Weijiang, and Kevin

Some updates on the Nvidia GPU assignment issue.
Good news is that I may have identified the root cause of this issue.
However, given the root cause, I'm not 100% sure that the issue I observed is
the same one reported by Alex. So it still needs Alex's confirmation and help to
verify it in the original environment.

== My Environment ==
With the help from Weijiang, I'm able to reproduce the issue using
GeForce GT 640, on a KBL desktop.
Besides the GeForce GT 640 assigned to the guest, this KBL desktop has an Intel
IGD device, which is used by host OS.
The guest OS is win10. Guest workloads: a video player + furmark + passmark.

I can observe error patterns that are very similar to those described by Alex
at [1] on kernel tags before v5.19-rc1.
- I can observe the error patterns on kernel tag v5.3-rc4.
  (It uses the zap-only-memslot logic and Alex reported that this version was
   with this issue at [2]).
- From tag 5.3-rc6 to v5.19-rc1, zap-only-memslot was reverted.
  From tag v5.4-rc8, commit b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation")
  was introduced. (though if I directly checkout this commit, the kernel version
  is 5.4.0-rc6).
  I can reproduce the issue on those kernel versions by adding back and forcing
  the zap-only-memslot logic, and setting kvm.nx_huge_pages=N.
  (Previously Weijiang found out that with kvm.nx_huge_pages=Y, the issue was
   not reproducible [3]).
- If I switched back to zap-all in all those versions, the error pattens were
  not observable.

== Root Cause ==
It's found out that with commit fc0051cb9590 ("iommu/vt-d: Check domain
force_snooping against attached devices"), the issue was not reproducible.
(I only bisected kernel tags. This commit first appeared in tag v5.19-rc1.)

Further analysis (with Kevin's help) shows that after the commit fc0051cb9590
("iommu/vt-d: Check domain force_snooping against attached devices"), VFIO
always detected the NVidia GPU device as a coherent DMA device. Prior to that
commit, VFIO detected the NVidia GPU device as a non-coherent DMA device by
querying cache coherency from Intel IOMMU driver, which, however, incorrectly
returned fail if any IOMMU lacked snoop control support. 

As a result, if the machine had an Intel IGD device,
- on the Intel IOMMU driver side, it would not enforce snoop for the assigned
  NVidia GPU device in the IOMMU SLPT.
- on the KVM's side, KVM also found that kvm_arch_has_noncoherent_dma() was true
  and would emulate guest WBINVD.


In KVM's vmx_get_mt_mask(), with non-coherent DMA devices attached,
(using the code in tag v5.3-rc4 as an example):
- when guest CD=1 && kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED),
  the EPT memtype is MTRR_TYPE_WRBACK | VMX_EPT_IPAT_BIT;
- when CD=0, the EPT memtype is guest MTRR type (without VMX_EPT_IPAT_BIT).

static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
        u8 cache;
        u64 ipat = 0;

        if (is_mmio) {
                cache = MTRR_TYPE_UNCACHABLE;
                goto exit;
        }

        if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) {
                ipat = VMX_EPT_IPAT_BIT;
                cache = MTRR_TYPE_WRBACK;
                goto exit;
        }

        if (kvm_read_cr0(vcpu) & X86_CR0_CD) {
                ipat = VMX_EPT_IPAT_BIT;
                if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
                        cache = MTRR_TYPE_WRBACK;
                else
                        cache = MTRR_TYPE_UNCACHABLE;
                goto exit;
        }

        cache = kvm_mtrr_get_guest_memory_type(vcpu, gfn);

exit:
        return (cache << VMX_EPT_MT_EPTE_SHIFT) | ipat;
}

However, with this vmx_get_mt_mask() implementation, KVM did not zap EPT on CD
toggling.
So if I applied patch[4], the error pattens previously observed were immediately
gone and the guest OS appeared quite stable.

Or if I changed vmx_get_mt_mask() as shown below, the issue was not reproducible
even if KVM did not zap EPT for CD toggling and update_mtrr().

static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
        if (is_mmio)
                return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;

        if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
                return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;

        return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT);
}


So, my conclusion is that the Nvidia GPU assignment issue was caused by the lack
of EPT zapping when the guest toggles CD. (The CD toggling occurs per-vCPU
during guest bootup for enabling guest MTRRs.)
The lack of EPT zapping was previously masked by the zap-all operations for
memslot deletions during guest bootup. However, the error became outstanding
when only memslot EPT entries were zapped. (The guest may have accessed a GPA
during CD=1 to create an EPT entry with a memtype no longer correct after CD=0).

The ITLB_MULTIHIT mitigation [3] splits non-executable huge pages in EPT to
create executable 4k pages. e.g., I can observe GFNs 0xa00, 0xc00 were mapped as
2M initially with EPT memtype=WB. They were then mapped as 2M + EPT
memtype=WB+IPAT when guest CD=1. After some seconds during guest boot, they were
split to 4K + EPT memtype=WB. The split may also mitigate the lack of zapping
for CD toggling to a great extent.
In my environment, the guest appeared quite stable with
"zap-only-memslot + kvm.nx_huge_pages=Y". However, the benchmarks sometimes
still showed around 10 errors in that case, compared to 1000+ errors with
"zap-only-memslot + kvm.nx_huge_pages=N".

== Request Help ==
So, Alex, do you recall if there was an IGD device in your original environment?
If so and if that environment is still available, could you please help verify
if patch [4] resolves the issue?

Thank you and your help is greatly appreciated!

[1] https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#mc45b9f909731d70551b4e10cff5a58d34a155e71
[2] https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com/
[3] https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#m1839c85392a7a022df9e507876bb241c022c4f06
[4]
From e41a78c95ea3478be04dcb35f374084231a08a5f Mon Sep 17 00:00:00 2001
From: Yan Zhao <yan.y.zhao@intel.com>
Date: Sat, 23 Nov 2024 21:06:42 -0800
Subject: [PATCH] KVM: x86: Zap EPT on CD changes when KVM has non-coherent DMA

Always zap EPT on CD changes when a VM has non-coherent DMA devices
attached, no matter quirk KVM_X86_QUIRK_CD_NW_CLEARED is turned on or not.

Previously when kvm_arch_has_noncoherent_dma() is true, EPT is zapped when
CD is toggled only if quirk KVM_X86_QUIRK_CD_NW_CLEARED is off.

However, EPT should also be zapped when quirk KVM_X86_QUIRK_CD_NW_CLEARED
is on because the EPT memtype would switch bewteen
- "MTRR_TYPE_WRBACK | VMX_EPT_IPAT_BIT", and
- "guest MTRR type (without VMX_EPT_IPAT_BIT)".

Fixes: 879ae1880449 ("KVM: x86: obey KVM_X86_QUIRK_CD_NW_CLEARED in kvm_set_cr0()")
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/x86.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 93b0bd45ac73..3e874cfaf059 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -792,8 +792,7 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
                kvm_mmu_reset_context(vcpu);

        if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
-           kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
-           !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
+           kvm_arch_has_noncoherent_dma(vcpu->kvm))
                kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);

        return 0;

base-commit: d45331b00ddb179e291766617259261c112db872
--
2.27.0
Re: [PATCH v2 1/4] KVM: x86/mmu: Introduce a quirk to control memslot zap behavior
Posted by Sean Christopherson 1 year, 2 months ago
On Tue, Sep 24, 2024, Yan Zhao wrote:
> On Mon, Sep 23, 2024 at 11:37:14AM -0700, Sean Christopherson wrote:
> > > +static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *slot)
> > > +{
> > > +	struct kvm_gfn_range range = {
> > > +		.slot = slot,
> > > +		.start = slot->base_gfn,
> > > +		.end = slot->base_gfn + slot->npages,
> > > +		.may_block = true,
> > > +	};
> > > +	bool flush = false;
> > > +
> > > +	write_lock(&kvm->mmu_lock);
> > > +
> > > +	if (kvm_memslots_have_rmaps(kvm))
> > > +		flush = kvm_handle_gfn_range(kvm, &range, kvm_zap_rmap);
> > 
> > This, and Paolo's merged variant, break shadow paging.  As was tried in commit
> > 4e103134b862 ("KVM: x86/mmu: Zap only the relevant pages when removing a memslot"),
> > all shadow pages, i.e. non-leaf SPTEs, need to be zapped.  All of the accounting
> > for a shadow page is tied to the memslot, i.e. the shadow page holds a reference
> > to the memslot, for all intents and purposes.  Deleting the memslot without removing
> > all relevant shadow pages results in NULL pointer derefs when tearing down the VM.
> > 
> > Note, that commit is/was buggy, and I suspect my follow-up attempt[*] was as well.
> > https://lore.kernel.org/all/20190820200318.GA15808@linux.intel.com
> > 
> > Rather than trying to get this functional for shadow paging (which includes nested
> > TDP), I think we should scrap the quirk idea and simply make this the behavior for
> > S-EPT and nothing else.
> Ok. Thanks for identifying this error. Will change code to this way.

For now, I think a full revert of the entire series makes sense.  Irrespective of
this bug, I don't think KVM should commit to specific implementation behavior,
i.e. KVM shouldn't explicitly say only leaf SPTEs are zapped.  The quirk docs
should instead say that if the quirk is disabled, KVM will only guarantee that
the delete memslot will be inaccessible.  That way, KVM can still do a fast zap
when it makes sense, e.g. for large memslots, do a complete fast zap, and for
small memslots, do a targeted zap of the TDP MMU but a fast zap of any shadow
MMUs.

> BTW: update some findings regarding to the previous bug with Nvidia GPU
> assignment:
> I found that after v5.19-rc1+, even with nx_huge_pages=N, the bug is not
> reproducible when only leaf entries of memslot are zapped.
> (no more detailed info due to limited time to debug).

Heh, I've given up hope on ever finding a root cause for that issue.
Re: [PATCH v2 1/4] KVM: x86/mmu: Introduce a quirk to control memslot zap behavior
Posted by Yan Zhao 1 year, 2 months ago
On Tue, Sep 24, 2024 at 12:45:52AM -0700, Sean Christopherson wrote:
> On Tue, Sep 24, 2024, Yan Zhao wrote:
> > On Mon, Sep 23, 2024 at 11:37:14AM -0700, Sean Christopherson wrote:
> > > > +static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *slot)
> > > > +{
> > > > +	struct kvm_gfn_range range = {
> > > > +		.slot = slot,
> > > > +		.start = slot->base_gfn,
> > > > +		.end = slot->base_gfn + slot->npages,
> > > > +		.may_block = true,
> > > > +	};
> > > > +	bool flush = false;
> > > > +
> > > > +	write_lock(&kvm->mmu_lock);
> > > > +
> > > > +	if (kvm_memslots_have_rmaps(kvm))
> > > > +		flush = kvm_handle_gfn_range(kvm, &range, kvm_zap_rmap);
> > > 
> > > This, and Paolo's merged variant, break shadow paging.  As was tried in commit
> > > 4e103134b862 ("KVM: x86/mmu: Zap only the relevant pages when removing a memslot"),
> > > all shadow pages, i.e. non-leaf SPTEs, need to be zapped.  All of the accounting
> > > for a shadow page is tied to the memslot, i.e. the shadow page holds a reference
> > > to the memslot, for all intents and purposes.  Deleting the memslot without removing
> > > all relevant shadow pages results in NULL pointer derefs when tearing down the VM.
> > > 
> > > Note, that commit is/was buggy, and I suspect my follow-up attempt[*] was as well.
> > > https://lore.kernel.org/all/20190820200318.GA15808@linux.intel.com
> > > 
> > > Rather than trying to get this functional for shadow paging (which includes nested
> > > TDP), I think we should scrap the quirk idea and simply make this the behavior for
> > > S-EPT and nothing else.
> > Ok. Thanks for identifying this error. Will change code to this way.
> 
> For now, I think a full revert of the entire series makes sense.  Irrespective of
> this bug, I don't think KVM should commit to specific implementation behavior,
> i.e. KVM shouldn't explicitly say only leaf SPTEs are zapped.  The quirk docs
> should instead say that if the quirk is disabled, KVM will only guarantee that
> the delete memslot will be inaccessible.  That way, KVM can still do a fast zap
> when it makes sense, e.g. for large memslots, do a complete fast zap, and for
> small memslots, do a targeted zap of the TDP MMU but a fast zap of any shadow
> MMUs.
For TDX, could we do as below after the full revert of this series?

void kvm_arch_flush_shadow_memslot(struct kvm *kvm,                              
                                   struct kvm_memory_slot *slot)                 
{                                                                                
        kvm_mmu_zap_all_fast(kvm); ==> this will skip mirror root
        kvm_mmu_zap_memslot_mirror_leafs(kvm, slot);  ==> zap memslot leaf entries
	                                                  in mirror root
}  

> 
> > BTW: update some findings regarding to the previous bug with Nvidia GPU
> > assignment:
> > I found that after v5.19-rc1+, even with nx_huge_pages=N, the bug is not
> > reproducible when only leaf entries of memslot are zapped.
> > (no more detailed info due to limited time to debug).
> 
> Heh, I've given up hope on ever finding a root cause for that issue.