KVM: x86/mmu: Bug fixes and cleanups

[PATCH 00/12] KVM: x86/mmu: Bug fixes and cleanups

Posted by Sean Christopherson 4 years ago

This all started from an offhand comment wondering if we might be dropping
A/D bits.  Turns out, yes, we are dropping A/D bits (and W bits) in the
TDP MMU.

Attempting to prove that bug exposed two other notable goofs, the first of
which has been lurking for a decade, give or take:

  Patch 1 - KVM treats _all_ MMU-writable SPTEs as volatile, even though
            KVM never clears WRITABLE outside of MMU lock.  As a result,
            the legacy MMU (and the TDP MMU if not fixed) uses XCHG to
            update writable SPTEs.

  Patch 4 - KVM sends literally all EPT violations down the fast page
            fault handler.  The bug was introduced when KVM started
            honoring EPTP12's A/D enabling, circa 2016.

Neither fix seems to have an easily-measurable affect on performance;
page faults are so slow that wasting even a few hundred cycles is dwarfed
by the base cost.

The rest of the series is cleanups, including a v2 of my RET_PF_CONTINUE,
which would conflict in mildly annoying ways.

The first two "DO NOT MERGE" patches are effectively a PoC showing that
sending indirect MMUs down the fast page fault handler can work, it's
just ineffective.  I went down this rabbit hole after I saw a massive
performance drop in the legacy MMU from patch 3 and thought it might be due
to detecting spurious faults out of mmu_lock.  The TDP MMU didn't exhibit
the same problem, so it just _had_ to be write-contention on mmu_lock.
Right!?!?

Wrong. Turns out that spurious faults are a non-issue, but inverting the
"ad_disabled" flag in the shadow page role and preventing reuse?  Yep,
that's a wee bit of an issue.  I included the patches here to publicly
document that this has been tried, it works, it's just likely not a win.

The "DO NOT MERGE" selftest is my attempt to prove that the TDP MMU can
drop W/A/D bits.  It does work in the sense that it can trigger dropped
bits, but it's fairly worthless as a selftest because it requires an
explicit WARN in KVM :-/  Posted mostly as a reminder to myself that
the primary MMU write-protects on writeback (this isn't the first time
I've gone spelunking in that code...).

E.g. it can trigger the WARN on WRITABLE and DIRTY bits this patch on
top of the series:

diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index f0af385c56e0..4f1ec56460ad 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -32,6 +32,8 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
 static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
                                         u64 new_spte, int level)
 {
+       u64 current_spte;
+
        /*
         * Atomically write the SPTE if it is a shadow-present, leaf SPTE with
         * volatile bits, i.e. has bits that can be set outside of mmu_lock.
@@ -46,7 +48,13 @@ static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
         */
        if (is_shadow_present_pte(old_spte) && is_last_spte(old_spte, level) &&
            spte_has_volatile_bits(old_spte))
-               return kvm_tdp_mmu_write_spte_atomic(sptep, new_spte);
+               ;
+
+       current_spte = kvm_tdp_mmu_read_spte(sptep);
+       WARN_ONCE((old_spte ^ current_spte) & current_spte &
+                 (PT_WRITABLE_MASK | PT_ACCESSED_MASK | PT_DIRTY_MASK),
+                 "KVM: Dropped SPTE W/A/D bits: 0x%llx\n",
+                 (old_spte ^ current_spte) & current_spte);
 
        __kvm_tdp_mmu_write_spte(sptep, new_spte);
        return old_spte;

  ------------[ cut here ]------------
  KVM: Dropped SPTE W/A/D bits: 0x2
  WARNING: CPU: 5 PID: 874 at arch/x86/kvm/mmu/tdp_iter.h:56 __tdp_mmu_set_spte+0x18b/0x1a0 [kvm]
  Modules linked in: kvm_intel kvm irqbypass
  CPU: 5 PID: 874 Comm: volatile_spte_t Not tainted 5.18.0-rc1+ #830
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:__tdp_mmu_set_spte+0x18b/0x1a0 [kvm]
  Call Trace:
   <TASK>
   tdp_mmu_zap_leafs+0xbc/0x180 [kvm]
   kvm_tdp_mmu_zap_leafs+0x74/0x90 [kvm]
   kvm_unmap_gfn_range+0xe8/0x100 [kvm]
   kvm_mmu_notifier_invalidate_range_start+0x11c/0x2c0 [kvm]
   __mmu_notifier_invalidate_range_start+0x7e/0x190
   change_protection+0x657/0xb30
   mprotect_fixup+0x1a1/0x310
   do_mprotect_pkey+0x1f9/0x3a0
   __x64_sys_mprotect+0x1b/0x20
   do_syscall_64+0x31/0x50
   entry_SYSCALL_64_after_hwframe+0x44/0xae
   </TASK>
  ---[ end trace 0000000000000000 ]---


Sean Christopherson (12):
  KVM: x86/mmu: Don't treat fully writable SPTEs as volatile (modulo
    A/D)
  KVM: x86/mmu: Move shadow-present check out of
    spte_has_volatile_bits()
  KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile
    bits
  KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use
  KVM: x86/mmu: Drop exec/NX check from "page fault can be fast"
  KVM: x86/mmu: Add RET_PF_CONTINUE to eliminate bool+int* "returns"
  KVM: x86/mmu: Make all page fault handlers internal to the MMU
  KVM: x86/mmu: Use IS_ENABLED() to avoid RETPOLINE for TDP page faults
  KVM: x86/mmu: Expand and clean up page fault stats
  DO NOT MERGE: KVM: x86/mmu: Always send !PRESENT faults down the fast
    path
  DO NOT MERGE: KVM: x86/mmu: Use fast_page_fault() to detect spurious
    shadow MMU faults
  DO NOT MERGE: KVM: selftests: Attempt to detect lost dirty bits

 arch/x86/include/asm/kvm_host.h               |   5 +
 arch/x86/kvm/mmu.h                            |  87 --------
 arch/x86/kvm/mmu/mmu.c                        | 211 ++++++++++--------
 arch/x86/kvm/mmu/mmu_internal.h               | 123 +++++++++-
 arch/x86/kvm/mmu/mmutrace.h                   |   1 +
 arch/x86/kvm/mmu/paging_tmpl.h                |  15 +-
 arch/x86/kvm/mmu/spte.c                       |  28 +++
 arch/x86/kvm/mmu/spte.h                       |  15 +-
 arch/x86/kvm/mmu/tdp_iter.h                   |  34 ++-
 arch/x86/kvm/mmu/tdp_mmu.c                    |  92 +++++---
 arch/x86/kvm/x86.c                            |  24 +-
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   4 +
 .../selftests/kvm/volatile_spte_test.c        | 208 +++++++++++++++++
 14 files changed, 600 insertions(+), 248 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/volatile_spte_test.c


base-commit: 2a39d8b39bffdaf1a4223d0d22f07baee154c8f3
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

[PATCH 01/12] KVM: x86/mmu: Don't treat fully writable SPTEs as volatile (modulo A/D)

Posted by Sean Christopherson 4 years ago

Don't treat SPTEs that are truly writable, i.e. writable in hardware, as
being volatile (unless they're volatile for other reasons, e.g. A/D bits).
KVM _sets_ the WRITABLE bit out of mmu_lock, but never _clears_ the bit
out of mmu_lock, so if the WRITABLE bit is set, it cannot magically get
cleared just because the SPTE is MMU-writable.

Rename the wrapper of MMU-writable to be more literal, the previous name
of spte_can_locklessly_be_made_writable() is wrong and misleading.

Fixes: c7ba5b48cc8d ("KVM: MMU: fast path of handling guest page fault")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c  | 17 +++++++++--------
 arch/x86/kvm/mmu/spte.h |  2 +-
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 904f0faff218..612316768e8e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -481,13 +481,15 @@ static bool spte_has_volatile_bits(u64 spte)
 	 * also, it can help us to get a stable is_writable_pte()
 	 * to ensure tlb flush is not missed.
 	 */
-	if (spte_can_locklessly_be_made_writable(spte) ||
-	    is_access_track_spte(spte))
+	if (!is_writable_pte(spte) && is_mmu_writable_spte(spte))
+		return true;
+
+	if (is_access_track_spte(spte))
 		return true;
 
 	if (spte_ad_enabled(spte)) {
-		if ((spte & shadow_accessed_mask) == 0 ||
-	    	    (is_writable_pte(spte) && (spte & shadow_dirty_mask) == 0))
+		if (!(spte & shadow_accessed_mask) ||
+		    (is_writable_pte(spte) && !(spte & shadow_dirty_mask)))
 			return true;
 	}
 
@@ -554,7 +556,7 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 	 * we always atomically update it, see the comments in
 	 * spte_has_volatile_bits().
 	 */
-	if (spte_can_locklessly_be_made_writable(old_spte) &&
+	if (is_mmu_writable_spte(old_spte) &&
 	      !is_writable_pte(new_spte))
 		flush = true;
 
@@ -1192,7 +1194,7 @@ static bool spte_write_protect(u64 *sptep, bool pt_protect)
 	u64 spte = *sptep;
 
 	if (!is_writable_pte(spte) &&
-	      !(pt_protect && spte_can_locklessly_be_made_writable(spte)))
+	    !(pt_protect && is_mmu_writable_spte(spte)))
 		return false;
 
 	rmap_printk("spte %p %llx\n", sptep, *sptep);
@@ -3171,8 +3173,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 * be removed in the fast path only if the SPTE was
 		 * write-protected for dirty-logging or access tracking.
 		 */
-		if (fault->write &&
-		    spte_can_locklessly_be_made_writable(spte)) {
+		if (fault->write && is_mmu_writable_spte(spte)) {
 			new_spte |= PT_WRITABLE_MASK;
 
 			/*
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index ad8ce3c5d083..570699682f6d 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -398,7 +398,7 @@ static inline void check_spte_writable_invariants(u64 spte)
 			  "kvm: Writable SPTE is not MMU-writable: %llx", spte);
 }
 
-static inline bool spte_can_locklessly_be_made_writable(u64 spte)
+static inline bool is_mmu_writable_spte(u64 spte)
 {
 	return spte & shadow_mmu_writable_mask;
 }
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

[PATCH 02/12] KVM: x86/mmu: Move shadow-present check out of spte_has_volatile_bits()

Posted by Sean Christopherson 4 years ago

Move the is_shadow_present_pte() check out of spte_has_volatile_bits()
and into its callers.  Well, caller, since only one of its two callers
doesn't already do the shadow-present check.

Opportunistically move the helper to spte.c/h so that it can be used by
the TDP MMU, which is also the primary motivation for the shadow-present
change.  Unlike the legacy MMU, the TDP MMU uses a single path for clear
leaf and non-leaf SPTEs, and to avoid unnecessary atomic updates, the TDP
MMU will need to check is_last_spte() prior to calling
spte_has_volatile_bits(), and calling is_last_spte() without first
calling is_shadow_present_spte() is at best odd, and at worst a violation
of KVM's loosely defines SPTE rules.

Note, mmu_spte_clear_track_bits() could likely skip the write entirely
for SPTEs that are not shadow-present.  Leave that cleanup for a future
patch to avoid introducing a functional change, and because the
shadow-present check can likely be moved further up the stack, e.g.
drop_large_spte() appears to be the only path that doesn't already
explicitly check for a shadow-present SPTE.

No functional change intended.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c  | 29 ++---------------------------
 arch/x86/kvm/mmu/spte.c | 28 ++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/spte.h |  2 ++
 3 files changed, 32 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 612316768e8e..65b723201738 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -470,32 +470,6 @@ static u64 __get_spte_lockless(u64 *sptep)
 }
 #endif
 
-static bool spte_has_volatile_bits(u64 spte)
-{
-	if (!is_shadow_present_pte(spte))
-		return false;
-
-	/*
-	 * Always atomically update spte if it can be updated
-	 * out of mmu-lock, it can ensure dirty bit is not lost,
-	 * also, it can help us to get a stable is_writable_pte()
-	 * to ensure tlb flush is not missed.
-	 */
-	if (!is_writable_pte(spte) && is_mmu_writable_spte(spte))
-		return true;
-
-	if (is_access_track_spte(spte))
-		return true;
-
-	if (spte_ad_enabled(spte)) {
-		if (!(spte & shadow_accessed_mask) ||
-		    (is_writable_pte(spte) && !(spte & shadow_dirty_mask)))
-			return true;
-	}
-
-	return false;
-}
-
 /* Rules for using mmu_spte_set:
  * Set the sptep from nonpresent to present.
  * Note: the sptep being assigned *must* be either not present
@@ -590,7 +564,8 @@ static int mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
 	u64 old_spte = *sptep;
 	int level = sptep_to_sp(sptep)->role.level;
 
-	if (!spte_has_volatile_bits(old_spte))
+	if (!is_shadow_present_pte(old_spte) ||
+	    !spte_has_volatile_bits(old_spte))
 		__update_clear_spte_fast(sptep, 0ull);
 	else
 		old_spte = __update_clear_spte_slow(sptep, 0ull);
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 3d611f07eee8..800b857b3a53 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -90,6 +90,34 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 				     E820_TYPE_RAM);
 }
 
+/*
+ * Returns true if the SPTE has bits that may be set without holding mmu_lock.
+ * The caller is responsible for checking if the SPTE is shadow-present, and
+ * for determining whether or not the caller cares about non-leaf SPTEs.
+ */
+bool spte_has_volatile_bits(u64 spte)
+{
+	/*
+	 * Always atomically update spte if it can be updated
+	 * out of mmu-lock, it can ensure dirty bit is not lost,
+	 * also, it can help us to get a stable is_writable_pte()
+	 * to ensure tlb flush is not missed.
+	 */
+	if (!is_writable_pte(spte) && is_mmu_writable_spte(spte))
+		return true;
+
+	if (is_access_track_spte(spte))
+		return true;
+
+	if (spte_ad_enabled(spte)) {
+		if (!(spte & shadow_accessed_mask) ||
+		    (is_writable_pte(spte) && !(spte & shadow_dirty_mask)))
+			return true;
+	}
+
+	return false;
+}
+
 bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       const struct kvm_memory_slot *slot,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 570699682f6d..098d7d144627 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -412,6 +412,8 @@ static inline u64 get_mmio_spte_generation(u64 spte)
 	return gen;
 }
 
+bool spte_has_volatile_bits(u64 spte);
+
 bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       const struct kvm_memory_slot *slot,
 	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

[PATCH 03/12] KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits

Posted by Sean Christopherson 4 years ago

Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even
if mmu_lock is held for write, as volatile SPTEs can be written by other
tasks/vCPUs outside of mmu_lock.  If a vCPU uses the to-be-modified SPTE
to write a page, the CPU can cache the translation as WRITABLE in the TLB
despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the
Accessed/Dirty bits and not properly tag the backing page.

Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify
non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and
KVM doesn't consume the Accessed bit of non-leaf SPTEs.

Dropping the Dirty and/or Writable bits is most problematic for dirty
logging, as doing so can result in a missed TLB flush and eventually a
missed dirty page.  In the unlikely event that the only dirty page(s) is
a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty
(based on the Dirty or Writable bit depending on the method) and so not
update the SPTE and ultimately not flush.  If the SPTE is cached in the
TLB as writable before it is clobbered, the guest can continue writing
the associated page without ever taking a write-protect fault.

For most (all?) file back memory, dropping the Dirty bit is a non-issue.
The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty
bit is effectively ignored because the primary MMU will mark that page
dirty when the write-protection is lifted, e.g. when KVM faults the page
back in for write.

The Accessed bit is a complete non-issue.  Aside from being unused for
non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the
Accessed bit may be dropped anyways.

Lastly, the Writable bit is also problematic as an extension of the Dirty
bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is
!DIRTY && WRITABLE.  If KVM fixes an MMU-writable, but !WRITABLE, SPTE
out of mmu_lock, then it can allow the CPU to set the Dirty bit despite
the SPTE being !WRITABLE when it is checked by KVM.  But that all depends
on the Dirty bit being problematic in the first place.

Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Venkatesh Srinivas <venkateshs@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/tdp_iter.h | 34 ++++++++++++++-
 arch/x86/kvm/mmu/tdp_mmu.c  | 82 ++++++++++++++++++++++++-------------
 2 files changed, 85 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index b1eaf6ec0e0b..f0af385c56e0 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -6,6 +6,7 @@
 #include <linux/kvm_host.h>
 
 #include "mmu.h"
+#include "spte.h"
 
 /*
  * TDP MMU SPTEs are RCU protected to allow paging structures (non-leaf SPTEs)
@@ -17,9 +18,38 @@ static inline u64 kvm_tdp_mmu_read_spte(tdp_ptep_t sptep)
 {
 	return READ_ONCE(*rcu_dereference(sptep));
 }
-static inline void kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 val)
+
+static inline u64 kvm_tdp_mmu_write_spte_atomic(tdp_ptep_t sptep, u64 new_spte)
 {
-	WRITE_ONCE(*rcu_dereference(sptep), val);
+	return xchg(rcu_dereference(sptep), new_spte);
+}
+
+static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte)
+{
+	WRITE_ONCE(*rcu_dereference(sptep), new_spte);
+}
+
+static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
+					 u64 new_spte, int level)
+{
+	/*
+	 * Atomically write the SPTE if it is a shadow-present, leaf SPTE with
+	 * volatile bits, i.e. has bits that can be set outside of mmu_lock.
+	 * The Writable bit can be set by KVM's fast page fault handler, and
+	 * Accessed and Dirty bits can be set by the CPU.
+	 *
+	 * Note, non-leaf SPTEs do have Accessed bits and those bits are
+	 * technically volatile, but KVM doesn't consume the Accessed bit of
+	 * non-leaf SPTEs, i.e. KVM doesn't care if it clobbers the bit.  This
+	 * logic needs to be reassessed if KVM were to use non-leaf Accessed
+	 * bits, e.g. to skip stepping down into child SPTEs when aging SPTEs.
+	 */
+	if (is_shadow_present_pte(old_spte) && is_last_spte(old_spte, level) &&
+	    spte_has_volatile_bits(old_spte))
+		return kvm_tdp_mmu_write_spte_atomic(sptep, new_spte);
+
+	__kvm_tdp_mmu_write_spte(sptep, new_spte);
+	return old_spte;
 }
 
 /*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 566548a3efa7..e9033cce8aeb 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -426,9 +426,9 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 	tdp_mmu_unlink_sp(kvm, sp, shared);
 
 	for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
-		u64 *sptep = rcu_dereference(pt) + i;
+		tdp_ptep_t sptep = pt + i;
 		gfn_t gfn = base_gfn + i * KVM_PAGES_PER_HPAGE(level);
-		u64 old_child_spte;
+		u64 old_spte;
 
 		if (shared) {
 			/*
@@ -440,8 +440,8 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 			 * value to the removed SPTE value.
 			 */
 			for (;;) {
-				old_child_spte = xchg(sptep, REMOVED_SPTE);
-				if (!is_removed_spte(old_child_spte))
+				old_spte = kvm_tdp_mmu_write_spte_atomic(sptep, REMOVED_SPTE);
+				if (!is_removed_spte(old_spte))
 					break;
 				cpu_relax();
 			}
@@ -455,23 +455,43 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 			 * are guarded by the memslots generation, not by being
 			 * unreachable.
 			 */
-			old_child_spte = READ_ONCE(*sptep);
-			if (!is_shadow_present_pte(old_child_spte))
+			old_spte = kvm_tdp_mmu_read_spte(sptep);
+			if (!is_shadow_present_pte(old_spte))
 				continue;
 
 			/*
-			 * Marking the SPTE as a removed SPTE is not
-			 * strictly necessary here as the MMU lock will
-			 * stop other threads from concurrently modifying
-			 * this SPTE. Using the removed SPTE value keeps
-			 * the two branches consistent and simplifies
-			 * the function.
+			 * Use the common helper instead of a raw WRITE_ONCE as
+			 * the SPTE needs to be updated atomically if it can be
+			 * modified by a different vCPU outside of mmu_lock.
+			 * Even though the parent SPTE is !PRESENT, the TLB
+			 * hasn't yet been flushed, and both Intel and AMD
+			 * document that A/D assists can use upper-level PxE
+			 * entries that are cached in the TLB, i.e. the CPU can
+			 * still access the page and mark it dirty.
+			 *
+			 * No retry is needed in the atomic update path as the
+			 * sole concern is dropping a Dirty bit, i.e. no other
+			 * task can zap/remove the SPTE as mmu_lock is held for
+			 * write.  Marking the SPTE as a removed SPTE is not
+			 * strictly necessary for the same reason, but using
+			 * the remove SPTE value keeps the shared/exclusive
+			 * paths consistent and allows the handle_changed_spte()
+			 * call below to hardcode the new value to REMOVED_SPTE.
+			 *
+			 * Note, even though dropping a Dirty bit is the only
+			 * scenario where a non-atomic update could result in a
+			 * functional bug, simply checking the Dirty bit isn't
+			 * sufficient as a fast page fault could read the upper
+			 * level SPTE before it is zapped, and then make this
+			 * target SPTE writable, resume the guest, and set the
+			 * Dirty bit between reading the SPTE above and writing
+			 * it here.
 			 */
-			WRITE_ONCE(*sptep, REMOVED_SPTE);
+			old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte,
+							  REMOVED_SPTE, level);
 		}
 		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
-				    old_child_spte, REMOVED_SPTE, level,
-				    shared);
+				    old_spte, REMOVED_SPTE, level, shared);
 	}
 
 	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
@@ -667,14 +687,13 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 					   KVM_PAGES_PER_HPAGE(iter->level));
 
 	/*
-	 * No other thread can overwrite the removed SPTE as they
-	 * must either wait on the MMU lock or use
-	 * tdp_mmu_set_spte_atomic which will not overwrite the
-	 * special removed SPTE value. No bookkeeping is needed
-	 * here since the SPTE is going from non-present
-	 * to non-present.
+	 * No other thread can overwrite the removed SPTE as they must either
+	 * wait on the MMU lock or use tdp_mmu_set_spte_atomic() which will not
+	 * overwrite the special removed SPTE value. No bookkeeping is needed
+	 * here since the SPTE is going from non-present to non-present.  Use
+	 * the raw write helper to avoid an unnecessary check on volatile bits.
 	 */
-	kvm_tdp_mmu_write_spte(iter->sptep, 0);
+	__kvm_tdp_mmu_write_spte(iter->sptep, 0);
 
 	return 0;
 }
@@ -699,10 +718,13 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
  *		      unless performing certain dirty logging operations.
  *		      Leaving record_dirty_log unset in that case prevents page
  *		      writes from being double counted.
+ *
+ * Returns the old SPTE value, which _may_ be different than @old_spte if the
+ * SPTE had voldatile bits.
  */
-static void __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
-			       u64 old_spte, u64 new_spte, gfn_t gfn, int level,
-			       bool record_acc_track, bool record_dirty_log)
+static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
+			      u64 old_spte, u64 new_spte, gfn_t gfn, int level,
+			      bool record_acc_track, bool record_dirty_log)
 {
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
@@ -715,7 +737,7 @@ static void __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 	 */
 	WARN_ON(is_removed_spte(old_spte) || is_removed_spte(new_spte));
 
-	kvm_tdp_mmu_write_spte(sptep, new_spte);
+	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
 
 	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
 
@@ -724,6 +746,7 @@ static void __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 	if (record_dirty_log)
 		handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
 					      new_spte, level);
+	return old_spte;
 }
 
 static inline void _tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
@@ -732,9 +755,10 @@ static inline void _tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 {
 	WARN_ON_ONCE(iter->yielded);
 
-	__tdp_mmu_set_spte(kvm, iter->as_id, iter->sptep, iter->old_spte,
-			   new_spte, iter->gfn, iter->level,
-			   record_acc_track, record_dirty_log);
+	iter->old_spte = __tdp_mmu_set_spte(kvm, iter->as_id, iter->sptep,
+					    iter->old_spte, new_spte,
+					    iter->gfn, iter->level,
+					    record_acc_track, record_dirty_log);
 }
 
 static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

[PATCH 04/12] KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use

Posted by Sean Christopherson 4 years ago

Check for A/D bits being disabled instead of the access tracking mask
being non-zero when deciding whether or not to attempt to fix a page
fault vian the fast path.  Originally, the access tracking mask was
non-zero if and only if A/D bits were disabled by _KVM_ (including not
being supported by hardware), but that hasn't been true since nVMX was
fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause
KVM to not use A/D bits while running L2 despite KVM using them while
running L1.

In other words, don't attempt the fast path just because EPT is enabled.

Note, attempting the fast path for all !PRESENT faults can "fix" a very,
_VERY_ tiny percentage of faults out of mmu_lock by detecting that the
fault is spurious, i.e. has been fixed by a different vCPU, but again the
odds of that happening are vanishingly small.  E.g. booting an 8-vCPU VM
gets less than 10 successes out of 30k+ faults, and that's likely one of
the more favorable scenarios.  Disabling dirty logging can likely lead to
a rash of collisions between vCPUs for some workloads that operate on a
common set of pages, but penalizing _all_ !PRESENT faults for that one
case is unlikely to be a net positive, not to mention that that problem
is best solved by not zapping in the first place.

The number of spurious faults does scale with the number of vCPUs, e.g. a
255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast
path (again out of 30k), but that's all of 0.2% of faults.  Using legacy
shadow paging does get more spurious faults, and a few more detected out
of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring
faults that are reflected into the guest), i.e. the extra detections are
purely due to the sheer number of faults observed.

On the other hand, getting a "negative" in the fast path takes in the
neighborhood of 150-250 cycles.  So while it is tempting to keep/extend
the current behavior, such a change needs to come with hard numbers
showing that it's actually a win in the grand scheme, or any scheme for
that matter.

Fixes: 995f00a61958 ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c     | 45 ++++++++++++++++++++++++--------------
 arch/x86/kvm/mmu/spte.h    | 11 ++++++++++
 arch/x86/kvm/mmu/tdp_mmu.c |  2 +-
 3 files changed, 41 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 65b723201738..dfd1cfa9c08c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3013,19 +3013,20 @@ static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
 
 	/*
 	 * #PF can be fast if:
-	 * 1. The shadow page table entry is not present, which could mean that
-	 *    the fault is potentially caused by access tracking (if enabled).
-	 * 2. The shadow page table entry is present and the fault
-	 *    is caused by write-protect, that means we just need change the W
-	 *    bit of the spte which can be done out of mmu-lock.
 	 *
-	 * However, if access tracking is disabled we know that a non-present
-	 * page must be a genuine page fault where we have to create a new SPTE.
-	 * So, if access tracking is disabled, we return true only for write
-	 * accesses to a present page.
+	 * 1. The shadow page table entry is not present and A/D bits are
+	 *    disabled _by KVM_, which could mean that the fault is potentially
+	 *    caused by access tracking (if enabled).  If A/D bits are enabled
+	 *    by KVM, but disabled by L1 for L2, KVM is forced to disable A/D
+	 *    bits for L2 and employ access tracking, but the fast page fault
+	 *    mechanism only supports direct MMUs.
+	 * 2. The shadow page table entry is present, the access is a write,
+	 *    and no reserved bits are set (MMIO SPTEs cannot be "fixed"), i.e.
+	 *    the fault was caused by a write-protection violation.  If the
+	 *    SPTE is MMU-writable (determined later), the fault can be fixed
+	 *    by setting the Writable bit, which can be done out of mmu_lock.
 	 */
-
-	return shadow_acc_track_mask != 0 || (fault->write && fault->present);
+	return !kvm_ad_enabled() || (fault->write && fault->present);
 }
 
 /*
@@ -3140,13 +3141,25 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 		new_spte = spte;
 
-		if (is_access_track_spte(spte))
+		/*
+		 * KVM only supports fixing page faults outside of MMU lock for
+		 * direct MMUs, nested MMUs are always indirect, and KVM always
+		 * uses A/D bits for non-nested MMUs.  Thus, if A/D bits are
+		 * enabled, the SPTE can't be an access-tracked SPTE.
+		 */
+		if (unlikely(!kvm_ad_enabled()) && is_access_track_spte(spte))
 			new_spte = restore_acc_track_spte(new_spte);
 
 		/*
-		 * Currently, to simplify the code, write-protection can
-		 * be removed in the fast path only if the SPTE was
-		 * write-protected for dirty-logging or access tracking.
+		 * To keep things simple, only SPTEs that are MMU-writable can
+		 * be made fully writable outside of mmu_lock, e.g. only SPTEs
+		 * that were write-protected for dirty-logging or access
+		 * tracking are handled here.  Don't bother checking if the
+		 * SPTE is writable to prioritize running with A/D bits enabled.
+		 * The is_access_allowed() check above handles the common case
+		 * of the fault being spurious, and the SPTE is known to be
+		 * shadow-present, i.e. except for access tracking restoration
+		 * making the new SPTE writable, the check is wasteful.
 		 */
 		if (fault->write && is_mmu_writable_spte(spte)) {
 			new_spte |= PT_WRITABLE_MASK;
@@ -4751,7 +4764,7 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
 	role.efer_nx = true;
 	role.smm = cpu_role.base.smm;
 	role.guest_mode = cpu_role.base.guest_mode;
-	role.ad_disabled = (shadow_accessed_mask == 0);
+	role.ad_disabled = !kvm_ad_enabled();
 	role.level = kvm_mmu_get_tdp_level(vcpu);
 	role.direct = true;
 	role.has_4_byte_gpte = false;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 098d7d144627..43ec7a8641b3 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -220,6 +220,17 @@ static inline bool is_shadow_present_pte(u64 pte)
 	return !!(pte & SPTE_MMU_PRESENT_MASK);
 }
 
+/*
+ * Returns true if A/D bits are supported in hardware and are enabled by KVM.
+ * When enabled, KVM uses A/D bits for all non-nested MMUs.  Because L1 can
+ * disable A/D bits in EPTP12, SP and SPTE variants are needed to handle the
+ * scenario where KVM is using A/D bits for L1, but not L2.
+ */
+static inline bool kvm_ad_enabled(void)
+{
+	return !!shadow_accessed_mask;
+}
+
 static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)
 {
 	return sp->role.ad_disabled;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e9033cce8aeb..a2eda3e55697 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1135,7 +1135,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
 			   struct kvm_mmu_page *sp, bool account_nx,
 			   bool shared)
 {
-	u64 spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
+	u64 spte = make_nonleaf_spte(sp->spt, !kvm_ad_enabled());
 	int ret = 0;
 
 	if (shared) {
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

[PATCH 05/12] KVM: x86/mmu: Drop exec/NX check from "page fault can be fast"

Posted by Sean Christopherson 4 years ago

Tweak the "page fault can be fast" logic to explicitly check for !PRESENT
faults in the access tracking case, and drop the exec/NX check that
becomes redundant as a result.  No sane hardware will generate an access
that is both an instruct fetch and a write, i.e. it's a waste of cycles.
If hardware goes off the rails, or KVM runs under a misguided hypervisor,
spuriously running throught fast path is benign (KVM has been uknowingly
being doing exactly that for years).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index dfd1cfa9c08c..f1618d8289ce 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3001,16 +3001,14 @@ static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
 static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
 {
 	/*
-	 * Do not fix the mmio spte with invalid generation number which
-	 * need to be updated by slow page fault path.
+	 * Page faults with reserved bits set, i.e. faults on MMIO SPTEs, only
+	 * reach the common page fault handler if the SPTE has an invalid MMIO
+	 * generation number.  Refreshing the MMIO generation needs to go down
+	 * the slow path.  Note, EPT Misconfigs do NOT set the PRESENT flag!
 	 */
 	if (fault->rsvd)
 		return false;
 
-	/* See if the page fault is due to an NX violation */
-	if (unlikely(fault->exec && fault->present))
-		return false;
-
 	/*
 	 * #PF can be fast if:
 	 *
@@ -3026,7 +3024,14 @@ static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
 	 *    SPTE is MMU-writable (determined later), the fault can be fixed
 	 *    by setting the Writable bit, which can be done out of mmu_lock.
 	 */
-	return !kvm_ad_enabled() || (fault->write && fault->present);
+	if (!fault->present)
+		return !kvm_ad_enabled();
+
+	/*
+	 * Note, instruction fetches and writes are mutually exclusive, ignore
+	 * the "exec" flag.
+	 */
+	return fault->write;
 }
 
 /*
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

[PATCH 06/12] KVM: x86/mmu: Add RET_PF_CONTINUE to eliminate bool+int* "returns"

Posted by Sean Christopherson 4 years ago

Add RET_PF_CONTINUE and use it in handle_abnormal_pfn() and
kvm_faultin_pfn() to signal that the page fault handler should continue
doing its thing.  Aside from being gross and inefficient, using a boolean
return to signal continue vs. stop makes it extremely difficult to add
more helpers and/or move existing code to a helper.

E.g. hypothetically, if nested MMUs were to gain a separate page fault
handler in the future, everything up to the "is self-modifying PTE" check
can be shared by all shadow MMUs, but communicating up the stack whether
to continue on or stop becomes a nightmare.

More concretely, proposed support for private guest memory ran into a
similar issue, where it'll be forced to forego a helper in order to yield
sane code: https://lore.kernel.org/all/YkJbxiL%2FAz7olWlq@google.com.

No functional change intended.

Cc: David Matlack <dmatlack@google.com>
Cc: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 51 ++++++++++++++-------------------
 arch/x86/kvm/mmu/mmu_internal.h |  9 +++++-
 arch/x86/kvm/mmu/mmutrace.h     |  1 +
 arch/x86/kvm/mmu/paging_tmpl.h  |  6 ++--
 4 files changed, 35 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f1618d8289ce..f1e8d71e6f7c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2970,14 +2970,12 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn)
 	return -EFAULT;
 }
 
-static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
-				unsigned int access, int *ret_val)
+static int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
+			       unsigned int access)
 {
 	/* The pfn is invalid, report the error! */
-	if (unlikely(is_error_pfn(fault->pfn))) {
-		*ret_val = kvm_handle_bad_page(vcpu, fault->gfn, fault->pfn);
-		return true;
-	}
+	if (unlikely(is_error_pfn(fault->pfn)))
+		return kvm_handle_bad_page(vcpu, fault->gfn, fault->pfn);
 
 	if (unlikely(!fault->slot)) {
 		gva_t gva = fault->is_tdp ? 0 : fault->addr;
@@ -2989,13 +2987,11 @@ static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
 		 * touching the shadow page tables as attempting to install an
 		 * MMIO SPTE will just be an expensive nop.
 		 */
-		if (unlikely(!enable_mmio_caching)) {
-			*ret_val = RET_PF_EMULATE;
-			return true;
-		}
+		if (unlikely(!enable_mmio_caching))
+			return RET_PF_EMULATE;
 	}
 
-	return false;
+	return RET_PF_CONTINUE;
 }
 
 static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
@@ -3903,7 +3899,7 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 				  kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
 }
 
-static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r)
+static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_memory_slot *slot = fault->slot;
 	bool async;
@@ -3914,7 +3910,7 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 	 * be zapped before KVM inserts a new MMIO SPTE for the gfn.
 	 */
 	if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
-		goto out_retry;
+		return RET_PF_RETRY;
 
 	if (!kvm_is_visible_memslot(slot)) {
 		/* Don't expose private memslots to L2. */
@@ -3922,7 +3918,7 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 			fault->slot = NULL;
 			fault->pfn = KVM_PFN_NOSLOT;
 			fault->map_writable = false;
-			return false;
+			return RET_PF_CONTINUE;
 		}
 		/*
 		 * If the APIC access page exists but is disabled, go directly
@@ -3931,10 +3927,8 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		 * when the AVIC is re-enabled.
 		 */
 		if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT &&
-		    !kvm_apicv_activated(vcpu->kvm)) {
-			*r = RET_PF_EMULATE;
-			return true;
-		}
+		    !kvm_apicv_activated(vcpu->kvm))
+			return RET_PF_EMULATE;
 	}
 
 	async = false;
@@ -3942,26 +3936,23 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 					  fault->write, &fault->map_writable,
 					  &fault->hva);
 	if (!async)
-		return false; /* *pfn has correct page already */
+		return RET_PF_CONTINUE; /* *pfn has correct page already */
 
 	if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
 		trace_kvm_try_async_get_page(fault->addr, fault->gfn);
 		if (kvm_find_async_pf_gfn(vcpu, fault->gfn)) {
 			trace_kvm_async_pf_doublefault(fault->addr, fault->gfn);
 			kvm_make_request(KVM_REQ_APF_HALT, vcpu);
-			goto out_retry;
-		} else if (kvm_arch_setup_async_pf(vcpu, fault->addr, fault->gfn))
-			goto out_retry;
+			return RET_PF_RETRY;
+		} else if (kvm_arch_setup_async_pf(vcpu, fault->addr, fault->gfn)) {
+			return RET_PF_RETRY;
+		}
 	}
 
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, NULL,
 					  fault->write, &fault->map_writable,
 					  &fault->hva);
-	return false;
-
-out_retry:
-	*r = RET_PF_RETRY;
-	return true;
+	return RET_PF_CONTINUE;
 }
 
 /*
@@ -4016,10 +4007,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (kvm_faultin_pfn(vcpu, fault, &r))
+	r = kvm_faultin_pfn(vcpu, fault);
+	if (r != RET_PF_CONTINUE)
 		return r;
 
-	if (handle_abnormal_pfn(vcpu, fault, ACC_ALL, &r))
+	r = handle_abnormal_pfn(vcpu, fault, ACC_ALL);
+	if (r != RET_PF_CONTINUE)
 		return r;
 
 	r = RET_PF_RETRY;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1bff453f7cbe..c0e502b17ef7 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -143,6 +143,7 @@ unsigned int pte_list_count(struct kvm_rmap_head *rmap_head);
 /*
  * Return values of handle_mmio_page_fault, mmu.page_fault, and fast_page_fault().
  *
+ * RET_PF_CONTINUE: So far, so good, keep handling the page fault.
  * RET_PF_RETRY: let CPU fault again on the address.
  * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
  * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
@@ -151,9 +152,15 @@ unsigned int pte_list_count(struct kvm_rmap_head *rmap_head);
  *
  * Any names added to this enum should be exported to userspace for use in
  * tracepoints via TRACE_DEFINE_ENUM() in mmutrace.h
+ *
+ * Note, all values must be greater than or equal to zero so as not to encroach
+ * on -errno return values.  Somewhat arbitrarily use '0' for CONTINUE, which
+ * will allow for efficient machine code when checking for CONTINUE, e.g.
+ * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero.
  */
 enum {
-	RET_PF_RETRY = 0,
+	RET_PF_CONTINUE = 0,
+	RET_PF_RETRY,
 	RET_PF_EMULATE,
 	RET_PF_INVALID,
 	RET_PF_FIXED,
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index 12247b96af01..ae86820cef69 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -54,6 +54,7 @@
 	{ PFERR_RSVD_MASK, "RSVD" },	\
 	{ PFERR_FETCH_MASK, "F" }
 
+TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
 TRACE_DEFINE_ENUM(RET_PF_RETRY);
 TRACE_DEFINE_ENUM(RET_PF_EMULATE);
 TRACE_DEFINE_ENUM(RET_PF_INVALID);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index b025decf610d..7f8f1c8dbed2 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -838,10 +838,12 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (kvm_faultin_pfn(vcpu, fault, &r))
+	r = kvm_faultin_pfn(vcpu, fault);
+	if (r != RET_PF_CONTINUE)
 		return r;
 
-	if (handle_abnormal_pfn(vcpu, fault, walker.pte_access, &r))
+	r = handle_abnormal_pfn(vcpu, fault, walker.pte_access);
+	if (r != RET_PF_CONTINUE)
 		return r;
 
 	/*
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

[PATCH 07/12] KVM: x86/mmu: Make all page fault handlers internal to the MMU

Posted by Sean Christopherson 4 years ago

Move kvm_arch_async_page_ready() to mmu.c where it belongs, and move all
of the page fault handling collateral that was in mmu.h purely for the
async #PF handler into mmu_internal.h, where it belongs.  This will allow
kvm_mmu_do_page_fault() to act on the RET_PF_* return without having to
expose those enums outside of the MMU.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu.h              | 87 -------------------------------
 arch/x86/kvm/mmu/mmu.c          | 19 +++++++
 arch/x86/kvm/mmu/mmu_internal.h | 90 ++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c              | 19 -------
 4 files changed, 108 insertions(+), 107 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 671cfeccf04e..461052bef896 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -117,93 +117,6 @@ static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
 					  vcpu->arch.mmu->root_role.level);
 }
 
-struct kvm_page_fault {
-	/* arguments to kvm_mmu_do_page_fault.  */
-	const gpa_t addr;
-	const u32 error_code;
-	const bool prefetch;
-
-	/* Derived from error_code.  */
-	const bool exec;
-	const bool write;
-	const bool present;
-	const bool rsvd;
-	const bool user;
-
-	/* Derived from mmu and global state.  */
-	const bool is_tdp;
-	const bool nx_huge_page_workaround_enabled;
-
-	/*
-	 * Whether a >4KB mapping can be created or is forbidden due to NX
-	 * hugepages.
-	 */
-	bool huge_page_disallowed;
-
-	/*
-	 * Maximum page size that can be created for this fault; input to
-	 * FNAME(fetch), __direct_map and kvm_tdp_mmu_map.
-	 */
-	u8 max_level;
-
-	/*
-	 * Page size that can be created based on the max_level and the
-	 * page size used by the host mapping.
-	 */
-	u8 req_level;
-
-	/*
-	 * Page size that will be created based on the req_level and
-	 * huge_page_disallowed.
-	 */
-	u8 goal_level;
-
-	/* Shifted addr, or result of guest page table walk if addr is a gva.  */
-	gfn_t gfn;
-
-	/* The memslot containing gfn. May be NULL. */
-	struct kvm_memory_slot *slot;
-
-	/* Outputs of kvm_faultin_pfn.  */
-	kvm_pfn_t pfn;
-	hva_t hva;
-	bool map_writable;
-};
-
-int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
-
-extern int nx_huge_pages;
-static inline bool is_nx_huge_page_enabled(void)
-{
-	return READ_ONCE(nx_huge_pages);
-}
-
-static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
-					u32 err, bool prefetch)
-{
-	struct kvm_page_fault fault = {
-		.addr = cr2_or_gpa,
-		.error_code = err,
-		.exec = err & PFERR_FETCH_MASK,
-		.write = err & PFERR_WRITE_MASK,
-		.present = err & PFERR_PRESENT_MASK,
-		.rsvd = err & PFERR_RSVD_MASK,
-		.user = err & PFERR_USER_MASK,
-		.prefetch = prefetch,
-		.is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault),
-		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(),
-
-		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
-		.req_level = PG_LEVEL_4K,
-		.goal_level = PG_LEVEL_4K,
-	};
-#ifdef CONFIG_RETPOLINE
-	if (fault.is_tdp)
-		return kvm_tdp_page_fault(vcpu, &fault);
-#endif
-	return vcpu->arch.mmu->page_fault(vcpu, &fault);
-}
-
 /*
  * Check if a given access (described through the I/D, W/R and U/S bits of a
  * page fault error code pfec) causes a permission fault with the given PTE
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f1e8d71e6f7c..8b8b62d2a903 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3899,6 +3899,25 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 				  kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
 }
 
+void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
+{
+	int r;
+
+	if ((vcpu->arch.mmu->root_role.direct != work->arch.direct_map) ||
+	      work->wakeup_all)
+		return;
+
+	r = kvm_mmu_reload(vcpu);
+	if (unlikely(r))
+		return;
+
+	if (!vcpu->arch.mmu->root_role.direct &&
+	      work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu))
+		return;
+
+	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
+}
+
 static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_memory_slot *slot = fault->slot;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index c0e502b17ef7..c0c85cbfa159 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -140,8 +140,70 @@ void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
 					u64 start_gfn, u64 pages);
 unsigned int pte_list_count(struct kvm_rmap_head *rmap_head);
 
+extern int nx_huge_pages;
+static inline bool is_nx_huge_page_enabled(void)
+{
+	return READ_ONCE(nx_huge_pages);
+}
+
+struct kvm_page_fault {
+	/* arguments to kvm_mmu_do_page_fault.  */
+	const gpa_t addr;
+	const u32 error_code;
+	const bool prefetch;
+
+	/* Derived from error_code.  */
+	const bool exec;
+	const bool write;
+	const bool present;
+	const bool rsvd;
+	const bool user;
+
+	/* Derived from mmu and global state.  */
+	const bool is_tdp;
+	const bool nx_huge_page_workaround_enabled;
+
+	/*
+	 * Whether a >4KB mapping can be created or is forbidden due to NX
+	 * hugepages.
+	 */
+	bool huge_page_disallowed;
+
+	/*
+	 * Maximum page size that can be created for this fault; input to
+	 * FNAME(fetch), __direct_map and kvm_tdp_mmu_map.
+	 */
+	u8 max_level;
+
+	/*
+	 * Page size that can be created based on the max_level and the
+	 * page size used by the host mapping.
+	 */
+	u8 req_level;
+
+	/*
+	 * Page size that will be created based on the req_level and
+	 * huge_page_disallowed.
+	 */
+	u8 goal_level;
+
+	/* Shifted addr, or result of guest page table walk if addr is a gva.  */
+	gfn_t gfn;
+
+	/* The memslot containing gfn. May be NULL. */
+	struct kvm_memory_slot *slot;
+
+	/* Outputs of kvm_faultin_pfn.  */
+	kvm_pfn_t pfn;
+	hva_t hva;
+	bool map_writable;
+};
+
+int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
+
 /*
- * Return values of handle_mmio_page_fault, mmu.page_fault, and fast_page_fault().
+ * Return values of handle_mmio_page_fault(), mmu.page_fault(), fast_page_fault(),
+ * and of course kvm_mmu_do_page_fault().
  *
  * RET_PF_CONTINUE: So far, so good, keep handling the page fault.
  * RET_PF_RETRY: let CPU fault again on the address.
@@ -167,6 +229,32 @@ enum {
 	RET_PF_SPURIOUS,
 };
 
+static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
+					u32 err, bool prefetch)
+{
+	struct kvm_page_fault fault = {
+		.addr = cr2_or_gpa,
+		.error_code = err,
+		.exec = err & PFERR_FETCH_MASK,
+		.write = err & PFERR_WRITE_MASK,
+		.present = err & PFERR_PRESENT_MASK,
+		.rsvd = err & PFERR_RSVD_MASK,
+		.user = err & PFERR_USER_MASK,
+		.prefetch = prefetch,
+		.is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault),
+		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(),
+
+		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
+		.req_level = PG_LEVEL_4K,
+		.goal_level = PG_LEVEL_4K,
+	};
+#ifdef CONFIG_RETPOLINE
+	if (fault.is_tdp)
+		return kvm_tdp_page_fault(vcpu, &fault);
+#endif
+	return vcpu->arch.mmu->page_fault(vcpu, &fault);
+}
+
 int kvm_mmu_max_mapping_level(struct kvm *kvm,
 			      const struct kvm_memory_slot *slot, gfn_t gfn,
 			      kvm_pfn_t pfn, int max_level);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 951d0a78ccda..7663c35a5c70 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12356,25 +12356,6 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 }
 EXPORT_SYMBOL_GPL(kvm_set_rflags);
 
-void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
-{
-	int r;
-
-	if ((vcpu->arch.mmu->root_role.direct != work->arch.direct_map) ||
-	      work->wakeup_all)
-		return;
-
-	r = kvm_mmu_reload(vcpu);
-	if (unlikely(r))
-		return;
-
-	if (!vcpu->arch.mmu->root_role.direct &&
-	      work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu))
-		return;
-
-	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
-}
-
 static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
 {
 	BUILD_BUG_ON(!is_power_of_2(ASYNC_PF_PER_VCPU));
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

[PATCH 08/12] KVM: x86/mmu: Use IS_ENABLED() to avoid RETPOLINE for TDP page faults

Posted by Sean Christopherson 4 years ago

Use IS_ENABLED() instead of an #ifdef to activate the anti-RETPOLINE fast
path for TDP page faults.  The generated code is identical, and the #ifdef
makes it dangerously difficult to extend the logic (guess who forgot to
add an "else" inside the #ifdef and ran through the page fault handler
twice).

No functional or binary change intented.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu_internal.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index c0c85cbfa159..9caa747ee033 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -248,10 +248,10 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		.req_level = PG_LEVEL_4K,
 		.goal_level = PG_LEVEL_4K,
 	};
-#ifdef CONFIG_RETPOLINE
-	if (fault.is_tdp)
+
+	if (IS_ENABLED(CONFIG_RETPOLINE) && fault.is_tdp)
 		return kvm_tdp_page_fault(vcpu, &fault);
-#endif
+
 	return vcpu->arch.mmu->page_fault(vcpu, &fault);
 }
 
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

[PATCH 09/12] KVM: x86/mmu: Expand and clean up page fault stats

Posted by Sean Christopherson 4 years ago

Expand and clean up the page fault stats.  The current stats are at best
incomplete, and at worst misleading.  Differentiate between faults that
are actually fixed vs those that result in an MMIO SPTE being created,
track faults that are spurious, faults that trigger emulation, faults
that that are fixed in the fast path, and last but not least, track the
number of faults that are taken.

Note, the number of faults that require emulation for write-protected
shadow pages can roughly be calculated by subtracting the number of MMIO
SPTEs created from the overall number of faults that trigger emulation.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  5 +++++
 arch/x86/kvm/mmu/mmu.c          |  7 +++++--
 arch/x86/kvm/mmu/mmu_internal.h | 28 ++++++++++++++++++++++++++--
 arch/x86/kvm/mmu/paging_tmpl.h  |  1 -
 arch/x86/kvm/mmu/tdp_mmu.c      |  8 +-------
 arch/x86/kvm/x86.c              |  5 +++++
 6 files changed, 42 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f164c6c1514a..c5fb4115176d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1269,7 +1269,12 @@ struct kvm_vm_stat {
 
 struct kvm_vcpu_stat {
 	struct kvm_vcpu_stat_generic generic;
+	u64 pf_taken;
 	u64 pf_fixed;
+	u64 pf_emulate;
+	u64 pf_spurious;
+	u64 pf_fast;
+	u64 pf_mmio_spte_created;
 	u64 pf_guest;
 	u64 tlb_flush;
 	u64 invlpg;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8b8b62d2a903..744c06bd7017 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2660,6 +2660,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 		 *sptep, write_fault, gfn);
 
 	if (unlikely(is_noslot_pfn(pfn))) {
+		vcpu->stat.pf_mmio_spte_created++;
 		mark_mmio_spte(vcpu, sptep, gfn, pte_access);
 		return RET_PF_EMULATE;
 	}
@@ -2943,7 +2944,6 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		return ret;
 
 	direct_pte_prefetch(vcpu, it.sptep);
-	++vcpu->stat.pf_fixed;
 	return ret;
 }
 
@@ -3206,6 +3206,9 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	trace_fast_page_fault(vcpu, fault, sptep, spte, ret);
 	walk_shadow_page_lockless_end(vcpu);
 
+	if (ret != RET_PF_INVALID)
+		vcpu->stat.pf_fast++;
+
 	return ret;
 }
 
@@ -5311,7 +5314,7 @@ static void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	write_unlock(&vcpu->kvm->mmu_lock);
 }
 
-int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
+int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
 		       void *insn, int insn_len)
 {
 	int r, emulation_type = EMULTYPE_PF;
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 9caa747ee033..bd2a26897b97 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -248,11 +248,35 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		.req_level = PG_LEVEL_4K,
 		.goal_level = PG_LEVEL_4K,
 	};
+	int r;
+
+	/*
+	 * Async #PF "faults", a.k.a. prefetch faults, are not faults from the
+	 * guest perspective and have already been counted at the time of the
+	 * original fault.
+	 */
+	if (!prefetch)
+		vcpu->stat.pf_taken++;
 
 	if (IS_ENABLED(CONFIG_RETPOLINE) && fault.is_tdp)
-		return kvm_tdp_page_fault(vcpu, &fault);
+		r = kvm_tdp_page_fault(vcpu, &fault);
+	else
+		r = vcpu->arch.mmu->page_fault(vcpu, &fault);
 
-	return vcpu->arch.mmu->page_fault(vcpu, &fault);
+	/*
+	 * Similar to above, prefetch faults aren't truly spurious, and the
+	 * async #PF path doesn't do emulation.  Do count faults that are fixed
+	 * by the async #PF handler though, otherwise they'll never be counted.
+	 */
+	if (r == RET_PF_FIXED)
+		vcpu->stat.pf_fixed++;
+	else if (prefetch)
+		;
+	else if (r == RET_PF_EMULATE)
+		vcpu->stat.pf_emulate++;
+	else if (r == RET_PF_SPURIOUS)
+		vcpu->stat.pf_spurious++;
+	return r;
 }
 
 int kvm_mmu_max_mapping_level(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 7f8f1c8dbed2..db80f7ccaa4e 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -723,7 +723,6 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 		return ret;
 
 	FNAME(pte_prefetch)(vcpu, gw, it.sptep);
-	++vcpu->stat.pf_fixed;
 	return ret;
 
 out_gpte_changed:
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a2eda3e55697..8089beb312d1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1099,6 +1099,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 
 	/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
 	if (unlikely(is_mmio_spte(new_spte))) {
+		vcpu->stat.pf_mmio_spte_created++;
 		trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn,
 				     new_spte);
 		ret = RET_PF_EMULATE;
@@ -1107,13 +1108,6 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 				       rcu_dereference(iter->sptep));
 	}
 
-	/*
-	 * Increase pf_fixed in both RET_PF_EMULATE and RET_PF_FIXED to be
-	 * consistent with legacy MMU behavior.
-	 */
-	if (ret != RET_PF_SPURIOUS)
-		vcpu->stat.pf_fixed++;
-
 	return ret;
 }
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7663c35a5c70..a6441b281fb3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -266,7 +266,12 @@ const struct kvm_stats_header kvm_vm_stats_header = {
 
 const struct _kvm_stats_desc kvm_vcpu_stats_desc[] = {
 	KVM_GENERIC_VCPU_STATS(),
+	STATS_DESC_COUNTER(VCPU, pf_taken),
 	STATS_DESC_COUNTER(VCPU, pf_fixed),
+	STATS_DESC_COUNTER(VCPU, pf_emulate),
+	STATS_DESC_COUNTER(VCPU, pf_spurious),
+	STATS_DESC_COUNTER(VCPU, pf_fast),
+	STATS_DESC_COUNTER(VCPU, pf_mmio_spte_created),
 	STATS_DESC_COUNTER(VCPU, pf_guest),
 	STATS_DESC_COUNTER(VCPU, tlb_flush),
 	STATS_DESC_COUNTER(VCPU, invlpg),
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

[PATCH 10/12] DO NOT MERGE: KVM: x86/mmu: Always send !PRESENT faults down the fast path

Posted by Sean Christopherson 4 years ago

Posted for posterity, and to show that it's possible to funnel indirect
page faults down the fast path.

Not-signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 44 +++++++++++++++++++++---------------------
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 744c06bd7017..7ba88907d032 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3006,26 +3006,25 @@ static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
 		return false;
 
 	/*
-	 * #PF can be fast if:
-	 *
-	 * 1. The shadow page table entry is not present and A/D bits are
-	 *    disabled _by KVM_, which could mean that the fault is potentially
-	 *    caused by access tracking (if enabled).  If A/D bits are enabled
-	 *    by KVM, but disabled by L1 for L2, KVM is forced to disable A/D
-	 *    bits for L2 and employ access tracking, but the fast page fault
-	 *    mechanism only supports direct MMUs.
-	 * 2. The shadow page table entry is present, the access is a write,
-	 *    and no reserved bits are set (MMIO SPTEs cannot be "fixed"), i.e.
-	 *    the fault was caused by a write-protection violation.  If the
-	 *    SPTE is MMU-writable (determined later), the fault can be fixed
-	 *    by setting the Writable bit, which can be done out of mmu_lock.
+	 * Unconditionally send !PRESENT page faults (except for emulated MMIO)
+	 * through the fast path.  There are two scenarios where the fast path
+	 * can resolve the fault.  The first is if the fault is spurious, i.e.
+	 * a different vCPU has faulted in the page, which applies to all MMUs.
+	 * The second scenario is if KVM marked the SPTE !PRESENT for access
+	 * tracking (due to lack of EPT A/D bits), in which case KVM can fix
+	 * the fault after logging the access.
 	 */
 	if (!fault->present)
-		return !kvm_ad_enabled();
+		return true;
 
 	/*
-	 * Note, instruction fetches and writes are mutually exclusive, ignore
-	 * the "exec" flag.
+	 * Skip the fast path if the fault is due to a protection violation and
+	 * the access isn't a write.  Write-protection violations can be fixed
+	 * by KVM, e.g. if memory is write-protected for dirty logging, but all
+	 * other protection violations are in the domain of a third party, i.e.
+	 * either the primary MMU or the guest's page tables, and thus are
+	 * extremely unlikely to be resolved by KVM.  Note, instruction fetches
+	 * and writes are mutually exclusive, ignore the "exec" flag.
 	 */
 	return fault->write;
 }
@@ -3041,12 +3040,13 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 	/*
 	 * Theoretically we could also set dirty bit (and flush TLB) here in
 	 * order to eliminate unnecessary PML logging. See comments in
-	 * set_spte. But fast_page_fault is very unlikely to happen with PML
-	 * enabled, so we do not do this. This might result in the same GPA
-	 * to be logged in PML buffer again when the write really happens, and
-	 * eventually to be called by mark_page_dirty twice. But it's also no
-	 * harm. This also avoids the TLB flush needed after setting dirty bit
-	 * so non-PML cases won't be impacted.
+	 * set_spte. But a write-protection violation that can be fixed outside
+	 * of mmu_lock is very unlikely to happen with PML enabled, so we don't
+	 * do this. This might result in the same GPA to be logged in the PML
+	 * buffer again when the write really happens, and eventually to be
+	 * sent to mark_page_dirty() twice, but that's a minor performance blip
+	 * and not a function issue.  This also avoids the TLB flush needed
+	 * after setting dirty bit so non-PML cases won't be impacted.
 	 *
 	 * Compare with set_spte where instead shadow_dirty_mask is set.
 	 */
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

[PATCH 11/12] DO NOT MERGE: KVM: x86/mmu: Use fast_page_fault() to detect spurious shadow MMU faults

Posted by Sean Christopherson 4 years ago

Sounds good in theory, but in practice it's really, really rare to detect
a spurious fault outside of mmu_lock.  The window is teeny tiny, so more
likely than not, spurious faults won't be detected until the slow path,
not too mention spurious faults on !PRESENT pages are rare in and of
themselves.

Not-signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c         | 28 ++++++++++++++++++----------
 arch/x86/kvm/mmu/paging_tmpl.h |  8 ++++++++
 2 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7ba88907d032..850d58793307 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2994,7 +2994,7 @@ static int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fau
 	return RET_PF_CONTINUE;
 }
 
-static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
+static bool page_fault_can_be_fast(struct kvm_page_fault *fault, bool direct_mmu)
 {
 	/*
 	 * Page faults with reserved bits set, i.e. faults on MMIO SPTEs, only
@@ -3025,8 +3025,12 @@ static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
 	 * either the primary MMU or the guest's page tables, and thus are
 	 * extremely unlikely to be resolved by KVM.  Note, instruction fetches
 	 * and writes are mutually exclusive, ignore the "exec" flag.
+	 *
+	 * KVM doesn't support resolving write-protection violations outside of
+	 * mmu_lock for indirect MMUs as the gfn is not stable for indirect
+	 * shadow pages. See Documentation/virt/kvm/locking.rst for details.
 	 */
-	return fault->write;
+	return fault->write && direct_mmu;
 }
 
 /*
@@ -3097,7 +3101,8 @@ static u64 *fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gpa_t gpa, u64 *spte)
 /*
  * Returns one of RET_PF_INVALID, RET_PF_FIXED or RET_PF_SPURIOUS.
  */
-static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
+			   bool direct_mmu)
 {
 	struct kvm_mmu_page *sp;
 	int ret = RET_PF_INVALID;
@@ -3105,7 +3110,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	u64 *sptep = NULL;
 	uint retry_count = 0;
 
-	if (!page_fault_can_be_fast(fault))
+	if (!page_fault_can_be_fast(fault, direct_mmu))
 		return ret;
 
 	walk_shadow_page_lockless_begin(vcpu);
@@ -3140,6 +3145,14 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			break;
 		}
 
+		/*
+		 * KVM doesn't support fixing SPTEs outside of mmu_lock for
+		 * indirect MMUs as the gfn isn't stable for indirect shadow
+		 * pages. See Documentation/virt/kvm/locking.rst for details.
+		 */
+		if (!direct_mmu)
+			break;
+
 		new_spte = spte;
 
 		/*
@@ -3185,11 +3198,6 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		    !is_access_allowed(fault, new_spte))
 			break;
 
-		/*
-		 * Currently, fast page fault only works for direct mapping
-		 * since the gfn is not stable for indirect shadow page. See
-		 * Documentation/virt/kvm/locking.rst to get more detail.
-		 */
 		if (fast_pf_fix_direct_spte(vcpu, fault, sptep, spte, new_spte)) {
 			ret = RET_PF_FIXED;
 			break;
@@ -4018,7 +4026,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	if (page_fault_handle_page_track(vcpu, fault))
 		return RET_PF_EMULATE;
 
-	r = fast_page_fault(vcpu, fault);
+	r = fast_page_fault(vcpu, fault, true);
 	if (r != RET_PF_INVALID)
 		return r;
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index db80f7ccaa4e..d33b01a2714e 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -812,6 +812,14 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 		return RET_PF_RETRY;
 	}
 
+	/* See if the fault has already been resolved by a different vCPU. */
+	r = fast_page_fault(vcpu, fault, false);
+	if (r == RET_PF_SPURIOUS)
+		return r;
+
+	/* Indirect page faults should never be fixed in the fast path. */
+	WARN_ON_ONCE(r != RET_PF_INVALID);
+
 	fault->gfn = walker.gfn;
 	fault->slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn);
 
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog

[PATCH 12/12] DO NOT MERGE: KVM: selftests: Attempt to detect lost dirty bits

Posted by Sean Christopherson 4 years ago

A failed attempt to detect improper dropping of Writable and/or Dirty
bits.  Doesn't work because the primary MMU write-protects its PTEs when
file writeback occurs, i.e. KVM's dirty bits are meaningless as far as
file-backed guest memory is concnered.

Not-signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   4 +
 .../selftests/kvm/volatile_spte_test.c        | 208 ++++++++++++++++++
 3 files changed, 213 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/volatile_spte_test.c

diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
index 56140068b763..3307444d9fda 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -70,3 +70,4 @@
 /steal_time
 /kvm_binary_stats_test
 /system_counter_offset_test
+/volatile_spte_test
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index af582d168621..bc0907de6638 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -103,6 +103,7 @@ TEST_GEN_PROGS_x86_64 += set_memory_region_test
 TEST_GEN_PROGS_x86_64 += steal_time
 TEST_GEN_PROGS_x86_64 += kvm_binary_stats_test
 TEST_GEN_PROGS_x86_64 += system_counter_offset_test
+TEST_GEN_PROGS_x86_64 += volatile_spte_test
 
 TEST_GEN_PROGS_aarch64 += aarch64/arch_timer
 TEST_GEN_PROGS_aarch64 += aarch64/debug-exceptions
@@ -122,6 +123,7 @@ TEST_GEN_PROGS_aarch64 += rseq_test
 TEST_GEN_PROGS_aarch64 += set_memory_region_test
 TEST_GEN_PROGS_aarch64 += steal_time
 TEST_GEN_PROGS_aarch64 += kvm_binary_stats_test
+TEST_GEN_PROGS_aarch64 += volatile_spte_test
 
 TEST_GEN_PROGS_s390x = s390x/memop
 TEST_GEN_PROGS_s390x += s390x/resets
@@ -134,6 +136,7 @@ TEST_GEN_PROGS_s390x += kvm_page_table_test
 TEST_GEN_PROGS_s390x += rseq_test
 TEST_GEN_PROGS_s390x += set_memory_region_test
 TEST_GEN_PROGS_s390x += kvm_binary_stats_test
+TEST_GEN_PROGS_s390x += volatile_spte_test
 
 TEST_GEN_PROGS_riscv += demand_paging_test
 TEST_GEN_PROGS_riscv += dirty_log_test
@@ -141,6 +144,7 @@ TEST_GEN_PROGS_riscv += kvm_create_max_vcpus
 TEST_GEN_PROGS_riscv += kvm_page_table_test
 TEST_GEN_PROGS_riscv += set_memory_region_test
 TEST_GEN_PROGS_riscv += kvm_binary_stats_test
+TEST_GEN_PROGS_riscv += volatile_spte_test
 
 TEST_GEN_PROGS += $(TEST_GEN_PROGS_$(UNAME_M))
 LIBKVM += $(LIBKVM_$(UNAME_M))
diff --git a/tools/testing/selftests/kvm/volatile_spte_test.c b/tools/testing/selftests/kvm/volatile_spte_test.c
new file mode 100644
index 000000000000..a4277216eb3d
--- /dev/null
+++ b/tools/testing/selftests/kvm/volatile_spte_test.c
@@ -0,0 +1,208 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#define _GNU_SOURCE /* for program_invocation_short_name */
+#include <errno.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <signal.h>
+#include <syscall.h>
+#include <sys/ioctl.h>
+#include <sys/sysinfo.h>
+#include <asm/barrier.h>
+#include <linux/atomic.h>
+#include <linux/rseq.h>
+#include <linux/unistd.h>
+
+#include "kvm_util.h"
+#include "processor.h"
+#include "test_util.h"
+
+#define VCPU_ID 0
+
+#define PAGE_SIZE 4096
+
+#define NR_ITERATIONS		1000
+
+#define MEM_FILE_NAME		"volatile_spte_test_mem"
+#define MEM_FILE_MEMSLOT	1
+#define MEM_FILE_DATA_PATTERN	0xa5a5a5a5a5a5a5a5ul
+
+static const uint64_t gpa = (4ull * (1 << 30));
+
+static uint64_t *hva;
+
+static pthread_t mprotect_thread;
+static atomic_t rendezvous;
+static bool done;
+
+static void guest_code(void)
+{
+	uint64_t *gva = (uint64_t *)gpa;
+
+	while (!READ_ONCE(done)) {
+		WRITE_ONCE(*gva, 0);
+		GUEST_SYNC(0);
+
+		WRITE_ONCE(*gva, MEM_FILE_DATA_PATTERN);
+		GUEST_SYNC(1);
+	}
+}
+
+static void *mprotect_worker(void *ign)
+{
+	int i, r;
+
+	i = 0;
+	while (!READ_ONCE(done)) {
+		for ( ; atomic_read(&rendezvous) != 1; i++)
+			cpu_relax();
+
+		usleep((i % 10) + 1);
+
+		r = mprotect(hva, PAGE_SIZE, PROT_NONE);
+		TEST_ASSERT(!r, "Failed to mprotect file (hva = %lx), errno = %d (%s)",
+			    (unsigned long)hva, errno, strerror(errno));
+
+		atomic_inc(&rendezvous);
+	}
+	return NULL;
+}
+
+int main(int argc, char *argv[])
+{
+	uint64_t bitmap = -1ull, val;
+	int i, r, fd, nr_writes;
+	struct kvm_regs regs;
+	struct ucall ucall;
+	struct kvm_vm *vm;
+
+	vm = vm_create_default(VCPU_ID, 0, guest_code);
+	vcpu_regs_get(vm, VCPU_ID, &regs);
+	ucall_init(vm, NULL);
+
+	pthread_create(&mprotect_thread, NULL, mprotect_worker, 0);
+
+	fd = open(MEM_FILE_NAME, O_RDWR | O_CREAT, 0644);
+	TEST_ASSERT(fd >= 0, "Failed to open '%s', errno = %d (%s)",
+		    MEM_FILE_NAME, errno, strerror(errno));
+
+	r = ftruncate(fd, PAGE_SIZE);
+	TEST_ASSERT(fd >= 0, "Failed to ftruncate '%s', errno = %d (%s)",
+		    MEM_FILE_NAME, errno, strerror(errno));
+
+	hva = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT(hva != MAP_FAILED,  "Failed to map file, errno = %d (%s)",
+		    errno, strerror(errno));
+
+	vm_set_user_memory_region(vm, MEM_FILE_MEMSLOT, KVM_MEM_LOG_DIRTY_PAGES,
+				  gpa, PAGE_SIZE, hva);
+	virt_pg_map(vm, gpa, gpa);
+
+	for (i = 0, nr_writes = 0; i < NR_ITERATIONS; i++) {
+		fdatasync(fd);
+
+		vcpu_run(vm, VCPU_ID);
+		ASSERT_EQ(*hva, 0);
+		ASSERT_EQ(get_ucall(vm, VCPU_ID, &ucall), UCALL_SYNC);
+		ASSERT_EQ(ucall.args[1], 0);
+
+		/*
+		 * The origin hope/intent was to detect dropped Dirty bits by
+		 * checking for missed file writeback.  Sadly, the kernel is
+		 * too smart and write-protects the primary MMU's PTEs, which
+		 * zaps KVM's SPTEs and ultimately causes the folio/page to get
+		 * marked marked dirty by the primary MMU when KVM re-faults on
+		 * the page.
+		 *
+		 * Triggering swap _might_ be a way to detect failure, as swap
+		 * is treated differently than "normal" files.
+		 *
+		 * RIP: 0010:kvm_unmap_gfn_range+0xf1/0x100 [kvm]
+		 * Call Trace:
+		 * <TASK>
+		 *   kvm_mmu_notifier_invalidate_range_start+0x11c/0x2c0 [kvm]
+		 *   __mmu_notifier_invalidate_range_start+0x7e/0x190
+		 *   page_mkclean_one+0x226/0x250
+		 *   rmap_walk_file+0x213/0x430
+		 *   folio_mkclean+0x95/0xb0
+		 *   folio_clear_dirty_for_io+0x5d/0x1c0
+		 *   mpage_submit_page+0x1f/0x70
+		 *   mpage_process_page_bufs+0xf8/0x110
+		 *   mpage_prepare_extent_to_map+0x1e3/0x420
+		 *   ext4_writepages+0x277/0xca0
+		 *   do_writepages+0xd1/0x190
+		 *   filemap_fdatawrite_wbc+0x62/0x90
+		 *   file_write_and_wait_range+0xa3/0xe0
+		 *   ext4_sync_file+0xdb/0x340
+		 *   do_fsync+0x38/0x70
+		 *   __x64_sys_fdatasync+0x13/0x20
+		 *   do_syscall_64+0x31/0x50
+		 *   entry_SYSCALL_64_after_hwframe+0x44/0xae
+		 * </TASK>
+		 *
+		 * RIP: 0010:__folio_mark_dirty+0x266/0x310
+		 * Call Trace:
+		 * <TASK>
+		 *   mark_buffer_dirty+0xe7/0x140
+		 *   __block_commit_write.isra.0+0x59/0xc0
+		 *   block_page_mkwrite+0x15a/0x170
+		 *   ext4_page_mkwrite+0x485/0x620
+		 *   do_page_mkwrite+0x54/0x150
+		 *   __handle_mm_fault+0xe2a/0x1600
+		 *   handle_mm_fault+0xbd/0x280
+		 *   do_user_addr_fault+0x192/0x600
+		 *   exc_page_fault+0x6c/0x140
+		 *   asm_exc_page_fault+0x1e/0x30
+		 * </TASK>
+		 */
+		/* fdatasync(fd); */
+
+		/*
+		 * Clear the dirty log to coerce KVM into write-protecting the
+		 * SPTE (or into clearing dirty bits when using PML).
+		 */
+		kvm_vm_clear_dirty_log(vm, MEM_FILE_MEMSLOT, &bitmap, 0, 1);
+
+		atomic_inc(&rendezvous);
+
+		usleep(i % 10);
+
+		r = _vcpu_run(vm, VCPU_ID);
+
+		while (atomic_read(&rendezvous) != 2)
+			cpu_relax();
+
+		atomic_set(&rendezvous, 0);
+
+		fdatasync(fd);
+		mprotect(hva, PAGE_SIZE, PROT_READ | PROT_WRITE);
+
+		val = READ_ONCE(*hva);
+		if (r) {
+			TEST_ASSERT(!val, "Memory should be zero, write faulted\n");
+			vcpu_regs_set(vm, VCPU_ID, &regs);
+			continue;
+		}
+		nr_writes++;
+		TEST_ASSERT(val == MEM_FILE_DATA_PATTERN,
+			    "Memory doesn't match data pattern, want 0x%lx, got 0x%lx",
+			    MEM_FILE_DATA_PATTERN, val);
+		ASSERT_EQ(get_ucall(vm, VCPU_ID, &ucall), UCALL_SYNC);
+		ASSERT_EQ(ucall.args[1], 1);
+	}
+
+	printf("%d of %d iterations wrote memory\n", nr_writes, NR_ITERATIONS);
+
+	atomic_inc(&rendezvous);
+	WRITE_ONCE(done, true);
+
+	pthread_join(mprotect_thread, NULL);
+
+	kvm_vm_free(vm);
+
+	return 0;
+}
+
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog