[PATCH v2 0/2] KVM: x86/tdp_mmu: Trigger the callback only when an interesting change

Isaku Yamahata posted 2 patches 1 month, 2 weeks ago
non-present => present change.
[PATCH v2 0/2] KVM: x86/tdp_mmu: Trigger the callback only when an interesting change
Posted by Isaku Yamahata 1 month, 2 weeks ago
This patch is for kvm-coco-queue.   Invoke the leaf spte change hook only when
non-present => present change.

Patches:
1: Add WARN_ON_ONCE()
2: Bug fix

v2:
- Split out WARN_ON_ONCE() into an independent patch
- Removed pr_debug()

v1: https://lore.kernel.org/kvm/6eecc450d0326c9bedfbb34096a0279410923c8d.1726182754.git.isaku.yamahata@intel.com/

As the test requires the base TDX KVM patch series and the TDX KVM kselftesets,
this patch series doesn't include it.  [1] suggested to require the AD bits
supports for TDX.  Don't include the following chage because this patch series
is for kvm-coco-queue.

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 109a7440aa68..5499edb80b9a 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3275,7 +3275,7 @@ static int __init __tdx_bringup(void)
        const struct tdx_sys_info_td_conf *td_conf;
        int r, i;
 
-       if (!tdp_mmu_enabled || !enable_mmio_caching)
+       if (!tdp_mmu_enabled || !enable_mmio_caching || !enable_ept_ad_bits)
                return -EOPNOTSUPP;
 
        if (!cpu_feature_enabled(X86_FEATURE_MOVDIR64B)) {


Observation of the issue
------------------------
An issue was reported on an out-of-tree kernel with an older version of TDX
support, but that uses similar code to the current existing mirror/external
MMU support.  It can't be triggered without future patches.  In the out of
tree kernel it appears as a KVM_BUG_ON() getting hit in TDX code when
set_external_spte() was called on a PTE that was already present.
Investigation turned up that:

 1. It was due to the AD state machine trying to update the dirty bit
 2. A problem in the currently queued mirror/external PTE, but that can't be
    hit until more TDX support is added.

In more detail, the spotted issue was caused by two vcpus simultaneously
faulting for same GFN:

vcpu0                                   vcpu1
-----                                   -----
ept violation for GFN                   ept violation for GFN
encounter !present mirror PTE
set_external_spte()
re-enter TD, and accept GFN
                                        encounter present mirror PTE
                                        trigger AD state machine transition
                                        see old/new PTEs are different
                                        call set_external_spte() again

The TDX external TDP backend couldn't handle the second call to
set_external_spte().  In recent TDX development branches, the
TDH.MEM.PAGE.AUG() SEAMCALL() hook would fail unexpectedly with
TDX_EPT_ENTRY_STATE_INCORRECT + SPTE PENDING or PRESENT.  However, the
callback assumes that it's called only when non-present => present leaf
SPTE change.  It shows as a triggering a KVM_BUG_ON() with a stacktrace
like:

  WARNING: CPU: 4 PID: 3700 at tdx_mem_page_aug+0x16d/0x560 [kvm_intel]
  IP: 0010:tdx_mem_page_aug+0x16d/0x560 [kvm_intel]

  Call Trace:
  set_external_spte_present+0x244/0x7e0
  tdp_mmu_map_handle_target_level+0x460/0xd60
  kvm_tdp_mmu_map+0x9c6/0xdc0
  kvm_tdp_mmu_page_fault+0x32b/0x3e0
  kvm_mmu_do_page_fault+0x4e5/0x6a0
  kvm_mmu_page_fault+0x1a0/0x5e0
  vcpu_enter_guest+0x1f5f/0x3900
  vcpu_run+0x133/0xb60
  kvm_arch_vcpu_ioctl_run+0x8ef/0x1650
  kvm_vcpu_ioctl+0x4f9/0xb70
  __x64_sys_ioctl+0x132/0x1a0
  do_syscall_64+0xc1/0x1d0
  entry_SYSCALL_64_after_hwframe+0x77/0x7f

Design of MMU mirror/external
-----------------------------
In current mirror/external MMU support, the set_external_spte() allows
external PTEs to be set as present, but doesn't allow configuration of any
other PTE bits.  remove_external_spte() allows external PTEs to be zapped.

Based on the fact that external PTE callbacks are exposed to essentially
configure two PTE states (present or not present), and that mirror PTEs
record whether the external PTE is present or not, future patches were
planned to introduce a TDX backend for mirror/external page tables to not
expect set_external_spte() calls on already present external PTEs.  To
accommodate this future code, steps were taken in the mirror/external
support in the core MMU to prevent permission bits changing on such PTEs.

However, the AD state machine scenario was missed.  When the multiple vCPUs
fault in the same GFN simultaneously, the hook is called multiple times
with some bits changed while both old/new SPTEs have the present bits.  The
leaf SPTE is a complex state machine in general, and the value can change
with software available bits and hardware bits.

Solution
--------
There are several options to address this:
- Tame the SPTE state machine so that SPTE change won't happen for mirrored
  page table.
  PRO: It's conceptually natural to disable D bit support because mirrored
       page table doesn't support AD bits.
  CON: Not only D bit but also other bits can be modified.
       setting strict kvm_page_table.role.ad_disabled = true doesn't work.
       It requires to change the SPTE more deeply.

- Add a check to the TDP MMU so that it won't call the hook unnecessarily
  PRO: Direct fix that shares set_external_spte() expectations in core
       MMU code.
  CON: Hard code in the TDP MMU when the hook is needed.

- Pass old and new SPTE values to the hooks, add a check to the backend
  PRO: The backend can determine whether the callback is needed.
  CON: The KVM MMU implementation details leak to the backend because
       The SPTE encoding is specific to the KVM MMU implementation.
       And it's subject to change in the future.
       For example, is_shadow_present_pte() needs to be exported from the
       KVM MMU to the backend.

The first choice is out because it doesn't easily handle the problem.
Choose the second option over the third choice because the current
interesting change is only non-present => present because we don't support
dirty page tracking by write protect and large page yet by TDX KVM for now.
(TDX supports them.) They're future work for TDX KVM.

Care only about the leaf SPTE because non-leaf doesn't have a state
machine.

[1] https://lore.kernel.org/kvm/ZuOCXarfAwPjYj19@google.com/

Isaku Yamahata (2):
  KVM: x86/tdp_mmu: Add WARN_ON_ONCE() in
    tdp_mmu_map_handle_target_level()
  KVM: x86/tdp_mmu: Trigger the callback only when an interesting change

 arch/x86/kvm/mmu/spte.h    | 11 +++++++++++
 arch/x86/kvm/mmu/tdp_mmu.c | 24 ++++++++++++++++++++----
 2 files changed, 31 insertions(+), 4 deletions(-)


base-commit: 909f9d422f59f863d7b6e4e2c6e57abb97a27d4d
-- 
2.45.2
[PATCH v2 1/2] KVM: x86/tdp_mmu: Add WARN_ON_ONCE() in tdp_mmu_map_handle_target_level()
Posted by Isaku Yamahata 1 month, 2 weeks ago
Add WARN_ON_ONCE() in tdp_mmu_map_handle_target_level() to check SPTE
validity made by make_spte() suggested at [1].

The possible SPTE change at present leaf is,
- Non leaf => leaf (large page).
- Read fault (SPTE doesn't have write permission) => write fault.
  Hardening permission, write protect for example, goes through zapping.
- Access tracking when AD bits aren't supported.
- AD bit change.  This should be removed when make_spte() is fixed [1].

[1] https://lore.kernel.org/kvm/ZuOCXarfAwPjYj19@google.com/
  One idea would be to WARN and skip setting the SPTE in
  tdp_mmu_map_handle_target_level().  I.e. WARN and ignore 1=>0 transitions
  for Writable and Dirty bits, and then drop the TLB flush (assuming the
  above patch lands).

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
v2:
- Split out into an independent patch to make common TDP MMU logic. (Sean)
---
 arch/x86/kvm/mmu/spte.h    | 11 +++++++++++
 arch/x86/kvm/mmu/tdp_mmu.c | 14 +++++++++++++-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index a72f0e3bde17..85fd99c92960 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -214,6 +214,17 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
  */
 #define FROZEN_SPTE	(SHADOW_NONPRESENT_VALUE | 0x5a0ULL)
 
+#define AD_SPTE_IGNORE_CHANGE_MASK				\
+	(PT_WRITABLE_MASK |                                     \
+	 shadow_host_writable_mask | shadow_mmu_writable_mask | \
+	 shadow_dirty_mask | shadow_accessed_mask)
+
+#define  ACCESS_TRACK_SPTE_IGNORE_CHANGE_MASK	\
+	(AD_SPTE_IGNORE_CHANGE_MASK |		\
+	 shadow_acc_track_mask |		\
+	 (SHADOW_ACC_TRACK_SAVED_BITS_MASK <<	\
+	  SHADOW_ACC_TRACK_SAVED_BITS_SHIFT))
+
 /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
 static_assert(!(FROZEN_SPTE & SPTE_MMU_PRESENT_MASK));
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 37b3769a5d32..1da3df517522 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1186,11 +1186,23 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 
 	if (unlikely(!fault->slot))
 		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
-	else
+	else {
 		wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
 					 fault->pfn, iter->old_spte, fault->prefetch, true,
 					 fault->map_writable, &new_spte);
 
+		WARN_ON_ONCE(kvm_ad_enabled() &&
+			     is_shadow_present_pte(iter->old_spte) &&
+			     is_large_pte(iter->old_spte) == is_large_pte(new_spte) &&
+			     ~AD_SPTE_IGNORE_CHANGE_MASK &
+			     (iter->old_spte ^ new_spte));
+		WARN_ON_ONCE(!kvm_ad_enabled() &&
+			     is_shadow_present_pte(iter->old_spte) &&
+			     is_large_pte(iter->old_spte) == is_large_pte(new_spte) &&
+			     ~ACCESS_TRACK_SPTE_IGNORE_CHANGE_MASK &
+			     (iter->old_spte ^ new_spte));
+	}
+
 	if (new_spte == iter->old_spte)
 		ret = RET_PF_SPURIOUS;
 	else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
-- 
2.45.2
[PATCH v2 2/2] KVM: x86/tdp_mmu: Trigger the callback only when an interesting change
Posted by Isaku Yamahata 1 month, 2 weeks ago
Call the set_external_spte() hook only when the present bit is changed from
non-present to present.

Observation of the issue
------------------------
An issue was reported on an out-of-tree kernel with an older version of TDX
support, but that uses similar code to the current existing mirror/external
MMU support.  It can't be triggered without future patches.  In the out of
tree kernel it appears as a KVM_BUG_ON() getting hit in TDX code when
set_external_spte() was called on a PTE that was already present.
Investigation turned up that:

 1. It was due to the AD state machine trying to update the dirty bit
 2. A problem in the currently queued mirror/external PTE, but that can't be
    hit until more TDX support is added.

In more detail, the spotted issue was caused by two vcpus simultaneously
faulting for same GFN:

vcpu0                                   vcpu1
-----                                   -----
ept violation for GFN                   ept violation for GFN
encounter !present mirror PTE
set_external_spte()
re-enter TD, and accept GFN
                                        encounter present mirror PTE
                                        trigger AD state machine transition
                                        see old/new PTEs are different
                                        call set_external_spte() again

The TDX external TDP backend couldn't handle the second call to
set_external_spte().  In recent TDX development branches, the
TDH.MEM.PAGE.AUG() SEAMCALL() hook would fail unexpectedly with
TDX_EPT_ENTRY_STATE_INCORRECT + SPTE PENDING or PRESENT.  However, the
callback assumes that it's called only when non-present => present leaf
SPTE change.  It shows as a triggering a KVM_BUG_ON() with a stacktrace
like:

  WARNING: CPU: 4 PID: 3700 at tdx_mem_page_aug+0x16d/0x560 [kvm_intel]
  IP: 0010:tdx_mem_page_aug+0x16d/0x560 [kvm_intel]

  Call Trace:
  set_external_spte_present+0x244/0x7e0
  tdp_mmu_map_handle_target_level+0x460/0xd60
  kvm_tdp_mmu_map+0x9c6/0xdc0
  kvm_tdp_mmu_page_fault+0x32b/0x3e0
  kvm_mmu_do_page_fault+0x4e5/0x6a0
  kvm_mmu_page_fault+0x1a0/0x5e0
  vcpu_enter_guest+0x1f5f/0x3900
  vcpu_run+0x133/0xb60
  kvm_arch_vcpu_ioctl_run+0x8ef/0x1650
  kvm_vcpu_ioctl+0x4f9/0xb70
  __x64_sys_ioctl+0x132/0x1a0
  do_syscall_64+0xc1/0x1d0
  entry_SYSCALL_64_after_hwframe+0x77/0x7f

Design of MMU mirror/external
-----------------------------
In current mirror/external MMU support, the set_external_spte() allows
external PTEs to be set as present, but doesn't allow configuration of any
other PTE bits.  remove_external_spte() allows external PTEs to be zapped.

Based on the fact that external PTE callbacks are exposed to essentially
configure two PTE states (present or not present), and that mirror PTEs
record whether the external PTE is present or not, future patches were
planned to introduce a TDX backend for mirror/external page tables to not
expect set_external_spte() calls on already present external PTEs.  To
accommodate this future code, steps were taken in the mirror/external
support in the core MMU to prevent permission bits changing on such PTEs.

However, the AD state machine scenario was missed.  When the multiple vCPUs
fault in the same GFN simultaneously, the hook is called multiple times
with some bits changed while both old/new SPTEs have the present bits.  The
leaf SPTE is a complex state machine in general, and the value can change
with software available bits and hardware bits.

Solution
--------
There are several options to address this:
- Tame the SPTE state machine so that SPTE change won't happen for mirrored
  page table.
  PRO: It's conceptually natural to disable D bit support because mirrored
       page table doesn't support AD bits.
  CON: Not only D bit but also other bits can be modified.
       setting strict kvm_page_table.role.ad_disabled = true doesn't work.
       It requires to change the SPTE more deeply.

- Add a check to the TDP MMU so that it won't call the hook unnecessarily
  PRO: Direct fix that shares set_external_spte() expectations in core
       MMU code.
  CON: Hard code in the TDP MMU when the hook is needed.

- Pass old and new SPTE values to the hooks, add a check to the backend
  PRO: The backend can determine whether the callback is needed.
  CON: The KVM MMU implementation details leak to the backend because
       The SPTE encoding is specific to the KVM MMU implementation.
       And it's subject to change in the future.
       For example, is_shadow_present_pte() needs to be exported from the
       KVM MMU to the backend.

The first choice is out because it doesn't easily handle the problem.
Choose the second option over the third choice because the current
interesting change is only non-present => present because we don't support
dirty page tracking by write protect and large page yet by TDX KVM for now.
(TDX supports them.) They're future work for TDX KVM.

Care only about the leaf SPTE because non-leaf doesn't have a state
machine.

Reported-by: Sagi Shahar <sagis@google.com>
Fixes: b6bcd88ad43a ("KVM: x86/tdp_mmu: Propagate building mirror page tables")
[log heavily edited by Rick]
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
v2:
- Removed KVM_BUG_ON() into another patch (Sean)
- Removed pr_debug() (Sean)

v1:
https://lore.kernel.org/kvm/6eecc450d0326c9bedfbb34096a0279410923c8d.1726182754.git.isaku.yamahata@intel.com/
---
 arch/x86/kvm/mmu/tdp_mmu.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1da3df517522..c7f251549973 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -503,8 +503,6 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
 	kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
 	int ret = 0;
 
-	KVM_BUG_ON(was_present, kvm);
-
 	lockdep_assert_held(&kvm->mmu_lock);
 	/*
 	 * We need to lock out other updates to the SPTE until the external
@@ -519,10 +517,16 @@ static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sp
 	 * external page table, or leaf.
 	 */
 	if (is_leaf) {
-		ret = static_call(kvm_x86_set_external_spte)(kvm, gfn, level, new_pfn);
+		/*
+		 * SPTE is state machine with software available bits used.
+		 * Check if the change is interesting to the backend.
+		 */
+		if (!was_present)
+			ret = static_call(kvm_x86_set_external_spte)(kvm, gfn, level, new_pfn);
 	} else {
 		void *external_spt = get_external_spt(gfn, new_spte, level);
 
+		KVM_BUG_ON(was_present, kvm);
 		KVM_BUG_ON(!external_spt, kvm);
 		ret = static_call(kvm_x86_link_external_spt)(kvm, gfn, level, external_spt);
 	}
-- 
2.45.2