From nobody Mon Feb 9 05:57:49 2026 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 000F06FC2 for ; Wed, 24 Jul 2024 01:11:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1721783477; cv=none; b=L7deXasGiblu7wAk4UDKzr9n0o4bu8T4bJ55R1PL6T5PeGmylh9BrcnRtd6bv0aRioorien/mNE9iIhmPn1DvEGWXNapAOlSm3xDrFRtn5+g5zMNrsz73yy851jV1Owel6rHM65E0F9u++OkF2F1zXuQv+njRdhU0EhkdUloLHg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1721783477; c=relaxed/simple; bh=/2kdFrEbCqHGO5LP5EvePqfjzDFLCHiFO1sO9p5KwPk=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=AYwzfqFoATMsSKqzq3sk3gALkwgKpaZwVrIYJK6DuGJarOF4+yNDAX/GUNFXxcmcQef5cgd5HdkZ8MinznFfMmGjAS7kLXZ2O+x2W1XS9MHC8GVghGpnjVsxjRWSRn+GPXAjcZ+jzvjzuAzXMmD6S+lb/tkjMfTA2qc1uSwVTS4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=o2tsik2P; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="o2tsik2P" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-664916e5b40so7215537b3.1 for ; Tue, 23 Jul 2024 18:11:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1721783475; x=1722388275; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=wYE88dZwxum6KV/GhN2JgnjTuhwRm296Ss1xDF2ZSzg=; b=o2tsik2PNa7ZGYeBObbjupExLxkjDImngS6WBXccvSNy91daNESSgStdln+DQjk7zH WMCibL6D5pqdK0FZSDIc3ZK5tXKjS247xSYeu4gYtr2vgBcy1IdakQux2qQywdRSft3Q gw6k0z/eVcYJUUtKpO4lSgdkOAWKE75GcvBMNse2C7Pvos9+2PPPImD+JdBGI6iWxhwC mWGGblme7dFRX3dT690kUf8QK0iO8ZuK+UqjgBFn3QqisErfzNuH70556v7RncjN6XqO 2S42G3tUrIMGamm0iWwFK/yvw3T8uEajH12j6HnYaBOwwwzXIfSD/T0RhZoda+rSrS5+ mbDw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721783475; x=1722388275; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=wYE88dZwxum6KV/GhN2JgnjTuhwRm296Ss1xDF2ZSzg=; b=Pi6NKoTBRk/cc8WHriJfeuSKrqw8d23qUpO3eVVKLtuXobodQ4OlfbJGL8L6xImL3j qGWUZsFQfyZsPbMW2I+m00robVkpG0QezOsCeijjGqxOd+kr6k2xddVUyd2R6Tg169vH VYqy2bZBpAuvHnm/2zikL6NcMP6TwMVBmRuUv/FWBhFoONWfmeXU6pTa+aJdwy5dy68o JPInQGZme9jC6IDgbQTR4paKJLWizkljCzXTxvXBHmvrQLXdCD71bXuKHQ60aDpTWtB7 TJvjhZaXnFqmjohCcqUEmkhS3zzTRE6wdLJeQlM/AuXEFUJIw3/AeXOHqVxKIxuwKpWC YyjA== X-Forwarded-Encrypted: i=1; AJvYcCWKr+sFu7rr41ZBOsgSTK0LpSq5ieD8TvFurBe64NQ/R0KD7YOHtrNFwMvz+ai4SFoiqlBk7Z/gKy68ueZYCja44jNc/QQtdXPtlwU4 X-Gm-Message-State: AOJu0YzyCSmW+TAowAJc6v06pdJUeuZnQJpapZ0bXdXGHAbo3ex4JusJ D/4JCxzQDs9A+7iwP5OBvZpCNdWQ3lhNShxDS8fjrkpGpHX3b9mg9XlGNnoBt7A8f4HrYPdEsgY W9ALqwb1IF89JU7uQNw== X-Google-Smtp-Source: AGHT+IFlbCwOv7FcHHPSyN/4LSWSns0pNrKnm/Gh8fQ5E5vmLdKRE2NJsut+pxEH6Bu3xVJUKO8eEZSj9XJO7SkF X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a0d:ff85:0:b0:665:7b0d:ed27 with SMTP id 00721157ae682-672bcb1f07cmr141337b3.2.1721783474906; Tue, 23 Jul 2024 18:11:14 -0700 (PDT) Date: Wed, 24 Jul 2024 01:10:27 +0000 In-Reply-To: <20240724011037.3671523-1-jthoughton@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240724011037.3671523-1-jthoughton@google.com> X-Mailer: git-send-email 2.46.0.rc1.232.g9752f9e123-goog Message-ID: <20240724011037.3671523-3-jthoughton@google.com> Subject: [PATCH v6 02/11] KVM: x86: Relax locking for kvm_test_age_gfn and kvm_age_gfn From: James Houghton To: Andrew Morton , Paolo Bonzini Cc: Ankit Agrawal , Axel Rasmussen , Catalin Marinas , David Matlack , David Rientjes , James Houghton , James Morse , Jason Gunthorpe , Jonathan Corbet , Marc Zyngier , Oliver Upton , Raghavendra Rao Ananta , Ryan Roberts , Sean Christopherson , Shaoqin Huang , Suzuki K Poulose , Wei Xu , Will Deacon , Yu Zhao , Zenghui Yu , kvmarm@lists.linux.dev, kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Walk the TDP MMU in an RCU read-side critical section. This requires a way to do RCU-safe walking of the tdp_mmu_roots; do this with a new macro. The PTE modifications are now done atomically, and kvm_tdp_mmu_spte_need_atomic_write() has been updated to account for the fact that kvm_age_gfn can now lockless update the accessed bit and the R/X bits). If the cmpxchg for marking the spte for access tracking fails, we simply retry if the spte is still a leaf PTE. If it isn't, we return false to continue the walk. Harvesting age information from the shadow MMU is still done while holding the MMU write lock. Suggested-by: Yu Zhao Signed-off-by: James Houghton Reviewed-by: David Matlack --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/Kconfig | 1 + arch/x86/kvm/mmu/mmu.c | 10 ++++- arch/x86/kvm/mmu/tdp_iter.h | 27 +++++++------ arch/x86/kvm/mmu/tdp_mmu.c | 67 +++++++++++++++++++++++++-------- 5 files changed, 77 insertions(+), 29 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_hos= t.h index 950a03e0181e..096988262005 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1456,6 +1456,7 @@ struct kvm_arch { * tdp_mmu_page set. * * For reads, this list is protected by: + * RCU alone or * the MMU lock in read mode + RCU or * the MMU lock in write mode * diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 4287a8071a3a..6ac43074c5e9 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -23,6 +23,7 @@ config KVM depends on X86_LOCAL_APIC select KVM_COMMON select KVM_GENERIC_MMU_NOTIFIER + select KVM_MMU_NOTIFIER_YOUNG_LOCKLESS select HAVE_KVM_IRQCHIP select HAVE_KVM_PFNCACHE select HAVE_KVM_DIRTY_RING_TSO diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 901be9e420a4..7b93ce8f0680 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1633,8 +1633,11 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_ran= ge *range) { bool young =3D false; =20 - if (kvm_memslots_have_rmaps(kvm)) + if (kvm_memslots_have_rmaps(kvm)) { + write_lock(&kvm->mmu_lock); young =3D kvm_handle_gfn_range(kvm, range, kvm_age_rmap); + write_unlock(&kvm->mmu_lock); + } =20 if (tdp_mmu_enabled) young |=3D kvm_tdp_mmu_age_gfn_range(kvm, range); @@ -1646,8 +1649,11 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gf= n_range *range) { bool young =3D false; =20 - if (kvm_memslots_have_rmaps(kvm)) + if (kvm_memslots_have_rmaps(kvm)) { + write_lock(&kvm->mmu_lock); young =3D kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap); + write_unlock(&kvm->mmu_lock); + } =20 if (tdp_mmu_enabled) young |=3D kvm_tdp_mmu_test_age_gfn(kvm, range); diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h index 2880fd392e0c..510936a8455a 100644 --- a/arch/x86/kvm/mmu/tdp_iter.h +++ b/arch/x86/kvm/mmu/tdp_iter.h @@ -25,6 +25,13 @@ static inline u64 kvm_tdp_mmu_write_spte_atomic(tdp_ptep= _t sptep, u64 new_spte) return xchg(rcu_dereference(sptep), new_spte); } =20 +static inline u64 tdp_mmu_clear_spte_bits_atomic(tdp_ptep_t sptep, u64 mas= k) +{ + atomic64_t *sptep_atomic =3D (atomic64_t *)rcu_dereference(sptep); + + return (u64)atomic64_fetch_and(~mask, sptep_atomic); +} + static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte) { KVM_MMU_WARN_ON(is_ept_ve_possible(new_spte)); @@ -32,10 +39,11 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t = sptep, u64 new_spte) } =20 /* - * SPTEs must be modified atomically if they are shadow-present, leaf - * SPTEs, and have volatile bits, i.e. has bits that can be set outside - * of mmu_lock. The Writable bit can be set by KVM's fast page fault - * handler, and Accessed and Dirty bits can be set by the CPU. + * SPTEs must be modified atomically if they have bits that can be set out= side + * of the mmu_lock. This can happen for any shadow-present leaf SPTEs, as = the + * Writable bit can be set by KVM's fast page fault handler, the Accessed = and + * Dirty bits can be set by the CPU, and the Accessed and R/X bits can be + * cleared by age_gfn_range. * * Note, non-leaf SPTEs do have Accessed bits and those bits are * technically volatile, but KVM doesn't consume the Accessed bit of @@ -46,8 +54,7 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sp= tep, u64 new_spte) static inline bool kvm_tdp_mmu_spte_need_atomic_write(u64 old_spte, int le= vel) { return is_shadow_present_pte(old_spte) && - is_last_spte(old_spte, level) && - spte_has_volatile_bits(old_spte); + is_last_spte(old_spte, level); } =20 static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte, @@ -63,12 +70,8 @@ static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t spte= p, u64 old_spte, static inline u64 tdp_mmu_clear_spte_bits(tdp_ptep_t sptep, u64 old_spte, u64 mask, int level) { - atomic64_t *sptep_atomic; - - if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level)) { - sptep_atomic =3D (atomic64_t *)rcu_dereference(sptep); - return (u64)atomic64_fetch_and(~mask, sptep_atomic); - } + if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level)) + return tdp_mmu_clear_spte_bits_atomic(sptep, mask); =20 __kvm_tdp_mmu_write_spte(sptep, old_spte & ~mask); return old_spte; diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index c7dc49ee7388..3f13b2db53de 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -29,6 +29,11 @@ static __always_inline bool kvm_lockdep_assert_mmu_lock_= held(struct kvm *kvm, =20 return true; } +static __always_inline bool kvm_lockdep_assert_rcu_read_lock_held(void) +{ + WARN_ON_ONCE(!rcu_read_lock_held()); + return true; +} =20 void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm) { @@ -178,6 +183,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct k= vm *kvm, ((_only_valid) && (_root)->role.invalid))) { \ } else =20 +/* + * Iterate over all TDP MMU roots in an RCU read-side critical section. + */ +#define for_each_tdp_mmu_root_rcu(_kvm, _root, _as_id) \ + list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link) \ + if (kvm_lockdep_assert_rcu_read_lock_held() && \ + (_as_id >=3D 0 && kvm_mmu_page_as_id(_root) !=3D _as_id)) { \ + } else + #define for_each_tdp_mmu_root(_kvm, _root, _as_id) \ __for_each_tdp_mmu_root(_kvm, _root, _as_id, false) =20 @@ -1224,6 +1238,27 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(s= truct kvm *kvm, return ret; } =20 +static __always_inline bool kvm_tdp_mmu_handle_gfn_lockless( + struct kvm *kvm, + struct kvm_gfn_range *range, + tdp_handler_t handler) +{ + struct kvm_mmu_page *root; + struct tdp_iter iter; + bool ret =3D false; + + rcu_read_lock(); + + for_each_tdp_mmu_root_rcu(kvm, root, range->slot->as_id) { + tdp_root_for_each_leaf_pte(iter, root, range->start, range->end) + ret |=3D handler(kvm, &iter, range); + } + + rcu_read_unlock(); + + return ret; +} + /* * Mark the SPTEs range of GFNs [start, end) unaccessed and return non-zero * if any of the GFNs in the range have been accessed. @@ -1237,28 +1272,30 @@ static bool age_gfn_range(struct kvm *kvm, struct t= dp_iter *iter, { u64 new_spte; =20 +retry: /* If we have a non-accessed entry we don't need to change the pte. */ if (!is_accessed_spte(iter->old_spte)) return false; =20 if (spte_ad_enabled(iter->old_spte)) { - iter->old_spte =3D tdp_mmu_clear_spte_bits(iter->sptep, - iter->old_spte, - shadow_accessed_mask, - iter->level); + iter->old_spte =3D tdp_mmu_clear_spte_bits_atomic(iter->sptep, + shadow_accessed_mask); new_spte =3D iter->old_spte & ~shadow_accessed_mask; } else { - /* - * Capture the dirty status of the page, so that it doesn't get - * lost when the SPTE is marked for access tracking. - */ + new_spte =3D mark_spte_for_access_track(iter->old_spte); + if (__tdp_mmu_set_spte_atomic(iter, new_spte)) { + /* + * The cmpxchg failed. If the spte is still a + * last-level spte, we can safely retry. + */ + if (is_shadow_present_pte(iter->old_spte) && + is_last_spte(iter->old_spte, iter->level)) + goto retry; + /* Otherwise, continue walking. */ + return false; + } if (is_writable_pte(iter->old_spte)) kvm_set_pfn_dirty(spte_to_pfn(iter->old_spte)); - - new_spte =3D mark_spte_for_access_track(iter->old_spte); - iter->old_spte =3D kvm_tdp_mmu_write_spte(iter->sptep, - iter->old_spte, new_spte, - iter->level); } =20 trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level, @@ -1268,7 +1305,7 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp= _iter *iter, =20 bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *rang= e) { - return kvm_tdp_mmu_handle_gfn(kvm, range, age_gfn_range); + return kvm_tdp_mmu_handle_gfn_lockless(kvm, range, age_gfn_range); } =20 static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter, @@ -1279,7 +1316,7 @@ static bool test_age_gfn(struct kvm *kvm, struct tdp_= iter *iter, =20 bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { - return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn); + return kvm_tdp_mmu_handle_gfn_lockless(kvm, range, test_age_gfn); } =20 /* --=20 2.46.0.rc1.232.g9752f9e123-goog