From nobody Sun Dec 14 11:32:40 2025 Received: from mail-qk1-f201.google.com (mail-qk1-f201.google.com [209.85.222.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BC6F678F26 for ; Tue, 4 Feb 2025 00:40:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738629661; cv=none; b=c6U1bO1AxY0TU+LjZdmWh+dMT8Vh80jrKGXelTpZdsYeBElTR4dQyGkDCFi0hnsw07ajCNhmDdH9dA9sSFuOzq8Ad30ieGnqi7Dx4q2Zz/TT9KQPk0HQ2nrEClUaK7gzo1k5Fb1UbYMfDlo005HoQDAK84kKourdegbrCXp3n40= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738629661; c=relaxed/simple; bh=bX2/w7UmvLEkOahma7kREuSvI6Ls5tXqD4E2HIZYIY0=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=cXyUUcU8f0cayvPsq+n3ty9IbBFz1ccTZVQhsq8TyBGH7eWjz2pWavbd5tQTtTrhPWH1i/unVRAKziBRpIhm4533Sd/59zsV7DEFu+d+Ny62Kofas1k7djdy2VasA/g9YNwQbt3QcBl5IcIil5sHlPkrimlJU4rUg4o2RlenZso= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=YYF2yOB4; arc=none smtp.client-ip=209.85.222.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="YYF2yOB4" Received: by mail-qk1-f201.google.com with SMTP id af79cd13be357-7b6e6cf6742so1455613385a.3 for ; Mon, 03 Feb 2025 16:40:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1738629658; x=1739234458; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=gwfNdavV7e4Rb86VqYhaSGV1yuPDe9C0uBkRmAorJGI=; b=YYF2yOB4yolatRddZmhZOuX+14I5uHozwCWJH97qfs0z7ja7LIEHR41MJzUnEry8Ni lVeBX/ZUowXN8/p7GuxnNG55s4r5lHBi8cK2wRGzNa2hIAQQQJ+nlt/roeg97YrJjSOW Ng0QfxyXXH3GWivoKCZNfTxw3uux1hKeOvFN3OULphm4OkC/QIEqd0XhSdAFSQyUnCHf ntX45QVg82wusS6Po7phWX3dsSAjJxzzf7gHkI5MCKvEObiT+eE7Fp1XbAA/l4LohRoZ nceWmb3/I/XwuWRCErO0yrC+OeJ+BjXnKh/PHBoBdzpgG1Hk4598Xs48Jx00lZfihFDl bLRQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738629658; x=1739234458; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=gwfNdavV7e4Rb86VqYhaSGV1yuPDe9C0uBkRmAorJGI=; b=W7Rfg0HvrlNebgPT1F90a2ztB98A/FrPtuNZYMhfhnWobfeRCM0RYussJekmea/XfZ G6eQ40cPptbM9t/5vBXI7aNnEDTmbFfZsadgIN2WR7h8Qktksz05HilH6kEmpLLMO2/Q TqfzRKS73V6WSioYSM2lI7AWmOQIKtJf3m/4TyJoy9h5ua3SC4UQ0IvTUnPNa2hA5reb ZDcLQF8JmE2PGmqjc5KynRscPI9d7pLWv8e7f9rW4mTvucdjHVIMMc55V527KZsouvzc NOgTS/oXHYjZya3y4hMf8uJwTMEJhbgZ5RD9Bqi+0sE5l6juMMGuRkp0aogNWiregdil ajGg== X-Forwarded-Encrypted: i=1; AJvYcCXTgvE1sN6H7QPsqEWTetnzkvv5vY1EE+sJVNnOhn6bgVYLYqtd0EroC5YG5mztTg0ob/gNKu0Kg83mqSA=@vger.kernel.org X-Gm-Message-State: AOJu0YwBV1z0lbn7OUHe0YAUY/owvMkfje1TnW5wko1+AilK+NVRMLEq LpYBnkWQ3Y4ciYMLK02Gni2fbdsasVqQonkbngwHS/xvMHAlSq1aHSbheXE1KxZF5CstX5ljH9+ RGvppMOb+VghidtYtNg== X-Google-Smtp-Source: AGHT+IGeok83Q9IAnky23jRFLjinK54hMup/gDbGN0jTr0epHOw3qamViOu7KianemfptLmTnNVDBCnVR1genjs+ X-Received: from qknpr12.prod.google.com ([2002:a05:620a:86cc:b0:7bc:dee1:94a3]) (user=jthoughton job=prod-delivery.src-stubby-dispatcher) by 2002:a05:620a:bcb:b0:7af:c60b:5acf with SMTP id af79cd13be357-7bffccbfc15mr3017337685a.10.1738629658570; Mon, 03 Feb 2025 16:40:58 -0800 (PST) Date: Tue, 4 Feb 2025 00:40:31 +0000 In-Reply-To: <20250204004038.1680123-1-jthoughton@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250204004038.1680123-1-jthoughton@google.com> X-Mailer: git-send-email 2.48.1.362.g079036d154-goog Message-ID: <20250204004038.1680123-5-jthoughton@google.com> Subject: [PATCH v9 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn() From: James Houghton To: Sean Christopherson , Paolo Bonzini Cc: David Matlack , David Rientjes , James Houghton , Marc Zyngier , Oliver Upton , Wei Xu , Yu Zhao , Axel Rasmussen , kvm@vger.kernel.org, linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Walk the TDP MMU in an RCU read-side critical section without holding mmu_lock when harvesting and potentially updating age information on sptes. This requires a way to do RCU-safe walking of the tdp_mmu_roots; do this with a new macro. The PTE modifications are now always done atomically. spte_has_volatile_bits() no longer checks for Accessed bit at all. It can (now) be set and cleared without taking the mmu_lock, but dropping Accessed bit updates is already tolerated (the TLB is not invalidated after clearing the Accessed bit). If the cmpxchg for marking the spte for access tracking fails, leave it as is and treat it as if it were young, as if the spte is being actively modified, it is most likely young. Harvesting age information from the shadow MMU is still done while holding the MMU write lock. Suggested-by: Yu Zhao Signed-off-by: James Houghton Reviewed-by: David Matlack Reviewed-by: James Houghton --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/Kconfig | 1 + arch/x86/kvm/mmu/mmu.c | 10 +++++++-- arch/x86/kvm/mmu/spte.c | 10 +++++++-- arch/x86/kvm/mmu/tdp_iter.h | 9 +++++---- arch/x86/kvm/mmu/tdp_mmu.c | 36 +++++++++++++++++++++++---------- 6 files changed, 48 insertions(+), 19 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_hos= t.h index f378cd43241c..0e44fc1cec0d 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1479,6 +1479,7 @@ struct kvm_arch { * tdp_mmu_page set. * * For reads, this list is protected by: + * RCU alone or * the MMU lock in read mode + RCU or * the MMU lock in write mode * diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index ea2c4f21c1ca..f0a60e59c884 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -22,6 +22,7 @@ config KVM_X86 select KVM_COMMON select KVM_GENERIC_MMU_NOTIFIER select KVM_ELIDE_TLB_FLUSH_IF_YOUNG + select KVM_MMU_NOTIFIER_AGING_LOCKLESS select HAVE_KVM_IRQCHIP select HAVE_KVM_PFNCACHE select HAVE_KVM_DIRTY_RING_TSO diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index a45ae60e84ab..7779b49f386d 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1592,8 +1592,11 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_ran= ge *range) { bool young =3D false; =20 - if (kvm_memslots_have_rmaps(kvm)) + if (kvm_memslots_have_rmaps(kvm)) { + write_lock(&kvm->mmu_lock); young =3D kvm_rmap_age_gfn_range(kvm, range, false); + write_unlock(&kvm->mmu_lock); + } =20 if (tdp_mmu_enabled) young |=3D kvm_tdp_mmu_age_gfn_range(kvm, range); @@ -1605,8 +1608,11 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gf= n_range *range) { bool young =3D false; =20 - if (kvm_memslots_have_rmaps(kvm)) + if (kvm_memslots_have_rmaps(kvm)) { + write_lock(&kvm->mmu_lock); young =3D kvm_rmap_age_gfn_range(kvm, range, true); + write_unlock(&kvm->mmu_lock); + } =20 if (tdp_mmu_enabled) young |=3D kvm_tdp_mmu_test_age_gfn(kvm, range); diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c index 22551e2f1d00..e984b440c0f0 100644 --- a/arch/x86/kvm/mmu/spte.c +++ b/arch/x86/kvm/mmu/spte.c @@ -142,8 +142,14 @@ bool spte_has_volatile_bits(u64 spte) return true; =20 if (spte_ad_enabled(spte)) { - if (!(spte & shadow_accessed_mask) || - (is_writable_pte(spte) && !(spte & shadow_dirty_mask))) + /* + * Do not check the Accessed bit. It can be set (by the CPU) + * and cleared (by kvm_tdp_mmu_age_spte()) without holding + * the mmu_lock, but when clearing the Accessed bit, we do + * not invalidate the TLB, so we can already miss Accessed bit + * updates. + */ + if (is_writable_pte(spte) && !(spte & shadow_dirty_mask)) return true; } =20 diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h index 9135b035fa40..05e9d678aac9 100644 --- a/arch/x86/kvm/mmu/tdp_iter.h +++ b/arch/x86/kvm/mmu/tdp_iter.h @@ -39,10 +39,11 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t = sptep, u64 new_spte) } =20 /* - * SPTEs must be modified atomically if they are shadow-present, leaf - * SPTEs, and have volatile bits, i.e. has bits that can be set outside - * of mmu_lock. The Writable bit can be set by KVM's fast page fault - * handler, and Accessed and Dirty bits can be set by the CPU. + * SPTEs must be modified atomically if they have bits that can be set out= side + * of the mmu_lock. This can happen for any shadow-present leaf SPTEs, as = the + * Writable bit can be set by KVM's fast page fault handler, the Accessed = and + * Dirty bits can be set by the CPU, and the Accessed and W/R/X bits can be + * cleared by age_gfn_range(). * * Note, non-leaf SPTEs do have Accessed bits and those bits are * technically volatile, but KVM doesn't consume the Accessed bit of diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 046b6ba31197..c9778c3e6ecd 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -193,6 +193,19 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct k= vm *kvm, !tdp_mmu_root_match((_root), (_types)))) { \ } else =20 +/* + * Iterate over all TDP MMU roots in an RCU read-side critical section. + * It is safe to iterate over the SPTEs under the root, but their values w= ill + * be unstable, so all writes must be atomic. As this routine is meant to = be + * used without holding the mmu_lock at all, any bits that are flipped must + * be reflected in kvm_tdp_mmu_spte_need_atomic_write(). + */ +#define for_each_tdp_mmu_root_rcu(_kvm, _root, _as_id, _types) \ + list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link) \ + if ((_as_id >=3D 0 && kvm_mmu_page_as_id(_root) !=3D _as_id) || \ + !tdp_mmu_root_match((_root), (_types))) { \ + } else + #define for_each_valid_tdp_mmu_root(_kvm, _root, _as_id) \ __for_each_tdp_mmu_root(_kvm, _root, _as_id, KVM_VALID_ROOTS) =20 @@ -1332,21 +1345,22 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, s= truct kvm_gfn_range *range, * from the clear_young() or clear_flush_young() notifier, which uses the * return value to determine if the page has been accessed. */ -static void kvm_tdp_mmu_age_spte(struct tdp_iter *iter) +static void kvm_tdp_mmu_age_spte(struct kvm *kvm, struct tdp_iter *iter) { u64 new_spte; =20 if (spte_ad_enabled(iter->old_spte)) { - iter->old_spte =3D tdp_mmu_clear_spte_bits(iter->sptep, - iter->old_spte, - shadow_accessed_mask, - iter->level); + iter->old_spte =3D tdp_mmu_clear_spte_bits_atomic(iter->sptep, + shadow_accessed_mask); new_spte =3D iter->old_spte & ~shadow_accessed_mask; } else { new_spte =3D mark_spte_for_access_track(iter->old_spte); - iter->old_spte =3D kvm_tdp_mmu_write_spte(iter->sptep, - iter->old_spte, new_spte, - iter->level); + /* + * It is safe for the following cmpxchg to fail. Leave the + * Accessed bit set, as the spte is most likely young anyway. + */ + if (__tdp_mmu_set_spte_atomic(kvm, iter, new_spte)) + return; } =20 trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level, @@ -1371,9 +1385,9 @@ static bool __kvm_tdp_mmu_age_gfn_range(struct kvm *k= vm, * valid roots! */ WARN_ON(types & ~KVM_VALID_ROOTS); - __for_each_tdp_mmu_root(kvm, root, range->slot->as_id, types) { - guard(rcu)(); =20 + guard(rcu)(); + for_each_tdp_mmu_root_rcu(kvm, root, range->slot->as_id, types) { tdp_root_for_each_leaf_pte(iter, kvm, root, range->start, range->end) { if (!is_accessed_spte(iter.old_spte)) continue; @@ -1382,7 +1396,7 @@ static bool __kvm_tdp_mmu_age_gfn_range(struct kvm *k= vm, return true; =20 ret =3D true; - kvm_tdp_mmu_age_spte(&iter); + kvm_tdp_mmu_age_spte(kvm, &iter); } } =20 --=20 2.48.1.362.g079036d154-goog