From nobody Mon Feb 9 00:55:10 2026 Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com [209.85.215.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B1E631388 for ; Wed, 20 Mar 2024 00:50:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710895844; cv=none; b=YK9bd6sM/wD7vU17sflaLfKkV9PlmJIUfxKt//tlQxP4BPloJmw/qvBkvFFBoj28zdqUPyWUAL/BKlSKH+j9D9IBhM2Q/r66WiPks5mk8JVWs8YJIXEWTFcEsjVG3A2waA3dw9h+6/lxaCKHODLhMTwFSxCxHJFmEz8duZwgZFk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710895844; c=relaxed/simple; bh=e9vCkrtfJaABLYAyKrQ0HWBYDJdjKHVlxWPsqtjWiU8=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=mdW66Mr7X/R65I1mE11JMAJTX00+FIdC/EIVlL1PoQ4ni+FdI3El86CPg4r5EHXP+BaKDJlyoXYEQliQwR1dk7InUfYHegIxgc3von6ptDLRrp0doQiY08Vzk7js/sj1ATdHKq42zo8HRDMYoYU8qagGZdE70RL4UCFPy3CS2wc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=mh901u8t; arc=none smtp.client-ip=209.85.215.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="mh901u8t" Received: by mail-pg1-f201.google.com with SMTP id 41be03b00d2f7-5cfd6ba1c11so4430678a12.0 for ; Tue, 19 Mar 2024 17:50:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1710895842; x=1711500642; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:from:to:cc:subject:date:message-id:reply-to; bh=73z0LxtdomfWG0GJE09D1ttIG4ILmbtU6RU20yVkqf4=; b=mh901u8t7A54U2oZkNub4mTJqcIJ7r6ar6ZLym0UGnJFrCgbUOfjM0E6Gq15UQFSQy RSK1Rnk9FPCboZydal4qJhq3WZGrOsoVEKozytT+WM/BIUY/P+esReMCKKtB5T/hWxeY 24KJT1c3Q0KescU5r1QTa0ef21/Jo7AVT5Z+ID1lGYTORA6LKTDoh/X0X9O9j6XL5Rco gCkmxglKkv11TCT86KQWTOI5YKyWPRs3WzkvaV94tnDMl958WaAH6aoP2yDoSSdlqenE PaDgIHnxkJDqFF1UF8nKNn3DuYRvp5NzLJKWucICAHy5DLqeviJXvKsIZcmXVMSEqeZ6 2j9g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710895842; x=1711500642; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=73z0LxtdomfWG0GJE09D1ttIG4ILmbtU6RU20yVkqf4=; b=bs10lZ0wZeP77hPH+x+SmWorN5FreVhVRNevWi3IPUKD1Fq71rdLUA+otcqHoXTwxe sbjFNRfnd3Jwud/2uSqNtVySj5CLvX4TrTrsxwnfqsG006OyOlykj+bgEpVgaKwu1kR+ UF216GzEvIPx+LDVxQlp1OzzOvFTw8VrX1MdXt1FTwbnRX00bzUeUDcnkpJw0YgCMFxM YkOvcTtFJClQZaN5hTsqblTUWIUFcmN/vUHS+M9mRCr1gVCXl4SKGcvqnPm//zTNKI8v A/OPIgzmPIOGuW67roux+Huo/BK/6BZtevrqah0cnossfbG3KfRunGLUxyO6pzPmmGA3 +xKQ== X-Forwarded-Encrypted: i=1; AJvYcCUbJpR4hV0vU0A5YNJa1VIV2EZjJem8jWsRTyyb0sArmeSoqoP79psK03MOg1vz0H3j/4scE4mP/fqKKB6p3YiMxQ8K6iYH5feMzeAt X-Gm-Message-State: AOJu0YzDdPbuvJKOBJQ6e60JUoU89upEhr07RxxZvNfa4R0J/mF/irPG Znz3ogwUH/o3gCppv9PBy1dq1wt5TicQJb+QyQM4c14aPNwR/kLADv8Z4alC/JmlWppK70arzcQ /7g== X-Google-Smtp-Source: AGHT+IF1elppJ1tODfZ0EG8u4kLfEw8DDLQ91IUuYrs1E3zIe4DEs7GG6DnoA43xvrWkoC/VwXNmeT6UbY4= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a05:6a02:488:b0:5e8:57aa:3609 with SMTP id bw8-20020a056a02048800b005e857aa3609mr6141pgb.9.1710895841989; Tue, 19 Mar 2024 17:50:41 -0700 (PDT) Reply-To: Sean Christopherson Date: Tue, 19 Mar 2024 17:50:23 -0700 In-Reply-To: <20240320005024.3216282-1-seanjc@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240320005024.3216282-1-seanjc@google.com> X-Mailer: git-send-email 2.44.0.291.gc1ea87d7ee-goog Message-ID: <20240320005024.3216282-4-seanjc@google.com> Subject: [RFC PATCH 3/4] KVM: x86/mmu: Mark page/folio accessed only when zapping leaf SPTEs From: Sean Christopherson To: Paolo Bonzini , Sean Christopherson Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, David Hildenbrand , David Matlack , David Stevens , Matthew Wilcox Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Mark folios as accessed only when zapping leaf SPTEs, which is a rough heuristic for "only in response to an mmu_notifier invalidation". Page aging and LRUs are tolerant of false negatives, i.e. KVM doesn't need to be precise for correctness, and re-marking folios as accessed when zapping entire roots or when zapping collapsible SPTEs is expensive and adds very little value. E.g. when a VM is dying, all of its memory is being freed; marking folios accessed at that time provides no known value. Similarly, because KVM makes folios as accessed when creating SPTEs, marking all folios as accessed when userspace happens to delete a memslot doesn't add value. The folio was marked access when the old SPTE was created, and will be marked accessed yet again if a vCPU accesses the pfn again after reloading a new root. Zapping collapsible SPTEs is a similar story; marking folios accessed just because userspace disable dirty logging is a side effect of KVM behavior, not a deliberate goal. Mark folios accessed when the primary MMU might be invalidating mappings, e.g. instead of completely dropping calls to kvm_set_pfn_accessed(), as such zappings are not KVM initiated, i.e. might actually be related to page aging and LRU activity. Note, x86 is the only KVM architecture that "double dips"; every other arch marks pfns as accessed only when mapping into the guest, not when mapping into the guest _and_ when removing from the guest. Signed-off-by: Sean Christopherson --- Documentation/virt/kvm/locking.rst | 76 +++++++++++++++--------------- arch/x86/kvm/mmu/mmu.c | 4 +- arch/x86/kvm/mmu/tdp_mmu.c | 7 ++- 3 files changed, 43 insertions(+), 44 deletions(-) diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/lo= cking.rst index 02880d5552d5..8b3bb9fe60bf 100644 --- a/Documentation/virt/kvm/locking.rst +++ b/Documentation/virt/kvm/locking.rst @@ -138,49 +138,51 @@ Then, we can ensure the dirty bitmaps is correctly se= t for a gfn. =20 2) Dirty bit tracking =20 -In the origin code, the spte can be fast updated (non-atomically) if the +In the original code, the spte can be fast updated (non-atomically) if the spte is read-only and the Accessed bit has already been set since the Accessed bit and Dirty bit can not be lost. =20 But it is not true after fast page fault since the spte can be marked writable between reading spte and updating spte. Like below case: =20 -+------------------------------------------------------------------------+ -| At the beginning:: | -| | -| spte.W =3D 0 | -| spte.Accessed =3D 1 | -+------------------------------------+-----------------------------------+ -| CPU 0: | CPU 1: | -+------------------------------------+-----------------------------------+ -| In mmu_spte_clear_track_bits():: | | -| | | -| old_spte =3D *spte; | = | -| | | -| | | -| /* 'if' condition is satisfied. */| | -| if (old_spte.Accessed =3D=3D 1 && | = | -| old_spte.W =3D=3D 0) | = | -| spte =3D 0ull; | = | -+------------------------------------+-----------------------------------+ -| | on fast page fault path:: | -| | | -| | spte.W =3D 1 = | -| | | -| | memory write on the spte:: | -| | | -| | spte.Dirty =3D 1 = | -+------------------------------------+-----------------------------------+ -| :: | | -| | | -| else | | -| old_spte =3D xchg(spte, 0ull) | = | -| if (old_spte.Accessed =3D=3D 1) | = | -| kvm_set_pfn_accessed(spte.pfn);| | -| if (old_spte.Dirty =3D=3D 1) | = | -| kvm_set_pfn_dirty(spte.pfn); | | -| OOPS!!! | | -+------------------------------------+-----------------------------------+ ++-------------------------------------------------------------------------+ +| At the beginning:: | +| | +| spte.W =3D 0 = | +| spte.Accessed =3D 1 = | ++-------------------------------------+-----------------------------------+ +| CPU 0: | CPU 1: | ++-------------------------------------+-----------------------------------+ +| In mmu_spte_update():: | | +| | | +| old_spte =3D *spte; | = | +| | | +| | | +| /* 'if' condition is satisfied. */ | | +| if (old_spte.Accessed =3D=3D 1 && | = | +| old_spte.W =3D=3D 0) | = | +| spte =3D new_spte; | = | ++-------------------------------------+-----------------------------------+ +| | on fast page fault path:: | +| | | +| | spte.W =3D 1 = | +| | | +| | memory write on the spte:: | +| | | +| | spte.Dirty =3D 1 = | ++-------------------------------------+-----------------------------------+ +| :: | | +| | | +| else | | +| old_spte =3D xchg(spte, new_spte);| = | +| if (old_spte.Accessed && | | +| !new_spte.Accessed) | | +| flush =3D true; | = | +| if (old_spte.Dirty && | | +| !new_spte.Dirty) | | +| flush =3D true; | = | +| OOPS!!! | | ++-------------------------------------+-----------------------------------+ =20 The Dirty bit is lost in this case. =20 diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index bd2240b94ff6..0a6c6619d213 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -539,10 +539,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte) * to guarantee consistency between TLB and page tables. */ =20 - if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) { + if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) flush =3D true; - kvm_set_pfn_accessed(spte_to_pfn(old_spte)); - } =20 if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) flush =3D true; diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 5866a664f46e..340d5af454c6 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -520,10 +520,6 @@ static void handle_changed_spte(struct kvm *kvm, int a= s_id, gfn_t gfn, if (was_present && !was_leaf && (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared); - - if (was_leaf && is_accessed_spte(old_spte) && - (!is_present || !is_accessed_spte(new_spte) || pfn_changed)) - kvm_set_pfn_accessed(spte_to_pfn(old_spte)); } =20 /* @@ -841,6 +837,9 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct k= vm_mmu_page *root, =20 tdp_mmu_iter_set_spte(kvm, &iter, 0); =20 + if (is_accessed_spte(iter.old_spte)) + kvm_set_pfn_accessed(spte_to_pfn(iter.old_spte)); + /* * Zappings SPTEs in invalid roots doesn't require a TLB flush, * see kvm_tdp_mmu_zap_invalidated_roots() for details. --=20 2.44.0.291.gc1ea87d7ee-goog