From nobody Sat Feb  7 17:09:32 2026
Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com
 [209.85.128.202])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1E44D8F5A
	for <linux-kernel@vger.kernel.org>; Wed, 20 Mar 2024 00:50:38 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.202
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1710895840; cv=none;
 b=q4WlZrZPF/wOXHiJo/2pp4HswJ5AVOWxNZ+9kNBK31tly5UaFNn3liJHQkGdsTKZdJDiROsr00E+ypnJpFhJbN1VJt/hn8lGCL9Un8ly5e1oSGuw5DIgf6SGoIdugAIdsHrseGxbHEOr+BwxBTk0RKksM38divji1gVoIIT5ePk=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1710895840; c=relaxed/simple;
	bh=W0mOjtlu5f54bLTmPCZiIAQr5uTb3v00ZxlbrR8mWrE=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=sJWoQ3yx7mMziDchdEW5sZ+/4Y5F+IzKzONJEm5Nb2QQ9H1R5aUzPBRCc+DDsV2SqZiADFVNBT8q5JETxPfrZmErUMyOMUM3sK+im8ohqhfWk46lejKEsNx7b2CoeX5Ho5HJdoxsaz01YqCbiSBaS8EKcERlCvg/JPvcOPkOm0M=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=wF2u6Bwc; arc=none smtp.client-ip=209.85.128.202
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="wF2u6Bwc"
Received: by mail-yw1-f202.google.com with SMTP id
 00721157ae682-610b83ff92bso62316807b3.1
        for <linux-kernel@vger.kernel.org>;
 Tue, 19 Mar 2024 17:50:38 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1710895838; x=1711500638;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:reply-to:from:to:cc:subject:date:message-id:reply-to;
        bh=ZEyM3WpVtlcPduz6nOsAULxgTTwc2EHZ5Latop+Y9uo=;
        b=wF2u6BwcW3L7nfp7pGXhVxxnu7ehekCKu8wa9WTfDLS+Rdb1R1gkb9GEgxfqFe5Eqq
         1y0vEDWfEir529+lR/0Ai1sfrAcNvzP0ykano21Rq6AB0PZvM16EWu0rat4xxeTobPq8
         cYtplhEhvkueQu/qrPXEuntxfWSue4HZEF34GRbpNebkAwZbueOsI+8Dezbmlx2XPVek
         TqcwQkHZo4BSqXxYPhUNpBY/4UhQ739d9N1dJWFA4cTeFBk4VtfpbDMhkXlrQGaRg8io
         7m4I2VxVS1KiCPY2rYK4/cofh83ef3I9E5c5kGnFvr9JFE/AtNOlBfGEfOeTbz9dU4RU
         +LTA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1710895838; x=1711500638;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:reply-to:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=ZEyM3WpVtlcPduz6nOsAULxgTTwc2EHZ5Latop+Y9uo=;
        b=qHPGYs5N/3xeDdaIZcOjNb3g9PT/2w+4hQ6wTcN91F0sBvj47VevQ7TgLb1F81XM2e
         JPsdCQJu7binnl9HV2fW23iyNoeqdDKoohhZ7jU41M28T+t/GJEhGGp3hvMENzcH6AeS
         HLerkWS8FWUU/lDAt+e7acx9BRD9k1gMsM4uAwD4pbL9s55FVZ4QHdtKxXx6q0I31OHO
         3Da6WdABUrN1vaJRITxcguJ9C97GnaHrs7DQUoBAwB32WUE64IQOB4MFeb5rPRIyJBoe
         6Z0gOXxv0MFZVSsSPCF4WUe4a3Zrof5CviaI/MC6cbTy0VfT3ISyh7+sxvLEDz3vnSIv
         DgNg==
X-Forwarded-Encrypted: i=1;
 AJvYcCWLWis/pSY0mwIWPh2dJUj3K0H8jg0gMLAK7L5G0eCJNhOr5rxy28w0uXozBAgdYI0BRFBzovg4rIrK5bHPnIQkHyBW0PfgDu9GNzvY
X-Gm-Message-State: AOJu0Yw7i3Mz3cvIvGKR4Ed34vyNtcTTikJH9TBb5r1ytenwQaX1X7pI
	gKzAygdCGJPcr+5L9QvX84di4clLDpErkr/Q9kKCUtlvnliAaXTURv1CWlb9b84OJ+ao5fa2Hyj
	9Xw==
X-Google-Smtp-Source: 
 AGHT+IGvX6kMEJQVGFiJQPwTYc+9UFzG3TlVdYPiKT28fIxYPdRocSUe0xURLQF6VnlT10vdKJ1/SWO/KWA=
X-Received: from zagreus.c.googlers.com
 ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37])
 (user=seanjc job=sendgmr) by 2002:a05:6902:2203:b0:dbe:a0c2:df25 with SMTP id
 dm3-20020a056902220300b00dbea0c2df25mr888093ybb.8.1710895838254; Tue, 19 Mar
 2024 17:50:38 -0700 (PDT)
Reply-To: Sean Christopherson <seanjc@google.com>
Date: Tue, 19 Mar 2024 17:50:21 -0700
In-Reply-To: <20240320005024.3216282-1-seanjc@google.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20240320005024.3216282-1-seanjc@google.com>
X-Mailer: git-send-email 2.44.0.291.gc1ea87d7ee-goog
Message-ID: <20240320005024.3216282-2-seanjc@google.com>
Subject: [RFC PATCH 1/4] KVM: x86/mmu: Skip the "try unsync" path iff the old
 SPTE was a leaf SPTE
From: Sean Christopherson <seanjc@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>,
 Sean Christopherson <seanjc@google.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	David Hildenbrand <david@redhat.com>, David Matlack <dmatlack@google.com>,
	David Stevens <stevensd@chromium.org>, Matthew Wilcox <willy@infradead.org>
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Apply make_spte()'s optimization to skip trying to unsync shadow pages if
and only if the old SPTE was a leaf SPTE, as non-leaf SPTEs in direct MMUs
are always writable, i.e. could trigger a false positive and incorrectly
lead to KVM creating a SPTE without write-protecting or marking shadow
pages unsync.

This bug only affects the TDP MMU, as the shadow MMU only overwrites a
shadow-present SPTE when synchronizing SPTEs (and only 4KiB SPTEs can be
unsync).  Specifically, mmu_set_spte() drops any non-leaf SPTEs *before*
calling make_spte(), whereas the TDP MMU can do a direct replacement of a
page table with the leaf SPTE.

Opportunistically update the comment to explain why skipping the unsync
stuff is safe, as opposed to simply saying "it's someone else's problem".

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/spte.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 4a599130e9c9..b4c1119cc48b 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -206,12 +206,20 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_=
page *sp,
 		spte |=3D PT_WRITABLE_MASK | shadow_mmu_writable_mask;
=20
 		/*
-		 * Optimization: for pte sync, if spte was writable the hash
-		 * lookup is unnecessary (and expensive). Write protection
-		 * is responsibility of kvm_mmu_get_page / kvm_mmu_sync_roots.
-		 * Same reasoning can be applied to dirty page accounting.
+		 * When overwriting an existing leaf SPTE, and the old SPTE was
+		 * writable, skip trying to unsync shadow pages as any relevant
+		 * shadow pages must already be unsync, i.e. the hash lookup is
+		 * unnecessary (and expensive).
+		 *
+		 * The same reasoning applies to dirty page/folio accounting;
+		 * KVM will mark the folio dirty using the old SPTE, thus
+		 * there's no need to immediately mark the new SPTE as dirty.
+		 *
+		 * Note, both cases rely on KVM not changing PFNs without first
+		 * zapping the old SPTE, which is guaranteed by both the shadow
+		 * MMU and the TDP MMU.
 		 */
-		if (is_writable_pte(old_spte))
+		if (is_last_spte(old_spte, level) && is_writable_pte(old_spte))
 			goto out;
=20
 		/*
--=20
2.44.0.291.gc1ea87d7ee-goog
From nobody Sat Feb  7 17:09:32 2026
Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com
 [209.85.219.202])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 283F4BA42
	for <linux-kernel@vger.kernel.org>; Wed, 20 Mar 2024 00:50:41 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.219.202
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1710895843; cv=none;
 b=mhygpvLWwh3FKypKFsD6cHeJZ7NDLh+QkRinFTxYfhD5+6hbXqwkZveL/3iscp4VFHHDcgn63y48qqymTcO+V/o2WwF/OAWlyxUgXWb30VGgPnv44r4G6ZyWNFAyDH0YpzAkNDuONN1IM8d07YX4hxebjUzZJMtJ8gRj2boUdGU=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1710895843; c=relaxed/simple;
	bh=6Q073mfOFIAzI7qEkWQQriqWiadvjFFWhpEnWTr6mGk=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=Q0omE7u3b8QJuH7iC3wPlT5EdBAlShBUZw5+27OodSobMIuVevmFJVaXExRyboIcw2TLQdZk0E2gqbKr0ETIBO2c6PwyX343c3vLCxCWdEIrh3caeWjSCCo+c1VETNK33wMEhursqBLd/jNCa/FUpXr58/VKxCxk2n7jQ+aAI50=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=WttWwMgx; arc=none smtp.client-ip=209.85.219.202
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="WttWwMgx"
Received: by mail-yb1-f202.google.com with SMTP id
 3f1490d57ef6-dbf618042daso9464100276.0
        for <linux-kernel@vger.kernel.org>;
 Tue, 19 Mar 2024 17:50:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1710895840; x=1711500640;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:reply-to:from:to:cc:subject:date:message-id:reply-to;
        bh=tCQ96HrZSAyrK2o8U8Il1owTWIbaazqQ6Mo14S4OZD0=;
        b=WttWwMgxtGpVvtn2QKUbc7ntn5HXROz/8C4aE1U/UxEpopfsZFknIgPCcyod2MQG97
         k+nm18Nb6jbK6xMi3DeEYfNxM+mqCrYbxTt6GaeZbkFl08MiL72ZKy3ve4V5oZW2Jp2c
         Qba7SRICjI8NtfQUp5IOLM482eA65wyC1Cll9kyH8pQmC+XaiCUdxrtyi6983Qhf+ZZk
         YtQ58+S2pWl1+j2uHa87a31LJeGGQLZCLMABSnKfvcE9W4DWN5EcVqrTAK1fmud5TnNW
         03FGUDRa3u0dXtyTPReewmpDQl9XdbSiPDA5TwaRG7dcO4iembcbCp6DGilOkM2ceTGC
         u9aA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1710895840; x=1711500640;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:reply-to:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=tCQ96HrZSAyrK2o8U8Il1owTWIbaazqQ6Mo14S4OZD0=;
        b=mpu3AtOmkkgyJnAgjEfZn773MVJ52iHfV0gm5WsK/V9fhxiXTFUJo7fOE6wj2nbaQX
         C/ibtXCIG7WNIs8cWUMxmN5NIiVBi2R6RuTwY7o3N7L8tAYnv8y5VUBd9V679FhrHL50
         qxVs5laDtJnTsObe0lpddTVCp7RvSwitUSNeobeEo2qC9gxc5YNvrK5lTSKBiPVPVnlY
         b6GfDMTeSIB5yz5IIBiBXjEXHY33rqfXSp4ZIiIMf9Zg2Q01CYnjXIES7lCngb9UnxeP
         L+ce7e5/MjSGMdjDQRD1uI4l3SIcDyJFRtAET2hOqB4pPy2y/A8zYm4UgzVoHErOfm1V
         H8lQ==
X-Forwarded-Encrypted: i=1;
 AJvYcCWUc+ryI5jHOPUxfaUBhc7BIrgsXkXns1e889NOOnnWH10o71d0+xK8+JMEyM4i5P/evOVLjuQJYS626QNaI/36ZoCR3XADRybGsa2o
X-Gm-Message-State: AOJu0Ywtzy/eHP1e6Os+bWgG4BsuoFnfVHrjLTdMTrwMChjBwcJfjyqd
	KngISNbpDyq5EUYGwLrHbLhCUH0U4ALSrDr8XWtUNKflEOWEvIsUWLuiSe7DQkA0soyqZGyxtsS
	hyQ==
X-Google-Smtp-Source: 
 AGHT+IGdMqF4vD2v9QPypYeBdDU6pK+MEXTo/9biMAJ+aCjODjdQpdDK43t7qA4l+VUTFWj/TlgTKU/7WCE=
X-Received: from zagreus.c.googlers.com
 ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37])
 (user=seanjc job=sendgmr) by 2002:a05:6902:e0d:b0:dcd:b593:6503 with SMTP id
 df13-20020a0569020e0d00b00dcdb5936503mr974695ybb.2.1710895840275; Tue, 19 Mar
 2024 17:50:40 -0700 (PDT)
Reply-To: Sean Christopherson <seanjc@google.com>
Date: Tue, 19 Mar 2024 17:50:22 -0700
In-Reply-To: <20240320005024.3216282-1-seanjc@google.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20240320005024.3216282-1-seanjc@google.com>
X-Mailer: git-send-email 2.44.0.291.gc1ea87d7ee-goog
Message-ID: <20240320005024.3216282-3-seanjc@google.com>
Subject: [RFC PATCH 2/4] KVM: x86/mmu: Mark folio dirty when creating SPTE,
 not when zapping/modifying
From: Sean Christopherson <seanjc@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>,
 Sean Christopherson <seanjc@google.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	David Hildenbrand <david@redhat.com>, David Matlack <dmatlack@google.com>,
	David Stevens <stevensd@chromium.org>, Matthew Wilcox <willy@infradead.org>
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Mark pages/folios dirty when creating SPTEs to map PFNs into the guest,
not when zapping or modifying SPTEs, as marking folios dirty when zapping
or modifying SPTEs can be extremely inefficient.  E.g. when KVM is zapping
collapsible SPTEs to reconstitute a hugepage after disbling dirty logging,
KVM will mark every 4KiB pfn as dirty, even though _at least_ 512 pfns are
guaranteed to be in a single folio (the SPTE couldn't potentially be huge
if that weren't the case).  The problem only becomes worse for 1GiB
HugeTLB pages, as KVM can mark a single folio dirty 512*512 times.

Marking a folio dirty when mapping is functionally safe as KVM drops all
relevant SPTEs in response to an mmu_notifier invalidation, i.e. ensures
that the guest can't dirty a folio after access has been removed.

And because KVM already marks folios dirty when zapping/modifying SPTEs
for KVM reasons, i.e. not in response to an mmu_notifier invalidation,
there is no danger of "prematurely" marking a folio dirty.  E.g. if a
filesystems cleans a folio without first removing write access, then there
already exists races where KVM could mark a folio dirty before remote TLBs
are flushed, i.e. before guest writes are guaranteed to stop.  Furthermore,
x86 is literally the only architecture that marks folios dirty on the
backend; every other KVM architecture marks folios dirty at map time.

x86's unique behavior likely stems from the fact that x86's MMU predates
mmu_notifiers.  Long, long ago, before mmu_notifiers were added, marking
pages dirty when zapping SPTEs was logical, and perhaps even necessary, as
KVM held references to pages, i.e. kept a page's refcount elevated while
the page was mapped into the guest.  At the time, KVM's rmap_remove()
simply did:

        if (is_writeble_pte(*spte))
                kvm_release_pfn_dirty(pfn);
        else
                kvm_release_pfn_clean(pfn);

i.e. dropped the refcount and marked the page dirty at the same time.
After mmu_notifiers were introduced, commit acb66dd051d0 ("KVM: MMU:
dont hold pagecount reference for mapped sptes pages") removed the
refcount logic, but kept the dirty logic, i.e. converted the above to:

	if (is_writeble_pte(*spte))
		kvm_release_pfn_dirty(pfn);

And for KVM x86, that's essentially how things have stayed over the last
~15 years, without anyone revisiting *why* KVM marks pages/folios dirty at
zap/modification time, e.g. the behavior was blindly carried forward to
the TDP MMU.

Practically speaking, the only downside to marking a folio dirty during
mapping is that KVM could trigger writeback of memory that was never
actually written.  Except that can't actually happen if KVM marks folios
during if and only if a writable SPTE is created (as done here), because
KVM always marks writable SPTEs as dirty during make_spte().  See commit
9b51a63024bd ("KVM: MMU: Explicitly set D-bit for writable spte."), circa
2015.

Note, KVM's access tracking logic for prefetched SPTEs is a bit odd.  If a
guest PTE is dirty and writable, KVM will create a writable SPTE, but then
mark the SPTE for access tracking.  Which isn't wrong, just a bit odd, as
it results in _more_ precise dirty tracking for MMUs _without_ A/D bits.

To keep things simple, mark the folio dirty before access tracking comes
into play, as an access-tracked SPTE can be restored in the fast page
fault path, i.e. without holding mmu_lock.  While writing SPTEs and
accessing memslots outside of mmu_lock is safe, marking a folio dirty is
not.  E.g. if the fast path gets interrupted _just_ after setting a SPTE,
the primary MMU could theoretically invalidate and free a folio before KVM
marks it dirty.  Unlike the shadow MMU, which waits for CPUs to respond to
an IPI, the TDP MMU only guarantees the page tables themselves won't be
freed (via RCU).

Opportunistically update a few stale comments.

Cc: David Hildenbrand <david@redhat.com>
Cc: David Matlack <dmatlack@google.com>
Cc: David Stevens <stevensd@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c         | 29 ++++-------------------------
 arch/x86/kvm/mmu/paging_tmpl.h |  7 +++----
 arch/x86/kvm/mmu/spte.c        | 13 ++++++++++---
 arch/x86/kvm/mmu/tdp_mmu.c     | 12 ------------
 4 files changed, 17 insertions(+), 44 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e4cc7f764980..bd2240b94ff6 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -544,10 +544,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
 	}
=20
-	if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
+	if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte))
 		flush =3D true;
-		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
-	}
=20
 	return flush;
 }
@@ -590,9 +588,6 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u=
64 *sptep)
 	if (is_accessed_spte(old_spte))
 		kvm_set_pfn_accessed(pfn);
=20
-	if (is_dirty_spte(old_spte))
-		kvm_set_pfn_dirty(pfn);
-
 	return old_spte;
 }
=20
@@ -623,13 +618,6 @@ static bool mmu_spte_age(u64 *sptep)
 		clear_bit((ffs(shadow_accessed_mask) - 1),
 			  (unsigned long *)sptep);
 	} else {
-		/*
-		 * Capture the dirty status of the page, so that it doesn't get
-		 * lost when the SPTE is marked for access tracking.
-		 */
-		if (is_writable_pte(spte))
-			kvm_set_pfn_dirty(spte_to_pfn(spte));
-
 		spte =3D mark_spte_for_access_track(spte);
 		mmu_spte_update_no_track(sptep, spte);
 	}
@@ -1263,16 +1251,6 @@ static bool spte_clear_dirty(u64 *sptep)
 	return mmu_spte_update(sptep, spte);
 }
=20
-static bool spte_wrprot_for_clear_dirty(u64 *sptep)
-{
-	bool was_writable =3D test_and_clear_bit(PT_WRITABLE_SHIFT,
-					       (unsigned long *)sptep);
-	if (was_writable && !spte_ad_enabled(*sptep))
-		kvm_set_pfn_dirty(spte_to_pfn(*sptep));
-
-	return was_writable;
-}
-
 /*
  * Gets the GFN ready for another round of dirty logging by clearing the
  *	- D bit on ad-enabled SPTEs, and
@@ -1288,7 +1266,8 @@ static bool __rmap_clear_dirty(struct kvm *kvm, struc=
t kvm_rmap_head *rmap_head,
=20
 	for_each_rmap_spte(rmap_head, &iter, sptep)
 		if (spte_ad_need_write_protect(*sptep))
-			flush |=3D spte_wrprot_for_clear_dirty(sptep);
+			flush |=3D test_and_clear_bit(PT_WRITABLE_SHIFT,
+						    (unsigned long *)sptep);
 		else
 			flush |=3D spte_clear_dirty(sptep);
=20
@@ -3392,7 +3371,7 @@ static bool fast_pf_fix_direct_spte(struct kvm_vcpu *=
vcpu,
 	 * harm. This also avoids the TLB flush needed after setting dirty bit
 	 * so non-PML cases won't be impacted.
 	 *
-	 * Compare with set_spte where instead shadow_dirty_mask is set.
+	 * Compare with make_spte() where instead shadow_dirty_mask is set.
 	 */
 	if (!try_cmpxchg64(sptep, &old_spte, new_spte))
 		return false;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 4d4e98fe4f35..ec24a6679153 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -554,7 +554,6 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_=
mmu_page *sp,
 		return false;
=20
 	mmu_set_spte(vcpu, slot, spte, pte_access, gfn, pfn, NULL);
-	kvm_release_pfn_clean(pfn);
 	return true;
 }
=20
@@ -891,9 +890,9 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, s=
truct kvm_mmu *mmu,
=20
 /*
  * Using the information in sp->shadowed_translation (kvm_mmu_page_get_gfn=
()) is
- * safe because:
- * - The spte has a reference to the struct page, so the pfn for a given g=
fn
- *   can't change unless all sptes pointing to it are nuked first.
+ * safe because SPTEs are protected by mmu_notifiers and memslot generatio=
ns, so
+ * the pfn for a given gfn can't change unless all SPTEs pointing to the g=
fn are
+ * nuked first.
  *
  * Returns
  * < 0: failed to sync spte
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index b4c1119cc48b..490966bc893c 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -212,8 +212,8 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_pa=
ge *sp,
 		 * unnecessary (and expensive).
 		 *
 		 * The same reasoning applies to dirty page/folio accounting;
-		 * KVM will mark the folio dirty using the old SPTE, thus
-		 * there's no need to immediately mark the new SPTE as dirty.
+		 * KVM marked the folio dirty when the old SPTE was created,
+		 * thus there's no need to mark the folio dirty again.
 		 *
 		 * Note, both cases rely on KVM not changing PFNs without first
 		 * zapping the old SPTE, which is guaranteed by both the shadow
@@ -235,8 +235,10 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_p=
age *sp,
 		}
 	}
=20
-	if (pte_access & ACC_WRITE_MASK)
+	if (pte_access & ACC_WRITE_MASK) {
 		spte |=3D spte_shadow_dirty_mask(spte);
+		kvm_set_pfn_dirty(pfn);
+	}
=20
 out:
 	if (prefetch)
@@ -246,6 +248,11 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_p=
age *sp,
 		  "spte =3D 0x%llx, level =3D %d, rsvd bits =3D 0x%llx", spte, level,
 		  get_rsvd_bits(&vcpu->arch.mmu->shadow_zero_check, spte, level));
=20
+	/*
+	 * Mark the memslot dirty *after* modifying it for access tracking.
+	 * Unlike folios, memslots can be safely marked dirty out of mmu_lock,
+	 * i.e. in the fast page fault handler.
+	 */
 	if ((spte & PT_WRITABLE_MASK) && kvm_slot_dirty_track_enabled(slot)) {
 		/* Enforced by kvm_mmu_hugepage_adjust. */
 		WARN_ON_ONCE(level > PG_LEVEL_4K);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d078157e62aa..5866a664f46e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -511,10 +511,6 @@ static void handle_changed_spte(struct kvm *kvm, int a=
s_id, gfn_t gfn,
 	if (is_leaf !=3D was_leaf)
 		kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
=20
-	if (was_leaf && is_dirty_spte(old_spte) &&
-	    (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
-		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
-
 	/*
 	 * Recursively handle child PTs if the change removed a subtree from
 	 * the paging structure.  Note the WARN on the PFN changing without the
@@ -1224,13 +1220,6 @@ static bool age_gfn_range(struct kvm *kvm, struct td=
p_iter *iter,
 							 iter->level);
 		new_spte =3D iter->old_spte & ~shadow_accessed_mask;
 	} else {
-		/*
-		 * Capture the dirty status of the page, so that it doesn't get
-		 * lost when the SPTE is marked for access tracking.
-		 */
-		if (is_writable_pte(iter->old_spte))
-			kvm_set_pfn_dirty(spte_to_pfn(iter->old_spte));
-
 		new_spte =3D mark_spte_for_access_track(iter->old_spte);
 		iter->old_spte =3D kvm_tdp_mmu_write_spte(iter->sptep,
 							iter->old_spte, new_spte,
@@ -1652,7 +1641,6 @@ static void clear_dirty_pt_masked(struct kvm *kvm, st=
ruct kvm_mmu_page *root,
 		trace_kvm_tdp_mmu_spte_changed(iter.as_id, iter.gfn, iter.level,
 					       iter.old_spte,
 					       iter.old_spte & ~dbit);
-		kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte));
 	}
=20
 	rcu_read_unlock();
--=20
2.44.0.291.gc1ea87d7ee-goog
From nobody Sat Feb  7 17:09:32 2026
Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com
 [209.85.215.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B1E631388
	for <linux-kernel@vger.kernel.org>; Wed, 20 Mar 2024 00:50:42 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.215.201
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1710895844; cv=none;
 b=YK9bd6sM/wD7vU17sflaLfKkV9PlmJIUfxKt//tlQxP4BPloJmw/qvBkvFFBoj28zdqUPyWUAL/BKlSKH+j9D9IBhM2Q/r66WiPks5mk8JVWs8YJIXEWTFcEsjVG3A2waA3dw9h+6/lxaCKHODLhMTwFSxCxHJFmEz8duZwgZFk=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1710895844; c=relaxed/simple;
	bh=e9vCkrtfJaABLYAyKrQ0HWBYDJdjKHVlxWPsqtjWiU8=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=mdW66Mr7X/R65I1mE11JMAJTX00+FIdC/EIVlL1PoQ4ni+FdI3El86CPg4r5EHXP+BaKDJlyoXYEQliQwR1dk7InUfYHegIxgc3von6ptDLRrp0doQiY08Vzk7js/sj1ATdHKq42zo8HRDMYoYU8qagGZdE70RL4UCFPy3CS2wc=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=mh901u8t; arc=none smtp.client-ip=209.85.215.201
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="mh901u8t"
Received: by mail-pg1-f201.google.com with SMTP id
 41be03b00d2f7-5cfd6ba1c11so4430678a12.0
        for <linux-kernel@vger.kernel.org>;
 Tue, 19 Mar 2024 17:50:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1710895842; x=1711500642;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:reply-to:from:to:cc:subject:date:message-id:reply-to;
        bh=73z0LxtdomfWG0GJE09D1ttIG4ILmbtU6RU20yVkqf4=;
        b=mh901u8t7A54U2oZkNub4mTJqcIJ7r6ar6ZLym0UGnJFrCgbUOfjM0E6Gq15UQFSQy
         RSK1Rnk9FPCboZydal4qJhq3WZGrOsoVEKozytT+WM/BIUY/P+esReMCKKtB5T/hWxeY
         24KJT1c3Q0KescU5r1QTa0ef21/Jo7AVT5Z+ID1lGYTORA6LKTDoh/X0X9O9j6XL5Rco
         gCkmxglKkv11TCT86KQWTOI5YKyWPRs3WzkvaV94tnDMl958WaAH6aoP2yDoSSdlqenE
         PaDgIHnxkJDqFF1UF8nKNn3DuYRvp5NzLJKWucICAHy5DLqeviJXvKsIZcmXVMSEqeZ6
         2j9g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1710895842; x=1711500642;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:reply-to:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=73z0LxtdomfWG0GJE09D1ttIG4ILmbtU6RU20yVkqf4=;
        b=bs10lZ0wZeP77hPH+x+SmWorN5FreVhVRNevWi3IPUKD1Fq71rdLUA+otcqHoXTwxe
         sbjFNRfnd3Jwud/2uSqNtVySj5CLvX4TrTrsxwnfqsG006OyOlykj+bgEpVgaKwu1kR+
         UF216GzEvIPx+LDVxQlp1OzzOvFTw8VrX1MdXt1FTwbnRX00bzUeUDcnkpJw0YgCMFxM
         YkOvcTtFJClQZaN5hTsqblTUWIUFcmN/vUHS+M9mRCr1gVCXl4SKGcvqnPm//zTNKI8v
         A/OPIgzmPIOGuW67roux+Huo/BK/6BZtevrqah0cnossfbG3KfRunGLUxyO6pzPmmGA3
         +xKQ==
X-Forwarded-Encrypted: i=1;
 AJvYcCUbJpR4hV0vU0A5YNJa1VIV2EZjJem8jWsRTyyb0sArmeSoqoP79psK03MOg1vz0H3j/4scE4mP/fqKKB6p3YiMxQ8K6iYH5feMzeAt
X-Gm-Message-State: AOJu0YzDdPbuvJKOBJQ6e60JUoU89upEhr07RxxZvNfa4R0J/mF/irPG
	Znz3ogwUH/o3gCppv9PBy1dq1wt5TicQJb+QyQM4c14aPNwR/kLADv8Z4alC/JmlWppK70arzcQ
	/7g==
X-Google-Smtp-Source: 
 AGHT+IF1elppJ1tODfZ0EG8u4kLfEw8DDLQ91IUuYrs1E3zIe4DEs7GG6DnoA43xvrWkoC/VwXNmeT6UbY4=
X-Received: from zagreus.c.googlers.com
 ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37])
 (user=seanjc job=sendgmr) by 2002:a05:6a02:488:b0:5e8:57aa:3609 with SMTP id
 bw8-20020a056a02048800b005e857aa3609mr6141pgb.9.1710895841989; Tue, 19 Mar
 2024 17:50:41 -0700 (PDT)
Reply-To: Sean Christopherson <seanjc@google.com>
Date: Tue, 19 Mar 2024 17:50:23 -0700
In-Reply-To: <20240320005024.3216282-1-seanjc@google.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20240320005024.3216282-1-seanjc@google.com>
X-Mailer: git-send-email 2.44.0.291.gc1ea87d7ee-goog
Message-ID: <20240320005024.3216282-4-seanjc@google.com>
Subject: [RFC PATCH 3/4] KVM: x86/mmu: Mark page/folio accessed only when
 zapping leaf SPTEs
From: Sean Christopherson <seanjc@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>,
 Sean Christopherson <seanjc@google.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	David Hildenbrand <david@redhat.com>, David Matlack <dmatlack@google.com>,
	David Stevens <stevensd@chromium.org>, Matthew Wilcox <willy@infradead.org>
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Mark folios as accessed only when zapping leaf SPTEs, which is a rough
heuristic for "only in response to an mmu_notifier invalidation".  Page
aging and LRUs are tolerant of false negatives, i.e. KVM doesn't need to
be precise for correctness, and re-marking folios as accessed when zapping
entire roots or when zapping collapsible SPTEs is expensive and adds very
little value.

E.g. when a VM is dying, all of its memory is being freed; marking folios
accessed at that time provides no known value.  Similarly, because KVM
makes folios as accessed when creating SPTEs, marking all folios as
accessed when userspace happens to delete a memslot doesn't add value.
The folio was marked access when the old SPTE was created, and will be
marked accessed yet again if a vCPU accesses the pfn again after reloading
a new root.  Zapping collapsible SPTEs is a similar story; marking folios
accessed just because userspace disable dirty logging is a side effect of
KVM behavior, not a deliberate goal.

Mark folios accessed when the primary MMU might be invalidating mappings,
e.g. instead of completely dropping calls to kvm_set_pfn_accessed(), as
such zappings are not KVM initiated, i.e. might actually be related to
page aging and LRU activity.

Note, x86 is the only KVM architecture that "double dips"; every other
arch marks pfns as accessed only when mapping into the guest, not when
mapping into the guest _and_ when removing from the guest.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/locking.rst | 76 +++++++++++++++---------------
 arch/x86/kvm/mmu/mmu.c             |  4 +-
 arch/x86/kvm/mmu/tdp_mmu.c         |  7 ++-
 3 files changed, 43 insertions(+), 44 deletions(-)

diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/lo=
cking.rst
index 02880d5552d5..8b3bb9fe60bf 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -138,49 +138,51 @@ Then, we can ensure the dirty bitmaps is correctly se=
t for a gfn.
=20
 2) Dirty bit tracking
=20
-In the origin code, the spte can be fast updated (non-atomically) if the
+In the original code, the spte can be fast updated (non-atomically) if the
 spte is read-only and the Accessed bit has already been set since the
 Accessed bit and Dirty bit can not be lost.
=20
 But it is not true after fast page fault since the spte can be marked
 writable between reading spte and updating spte. Like below case:
=20
-+------------------------------------------------------------------------+
-| At the beginning::                                                     |
-|                                                                        |
-|	spte.W =3D 0                                                       |
-|	spte.Accessed =3D 1                                                |
-+------------------------------------+-----------------------------------+
-| CPU 0:                             | CPU 1:                            |
-+------------------------------------+-----------------------------------+
-| In mmu_spte_clear_track_bits()::   |                                   |
-|                                    |                                   |
-|  old_spte =3D *spte;                 |                                  =
 |
-|                                    |                                   |
-|                                    |                                   |
-|  /* 'if' condition is satisfied. */|                                   |
-|  if (old_spte.Accessed =3D=3D 1 &&     |                                =
   |
-|       old_spte.W =3D=3D 0)             |                                =
   |
-|     spte =3D 0ull;                   |                                  =
 |
-+------------------------------------+-----------------------------------+
-|                                    | on fast page fault path::         |
-|                                    |                                   |
-|                                    |    spte.W =3D 1                    =
 |
-|                                    |                                   |
-|                                    | memory write on the spte::        |
-|                                    |                                   |
-|                                    |    spte.Dirty =3D 1                =
 |
-+------------------------------------+-----------------------------------+
-|  ::                                |                                   |
-|                                    |                                   |
-|   else                             |                                   |
-|     old_spte =3D xchg(spte, 0ull)    |                                  =
 |
-|   if (old_spte.Accessed =3D=3D 1)      |                                =
   |
-|     kvm_set_pfn_accessed(spte.pfn);|                                   |
-|   if (old_spte.Dirty =3D=3D 1)         |                                =
   |
-|     kvm_set_pfn_dirty(spte.pfn);   |                                   |
-|     OOPS!!!                        |                                   |
-+------------------------------------+-----------------------------------+
++-------------------------------------------------------------------------+
+| At the beginning::                                                      |
+|                                                                         |
+|	spte.W =3D 0                                                            =
  |
+|	spte.Accessed =3D 1                                                     =
  |
++-------------------------------------+-----------------------------------+
+| CPU 0:                              | CPU 1:                            |
++-------------------------------------+-----------------------------------+
+| In mmu_spte_update()::              |                                   |
+|                                     |                                   |
+|  old_spte =3D *spte;                  |                                 =
  |
+|                                     |                                   |
+|                                     |                                   |
+|  /* 'if' condition is satisfied. */ |                                   |
+|  if (old_spte.Accessed =3D=3D 1 &&      |                               =
    |
+|       old_spte.W =3D=3D 0)              |                               =
    |
+|     spte =3D new_spte;                |                                 =
  |
++-------------------------------------+-----------------------------------+
+|                                     | on fast page fault path::         |
+|                                     |                                   |
+|                                     |    spte.W =3D 1                   =
  |
+|                                     |                                   |
+|                                     | memory write on the spte::        |
+|                                     |                                   |
+|                                     |    spte.Dirty =3D 1               =
  |
++-------------------------------------+-----------------------------------+
+|  ::                                 |                                   |
+|                                     |                                   |
+|   else                              |                                   |
+|     old_spte =3D xchg(spte, new_spte);|                                 =
  |
+|   if (old_spte.Accessed &&          |                                   |
+|       !new_spte.Accessed)           |                                   |
+|     flush =3D true;                   |                                 =
  |
+|   if (old_spte.Dirty &&             |                                   |
+|       !new_spte.Dirty)              |                                   |
+|     flush =3D true;                   |                                 =
  |
+|     OOPS!!!                         |                                   |
++-------------------------------------+-----------------------------------+
=20
 The Dirty bit is lost in this case.
=20
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bd2240b94ff6..0a6c6619d213 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -539,10 +539,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 	 * to guarantee consistency between TLB and page tables.
 	 */
=20
-	if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
+	if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte))
 		flush =3D true;
-		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
-	}
=20
 	if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte))
 		flush =3D true;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5866a664f46e..340d5af454c6 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -520,10 +520,6 @@ static void handle_changed_spte(struct kvm *kvm, int a=
s_id, gfn_t gfn,
 	if (was_present && !was_leaf &&
 	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
 		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
-
-	if (was_leaf && is_accessed_spte(old_spte) &&
-	    (!is_present || !is_accessed_spte(new_spte) || pfn_changed))
-		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
 }
=20
 /*
@@ -841,6 +837,9 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct k=
vm_mmu_page *root,
=20
 		tdp_mmu_iter_set_spte(kvm, &iter, 0);
=20
+		if (is_accessed_spte(iter.old_spte))
+			kvm_set_pfn_accessed(spte_to_pfn(iter.old_spte));
+
 		/*
 		 * Zappings SPTEs in invalid roots doesn't require a TLB flush,
 		 * see kvm_tdp_mmu_zap_invalidated_roots() for details.
--=20
2.44.0.291.gc1ea87d7ee-goog
From nobody Sat Feb  7 17:09:32 2026
Received: from mail-pf1-f202.google.com (mail-pf1-f202.google.com
 [209.85.210.202])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id AE4B57464
	for <linux-kernel@vger.kernel.org>; Wed, 20 Mar 2024 00:50:44 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.210.202
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1710895846; cv=none;
 b=YKbohzvjQgPsrVmX9P1Y9MbPz5usQM9rHXZIJXqf9+rL/gNpJUxll3wRi4BQ/R2SS/nLAmmBCJcWNKuE76+bZK3CFletaXyn0kwrvQCKk60KT6pg6z1MwZB3UZ+BPx2KkcFb8bmKcoq3lxyVMrlIeh9owqJ+UPSiBPtbueA4USk=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1710895846; c=relaxed/simple;
	bh=LfHjr0gh4Ja0nQSAEja4M7wubj4py+zeRyGFa6gzCSY=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=VXnZK+rzQVFT7pFl3CytqHXLNDsEf5roAfJ7qwVbNjYlyvFayXNHjVbAYtbXiK/8xolFZr88SiczCwmaoK9NMfiDfYHpnA+p060wZsCkTbCoZ5zrgYlby6hmmn9zLnYjkrOVlIJBRmpZ5SpKhaCOKbXqZgbMQYfRnOhNI4ShnuA=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=rtkMtioc; arc=none smtp.client-ip=209.85.210.202
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="rtkMtioc"
Received: by mail-pf1-f202.google.com with SMTP id
 d2e1a72fcca58-6e6fd225b40so3892639b3a.1
        for <linux-kernel@vger.kernel.org>;
 Tue, 19 Mar 2024 17:50:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1710895844; x=1711500644;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:reply-to:from:to:cc:subject:date:message-id:reply-to;
        bh=5a2bqtKTV93JecoVjcrHiTik+s3Ay6FzfBzzZuxLT4w=;
        b=rtkMtiocLZwFxqLmZSpy2mFrf9rX0C05T4oKiYa3+sGRAbOVnVavD/0EZT4OH0S7Cm
         FtaqW0M7746+XFOhgxpuUOmsV7y4bsKeGmWeJyd6ayKGHEwVN8etALq0ST3+w/Y/7ejO
         wZnYr5QLYiyU/CsL1pYTbeIb8azH3cxE2k5n1cRwpnzYP/mtvPUZV6pzZAad0Kk6nAnz
         P348D6ET4aM2IXX20U5ikbjUiHkSgsracB22KI2hZtPF51WudjJiTDkIcDBQAaWCqTD3
         IhzMqZ8YWidjBrF8+WZEkdrMq034XFK0GmXMg+OQ0uhG0L0EbvgsJcCsos89Ef0BYlE2
         tAKg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1710895844; x=1711500644;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:reply-to:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=5a2bqtKTV93JecoVjcrHiTik+s3Ay6FzfBzzZuxLT4w=;
        b=TPpsfKAGzhEwnsEqoOsj5MjmvsvZs10SH30xT2yiUIzqeODbvyC3oaKPdR7caN+VdN
         CKvZc8/3teUYHTDG72joOZoHtARHWFfMEsPvhNUouy247yxb2bKnQtTtrwOpuA3C97Vx
         aGsC6R/n9wmmJdYfK5iOT7ZDVkP0+BOtQaVTjWnxhz80OhLR1e10d0SnR8mMQzCDHO+V
         cOsNlxWRyKIoBE/O2QNOs965PHAvu5T6rCYZg6BQ7FS+feqlKAKLZUmgwb5nZ7q7aZFu
         CeApdPoy0LkcVBsbsgev+M/I9urLdSXK0g65vA3LytpMPuN02e6+Rlv/lbeCCZF/I5yq
         ipRA==
X-Forwarded-Encrypted: i=1;
 AJvYcCWakiFl/jqNNLpVUOj5ETJCRaI8rn12JsJB1uHie6/vtox5+ezC3EYpxD70W8OzMaOy/S1zEZPV+aER1nliMsC6FzArpPkLaNoHG8kw
X-Gm-Message-State: AOJu0YymY9D0mq+KIj7aWAT0G/z6OtpEAKA3Fvsip43nGs8GhB7LgQU1
	oOThRff2qUivoWM5NJuijugqR633Aqrycml7RbDYJxMSO0p4Le9MTGC31lQIUeiSCARDSi/YlD9
	FPQ==
X-Google-Smtp-Source: 
 AGHT+IEaIA2aymRjDeAbUoUA5PBT+xAo8Cuxn++SVSqL/D4x7uk8+Uter3hx2zwaGXU73twu2UQ+PhDpSrE=
X-Received: from zagreus.c.googlers.com
 ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37])
 (user=seanjc job=sendgmr) by 2002:a05:6a00:a0d:b0:6e7:956e:4388 with SMTP id
 p13-20020a056a000a0d00b006e7956e4388mr3227pfh.0.1710895843865; Tue, 19 Mar
 2024 17:50:43 -0700 (PDT)
Reply-To: Sean Christopherson <seanjc@google.com>
Date: Tue, 19 Mar 2024 17:50:24 -0700
In-Reply-To: <20240320005024.3216282-1-seanjc@google.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20240320005024.3216282-1-seanjc@google.com>
X-Mailer: git-send-email 2.44.0.291.gc1ea87d7ee-goog
Message-ID: <20240320005024.3216282-5-seanjc@google.com>
Subject: [RFC PATCH 4/4] KVM: x86/mmu: Don't force flush if SPTE update clears
 Accessed bit
From: Sean Christopherson <seanjc@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>,
 Sean Christopherson <seanjc@google.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	David Hildenbrand <david@redhat.com>, David Matlack <dmatlack@google.com>,
	David Stevens <stevensd@chromium.org>, Matthew Wilcox <willy@infradead.org>
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Don't force a TLB if mmu_spte_update() clears Accessed bit, as access
tracking tolerates false negatives, as evidenced by the mmu_notifier hooks
that explicit test and age SPTEs without doing a TLB flush.

In practice, this is very nearly a nop.  spte_write_protect() and
spte_clear_dirty() never clear the Accessed bit.  make_spte() always
sets the Accessed bit for !prefetch scenarios.  FNAME(sync_spte) only sets
SPTE if the protection bits are changing, i.e. if a flush will be needed
regardless of the Accessed bits.  And FNAME(pte_prefetch) sets SPTE if and
only if the old SPTE is !PRESENT.

That leaves kvm_arch_async_page_ready() as the one path that will generate
a !ACCESSED SPTE *and* overwrite a PRESENT SPTE.  And that's very arguably
a bug, as clobbering a valid SPTE in that case is nonsensical.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 31 +++++++++----------------------
 1 file changed, 9 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0a6c6619d213..77d1072b130d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -515,37 +515,24 @@ static u64 mmu_spte_update_no_track(u64 *sptep, u64 n=
ew_spte)
  * TLBs must be flushed. Otherwise rmap_write_protect will find a read-only
  * spte, even though the writable spte might be cached on a CPU's TLB.
  *
+ * Remote TLBs also need to be flushed if the Dirty bit is cleared, as fal=
se
+ * negatives are not acceptable, e.g. if KVM is using D-bit based PML on V=
MX.
+ *
+ * Don't flush if the Accessed bit is cleared, as access tracking tolerates
+ * false negatives, and the one path that does care about TLB flushes,
+ * kvm_mmu_notifier_clear_flush_young(), uses mmu_spte_update_no_track().
+ *
  * Returns true if the TLB needs to be flushed
  */
 static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 {
-	bool flush =3D false;
 	u64 old_spte =3D mmu_spte_update_no_track(sptep, new_spte);
=20
 	if (!is_shadow_present_pte(old_spte))
 		return false;
=20
-	/*
-	 * For the spte updated out of mmu-lock is safe, since
-	 * we always atomically update it, see the comments in
-	 * spte_has_volatile_bits().
-	 */
-	if (is_mmu_writable_spte(old_spte) &&
-	      !is_writable_pte(new_spte))
-		flush =3D true;
-
-	/*
-	 * Flush TLB when accessed/dirty states are changed in the page tables,
-	 * to guarantee consistency between TLB and page tables.
-	 */
-
-	if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte))
-		flush =3D true;
-
-	if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte))
-		flush =3D true;
-
-	return flush;
+	return (is_mmu_writable_spte(old_spte) && !is_writable_pte(new_spte)) ||
+	       (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte));
 }
=20
 /*
--=20
2.44.0.291.gc1ea87d7ee-goog