From nobody Thu Sep 19 01:12:30 2024 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CFD1F158DA0 for ; Fri, 26 Jul 2024 23:52:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722037976; cv=none; b=NExx4EzSItydV7k6E3eP864ZbYdyFOstLOCNFNCxfVaSf7tfzEipr6PLQACLUsXbfvQJtWQH8/LmvyYUdpr30yip32XOnecb286CKKwdROzC3JBF8eX7Fb9RovVtsfOWbDjucmllyQG1PXlmNaETKMQ+Q+Yr87FQltbfijD6gM0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722037976; c=relaxed/simple; bh=BpefGVw3ai5MyZqnGpqT/DkM64xZIHhkO67jsUqUIys=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=nK70MLCRGinFiPnvIOGuZn28bUO5r0zsZGLwsbLEqQArgN+58U/ef6oKmH2Z7gQQeY3WkSYFyUUNcusM9RZBNDe0ud4gEfqn571G9E6QoFEsKwOvmy7MjtdyvRwmlWX/25N4auqT9ibbYN3GCVLXzpwtg0NPZ91sEOdi9l19IOE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=KcdC67np; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="KcdC67np" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-664ccf0659cso6135807b3.1 for ; Fri, 26 Jul 2024 16:52:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1722037974; x=1722642774; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:from:to:cc:subject:date:message-id:reply-to; bh=Dpyi/TQvAabfHqvBiXmrWwwqtiSCC0ukXFLvczoDhHE=; b=KcdC67npj0TSKJPWMIOjBYz5BC6rRp3oSzV1GZy6lTT8BP9rI3t5P9Vy4Zs49XGVwX 7SU18jxLk/Ian3AK9DVvxrXW741yG/ikB2NpWee3xhPNXm0uC2A16VolFBtbFkqax2F1 fV2QB716kApSdes4JMFukG4zbfC+WMgqFnM9QJLp33344RyZ2sjkoTR5halQCk7+spp/ WcSaXPP/XO4c4mmyyoVoKD5Cz6HZfzwl5DzT2iC3dNj8TsuNkMvCsUwLGdJXk9d0EAee FwFtub9t4Tg2Iqs4+RdHgufUFTYqRjzKLZGRLUn/anEJ12nl2c3Mh/Gyk5hjCsFHchDH VEFg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722037974; x=1722642774; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Dpyi/TQvAabfHqvBiXmrWwwqtiSCC0ukXFLvczoDhHE=; b=hhdaXrhzM9RCgKDVLflNYGl8MxfM5u9/HBuwTE1oLaSNBtHllYeOIhgZsiMMI/8t+N S+7KC79425ppjL6USkZS4erH84ygzAM9Ml/U+yjoEF0M/pxDno7w/Lw7M1sy6KEVI1Ty NcdicGoDkUJ+xgs7kKetKvIAJ3oe0ZDeR8TrkWyBCeTSLD+fmWe0bTDoBzGHRGm3xV+X Sb0BaHvraBlmATn0YDbyVY0nrCtxqe0h9hicAtdcwyWFDyRR4iViWrugStIioAIHKyqY frid7pErj8CBBJGFcOfpb4MQZbCOvmLI7k53+flE3XfNjG4+IVs+dHiSBeD0fl2T0DLI JBAQ== X-Forwarded-Encrypted: i=1; AJvYcCVQ0UW8SnUTjI2glZmz0FhTbG/PXVyZaEvwT3fIxedJRjMnQUV/+wzCTBiao6spQ26IbcMIXh8SD5OF3GRnVH2E4qfHOJf3DJXNM1mt X-Gm-Message-State: AOJu0YwoZgeryfhAnPhT7Rl89i1MX7vNSavJq9rdRXJll2JCKS8rT7Wl bOSuY6xdiKPu3aVle/ly5+5plO0tT5F9lLOxeosLfYi86DEhV6kKeF348NCVhR8buPq4UWOypTs sNw== X-Google-Smtp-Source: AGHT+IEuz+UKgzDWJGr0lPjm3UdUlqhv4uVGAVLwalCQIXTZVrMJ3vp7bbZbHMGkSPe0yP2xueKpOueyV+8= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a05:690c:f07:b0:64a:e220:bfb5 with SMTP id 00721157ae682-67a051e9c33mr504237b3.1.1722037973756; Fri, 26 Jul 2024 16:52:53 -0700 (PDT) Reply-To: Sean Christopherson Date: Fri, 26 Jul 2024 16:51:17 -0700 In-Reply-To: <20240726235234.228822-1-seanjc@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240726235234.228822-1-seanjc@google.com> X-Mailer: git-send-email 2.46.0.rc1.232.g9752f9e123-goog Message-ID: <20240726235234.228822-9-seanjc@google.com> Subject: [PATCH v12 08/84] KVM: x86/mmu: Mark page/folio accessed only when zapping leaf SPTEs From: Sean Christopherson To: Paolo Bonzini , Marc Zyngier , Oliver Upton , Tianrui Zhao , Bibo Mao , Huacai Chen , Michael Ellerman , Anup Patel , Paul Walmsley , Palmer Dabbelt , Albert Ou , Christian Borntraeger , Janosch Frank , Claudio Imbrenda , Sean Christopherson Cc: kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, loongarch@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org, David Matlack , David Stevens Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Mark folios as accessed only when zapping leaf SPTEs, which is a rough heuristic for "only in response to an mmu_notifier invalidation". Page aging and LRUs are tolerant of false negatives, i.e. KVM doesn't need to be precise for correctness, and re-marking folios as accessed when zapping entire roots or when zapping collapsible SPTEs is expensive and adds very little value. E.g. when a VM is dying, all of its memory is being freed; marking folios accessed at that time provides no known value. Similarly, because KVM marks folios as accessed when creating SPTEs, marking all folios as accessed when userspace happens to delete a memslot doesn't add value. The folio was marked access when the old SPTE was created, and will be marked accessed yet again if a vCPU accesses the pfn again after reloading a new root. Zapping collapsible SPTEs is a similar story; marking folios accessed just because userspace disable dirty logging is a side effect of KVM behavior, not a deliberate goal. As an intermediate step, a.k.a. bisection point, towards *never* marking folios accessed when dropping SPTEs, mark folios accessed when the primary MMU might be invalidating mappings, as such zappings are not KVM initiated, i.e. might actually be related to page aging and LRU activity. Note, x86 is the only KVM architecture that "double dips"; every other arch marks pfns as accessed only when mapping into the guest, not when mapping into the guest _and_ when removing from the guest. Signed-off-by: Sean Christopherson --- Documentation/virt/kvm/locking.rst | 76 +++++++++++++++--------------- arch/x86/kvm/mmu/mmu.c | 4 +- arch/x86/kvm/mmu/tdp_mmu.c | 7 ++- 3 files changed, 43 insertions(+), 44 deletions(-) diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/lo= cking.rst index 02880d5552d5..8b3bb9fe60bf 100644 --- a/Documentation/virt/kvm/locking.rst +++ b/Documentation/virt/kvm/locking.rst @@ -138,49 +138,51 @@ Then, we can ensure the dirty bitmaps is correctly se= t for a gfn. =20 2) Dirty bit tracking =20 -In the origin code, the spte can be fast updated (non-atomically) if the +In the original code, the spte can be fast updated (non-atomically) if the spte is read-only and the Accessed bit has already been set since the Accessed bit and Dirty bit can not be lost. =20 But it is not true after fast page fault since the spte can be marked writable between reading spte and updating spte. Like below case: =20 -+------------------------------------------------------------------------+ -| At the beginning:: | -| | -| spte.W =3D 0 | -| spte.Accessed =3D 1 | -+------------------------------------+-----------------------------------+ -| CPU 0: | CPU 1: | -+------------------------------------+-----------------------------------+ -| In mmu_spte_clear_track_bits():: | | -| | | -| old_spte =3D *spte; | = | -| | | -| | | -| /* 'if' condition is satisfied. */| | -| if (old_spte.Accessed =3D=3D 1 && | = | -| old_spte.W =3D=3D 0) | = | -| spte =3D 0ull; | = | -+------------------------------------+-----------------------------------+ -| | on fast page fault path:: | -| | | -| | spte.W =3D 1 = | -| | | -| | memory write on the spte:: | -| | | -| | spte.Dirty =3D 1 = | -+------------------------------------+-----------------------------------+ -| :: | | -| | | -| else | | -| old_spte =3D xchg(spte, 0ull) | = | -| if (old_spte.Accessed =3D=3D 1) | = | -| kvm_set_pfn_accessed(spte.pfn);| | -| if (old_spte.Dirty =3D=3D 1) | = | -| kvm_set_pfn_dirty(spte.pfn); | | -| OOPS!!! | | -+------------------------------------+-----------------------------------+ ++-------------------------------------------------------------------------+ +| At the beginning:: | +| | +| spte.W =3D 0 = | +| spte.Accessed =3D 1 = | ++-------------------------------------+-----------------------------------+ +| CPU 0: | CPU 1: | ++-------------------------------------+-----------------------------------+ +| In mmu_spte_update():: | | +| | | +| old_spte =3D *spte; | = | +| | | +| | | +| /* 'if' condition is satisfied. */ | | +| if (old_spte.Accessed =3D=3D 1 && | = | +| old_spte.W =3D=3D 0) | = | +| spte =3D new_spte; | = | ++-------------------------------------+-----------------------------------+ +| | on fast page fault path:: | +| | | +| | spte.W =3D 1 = | +| | | +| | memory write on the spte:: | +| | | +| | spte.Dirty =3D 1 = | ++-------------------------------------+-----------------------------------+ +| :: | | +| | | +| else | | +| old_spte =3D xchg(spte, new_spte);| = | +| if (old_spte.Accessed && | | +| !new_spte.Accessed) | | +| flush =3D true; | = | +| if (old_spte.Dirty && | | +| !new_spte.Dirty) | | +| flush =3D true; | = | +| OOPS!!! | | ++-------------------------------------+-----------------------------------+ =20 The Dirty bit is lost in this case. =20 diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 2e6daa6d1cc0..58b70328b20c 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -542,10 +542,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte) * to guarantee consistency between TLB and page tables. */ =20 - if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) { + if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) flush =3D true; - kvm_set_pfn_accessed(spte_to_pfn(old_spte)); - } =20 if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) flush =3D true; diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 7ac43d1ce918..d1de5f28c445 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -520,10 +520,6 @@ static void handle_changed_spte(struct kvm *kvm, int a= s_id, gfn_t gfn, if (was_present && !was_leaf && (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared); - - if (was_leaf && is_accessed_spte(old_spte) && - (!is_present || !is_accessed_spte(new_spte) || pfn_changed)) - kvm_set_pfn_accessed(spte_to_pfn(old_spte)); } =20 static inline int __must_check __tdp_mmu_set_spte_atomic(struct tdp_iter *= iter, @@ -865,6 +861,9 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct k= vm_mmu_page *root, =20 tdp_mmu_iter_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE); =20 + if (is_accessed_spte(iter.old_spte)) + kvm_set_pfn_accessed(spte_to_pfn(iter.old_spte)); + /* * Zappings SPTEs in invalid roots doesn't require a TLB flush, * see kvm_tdp_mmu_zap_invalidated_roots() for details. --=20 2.46.0.rc1.232.g9752f9e123-goog