From nobody Sun Nov 24 11:28:49 2024 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8966A1E906B for ; Tue, 5 Nov 2024 18:43:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730832226; cv=none; b=JSeBwzlXYMofxPTcKKOG51cI9PEwiFrKGn0TgUzIphn43TN+OXZq5BWImWz2WNsP2LVbfZqwOzzMNZO2HOMl7d/uxWlAh5LzFPlJm3cMlt0uS/t+8rxbFfOpMwitRVKGW+mPLN6Bv7tkAZ7Rz3uSxProwG8hEVo1WxG9L7CBPEg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730832226; c=relaxed/simple; bh=tT+pzCXqWuNowH8bjSb9aKrHybWvc7zxkwsmc7gL1dg=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=MMAr8QMh/P/vbJkYfwnR11RFFZZrdB0kptWW1avc6U42+qFQRoVHkrXjIbct8CwgNkZDg7q5hceWMXlvfcbuD/EGdrvl2+XwLWBDRu45fB9eGjawPDoWpCd4SC75GpfqTGa0QaoFjFUdJNMe8h0LqWJjL5Dypc7h5Yro8tkHaTA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=Kv1cNC7Q; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="Kv1cNC7Q" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6ea90b6ee2fso41676197b3.1 for ; Tue, 05 Nov 2024 10:43:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730832223; x=1731437023; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=dg1ueYt3SFlmv/6xFaUNv91kdLtzNJ+9Y0y9mZTK1X4=; b=Kv1cNC7Q3nQQO8YONrhz+WCU68/FF9HoZV1X1TXEXzkX3r2/X80cO+d/SHnqOakeyC 0DrJeml3Q58TNBFt7ITXbUb3zubd71Ook8IiIyx1fRnC+wiO9mUyomqQxyLB0RV0+Bct CHVfBFpI1s+dJdc/YYK98zos7eirzcgl2KnPqyHl3B2jmsYl586lO+49RlgypeT8EKnk Ot4qpLiMKGSFsqSjfAThjMBCnFAEb/3oqVfvGZ6PurBUCC/LJyBb3YbYZy4CSV3rqN1A +ilUrVHCE479vqwt7DvXRecFje/8Q323kJMybSgynQTuuuBNphsyZ/R6RFvCLm6ndnGT gDgg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730832223; x=1731437023; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=dg1ueYt3SFlmv/6xFaUNv91kdLtzNJ+9Y0y9mZTK1X4=; b=RFP21aw9ABt/m8QbmsdrbAV8X1VLFsPJjPtU02Xlx35XxIc3Ngg0Jj6TqpFAlF3RPs wqqt1iu9RIFeK9iIRPJdCg/rK4ZYBDbDxTbhNJ3Nqcve3LqoKUdNMo7cVvAJiETH2aiN 1RGahB7rq8MwStttCdQypL4Kc5Gbc1HGqZYyXTOEJUnBnKoyLHrgnNO8FmWz3FES+T6z Op336LEJqaxKqZP+85but/hkQ2QVQWjWRC232jBROUViAMRfBIxlOb+h5nYgvAeguJxT dQKLabyCAQ7orwXiCU565tSXjScSxTR9xnQVYSIp+kP1Y/Us4+BEFHJxH92Vn3RjU195 L2EA== X-Forwarded-Encrypted: i=1; AJvYcCW/Ksf5JUCBnHQboYVRq6w7Q+dtz/E+vMi4rBeram7N0tSlBG0/UEfs71MM5Gqj0q5sgrXBA33R+6/Jqbg=@vger.kernel.org X-Gm-Message-State: AOJu0YxbOWneK+tmvvFyiZkMwL68EFJuDOliG+bIrw4M+mV/2GHhEMRg qsSPPTzdnNKiotjumehVwMlFWhdxWUqj38PYWmIVlxk8CjB8J3BenfW7C6ZA2leMwbjZPigAERe jw+CBkgs7rgEUat/wRg== X-Google-Smtp-Source: AGHT+IH/2j0g8b38xgXv/Mz5EetP3v/9W3P5WbYERrZ4ckZRr3Z4g+b3Hv93b2pe6QIYDIeBDbGY/gVzjSsx0I/4 X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:13d:fb22:ac12:a84b]) (user=jthoughton job=sendgmr) by 2002:a25:b411:0:b0:e2b:da82:f695 with SMTP id 3f1490d57ef6-e30e5b0f56amr12767276.6.1730832223468; Tue, 05 Nov 2024 10:43:43 -0800 (PST) Date: Tue, 5 Nov 2024 18:43:23 +0000 In-Reply-To: <20241105184333.2305744-1-jthoughton@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241105184333.2305744-1-jthoughton@google.com> X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241105184333.2305744-2-jthoughton@google.com> Subject: [PATCH v8 01/11] KVM: Remove kvm_handle_hva_range helper functions From: James Houghton To: Sean Christopherson , Paolo Bonzini Cc: David Matlack , David Rientjes , James Houghton , Marc Zyngier , Oliver Upton , Wei Xu , Yu Zhao , Axel Rasmussen , kvm@vger.kernel.org, linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" kvm_handle_hva_range is only used by the young notifiers. In a later patch, it will be even further tied to the young notifiers. Instead of renaming kvm_handle_hva_range to something like kvm_handle_hva_range_young, simply remove kvm_handle_hva_range. This seems slightly more readable, though there is slightly more code duplication. Finally, rename __kvm_handle_hva_range to kvm_handle_hva_range, now that the name is available. Suggested-by: David Matlack Signed-off-by: James Houghton --- virt/kvm/kvm_main.c | 74 +++++++++++++++++++++++---------------------- 1 file changed, 38 insertions(+), 36 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 27186b06518a..8b234a9acdb3 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -551,8 +551,8 @@ static void kvm_null_fn(void) node; \ node =3D interval_tree_iter_next(node, start, last)) \ =20 -static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, - const struct kvm_mmu_notifier_range *range) +static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm, + const struct kvm_mmu_notifier_range *range) { struct kvm_mmu_notifier_return r =3D { .ret =3D false, @@ -628,33 +628,6 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_r= ange(struct kvm *kvm, return r; } =20 -static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn, - unsigned long start, - unsigned long end, - gfn_handler_t handler, - bool flush_on_ret) -{ - struct kvm *kvm =3D mmu_notifier_to_kvm(mn); - const struct kvm_mmu_notifier_range range =3D { - .start =3D start, - .end =3D end, - .handler =3D handler, - .on_lock =3D (void *)kvm_null_fn, - .flush_on_ret =3D flush_on_ret, - .may_block =3D false, - }; - - return __kvm_handle_hva_range(kvm, &range).ret; -} - -static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifi= er *mn, - unsigned long start, - unsigned long end, - gfn_handler_t handler) -{ - return kvm_handle_hva_range(mn, start, end, handler, false); -} - void kvm_mmu_invalidate_begin(struct kvm *kvm) { lockdep_assert_held_write(&kvm->mmu_lock); @@ -747,7 +720,7 @@ static int kvm_mmu_notifier_invalidate_range_start(stru= ct mmu_notifier *mn, * that guest memory has been reclaimed. This needs to be done *after* * dropping mmu_lock, as x86's reclaim path is slooooow. */ - if (__kvm_handle_hva_range(kvm, &hva_range).found_memslot) + if (kvm_handle_hva_range(kvm, &hva_range).found_memslot) kvm_arch_guest_memory_reclaimed(kvm); =20 return 0; @@ -793,7 +766,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struc= t mmu_notifier *mn, }; bool wake; =20 - __kvm_handle_hva_range(kvm, &hva_range); + kvm_handle_hva_range(kvm, &hva_range); =20 /* Pairs with the increment in range_start(). */ spin_lock(&kvm->mn_invalidate_lock); @@ -815,10 +788,20 @@ static int kvm_mmu_notifier_clear_flush_young(struct = mmu_notifier *mn, unsigned long start, unsigned long end) { + struct kvm *kvm =3D mmu_notifier_to_kvm(mn); + const struct kvm_mmu_notifier_range range =3D { + .start =3D start, + .end =3D end, + .handler =3D kvm_age_gfn, + .on_lock =3D (void *)kvm_null_fn, + .flush_on_ret =3D + !IS_ENABLED(CONFIG_KVM_ELIDE_TLB_FLUSH_IF_YOUNG), + .may_block =3D false, + }; + trace_kvm_age_hva(start, end); =20 - return kvm_handle_hva_range(mn, start, end, kvm_age_gfn, - !IS_ENABLED(CONFIG_KVM_ELIDE_TLB_FLUSH_IF_YOUNG)); + return kvm_handle_hva_range(kvm, &range).ret; } =20 static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn, @@ -826,6 +809,16 @@ static int kvm_mmu_notifier_clear_young(struct mmu_not= ifier *mn, unsigned long start, unsigned long end) { + struct kvm *kvm =3D mmu_notifier_to_kvm(mn); + const struct kvm_mmu_notifier_range range =3D { + .start =3D start, + .end =3D end, + .handler =3D kvm_age_gfn, + .on_lock =3D (void *)kvm_null_fn, + .flush_on_ret =3D false, + .may_block =3D false, + }; + trace_kvm_age_hva(start, end); =20 /* @@ -841,17 +834,26 @@ static int kvm_mmu_notifier_clear_young(struct mmu_no= tifier *mn, * cadence. If we find this inaccurate, we might come up with a * more sophisticated heuristic later. */ - return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn); + return kvm_handle_hva_range(kvm, &range).ret; } =20 static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long address) { + struct kvm *kvm =3D mmu_notifier_to_kvm(mn); + const struct kvm_mmu_notifier_range range =3D { + .start =3D address, + .end =3D address + 1, + .handler =3D kvm_test_age_gfn, + .on_lock =3D (void *)kvm_null_fn, + .flush_on_ret =3D false, + .may_block =3D false, + }; + trace_kvm_test_age_hva(address); =20 - return kvm_handle_hva_range_no_flush(mn, address, address + 1, - kvm_test_age_gfn); + return kvm_handle_hva_range(kvm, &range).ret; } =20 static void kvm_mmu_notifier_release(struct mmu_notifier *mn, --=20 2.47.0.199.ga7371fff76-goog From nobody Sun Nov 24 11:28:49 2024 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A6F6B1EBA1D for ; Tue, 5 Nov 2024 18:43:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730832227; cv=none; b=dITFYY6C05/pXsmyhRQBWTO+pSXKFQuTcyl7hdhnB4YwyWViwWblrxeqRXmR6/OXlNG5hHYYKRf8vE4NawYZBoxiwSkFkRcMF+AF7mMRTqHB4q2mmeZCbFgXGbp08exKwmULMhySnXT73rMZ5nI6N3L2TFKbxCjXf8sTv73ueEc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730832227; c=relaxed/simple; bh=vZrsRH2ZpnPhLUPg8HbhY2SRWoRuV7oD6Hdw4ccdV9g=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=WKLczXajSzDT22XqJPBFdjTTuHJaO0AJiqei4EKESuQmqD5TleS1V2nNG54SNm8loDSdXQLh7WDN5Rfp2fhjh2foZn8hqGj3qEaHtLU67aeaYu3ZC+9uG8NOxt+YnqB5L1n6ZzNTTNHGgd+7ckwepTC67WkdVLiV2ErEvzln5yw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=L8/35Y/u; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="L8/35Y/u" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e3314237b86so5405715276.1 for ; Tue, 05 Nov 2024 10:43:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730832224; x=1731437024; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=D+gvy7h6VjHvEicxCxVK1KFes/YAEA42/v/7o2kAWKg=; b=L8/35Y/uOelzS74SMzl1a7NgpYVl7tdae1FqiRbXt9N0eJe5QyHskXmbFm734l52IP cISMidO+TxJzBEsyEkfh3Axwg9rDcTgvcMfgmvJLO1g08Rkzyi5OKFtt7cjHpyJVmoRO G+2bRzdzKy3IFa8W9XtfZjsVdGoqyP1Bz3qm4OxdJh3CZX3IY102MUsToC9sTocfGXWe g4cuIe9O2FklqLSALugniGMMzIYA61Vg+25AVeoq8RpPO0D9a8TrrEoElJvTg56FIxOY iMJriySEp7vZuzSKWv9snzMdTR7ikUz0vVpdtKAIBdJ8PGkq6JhMR8KGITY4sXi35trD ZpMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730832224; x=1731437024; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=D+gvy7h6VjHvEicxCxVK1KFes/YAEA42/v/7o2kAWKg=; b=cIrW5/VwfdtOKUtKJUtR5vVm5ulHbgjxrPs2ts0fRmcmciTNbIgAnKmgFnAvCW63MM jQCltDl3qElq3K+ndgVcCtLFLudANHeH6bbomc5LsAIEChpSo3iPbyNUNrrVScQTIW6h GFIvIXsFzW5W5pw6DFXwfcEZZ2PqBtRcyJ2u+/HmNw1L6yRfPWK75lgD+QWKG00peIkU HglHeR4B9JS7sc2yofr7jEj67vWL30lgkMqU4ZUzUtlQNopxGphbk5gxQxJR6KsnAnRi eHIjjPCyKF78fNUUmG6VP8pRDz+QFgA/tdDNNirEsh0oeqqsmH0gIiiCbv2gvdFlb9k6 94AA== X-Forwarded-Encrypted: i=1; AJvYcCW8IAcHM2mX9lUyK75YkFNcVP4lIgmJRPAt+ostyPm+iAshTfTSLzOlSIYsLkXA8jTzKtgLLnnqu3Dtc28=@vger.kernel.org X-Gm-Message-State: AOJu0YzsZRkHNOW7RgRcXR5LiJPtDt9k9cxFArqxzCiqvTRaNN58obX8 nTOrrd9cGv7dltsnNfz0zVCH4NYNlTvsKrJ03LztFiaYc22f0Vwhqhz8o4ykDzwBXIRo43dOOfJ S49jW0QEf8Gw4wNgrNA== X-Google-Smtp-Source: AGHT+IFLgucFBeYmf4t35ZiwYH4jII1t3oDCRi+X7uizgjYi9k77iZE9V4CFoxH6sOGu/7XXObgmhbIRgb/aBbhF X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:13d:fb22:ac12:a84b]) (user=jthoughton job=sendgmr) by 2002:a5b:bc7:0:b0:e16:68fb:f261 with SMTP id 3f1490d57ef6-e33025817a8mr11632276.5.1730832224607; Tue, 05 Nov 2024 10:43:44 -0800 (PST) Date: Tue, 5 Nov 2024 18:43:24 +0000 In-Reply-To: <20241105184333.2305744-1-jthoughton@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241105184333.2305744-1-jthoughton@google.com> X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241105184333.2305744-3-jthoughton@google.com> Subject: [PATCH v8 02/11] KVM: Add lockless memslot walk to KVM From: James Houghton To: Sean Christopherson , Paolo Bonzini Cc: David Matlack , David Rientjes , James Houghton , Marc Zyngier , Oliver Upton , Wei Xu , Yu Zhao , Axel Rasmussen , kvm@vger.kernel.org, linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Provide flexibility to the architecture to synchronize as optimally as they can instead of always taking the MMU lock for writing. Architectures that do their own locking must select CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS. The immediate application is to allow architectures to implement the test/clear_young MMU notifiers more cheaply. Suggested-by: Yu Zhao Signed-off-by: James Houghton Reviewed-by: David Matlack --- include/linux/kvm_host.h | 1 + virt/kvm/Kconfig | 2 ++ virt/kvm/kvm_main.c | 28 +++++++++++++++++++++------- 3 files changed, 24 insertions(+), 7 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 18a1672ffcbf..ab0318dbb8bd 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -260,6 +260,7 @@ struct kvm_gfn_range { gfn_t end; union kvm_mmu_notifier_arg arg; bool may_block; + bool lockless; }; bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig index 54e959e7d68f..b50e4e629ac9 100644 --- a/virt/kvm/Kconfig +++ b/virt/kvm/Kconfig @@ -102,6 +102,8 @@ config KVM_GENERIC_MMU_NOTIFIER =20 config KVM_ELIDE_TLB_FLUSH_IF_YOUNG depends on KVM_GENERIC_MMU_NOTIFIER + +config KVM_MMU_NOTIFIER_YOUNG_LOCKLESS bool =20 config KVM_GENERIC_MEMORY_ATTRIBUTES diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 8b234a9acdb3..218edf037917 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -517,6 +517,7 @@ struct kvm_mmu_notifier_range { on_lock_fn_t on_lock; bool flush_on_ret; bool may_block; + bool lockless; }; =20 /* @@ -571,6 +572,10 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_ran= ge(struct kvm *kvm, IS_KVM_NULL_FN(range->handler))) return r; =20 + /* on_lock will never be called for lockless walks */ + if (WARN_ON_ONCE(range->lockless && !IS_KVM_NULL_FN(range->on_lock))) + return r; + idx =3D srcu_read_lock(&kvm->srcu); =20 for (i =3D 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) { @@ -602,15 +607,18 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_ra= nge(struct kvm *kvm, gfn_range.start =3D hva_to_gfn_memslot(hva_start, slot); gfn_range.end =3D hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot); gfn_range.slot =3D slot; + gfn_range.lockless =3D range->lockless; =20 if (!r.found_memslot) { r.found_memslot =3D true; - KVM_MMU_LOCK(kvm); - if (!IS_KVM_NULL_FN(range->on_lock)) - range->on_lock(kvm); - - if (IS_KVM_NULL_FN(range->handler)) - goto mmu_unlock; + if (!range->lockless) { + KVM_MMU_LOCK(kvm); + if (!IS_KVM_NULL_FN(range->on_lock)) + range->on_lock(kvm); + + if (IS_KVM_NULL_FN(range->handler)) + goto mmu_unlock; + } } r.ret |=3D range->handler(kvm, &gfn_range); } @@ -620,7 +628,7 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_rang= e(struct kvm *kvm, kvm_flush_remote_tlbs(kvm); =20 mmu_unlock: - if (r.found_memslot) + if (r.found_memslot && !range->lockless) KVM_MMU_UNLOCK(kvm); =20 srcu_read_unlock(&kvm->srcu, idx); @@ -797,6 +805,8 @@ static int kvm_mmu_notifier_clear_flush_young(struct mm= u_notifier *mn, .flush_on_ret =3D !IS_ENABLED(CONFIG_KVM_ELIDE_TLB_FLUSH_IF_YOUNG), .may_block =3D false, + .lockless =3D + IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS), }; =20 trace_kvm_age_hva(start, end); @@ -817,6 +827,8 @@ static int kvm_mmu_notifier_clear_young(struct mmu_noti= fier *mn, .on_lock =3D (void *)kvm_null_fn, .flush_on_ret =3D false, .may_block =3D false, + .lockless =3D + IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS), }; =20 trace_kvm_age_hva(start, end); @@ -849,6 +861,8 @@ static int kvm_mmu_notifier_test_young(struct mmu_notif= ier *mn, .on_lock =3D (void *)kvm_null_fn, .flush_on_ret =3D false, .may_block =3D false, + .lockless =3D + IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_YOUNG_LOCKLESS), }; =20 trace_kvm_test_age_hva(address); --=20 2.47.0.199.ga7371fff76-goog From nobody Sun Nov 24 11:28:49 2024 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D25151EE037 for ; Tue, 5 Nov 2024 18:43:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730832228; cv=none; b=sxCkPZQ6i8OtPRjjMDYVMuZGekvxv2Uzm9slClU7eQ2h1NChoe4VmBXDqc9iscxAR2KguDFqG3UM2we9Oh3/JXktd1s0b+oPsbdNX70ehuM8QZaD2dAX3XNqbjPe1lCNdIVXLXRcCcIRnrvCVmnjjhN5FZNuWFkYWEB9JmK/g6I= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730832228; c=relaxed/simple; bh=/iQ+9SFHr3+eSC4MwdAWPvkjN09yEjcpMjdI5J7/Kp8=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=FOZTWA/R12Q6Wepc3usEyzdJ8+1ybXBWvwQjXd6+2YNj99KLmNQ0B1IpPrPQRX2QxULt+mHX+U29IQFlumtXeAwi+TYpxow3Trs++JGdcLLuf3EskAIr83qK0QiDzftgIPioHPVkJpYJbDdxVGqxrTCVM7inc74I6wrrioN73hc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=ae95Zujj; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="ae95Zujj" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6ea7cfb6e0fso76883867b3.0 for ; Tue, 05 Nov 2024 10:43:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730832226; x=1731437026; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=6eFxuwycLsLAbgeJ2mfRY/GmSrQy75uRCa7WJGSthvA=; b=ae95ZujjfyW78zQnIBmR5OUT0kRdRx/L4PsTlm6rDh0wwlRfBX3jUUzpDY7rUTFAYj zFzR2IiCdf/T/wGVnvhjq11fhh97u2EZxOdvZ5cM2fT6vd6Y3NMTIDs2QgqIiAC01EK5 FBPGuA83J4/uRbiipCteE5Sm7IP8DAlU248Orq9qrhjIv8XJzC3SiPrzLAayil06Xwzt R4i2dHlOCxTggHiHEmlo9VPhviO3RgBQtqtNcM5pro5aHRioLZVBiG7dphG8cDYzDVh+ YJYQKzp24sbV/dIgN+yfxNCIQUAgoozMDoeRXr6vXSz4Unv1xGgwkELeEL8SHkEoq1Pu MTxQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730832226; x=1731437026; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=6eFxuwycLsLAbgeJ2mfRY/GmSrQy75uRCa7WJGSthvA=; b=eQBHiEpVu/pccYQJfTeIUppcnAz1rmPBQUr9p227AmpZjwmGWHmACWpmtK1P/VcuLW aM3sPM5C+ADmi8FKjPc8KPhi4Y3KHccFysI5yD4vyLQAwSlryb5Dklb4JBqZVIr9XcF2 vsvQgvYQ1z/JgDR8LJ0z7Ob90WCqFUae1LjF01oVrGwOrxii2MZCvam3JNrPcuHObdQi UVfDId8qGm88LLvcWeXVqRHAjrAxmiT8HcGqBsvos2BkHv/MAxfHDP5jtJNRfFjca3vN I4f7LQFpWrtvAsSy3Pxw9jZvFO8WdPaKXFitgJIm/oBWUEWIziDUNW+1gwjR+1sRg55y gZEQ== X-Forwarded-Encrypted: i=1; AJvYcCWcNoAFlJkuwpJuKvwnCXGqjiW7wgi3QVU2Izge4dhd9FaHfU8I3l21QJ78zv8rfso0arGxBrSCpnbyrE4=@vger.kernel.org X-Gm-Message-State: AOJu0YwIsQZRbyTPzNmBAvKzbKU/NkbHbJqyCmPcopPpNUhC2Ia/tB97 VaSqqKLbvSs6zYQhtFZiBRyPd5VUVWcSBvzTT+jTyw0NXMSpIizeWo78dHLD+wW68yqmVwkoTdL +H5NpuuDtLXIyWF6dwg== X-Google-Smtp-Source: AGHT+IENfGBOOxiMAGivPRvdHYCFkRZ7cQKI8lrn9MmMEy60InS9mwDJ6R0bcKfhRAeslM5j1SEZ1ph/hIA93Day X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:13d:fb22:ac12:a84b]) (user=jthoughton job=sendgmr) by 2002:a05:690c:4512:b0:6e2:1b8c:39bf with SMTP id 00721157ae682-6e9d88ad8b5mr10038697b3.2.1730832225710; Tue, 05 Nov 2024 10:43:45 -0800 (PST) Date: Tue, 5 Nov 2024 18:43:25 +0000 In-Reply-To: <20241105184333.2305744-1-jthoughton@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241105184333.2305744-1-jthoughton@google.com> X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241105184333.2305744-4-jthoughton@google.com> Subject: [PATCH v8 03/11] KVM: x86/mmu: Factor out spte atomic bit clearing routine From: James Houghton To: Sean Christopherson , Paolo Bonzini Cc: David Matlack , David Rientjes , James Houghton , Marc Zyngier , Oliver Upton , Wei Xu , Yu Zhao , Axel Rasmussen , kvm@vger.kernel.org, linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This new function, tdp_mmu_clear_spte_bits_atomic(), will be used in a follow-up patch to enable lockless Accessed and R/W/X bit clearing. Signed-off-by: James Houghton Acked-by: Yu Zhao --- arch/x86/kvm/mmu/tdp_iter.h | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h index 2880fd392e0c..a24fca3f9e7f 100644 --- a/arch/x86/kvm/mmu/tdp_iter.h +++ b/arch/x86/kvm/mmu/tdp_iter.h @@ -25,6 +25,13 @@ static inline u64 kvm_tdp_mmu_write_spte_atomic(tdp_ptep= _t sptep, u64 new_spte) return xchg(rcu_dereference(sptep), new_spte); } =20 +static inline u64 tdp_mmu_clear_spte_bits_atomic(tdp_ptep_t sptep, u64 mas= k) +{ + atomic64_t *sptep_atomic =3D (atomic64_t *)rcu_dereference(sptep); + + return (u64)atomic64_fetch_and(~mask, sptep_atomic); +} + static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 new_spte) { KVM_MMU_WARN_ON(is_ept_ve_possible(new_spte)); @@ -63,12 +70,8 @@ static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t spte= p, u64 old_spte, static inline u64 tdp_mmu_clear_spte_bits(tdp_ptep_t sptep, u64 old_spte, u64 mask, int level) { - atomic64_t *sptep_atomic; - - if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level)) { - sptep_atomic =3D (atomic64_t *)rcu_dereference(sptep); - return (u64)atomic64_fetch_and(~mask, sptep_atomic); - } + if (kvm_tdp_mmu_spte_need_atomic_write(old_spte, level)) + return tdp_mmu_clear_spte_bits_atomic(sptep, mask); =20 __kvm_tdp_mmu_write_spte(sptep, old_spte & ~mask); return old_spte; --=20 2.47.0.199.ga7371fff76-goog From nobody Sun Nov 24 11:28:49 2024 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CACED1F131B for ; Tue, 5 Nov 2024 18:43:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730832230; cv=none; b=bwMh6Kp8lUwLCpsTTfp/IaJUt49vWpCpKHqu4wKJfiGxV4InT8UTiX5WwehcjZ1P5AOeOl4RGGIQ7xKqSeMDnUQ3VsiC425ZbPJcHiIQdann5IqUa5febwvz4tcD6AvJDQk/d21qbzt7z61Odw2QTGwWJQd5UDr1m+zbOOTRrI8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730832230; c=relaxed/simple; bh=rGtastp5MUVjDZjrcFpzQhgzR3WKxIrGkFIwnsizN4o=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=XeQfMaG1HDImICTXCG6yAU21ZNzSXN+T+XJnT6rs8g0A9LXzIM17F0TzyfceL6OalK/T1595URWJ+hh9IRk/3xNjJTvYkaLZdqkQR6IL6Wtd9ttaP0SUHelAuD1BgWvwsQMQUJVj6BGDZKgSiYNRG2KLJv8kQuqk3IhgwkdrAB4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=kHW6y1S0; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="kHW6y1S0" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6e5bdb9244eso93984167b3.2 for ; Tue, 05 Nov 2024 10:43:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730832227; x=1731437027; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=q/zpmmxBLNRk9p52ALn0nSRYK7DaOpkC3JwFDj1wmZI=; b=kHW6y1S0/yUuIDu8VsV94vgNPGy075vDWiOJPb3qJon6Y9ZHyCStnfKOjvc0THAYJ7 cs8bmpUD6on66Gt3xeSQ2lIbhVyD9N7q/By9nDZif7ff+4VFV8v6rP8T3S0dZrMo7hg0 FeJ24cxXlKP7/KwL4UvMrqDEyJmRSJLBMxZy4kEoqpSNmesDVfQJbjdqId+5bfP1BRJ3 YyJY/Dg71qm5oFyYnoudWJ88hys9avuFRa5kPpYOV9VRSCFvapOndl1LAsfhfbtsgLT1 c/BZ8VTbfAyBNBXb8GKwxhZ3gsJ0U2jEk8lwTZohQAFPZxnQ1w/qqpoU0d9TYTkTEPLN 1qfA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730832227; x=1731437027; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=q/zpmmxBLNRk9p52ALn0nSRYK7DaOpkC3JwFDj1wmZI=; b=U6a/DH6V9xfEMT0I0yTbxyKVvEpe4EIAJkPTvQ7i1ET07tKWRmkFj56cC9JF98OPJW uV92tpDjYBwH0yFF7dCKv1Qq+ETTPJgorFtcM1Bk5PzG1AKTvk9IyBKc4qT9PDbzpO2j B/gsFEIF56Jv0h/uB1Omr14ZXyN2E4cjvhQhLbyvwTrQNs287pC1EnsNeQh3M1UVVyvD YqiPRdIam5Dt0FDR3tGUJzsc5DpL9FxMg4jzun8C7wEReVT7SvGii9PhBHm8Zpti6hD3 xsn/KTwqE1yNrPuUn4IaUP0uYrOwFvDgOfQczkoHNVfUHJLZNdv1TYQgbK/4w9HE3KlB vMlw== X-Forwarded-Encrypted: i=1; AJvYcCXn8sVpZLe2thQq1y6YnzW7HH5SKmZSbl8pZGLHYsN6OM4HfU4Jn7Hua+kB45Qwfr/hWGwY82Q0BilJUn0=@vger.kernel.org X-Gm-Message-State: AOJu0YyRrxOD1b4oZqSci/WW7wJB7OYx+PHNCdvfiqypK97TkvGCtTtC EAGQP6hgJ0K5q/xgKg4rrGaqwbDbItJOi8XQFXSn8eWk2tkbyHMxl6aDU2psGZv0hNLQx9AkTsk s61EDf8Zz8+MyQHvBcw== X-Google-Smtp-Source: AGHT+IFa1K4sYqOl+dXQ0E88Fydq2dUpDP40xAUJNd8RbWVqJfYPpP7dUIrWr/dF77HO82K1mA4Eg4a/ON1+iLKE X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:13d:fb22:ac12:a84b]) (user=jthoughton job=sendgmr) by 2002:a25:83c3:0:b0:e30:c79e:16bc with SMTP id 3f1490d57ef6-e30c79e1861mr26074276.8.1730832226917; Tue, 05 Nov 2024 10:43:46 -0800 (PST) Date: Tue, 5 Nov 2024 18:43:26 +0000 In-Reply-To: <20241105184333.2305744-1-jthoughton@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241105184333.2305744-1-jthoughton@google.com> X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241105184333.2305744-5-jthoughton@google.com> Subject: [PATCH v8 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn and kvm_age_gfn From: James Houghton To: Sean Christopherson , Paolo Bonzini Cc: David Matlack , David Rientjes , James Houghton , Marc Zyngier , Oliver Upton , Wei Xu , Yu Zhao , Axel Rasmussen , kvm@vger.kernel.org, linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Walk the TDP MMU in an RCU read-side critical section without holding mmu_lock when harvesting and potentially updating age information on sptes. This requires a way to do RCU-safe walking of the tdp_mmu_roots; do this with a new macro. The PTE modifications are now done atomically, and kvm_tdp_mmu_spte_need_atomic_write() has been updated to account for the fact that kvm_age_gfn can now locklessly update the accessed bit and the W/R/X bits). If the cmpxchg for marking the spte for access tracking fails, leave it as is and treat it as if it were young, as if the spte is being actively modified, it is most likely young. Harvesting age information from the shadow MMU is still done while holding the MMU write lock. Suggested-by: Yu Zhao Signed-off-by: James Houghton Reviewed-by: David Matlack --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/Kconfig | 1 + arch/x86/kvm/mmu/mmu.c | 10 ++++++++-- arch/x86/kvm/mmu/tdp_iter.h | 12 ++++++------ arch/x86/kvm/mmu/tdp_mmu.c | 23 ++++++++++++++++------- 5 files changed, 32 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_hos= t.h index 70c7ed0ef184..84ee08078686 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1455,6 +1455,7 @@ struct kvm_arch { * tdp_mmu_page set. * * For reads, this list is protected by: + * RCU alone or * the MMU lock in read mode + RCU or * the MMU lock in write mode * diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 1ed1e4f5d51c..97f747d60fe9 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -23,6 +23,7 @@ config KVM_X86 select KVM_COMMON select KVM_GENERIC_MMU_NOTIFIER select KVM_ELIDE_TLB_FLUSH_IF_YOUNG + select KVM_MMU_NOTIFIER_YOUNG_LOCKLESS select HAVE_KVM_IRQCHIP select HAVE_KVM_PFNCACHE select HAVE_KVM_DIRTY_RING_TSO diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 443845bb2e01..26797ccd34d8 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1586,8 +1586,11 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_ran= ge *range) { bool young =3D false; =20 - if (kvm_memslots_have_rmaps(kvm)) + if (kvm_memslots_have_rmaps(kvm)) { + write_lock(&kvm->mmu_lock); young =3D kvm_rmap_age_gfn_range(kvm, range, false); + write_unlock(&kvm->mmu_lock); + } =20 if (tdp_mmu_enabled) young |=3D kvm_tdp_mmu_age_gfn_range(kvm, range); @@ -1599,8 +1602,11 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gf= n_range *range) { bool young =3D false; =20 - if (kvm_memslots_have_rmaps(kvm)) + if (kvm_memslots_have_rmaps(kvm)) { + write_lock(&kvm->mmu_lock); young =3D kvm_rmap_age_gfn_range(kvm, range, true); + write_unlock(&kvm->mmu_lock); + } =20 if (tdp_mmu_enabled) young |=3D kvm_tdp_mmu_test_age_gfn(kvm, range); diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h index a24fca3f9e7f..f26d0b60d2dd 100644 --- a/arch/x86/kvm/mmu/tdp_iter.h +++ b/arch/x86/kvm/mmu/tdp_iter.h @@ -39,10 +39,11 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t = sptep, u64 new_spte) } =20 /* - * SPTEs must be modified atomically if they are shadow-present, leaf - * SPTEs, and have volatile bits, i.e. has bits that can be set outside - * of mmu_lock. The Writable bit can be set by KVM's fast page fault - * handler, and Accessed and Dirty bits can be set by the CPU. + * SPTEs must be modified atomically if they have bits that can be set out= side + * of the mmu_lock. This can happen for any shadow-present leaf SPTEs, as = the + * Writable bit can be set by KVM's fast page fault handler, the Accessed = and + * Dirty bits can be set by the CPU, and the Accessed and W/R/X bits can be + * cleared by age_gfn_range(). * * Note, non-leaf SPTEs do have Accessed bits and those bits are * technically volatile, but KVM doesn't consume the Accessed bit of @@ -53,8 +54,7 @@ static inline void __kvm_tdp_mmu_write_spte(tdp_ptep_t sp= tep, u64 new_spte) static inline bool kvm_tdp_mmu_spte_need_atomic_write(u64 old_spte, int le= vel) { return is_shadow_present_pte(old_spte) && - is_last_spte(old_spte, level) && - spte_has_volatile_bits(old_spte); + is_last_spte(old_spte, level); } =20 static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte, diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 4508d868f1cd..f5b4f1060fff 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -178,6 +178,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct k= vm *kvm, ((_only_valid) && (_root)->role.invalid))) { \ } else =20 +/* + * Iterate over all TDP MMU roots in an RCU read-side critical section. + */ +#define for_each_valid_tdp_mmu_root_rcu(_kvm, _root, _as_id) \ + list_for_each_entry_rcu(_root, &_kvm->arch.tdp_mmu_roots, link) \ + if ((_as_id >=3D 0 && kvm_mmu_page_as_id(_root) !=3D _as_id) || \ + (_root)->role.invalid) { \ + } else + #define for_each_tdp_mmu_root(_kvm, _root, _as_id) \ __for_each_tdp_mmu_root(_kvm, _root, _as_id, false) =20 @@ -1168,16 +1177,16 @@ static void kvm_tdp_mmu_age_spte(struct tdp_iter *i= ter) u64 new_spte; =20 if (spte_ad_enabled(iter->old_spte)) { - iter->old_spte =3D tdp_mmu_clear_spte_bits(iter->sptep, - iter->old_spte, - shadow_accessed_mask, - iter->level); + iter->old_spte =3D tdp_mmu_clear_spte_bits_atomic(iter->sptep, + shadow_accessed_mask); new_spte =3D iter->old_spte & ~shadow_accessed_mask; } else { new_spte =3D mark_spte_for_access_track(iter->old_spte); - iter->old_spte =3D kvm_tdp_mmu_write_spte(iter->sptep, - iter->old_spte, new_spte, - iter->level); + /* + * It is safe for the following cmpxchg to fail. Leave the + * Accessed bit set, as the spte is most likely young anyway. + */ + (void)__tdp_mmu_set_spte_atomic(iter, new_spte); } =20 trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level, --=20 2.47.0.199.ga7371fff76-goog From nobody Sun Nov 24 11:28:49 2024 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 012D41F4FAE for ; Tue, 5 Nov 2024 18:43:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730832230; cv=none; b=CuV2OKRrSKs0rXdODGwJkzdckXk6C4ExmOT8YpYKtWKRKBXH4tr+hoh9NOxUEyV4oxJsdeTj28z74oCLtLJ0fWulDpVxrXzLaMSAX3dc++Dul6bMRRB+HsDB2vg0KFQRd6zaaFIljF/ddGGMB5qFwgoFEUsp6f6VgRNYN6Jp5GM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730832230; c=relaxed/simple; bh=gQHbST9g9dX7AdsXb60482aaTjlBxEvk62T1cYPsouI=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=e07/xm3YQcnD559+dxo2tB8TCT2QrqzHRInPQsecpPfP578mBeGI4KVGvausNfbBYeFA5LnzNMX+PXf7ZvHO+C0iCJEp/Nrr7ygHsJGoyHMJVoYit7EWRSE1gpTL6DL0Rqmg2s1ftcPF6IfPRDpDYrwE6MXzkPECeg1AcUX6dZA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=SwLGGq9N; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="SwLGGq9N" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e02fff66a83so9074436276.0 for ; Tue, 05 Nov 2024 10:43:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730832228; x=1731437028; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=RVQiqQNJn8V9+bn63bPEy1RgKjlRZWWwgjJvsJY30w0=; b=SwLGGq9NnoMxQhrWHy4HL//u0U4PSD9cGBFijgKCyTpFaeGWZ5bB21Pq9GTqkY+Imi 8cIFM8Yd5Pso9pqOz916fnU9wdrVcXmH5fsrSw3Of8LSq7CHscXP42QnYm6MgAL7zOJG ZBtMcvmawjmHO1+1OdBkPaZixvQ/ZUziAK0XP5c+9lXrL5J6MU3xST7i7MX/u83/FdGX mJdhpM5UqVFvI/mB4rt0JetT23QTSBcYT9jJdoUB3NtjcJgZXIYoRwzVZ209BYPnuFrA MXTbg3RkI+zf8CXUJlm0RXIATxlmImHjOA2mUXG2W8HcioONbxpbz5+pkPXi2vhkvfXV fWcw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730832228; x=1731437028; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=RVQiqQNJn8V9+bn63bPEy1RgKjlRZWWwgjJvsJY30w0=; b=I3sStikGDB9KNTyKRkI6LfqDS6UPkT5GW7W+jnNSRnVjZrioOpuclaQIYOXc+75qcM DNpz22bvHYAiHmJx7Td9Hq5jntTnd1FhwU9NNkgW1ySCL4WJOdfurFzFT2S6iVGBpZo5 u9TeW28zDfAAIrJeNA2n4d+0BNKgnY9wVHs9Lx0frJp2Maj7RJhwfD5rhbGfh4FKGFYz 7Wjxe+GosJDDAS8FKz7wpnYAUBCileEzy1xmb+3RVGOny3FBiBtM8GOwXsK4+/K3JJ6B Cp6GOSyam5voWC5YkZY7kCwJqrlwhHl5y0pYTrdvwRmsNEfiunemfR/XAXXiJPUX0qjI +gaA== X-Forwarded-Encrypted: i=1; AJvYcCXk3z1mmPnBYtVXBnUXK/YZPNR+7wXl4UhHnIcljuknL0SgRibMr8aKbm6SQVw3+6YSlvQU+2RXfWHesNQ=@vger.kernel.org X-Gm-Message-State: AOJu0YyAcpuaAdYDueM5+7EuIbFZBkjH4zTIKw2tuuq4moKyHvZufrAb XEwTpO+XfCEAFiJubraK74icR/XKvVbpkQ3TKj/jPTAy/nOwL1aXpSQjn/dZ3RjEuZZtTqzr++4 OvWMuWTjtqHIHloxS1w== X-Google-Smtp-Source: AGHT+IHXM00di5OmvL87GQ4jzSym+HQrvnV/cZmee9spmhbqBumr9Iqc/+cJun+krlb+CD896NkOpi3sK+UJI8u9 X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:13d:fb22:ac12:a84b]) (user=jthoughton job=sendgmr) by 2002:a25:86c8:0:b0:e30:d717:36ed with SMTP id 3f1490d57ef6-e30e5b3afe2mr13577276.10.1730832228018; Tue, 05 Nov 2024 10:43:48 -0800 (PST) Date: Tue, 5 Nov 2024 18:43:27 +0000 In-Reply-To: <20241105184333.2305744-1-jthoughton@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241105184333.2305744-1-jthoughton@google.com> X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241105184333.2305744-6-jthoughton@google.com> Subject: [PATCH v8 05/11] KVM: x86/mmu: Rearrange kvm_{test_,}age_gfn From: James Houghton To: Sean Christopherson , Paolo Bonzini Cc: David Matlack , David Rientjes , James Houghton , Marc Zyngier , Oliver Upton , Wei Xu , Yu Zhao , Axel Rasmussen , kvm@vger.kernel.org, linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Reorder the TDP MMU check to be first for both kvm_test_age_gfn and kvm_age_gfn. For kvm_test_age_gfn, this allows us to completely avoid needing to grab the MMU lock when the TDP MMU reports that the page is young. Do the same for kvm_age_gfn merely for consistency. Signed-off-by: James Houghton Acked-by: Yu Zhao --- arch/x86/kvm/mmu/mmu.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 26797ccd34d8..793565a3a573 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1586,15 +1586,15 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_ra= nge *range) { bool young =3D false; =20 + if (tdp_mmu_enabled) + young =3D kvm_tdp_mmu_age_gfn_range(kvm, range); + if (kvm_memslots_have_rmaps(kvm)) { write_lock(&kvm->mmu_lock); - young =3D kvm_rmap_age_gfn_range(kvm, range, false); + young |=3D kvm_rmap_age_gfn_range(kvm, range, false); write_unlock(&kvm->mmu_lock); } =20 - if (tdp_mmu_enabled) - young |=3D kvm_tdp_mmu_age_gfn_range(kvm, range); - return young; } =20 @@ -1602,15 +1602,15 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_g= fn_range *range) { bool young =3D false; =20 - if (kvm_memslots_have_rmaps(kvm)) { + if (tdp_mmu_enabled) + young =3D kvm_tdp_mmu_test_age_gfn(kvm, range); + + if (!young && kvm_memslots_have_rmaps(kvm)) { write_lock(&kvm->mmu_lock); - young =3D kvm_rmap_age_gfn_range(kvm, range, true); + young |=3D kvm_rmap_age_gfn_range(kvm, range, true); write_unlock(&kvm->mmu_lock); } =20 - if (tdp_mmu_enabled) - young |=3D kvm_tdp_mmu_test_age_gfn(kvm, range); - return young; } =20 --=20 2.47.0.199.ga7371fff76-goog From nobody Sun Nov 24 11:28:49 2024 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EC1A41F4FD9 for ; Tue, 5 Nov 2024 18:43:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730832231; cv=none; b=W78pS0+Tq+m8nj2TA/9GSCLchr0yqaBKc99vUmcJ4lhB9fEutFrpHbAJAYbpQmFvAKxwZTJnGlf318B7vnyTXe8q64YC8G2xg+irANHBo1xt1bV0rOzHa1ZVC/HfUjKEWdggxJKAdeyOL/ddJ+xgXyyrHQ3zzP23l6pNjmkhEBs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730832231; c=relaxed/simple; bh=adN5zNxLOGpGOTqMwOHxC6d7vF3p5DoHj0SGHVvTfJ8=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=X9h7rgm4W/qc2VzYhNsCbEHRHZLZbeX6Yscd+5QMkHSNWYKfh1q09FQe7vH0M8RQJcdUaTbviDv50fNzZqfOh5Ksnb+2mONg8rH/V0WHL+K3cTjiVFMAuNgQPey0hvBbRI+iTTRIruF1H+ojtLVpqQCT2kjUQtTKr+Gf+Vsi5m8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=V+NZJ/jf; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="V+NZJ/jf" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6e3705b2883so114574647b3.3 for ; Tue, 05 Nov 2024 10:43:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730832229; x=1731437029; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=FOOEVUdnHgf32MRIxCKSoPjfbLEYf1LngBtyI0A4Ics=; b=V+NZJ/jfYOQqvRmQ32lF9cixaj3UqH49cYvYTGXoQss9HEqwzCEZ8dHgr9n/8+sydI slH9dzcf8Fjk6R8Wd+msqnzv/qOs5zusbk9XnNXbqOxveBpCbr+VS3wVo24dhfhkJn9m 2vbtHAATBD+2Jns+cov9DYQEJFFyOq7V2qlUDfq4EiA6V4pNe6BU2ux10BnHIbipzVM5 uSJck1Gpa5mOBootWbmc4z2chB8Zt3dctpcO8CwA9hKQch/qcHzzUesEjfknhvc9UeWz k9IpDna8/646KqM0WSuPTJbe1GiofusF9caF2y2+fHIX2TVy5IfJX7toTUhifCsZYB7/ +QhA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730832229; x=1731437029; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=FOOEVUdnHgf32MRIxCKSoPjfbLEYf1LngBtyI0A4Ics=; b=CiLp+9IBr8ae4O/HtAKiki/VPB7O+r4kwfYiXAyPDWgQ/LbA1zZrTqAxjn+y4s54Ef jtixC53GRR4uDbTPZuJtYCJ969INQ3RQ7Bct/4gwHwSRHs4rWbAWWN6tSdr77p+LfWiZ wvL/whduxTXaqddQLBq4L+3WjUMv46NLCScoF2/WPze9uTyLok9qu6LYb5zOkWoAaWdx wycWC1EKJH3kyRXDP4s9/sJyoVUgX2FQTfaH8YUgCxl5jY5Rg6tVTqJyq4HM3E97PqQl L+AguOCRWxiu99rZ2G/ZHbbH0co5qF7VEMG8vt9s5VxKCMKy03PyF9n4kDSgJ7euGGuY 6w6w== X-Forwarded-Encrypted: i=1; AJvYcCWySWyXBewJV/f60LWh2VKN5rkQ+v0f55ZqsUwRxDD0Pc18J9qs3vZzF+R1JiBPl4Ysh0jNPSZ6V/Z/t9M=@vger.kernel.org X-Gm-Message-State: AOJu0YyoYW+rLK2r755iP2LvhMzgBs7/g5vmT3ytWUXDe289AtuRZvQg K2QKN4TLAnhP0xQh7HiWOSa3tFfqqoBe5sebKRhIRkp4NRAlv1AnE9edi5iboPrF3WrW3sbvvhK NQ6DGyO1oDLdzYUn1vQ== X-Google-Smtp-Source: AGHT+IGVL4XrDgFmvYXl5VxvaY4ee6ODWUq7wk9dSm9xDKUM11C1od5XcBT+XS/F+nwINXTTz1u2yJuF4xm6dR9N X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:13d:fb22:ac12:a84b]) (user=jthoughton job=sendgmr) by 2002:a05:690c:4484:b0:6dd:fda3:6568 with SMTP id 00721157ae682-6ea64b8c23cmr1286667b3.3.1730832229001; Tue, 05 Nov 2024 10:43:49 -0800 (PST) Date: Tue, 5 Nov 2024 18:43:28 +0000 In-Reply-To: <20241105184333.2305744-1-jthoughton@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241105184333.2305744-1-jthoughton@google.com> X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241105184333.2305744-7-jthoughton@google.com> Subject: [PATCH v8 06/11] KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0 From: James Houghton To: Sean Christopherson , Paolo Bonzini Cc: David Matlack , David Rientjes , James Houghton , Marc Zyngier , Oliver Upton , Wei Xu , Yu Zhao , Axel Rasmussen , kvm@vger.kernel.org, linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Optimize both kvm_age_gfn and kvm_test_age_gfn's interaction with the shadow MMU by, rather than checking if our memslot has rmaps, check if there are any indirect_shadow_pages at all. Signed-off-by: James Houghton Acked-by: Yu Zhao --- arch/x86/kvm/mmu/mmu.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 793565a3a573..125d4c3ccceb 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1582,6 +1582,11 @@ static bool kvm_rmap_age_gfn_range(struct kvm *kvm, return young; } =20 +static bool kvm_has_shadow_mmu_sptes(struct kvm *kvm) +{ + return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages); +} + bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { bool young =3D false; @@ -1589,7 +1594,7 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_rang= e *range) if (tdp_mmu_enabled) young =3D kvm_tdp_mmu_age_gfn_range(kvm, range); =20 - if (kvm_memslots_have_rmaps(kvm)) { + if (kvm_has_shadow_mmu_sptes(kvm)) { write_lock(&kvm->mmu_lock); young |=3D kvm_rmap_age_gfn_range(kvm, range, false); write_unlock(&kvm->mmu_lock); @@ -1605,7 +1610,7 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn= _range *range) if (tdp_mmu_enabled) young =3D kvm_tdp_mmu_test_age_gfn(kvm, range); =20 - if (!young && kvm_memslots_have_rmaps(kvm)) { + if (!young && kvm_has_shadow_mmu_sptes(kvm)) { write_lock(&kvm->mmu_lock); young |=3D kvm_rmap_age_gfn_range(kvm, range, true); write_unlock(&kvm->mmu_lock); --=20 2.47.0.199.ga7371fff76-goog From nobody Sun Nov 24 11:28:49 2024 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 309801F5854 for ; Tue, 5 Nov 2024 18:43:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730832233; cv=none; b=pRU9PD7i729TtUwpaNegi7kvVDaqGQxt0ogRU0b+dCfWEjF/3NbSIRCY4OurUr8X5usooWmZj96wZvBDLo9cBN5qgI6t1CdmesCSfRSrOqupcY0Y961DvbYYv/Tz9H07CPI/83M2RO8G7TPpKv5uaeBucXf+Ujkl6poMtJTuApU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730832233; c=relaxed/simple; bh=/9mxq13o/uI7Wr1lsTwiFyHNtgH8HeYaRMTy0Q8kkE8=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=Mz3Pll+rfV/hOTQMmQD1iZjvIPYP+dUDSqnitdSTdopTIghOuwJPBhnywIulvJV4Q9hKeBFITOx569HR9+BH4OLvOgxg5OGKHdNDbv/gHIOaItpDS+EHSGydqIsLLr5rpXK6L4fMxsih7AlvDbMSi0jr8GeXXP+h4wMou9odqq0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=kvpTMSQ8; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="kvpTMSQ8" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6ea8a238068so54867787b3.1 for ; Tue, 05 Nov 2024 10:43:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730832230; x=1731437030; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=wIDLC6t/sMMMxR2BsSETZhKknrKZVVywOqJrGrnhMso=; b=kvpTMSQ8J2x4KojnB46JtWhIWcdwOHN9TMSkndWsL+/kl5oXygcNOvZwCl9I++3RTs urXgAR0D38VqhGXhp2WnkiANacQfd8HWN7tMSeROaAUHOgQQ2YdD1b0N9kb17A1rpOEJ lySB8sXX+nHp07I7wtUDbY4dUYN8PHBFtbd5OMUTRfvITWXh1489tVCgdAIlOeR+B84p GNEiZYL5Xnl2v/HG7fpXqQeNMY6J0kzHq1FhYTkYgtVzsejrZW9npZfSQMsKiPLcuHdp xEKoUiX1SUqEbCTdny4kbFhgUnimO7SLaadC638Tntp+HeGkV+7bgLscXDT7rPqHR/24 SQwQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730832230; x=1731437030; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=wIDLC6t/sMMMxR2BsSETZhKknrKZVVywOqJrGrnhMso=; b=hbsbqMKJMrf10GDqAeEJT1vRmO3kaBLx3yKFMaarX+W6BRIRHB/jJC8I4JqpE9csbY clOHwaqyRa3cuHormstYS+JMVs5J+BxrQaDJq2ieIQ+73eQWJD5FdQZS126Es7SJ/6zI Femi3uVpQYOC9lMSd6fI5IDt+L3I8+SYA/XM9myUrCxFDknjt4TWgwA9AZYFyqipaE9+ t01HMldLJEXiUaosw7vbV+ROkfzvBDVkW5wK2YVw4tQnx5k93gRhc7enx1GRi3Gymo+N zRRzoE0zZKhUm6FDd2sn1oEFBZhBb2J3lsgd1vAjLjOb0cqytp8/0G1X0NhONV9O1aVE fRpA== X-Forwarded-Encrypted: i=1; AJvYcCVE9PmUAPH54AlpuvC+v70Xf3eWPr7M1KftwOV+L+gCWLIjlObvSD/9HG+wRRwqPVkmCrRvnOwV4vVzx0k=@vger.kernel.org X-Gm-Message-State: AOJu0Yy+aGNuDzY/VvgICWiky5XIq323epSVzYpez1Vkl/b5l0JENCb7 +FNZE95zTsJOslnb51/mYXLoDS/Pd6qtD8X3XtNQRd1kYYBUNUyq4wkDWXxIDVGUc7eJRED/JBa Ls6Bx7xknJei4vykatg== X-Google-Smtp-Source: AGHT+IHnX5S3d+fPX7LL8o5N7U0OFXGCdCj7VT0GdmujEtmN9uhXC8UXR6GXgHOI3eq/3/wi+mADnEOa9Ep1inMP X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:13d:fb22:ac12:a84b]) (user=jthoughton job=sendgmr) by 2002:a25:a207:0:b0:e25:6701:410b with SMTP id 3f1490d57ef6-e3087b792abmr83276276.5.1730832230072; Tue, 05 Nov 2024 10:43:50 -0800 (PST) Date: Tue, 5 Nov 2024 18:43:29 +0000 In-Reply-To: <20241105184333.2305744-1-jthoughton@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241105184333.2305744-1-jthoughton@google.com> X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241105184333.2305744-8-jthoughton@google.com> Subject: [PATCH v8 07/11] KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock From: James Houghton To: Sean Christopherson , Paolo Bonzini Cc: David Matlack , David Rientjes , James Houghton , Marc Zyngier , Oliver Upton , Wei Xu , Yu Zhao , Axel Rasmussen , kvm@vger.kernel.org, linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Sean Christopherson Refactor the pte_list and rmap code to always read and write rmap_head->val exactly once, e.g. by collecting changes in a local variable and then propagating those changes back to rmap_head->val as appropriate. This will allow implementing a per-rmap rwlock (of sorts) by adding a LOCKED bit into the rmap value alongside the MANY bit. Signed-off-by: Sean Christopherson Signed-off-by: James Houghton --- arch/x86/kvm/mmu/mmu.c | 83 +++++++++++++++++++++++++----------------- 1 file changed, 50 insertions(+), 33 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 125d4c3ccceb..145ea180963e 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -858,21 +858,24 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_b= itmap(struct kvm_vcpu *vcpu static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte, struct kvm_rmap_head *rmap_head) { + unsigned long old_val, new_val; struct pte_list_desc *desc; int count =3D 0; =20 - if (!rmap_head->val) { - rmap_head->val =3D (unsigned long)spte; - } else if (!(rmap_head->val & KVM_RMAP_MANY)) { + old_val =3D rmap_head->val; + + if (!old_val) { + new_val =3D (unsigned long)spte; + } else if (!(old_val & KVM_RMAP_MANY)) { desc =3D kvm_mmu_memory_cache_alloc(cache); - desc->sptes[0] =3D (u64 *)rmap_head->val; + desc->sptes[0] =3D (u64 *)old_val; desc->sptes[1] =3D spte; desc->spte_count =3D 2; desc->tail_count =3D 0; - rmap_head->val =3D (unsigned long)desc | KVM_RMAP_MANY; + new_val =3D (unsigned long)desc | KVM_RMAP_MANY; ++count; } else { - desc =3D (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY); + desc =3D (struct pte_list_desc *)(old_val & ~KVM_RMAP_MANY); count =3D desc->tail_count + desc->spte_count; =20 /* @@ -881,21 +884,25 @@ static int pte_list_add(struct kvm_mmu_memory_cache *= cache, u64 *spte, */ if (desc->spte_count =3D=3D PTE_LIST_EXT) { desc =3D kvm_mmu_memory_cache_alloc(cache); - desc->more =3D (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY= ); + desc->more =3D (struct pte_list_desc *)(old_val & ~KVM_RMAP_MANY); desc->spte_count =3D 0; desc->tail_count =3D count; - rmap_head->val =3D (unsigned long)desc | KVM_RMAP_MANY; + new_val =3D (unsigned long)desc | KVM_RMAP_MANY; + } else { + new_val =3D old_val; } desc->sptes[desc->spte_count++] =3D spte; } + + rmap_head->val =3D new_val; + return count; } =20 -static void pte_list_desc_remove_entry(struct kvm *kvm, - struct kvm_rmap_head *rmap_head, +static void pte_list_desc_remove_entry(struct kvm *kvm, unsigned long *rma= p_val, struct pte_list_desc *desc, int i) { - struct pte_list_desc *head_desc =3D (struct pte_list_desc *)(rmap_head->v= al & ~KVM_RMAP_MANY); + struct pte_list_desc *head_desc =3D (struct pte_list_desc *)(*rmap_val & = ~KVM_RMAP_MANY); int j =3D head_desc->spte_count - 1; =20 /* @@ -922,9 +929,9 @@ static void pte_list_desc_remove_entry(struct kvm *kvm, * head at the next descriptor, i.e. the new head. */ if (!head_desc->more) - rmap_head->val =3D 0; + *rmap_val =3D 0; else - rmap_head->val =3D (unsigned long)head_desc->more | KVM_RMAP_MANY; + *rmap_val =3D (unsigned long)head_desc->more | KVM_RMAP_MANY; mmu_free_pte_list_desc(head_desc); } =20 @@ -932,24 +939,26 @@ static void pte_list_remove(struct kvm *kvm, u64 *spt= e, struct kvm_rmap_head *rmap_head) { struct pte_list_desc *desc; + unsigned long rmap_val; int i; =20 - if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_head->val, kvm)) - return; + rmap_val =3D rmap_head->val; + if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_val, kvm)) + goto out; =20 - if (!(rmap_head->val & KVM_RMAP_MANY)) { - if (KVM_BUG_ON_DATA_CORRUPTION((u64 *)rmap_head->val !=3D spte, kvm)) - return; + if (!(rmap_val & KVM_RMAP_MANY)) { + if (KVM_BUG_ON_DATA_CORRUPTION((u64 *)rmap_val !=3D spte, kvm)) + goto out; =20 - rmap_head->val =3D 0; + rmap_val =3D 0; } else { - desc =3D (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY); + desc =3D (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY); while (desc) { for (i =3D 0; i < desc->spte_count; ++i) { if (desc->sptes[i] =3D=3D spte) { - pte_list_desc_remove_entry(kvm, rmap_head, + pte_list_desc_remove_entry(kvm, &rmap_val, desc, i); - return; + goto out; } } desc =3D desc->more; @@ -957,6 +966,9 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte, =20 KVM_BUG_ON_DATA_CORRUPTION(true, kvm); } + +out: + rmap_head->val =3D rmap_val; } =20 static void kvm_zap_one_rmap_spte(struct kvm *kvm, @@ -971,17 +983,19 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm, struct kvm_rmap_head *rmap_head) { struct pte_list_desc *desc, *next; + unsigned long rmap_val; int i; =20 - if (!rmap_head->val) + rmap_val =3D rmap_head->val; + if (!rmap_val) return false; =20 - if (!(rmap_head->val & KVM_RMAP_MANY)) { - mmu_spte_clear_track_bits(kvm, (u64 *)rmap_head->val); + if (!(rmap_val & KVM_RMAP_MANY)) { + mmu_spte_clear_track_bits(kvm, (u64 *)rmap_val); goto out; } =20 - desc =3D (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY); + desc =3D (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY); =20 for (; desc; desc =3D next) { for (i =3D 0; i < desc->spte_count; i++) @@ -997,14 +1011,15 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm, =20 unsigned int pte_list_count(struct kvm_rmap_head *rmap_head) { + unsigned long rmap_val =3D rmap_head->val; struct pte_list_desc *desc; =20 - if (!rmap_head->val) + if (!rmap_val) return 0; - else if (!(rmap_head->val & KVM_RMAP_MANY)) + else if (!(rmap_val & KVM_RMAP_MANY)) return 1; =20 - desc =3D (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY); + desc =3D (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY); return desc->tail_count + desc->spte_count; } =20 @@ -1047,6 +1062,7 @@ static void rmap_remove(struct kvm *kvm, u64 *spte) */ struct rmap_iterator { /* private fields */ + struct rmap_head *head; struct pte_list_desc *desc; /* holds the sptep if not NULL */ int pos; /* index of the sptep */ }; @@ -1061,18 +1077,19 @@ struct rmap_iterator { static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head, struct rmap_iterator *iter) { + unsigned long rmap_val =3D rmap_head->val; u64 *sptep; =20 - if (!rmap_head->val) + if (!rmap_val) return NULL; =20 - if (!(rmap_head->val & KVM_RMAP_MANY)) { + if (!(rmap_val & KVM_RMAP_MANY)) { iter->desc =3D NULL; - sptep =3D (u64 *)rmap_head->val; + sptep =3D (u64 *)rmap_val; goto out; } =20 - iter->desc =3D (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY); + iter->desc =3D (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY); iter->pos =3D 0; sptep =3D iter->desc->sptes[iter->pos]; out: --=20 2.47.0.199.ga7371fff76-goog From nobody Sun Nov 24 11:28:49 2024 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1DB461F6687 for ; Tue, 5 Nov 2024 18:43:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730832234; cv=none; b=DF3kD8+IrBXx8TdPIvPAtikwgfY9Lk0C9C+2gyXVasqC+m/sfCOYoMFo1mLhECVnaQYpWqxMdtE50i4Ik2NwggVVXdBzLLZrCbBGE57sZHSKhoHQ9HQaUVrlxg2u3vEMkiZwobsuR3Y8py0paldADKO1REnqrOubV9uhimRrLvI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730832234; c=relaxed/simple; bh=xcAjcBrdvWnNDb0k2Wn43m9HQBwOqnf9onULqJo8QpA=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=UQdJ0DdMCxUr89j1fxJtExC/PfcPxQD7SZmnYPNl+tTvT+Z5wlPW0pKA7yFWdMdcrPbB+ut1t9xrD6po0OkBu/PKQlIFz9ehvkpyWXTsgC+/DgSylCQj5T9IgU1pswtXTIn969DO3lVYASNndOWBPd1TWD9vGFYUS2xcjQDBZdA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=dhNl0CTA; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="dhNl0CTA" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e293150c2c6so12099182276.1 for ; Tue, 05 Nov 2024 10:43:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730832231; x=1731437031; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=01wePMH4+QHc02SMeyCoSvZeRWpbLNfwimxdT99GW8o=; b=dhNl0CTAeC0vgLMM+zN1R4N8FXJIINZVnLJhd+EeBtSyuFZy7FqyOKSQG/7FHYjv9S MiREmEYT/YY3kKPU66dOVgrcNkyYPzT2uD9s+pKIKIOa4n9JKXrDfxh8Q9wf9SEb9PTq 7II3TJQBHTXefXQt4lQVze9dOEgqq1CPzAFMU/XYkfc5Ak5JPGxIbMvfEOnYqNePB/YG AZP0t/sSjCUdwNwC/M/kydeZcSV70RBi5GuQOEnVeHZKp+WA7UBgZJYPR6YnJcyR3lB3 ZIIyu+5gA74cxCNur9wTGpkBalsJp9VDv5bqiAPtWJBaKlb9QBEMNBtANun/qleDSDbO /QLA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730832231; x=1731437031; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=01wePMH4+QHc02SMeyCoSvZeRWpbLNfwimxdT99GW8o=; b=aI7rv5rjau1473V3n03WpVd3DjOiHnK6PgnjVX98nyfYhflJqyen0wr+g7svjkNTLv QuEr7avx8ToJPiQH1tzc59jC9u65FbGYvizVIczMGfPRD1hW63n2AMxqKQpFoYS2vgP5 ri6KLUDcgbnqyQCCBBqu4uypUf7N810c3gLS+y0dECjZffKezE/qikUuu2+WKwbKxe5h Q8RXNBZi1jvsbr5pME1pGHmuMf0KeEwBqfshGJO2accG6Erfpmyao74zbyE70vDH41xA 7m5e2EPs1fdBB4R0JsmwamzaCxAK9fTjM59Hb4NyJxyB9gSSGxy1PrLSP7N2ka0qt87R EzhQ== X-Forwarded-Encrypted: i=1; AJvYcCVK8vDfwFeoTT+6q/3VJQlYdEeGkJ6lVjWxb0iWuJMt8m36eSsIIBXWDSHca9vZRWDmWuCqW8UbeIS0bvo=@vger.kernel.org X-Gm-Message-State: AOJu0YzhIgG109KN2GFzz2OCtQhZ1L5m9jvU3ZoARyTHb/sK2CrE5mUi TmHEhkWfuoUx3swT8fCs+LNwl9FoEJgp9NyeFLf0XQY7sbTIpxVHk6UTcaDlwdJIjnbgW6ZuRv7 bb21z8yyjTGP8czZ38A== X-Google-Smtp-Source: AGHT+IGik+DLwPnAj/ZpDJXHf21XyU3PCl7bJk+sYnaFfsbhLbfn2IFNWxplY+ZU6Y8yYe4uBthq2BbYiZhjOB24 X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:13d:fb22:ac12:a84b]) (user=jthoughton job=sendgmr) by 2002:a25:7449:0:b0:e2e:330b:faab with SMTP id 3f1490d57ef6-e30e59194c3mr15938276.0.1730832231123; Tue, 05 Nov 2024 10:43:51 -0800 (PST) Date: Tue, 5 Nov 2024 18:43:30 +0000 In-Reply-To: <20241105184333.2305744-1-jthoughton@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241105184333.2305744-1-jthoughton@google.com> X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241105184333.2305744-9-jthoughton@google.com> Subject: [PATCH v8 08/11] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock From: James Houghton To: Sean Christopherson , Paolo Bonzini Cc: David Matlack , David Rientjes , James Houghton , Marc Zyngier , Oliver Upton , Wei Xu , Yu Zhao , Axel Rasmussen , kvm@vger.kernel.org, linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Sean Christopherson Steal another bit from rmap entries (which are word aligned pointers, i.e. have 2 free bits on 32-bit KVM, and 3 free bits on 64-bit KVM), and use the bit to implement a *very* rudimentary per-rmap spinlock. The only anticipated usage of the lock outside of mmu_lock is for aging gfns, and collisions between aging and other MMU rmap operations are quite rare, e.g. unless userspace is being silly and aging a tiny range over and over in a tight loop, time between contention when aging an actively running VM is O(seconds). In short, a more sophisticated locking scheme shouldn't be necessary. Note, the lock only protects the rmap structure itself, SPTEs that are pointed at by a locked rmap can still be modified and zapped by another task (KVM drops/zaps SPTEs before deleting the rmap entries) Signed-off-by: Sean Christopherson Co-developed-by: James Houghton Signed-off-by: James Houghton --- arch/x86/include/asm/kvm_host.h | 3 +- arch/x86/kvm/mmu/mmu.c | 129 +++++++++++++++++++++++++++++--- 2 files changed, 120 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_hos= t.h index 84ee08078686..378b87ff5b1f 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -26,6 +26,7 @@ #include #include #include +#include =20 #include #include @@ -402,7 +403,7 @@ union kvm_cpu_role { }; =20 struct kvm_rmap_head { - unsigned long val; + atomic_long_t val; }; =20 struct kvm_pio_request { diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 145ea180963e..1cdb77df0a4d 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -847,11 +847,117 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_= bitmap(struct kvm_vcpu *vcpu * About rmap_head encoding: * * If the bit zero of rmap_head->val is clear, then it points to the only = spte - * in this rmap chain. Otherwise, (rmap_head->val & ~1) points to a struct + * in this rmap chain. Otherwise, (rmap_head->val & ~3) points to a struct * pte_list_desc containing more mappings. */ #define KVM_RMAP_MANY BIT(0) =20 +/* + * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU al= ways + * operates with mmu_lock held for write), but rmaps can be walked without + * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap c= hain + * being zapped/dropped _while the rmap is locked_. + * + * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must= be + * done while holding mmu_lock for write. This allows a task walking rmaps + * without holding mmu_lock to concurrently walk the same entries as a task + * that is holding mmu_lock but _not_ the rmap lock. Neither task will mo= dify + * the rmaps, thus the walks are stable. + * + * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOC= KED, + * only the rmap chains themselves are protected. E.g. holding an rmap's = lock + * ensures all "struct pte_list_desc" fields are stable. + */ +#define KVM_RMAP_LOCKED BIT(1) + +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head) +{ + unsigned long old_val, new_val; + + /* + * Elide the lock if the rmap is empty, as lockless walkers (read-only + * mode) don't need to (and can't) walk an empty rmap, nor can they add + * entries to the rmap. I.e. the only paths that process empty rmaps + * do so while holding mmu_lock for write, and are mutually exclusive. + */ + old_val =3D atomic_long_read(&rmap_head->val); + if (!old_val) + return 0; + + do { + /* + * If the rmap is locked, wait for it to be unlocked before + * trying acquire the lock, e.g. to bounce the cache line. + */ + while (old_val & KVM_RMAP_LOCKED) { + old_val =3D atomic_long_read(&rmap_head->val); + cpu_relax(); + } + + /* + * Recheck for an empty rmap, it may have been purged by the + * task that held the lock. + */ + if (!old_val) + return 0; + + new_val =3D old_val | KVM_RMAP_LOCKED; + /* + * Use try_cmpxchg_acquire to prevent reads and writes to the rmap + * from being reordered outside of the critical section created by + * __kvm_rmap_lock. + * + * Pairs with smp_store_release in kvm_rmap_unlock. + * + * For the !old_val case, no ordering is needed, as there is no rmap + * to walk. + */ + } while (!atomic_long_try_cmpxchg_acquire(&rmap_head->val, &old_val, new_= val)); + + /* Return the old value, i.e. _without_ the LOCKED bit set. */ + return old_val; +} + +static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head, + unsigned long new_val) +{ + WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED); + /* + * Ensure that all accesses to the rmap have completed + * before we actually unlock the rmap. + * + * Pairs with the atomic_long_try_cmpxchg_acquire in __kvm_rmap_lock. + */ + atomic_long_set_release(&rmap_head->val, new_val); +} + +static unsigned long kvm_rmap_get(struct kvm_rmap_head *rmap_head) +{ + return atomic_long_read(&rmap_head->val) & ~KVM_RMAP_LOCKED; +} + +/* + * If mmu_lock isn't held, rmaps can only locked in read-only mode. The a= ctual + * locking is the same, but the caller is disallowed from modifying the rm= ap, + * and so the unlock flow is a nop if the rmap is/was empty. + */ +__maybe_unused +static unsigned long kvm_rmap_lock_readonly(struct kvm_rmap_head *rmap_hea= d) +{ + return __kvm_rmap_lock(rmap_head); +} + +__maybe_unused +static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head, + unsigned long old_val) +{ + if (!old_val) + return; + + KVM_MMU_WARN_ON(old_val !=3D kvm_rmap_get(rmap_head)); + atomic_long_set(&rmap_head->val, old_val); +} + /* * Returns the number of pointers in the rmap chain, not counting the new = one. */ @@ -862,7 +968,7 @@ static int pte_list_add(struct kvm_mmu_memory_cache *ca= che, u64 *spte, struct pte_list_desc *desc; int count =3D 0; =20 - old_val =3D rmap_head->val; + old_val =3D kvm_rmap_lock(rmap_head); =20 if (!old_val) { new_val =3D (unsigned long)spte; @@ -894,7 +1000,7 @@ static int pte_list_add(struct kvm_mmu_memory_cache *c= ache, u64 *spte, desc->sptes[desc->spte_count++] =3D spte; } =20 - rmap_head->val =3D new_val; + kvm_rmap_unlock(rmap_head, new_val); =20 return count; } @@ -942,7 +1048,7 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte, unsigned long rmap_val; int i; =20 - rmap_val =3D rmap_head->val; + rmap_val =3D kvm_rmap_lock(rmap_head); if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_val, kvm)) goto out; =20 @@ -968,7 +1074,7 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte, } =20 out: - rmap_head->val =3D rmap_val; + kvm_rmap_unlock(rmap_head, rmap_val); } =20 static void kvm_zap_one_rmap_spte(struct kvm *kvm, @@ -986,7 +1092,7 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm, unsigned long rmap_val; int i; =20 - rmap_val =3D rmap_head->val; + rmap_val =3D kvm_rmap_lock(rmap_head); if (!rmap_val) return false; =20 @@ -1005,13 +1111,13 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm, } out: /* rmap_head is meaningless now, remember to reset it */ - rmap_head->val =3D 0; + kvm_rmap_unlock(rmap_head, 0); return true; } =20 unsigned int pte_list_count(struct kvm_rmap_head *rmap_head) { - unsigned long rmap_val =3D rmap_head->val; + unsigned long rmap_val =3D kvm_rmap_get(rmap_head); struct pte_list_desc *desc; =20 if (!rmap_val) @@ -1077,7 +1183,7 @@ struct rmap_iterator { static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head, struct rmap_iterator *iter) { - unsigned long rmap_val =3D rmap_head->val; + unsigned long rmap_val =3D kvm_rmap_get(rmap_head); u64 *sptep; =20 if (!rmap_val) @@ -1412,7 +1518,7 @@ static void slot_rmap_walk_next(struct slot_rmap_walk= _iterator *iterator) while (++iterator->rmap <=3D iterator->end_rmap) { iterator->gfn +=3D KVM_PAGES_PER_HPAGE(iterator->level); =20 - if (iterator->rmap->val) + if (atomic_long_read(&iterator->rmap->val)) return; } =20 @@ -2450,7 +2556,8 @@ static int mmu_page_zap_pte(struct kvm *kvm, struct k= vm_mmu_page *sp, * avoids retaining a large number of stale nested SPs. */ if (tdp_enabled && invalid_list && - child->role.guest_mode && !child->parent_ptes.val) + child->role.guest_mode && + !atomic_long_read(&child->parent_ptes.val)) return kvm_mmu_prepare_zap_page(kvm, child, invalid_list); } --=20 2.47.0.199.ga7371fff76-goog From nobody Sun Nov 24 11:28:49 2024 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 06A9C1F706F for ; Tue, 5 Nov 2024 18:43:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730832234; cv=none; b=NgUhATtqW5T7BftNNfgFT/lGD3aXMtZ/BrJHb+sx2S/uJSe6NU/VhE7klW09FHTPR4wXqzzD4cOaIxreF6uhpFMG+2sIEtHAmj3b60Kh1qaNdCW5jbWbK1XP7g+gePKkKrlkvC1QoXF1bl0sTZb/p0daes4qr26zzRFs+TToOZE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730832234; c=relaxed/simple; bh=jDUR0fdf4pyrS1yb6Btzmah8mkW5NwE4XOL0ct91PhM=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=krTAD1d7IzeDG8m/JZZuup1V362+QmdUHyiiiw9PYfNN21ViDFTZVHQ35YMwIySNUbof/k5UgPfIOO3Uu9jfWvWb1knVmJ+D071MpZYp7XF7pQTofsyby2schdUCwzf6sXkkDpMzIg7Qw4jKorVLbzKVPLZBUUgHRBi2qBoMHDc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=huBeBKNP; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="huBeBKNP" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6e38fabff35so102508197b3.0 for ; Tue, 05 Nov 2024 10:43:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730832232; x=1731437032; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=1iBajr2tvBcG6zm00BdC5Rt5Uo5VkPVaG6KrnUwYASM=; b=huBeBKNPPZNkN4sodNz4itHauoF/PQnTwiysZ7d7fRxrjl+UuOI9/yvK05quanVlhS zMjSmsBWaPb+8algPi/2mSu2fmrxtorWiftgiOK1v/9rzni/7uBvxJcQh37WaZ32ixwf z3mAGyQtO+FxEj4GsrOEJJoYgUT463x/26ehbWyxq6+lfPNY9DjGFhXW4qbAM2lXRb1t 8PPbU26VnMCzndvqducwtqPm6B1DGApdWTaTkSM0jjoL1j7EzLCF9Ebp5TebkV4iavnG hZg5SJg1XMeNzjvJK8Qtdv1pkwQ9a9EilDbC2g0kPqP0RUn+6vO54iofc9myyxsN8szW 9lMQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730832232; x=1731437032; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=1iBajr2tvBcG6zm00BdC5Rt5Uo5VkPVaG6KrnUwYASM=; b=QuHn7q9MRAV1svDR42JbOZLXU7cpcIgyVQ1fAi9IFCp0jZCYE5Mav2WW+j3muQOQvz ea23YLuQkcMoWq9KxrPvi0gHZaYXaw+KhbkWHz5JJUizGrNCJCaeSHGiBBGX2Y7xnQaT 3gp/LG8w/R7KZVxWuTdX7vBi9+hR5Ehey0c7ETQUa+fC+A/paVxRe8lEd/9uPdZLGohx +zCE9nnZR+u6CKJdF04BSufb4h4DG4jOYkKSjOH+u77b+00ESOBFHMNhHb6cKz1jNwyr Ep98ZlsKFM3PGD/gznfQyopan5fAqOdkKEIUJ19c0hUISyJCdbQG8oQu7VXmN6AAoshG idKw== X-Forwarded-Encrypted: i=1; AJvYcCU2s7m3rlwzRQtwNPn38lVrsAuOhN0bI92mPjAzuuRnXyI8JQzg/hWHKDi1q2d4IhNCjvmUmevrBSzl/iA=@vger.kernel.org X-Gm-Message-State: AOJu0YxtW7Z4LKrz3ECTIwNcC25ij8swtgHOquIhJy3j0+ZiG3umn/QX lUDJzkqtZ7LjqUohMXFSPX3/eUqwVfAI0nOnxeSaJ4URUhONfb4LRN5BvTV5j5keoPIrIIJUL6F Esd8Fh8KouJEM5R0joQ== X-Google-Smtp-Source: AGHT+IEQCTYc5Qjt+z/Hb6r7J4If+mjo7L20fqiQ08ClH2OP7hcg4Onkoc+dim/QgXCvgp8uIzQynwLDxF2eMifJ X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:13d:fb22:ac12:a84b]) (user=jthoughton job=sendgmr) by 2002:a05:690c:7091:b0:6ea:3c62:17c1 with SMTP id 00721157ae682-6ea3c621d20mr5673717b3.1.1730832232160; Tue, 05 Nov 2024 10:43:52 -0800 (PST) Date: Tue, 5 Nov 2024 18:43:31 +0000 In-Reply-To: <20241105184333.2305744-1-jthoughton@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241105184333.2305744-1-jthoughton@google.com> X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241105184333.2305744-10-jthoughton@google.com> Subject: [PATCH v8 09/11] KVM: x86/mmu: Add support for lockless walks of rmap SPTEs From: James Houghton To: Sean Christopherson , Paolo Bonzini Cc: David Matlack , David Rientjes , James Houghton , Marc Zyngier , Oliver Upton , Wei Xu , Yu Zhao , Axel Rasmussen , kvm@vger.kernel.org, linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Sean Christopherson Add a lockless version of for_each_rmap_spte(), which is pretty much the same as the normal version, except that it doesn't BUG() the host if a non-present SPTE is encountered. When mmu_lock is held, it should be impossible for a different task to zap a SPTE, _and_ zapped SPTEs must be removed from their rmap chain prior to dropping mmu_lock. Thus, the normal walker BUG()s if a non-present SPTE is encountered as something is wildly broken. When walking rmaps without holding mmu_lock, the SPTEs pointed at by the rmap chain can be zapped/dropped, and so a lockless walk can observe a non-present SPTE if it runs concurrently with a different operation that is zapping SPTEs. Signed-off-by: Sean Christopherson Signed-off-by: James Houghton --- arch/x86/kvm/mmu/mmu.c | 75 +++++++++++++++++++++++------------------- 1 file changed, 42 insertions(+), 33 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 1cdb77df0a4d..71019762a28a 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -870,7 +870,7 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bit= map(struct kvm_vcpu *vcpu */ #define KVM_RMAP_LOCKED BIT(1) =20 -static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head) +static unsigned long __kvm_rmap_lock(struct kvm_rmap_head *rmap_head) { unsigned long old_val, new_val; =20 @@ -914,14 +914,25 @@ static unsigned long kvm_rmap_lock(struct kvm_rmap_he= ad *rmap_head) */ } while (!atomic_long_try_cmpxchg_acquire(&rmap_head->val, &old_val, new_= val)); =20 - /* Return the old value, i.e. _without_ the LOCKED bit set. */ + /* + * Return the old value, i.e. _without_ the LOCKED bit set. It's + * impossible for the return value to be 0 (see above), i.e. the read- + * only unlock flow can't get a false positive and fail to unlock. + */ return old_val; } =20 +static unsigned long kvm_rmap_lock(struct kvm *kvm, + struct kvm_rmap_head *rmap_head) +{ + lockdep_assert_held_write(&kvm->mmu_lock); + return __kvm_rmap_lock(rmap_head); +} + static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head, unsigned long new_val) { - WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED); + KVM_MMU_WARN_ON(new_val & KVM_RMAP_LOCKED); /* * Ensure that all accesses to the rmap have completed * before we actually unlock the rmap. @@ -961,14 +972,14 @@ static void kvm_rmap_unlock_readonly(struct kvm_rmap_= head *rmap_head, /* * Returns the number of pointers in the rmap chain, not counting the new = one. */ -static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte, - struct kvm_rmap_head *rmap_head) +static int pte_list_add(struct kvm *kvm, struct kvm_mmu_memory_cache *cach= e, + u64 *spte, struct kvm_rmap_head *rmap_head) { unsigned long old_val, new_val; struct pte_list_desc *desc; int count =3D 0; =20 - old_val =3D kvm_rmap_lock(rmap_head); + old_val =3D kvm_rmap_lock(kvm, rmap_head); =20 if (!old_val) { new_val =3D (unsigned long)spte; @@ -1048,7 +1059,7 @@ static void pte_list_remove(struct kvm *kvm, u64 *spt= e, unsigned long rmap_val; int i; =20 - rmap_val =3D kvm_rmap_lock(rmap_head); + rmap_val =3D kvm_rmap_lock(kvm, rmap_head); if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_val, kvm)) goto out; =20 @@ -1092,7 +1103,7 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm, unsigned long rmap_val; int i; =20 - rmap_val =3D kvm_rmap_lock(rmap_head); + rmap_val =3D kvm_rmap_lock(kvm, rmap_head); if (!rmap_val) return false; =20 @@ -1184,23 +1195,18 @@ static u64 *rmap_get_first(struct kvm_rmap_head *rm= ap_head, struct rmap_iterator *iter) { unsigned long rmap_val =3D kvm_rmap_get(rmap_head); - u64 *sptep; =20 if (!rmap_val) return NULL; =20 if (!(rmap_val & KVM_RMAP_MANY)) { iter->desc =3D NULL; - sptep =3D (u64 *)rmap_val; - goto out; + return (u64 *)rmap_val; } =20 iter->desc =3D (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY); iter->pos =3D 0; - sptep =3D iter->desc->sptes[iter->pos]; -out: - BUG_ON(!is_shadow_present_pte(*sptep)); - return sptep; + return iter->desc->sptes[iter->pos]; } =20 /* @@ -1210,14 +1216,11 @@ static u64 *rmap_get_first(struct kvm_rmap_head *rm= ap_head, */ static u64 *rmap_get_next(struct rmap_iterator *iter) { - u64 *sptep; - if (iter->desc) { if (iter->pos < PTE_LIST_EXT - 1) { ++iter->pos; - sptep =3D iter->desc->sptes[iter->pos]; - if (sptep) - goto out; + if (iter->desc->sptes[iter->pos]) + return iter->desc->sptes[iter->pos]; } =20 iter->desc =3D iter->desc->more; @@ -1225,20 +1228,24 @@ static u64 *rmap_get_next(struct rmap_iterator *ite= r) if (iter->desc) { iter->pos =3D 0; /* desc->sptes[0] cannot be NULL */ - sptep =3D iter->desc->sptes[iter->pos]; - goto out; + return iter->desc->sptes[iter->pos]; } } =20 return NULL; -out: - BUG_ON(!is_shadow_present_pte(*sptep)); - return sptep; } =20 -#define for_each_rmap_spte(_rmap_head_, _iter_, _spte_) \ - for (_spte_ =3D rmap_get_first(_rmap_head_, _iter_); \ - _spte_; _spte_ =3D rmap_get_next(_iter_)) +#define __for_each_rmap_spte(_rmap_head_, _iter_, _sptep_) \ + for (_sptep_ =3D rmap_get_first(_rmap_head_, _iter_); \ + _sptep_; _sptep_ =3D rmap_get_next(_iter_)) + +#define for_each_rmap_spte(_rmap_head_, _iter_, _sptep_) \ + __for_each_rmap_spte(_rmap_head_, _iter_, _sptep_) \ + if (!WARN_ON_ONCE(!is_shadow_present_pte(*(_sptep_)))) \ + +#define for_each_rmap_spte_lockless(_rmap_head_, _iter_, _sptep_, _spte_) \ + __for_each_rmap_spte(_rmap_head_, _iter_, _sptep_) \ + if (is_shadow_present_pte(_spte_ =3D mmu_spte_get_lockless(sptep))) =20 static void drop_spte(struct kvm *kvm, u64 *sptep) { @@ -1324,12 +1331,13 @@ static bool __rmap_clear_dirty(struct kvm *kvm, str= uct kvm_rmap_head *rmap_head, struct rmap_iterator iter; bool flush =3D false; =20 - for_each_rmap_spte(rmap_head, &iter, sptep) + for_each_rmap_spte(rmap_head, &iter, sptep) { if (spte_ad_need_write_protect(*sptep)) flush |=3D test_and_clear_bit(PT_WRITABLE_SHIFT, (unsigned long *)sptep); else flush |=3D spte_clear_dirty(sptep); + } =20 return flush; } @@ -1650,7 +1658,7 @@ static void __rmap_add(struct kvm *kvm, kvm_update_page_stats(kvm, sp->role.level, 1); =20 rmap_head =3D gfn_to_rmap(gfn, sp->role.level, slot); - rmap_count =3D pte_list_add(cache, spte, rmap_head); + rmap_count =3D pte_list_add(kvm, cache, spte, rmap_head); =20 if (rmap_count > kvm->stat.max_mmu_rmap_size) kvm->stat.max_mmu_rmap_size =3D rmap_count; @@ -1796,13 +1804,14 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn) return hash_64(gfn, KVM_MMU_HASH_SHIFT); } =20 -static void mmu_page_add_parent_pte(struct kvm_mmu_memory_cache *cache, +static void mmu_page_add_parent_pte(struct kvm *kvm, + struct kvm_mmu_memory_cache *cache, struct kvm_mmu_page *sp, u64 *parent_pte) { if (!parent_pte) return; =20 - pte_list_add(cache, parent_pte, &sp->parent_ptes); + pte_list_add(kvm, cache, parent_pte, &sp->parent_ptes); } =20 static void mmu_page_remove_parent_pte(struct kvm *kvm, struct kvm_mmu_pag= e *sp, @@ -2492,7 +2501,7 @@ static void __link_shadow_page(struct kvm *kvm, =20 mmu_spte_set(sptep, spte); =20 - mmu_page_add_parent_pte(cache, sp, sptep); + mmu_page_add_parent_pte(kvm, cache, sp, sptep); =20 /* * The non-direct sub-pagetable must be updated before linking. For --=20 2.47.0.199.ga7371fff76-goog From nobody Sun Nov 24 11:28:49 2024 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7D6C91F7578 for ; Tue, 5 Nov 2024 18:43:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730832236; cv=none; b=t0tBV08b9rodssqpkRPcGBqfFdSY2QRfL8brMPo+oekeLWTBFrU+BABA3kjp2lqpfoD+h714vV2vfDotEgumUkaBZcQqP9TgrqeoJI+n8CkFYu74E1/AfV6PKOBa24KCTHNl9+w5dbKU6c8fqhASHQt+ssXEIZrgkB9a0utyWk0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730832236; c=relaxed/simple; bh=NUxSfRX4cW7cHK1sEEySqBH76Z0YY3GolZI8i3B0KXc=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=e8ZIW0rnhizDzRRKGadxev7hE5dKTTJNLrMvVlQy0pnhw9EHBxrAUaoRe8tCntAfsyK94PHValWF+iFi4No0srVGZ0zBs4Lx6dj2Djtav1hS7c6qANjYUGhyoYjN902xIaSQela9GsEYjP5RQ3iF1Zzpwo/OD1yiK4TvMqwlUjI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=J5kHK/OJ; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="J5kHK/OJ" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6e9d6636498so111694847b3.2 for ; Tue, 05 Nov 2024 10:43:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730832233; x=1731437033; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=878SGuKh3XmWcBJqcVpzoEFvF+zSdyFMaOS0SizaVl0=; b=J5kHK/OJQotjkpAFnmz0SpIlBwugvNynd8cVpQG9dHbrDnigzyhY0TqZ/jdVcG7CpR iaZht1Apr689NOIId/9y2JFAAs72kRlra1UpWdD3pLgmhajLJ8GJEVJPWRzQ53eZ6TyT Xx/iMZpOqjYZzwV2/J0dWkPYAKnVTFjrOpRYOnPDGfuggSzFY/rUdCpy6NXEK5rnc0yF 5y4CHRIRQFOsJpcKmGgvqWlJR54QVGpVJKWSMmTReq6ylA8eLfK6eMQ3BTrcUk93Os+K fSx6HeGsI0bpp9VgDQHH942yK7Bn2MM21ZNYRQGcKy+k0p476cxZJY72GCSESrI/4+9U txYA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730832233; x=1731437033; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=878SGuKh3XmWcBJqcVpzoEFvF+zSdyFMaOS0SizaVl0=; b=NBqXfgXhQgOflKVqS4x7V+aOTt2m6oBRvH/4uVmHweXrWRqAOSKf3j3/Hzo7DiuHaA Qz+fTzIMtiCYDmCOmLAatSTcJcbyPNuBNqOt+PP0l8FObov6FLhqDyhDOrJGlTlYMIDH 9UVwb3uxk3vAexdKIpE4z8/KfzShc80hIkZZe5dAUjxrnX0Fkpm7LCrBewv8CpcmCBVL iONIwwbMtsyQlECz9KMHGM5f9UU9RteO03m3KwQOTctBkfz6E5+RVg0M0MquoUFhHcag tZ4yzPQLpXXsf/GHkHt99bj0eJiKvx1zhoe3htQoAL9Ih/1voWPAF71QPr44o2OItth1 m0Yw== X-Forwarded-Encrypted: i=1; AJvYcCXN/X8MZ3eXGETx5E6NB0x+YYOPdxgE3fGMDBWlYFZmkMCZ9KSg2CpbFSnosjJxqEAKHKniVu2cOmHkLJA=@vger.kernel.org X-Gm-Message-State: AOJu0Yyziz1oStZpFXLn61eM4/4TzGAqZvwPW06C7jC+04W069+ee5xP 51WPMsEBKPc3KOEq1RoVP65OxLlH6wfFxBNdV0H8Jpf33OkF9CwgpzDwUAjdVd+GSUSZvpACB0v 5yGBx719FC8hTTnKwEQ== X-Google-Smtp-Source: AGHT+IH7ISi4GnII+0BnZH0fUyJfZEHfRMZQzMU+f5JniaDp4cZHgc7uAg+CKsWk5yAjOsMg6WNz5CqdG6JaBDbt X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:13d:fb22:ac12:a84b]) (user=jthoughton job=sendgmr) by 2002:a05:6902:1d1:b0:e2e:3031:3f0c with SMTP id 3f1490d57ef6-e30e5b0ee45mr14173276.7.1730832233341; Tue, 05 Nov 2024 10:43:53 -0800 (PST) Date: Tue, 5 Nov 2024 18:43:32 +0000 In-Reply-To: <20241105184333.2305744-1-jthoughton@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241105184333.2305744-1-jthoughton@google.com> X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241105184333.2305744-11-jthoughton@google.com> Subject: [PATCH v8 10/11] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns From: James Houghton To: Sean Christopherson , Paolo Bonzini Cc: David Matlack , David Rientjes , James Houghton , Marc Zyngier , Oliver Upton , Wei Xu , Yu Zhao , Axel Rasmussen , kvm@vger.kernel.org, linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Sean Christopherson When A/D bits are supported on sptes, it is safe to simply clear the Accessed bits. The less obvious case is marking sptes for access tracking in the non-A/D case (for EPT only). In this case, we have to be sure that it is okay for TLB entries to exist for non-present sptes. For example, when doing dirty tracking, if we come across a non-present SPTE, we need to know that we need to do a TLB invalidation. This case is already supported today (as we already support *not* doing TLBIs for clear_young(); there is a separate notifier for clearing *and* flushing, clear_flush_young()). This works today because GET_DIRTY_LOG flushes the TLB before returning to userspace. Signed-off-by: Sean Christopherson Co-developed-by: James Houghton Signed-off-by: James Houghton --- arch/x86/kvm/mmu/mmu.c | 72 +++++++++++++++++++++++------------------- 1 file changed, 39 insertions(+), 33 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 71019762a28a..bdd6abf9b44e 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -952,13 +952,11 @@ static unsigned long kvm_rmap_get(struct kvm_rmap_hea= d *rmap_head) * locking is the same, but the caller is disallowed from modifying the rm= ap, * and so the unlock flow is a nop if the rmap is/was empty. */ -__maybe_unused static unsigned long kvm_rmap_lock_readonly(struct kvm_rmap_head *rmap_hea= d) { return __kvm_rmap_lock(rmap_head); } =20 -__maybe_unused static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head, unsigned long old_val) { @@ -1677,37 +1675,48 @@ static void rmap_add(struct kvm_vcpu *vcpu, const s= truct kvm_memory_slot *slot, } =20 static bool kvm_rmap_age_gfn_range(struct kvm *kvm, - struct kvm_gfn_range *range, bool test_only) + struct kvm_gfn_range *range, + bool test_only) { - struct slot_rmap_walk_iterator iterator; + struct kvm_rmap_head *rmap_head; struct rmap_iterator iter; + unsigned long rmap_val; bool young =3D false; u64 *sptep; + gfn_t gfn; + int level; + u64 spte; =20 - for_each_slot_rmap_range(range->slot, PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL, - range->start, range->end - 1, &iterator) { - for_each_rmap_spte(iterator.rmap, &iter, sptep) { - u64 spte =3D *sptep; + for (level =3D PG_LEVEL_4K; level <=3D KVM_MAX_HUGEPAGE_LEVEL; level++) { + for (gfn =3D range->start; gfn < range->end; + gfn +=3D KVM_PAGES_PER_HPAGE(level)) { + rmap_head =3D gfn_to_rmap(gfn, level, range->slot); + rmap_val =3D kvm_rmap_lock_readonly(rmap_head); =20 - if (!is_accessed_spte(spte)) - continue; + for_each_rmap_spte_lockless(rmap_head, &iter, sptep, spte) { + if (!is_accessed_spte(spte)) + continue; + + if (test_only) { + kvm_rmap_unlock_readonly(rmap_head, rmap_val); + return true; + } =20 - if (test_only) - return true; - - if (spte_ad_enabled(spte)) { - clear_bit((ffs(shadow_accessed_mask) - 1), - (unsigned long *)sptep); - } else { - /* - * WARN if mmu_spte_update() signals the need - * for a TLB flush, as Access tracking a SPTE - * should never trigger an _immediate_ flush. - */ - spte =3D mark_spte_for_access_track(spte); - WARN_ON_ONCE(mmu_spte_update(sptep, spte)); + if (spte_ad_enabled(spte)) + clear_bit((ffs(shadow_accessed_mask) - 1), + (unsigned long *)sptep); + else + /* + * If the following cmpxchg fails, the + * spte is being concurrently modified + * and should most likely stay young. + */ + cmpxchg64(sptep, spte, + mark_spte_for_access_track(spte)); + young =3D true; } - young =3D true; + + kvm_rmap_unlock_readonly(rmap_head, rmap_val); } } return young; @@ -1725,11 +1734,8 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_ran= ge *range) if (tdp_mmu_enabled) young =3D kvm_tdp_mmu_age_gfn_range(kvm, range); =20 - if (kvm_has_shadow_mmu_sptes(kvm)) { - write_lock(&kvm->mmu_lock); + if (kvm_has_shadow_mmu_sptes(kvm)) young |=3D kvm_rmap_age_gfn_range(kvm, range, false); - write_unlock(&kvm->mmu_lock); - } =20 return young; } @@ -1741,11 +1747,11 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_g= fn_range *range) if (tdp_mmu_enabled) young =3D kvm_tdp_mmu_test_age_gfn(kvm, range); =20 - if (!young && kvm_has_shadow_mmu_sptes(kvm)) { - write_lock(&kvm->mmu_lock); + if (young) + return young; + + if (kvm_has_shadow_mmu_sptes(kvm)) young |=3D kvm_rmap_age_gfn_range(kvm, range, true); - write_unlock(&kvm->mmu_lock); - } =20 return young; } --=20 2.47.0.199.ga7371fff76-goog From nobody Sun Nov 24 11:28:49 2024 Received: from mail-ua1-f74.google.com (mail-ua1-f74.google.com [209.85.222.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 375AA1F76D6 for ; Tue, 5 Nov 2024 18:43:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.74 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730832238; cv=none; b=CnbfpSVlcbw9zp/i/v83PSLYmVteuGjwL1jvOX9bkiKvYIhXWnXDSGLIcDPxg54eBqU5Kz9kIssZSXXOF1evB/UrIsDzndwE2LN1snRPek2R2O3dYVPD35sQRhYP7AWhST+jpYuJf5nOFyxPtHaFG+3FZezfll+eIo/+1v6DpVQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730832238; c=relaxed/simple; bh=MYnTUaBrkltgTXFGcXYC/R6U7Z9vRbK4TbNxRrSo8fk=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=kO/hAzdUUcNDuslzLBGm2uGE3MmxflRvu53WKkSV1JeoOCFH8OjeLN4ck1I+b/bph4BzIifWyMnppdUG18VrSnCIZX4DweFqWH2oqzt5YzMRvkRyYPgtWDbmPV+43+APImlQzlu+24TtS8zBpeHrPJsH53vlNE72fs7TRx2s7F8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=caq7j35e; arc=none smtp.client-ip=209.85.222.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jthoughton.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="caq7j35e" Received: by mail-ua1-f74.google.com with SMTP id a1e0cc1a2514c-85636e5c41cso511816241.0 for ; Tue, 05 Nov 2024 10:43:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730832235; x=1731437035; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=kXJlhKVm3rXgc0FRAdpIfxhCDUnEwEWLVRju28ArHBQ=; b=caq7j35eFUq74wv0kNJJM3r24bG42u9FCiu7vEPC773Mdgmldtne+BlXy7hYrPaxL9 CBNiFuKLnaABn639O6Gn20rsOE94JUpCRakGrB4m7EmIYFtJqqIzXWd/SRiRCbRQ43t+ yhEZ6pGY+IWnH7SNmsu9iz6pizITmdFNbmX6e/PQF/WvwSc5kN3ou++V5BieM7KwX59c d//r4pG8ekqTJNHsgs5+r3MyG4sSNiJhlYF8kgab8Bo2M+Q5lOBCETstM3AO740hJOtF 3XKB8AC7KEn8/bIRR1Rp2D2VGxpR9UP7qIpdwriQe7hUKmkwEY6FvUFMk0d6CWedKcWb +HQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730832235; x=1731437035; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=kXJlhKVm3rXgc0FRAdpIfxhCDUnEwEWLVRju28ArHBQ=; b=j0Y1O/sFctkklmPYtniqqWVrcouCniv0hlY6EukkJQWhP8Aa8zzQ0RD9Hv7WfjGGdm 61FpMOzdrkcPb1C4DOxE7KFkty6fIAHuG71mr3emQR3/MvLmP4hEjW/XiJMlOE6suhI3 FH7I1q2mY5sKuf8r0tQauXdkQOQJBOngiH7AoQdi0YJLObqb/aULj7hBofwFfSW17jds 6Czsqnx86ooON7mgPjH3M79fMwHQU1lYow/KbRzrkyNQ1DspX0KBsujaO/s+xgU1Dl0h 88PMV7HSfFGGlCyPNVASHUs8iHh0unU99mP4Rb2wKzvymtCYyRZt8DRZu36AEjZS/Y2E MC2A== X-Forwarded-Encrypted: i=1; AJvYcCU0ISRuLFf2t5D7BHZIARyXnqXmMPlf6oLvn2a3wcBfTpmrbkRX57Pc2i56GIXVVNPIDhXXO30uzewk/8Y=@vger.kernel.org X-Gm-Message-State: AOJu0YylKQ1LSdpJDEb+AvuHd1gC14Hl1ysFju6NEbeSklh4R09a3l3w GIqrN2xTKLxk5zf3PRp1EJ3NJSaeeLjvMGA+OnVYMD6e22OoDhEvu3i5sQ/KPP5mfGF0+FdWr1q SdYiMQ15xmm38Rw80YQ== X-Google-Smtp-Source: AGHT+IFiHiu6xIq/4+Yres2NWywIy+kT74SNVwkFIU9yFN3Y09U5PCDTy/b3EiIVdpijr7vUvnHESDmbhcXTq8a6 X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:13d:fb22:ac12:a84b]) (user=jthoughton job=sendgmr) by 2002:a67:fd56:0:b0:4a8:f47c:5b84 with SMTP id ada2fe7eead31-4a8f47c6b85mr84606137.6.1730832234885; Tue, 05 Nov 2024 10:43:54 -0800 (PST) Date: Tue, 5 Nov 2024 18:43:33 +0000 In-Reply-To: <20241105184333.2305744-1-jthoughton@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241105184333.2305744-1-jthoughton@google.com> X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241105184333.2305744-12-jthoughton@google.com> Subject: [PATCH v8 11/11] KVM: selftests: Add multi-gen LRU aging to access_tracking_perf_test From: James Houghton To: Sean Christopherson , Paolo Bonzini Cc: David Matlack , David Rientjes , James Houghton , Marc Zyngier , Oliver Upton , Wei Xu , Yu Zhao , Axel Rasmussen , kvm@vger.kernel.org, linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" This test now has two modes of operation: 1. (default) To check how much vCPU performance was affected by access tracking (previously existed, now supports MGLRU aging). 2. (-p) To also benchmark how fast MGLRU can do aging while vCPUs are faulting in memory. Mode (1) also serves as a way to verify that aging is working properly for pages only accessed by KVM. It will fail if one does not have the 0x8 lru_gen feature bit. To support MGLRU, the test creates a memory cgroup, moves itself into it, then uses the lru_gen debugfs output to track memory in that cgroup. The logic to parse the lru_gen debugfs output has been put into selftests/kvm/lib/lru_gen_util.c. Co-developed-by: Axel Rasmussen Signed-off-by: Axel Rasmussen Signed-off-by: James Houghton --- tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/access_tracking_perf_test.c | 366 ++++++++++++++-- .../selftests/kvm/include/lru_gen_util.h | 55 +++ .../testing/selftests/kvm/lib/lru_gen_util.c | 391 ++++++++++++++++++ 4 files changed, 783 insertions(+), 30 deletions(-) create mode 100644 tools/testing/selftests/kvm/include/lru_gen_util.h create mode 100644 tools/testing/selftests/kvm/lib/lru_gen_util.c diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests= /kvm/Makefile index f186888f0e00..542548e6e8ba 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -22,6 +22,7 @@ LIBKVM +=3D lib/elf.c LIBKVM +=3D lib/guest_modes.c LIBKVM +=3D lib/io.c LIBKVM +=3D lib/kvm_util.c +LIBKVM +=3D lib/lru_gen_util.c LIBKVM +=3D lib/memstress.c LIBKVM +=3D lib/guest_sprintf.c LIBKVM +=3D lib/rbtree.c diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tool= s/testing/selftests/kvm/access_tracking_perf_test.c index 3c7defd34f56..8d6c2ce4b98a 100644 --- a/tools/testing/selftests/kvm/access_tracking_perf_test.c +++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c @@ -38,6 +38,7 @@ #include #include #include +#include #include #include #include @@ -47,6 +48,19 @@ #include "memstress.h" #include "guest_modes.h" #include "processor.h" +#include "lru_gen_util.h" + +static const char *TEST_MEMCG_NAME =3D "access_tracking_perf_test"; +static const int LRU_GEN_ENABLED =3D 0x1; +static const int LRU_GEN_MM_WALK =3D 0x2; +static const char *CGROUP_PROCS =3D "cgroup.procs"; +/* + * If using MGLRU, this test assumes a cgroup v2 or cgroup v1 memory hiera= rchy + * is mounted at cgroup_root. + * + * Can be changed with -r. + */ +static const char *cgroup_root =3D "/sys/fs/cgroup"; =20 /* Global variable used to synchronize all of the vCPU threads. */ static int iteration; @@ -62,6 +76,9 @@ static enum { /* The iteration that was last completed by each vCPU. */ static int vcpu_last_completed_iteration[KVM_MAX_VCPUS]; =20 +/* The time at which the last iteration was completed */ +static struct timespec vcpu_last_completed_time[KVM_MAX_VCPUS]; + /* Whether to overlap the regions of memory vCPUs access. */ static bool overlap_memory_access; =20 @@ -74,6 +91,12 @@ struct test_params { =20 /* The number of vCPUs to create in the VM. */ int nr_vcpus; + + /* Whether to use lru_gen aging instead of idle page tracking. */ + bool lru_gen; + + /* Whether to test the performance of aging itself. */ + bool benchmark_lru_gen; }; =20 static uint64_t pread_uint64(int fd, const char *filename, uint64_t index) @@ -89,6 +112,50 @@ static uint64_t pread_uint64(int fd, const char *filena= me, uint64_t index) =20 } =20 +static void write_file_long(const char *path, long v) +{ + FILE *f; + + f =3D fopen(path, "w"); + TEST_ASSERT(f, "fopen(%s) failed", path); + TEST_ASSERT(fprintf(f, "%ld\n", v) > 0, + "fprintf to %s failed", path); + TEST_ASSERT(!fclose(f), "fclose(%s) failed", path); +} + +static char *path_join(const char *parent, const char *child) +{ + char *out =3D NULL; + + return asprintf(&out, "%s/%s", parent, child) >=3D 0 ? out : NULL; +} + +static char *memcg_path(const char *memcg) +{ + return path_join(cgroup_root, memcg); +} + +static char *memcg_file_path(const char *memcg, const char *file) +{ + char *mp =3D memcg_path(memcg); + char *fp; + + if (!mp) + return NULL; + fp =3D path_join(mp, file); + free(mp); + return fp; +} + +static void move_to_memcg(const char *memcg, pid_t pid) +{ + char *procs =3D memcg_file_path(memcg, CGROUP_PROCS); + + TEST_ASSERT(procs, "Failed to construct cgroup.procs path"); + write_file_long(procs, pid); + free(procs); +} + #define PAGEMAP_PRESENT (1ULL << 63) #define PAGEMAP_PFN_MASK ((1ULL << 55) - 1) =20 @@ -242,6 +309,8 @@ static void vcpu_thread_main(struct memstress_vcpu_args= *vcpu_args) }; =20 vcpu_last_completed_iteration[vcpu_idx] =3D current_iteration; + clock_gettime(CLOCK_MONOTONIC, + &vcpu_last_completed_time[vcpu_idx]); } } =20 @@ -253,38 +322,68 @@ static void spin_wait_for_vcpu(int vcpu_idx, int targ= et_iteration) } } =20 +static bool all_vcpus_done(int target_iteration, int nr_vcpus) +{ + for (int i =3D 0; i < nr_vcpus; ++i) + if (READ_ONCE(vcpu_last_completed_iteration[i]) !=3D + target_iteration) + return false; + + return true; +} + /* The type of memory accesses to perform in the VM. */ enum access_type { ACCESS_READ, ACCESS_WRITE, }; =20 -static void run_iteration(struct kvm_vm *vm, int nr_vcpus, const char *des= cription) +static void run_iteration(struct kvm_vm *vm, int nr_vcpus, const char *des= cription, + bool wait) { - struct timespec ts_start; - struct timespec ts_elapsed; int next_iteration, i; =20 /* Kick off the vCPUs by incrementing iteration. */ next_iteration =3D ++iteration; =20 - clock_gettime(CLOCK_MONOTONIC, &ts_start); - /* Wait for all vCPUs to finish the iteration. */ - for (i =3D 0; i < nr_vcpus; i++) - spin_wait_for_vcpu(i, next_iteration); + if (wait) { + struct timespec ts_start; + struct timespec ts_elapsed; + + clock_gettime(CLOCK_MONOTONIC, &ts_start); =20 - ts_elapsed =3D timespec_elapsed(ts_start); - pr_info("%-30s: %ld.%09lds\n", - description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec); + for (i =3D 0; i < nr_vcpus; i++) + spin_wait_for_vcpu(i, next_iteration); + + ts_elapsed =3D timespec_elapsed(ts_start); + + pr_info("%-30s: %ld.%09lds\n", + description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec); + } else + pr_info("%-30s\n", description); } =20 -static void access_memory(struct kvm_vm *vm, int nr_vcpus, - enum access_type access, const char *description) +static void _access_memory(struct kvm_vm *vm, int nr_vcpus, + enum access_type access, const char *description, + bool wait) { memstress_set_write_percent(vm, (access =3D=3D ACCESS_READ) ? 0 : 100); iteration_work =3D ITERATION_ACCESS_MEMORY; - run_iteration(vm, nr_vcpus, description); + run_iteration(vm, nr_vcpus, description, wait); +} + +static void access_memory(struct kvm_vm *vm, int nr_vcpus, + enum access_type access, const char *description) +{ + return _access_memory(vm, nr_vcpus, access, description, true); +} + +static void access_memory_async(struct kvm_vm *vm, int nr_vcpus, + enum access_type access, + const char *description) +{ + return _access_memory(vm, nr_vcpus, access, description, false); } =20 static void mark_memory_idle(struct kvm_vm *vm, int nr_vcpus) @@ -297,19 +396,115 @@ static void mark_memory_idle(struct kvm_vm *vm, int = nr_vcpus) */ pr_debug("Marking VM memory idle (slow)...\n"); iteration_work =3D ITERATION_MARK_IDLE; - run_iteration(vm, nr_vcpus, "Mark memory idle"); + run_iteration(vm, nr_vcpus, "Mark memory idle", true); } =20 -static void run_test(enum vm_guest_mode mode, void *arg) +static void create_memcg(const char *memcg) +{ + const char *full_memcg_path =3D memcg_path(memcg); + int ret; + + TEST_ASSERT(full_memcg_path, "Failed to construct full memcg path"); +retry: + ret =3D mkdir(full_memcg_path, 0755); + if (ret && errno =3D=3D EEXIST) { + TEST_ASSERT(!rmdir(full_memcg_path), + "Found existing memcg at %s, but rmdir failed", + full_memcg_path); + goto retry; + } + TEST_ASSERT(!ret, "Creating the memcg failed: mkdir(%s) failed", + full_memcg_path); + + pr_info("Created memcg at %s\n", full_memcg_path); +} + +/* + * Test lru_gen aging speed while vCPUs are faulting memory in. + * + * This test will run lru_gen aging until the vCPUs have finished all of + * the faulting work, reporting: + * - vcpu wall time (wall time for slowest vCPU) + * - average aging pass duration + * - total number of aging passes + * - total time spent aging + * + * This test produces the most useful results when the vcpu wall time and = the + * total time spent aging are similar (i.e., we want to avoid timing aging + * while the vCPUs aren't doing any work). + */ +static void run_benchmark(enum vm_guest_mode mode, struct kvm_vm *vm, + struct test_params *params) { - struct test_params *params =3D arg; - struct kvm_vm *vm; int nr_vcpus =3D params->nr_vcpus; + struct memcg_stats stats; + struct timespec ts_start, ts_max, ts_vcpus_elapsed, + ts_aging_elapsed, ts_aging_elapsed_avg; + int num_passes =3D 0; =20 - vm =3D memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1, - params->backing_src, !overlap_memory_access); + printf("Running lru_gen benchmark...\n"); =20 - memstress_start_vcpu_threads(nr_vcpus, vcpu_thread_main); + clock_gettime(CLOCK_MONOTONIC, &ts_start); + access_memory_async(vm, nr_vcpus, ACCESS_WRITE, + "Populating memory (async)"); + while (!all_vcpus_done(iteration, nr_vcpus)) { + lru_gen_do_aging_quiet(&stats, TEST_MEMCG_NAME); + ++num_passes; + } + + ts_aging_elapsed =3D timespec_elapsed(ts_start); + ts_aging_elapsed_avg =3D timespec_div(ts_aging_elapsed, num_passes); + + /* Find out when the slowest vCPU finished. */ + ts_max =3D ts_start; + for (int i =3D 0; i < nr_vcpus; ++i) { + struct timespec *vcpu_ts =3D &vcpu_last_completed_time[i]; + + if (ts_max.tv_sec < vcpu_ts->tv_sec || + (ts_max.tv_sec =3D=3D vcpu_ts->tv_sec && + ts_max.tv_nsec < vcpu_ts->tv_nsec)) + ts_max =3D *vcpu_ts; + } + + ts_vcpus_elapsed =3D timespec_sub(ts_max, ts_start); + + pr_info("%-30s: %ld.%09lds\n", "vcpu wall time", + ts_vcpus_elapsed.tv_sec, ts_vcpus_elapsed.tv_nsec); + + pr_info("%-30s: %ld.%09lds, (passes:%d, total:%ld.%09lds)\n", + "lru_gen avg pass duration", + ts_aging_elapsed_avg.tv_sec, + ts_aging_elapsed_avg.tv_nsec, + num_passes, + ts_aging_elapsed.tv_sec, + ts_aging_elapsed.tv_nsec); +} + +/* + * Test how much access tracking affects vCPU performance. + * + * Supports two modes of access tracking: + * - idle page tracking + * - lru_gen aging + * + * When using lru_gen, this test additionally verifies that the pages are = in + * fact getting younger and older, otherwise the performance data would be + * invalid. + * + * The forced lru_gen aging can race with aging that occurs naturally. + */ +static void run_test(enum vm_guest_mode mode, struct kvm_vm *vm, + struct test_params *params) +{ + int nr_vcpus =3D params->nr_vcpus; + bool lru_gen =3D params->lru_gen; + struct memcg_stats stats; + // If guest_page_size is larger than the host's page size, the + // guest (memstress) will only fault in a subset of the host's pages. + long total_pages =3D nr_vcpus * params->vcpu_memory_bytes / + max(memstress_args.guest_page_size, + (uint64_t)getpagesize()); + int found_gens[5]; =20 pr_info("\n"); access_memory(vm, nr_vcpus, ACCESS_WRITE, "Populating memory"); @@ -319,11 +514,78 @@ static void run_test(enum vm_guest_mode mode, void *a= rg) access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from populated memory"); =20 /* Repeat on memory that has been marked as idle. */ - mark_memory_idle(vm, nr_vcpus); + if (lru_gen) { + /* Do an initial page table scan */ + lru_gen_do_aging(&stats, TEST_MEMCG_NAME); + TEST_ASSERT(sum_memcg_stats(&stats) >=3D total_pages, + "Not all pages tracked in lru_gen stats.\n" + "Is lru_gen enabled? Did the memcg get created properly?"); + + /* Find the generation we're currently in (probably youngest) */ + found_gens[0] =3D lru_gen_find_generation(&stats, total_pages); + + /* Do an aging pass now */ + lru_gen_do_aging(&stats, TEST_MEMCG_NAME); + + /* Same generation, but a newer generation has been made */ + found_gens[1] =3D lru_gen_find_generation(&stats, total_pages); + TEST_ASSERT(found_gens[1] =3D=3D found_gens[0], + "unexpected gen change: %d vs. %d", + found_gens[1], found_gens[0]); + } else + mark_memory_idle(vm, nr_vcpus); + access_memory(vm, nr_vcpus, ACCESS_WRITE, "Writing to idle memory"); - mark_memory_idle(vm, nr_vcpus); + + if (lru_gen) { + /* Scan the page tables again */ + lru_gen_do_aging(&stats, TEST_MEMCG_NAME); + + /* The pages should now be young again, so in a newer generation */ + found_gens[2] =3D lru_gen_find_generation(&stats, total_pages); + TEST_ASSERT(found_gens[2] > found_gens[1], + "pages did not get younger"); + + /* Do another aging pass */ + lru_gen_do_aging(&stats, TEST_MEMCG_NAME); + + /* Same generation; new generation has been made */ + found_gens[3] =3D lru_gen_find_generation(&stats, total_pages); + TEST_ASSERT(found_gens[3] =3D=3D found_gens[2], + "unexpected gen change: %d vs. %d", + found_gens[3], found_gens[2]); + } else + mark_memory_idle(vm, nr_vcpus); + access_memory(vm, nr_vcpus, ACCESS_READ, "Reading from idle memory"); =20 + if (lru_gen) { + /* Scan the pages tables again */ + lru_gen_do_aging(&stats, TEST_MEMCG_NAME); + + /* The pages should now be young again, so in a newer generation */ + found_gens[4] =3D lru_gen_find_generation(&stats, total_pages); + TEST_ASSERT(found_gens[4] > found_gens[3], + "pages did not get younger"); + } +} + +static void setup_vm_and_run(enum vm_guest_mode mode, void *arg) +{ + struct test_params *params =3D arg; + int nr_vcpus =3D params->nr_vcpus; + struct kvm_vm *vm; + + vm =3D memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1, + params->backing_src, !overlap_memory_access); + + memstress_start_vcpu_threads(nr_vcpus, vcpu_thread_main); + + if (params->benchmark_lru_gen) + run_benchmark(mode, vm, params); + else + run_test(mode, vm, params); + memstress_join_vcpu_threads(nr_vcpus); memstress_destroy_vm(vm); } @@ -331,8 +593,8 @@ static void run_test(enum vm_guest_mode mode, void *arg) static void help(char *name) { puts(""); - printf("usage: %s [-h] [-m mode] [-b vcpu_bytes] [-v vcpus] [-o] [-s mem= _type]\n", - name); + printf("usage: %s [-h] [-m mode] [-b vcpu_bytes] [-v vcpus] [-o]" + " [-s mem_type] [-l] [-r memcg_root]\n", name); puts(""); printf(" -h: Display this help message."); guest_modes_help(); @@ -342,6 +604,9 @@ static void help(char *name) printf(" -v: specify the number of vCPUs to run.\n"); printf(" -o: Overlap guest memory accesses instead of partitioning\n" " them into a separate region of memory for each vCPU.\n"); + printf(" -l: Use MGLRU aging instead of idle page tracking\n"); + printf(" -p: Benchmark MGLRU aging while faulting memory in\n"); + printf(" -r: The memory cgroup hierarchy root to use (when -l is given)\n= "); backing_src_help("-s"); puts(""); exit(0); @@ -353,13 +618,15 @@ int main(int argc, char *argv[]) .backing_src =3D DEFAULT_VM_MEM_SRC, .vcpu_memory_bytes =3D DEFAULT_PER_VCPU_MEM_SIZE, .nr_vcpus =3D 1, + .lru_gen =3D false, + .benchmark_lru_gen =3D false, }; int page_idle_fd; int opt; =20 guest_modes_append_default(); =20 - while ((opt =3D getopt(argc, argv, "hm:b:v:os:")) !=3D -1) { + while ((opt =3D getopt(argc, argv, "hm:b:v:os:lr:p")) !=3D -1) { switch (opt) { case 'm': guest_modes_cmdline(optarg); @@ -376,6 +643,15 @@ int main(int argc, char *argv[]) case 's': params.backing_src =3D parse_backing_src_type(optarg); break; + case 'l': + params.lru_gen =3D true; + break; + case 'p': + params.benchmark_lru_gen =3D true; + break; + case 'r': + cgroup_root =3D strdup(optarg); + break; case 'h': default: help(argv[0]); @@ -383,12 +659,42 @@ int main(int argc, char *argv[]) } } =20 - page_idle_fd =3D open("/sys/kernel/mm/page_idle/bitmap", O_RDWR); - __TEST_REQUIRE(page_idle_fd >=3D 0, - "CONFIG_IDLE_PAGE_TRACKING is not enabled"); - close(page_idle_fd); + if (!params.lru_gen) { + page_idle_fd =3D open("/sys/kernel/mm/page_idle/bitmap", O_RDWR); + __TEST_REQUIRE(page_idle_fd >=3D 0, + "CONFIG_IDLE_PAGE_TRACKING is not enabled"); + close(page_idle_fd); + } else { + int lru_gen_fd, lru_gen_debug_fd; + long mglru_features; + char mglru_feature_str[8] =3D {}; + + lru_gen_fd =3D open("/sys/kernel/mm/lru_gen/enabled", O_RDONLY); + __TEST_REQUIRE(lru_gen_fd >=3D 0, + "CONFIG_LRU_GEN is not enabled"); + TEST_ASSERT(read(lru_gen_fd, &mglru_feature_str, 7) > 0, + "couldn't read lru_gen features"); + mglru_features =3D strtol(mglru_feature_str, NULL, 16); + __TEST_REQUIRE(mglru_features & LRU_GEN_ENABLED, + "lru_gen is not enabled"); + __TEST_REQUIRE(mglru_features & LRU_GEN_MM_WALK, + "lru_gen does not support MM_WALK"); + + lru_gen_debug_fd =3D open(DEBUGFS_LRU_GEN, O_RDWR); + __TEST_REQUIRE(lru_gen_debug_fd >=3D 0, + "Cannot access %s", DEBUGFS_LRU_GEN); + close(lru_gen_debug_fd); + } + + TEST_ASSERT(!params.benchmark_lru_gen || params.lru_gen, + "-p specified without -l"); + + if (params.lru_gen) { + create_memcg(TEST_MEMCG_NAME); + move_to_memcg(TEST_MEMCG_NAME, getpid()); + } =20 - for_each_guest_mode(run_test, ¶ms); + for_each_guest_mode(setup_vm_and_run, ¶ms); =20 return 0; } diff --git a/tools/testing/selftests/kvm/include/lru_gen_util.h b/tools/tes= ting/selftests/kvm/include/lru_gen_util.h new file mode 100644 index 000000000000..4eef8085a3cb --- /dev/null +++ b/tools/testing/selftests/kvm/include/lru_gen_util.h @@ -0,0 +1,55 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Tools for integrating with lru_gen, like parsing the lru_gen debugfs ou= tput. + * + * Copyright (C) 2024, Google LLC. + */ +#ifndef SELFTEST_KVM_LRU_GEN_UTIL_H +#define SELFTEST_KVM_LRU_GEN_UTIL_H + +#include +#include +#include + +#include "test_util.h" + +#define MAX_NR_GENS 16 /* MAX_NR_GENS in include/linux/mmzone.h */ +#define MAX_NR_NODES 4 /* Maximum number of nodes we support */ + +static const char *DEBUGFS_LRU_GEN =3D "/sys/kernel/debug/lru_gen"; + +struct generation_stats { + int gen; + long age_ms; + long nr_anon; + long nr_file; +}; + +struct node_stats { + int node; + int nr_gens; /* Number of populated gens entries. */ + struct generation_stats gens[MAX_NR_GENS]; +}; + +struct memcg_stats { + unsigned long memcg_id; + int nr_nodes; /* Number of populated nodes entries. */ + struct node_stats nodes[MAX_NR_NODES]; +}; + +void print_memcg_stats(const struct memcg_stats *stats, const char *name); + +void read_memcg_stats(struct memcg_stats *stats, const char *memcg); + +void read_print_memcg_stats(struct memcg_stats *stats, const char *memcg); + +long sum_memcg_stats(const struct memcg_stats *stats); + +void lru_gen_do_aging(struct memcg_stats *stats, const char *memcg); + +void lru_gen_do_aging_quiet(struct memcg_stats *stats, const char *memcg); + +int lru_gen_find_generation(const struct memcg_stats *stats, + unsigned long total_pages); + +#endif /* SELFTEST_KVM_LRU_GEN_UTIL_H */ diff --git a/tools/testing/selftests/kvm/lib/lru_gen_util.c b/tools/testing= /selftests/kvm/lib/lru_gen_util.c new file mode 100644 index 000000000000..3c02a635a9f7 --- /dev/null +++ b/tools/testing/selftests/kvm/lib/lru_gen_util.c @@ -0,0 +1,391 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2024, Google LLC. + */ + +#include + +#include "lru_gen_util.h" + +/* + * Tracks state while we parse memcg lru_gen stats. The file we're parsing= is + * structured like this (some extra whitespace elided): + * + * memcg (id) (path) + * node (id) + * (gen_nr) (age_in_ms) (nr_anon_pages) (nr_file_pages) + */ +struct memcg_stats_parse_context { + bool consumed; /* Whether or not this line was consumed */ + /* Next parse handler to invoke */ + void (*next_handler)(struct memcg_stats *, + struct memcg_stats_parse_context *, char *); + int current_node_idx; /* Current index in nodes array */ + const char *name; /* The name of the memcg we're looking for */ +}; + +static void memcg_stats_handle_searching(struct memcg_stats *stats, + struct memcg_stats_parse_context *ctx, + char *line); +static void memcg_stats_handle_in_memcg(struct memcg_stats *stats, + struct memcg_stats_parse_context *ctx, + char *line); +static void memcg_stats_handle_in_node(struct memcg_stats *stats, + struct memcg_stats_parse_context *ctx, + char *line); + +struct split_iterator { + char *str; + char *save; +}; + +static char *split_next(struct split_iterator *it) +{ + char *ret =3D strtok_r(it->str, " \t\n\r", &it->save); + + it->str =3D NULL; + return ret; +} + +static void memcg_stats_handle_searching(struct memcg_stats *stats, + struct memcg_stats_parse_context *ctx, + char *line) +{ + struct split_iterator it =3D { .str =3D line }; + char *prefix =3D split_next(&it); + char *memcg_id =3D split_next(&it); + char *memcg_name =3D split_next(&it); + char *end; + + ctx->consumed =3D true; + + if (!prefix || strcmp("memcg", prefix)) + return; /* Not a memcg line (maybe empty), skip */ + + TEST_ASSERT(memcg_id && memcg_name, + "malformed memcg line; no memcg id or memcg_name"); + + if (strcmp(memcg_name + 1, ctx->name)) + return; /* Wrong memcg, skip */ + + /* Found it! */ + + stats->memcg_id =3D strtoul(memcg_id, &end, 10); + TEST_ASSERT(*end =3D=3D '\0', "malformed memcg id '%s'", memcg_id); + if (!stats->memcg_id) + return; /* Removed memcg? */ + + ctx->next_handler =3D memcg_stats_handle_in_memcg; +} + +static void memcg_stats_handle_in_memcg(struct memcg_stats *stats, + struct memcg_stats_parse_context *ctx, + char *line) +{ + struct split_iterator it =3D { .str =3D line }; + char *prefix =3D split_next(&it); + char *id =3D split_next(&it); + long found_node_id; + char *end; + + ctx->consumed =3D true; + ctx->current_node_idx =3D -1; + + if (!prefix) + return; /* Skip empty lines */ + + if (!strcmp("memcg", prefix)) { + /* Memcg done, found next one; stop. */ + ctx->next_handler =3D NULL; + return; + } else if (strcmp("node", prefix)) + TEST_ASSERT(false, "found malformed line after 'memcg ...'," + "token: '%s'", prefix); + + /* At this point we know we have a node line. Parse the ID. */ + + TEST_ASSERT(id, "malformed node line; no node id"); + + found_node_id =3D strtol(id, &end, 10); + TEST_ASSERT(*end =3D=3D '\0', "malformed node id '%s'", id); + + ctx->current_node_idx =3D stats->nr_nodes++; + TEST_ASSERT(ctx->current_node_idx < MAX_NR_NODES, + "memcg has stats for too many nodes, max is %d", + MAX_NR_NODES); + stats->nodes[ctx->current_node_idx].node =3D found_node_id; + + ctx->next_handler =3D memcg_stats_handle_in_node; +} + +static void memcg_stats_handle_in_node(struct memcg_stats *stats, + struct memcg_stats_parse_context *ctx, + char *line) +{ + /* Have to copy since we might not consume */ + char *my_line =3D strdup(line); + struct split_iterator it =3D { .str =3D my_line }; + char *gen, *age, *nr_anon, *nr_file; + struct node_stats *node_stats; + struct generation_stats *gen_stats; + char *end; + + TEST_ASSERT(it.str, "failed to copy input line"); + + gen =3D split_next(&it); + + /* Skip empty lines */ + if (!gen) + goto out_consume; /* Skip empty lines */ + + if (!strcmp("memcg", gen) || !strcmp("node", gen)) { + /* + * Reached next memcg or node section. Don't consume, let the + * other handler deal with this. + */ + ctx->next_handler =3D memcg_stats_handle_in_memcg; + goto out; + } + + node_stats =3D &stats->nodes[ctx->current_node_idx]; + TEST_ASSERT(node_stats->nr_gens < MAX_NR_GENS, + "found too many generation lines; max is %d", + MAX_NR_GENS); + gen_stats =3D &node_stats->gens[node_stats->nr_gens++]; + + age =3D split_next(&it); + nr_anon =3D split_next(&it); + nr_file =3D split_next(&it); + + TEST_ASSERT(age && nr_anon && nr_file, + "malformed generation line; not enough tokens"); + + gen_stats->gen =3D (int)strtol(gen, &end, 10); + TEST_ASSERT(*end =3D=3D '\0', "malformed generation number '%s'", gen); + + gen_stats->age_ms =3D strtol(age, &end, 10); + TEST_ASSERT(*end =3D=3D '\0', "malformed generation age '%s'", age); + + gen_stats->nr_anon =3D strtol(nr_anon, &end, 10); + TEST_ASSERT(*end =3D=3D '\0', "malformed anonymous page count '%s'", + nr_anon); + + gen_stats->nr_file =3D strtol(nr_file, &end, 10); + TEST_ASSERT(*end =3D=3D '\0', "malformed file page count '%s'", nr_file); + +out_consume: + ctx->consumed =3D true; +out: + free(my_line); +} + +/* Pretty-print lru_gen @stats. */ +void print_memcg_stats(const struct memcg_stats *stats, const char *name) +{ + int node, gen; + + fprintf(stderr, "stats for memcg %s (id %lu):\n", + name, stats->memcg_id); + for (node =3D 0; node < stats->nr_nodes; ++node) { + fprintf(stderr, "\tnode %d\n", stats->nodes[node].node); + for (gen =3D 0; gen < stats->nodes[node].nr_gens; ++gen) { + const struct generation_stats *gstats =3D + &stats->nodes[node].gens[gen]; + + fprintf(stderr, + "\t\tgen %d\tage_ms %ld" + "\tnr_anon %ld\tnr_file %ld\n", + gstats->gen, gstats->age_ms, gstats->nr_anon, + gstats->nr_file); + } + } +} + +/* Re-read lru_gen debugfs information for @memcg into @stats. */ +void read_memcg_stats(struct memcg_stats *stats, const char *memcg) +{ + FILE *f; + ssize_t read =3D 0; + char *line =3D NULL; + size_t bufsz; + struct memcg_stats_parse_context ctx =3D { + .next_handler =3D memcg_stats_handle_searching, + .name =3D memcg, + }; + + memset(stats, 0, sizeof(struct memcg_stats)); + + f =3D fopen(DEBUGFS_LRU_GEN, "r"); + TEST_ASSERT(f, "fopen(%s) failed", DEBUGFS_LRU_GEN); + + while (ctx.next_handler && (read =3D getline(&line, &bufsz, f)) > 0) { + ctx.consumed =3D false; + + do { + ctx.next_handler(stats, &ctx, line); + if (!ctx.next_handler) + break; + } while (!ctx.consumed); + } + + if (read < 0 && !feof(f)) + TEST_ASSERT(false, "getline(%s) failed", DEBUGFS_LRU_GEN); + + TEST_ASSERT(stats->memcg_id > 0, "Couldn't find memcg: %s\n" + "Did the memcg get created in the proper mount?", + memcg); + if (line) + free(line); + TEST_ASSERT(!fclose(f), "fclose(%s) failed", DEBUGFS_LRU_GEN); +} + +/* + * Find all pages tracked by lru_gen for this memcg in generation @target_= gen. + * + * If @target_gen is negative, look for all generations. + */ +static long sum_memcg_stats_for_gen(int target_gen, + const struct memcg_stats *stats) +{ + int node, gen; + long total_nr =3D 0; + + for (node =3D 0; node < stats->nr_nodes; ++node) { + const struct node_stats *node_stats =3D &stats->nodes[node]; + + for (gen =3D 0; gen < node_stats->nr_gens; ++gen) { + const struct generation_stats *gen_stats =3D + &node_stats->gens[gen]; + + if (target_gen >=3D 0 && gen_stats->gen !=3D target_gen) + continue; + + total_nr +=3D gen_stats->nr_anon + gen_stats->nr_file; + } + } + + return total_nr; +} + +/* Find all pages tracked by lru_gen for this memcg. */ +long sum_memcg_stats(const struct memcg_stats *stats) +{ + return sum_memcg_stats_for_gen(-1, stats); +} + +/* Read the memcg stats and optionally print if this is a debug build. */ +void read_print_memcg_stats(struct memcg_stats *stats, const char *memcg) +{ + read_memcg_stats(stats, memcg); +#ifdef DEBUG + print_memcg_stats(stats, memcg); +#endif +} + +/* + * If lru_gen aging should force page table scanning. + * + * If you want to set this to false, you will need to do eviction + * before doing extra aging passes. + */ +static const bool force_scan =3D true; + +static void run_aging_impl(unsigned long memcg_id, int node_id, int max_ge= n) +{ + FILE *f =3D fopen(DEBUGFS_LRU_GEN, "w"); + char *command; + size_t sz; + + TEST_ASSERT(f, "fopen(%s) failed", DEBUGFS_LRU_GEN); + sz =3D asprintf(&command, "+ %lu %d %d 1 %d\n", + memcg_id, node_id, max_gen, force_scan); + TEST_ASSERT(sz > 0, "creating aging command failed"); + + pr_debug("Running aging command: %s", command); + if (fwrite(command, sizeof(char), sz, f) < sz) { + TEST_ASSERT(false, "writing aging command %s to %s failed", + command, DEBUGFS_LRU_GEN); + } + + TEST_ASSERT(!fclose(f), "fclose(%s) failed", DEBUGFS_LRU_GEN); +} + +static void _lru_gen_do_aging(struct memcg_stats *stats, const char *memcg, + bool verbose) +{ + int node, gen; + struct timespec ts_start; + struct timespec ts_elapsed; + + pr_debug("lru_gen: invoking aging...\n"); + + /* Must read memcg stats to construct the proper aging command. */ + read_print_memcg_stats(stats, memcg); + + if (verbose) + clock_gettime(CLOCK_MONOTONIC, &ts_start); + + for (node =3D 0; node < stats->nr_nodes; ++node) { + int max_gen =3D 0; + + for (gen =3D 0; gen < stats->nodes[node].nr_gens; ++gen) { + int this_gen =3D stats->nodes[node].gens[gen].gen; + + max_gen =3D max_gen > this_gen ? max_gen : this_gen; + } + + run_aging_impl(stats->memcg_id, stats->nodes[node].node, + max_gen); + } + + if (verbose) { + ts_elapsed =3D timespec_elapsed(ts_start); + pr_info("%-30s: %ld.%09lds\n", "lru_gen: Aging", + ts_elapsed.tv_sec, ts_elapsed.tv_nsec); + } + + /* Re-read so callers get updated information */ + read_print_memcg_stats(stats, memcg); +} + +/* Do aging, and print how long it took. */ +void lru_gen_do_aging(struct memcg_stats *stats, const char *memcg) +{ + return _lru_gen_do_aging(stats, memcg, true); +} + +/* Do aging, don't print anything. */ +void lru_gen_do_aging_quiet(struct memcg_stats *stats, const char *memcg) +{ + return _lru_gen_do_aging(stats, memcg, false); +} + +/* + * Find which generation contains more than half of @total_pages, assuming= that + * such a generation exists. + */ +int lru_gen_find_generation(const struct memcg_stats *stats, + unsigned long total_pages) +{ + int node, gen, gen_idx, min_gen =3D INT_MAX, max_gen =3D -1; + + for (node =3D 0; node < stats->nr_nodes; ++node) + for (gen_idx =3D 0; gen_idx < stats->nodes[node].nr_gens; + ++gen_idx) { + gen =3D stats->nodes[node].gens[gen_idx].gen; + max_gen =3D gen > max_gen ? gen : max_gen; + min_gen =3D gen < min_gen ? gen : min_gen; + } + + for (gen =3D min_gen; gen < max_gen; ++gen) + /* See if the most pages are in this generation. */ + if (sum_memcg_stats_for_gen(gen, stats) > + total_pages / 2) + return gen; + + TEST_ASSERT(false, "No generation includes majority of %lu pages.", + total_pages); + + /* unreachable, but make the compiler happy */ + return -1; +} --=20 2.47.0.199.ga7371fff76-goog