From nobody Sun Feb  8 07:17:46 2026
Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com
 [209.85.128.202])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id DE3431E1C3D
	for <linux-kernel@vger.kernel.org>; Fri,  4 Oct 2024 19:55:46 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.202
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1728071749; cv=none;
 b=G52Vxs31ra5hefBXs8a505ZyUQtrgIcU2Idt0pAxHqjmUTv4rAkKlUriHuoOdmrmyOfuRHPa2MXXdgF9BlRUt3VofA6aavrG8yDtD9Zw+S6Eflg6JdD7ztvjPcOqGW7ojj/WWkuZTxrJlytx4/4bzeaYzMnYbC6mLfAU8bxCYeo=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1728071749; c=relaxed/simple;
	bh=pxrOalPvNLDtJQpdCW/G4HAMuEe//eQGLVWKtOXkXEI=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=TLeJJxljO7sXFRPfnU6qvjqinILiDPXN7j+4kJLR3qpDSkvKEvaahUJYS8TVwywcJAmThxj3KLcBtvwime7oLrzbWrbDMwHp9FuxVWlT4gl1lfJmwX4hehGruoVTYtJlL51jzjJCcntOBpP1OdZH13JNqI/LbuCHhUXlCFZ26sQ=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--vipinsh.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=4yG4Ff0s; arc=none smtp.client-ip=209.85.128.202
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--vipinsh.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="4yG4Ff0s"
Received: by mail-yw1-f202.google.com with SMTP id
 00721157ae682-6e2baf2ff64so39435527b3.0
        for <linux-kernel@vger.kernel.org>;
 Fri, 04 Oct 2024 12:55:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1728071746; x=1728676546;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=hTtzmO1m4B2S+nZ7TcqJj3+RjDp8RGQCqP+9mB7XDF0=;
        b=4yG4Ff0s4xBv7dBQI/gP/p/8ZfX3okg+C7Wag62Jn+3fQ1RhDTh42BCbP20Q7bI2B6
         aDkTY+Zmia7NeRKArZHmLZeWk5MlUFX7Hb4u6EloWDUBZ7dOK8gvYQwwbciYGCGRdCMK
         7JI3k98zJ6TQ4rcW+dmHcgXv94V0YYsf6iMqW6glo7TYCDIEAsB28fm3pkdu/IFSmbbE
         m9XtUNCds6u/RtCdi+fNod76m0NA1VUvM6+YkhbvHKurxwLpXepFGsk2jKzqcTU+0fOA
         bWz8bmS7YUj39oSgsygSAsxoZ1vBW/IscDftGp0lCSnu19wEP0P4XplRR0hULk5mN+Vs
         nG0Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1728071746; x=1728676546;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=hTtzmO1m4B2S+nZ7TcqJj3+RjDp8RGQCqP+9mB7XDF0=;
        b=Pda1wbo8vMxyE7A0O1pWCdRNTN7jnTuB/gw7hk6q58kjDMJYZyBtYq+xI0h1pqXWdg
         Lidhi29gYFDEbUl+lZdsUVZ+3kHDbKSN9LC06pKSQD29HVMcmx08rK9s+VX7IAepkFMs
         7QJeisuQaYaIXKS/mGvS1CyQpxTHq75Lgstl1SlEaOrgYs5k2BPgN/qjSn18y9P/jg/A
         uIJZqwSMrCaQrrHLDycqZoZ49dot7qzYz2B0DNaOch4sp8ZNumYE2dXWc4cuSMUqc3pZ
         AatPs+1GRXgSWpMYpmv7XxBaIQiaKMdud48mV/b3m+my2j2+0NPcBCMuE8ZHt0W/7AOr
         QOSw==
X-Forwarded-Encrypted: i=1;
 AJvYcCX1FSjx/5VIqPczzBq1J+CpzXITinde3ODft5OsxM84YCtA74dMIJilvJFdCCJkqwajw4vtMrYLVmbdTuo=@vger.kernel.org
X-Gm-Message-State: AOJu0YzzCrhm08c0ZAifp51PlHQxYALtUxaB4c6h/CJ4+hfpOd3d1ist
	BLj8W/kGBEyBetDR0s85sCcaYsnFAxVuua6jGt1tvounSxMfCFQWULSBrP2opEQlWEeHi7k5Uby
	IPj/CCg==
X-Google-Smtp-Source: 
 AGHT+IEEsD5A0BwWtlIz55ngwUefLRO7WKaELLKBP0aGBOdeR0m0uLwRikmx58xSNYxITBmsz09MGuwHQ3J3
X-Received: from vipin.c.googlers.com ([35.247.89.60]) (user=vipinsh
 job=sendgmr) by 2002:a25:adc2:0:b0:e24:9584:52d3 with SMTP id
 3f1490d57ef6-e28936b99bamr6758276.2.1728071745774; Fri, 04 Oct 2024 12:55:45
 -0700 (PDT)
Date: Fri,  4 Oct 2024 12:55:38 -0700
In-Reply-To: <20241004195540.210396-1-vipinsh@google.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20241004195540.210396-1-vipinsh@google.com>
X-Mailer: git-send-email 2.47.0.rc0.187.ge670bccf7e-goog
Message-ID: <20241004195540.210396-2-vipinsh@google.com>
Subject: [PATCH v2 1/3] KVM: x86/mmu: Change KVM mmu shrinker to no-op
From: Vipin Sharma <vipinsh@google.com>
To: seanjc@google.com, pbonzini@redhat.com, dmatlack@google.com
Cc: zhi.wang.linux@gmail.com, weijiang.yang@intel.com, mizhang@google.com,
	liangchen.linux@gmail.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Vipin Sharma <vipinsh@google.com>
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Remove global kvm_total_used_mmu_pages and page zapping flow from MMU
shrinker. Keep shrinker infrastructure in place to reuse in future
commits for freeing KVM page caches. Remove zapped_obsolete_pages list
from struct kvm_arch{} and use local list in kvm_zap_obsolete_pages()
since MMU shrinker is not using it anymore.

mmu_shrink_scan() is very disruptive to VMs. It picks the first VM in
the vm_list, zaps the oldest page which is most likely an upper level
SPTEs and most like to be reused. Prior to TDP MMU, this is even more
disruptive in nested VMs case, considering L1 SPTEs will be the oldest
even though most of the entries are for L2 SPTEs.

As discussed in
https://lore.kernel.org/lkml/Y45dldZnI6OIf+a5@google.com/ shrinker logic
has not be very useful in actually keeping VMs performant and reducing
memory usage.

Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: David Matlack <dmatlack@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 arch/x86/include/asm/kvm_host.h |  1 -
 arch/x86/kvm/mmu/mmu.c          | 92 +++------------------------------
 2 files changed, 8 insertions(+), 85 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_hos=
t.h
index b0c0bc0ed813f..cbfe31bac6cf6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1309,7 +1309,6 @@ struct kvm_arch {
 	bool pre_fault_allowed;
 	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
 	struct list_head active_mmu_pages;
-	struct list_head zapped_obsolete_pages;
 	/*
 	 * A list of kvm_mmu_page structs that, if zapped, could possibly be
 	 * replaced by an NX huge page.  A shadow page is on this list if its
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d25c2b395116e..213e46b55dda2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -179,7 +179,6 @@ struct kvm_shadow_walk_iterator {
=20
 static struct kmem_cache *pte_list_desc_cache;
 struct kmem_cache *mmu_page_header_cache;
-static struct percpu_counter kvm_total_used_mmu_pages;
=20
 static void mmu_spte_set(u64 *sptep, u64 spte);
=20
@@ -1651,27 +1650,15 @@ static void kvm_mmu_check_sptes_at_free(struct kvm_=
mmu_page *sp)
 #endif
 }
=20
-/*
- * This value is the sum of all of the kvm instances's
- * kvm->arch.n_used_mmu_pages values.  We need a global,
- * aggregate version in order to make the slab shrinker
- * faster
- */
-static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
-{
-	kvm->arch.n_used_mmu_pages +=3D nr;
-	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
-}
-
 static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
-	kvm_mod_used_mmu_pages(kvm, +1);
+	kvm->arch.n_used_mmu_pages++;
 	kvm_account_pgtable_pages((void *)sp->spt, +1);
 }
=20
 static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *s=
p)
 {
-	kvm_mod_used_mmu_pages(kvm, -1);
+	kvm->arch.n_used_mmu_pages--;
 	kvm_account_pgtable_pages((void *)sp->spt, -1);
 }
=20
@@ -6338,6 +6325,7 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
 {
 	struct kvm_mmu_page *sp, *node;
 	int nr_zapped, batch =3D 0;
+	LIST_HEAD(invalid_list);
 	bool unstable;
=20
 restart:
@@ -6371,7 +6359,7 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
 		}
=20
 		unstable =3D __kvm_mmu_prepare_zap_page(kvm, sp,
-				&kvm->arch.zapped_obsolete_pages, &nr_zapped);
+				&invalid_list, &nr_zapped);
 		batch +=3D nr_zapped;
=20
 		if (unstable)
@@ -6387,7 +6375,7 @@ static void kvm_zap_obsolete_pages(struct kvm *kvm)
 	 * kvm_mmu_load()), and the reload in the caller ensure no vCPUs are
 	 * running with an obsolete MMU.
 	 */
-	kvm_mmu_commit_zap_page(kvm, &kvm->arch.zapped_obsolete_pages);
+	kvm_mmu_commit_zap_page(kvm, &invalid_list);
 }
=20
 /*
@@ -6450,16 +6438,10 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 		kvm_tdp_mmu_zap_invalidated_roots(kvm);
 }
=20
-static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
-{
-	return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
-}
-
 void kvm_mmu_init_vm(struct kvm *kvm)
 {
 	kvm->arch.shadow_mmio_value =3D shadow_mmio_value;
 	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
-	INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
 	INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
 	spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
=20
@@ -7015,65 +6997,13 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm,=
 u64 gen)
 static unsigned long mmu_shrink_scan(struct shrinker *shrink,
 				     struct shrink_control *sc)
 {
-	struct kvm *kvm;
-	int nr_to_scan =3D sc->nr_to_scan;
-	unsigned long freed =3D 0;
-
-	mutex_lock(&kvm_lock);
-
-	list_for_each_entry(kvm, &vm_list, vm_list) {
-		int idx;
-
-		/*
-		 * Never scan more than sc->nr_to_scan VM instances.
-		 * Will not hit this condition practically since we do not try
-		 * to shrink more than one VM and it is very unlikely to see
-		 * !n_used_mmu_pages so many times.
-		 */
-		if (!nr_to_scan--)
-			break;
-		/*
-		 * n_used_mmu_pages is accessed without holding kvm->mmu_lock
-		 * here. We may skip a VM instance errorneosly, but we do not
-		 * want to shrink a VM that only started to populate its MMU
-		 * anyway.
-		 */
-		if (!kvm->arch.n_used_mmu_pages &&
-		    !kvm_has_zapped_obsolete_pages(kvm))
-			continue;
-
-		idx =3D srcu_read_lock(&kvm->srcu);
-		write_lock(&kvm->mmu_lock);
-
-		if (kvm_has_zapped_obsolete_pages(kvm)) {
-			kvm_mmu_commit_zap_page(kvm,
-			      &kvm->arch.zapped_obsolete_pages);
-			goto unlock;
-		}
-
-		freed =3D kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan);
-
-unlock:
-		write_unlock(&kvm->mmu_lock);
-		srcu_read_unlock(&kvm->srcu, idx);
-
-		/*
-		 * unfair on small ones
-		 * per-vm shrinkers cry out
-		 * sadness comes quickly
-		 */
-		list_move_tail(&kvm->vm_list, &vm_list);
-		break;
-	}
-
-	mutex_unlock(&kvm_lock);
-	return freed;
+	return SHRINK_STOP;
 }
=20
 static unsigned long mmu_shrink_count(struct shrinker *shrink,
 				      struct shrink_control *sc)
 {
-	return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
+	return SHRINK_EMPTY;
 }
=20
 static struct shrinker *mmu_shrinker;
@@ -7204,12 +7134,9 @@ int kvm_mmu_vendor_module_init(void)
 	if (!mmu_page_header_cache)
 		goto out;
=20
-	if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL))
-		goto out;
-
 	mmu_shrinker =3D shrinker_alloc(0, "x86-mmu");
 	if (!mmu_shrinker)
-		goto out_shrinker;
+		goto out;
=20
 	mmu_shrinker->count_objects =3D mmu_shrink_count;
 	mmu_shrinker->scan_objects =3D mmu_shrink_scan;
@@ -7219,8 +7146,6 @@ int kvm_mmu_vendor_module_init(void)
=20
 	return 0;
=20
-out_shrinker:
-	percpu_counter_destroy(&kvm_total_used_mmu_pages);
 out:
 	mmu_destroy_caches();
 	return ret;
@@ -7237,7 +7162,6 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
 void kvm_mmu_vendor_module_exit(void)
 {
 	mmu_destroy_caches();
-	percpu_counter_destroy(&kvm_total_used_mmu_pages);
 	shrinker_free(mmu_shrinker);
 }
=20
--=20
2.47.0.rc0.187.ge670bccf7e-goog
From nobody Sun Feb  8 07:17:46 2026
Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com
 [209.85.216.73])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9F62C1E1C3E
	for <linux-kernel@vger.kernel.org>; Fri,  4 Oct 2024 19:55:48 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.216.73
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1728071750; cv=none;
 b=n3Bgm8K35TWdZCVQ3GU91tJD17H5PjgHT+euggQ8RVlMsX2si8eCSoDWE1YQ/7iqUU9aUiVfMFOxe7uNYyUmh3yl46drsg08SSmQsrYRJRE8Sx4UZHwEBjrrrwt2jKb45ZRIHlNThBwNak2iZFlKRpxqM56YSUZgk+Sbi+8Ileg=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1728071750; c=relaxed/simple;
	bh=DYdRMJkwfHq1EeSmBOjN2a31jLCtFMQDVwgOph06U6s=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=njrvDhAWkwMI+UBGzm1dmA9ZUHJfDxsJy4ROcDPoD3rFq6avNeROQApAsXtlYUtzn4EU02PDvgge5JAKD0N8uT8fLIYHtrzKSwhg4wzM/S+D8H5jWHFAum/uYXt8+4w7/AsPotuUoOYfOeX87oci1F4UKwZbERHmBG/82Fzm4TY=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--vipinsh.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=l5+RxsgN; arc=none smtp.client-ip=209.85.216.73
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--vipinsh.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="l5+RxsgN"
Received: by mail-pj1-f73.google.com with SMTP id
 98e67ed59e1d1-2e06f5ec69cso3266652a91.1
        for <linux-kernel@vger.kernel.org>;
 Fri, 04 Oct 2024 12:55:48 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1728071748; x=1728676548;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=wOQiKDWxKwvHSTq+X53UtqBcRoF0/VfzwQic5qnhwMQ=;
        b=l5+RxsgNwm/YWWSXNflzk3F+davH+NLAexwiJr3kPOVpZgX1gDusJkLMnZpLS9GMXS
         n1v6P0Y+OscPL3mm2hrXL/y10h+uQd3DesVzbBBbYY6IZSQ7PmFX3rMs1hM1ovDAZQkI
         iM6vxloUG2zhC/GmP0v1IJVaWhHPhgF8iVx/bGjp0vIJ6pmUd0KgujNlnCDHPiSkocbm
         otxUybCt3+9QyFRoxgDp46o7TPzIQ3UGPfAzcBVZyrRAcHVBAjto3Oret4lbTrwRUZiU
         A3WdRCy+L704pzjg07433Bu8iC4QkEfBqXpcdDNST8nWC/UL+GZQyUcqpmyVc40NQ5nj
         gmdA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1728071748; x=1728676548;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=wOQiKDWxKwvHSTq+X53UtqBcRoF0/VfzwQic5qnhwMQ=;
        b=bmgTBToRf7bv6kuBXRzdiQaOdgVLE1S6F4YYxbpRkNmDKnRo6dYnjmEKjcslhY4e7m
         sTtq2eDn6U1/TWNAMCM5I6EzsRPfgOD1sAfTyQdH4ipSMbOKLTw0U/LBniL1B0lJ/VYT
         P7IjN2Z+gP8Cvu4TRA50JpWP7H4zBlCw7lZu6qsRVh8GAvlmM38vWjfuxnEtLtjgZ9TJ
         N4Hz4cZe7uam+EFZi5Q0c/wgfQ4eoL9WFaJ48IAZJLqXo4+BxPknzLUISUNQCcdim5Tm
         DcSgjLn9p6MDTG652gUH/cpvtPOUC2bjddsy2LBsDaXtAPrs53q1/x6syufejSwzOm5a
         xyTA==
X-Forwarded-Encrypted: i=1;
 AJvYcCUk/p6vMuDGJ7CE1IHAIRgFcLTbp6CZL/lmIrRgrmGOgcUYptjrxkSb7nfDW/IGhAaTQohrAYgP8O3lnhU=@vger.kernel.org
X-Gm-Message-State: AOJu0Yxahkniv/0rcXkZp0tiO/QaBM5oXJ9dDmFVC7PIob0n8wo8o66C
	RKUuKVzNx0ezBTC3QfS3UqfbwsestLufA5AqmuUo7P7BYPJohvpORLVDmvMsqpZ57SBzLPEbrUM
	iLYzZxQ==
X-Google-Smtp-Source: 
 AGHT+IHPE5k5SFpXLK6Z34zLxu8yWs3OwTfU0LRLmjyctWLUAce3eV/NaZe8zBYNUajJi9VaxaEPwfPjjn77
X-Received: from vipin.c.googlers.com ([35.247.89.60]) (user=vipinsh
 job=sendgmr) by 2002:a17:90a:ec08:b0:2d8:8c74:7088 with SMTP id
 98e67ed59e1d1-2e1e501d358mr37694a91.0.1728071747652; Fri, 04 Oct 2024
 12:55:47 -0700 (PDT)
Date: Fri,  4 Oct 2024 12:55:39 -0700
In-Reply-To: <20241004195540.210396-1-vipinsh@google.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20241004195540.210396-1-vipinsh@google.com>
X-Mailer: git-send-email 2.47.0.rc0.187.ge670bccf7e-goog
Message-ID: <20241004195540.210396-3-vipinsh@google.com>
Subject: [PATCH v2 2/3] KVM: x86/mmu: Use MMU shrinker to shrink KVM MMU
 memory caches
From: Vipin Sharma <vipinsh@google.com>
To: seanjc@google.com, pbonzini@redhat.com, dmatlack@google.com
Cc: zhi.wang.linux@gmail.com, weijiang.yang@intel.com, mizhang@google.com,
	liangchen.linux@gmail.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Vipin Sharma <vipinsh@google.com>
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Use MMU shrinker to iterate through all the vCPUs of all the VMs and
free pages allocated in MMU memory caches. Protect cache allocation in
page fault and MMU load path from MMU shrinker by using a per vCPU
mutex. In MMU shrinker, move the iterated VM to the end of the VMs list
so that the pain of emptying cache spread among other VMs too.

The specific caches to empty are mmu_shadow_page_cache and
mmu_shadowed_info_cache as these caches store whole pages. Emptying them
will give more impact to shrinker compared to other caches like
mmu_pte_list_desc_cache{} and mmu_page_header_cache{}

Holding per vCPU mutex lock ensures that a vCPU doesn't get surprised
by finding its cache emptied after filling them up for page table
allocations during page fault handling and MMU load operation. Per vCPU
mutex also makes sure there is only race between MMU shrinker and all
other vCPUs. This should result in very less contention.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Suggested-by: David Matlack <dmatlack@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  6 +++
 arch/x86/kvm/mmu/mmu.c          | 69 +++++++++++++++++++++++++++------
 arch/x86/kvm/mmu/paging_tmpl.h  | 14 ++++---
 include/linux/kvm_host.h        |  1 +
 virt/kvm/kvm_main.c             |  8 +++-
 5 files changed, 81 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_hos=
t.h
index cbfe31bac6cf6..63eaf03111ebb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -811,6 +811,12 @@ struct kvm_vcpu_arch {
 	 */
 	struct kvm_mmu *walk_mmu;
=20
+	/*
+	 * Protect cache from getting emptied in MMU shrinker while vCPU might
+	 * use cache for fault handling or loading MMU.  As this is a per vCPU
+	 * lock, only contention might happen when MMU shrinker runs.
+	 */
+	struct mutex mmu_memory_cache_lock;
 	struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
 	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
 	struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 213e46b55dda2..8e2935347615d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4524,29 +4524,33 @@ static int direct_page_fault(struct kvm_vcpu *vcpu,=
 struct kvm_page_fault *fault
 	if (r !=3D RET_PF_INVALID)
 		return r;
=20
+	mutex_lock(&vcpu->arch.mmu_memory_cache_lock);
 	r =3D mmu_topup_memory_caches(vcpu, false);
 	if (r)
-		return r;
+		goto out_mmu_memory_cache_unlock;
=20
 	r =3D kvm_faultin_pfn(vcpu, fault, ACC_ALL);
 	if (r !=3D RET_PF_CONTINUE)
-		return r;
+		goto out_mmu_memory_cache_unlock;
=20
 	r =3D RET_PF_RETRY;
 	write_lock(&vcpu->kvm->mmu_lock);
=20
 	if (is_page_fault_stale(vcpu, fault))
-		goto out_unlock;
+		goto out_mmu_unlock;
=20
 	r =3D make_mmu_pages_available(vcpu);
 	if (r)
-		goto out_unlock;
+		goto out_mmu_unlock;
=20
 	r =3D direct_map(vcpu, fault);
=20
-out_unlock:
+out_mmu_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
 	kvm_release_pfn_clean(fault->pfn);
+out_mmu_memory_cache_unlock:
+	mutex_unlock(&vcpu->arch.mmu_memory_cache_lock);
+
 	return r;
 }
=20
@@ -4617,25 +4621,28 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *=
vcpu,
 	if (r !=3D RET_PF_INVALID)
 		return r;
=20
+	mutex_lock(&vcpu->arch.mmu_memory_cache_lock);
 	r =3D mmu_topup_memory_caches(vcpu, false);
 	if (r)
-		return r;
+		goto out_mmu_memory_cache_unlock;
=20
 	r =3D kvm_faultin_pfn(vcpu, fault, ACC_ALL);
 	if (r !=3D RET_PF_CONTINUE)
-		return r;
+		goto out_mmu_memory_cache_unlock;
=20
 	r =3D RET_PF_RETRY;
 	read_lock(&vcpu->kvm->mmu_lock);
=20
 	if (is_page_fault_stale(vcpu, fault))
-		goto out_unlock;
+		goto out_mmu_unlock;
=20
 	r =3D kvm_tdp_mmu_map(vcpu, fault);
=20
-out_unlock:
+out_mmu_unlock:
 	read_unlock(&vcpu->kvm->mmu_lock);
 	kvm_release_pfn_clean(fault->pfn);
+out_mmu_memory_cache_unlock:
+	mutex_unlock(&vcpu->arch.mmu_memory_cache_lock);
 	return r;
 }
 #endif
@@ -5691,6 +5698,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
 {
 	int r;
=20
+	mutex_lock(&vcpu->arch.mmu_memory_cache_lock);
 	r =3D mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->root_role.direct);
 	if (r)
 		goto out;
@@ -5717,6 +5725,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
 	 */
 	kvm_x86_call(flush_tlb_current)(vcpu);
 out:
+	mutex_unlock(&vcpu->arch.mmu_memory_cache_lock);
 	return r;
 }
=20
@@ -6303,6 +6312,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
 	if (!vcpu->arch.mmu_shadow_page_cache.init_value)
 		vcpu->arch.mmu_shadow_page_cache.gfp_zero =3D __GFP_ZERO;
=20
+	mutex_init(&vcpu->arch.mmu_memory_cache_lock);
 	vcpu->arch.mmu =3D &vcpu->arch.root_mmu;
 	vcpu->arch.walk_mmu =3D &vcpu->arch.root_mmu;
=20
@@ -6997,13 +7007,50 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm,=
 u64 gen)
 static unsigned long mmu_shrink_scan(struct shrinker *shrink,
 				     struct shrink_control *sc)
 {
-	return SHRINK_STOP;
+	struct kvm *kvm, *next_kvm, *first_kvm =3D NULL;
+	unsigned long i, freed =3D 0;
+	struct kvm_vcpu *vcpu;
+
+	mutex_lock(&kvm_lock);
+	list_for_each_entry_safe(kvm, next_kvm, &vm_list, vm_list) {
+		if (!first_kvm)
+			first_kvm =3D kvm;
+		else if (first_kvm =3D=3D kvm)
+			break;
+
+		list_move_tail(&kvm->vm_list, &vm_list);
+
+		kvm_for_each_vcpu(i, vcpu, kvm) {
+			if (!mutex_trylock(&vcpu->arch.mmu_memory_cache_lock))
+				continue;
+			freed +=3D kvm_mmu_empty_memory_cache(&vcpu->arch.mmu_shadow_page_cache=
);
+			freed +=3D kvm_mmu_empty_memory_cache(&vcpu->arch.mmu_shadowed_info_cac=
he);
+			mutex_unlock(&vcpu->arch.mmu_memory_cache_lock);
+			if (freed >=3D sc->nr_to_scan)
+				goto out;
+		}
+	}
+out:
+	mutex_unlock(&kvm_lock);
+	return freed;
 }
=20
 static unsigned long mmu_shrink_count(struct shrinker *shrink,
 				      struct shrink_control *sc)
 {
-	return SHRINK_EMPTY;
+	unsigned long i, count =3D 0;
+	struct kvm_vcpu *vcpu;
+	struct kvm *kvm;
+
+	mutex_lock(&kvm_lock);
+	list_for_each_entry(kvm, &vm_list, vm_list) {
+		kvm_for_each_vcpu(i, vcpu, kvm) {
+			count +=3D READ_ONCE(vcpu->arch.mmu_shadow_page_cache.nobjs);
+			count +=3D READ_ONCE(vcpu->arch.mmu_shadowed_info_cache.nobjs);
+		}
+	}
+	mutex_unlock(&kvm_lock);
+	return !count ? SHRINK_EMPTY : count;
 }
=20
 static struct shrinker *mmu_shrinker;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 405bd7ceee2a3..084a5c532078f 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -809,13 +809,14 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, s=
truct kvm_page_fault *fault
 		return RET_PF_EMULATE;
 	}
=20
+	mutex_lock(&vcpu->arch.mmu_memory_cache_lock);
 	r =3D mmu_topup_memory_caches(vcpu, true);
 	if (r)
-		return r;
+		goto out_mmu_memory_cache_unlock;
=20
 	r =3D kvm_faultin_pfn(vcpu, fault, walker.pte_access);
 	if (r !=3D RET_PF_CONTINUE)
-		return r;
+		goto out_mmu_memory_cache_unlock;
=20
 	/*
 	 * Do not change pte_access if the pfn is a mmio page, otherwise
@@ -840,16 +841,19 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, s=
truct kvm_page_fault *fault
 	write_lock(&vcpu->kvm->mmu_lock);
=20
 	if (is_page_fault_stale(vcpu, fault))
-		goto out_unlock;
+		goto out_mmu_unlock;
=20
 	r =3D make_mmu_pages_available(vcpu);
 	if (r)
-		goto out_unlock;
+		goto out_mmu_unlock;
 	r =3D FNAME(fetch)(vcpu, fault, &walker);
=20
-out_unlock:
+out_mmu_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
 	kvm_release_pfn_clean(fault->pfn);
+out_mmu_memory_cache_unlock:
+	mutex_unlock(&vcpu->arch.mmu_memory_cache_lock);
+
 	return r;
 }
=20
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b23c6d48392f7..288e503f14a0b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1446,6 +1446,7 @@ void kvm_flush_remote_tlbs_memslot(struct kvm *kvm,
 int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min);
 int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capa=
city, int min);
 int kvm_mmu_memory_cache_nr_free_objects(struct kvm_mmu_memory_cache *mc);
+int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc);
 void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cb2b78e92910f..5d89ca218791b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -451,15 +451,21 @@ int kvm_mmu_memory_cache_nr_free_objects(struct kvm_m=
mu_memory_cache *mc)
 	return mc->nobjs;
 }
=20
-void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
+int kvm_mmu_empty_memory_cache(struct kvm_mmu_memory_cache *mc)
 {
+	int freed =3D mc->nobjs;
 	while (mc->nobjs) {
 		if (mc->kmem_cache)
 			kmem_cache_free(mc->kmem_cache, mc->objects[--mc->nobjs]);
 		else
 			free_page((unsigned long)mc->objects[--mc->nobjs]);
 	}
+	return freed;
+}
=20
+void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc)
+{
+	kvm_mmu_empty_memory_cache(mc);
 	kvfree(mc->objects);
=20
 	mc->objects =3D NULL;
--=20
2.47.0.rc0.187.ge670bccf7e-goog
From nobody Sun Feb  8 07:17:46 2026
Received: from mail-pf1-f202.google.com (mail-pf1-f202.google.com
 [209.85.210.202])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2CC561E2306
	for <linux-kernel@vger.kernel.org>; Fri,  4 Oct 2024 19:55:50 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.210.202
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1728071751; cv=none;
 b=qJ/GVHclIYexc+vHBK1Wq/lZTIkGMVptluJrUe+SP1gnmptoKoU+Xunwldpc9TEGvWHaZn7UnWvGkzh5kzDOicgy5exOA3hbn1K/eG3aP/ZCwtsgixTi1BTKh7Wn/QBsegASO9N2Ai7Kh3SISU9HPNEqve28sRlthHMiKq7AOUE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1728071751; c=relaxed/simple;
	bh=s/TK3RBzGYb24+3inpEnxECyw5ok/NiB3oxf3jwxHGE=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=JPRTp4BFaLUTcbZChVzZzNy5xCjyFzzf51eU661+Xj0B44jB+p4dDTO7nO0b631dqyqtxeZaGS5VWdBiggl+5CZr2yRPfab4K/HMAxczuJECnhGM9brmVZ2U7wOVkYNAQmA+75Pl66dkZs9kh+MSVcppGBe+tbjR+LS1ejZJG/w=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--vipinsh.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=2CUIwt4T; arc=none smtp.client-ip=209.85.210.202
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--vipinsh.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="2CUIwt4T"
Received: by mail-pf1-f202.google.com with SMTP id
 d2e1a72fcca58-71da742892aso3057801b3a.1
        for <linux-kernel@vger.kernel.org>;
 Fri, 04 Oct 2024 12:55:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1728071749; x=1728676549;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=AyDR+uj3DUNXBrtUhe7IJycvNyVdH4HzZ+qoUtsI/QE=;
        b=2CUIwt4T32J964o9lYs0aKviHCPV6F16jb766DLuDO4aFmsszZm5dYAPl8dP8wRCvd
         9NtbIut3VQpgSc/ag0eMgR2Fzm33d3brTjGxGfglwP1baObLoUG5gOk8dUnIuKgrl9O0
         eYqFBG0jywvfZk7pU3KnP0FGrP5bi/06W7xAiI0T5wTJZC2bBWl0+1YhSwn4CWenKe7I
         LgeRcU0idsDlnJelHhukBFAV5dA2MUuFpxmYKj938+ydiDDJD4Dk1swCTAb7/rD6/Eye
         /jl1iikNgpd5SsRS/TGX7IJ/RHBX7MEkDRAVRioTAQjoXGyI57uqh/DRSwbou5Zx4Cfd
         8veA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1728071749; x=1728676549;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=AyDR+uj3DUNXBrtUhe7IJycvNyVdH4HzZ+qoUtsI/QE=;
        b=Kp+W3tJ39BIiGV1ZWE1qIGStRk1nUHu3AmjN3sosXtCAMQ7de7cgNHlO1jf5O2P2VU
         mXSqhatkeo0+lwNOPLi1qR6u7gn/a1KK/V8LC5Fsk4snp/Y4YR0mW/iYxqhU1hq8Uypw
         UMyJ2q/72OTe2oV739Ba4oCoE9FbM3NEBcaH/OUtt0JE72v1nEcNtpF41MOKpOeRfC/m
         Bb9wpRlF4rfgUhHamkIMwK0HAawsJCs05CwOQUNCE7+4hHd+Fmi8auP1DGneBImVUgv6
         hKV+TEaMaHgjZpo7l0mtsNS6ZWIVIitB8nyNJah2d+IhhMCUoHubWsbwRa1URSQAyFH/
         a/fA==
X-Forwarded-Encrypted: i=1;
 AJvYcCUqJa+46+EK6Pcm2ws5lWYQ3TY2+e1DgcRIDnR65YzvT0c7OZNfcddbb2LxKxDcOEYwWZ+1tzqqLMGXCSg=@vger.kernel.org
X-Gm-Message-State: AOJu0YzXp9m/NF7Iv2+mk4MRy8sbYW6Pbcfq47aa8k/HCyhB3Ugryzt+
	XNROTRX+K33YXgi+9nXRSAJBsLCSxe1zC6wWKRHsPB/xihlBKUoJwRqyIIChY+s/KQHJQw7WexY
	sJM1WbA==
X-Google-Smtp-Source: 
 AGHT+IEPqABrZbInB9iF+3YhRj7tIobE3sERHXQs5uQYKTkhSk5zgLJdd7oOaPB+U/5x7r0jRQxnAieri8NH
X-Received: from vipin.c.googlers.com ([34.105.13.176]) (user=vipinsh
 job=sendgmr) by 2002:a05:6a00:3199:b0:71d:ec11:1214 with SMTP id
 d2e1a72fcca58-71dec11129amr5983b3a.0.1728071749432; Fri, 04 Oct 2024 12:55:49
 -0700 (PDT)
Date: Fri,  4 Oct 2024 12:55:40 -0700
In-Reply-To: <20241004195540.210396-1-vipinsh@google.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20241004195540.210396-1-vipinsh@google.com>
X-Mailer: git-send-email 2.47.0.rc0.187.ge670bccf7e-goog
Message-ID: <20241004195540.210396-4-vipinsh@google.com>
Subject: [PATCH v2 3/3] KVM: selftests: Add a test to invoke MMU shrinker on
 KVM VMs
From: Vipin Sharma <vipinsh@google.com>
To: seanjc@google.com, pbonzini@redhat.com, dmatlack@google.com
Cc: zhi.wang.linux@gmail.com, weijiang.yang@intel.com, mizhang@google.com,
	liangchen.linux@gmail.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Vipin Sharma <vipinsh@google.com>
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Add a test which invokes KVM MMU shrinker to free caches used by vCPUs in
KVM while a VM is running. Find the KVM MMU shrinker location in
shrinker_debugfs and invoke its scan file to fire the shrinker. Provide
options to specify number of vCPUs, memory accessed by each vCPU, number
of iterations of firing the shrinker scan and delay in milliseconds to
take a pause between two consecutive shrinker calls.

If shrinker_debugfs is not mounted then exit with soft skip.

Suggested-by: David Matlack <dmatlack@google.com>
Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../testing/selftests/kvm/include/test_util.h |   5 +
 tools/testing/selftests/kvm/lib/test_util.c   |  51 ++++
 .../selftests/kvm/x86_64/mmu_shrinker_test.c  | 269 ++++++++++++++++++
 4 files changed, 326 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86_64/mmu_shrinker_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests=
/kvm/Makefile
index 45cb70c048bb7..a0119e44a37f6 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -81,6 +81,7 @@ TEST_GEN_PROGS_x86_64 +=3D x86_64/hyperv_svm_test
 TEST_GEN_PROGS_x86_64 +=3D x86_64/hyperv_tlb_flush
 TEST_GEN_PROGS_x86_64 +=3D x86_64/kvm_clock_test
 TEST_GEN_PROGS_x86_64 +=3D x86_64/kvm_pv_test
+TEST_GEN_PROGS_x86_64 +=3D x86_64/mmu_shrinker_test
 TEST_GEN_PROGS_x86_64 +=3D x86_64/monitor_mwait_test
 TEST_GEN_PROGS_x86_64 +=3D x86_64/nested_exceptions_test
 TEST_GEN_PROGS_x86_64 +=3D x86_64/platform_info_test
diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testin=
g/selftests/kvm/include/test_util.h
index 3e473058849ff..7fb530fbc9ce3 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -218,4 +218,9 @@ char *strdup_printf(const char *fmt, ...) __attribute__=
((format(printf, 1, 2), n
=20
 char *sys_get_cur_clocksource(void);
=20
+int find_debugfs_root(char *debugfs_path, size_t size);
+int find_debugfs_subsystem_path(const char *subsystem,
+				char *debugfs_subsystem_path,
+				size_t max_path_size);
+
 #endif /* SELFTEST_KVM_TEST_UTIL_H */
diff --git a/tools/testing/selftests/kvm/lib/test_util.c b/tools/testing/se=
lftests/kvm/lib/test_util.c
index 8ed0b74ae8373..abd2837b5d69d 100644
--- a/tools/testing/selftests/kvm/lib/test_util.c
+++ b/tools/testing/selftests/kvm/lib/test_util.c
@@ -4,6 +4,7 @@
  *
  * Copyright (C) 2020, Google LLC.
  */
+#include <linux/limits.h>
 #include <stdio.h>
 #include <stdarg.h>
 #include <assert.h>
@@ -15,6 +16,8 @@
 #include <sys/syscall.h>
 #include <linux/mman.h>
 #include "linux/kernel.h"
+#include <mntent.h>
+#include <string.h>
=20
 #include "test_util.h"
=20
@@ -415,3 +418,51 @@ char *sys_get_cur_clocksource(void)
=20
 	return clk_name;
 }
+
+int find_debugfs_root(char *debugfs_path, size_t size)
+{
+	FILE *mtab =3D setmntent("/etc/mtab", "r");
+	struct mntent *mntent;
+	int r =3D -ENOENT;
+
+	if (!mtab)
+		return r;
+
+	while ((mntent =3D getmntent(mtab))) {
+		if (strcmp("debugfs", mntent->mnt_type))
+			continue;
+
+		if (strlen(mntent->mnt_dir) >=3D size) {
+			r =3D -EOVERFLOW;
+		} else {
+			strcpy(debugfs_path, mntent->mnt_dir);
+			r =3D 0;
+		}
+		break;
+	}
+
+	endmntent(mtab);
+	return r;
+}
+
+int find_debugfs_subsystem_path(const char *subsystem,
+				char *debugfs_subsystem_path,
+				size_t max_path_size)
+{
+	char debugfs_path[PATH_MAX];
+	int ret;
+
+	ret =3D find_debugfs_root(debugfs_path, PATH_MAX);
+	if (ret)
+		return ret;
+
+	/* Add extra 1 for separator "/". */
+	if (strlen(debugfs_path) + 1 + strlen(subsystem) >=3D max_path_size)
+		return -EOVERFLOW;
+
+	strcpy(debugfs_subsystem_path, debugfs_path);
+	strcat(debugfs_subsystem_path, "/");
+	strcat(debugfs_subsystem_path, subsystem);
+
+	return 0;
+}
diff --git a/tools/testing/selftests/kvm/x86_64/mmu_shrinker_test.c b/tools=
/testing/selftests/kvm/x86_64/mmu_shrinker_test.c
new file mode 100644
index 0000000000000..81dd745bcebdb
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/mmu_shrinker_test.c
@@ -0,0 +1,269 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * MMU shrinker test
+ *
+ * Test MMU shrinker invocation on VMs. This test needs kernel built with
+ * shrinker debugfs and mounted. Generally that location is
+ * /sys/debug/kernel/shrinker.
+ *
+ * Test will keep adding and removing memslots while guest is accessing me=
mory
+ * so that vCPUs will keep taking fault and filling up caches to process t=
he
+ * page faults. It will also invoke shrinker after memslot changes which w=
ill
+ * race with vCPUs to empty caches.
+ *
+ * Copyright 2010 Google LLC
+ *
+ */
+
+#include "guest_modes.h"
+#include "kvm_util.h"
+#include "memstress.h"
+#include "test_util.h"
+#include "ucall_common.h"
+
+#include <dirent.h>
+#include <error.h>
+#include <fnmatch.h>
+#include <kselftest.h>
+#include <linux/limits.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <time.h>
+#include <unistd.h>
+
+#define SHRINKER_DIR "shrinker"
+#define KVM_MMU_SHRINKER_PREFIX "x86-mmu-*"
+#define SHRINKER_SCAN_FILE "scan"
+#define DUMMY_MEMSLOT_INDEX 10
+#define DEFAULT_MMU_SHRINKER_ITERATIONS 5
+#define DEFAULT_MMU_SHRINKER_VCPUS 2
+#define DEFAULT_MMU_SHRINKER_DELAY_MS 100
+
+struct test_params {
+	uint64_t iterations;
+	uint64_t guest_percpu_mem_size;
+	int delay_ms;
+	int nr_vcpus;
+	char kvm_shrink_scan_file[PATH_MAX];
+};
+
+static int filter(const struct dirent *dir)
+{
+	return !fnmatch(KVM_MMU_SHRINKER_PREFIX, dir->d_name, 0);
+}
+
+static int find_kvm_shrink_scan_path(const char *shrinker_path,
+				     char *kvm_shrinker_path, size_t size)
+{
+	struct dirent **dirs =3D NULL;
+	int ret =3D 0;
+	size_t len;
+	int n;
+
+	n =3D scandir(shrinker_path, &dirs, filter, NULL);
+	if (n =3D=3D -1) {
+		return -errno;
+	} else if (n !=3D 1) {
+		pr_info("Expected one x86-mmu shrinker but found %d\n", n);
+		ret =3D -ENOTSUP;
+		goto out;
+	}
+
+	len =3D strnlen(shrinker_path, PATH_MAX) +
+	      1 + /* For path separator '/' */
+	      strnlen(dirs[0]->d_name, PATH_MAX) +
+	      1 + /* For path separator '/' */
+	      strnlen(SHRINKER_SCAN_FILE, PATH_MAX);
+
+	if (len >=3D PATH_MAX) {
+		ret =3D -EOVERFLOW;
+		goto out;
+	}
+
+	strcpy(kvm_shrinker_path, shrinker_path);
+	strcat(kvm_shrinker_path, "/");
+	strcat(kvm_shrinker_path, dirs[0]->d_name);
+	strcat(kvm_shrinker_path, "/");
+	strcat(kvm_shrinker_path, SHRINKER_SCAN_FILE);
+
+out:
+	while (n > 0)
+		free(dirs[n--]);
+	free(dirs);
+	return ret;
+}
+
+static void find_and_validate_kvm_shrink_scan_file(char *kvm_mmu_shrink_sc=
an_file, size_t size)
+{
+	char shrinker_path[PATH_MAX];
+	int ret;
+
+	ret =3D find_debugfs_subsystem_path(SHRINKER_DIR, shrinker_path, PATH_MAX=
);
+	if (ret =3D=3D -ENOENT) {
+		pr_info("Cannot find debugfs, error (%d - %s). Skipping the test.\n",
+			-ret, strerror(-ret));
+		exit(KSFT_SKIP);
+	} else if (ret) {
+		exit(-ret);
+	}
+
+	ret =3D find_kvm_shrink_scan_path(shrinker_path, kvm_mmu_shrink_scan_file=
, size);
+	if (ret =3D=3D -ENOENT) {
+		pr_info("Cannot find kvm shrinker debugfs path, error (%d - %s). Skippin=
g the test.\n",
+			-ret, strerror(-ret));
+		exit(KSFT_SKIP);
+	} else if (ret) {
+		exit(-ret);
+	}
+
+	if (access(kvm_mmu_shrink_scan_file, W_OK))
+		exit(errno);
+
+	pr_info("Got KVM MMU shrink scan file at: %s\n",
+		kvm_mmu_shrink_scan_file);
+}
+
+static int invoke_kvm_mmu_shrinker_scan(struct kvm_vm *vm,
+					const char *kvm_shrink_scan_file,
+					uint64_t iterations, int delay_ms)
+{
+	uint64_t pages =3D 1;
+	uint64_t gpa;
+	FILE *shrinker_scan_fp;
+	struct timespec ts;
+	int i =3D 1;
+
+	ts.tv_sec =3D delay_ms / 1000;
+	ts.tv_nsec =3D (delay_ms - (ts.tv_sec * 1000)) * 1000000;
+
+	gpa =3D memstress_args.gpa - pages * vm->page_size;
+
+	shrinker_scan_fp =3D fopen(kvm_shrink_scan_file, "w");
+	if (!shrinker_scan_fp) {
+		pr_info("Not able to open KVM shrink scan file for writing\n");
+		return -errno;
+	}
+
+	while (iterations--) {
+		/* Adding and deleting memslots rebuilds the page table */
+		vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, gpa,
+					    DUMMY_MEMSLOT_INDEX, pages, 0);
+		vm_mem_region_delete(vm, DUMMY_MEMSLOT_INDEX);
+
+		pr_info("Iteration %d: Invoking shrinker.\n", i++);
+		fprintf(shrinker_scan_fp, "0 0 1000\n");
+		rewind(shrinker_scan_fp);
+
+		nanosleep(&ts, NULL);
+	}
+
+	fclose(shrinker_scan_fp);
+	return 0;
+}
+
+static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
+{
+	struct kvm_vcpu *vcpu =3D vcpu_args->vcpu;
+	struct kvm_run *run;
+	int ret;
+
+	run =3D vcpu->run;
+
+	/* Let the guest access its memory until a stop signal is received */
+	while (!READ_ONCE(memstress_args.stop_vcpus)) {
+		ret =3D _vcpu_run(vcpu);
+		TEST_ASSERT(ret =3D=3D 0, "vcpu_run failed: %d", ret);
+
+		if (get_ucall(vcpu, NULL) =3D=3D UCALL_SYNC)
+			continue;
+
+		TEST_ASSERT(false,
+			    "Invalid guest sync status: exit_reason=3D%s\n",
+			    exit_reason_str(run->exit_reason));
+	}
+}
+
+static void run_test(struct test_params *p)
+{
+	struct kvm_vm *vm;
+	int nr_vcpus =3D p->nr_vcpus;
+
+	pr_info("Creating the VM.\n");
+	vm =3D memstress_create_vm(VM_MODE_DEFAULT, p->nr_vcpus,
+				 p->guest_percpu_mem_size,
+				 /*slots =3D*/1, DEFAULT_VM_MEM_SRC,
+				 /*partition_vcpu_memory_access=3D*/true);
+
+	memstress_start_vcpu_threads(p->nr_vcpus, vcpu_worker);
+
+	pr_info("Starting the test.\n");
+	invoke_kvm_mmu_shrinker_scan(vm, p->kvm_shrink_scan_file, p->iterations,
+				     p->delay_ms);
+
+	pr_info("Test completed.\nStopping the VM.\n");
+	memstress_join_vcpu_threads(nr_vcpus);
+	memstress_destroy_vm(vm);
+}
+
+static void help(char *name)
+{
+	puts("");
+	printf("usage: %s [-b memory] [-d delay_usec] [-i iterations] [-h]\n"
+	       "       [-v vcpus] \n", name);
+	printf(" -b: specify the size of the memory region which should be\n"
+	       "     accessed by each vCPU. e.g. 10M or 3G. (Default: 1G)\n");
+	printf(" -d: add a delay between each iterations of firing MMU shrinker\n"
+	       "     scan in milliseconds. (Default: %dms).\n",
+	       DEFAULT_MMU_SHRINKER_DELAY_MS);
+	printf(" -i: specify the number of iterations of firing MMU shrinker.\n"
+	       "     scan. (Default: %d)\n",
+	       DEFAULT_MMU_SHRINKER_ITERATIONS);
+	printf(" -v: specify the number of vCPUs to run. (Default: %d)\n",
+	       DEFAULT_MMU_SHRINKER_VCPUS);
+	printf(" -h: Print the help message.\n");
+	puts("");
+}
+
+int main(int argc, char *argv[])
+{
+	int max_vcpus =3D kvm_check_cap(KVM_CAP_MAX_VCPUS);
+	struct test_params p =3D {
+		.iterations =3D DEFAULT_MMU_SHRINKER_ITERATIONS,
+		.guest_percpu_mem_size =3D DEFAULT_PER_VCPU_MEM_SIZE,
+		.nr_vcpus =3D DEFAULT_MMU_SHRINKER_VCPUS,
+		.delay_ms =3D DEFAULT_MMU_SHRINKER_DELAY_MS,
+	};
+	int opt;
+
+	while ((opt =3D getopt(argc, argv, "b:d:i:v:")) !=3D -1) {
+		switch (opt) {
+		case 'b':
+			p.guest_percpu_mem_size =3D parse_size(optarg);
+			break;
+		case 'd':
+			p.delay_ms =3D atoi_non_negative("Time gap between two MMU shrinker inv=
ocations in milliseconds",
+						       optarg);
+			break;
+		case 'i':
+			p.iterations =3D atoi_positive("Number of iterations", optarg);
+			break;
+		case 'v':
+			p.nr_vcpus =3D atoi_positive("Number of vCPUs", optarg);
+			TEST_ASSERT(p.nr_vcpus <=3D max_vcpus,
+				    "Invalid number of vcpus, must be between 1 and %d",
+				    max_vcpus);
+			break;
+		case 'h':
+			help(argv[0]);
+			exit(EXIT_SUCCESS);
+		default:
+			help(argv[0]);
+			exit(EXIT_FAILURE);
+		}
+	}
+
+	find_and_validate_kvm_shrink_scan_file(p.kvm_shrink_scan_file, PATH_MAX);
+	run_test(&p);
+	return 0;
+}
--=20
2.47.0.rc0.187.ge670bccf7e-goog