From nobody Sun Feb  8 16:49:36 2026
Received: from mail-qt1-f175.google.com (mail-qt1-f175.google.com
 [209.85.160.175])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9B54C223DF1
	for <linux-kernel@vger.kernel.org>; Thu, 24 Apr 2025 20:28:14 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.160.175
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1745526496; cv=none;
 b=fKKIDC36xMZ7pwmarq6KXvIU6uf0y4Bc0SeMSuaSnRGQ/WW6o94yWiHQl304vVdz9n6m3a7Qb8Avck0HiOaSNzD6Yr2tU5Q39Z9Zlb/HZj4BIDw3u0eIo5gIsky0y2sQpvjm6zEiPzzAXwrXMEHMlcAmCOJHRD8Nk0sKIhCsWdc=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1745526496; c=relaxed/simple;
	bh=zfoCZAi+qgWXFIaHeDNfmek6cFG6y/TXE1VwWMa2yYs=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=N4MdTI7xgspQBam5HsZgoAx8a3kVk6JhJ8oYTbUNYrkILZ0gVJFyrHvzCZNXDD/6PaybGRGFxoP5GXqaLETugady6LlUQjNgsTDWQ2+MUPNaDRNo8fvKToEXCelbJj+WJ0EZQNHi+0UR5Pobl4Ud09v+GuU50quwcLi8EAsA7zE=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=none (p=none dis=none) header.from=gourry.net;
 spf=pass smtp.mailfrom=gourry.net;
 dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net
 header.b=AI7rOCnF; arc=none smtp.client-ip=209.85.160.175
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=none (p=none dis=none) header.from=gourry.net
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gourry.net
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net
 header.b="AI7rOCnF"
Received: by mail-qt1-f175.google.com with SMTP id
 d75a77b69052e-476ac73c76fso17352661cf.0
        for <linux-kernel@vger.kernel.org>;
 Thu, 24 Apr 2025 13:28:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gourry.net; s=google; t=1745526493; x=1746131293;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=CxWEnn5MTFE6+1tVKFs4oJ/SdesKMia6cc2bPiHfnm4=;
        b=AI7rOCnFe+64PBkGlQ4be3bDf6h5+hsJ2h27O6VWlAcz43ClTYv/o4OK3+f29ZcF2r
         oO+PjGNPp8tGRl7f/rwYy8ayfgt4+qURjib91VXPP9CUu2yp6PPkBsUobQEzT+JJM3hf
         nuFrRrf2DiyWy4+v0wF+5yal0OJzMt72ppyvuM3YhbuVs75dfCrsJhjGm6UZBsWnO5mb
         YLQ/vivH1FEUpmSohv5NgWo5MF7y7teC/D5/+NsezXbKtEMtlOUsfKJQchJEvbwR0aLW
         7iyQL7IjARwP8XyrsTQUyhZcDnJ0xtdf/lt9oMk0Brctg5XxKpjyEGZNAJ1dAgduiu2b
         5ckA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1745526493; x=1746131293;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=CxWEnn5MTFE6+1tVKFs4oJ/SdesKMia6cc2bPiHfnm4=;
        b=h/s10b/wmLfZmJUikoJafCy6d7rYa2tpp7wjz7I3J0jw7a+9ly8twjemgL5125zivg
         kDO7koJg0mUtR8y0YXKmVBVuIGs0RbKs3sENgB03wqTzFyjsOh1Z5vcLqYMunWvCPOmw
         tzsg5doxqLu8kiH5MQNh22StzJipPOGoElvUCrwdunDpO1P61rg0GIvdb18QDBEPeyVf
         2PwCVPJ5ocEISEo/SupQaTn5MK2xLU/tBTqfjfBaLXErxLndcXZm4iECzZgILtEk6AcA
         Dfu9iZ2yanpXDZ7OoevUQZIGUO86RSdU0tgZO3fhxEMNqqmjTnp9O2M8qV7sCSGyGDow
         Resw==
X-Forwarded-Encrypted: i=1;
 AJvYcCVVUOmCAxD8rOzxk8qczO0ZFcBtRtgjt9Z6rf0Mf7QAOwyt8YoTOvv/RIwd1ILn8akjNJ7f/xV9PhfS/8g=@vger.kernel.org
X-Gm-Message-State: AOJu0YxTQ866m3/m3PmcJ/98EdCRlBFK1d6Ztzrup+tJmiXyik+dWete
	J69Pe8DBqRTDu/aGHHWHjZVHylioYEmeubt+BfivnqpBGyQGYo2flCPNSpAScgo=
X-Gm-Gg: ASbGncupyd5j/+AqQazLVWkfNDeABVJOYQSF9XtbmOomayMryxwgLTum2NrNNbOEmrC
	lOXnr8Dg6VabA9VQMqmIlZjsNcTsjZEyRHySyorM3wpuS3IAw7npw7I+kNsP29vPxxHDvU0JWJ3
	Qb3frfwlRYdl4pqY8so7SONQq0AHOo+SL9C/Xzqoefk2VnT03DvezteAnCwEpo9t3wSPHRJSwCV
	lVJ/HyO2Zc4wqz70MxaSvGXR5/rlu9TyWMBLpuPv+nvCF7JqTiZM8gbD70eCFzeFpl67yHrO3Hb
	j39ou3pKljg0xDBkz6nrUO1jrTXZJUk9gTGlgPci9ATsSa+Y/cXjn9waCOiX4dtBJDhu2DEub+o
	znZKPeGlx/uW2NsZSd91kxdbnitIT34jhYBEAa9w=
X-Google-Smtp-Source: 
 AGHT+IHogOSfTg5LJdUBkMbpkPdMOCaNUDFPxTL0UeOXpsUsIFBmyFi+0UOFbb/jmohzrTYiFE7SSQ==
X-Received: by 2002:ac8:7d44:0:b0:47f:9f87:ce0a with SMTP id
 d75a77b69052e-47fb68d7bf9mr14855211cf.0.1745526493434;
        Thu, 24 Apr 2025 13:28:13 -0700 (PDT)
Received: from gourry-fedora-PF4VCD3F.lan
 (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208])
        by smtp.gmail.com with ESMTPSA id
 d75a77b69052e-47e9ebeb870sm16091691cf.5.2025.04.24.13.28.12
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 24 Apr 2025 13:28:13 -0700 (PDT)
From: Gregory Price <gourry@gourry.net>
To: linux-mm@kvack.org
Cc: cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	kernel-team@meta.com,
	longman@redhat.com,
	hannes@cmpxchg.org,
	mhocko@kernel.org,
	roman.gushchin@linux.dev,
	shakeel.butt@linux.dev,
	muchun.song@linux.dev,
	tj@kernel.org,
	mkoutny@suse.com,
	akpm@linux-foundation.org
Subject: [PATCH v5 2/2] vmscan,cgroup: apply mems_effective to reclaim
Date: Thu, 24 Apr 2025 16:28:06 -0400
Message-ID: <20250424202806.52632-3-gourry@gourry.net>
X-Mailer: git-send-email 2.49.0
In-Reply-To: <20250424202806.52632-1-gourry@gourry.net>
References: <20250424202806.52632-1-gourry@gourry.net>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

It is possible for a reclaimer to cause demotions of an lruvec belonging
to a cgroup with cpuset.mems set to exclude some nodes. Attempt to apply
this limitation based on the lruvec's memcg and prevent demotion.

Notably, this may still allow demotion of shared libraries or any memory
first instantiated in another cgroup. This means cpusets still cannot
cannot guarantee complete isolation when demotion is enabled, and the
docs have been updated to reflect this.

This is useful for isolating workloads on a multi-tenant system from
certain classes of memory more consistently - with the noted exceptions.

Note on locking:

The cgroup_get_e_css reference protects the css->effective_mems, and
calls of this interface would be subject to the same race conditions
associated with a non-atomic access to cs->effective_mems.

So while this interface cannot make strong guarantees of correctness,
it can therefore avoid taking a global or rcu_read_lock for performance.

Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
Suggested-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
 .../ABI/testing/sysfs-kernel-mm-numa          | 16 +++++---
 include/linux/cpuset.h                        |  5 +++
 include/linux/memcontrol.h                    |  6 +++
 kernel/cgroup/cpuset.c                        | 36 ++++++++++++++++
 mm/memcontrol.c                               |  6 +++
 mm/vmscan.c                                   | 41 +++++++++++--------
 6 files changed, 88 insertions(+), 22 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-numa b/Documentation=
/ABI/testing/sysfs-kernel-mm-numa
index 77e559d4ed80..90e375ff54cb 100644
--- a/Documentation/ABI/testing/sysfs-kernel-mm-numa
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-numa
@@ -16,9 +16,13 @@ Description:	Enable/disable demoting pages during reclaim
 		Allowing page migration during reclaim enables these
 		systems to migrate pages from fast tiers to slow tiers
 		when the fast tier is under pressure.  This migration
-		is performed before swap.  It may move data to a NUMA
-		node that does not fall into the cpuset of the
-		allocating process which might be construed to violate
-		the guarantees of cpusets.  This should not be enabled
-		on systems which need strict cpuset location
-		guarantees.
+		is performed before swap if an eligible numa node is
+		present in cpuset.mems for the cgroup (or if cpuset v1
+		is being used). If cpusets.mems changes at runtime, it
+		may move data to a NUMA node that does not fall into the
+		cpuset of the new cpusets.mems, which might be construed
+		to violate the guarantees of cpusets.  Shared memory,
+		such as libraries, owned by another cgroup may still be
+		demoted and result in memory use on a node not present
+		in cpusets.mem. This should not be enabled on systems
+		which need strict cpuset location guarantees.
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 893a4c340d48..5255e3fdbf62 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -171,6 +171,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 	task_unlock(current);
 }
=20
+extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
 #else /* !CONFIG_CPUSETS */
=20
 static inline bool cpusets_enabled(void) { return false; }
@@ -282,6 +283,10 @@ static inline bool read_mems_allowed_retry(unsigned in=
t seq)
 	return false;
 }
=20
+static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
+{
+	return true;
+}
 #endif /* !CONFIG_CPUSETS */
=20
 #endif /* _LINUX_CPUSET_H */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 53364526d877..a6c4e3faf721 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1736,6 +1736,8 @@ static inline void count_objcg_events(struct obj_cgro=
up *objcg,
 	rcu_read_unlock();
 }
=20
+bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid);
+
 #else
 static inline bool mem_cgroup_kmem_disabled(void)
 {
@@ -1793,6 +1795,10 @@ static inline void count_objcg_events(struct obj_cgr=
oup *objcg,
 {
 }
=20
+static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int n=
id)
+{
+	return true;
+}
 #endif /* CONFIG_MEMCG */
=20
 #if defined(CONFIG_MEMCG) && defined(CONFIG_ZSWAP)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index f8e6a9b642cb..7eb71d411dc7 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4163,6 +4163,42 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp=
_mask)
 	return allowed;
 }
=20
+bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
+{
+	struct cgroup_subsys_state *css;
+	struct cpuset *cs;
+	bool allowed;
+
+	/*
+	 * In v1, mem_cgroup and cpuset are unlikely in the same hierarchy
+	 * and mems_allowed is likely to be empty even if we could get to it,
+	 * so return true to avoid taking a global lock on the empty check.
+	 */
+	if (!cpuset_v2())
+		return true;
+
+	css =3D cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys);
+	if (!css)
+		return true;
+
+	/*
+	 * Normally, accessing effective_mems would require the cpuset_mutex
+	 * or callback_lock - but node_isset is atomic and the reference
+	 * taken via cgroup_get_e_css is sufficient to protect css.
+	 *
+	 * Since this interface is intended for use by migration paths, we
+	 * relax locking here to avoid taking global locks - while accepting
+	 * there may be rare scenarios where the result may be innaccurate.
+	 *
+	 * Reclaim and migration are subject to these same race conditions, and
+	 * cannot make strong isolation guarantees, so this is acceptable.
+	 */
+	cs =3D container_of(css, struct cpuset, css);
+	allowed =3D node_isset(nid, cs->effective_mems);
+	css_put(css);
+	return allowed;
+}
+
 /**
  * cpuset_spread_node() - On which node to begin search for a page
  * @rotor: round robin rotor
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 40c07b8699ae..2f61d0060fd1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -29,6 +29,7 @@
 #include <linux/page_counter.h>
 #include <linux/memcontrol.h>
 #include <linux/cgroup.h>
+#include <linux/cpuset.h>
 #include <linux/sched/mm.h>
 #include <linux/shmem_fs.h>
 #include <linux/hugetlb.h>
@@ -5437,3 +5438,8 @@ static int __init mem_cgroup_swap_init(void)
 subsys_initcall(mem_cgroup_swap_init);
=20
 #endif /* CONFIG_SWAP */
+
+bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
+{
+	return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true;
+}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2b2ab386cab5..32a7ce421e42 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -342,16 +342,22 @@ static void flush_reclaim_state(struct scan_control *=
sc)
 	}
 }
=20
-static bool can_demote(int nid, struct scan_control *sc)
+static bool can_demote(int nid, struct scan_control *sc,
+		       struct mem_cgroup *memcg)
 {
+	int demotion_nid;
+
 	if (!numa_demotion_enabled)
 		return false;
 	if (sc && sc->no_demotion)
 		return false;
-	if (next_demotion_node(nid) =3D=3D NUMA_NO_NODE)
+
+	demotion_nid =3D next_demotion_node(nid);
+	if (demotion_nid =3D=3D NUMA_NO_NODE)
 		return false;
=20
-	return true;
+	/* If demotion node isn't in the cgroup's mems_allowed, fall back */
+	return mem_cgroup_node_allowed(memcg, demotion_nid);
 }
=20
 static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
@@ -376,7 +382,7 @@ static inline bool can_reclaim_anon_pages(struct mem_cg=
roup *memcg,
 	 *
 	 * Can it be reclaimed from this node via demotion?
 	 */
-	return can_demote(nid, sc);
+	return can_demote(nid, sc, memcg);
 }
=20
 /*
@@ -1096,7 +1102,8 @@ static bool may_enter_fs(struct folio *folio, gfp_t g=
fp_mask)
  */
 static unsigned int shrink_folio_list(struct list_head *folio_list,
 		struct pglist_data *pgdat, struct scan_control *sc,
-		struct reclaim_stat *stat, bool ignore_references)
+		struct reclaim_stat *stat, bool ignore_references,
+		struct mem_cgroup *memcg)
 {
 	struct folio_batch free_folios;
 	LIST_HEAD(ret_folios);
@@ -1109,7 +1116,7 @@ static unsigned int shrink_folio_list(struct list_hea=
d *folio_list,
 	folio_batch_init(&free_folios);
 	memset(stat, 0, sizeof(*stat));
 	cond_resched();
-	do_demote_pass =3D can_demote(pgdat->node_id, sc);
+	do_demote_pass =3D can_demote(pgdat->node_id, sc, memcg);
=20
 retry:
 	while (!list_empty(folio_list)) {
@@ -1658,7 +1665,7 @@ unsigned int reclaim_clean_pages_from_list(struct zon=
e *zone,
 	 */
 	noreclaim_flag =3D memalloc_noreclaim_save();
 	nr_reclaimed =3D shrink_folio_list(&clean_folios, zone->zone_pgdat, &sc,
-					&stat, true);
+					&stat, true, NULL);
 	memalloc_noreclaim_restore(noreclaim_flag);
=20
 	list_splice(&clean_folios, folio_list);
@@ -2031,7 +2038,8 @@ static unsigned long shrink_inactive_list(unsigned lo=
ng nr_to_scan,
 	if (nr_taken =3D=3D 0)
 		return 0;
=20
-	nr_reclaimed =3D shrink_folio_list(&folio_list, pgdat, sc, &stat, false);
+	nr_reclaimed =3D shrink_folio_list(&folio_list, pgdat, sc, &stat, false,
+					 lruvec_memcg(lruvec));
=20
 	spin_lock_irq(&lruvec->lru_lock);
 	move_folios_to_lru(lruvec, &folio_list);
@@ -2214,7 +2222,7 @@ static unsigned int reclaim_folio_list(struct list_he=
ad *folio_list,
 		.no_demotion =3D 1,
 	};
=20
-	nr_reclaimed =3D shrink_folio_list(folio_list, pgdat, &sc, &stat, true);
+	nr_reclaimed =3D shrink_folio_list(folio_list, pgdat, &sc, &stat, true, N=
ULL);
 	while (!list_empty(folio_list)) {
 		folio =3D lru_to_folio(folio_list);
 		list_del(&folio->lru);
@@ -2646,7 +2654,7 @@ static void get_scan_count(struct lruvec *lruvec, str=
uct scan_control *sc,
  * Anonymous LRU management is a waste if there is
  * ultimately no way to reclaim the memory.
  */
-static bool can_age_anon_pages(struct pglist_data *pgdat,
+static bool can_age_anon_pages(struct lruvec *lruvec,
 			       struct scan_control *sc)
 {
 	/* Aging the anon LRU is valuable if swap is present: */
@@ -2654,7 +2662,8 @@ static bool can_age_anon_pages(struct pglist_data *pg=
dat,
 		return true;
=20
 	/* Also valuable if anon pages can be demoted: */
-	return can_demote(pgdat->node_id, sc);
+	return can_demote(lruvec_pgdat(lruvec)->node_id, sc,
+			  lruvec_memcg(lruvec));
 }
=20
 #ifdef CONFIG_LRU_GEN
@@ -2732,7 +2741,7 @@ static int get_swappiness(struct lruvec *lruvec, stru=
ct scan_control *sc)
 	if (!sc->may_swap)
 		return 0;
=20
-	if (!can_demote(pgdat->node_id, sc) &&
+	if (!can_demote(pgdat->node_id, sc, memcg) &&
 	    mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
 		return 0;
=20
@@ -4695,7 +4704,7 @@ static int evict_folios(struct lruvec *lruvec, struct=
 scan_control *sc, int swap
 	if (list_empty(&list))
 		return scanned;
 retry:
-	reclaimed =3D shrink_folio_list(&list, pgdat, sc, &stat, false);
+	reclaimed =3D shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
 	sc->nr.unqueued_dirty +=3D stat.nr_unqueued_dirty;
 	sc->nr_reclaimed +=3D reclaimed;
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
@@ -5850,7 +5859,7 @@ static void shrink_lruvec(struct lruvec *lruvec, stru=
ct scan_control *sc)
 	 * Even if we did not try to evict anon pages at all, we want to
 	 * rebalance the anon lru active/inactive ratio.
 	 */
-	if (can_age_anon_pages(lruvec_pgdat(lruvec), sc) &&
+	if (can_age_anon_pages(lruvec, sc) &&
 	    inactive_is_low(lruvec, LRU_INACTIVE_ANON))
 		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 				   sc, LRU_ACTIVE_ANON);
@@ -6681,10 +6690,10 @@ static void kswapd_age_node(struct pglist_data *pgd=
at, struct scan_control *sc)
 		return;
 	}
=20
-	if (!can_age_anon_pages(pgdat, sc))
+	lruvec =3D mem_cgroup_lruvec(NULL, pgdat);
+	if (!can_age_anon_pages(lruvec, sc))
 		return;
=20
-	lruvec =3D mem_cgroup_lruvec(NULL, pgdat);
 	if (!inactive_is_low(lruvec, LRU_INACTIVE_ANON))
 		return;
=20
--=20
2.49.0