From nobody Sun Feb 8 02:41:42 2026 Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com [209.85.215.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 24411283FDD for ; Sun, 21 Dec 2025 23:36:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766360218; cv=none; b=OAy5hnZDSYysKpVNoxFeLxdBJFWNWS/KWDhv5ZxnhfuINmVR+is1TzmzK7a+IOifqB265vXcKH0U7Eca4wjqM+SDcouJ97krfblsE+ktH9XkrcV1Dm7X+yTbxWQkBWXvVaw31m6jliS6MDf9vfhiw8f7m2GQ/0ZAV6yN9hBveEw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766360218; c=relaxed/simple; bh=TH8d2yyXKa4fMd0WI4nPJ0bzAgbIns0ypw8/ovvAOfM=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=ZmOr9wrMICM0hhDM5pDVScFTo4vE59LNiee7q0yz01YFNWVtQ+yJr/bhJMwHhh/ZFFevaQFzt1C7i9Hc3rKSBNUv/ykMypqiwqOZHtGAYuDqlOg9TVtuGv5AT61bbLVigdi26iaJdWL7FKfptpRRSY+K7fInUSET8TwwvA1IMCM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--bingjiao.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=aJaErCQG; arc=none smtp.client-ip=209.85.215.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--bingjiao.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="aJaErCQG" Received: by mail-pg1-f201.google.com with SMTP id 41be03b00d2f7-b99763210e5so6945449a12.3 for ; Sun, 21 Dec 2025 15:36:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1766360216; x=1766965016; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=4GfhhMoWYFTh/aN/kwrLJXHldk5+UtwhObh4bR+INB4=; b=aJaErCQGfQUOk6wMvmLvIIMGxSp+3eaKlUbdKiv2Pe5fgHmS11NP+tGw/Wr7fY7pmJ UU3n3hPnXo4jL3dWwF821Fo99YmbuMQQUkcaoWD97ooXi7LIc586fWvrh5EqHyS/fH7Z uPqGHVjizbslTG85ZTQunVdgtFvWAZAZNBMkloGPCUULrjxHVG67XA9OetrsOJ1/XyT5 rDJfrnADGMX0zsfJBqu9atLaSpBc/AjF5V9yuQQhom51XM9pblp33tXKhbKHxFXjLEzC i92D60U+2K6De7hg1KRDxVfBERgkror0Dk7XEeD6QcWF/GER1DBHm7kV+obcJr/DPo+K Brxg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766360216; x=1766965016; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=4GfhhMoWYFTh/aN/kwrLJXHldk5+UtwhObh4bR+INB4=; b=JkuPP9ceE1h7TeexJxo73ZaoLKmNf4y5WBqnR3gPOGiUOwOBRNcxb4uqOqco2rgxZo W/ZSEiRREXi3h6fJvZL8CWYoJAph4QfBY7nGb63SlZFh0EUSC7QaUoaP5F5O7Gq7H3KF Zjs33JvSicPrah+dHsOR5I5RTrRh09XazmNbMjYTO2PwWPdScJcltL2Z9vzPq+P9dVk+ UvG6UW3bf5hd9KNk+t+KHdWXzCrcbx+n8Q9Zyg68RBS3WFEmsniAJoUzDoFHAoUpgQ8e Wlb53cHiG9yugMH9smDas8wdrsVWObjoP0ulUuWX/KeX4a3T+cBbe9xMA+vTz6o0bvqZ So6Q== X-Gm-Message-State: AOJu0YxFr+kUv7IzTZQOPYss54pdwFoOtfSUZrYoUMfbAkul078R0Ec8 wluaWXYj9l0o1GfGPiuM2ngg04Z6Aqh6YLESsQIA5Qj0TKUBmLQYgDTIGcNjKfKuA1PPG7LA6sj rqojGK8uzRtHVwg== X-Google-Smtp-Source: AGHT+IEicLxazx7S6uu1698T5FhlbmHzMdzgeAbcuTNEEaAVkEM1v1fpFPWcA/uiQi7D96w3YBVKbgPaJ+Ktsg== X-Received: from dyboo7.prod.google.com ([2002:a05:7301:1e87:b0:2ac:36d5:fc65]) (user=bingjiao job=prod-delivery.src-stubby-dispatcher) by 2002:a05:7301:d89:b0:2ac:2263:dcc0 with SMTP id 5a478bee46e88-2b05ec19181mr11168511eec.19.1766360216254; Sun, 21 Dec 2025 15:36:56 -0800 (PST) Date: Sun, 21 Dec 2025 23:36:34 +0000 In-Reply-To: <20251221233635.3761887-1-bingjiao@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251220061022.2726028-1-bingjiao@google.com> <20251221233635.3761887-1-bingjiao@google.com> X-Mailer: git-send-email 2.52.0.351.gbe84eed79e-goog Message-ID: <20251221233635.3761887-2-bingjiao@google.com> Subject: [PATCH v2 1/2] mm/vmscan: respect mems_effective in demote_folio_list() From: Bing Jiao To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, stable@vger.kernel.org, akpm@linux-foundation.org, gourry@gourry.net, longman@redhat.com, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, tj@kernel.org, mkoutny@suse.com, david@kernel.org, zhengqi.arch@bytedance.com, lorenzo.stoakes@oracle.com, axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com, cgroups@vger.kernel.org, Bing Jiao Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") introduces the cpuset.mems_effective check and applies it to can_demote(). However, it does not apply this check in demote_folio_list(). This omission leads to situations where pages are demoted to nodes that are explicitly excluded from the task's cpuset.mems. The impact is two-fold: 1. Resource Isolation: This bug breaks resource isolation provided by cpuset.mems. It allows pages to be demoted to nodes that are dedicated to other tasks or are intended for hot-unplugging. 2. Performance Issue: In multi-tier systems, users use cpuset.mems to bind tasks to different performed-far tiers (e.g., avoiding the slowest tiers for latency-sensitive data). This bug can cause unexpected latency spikes if pages are demoted to the farthest nodes. To address the bug, implement a new function mem_cgroup_filter_mems_allowed() to filter out nodes that are not set in mems_effective, and update demote_folio_list() to utilize this filtering logic. This ensures that demotions target respect task's memory placement constraints. Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") Signed-off-by: Bing Jiao --- include/linux/cpuset.h | 6 ++++++ include/linux/memcontrol.h | 7 +++++++ kernel/cgroup/cpuset.c | 18 ++++++++++++++++++ mm/memcontrol.c | 6 ++++++ mm/vmscan.c | 13 ++++++++++--- 5 files changed, 47 insertions(+), 3 deletions(-) diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index a98d3330385c..0e94548e2d24 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -175,6 +175,7 @@ static inline void set_mems_allowed(nodemask_t nodemask) } =20 extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid); +extern void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *= mask); #else /* !CONFIG_CPUSETS */ =20 static inline bool cpusets_enabled(void) { return false; } @@ -305,6 +306,11 @@ static inline bool cpuset_node_allowed(struct cgroup *= cgroup, int nid) { return true; } + +static inline void cpuset_node_filter_allowed(struct cgroup *cgroup, + nodemask_t *mask) +{ +} #endif /* !CONFIG_CPUSETS */ =20 #endif /* _LINUX_CPUSET_H */ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index fd400082313a..7cfd71c57caa 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1742,6 +1742,8 @@ static inline void count_objcg_events(struct obj_cgro= up *objcg, =20 bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid); =20 +void mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, nodemask_t *= mask); + void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg); =20 static inline bool memcg_is_dying(struct mem_cgroup *memcg) @@ -1816,6 +1818,11 @@ static inline bool mem_cgroup_node_allowed(struct me= m_cgroup *memcg, int nid) return true; } =20 +static inline bool mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, + nodemask_t *mask) +{ +} + static inline void mem_cgroup_show_protected_memory(struct mem_cgroup *mem= cg) { } diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 6e6eb09b8db6..2925bd6bca91 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -4452,6 +4452,24 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int = nid) return allowed; } =20 +void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask) +{ + struct cgroup_subsys_state *css; + struct cpuset *cs; + + if (!cpuset_v2()) + return; + + css =3D cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys); + if (!css) + return; + + /* Follows the same assumption in cpuset_node_allowed() */ + cs =3D container_of(css, struct cpuset, css); + nodes_and(*mask, *mask, cs->effective_mems); + css_put(css); +} + /** * cpuset_spread_node() - On which node to begin search for a page * @rotor: round robin rotor diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 75fc22a33b28..f414653867de 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5602,6 +5602,12 @@ bool mem_cgroup_node_allowed(struct mem_cgroup *memc= g, int nid) return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true; } =20 +void mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, nodemask_t *= mask) +{ + if (memcg) + cpuset_node_filter_allowed(memcg->css.cgroup, mask); +} + void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg) { if (mem_cgroup_disabled() || !cgroup_subsys_on_dfl(memory_cgrp_subsys)) diff --git a/mm/vmscan.c b/mm/vmscan.c index 453d654727c1..4d23c491e914 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1018,7 +1018,8 @@ static struct folio *alloc_demote_folio(struct folio = *src, * Folios which are not demoted are left on @demote_folios. */ static unsigned int demote_folio_list(struct list_head *demote_folios, - struct pglist_data *pgdat) + struct pglist_data *pgdat, + struct mem_cgroup *memcg) { int target_nid =3D next_demotion_node(pgdat->node_id); unsigned int nr_succeeded; @@ -1032,7 +1033,6 @@ static unsigned int demote_folio_list(struct list_hea= d *demote_folios, */ .gfp_mask =3D (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOMEMALLOC | GFP_NOWAIT, - .nid =3D target_nid, .nmask =3D &allowed_mask, .reason =3D MR_DEMOTION, }; @@ -1044,6 +1044,13 @@ static unsigned int demote_folio_list(struct list_he= ad *demote_folios, return 0; =20 node_get_allowed_targets(pgdat, &allowed_mask); + /* Filter the given nmask based on cpuset.mems.allowed */ + mem_cgroup_filter_mems_allowed(memcg, &allowed_mask); + if (nodes_empty(allowed_mask)) + return 0; + if (!node_isset(target_nid, allowed_mask)) + target_nid =3D node_random(&allowed_mask); + mtc.nid =3D target_nid; =20 /* Demotion ignores all cpuset and mempolicy settings */ migrate_pages(demote_folios, alloc_demote_folio, NULL, @@ -1565,7 +1572,7 @@ static unsigned int shrink_folio_list(struct list_hea= d *folio_list, /* 'folio_list' is always empty here */ =20 /* Migrate folios selected for demotion */ - nr_demoted =3D demote_folio_list(&demote_folios, pgdat); + nr_demoted =3D demote_folio_list(&demote_folios, pgdat, memcg); nr_reclaimed +=3D nr_demoted; stat->nr_demoted +=3D nr_demoted; /* Folios that could not be demoted are still in @demote_folios */ --=20 2.52.0.351.gbe84eed79e-goog From nobody Sun Feb 8 02:41:42 2026 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 61D91280A5C for ; Sun, 21 Dec 2025 23:36:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.73 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766360222; cv=none; b=NZmG4sah3h4vVH1CmME4LQ9x9UiN6O/KsgC1OmEQzAmnKNjwc9z6qabcBICbBixh9UintVoJKejxsSTDZQ1f80poNuw2a9I4K2LX1Gq03LWVFYVUWtaQ5YVD1Ne3sLURDKt5AZDpW9KXzTW3dNnUIvdeDCpWxcz7nCfNk3Z+A1E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766360222; c=relaxed/simple; bh=1qck0wUG8zkmGtpgeRMZgcBVUsA+HQYW7SICvmzdXCE=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=pczAuQ0Aakm4Oa/2KqemampPhlv3/oAfySVl+sIDEjH1/98t0Izl/Lb0lIEqvBdqLB1SChO77NOMv5gNZ1vfXanoirFe33Tc+pOg6WvbhwJy/yTw6NjDTc4PTVbD4C2KxdTVK5UTw49kBIITJhiWQnxKgu0ABa05vcUT27X0iM0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--bingjiao.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=fbu3mm+W; arc=none smtp.client-ip=209.85.216.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--bingjiao.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="fbu3mm+W" Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-34c904a1168so7378609a91.1 for ; Sun, 21 Dec 2025 15:36:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1766360219; x=1766965019; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=TOdI/BEkQBQWXUHyCrchJg6ArfOE4J8nqe0nfugUeYI=; b=fbu3mm+WTV1jyn6nFbXAyS/6B7ZaB4zF8QOlFKZBkVSiIfxuhcXM1uamp7A+ikSMld 2AYaAjDP1rLhGL5XKRXFZPdYREg42T3vOyApj13W/SmA/CPflZq207q5lGm57nXVkY2v 1DQoM1XOp81HgcEO7mfIVhaMbVZowUkIAyeJAyr6RwLOGFbZm9axkMKQyF4o5M5rjEFy CJ6TeCEG4LbXzI0OqtYrQxcBS/nVzRk3szLYHidz9mwjEsmRMcEASfq/Z/nkysgq4UAl 3Q6uDC3AllQKRrYVTWEzWt4gYdQ2iwHP6k5CQ5I0INEhDJ411OxMG+SUyJj4IBNUQYVf +t7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766360219; x=1766965019; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=TOdI/BEkQBQWXUHyCrchJg6ArfOE4J8nqe0nfugUeYI=; b=h5tPhtz++e99R0FJmcIKhQrmDuRKtIKtCR62SF24ZWqPW/DiqrJmrB6xwPH9H0170A pUB6lG42Wr9+525rOi5jfXNjXc/ErhZB+e0NsmcsRZnpPe92j9dlsQqzZTrMXXV5Eb9n k21bOuwY9t/J5FXXYgSAd70x8Mnw8+CwnPVDtXxEwJFMiZsFm2aFfoo96J2oNh2WlonN sVEqcgAnafo8tZACkbeOwSFTlVUXASJtGLWTAgwLKvPcGXvLP0CtxTEYFIczJVjCNRU7 Tmiowq2vXHAN7kgIJ7A3X3x9BqE2eBEYz4wcdfgbsccj+/PVnrdTKGWNMux1QGz1uA5/ Myww== X-Gm-Message-State: AOJu0YyNtAiqyld3ZIpsG8qYep6V6JK/RI0rWtMe3MBztTWDxiBjU8x3 KGKAIgwzAw1QzPb3RVs6FR+s9+zaI/piA7lPwPuTdr+725e0oOnz4AA4+3QFgnC66JsLLx/60Xi 80iyxM/61N7/fyg== X-Google-Smtp-Source: AGHT+IELZQMOEJm89X4TKV9M7co9XdZ3YqAhkiRqGdD+0QsqFHoJCcLn4TvuboqvIUgaBTBeBhEGNHBt5S4NeA== X-Received: from dlbtk5.prod.google.com ([2002:a05:7022:fb05:b0:11e:3ea:a127]) (user=bingjiao job=prod-delivery.src-stubby-dispatcher) by 2002:a05:7022:3c06:b0:119:e569:f258 with SMTP id a92af1059eb24-121721acc08mr8626739c88.1.1766360218726; Sun, 21 Dec 2025 15:36:58 -0800 (PST) Date: Sun, 21 Dec 2025 23:36:35 +0000 In-Reply-To: <20251221233635.3761887-1-bingjiao@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251220061022.2726028-1-bingjiao@google.com> <20251221233635.3761887-1-bingjiao@google.com> X-Mailer: git-send-email 2.52.0.351.gbe84eed79e-goog Message-ID: <20251221233635.3761887-3-bingjiao@google.com> Subject: [PATCH v2 2/2] mm/vmscan: check all allowed targets in can_demote() From: Bing Jiao To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, stable@vger.kernel.org, akpm@linux-foundation.org, gourry@gourry.net, longman@redhat.com, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, tj@kernel.org, mkoutny@suse.com, david@kernel.org, zhengqi.arch@bytedance.com, lorenzo.stoakes@oracle.com, axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com, cgroups@vger.kernel.org, Bing Jiao Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") introduces the cpuset.mems_effective check and applies it to can_demote(). However, it checks only the nodes in the immediate next demotion hierarchy and does not check all allowed demotion targets. This can cause pages to never be demoted if the nodes in the next demotion hierarchy are not set in mems_effective. To address the bug, use mem_cgroup_filter_mems_allowed() to filter out allowed targets obtained from node_get_allowed_targets(). Also remove some unused functions. Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") Signed-off-by: Bing Jiao --- include/linux/cpuset.h | 6 ------ include/linux/memcontrol.h | 7 ------- kernel/cgroup/cpuset.c | 28 ++++------------------------ mm/memcontrol.c | 5 ----- mm/vmscan.c | 14 ++++++++------ 5 files changed, 12 insertions(+), 48 deletions(-) diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index 0e94548e2d24..ed7c27276e71 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -174,7 +174,6 @@ static inline void set_mems_allowed(nodemask_t nodemask) task_unlock(current); } =20 -extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid); extern void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *= mask); #else /* !CONFIG_CPUSETS */ =20 @@ -302,11 +301,6 @@ static inline bool read_mems_allowed_retry(unsigned in= t seq) return false; } =20 -static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid) -{ - return true; -} - static inline void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask) { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 7cfd71c57caa..41aab33499b5 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1740,8 +1740,6 @@ static inline void count_objcg_events(struct obj_cgro= up *objcg, rcu_read_unlock(); } =20 -bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid); - void mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, nodemask_t *= mask); =20 void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg); @@ -1813,11 +1811,6 @@ static inline ino_t page_cgroup_ino(struct page *pag= e) return 0; } =20 -static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int n= id) -{ - return true; -} - static inline bool mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, nodemask_t *mask) { diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 2925bd6bca91..339779571508 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -4416,11 +4416,10 @@ bool cpuset_current_node_allowed(int node, gfp_t gf= p_mask) return allowed; } =20 -bool cpuset_node_allowed(struct cgroup *cgroup, int nid) +void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask) { struct cgroup_subsys_state *css; struct cpuset *cs; - bool allowed; =20 /* * In v1, mem_cgroup and cpuset are unlikely in the same hierarchy @@ -4428,15 +4427,15 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int= nid) * so return true to avoid taking a global lock on the empty check. */ if (!cpuset_v2()) - return true; + return; =20 css =3D cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys); if (!css) - return true; + return; =20 /* * Normally, accessing effective_mems would require the cpuset_mutex - * or callback_lock - but node_isset is atomic and the reference + * or callback_lock - but it is acceptable and the reference * taken via cgroup_get_e_css is sufficient to protect css. * * Since this interface is intended for use by migration paths, we @@ -4447,25 +4446,6 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int = nid) * cannot make strong isolation guarantees, so this is acceptable. */ cs =3D container_of(css, struct cpuset, css); - allowed =3D node_isset(nid, cs->effective_mems); - css_put(css); - return allowed; -} - -void cpuset_node_filter_allowed(struct cgroup *cgroup, nodemask_t *mask) -{ - struct cgroup_subsys_state *css; - struct cpuset *cs; - - if (!cpuset_v2()) - return; - - css =3D cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys); - if (!css) - return; - - /* Follows the same assumption in cpuset_node_allowed() */ - cs =3D container_of(css, struct cpuset, css); nodes_and(*mask, *mask, cs->effective_mems); css_put(css); } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f414653867de..ebf5df3c8ca1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5597,11 +5597,6 @@ subsys_initcall(mem_cgroup_swap_init); =20 #endif /* CONFIG_SWAP */ =20 -bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid) -{ - return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true; -} - void mem_cgroup_filter_mems_allowed(struct mem_cgroup *memcg, nodemask_t *= mask) { if (memcg) diff --git a/mm/vmscan.c b/mm/vmscan.c index 4d23c491e914..fa4d51af7f44 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -344,19 +344,21 @@ static void flush_reclaim_state(struct scan_control *= sc) static bool can_demote(int nid, struct scan_control *sc, struct mem_cgroup *memcg) { - int demotion_nid; + struct pglist_data *pgdat =3D NODE_DATA(nid); + nodemask_t allowed_mask; =20 - if (!numa_demotion_enabled) + if (!pgdat || !numa_demotion_enabled) return false; if (sc && sc->no_demotion) return false; =20 - demotion_nid =3D next_demotion_node(nid); - if (demotion_nid =3D=3D NUMA_NO_NODE) + node_get_allowed_targets(pgdat, &allowed_mask); + if (nodes_empty(allowed_mask)) return false; =20 - /* If demotion node isn't in the cgroup's mems_allowed, fall back */ - return mem_cgroup_node_allowed(memcg, demotion_nid); + /* Filter the given nmask based on cpuset.mems.allowed */ + mem_cgroup_filter_mems_allowed(memcg, &allowed_mask); + return !nodes_empty(allowed_mask); } =20 static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg, --=20 2.52.0.351.gbe84eed79e-goog From nobody Sun Feb 8 02:41:42 2026 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D6AD328A701 for ; Tue, 23 Dec 2025 21:20:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766524836; cv=none; b=JQtCHFPTHTy1eRGSE/6+W/h9+283l8d1dId8oHBtJHjq9A2XgkZz+qo/GEZhLmU/SENLfmiBwbEdvQ0PFynCAThmYg703roTpkEoJecPHqICojwudULiJAQmQvwrHkx7yyZ3N5IVhAKjPMQ7o3evweZjalZC3Kx/laZA/cYX254= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1766524836; c=relaxed/simple; bh=DoZhKRfKFmI1qo5NVDRpPpGkB69QPB99mmMeA/ldaCY=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=Zp8aYK9Kgpll4PpAYtPoaXewpo1gWQSehsOieAQQLc28uA8IqV8eqAyaga33qtp0IJtBV93mEDZbGRxPC9hRMlq8yyZ9BG4/iWWZzycC7KoqHILZLnFlsJgSYJ7Kn0H6xJvy/Q16+hOX5PmIhkZOapTQ4rTWnjWxTFOm5Cqld2Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--bingjiao.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=ajLM4n7Q; arc=none smtp.client-ip=209.85.214.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--bingjiao.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="ajLM4n7Q" Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-29f2381ea85so112362985ad.0 for ; Tue, 23 Dec 2025 13:20:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1766524834; x=1767129634; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=lYfxavFypXvQWKTZ6bRnNlNx9q/hbXcZzoHeTM07X9Y=; b=ajLM4n7QPdmbTAWy9psXjGwWv843+BHYM3clXEWINxLcZcQsnpyJyW4c1KKhxOcfMs YqpyOFcC4Ti254Uxjhtc4p+UKLOs5C/ksScc7l+Mh/0NdYj8NT1+z5VUH7xSv41qLR8k y1GdpAHtzSSsXRWiypF7Ls0Ccr3AlGC5foDy1CDMGOjBh9dF4Q7bsw2nkl6xelVK2NMw UuWFrQUb6Sa2Lj3EmhgEAyiIVQwq05dqCOq5biQbdCZ7udkTaOWpj8DMeJ1KsDmnhaD4 kPoVPxsVRrtuOrwfxlMy8HluyS+CjJynBQIZwokmsxELruy1JGxuCXfkkN6hTM8DL6T1 VDUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766524834; x=1767129634; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=lYfxavFypXvQWKTZ6bRnNlNx9q/hbXcZzoHeTM07X9Y=; b=IEP2Wfg/+dBJI1uOBVboEbHX480DGVTr7YOkhedCEHoi785WNZjdYuVH3NaKq6hSkq esiScc1URIttI52135RPP2yx+dpFbjTocX4fWrxpQowQLhpx7ZO1zJc94lxi94GUtUEX XvYMuZfojtL+8zWu2xwa+goJ8DGIjQOt/yqSq6CxbsYlc4ur6EtrIQYqffvh/AczK9qE 9I9Gmh6XMmicxmsSa9rZvsuNQf+T4i2yJ2c2LtozN7RAdT3p6HADDfT+B6Y7Rbm286vR 8E3GuG9q8btyr6YsHR/FtPQN24KKTxdLZLVHjidytYy4NzEZJql+1WpZErect3awwsSP 3jwg== X-Gm-Message-State: AOJu0Yxv6iAhkPqBGTKWLNYVjyCQAfyhBzggJnXBos81p8vtzmptjgxd 061Bhs2TSq4yqdwB48/TbB0bSkWvxIN64eLmpMLzgcfCNf1JFWeh/HujqwCpzJ0vYQlJVUrDOCv euL0KICh3sJw2DQ== X-Google-Smtp-Source: AGHT+IGXR/gpiuAps+7fq4OauW9eemy4cSM5YIpNWsP7dLBZO0ns7cPZXMjmRfT48tgS6uiqxUjYwDaT0P2F6A== X-Received: from dlqq20.prod.google.com ([2002:a05:7022:6394:b0:11d:cf7a:5407]) (user=bingjiao job=prod-delivery.src-stubby-dispatcher) by 2002:a05:7022:2520:b0:11d:c04a:dc5b with SMTP id a92af1059eb24-121722e03camr15593882c88.30.1766524833960; Tue, 23 Dec 2025 13:20:33 -0800 (PST) Date: Tue, 23 Dec 2025 21:19:59 +0000 In-Reply-To: <20251221233635.3761887-1-bingjiao@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251221233635.3761887-1-bingjiao@google.com> X-Mailer: git-send-email 2.52.0.358.g0dd7633a29-goog Message-ID: <20251223212032.665731-1-bingjiao@google.com> Subject: [PATCH v3] mm/vmscan: fix demotion targets checks in reclaim/demotion From: Bing Jiao To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, akpm@linux-foundation.org, gourry@gourry.net, longman@redhat.com, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, tj@kernel.org, mkoutny@suse.com, david@kernel.org, zhengqi.arch@bytedance.com, lorenzo.stoakes@oracle.com, axelrasmussen@google.com, chenridong@huaweicloud.com, yuanchu@google.com, weixugc@google.com, cgroups@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Fix two bugs in demote_folio_list() and can_demote() due to incorrect demotion target checks in reclaim/demotion. Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") introduces the cpuset.mems_effective check and applies it to can_demote(). However: 1. It does not apply this check in demote_folio_list(), which leads to situations where pages are demoted to nodes that are explicitly excluded from the task's cpuset.mems. 2. It checks only the nodes in the immediate next demotion hierarchy and does not check all allowed demotion targets in can_demote(). This can cause pages to never be demoted if the nodes in the next demotion hierarchy are not set in mems_effective. These bugs break resource isolation provided by cpuset.mems. This is visible from userspace because pages can either fail to be demoted entirely or are demoted to nodes that are not allowed in multi-tier memory systems. To address these bugs, update cpuset_node_allowed() and mem_cgroup_node_allowed() to return effective_mems, allowing directly logic-and operation against demotion targets. Also update can_demote() and demote_folio_list() accordingly. Reproduct Bug 1: Assume a system with 4 nodes, where nodes 0-1 are top-tier and nodes 2-3 are far-tier memory. All nodes have equal capacity. Test script: echo 1 > /sys/kernel/mm/numa/demotion_enabled mkdir /sys/fs/cgroup/test echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control echo "0-2" > /sys/fs/cgroup/test/cpuset.mems echo $$ > /sys/fs/cgroup/test/cgroup.procs swapoff -a # Expectation: Should respect node 0-2 limit. # Observation: Node 3 shows significant allocation (MemFree drops) stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1 Reproduct Bug 2: Assume a system with 6 nodes, where nodes 0-2 are top-tier, node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes. All nodes have equal capacity. Test script: echo 1 > /sys/kernel/mm/numa/demotion_enabled mkdir /sys/fs/cgroup/test echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems echo $$ > /sys/fs/cgroup/test/cgroup.procs swapoff -a # Expectation: Pages are demoted to Nodes 4-5 # Observation: No pages are demoted before oom. stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2 Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") Cc: Signed-off-by: Bing Jiao --- include/linux/cpuset.h | 6 +++--- include/linux/memcontrol.h | 6 +++--- kernel/cgroup/cpuset.c | 16 ++++++++-------- mm/memcontrol.c | 6 ++++-- mm/vmscan.c | 35 +++++++++++++++++++++++------------ 5 files changed, 41 insertions(+), 28 deletions(-) diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index a98d3330385c..eb358c3aa9c0 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -174,7 +174,7 @@ static inline void set_mems_allowed(nodemask_t nodemask) task_unlock(current); } -extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid); +extern nodemask_t cpuset_node_get_allowed(struct cgroup *cgroup); #else /* !CONFIG_CPUSETS */ static inline bool cpusets_enabled(void) { return false; } @@ -301,9 +301,9 @@ static inline bool read_mems_allowed_retry(unsigned int= seq) return false; } -static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid) +static inline nodemask_t cpuset_node_get_allowed(struct cgroup *cgroup) { - return true; + return node_possible_map; } #endif /* !CONFIG_CPUSETS */ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index fd400082313a..f9463d853bba 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1740,7 +1740,7 @@ static inline void count_objcg_events(struct obj_cgro= up *objcg, rcu_read_unlock(); } -bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid); +nodemask_t mem_cgroup_node_get_allowed(struct mem_cgroup *memcg); void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg); @@ -1811,9 +1811,9 @@ static inline ino_t page_cgroup_ino(struct page *page) return 0; } -static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int n= id) +static inline nodemask_t mem_cgroup_node_get_allowed(struct mem_cgroup *me= mcg) { - return true; + return node_possible_map; } static inline void mem_cgroup_show_protected_memory(struct mem_cgroup *mem= cg) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 6e6eb09b8db6..abb9afb64205 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -4416,23 +4416,23 @@ bool cpuset_current_node_allowed(int node, gfp_t gf= p_mask) return allowed; } -bool cpuset_node_allowed(struct cgroup *cgroup, int nid) +nodemask_t cpuset_node_get_allowed(struct cgroup *cgroup) { + nodemask_t nodes =3D node_possible_map; struct cgroup_subsys_state *css; struct cpuset *cs; - bool allowed; /* * In v1, mem_cgroup and cpuset are unlikely in the same hierarchy * and mems_allowed is likely to be empty even if we could get to it, - * so return true to avoid taking a global lock on the empty check. + * so return directly to avoid taking a global lock on the empty check. */ - if (!cpuset_v2()) - return true; + if (!cgroup || !cpuset_v2()) + return nodes; css =3D cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys); if (!css) - return true; + return nodes; /* * Normally, accessing effective_mems would require the cpuset_mutex @@ -4447,9 +4447,9 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int n= id) * cannot make strong isolation guarantees, so this is acceptable. */ cs =3D container_of(css, struct cpuset, css); - allowed =3D node_isset(nid, cs->effective_mems); + nodes_copy(nodes, cs->effective_mems); css_put(css); - return allowed; + return nodes; } /** diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 75fc22a33b28..c2f4ac50d5c2 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5597,9 +5597,11 @@ subsys_initcall(mem_cgroup_swap_init); #endif /* CONFIG_SWAP */ -bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid) +nodemask_t mem_cgroup_node_get_allowed(struct mem_cgroup *memcg) { - return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true; + if (memcg) + return cpuset_node_get_allowed(memcg->css.cgroup); + return node_possible_map; } void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg) diff --git a/mm/vmscan.c b/mm/vmscan.c index a4b308a2f9ad..711a04baf258 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -345,18 +345,24 @@ static bool can_demote(int nid, struct scan_control *= sc, struct mem_cgroup *memcg) { int demotion_nid; + struct pglist_data *pgdat =3D NODE_DATA(nid); + nodemask_t allowed_mask, allowed_mems; - if (!numa_demotion_enabled) + if (!pgdat || !numa_demotion_enabled) return false; if (sc && sc->no_demotion) return false; - demotion_nid =3D next_demotion_node(nid); - if (demotion_nid =3D=3D NUMA_NO_NODE) + node_get_allowed_targets(pgdat, &allowed_mask); + if (nodes_empty(allowed_mask)) + return false; + + allowed_mems =3D mem_cgroup_node_get_allowed(memcg); + nodes_and(allowed_mask, allowed_mask, allowed_mems); + if (nodes_empty(allowed_mask)) return false; - /* If demotion node isn't in the cgroup's mems_allowed, fall back */ - if (mem_cgroup_node_allowed(memcg, demotion_nid)) { + for_each_node_mask(demotion_nid, allowed_mask) { int z; struct zone *zone; struct pglist_data *pgdat =3D NODE_DATA(demotion_nid); @@ -1029,11 +1035,12 @@ static struct folio *alloc_demote_folio(struct foli= o *src, * Folios which are not demoted are left on @demote_folios. */ static unsigned int demote_folio_list(struct list_head *demote_folios, - struct pglist_data *pgdat) + struct pglist_data *pgdat, + struct mem_cgroup *memcg) { int target_nid =3D next_demotion_node(pgdat->node_id); unsigned int nr_succeeded; - nodemask_t allowed_mask; + nodemask_t allowed_mask, allowed_mems; struct migration_target_control mtc =3D { /* @@ -1043,7 +1050,6 @@ static unsigned int demote_folio_list(struct list_hea= d *demote_folios, */ .gfp_mask =3D (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOMEMALLOC | GFP_NOWAIT, - .nid =3D target_nid, .nmask =3D &allowed_mask, .reason =3D MR_DEMOTION, }; @@ -1051,10 +1057,15 @@ static unsigned int demote_folio_list(struct list_h= ead *demote_folios, if (list_empty(demote_folios)) return 0; - if (target_nid =3D=3D NUMA_NO_NODE) - return 0; - node_get_allowed_targets(pgdat, &allowed_mask); + allowed_mems =3D mem_cgroup_node_get_allowed(memcg); + nodes_and(allowed_mask, allowed_mask, allowed_mems); + if (nodes_empty(allowed_mask)) + return false; + + if (target_nid =3D=3D NUMA_NO_NODE || !node_isset(target_nid, allowed_mas= k)) + target_nid =3D node_random(&allowed_mask); + mtc.nid =3D target_nid; /* Demotion ignores all cpuset and mempolicy settings */ migrate_pages(demote_folios, alloc_demote_folio, NULL, @@ -1576,7 +1587,7 @@ static unsigned int shrink_folio_list(struct list_hea= d *folio_list, /* 'folio_list' is always empty here */ /* Migrate folios selected for demotion */ - nr_demoted =3D demote_folio_list(&demote_folios, pgdat); + nr_demoted =3D demote_folio_list(&demote_folios, pgdat, memcg); nr_reclaimed +=3D nr_demoted; stat->nr_demoted +=3D nr_demoted; /* Folios that could not be demoted are still in @demote_folios */ -- 2.52.0.358.g0dd7633a29-goog