include/linux/cpuset.h | 6 +++--- include/linux/memcontrol.h | 6 +++--- kernel/cgroup/cpuset.c | 33 +++++++++++++++++++++++---------- mm/memcontrol.c | 10 ++++++++-- mm/vmscan.c | 30 ++++++++++++++++++------------ 5 files changed, 55 insertions(+), 30 deletions(-)
Fix two bugs in demote_folio_list() and can_demote() due to incorrect
demotion target checks in reclaim/demotion.
Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
introduces the cpuset.mems_effective check and applies it to
can_demote(). However:
1. It does not apply this check in demote_folio_list(), which leads
to situations where pages are demoted to nodes that are
explicitly excluded from the task's cpuset.mems.
2. It checks only the nodes in the immediate next demotion hierarchy
and does not check all allowed demotion targets in can_demote().
This can cause pages to never be demoted if the nodes in the next
demotion hierarchy are not set in mems_effective.
These bugs break resource isolation provided by cpuset.mems.
This is visible from userspace because pages can either fail to be
demoted entirely or are demoted to nodes that are not allowed
in multi-tier memory systems.
To address these bugs, update cpuset_node_allowed() and
mem_cgroup_node_allowed() to return effective_mems, allowing directly
logic-and operation against demotion targets. Also update can_demote()
and demote_folio_list() accordingly.
Bug 1 reproduction:
Assume a system with 4 nodes, where nodes 0-1 are top-tier and
nodes 2-3 are far-tier memory. All nodes have equal capacity.
Test script:
echo 1 > /sys/kernel/mm/numa/demotion_enabled
mkdir /sys/fs/cgroup/test
echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
echo "0-2" > /sys/fs/cgroup/test/cpuset.mems
echo $$ > /sys/fs/cgroup/test/cgroup.procs
swapoff -a
# Expectation: Should respect node 0-2 limit.
# Observation: Node 3 shows significant allocation (MemFree drops)
stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1
Bug 2 reproduction:
Assume a system with 6 nodes, where nodes 0-2 are top-tier,
node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes.
All nodes have equal capacity.
Test script:
echo 1 > /sys/kernel/mm/numa/demotion_enabled
mkdir /sys/fs/cgroup/test
echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems
echo $$ > /sys/fs/cgroup/test/cgroup.procs
swapoff -a
# Expectation: Pages are demoted to Nodes 4-5
# Observation: No pages are demoted before oom.
stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2
Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
Cc: <stable@vger.kernel.org>
Signed-off-by: Bing Jiao <bingjiao@google.com>
---
Patch against the linux mainline.
Tested on the mainline and passed. Pages can be demoted to correct nodes
on a VM with emulated 3-tiers far memory nodes.
Tested on mm-everyting, after Akinobu Mita's series "mm: fix oom-killer
not being invoked when demotion is enabled v2", and passed. OOM can be
triggered properly when far nodes are oom.
---
include/linux/cpuset.h | 6 +++---
include/linux/memcontrol.h | 6 +++---
kernel/cgroup/cpuset.c | 33 +++++++++++++++++++++++----------
mm/memcontrol.c | 10 ++++++++--
mm/vmscan.c | 30 ++++++++++++++++++------------
5 files changed, 55 insertions(+), 30 deletions(-)
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index a98d3330385c..631577384677 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -174,7 +174,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
task_unlock(current);
}
-extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
+extern void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask);
#else /* !CONFIG_CPUSETS */
static inline bool cpusets_enabled(void) { return false; }
@@ -301,9 +301,9 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
return false;
}
-static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
+static inline void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask)
{
- return true;
+ nodes_copy(*mask, node_states[N_MEMORY]);
}
#endif /* !CONFIG_CPUSETS */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0651865a4564..412db7663357 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1744,7 +1744,7 @@ static inline void count_objcg_events(struct obj_cgroup *objcg,
rcu_read_unlock();
}
-bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid);
+void mem_cgroup_node_filter_allowed(struct mem_cgroup *memcg, nodemask_t *mask);
void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg);
@@ -1815,9 +1815,9 @@ static inline ino_t page_cgroup_ino(struct page *page)
return 0;
}
-static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
+static inline void mem_cgroup_node_filter_allowed(struct mem_cgroup *memcg,
+ nodemask_t *mask)
{
- return true;
}
static inline void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 3e8cc34d8d50..5bbd1d2fe5f6 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4427,27 +4427,41 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
return allowed;
}
-bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
+/**
+ * cpuset_nodes_allowed - return mems_allowed mask from a cgroup cpuset.
+ * @cgroup: pointer to struct cgroup.
+ * @mask: pointer to struct nodemask_t to be returned.
+ *
+ * Returns mems_allowed mask from a cgroup cpuset if it is cgroup v2 and
+ * has cpuset subsys. Otherwise, returns node_states[N_MEMORY].
+ *
+ * Returned @mask may be empty, and nodes in @mask are not guaranteed
+ * to be online.
+ **/
+void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask)
{
struct cgroup_subsys_state *css;
struct cpuset *cs;
- bool allowed;
/*
* In v1, mem_cgroup and cpuset are unlikely in the same hierarchy
* and mems_allowed is likely to be empty even if we could get to it,
- * so return true to avoid taking a global lock on the empty check.
+ * so return directly to avoid taking a global lock on the empty check.
*/
- if (!cpuset_v2())
- return true;
+ if (!cgroup || !cpuset_v2()) {
+ nodes_copy(*mask, node_states[N_MEMORY]);
+ return;
+ }
css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys);
- if (!css)
- return true;
+ if (!css) {
+ nodes_copy(*mask, node_states[N_MEMORY]);
+ return;
+ }
/*
* Normally, accessing effective_mems would require the cpuset_mutex
- * or callback_lock - but node_isset is atomic and the reference
+ * or callback_lock - but not doing so is acceptable and the reference
* taken via cgroup_get_e_css is sufficient to protect css.
*
* Since this interface is intended for use by migration paths, we
@@ -4458,9 +4472,8 @@ bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
* cannot make strong isolation guarantees, so this is acceptable.
*/
cs = container_of(css, struct cpuset, css);
- allowed = node_isset(nid, cs->effective_mems);
+ nodes_copy(*mask, cs->effective_mems);
css_put(css);
- return allowed;
}
/**
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 86f43b7e5f71..252cc456714a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5624,9 +5624,15 @@ subsys_initcall(mem_cgroup_swap_init);
#endif /* CONFIG_SWAP */
-bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
+void mem_cgroup_node_filter_allowed(struct mem_cgroup *memcg, nodemask_t *mask)
{
- return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true;
+ nodemask_t allowed;
+
+ if (!memcg)
+ return;
+
+ cpuset_nodes_allowed(memcg->css.cgroup, &allowed);
+ nodes_and(*mask, *mask, allowed);
}
void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 670fe9fae5ba..eed1becfcb34 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -344,19 +344,21 @@ static void flush_reclaim_state(struct scan_control *sc)
static bool can_demote(int nid, struct scan_control *sc,
struct mem_cgroup *memcg)
{
- int demotion_nid;
+ struct pglist_data *pgdat = NODE_DATA(nid);
+ nodemask_t allowed_mask;
- if (!numa_demotion_enabled)
+ if (!pgdat || !numa_demotion_enabled)
return false;
if (sc && sc->no_demotion)
return false;
- demotion_nid = next_demotion_node(nid);
- if (demotion_nid == NUMA_NO_NODE)
+ node_get_allowed_targets(pgdat, &allowed_mask);
+ if (nodes_empty(allowed_mask))
return false;
- /* If demotion node isn't in the cgroup's mems_allowed, fall back */
- return mem_cgroup_node_allowed(memcg, demotion_nid);
+ /* Filter out nodes that are not in cgroup's mems_allowed. */
+ mem_cgroup_node_filter_allowed(memcg, &allowed_mask);
+ return !nodes_empty(allowed_mask);
}
static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
@@ -1019,7 +1021,8 @@ static struct folio *alloc_demote_folio(struct folio *src,
* Folios which are not demoted are left on @demote_folios.
*/
static unsigned int demote_folio_list(struct list_head *demote_folios,
- struct pglist_data *pgdat)
+ struct pglist_data *pgdat,
+ struct mem_cgroup *memcg)
{
int target_nid = next_demotion_node(pgdat->node_id);
unsigned int nr_succeeded;
@@ -1033,7 +1036,6 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
*/
.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
__GFP_NOMEMALLOC | GFP_NOWAIT,
- .nid = target_nid,
.nmask = &allowed_mask,
.reason = MR_DEMOTION,
};
@@ -1041,10 +1043,14 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
if (list_empty(demote_folios))
return 0;
- if (target_nid == NUMA_NO_NODE)
- return 0;
-
node_get_allowed_targets(pgdat, &allowed_mask);
+ mem_cgroup_node_filter_allowed(memcg, &allowed_mask);
+ if (nodes_empty(allowed_mask))
+ return false;
+
+ if (!node_isset(target_nid, allowed_mask))
+ target_nid = node_random(&allowed_mask);
+ mtc.nid = target_nid;
/* Demotion ignores all cpuset and mempolicy settings */
migrate_pages(demote_folios, alloc_demote_folio, NULL,
@@ -1566,7 +1572,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
/* 'folio_list' is always empty here */
/* Migrate folios selected for demotion */
- nr_demoted = demote_folio_list(&demote_folios, pgdat);
+ nr_demoted = demote_folio_list(&demote_folios, pgdat, memcg);
nr_reclaimed += nr_demoted;
stat->nr_demoted += nr_demoted;
/* Folios that could not be demoted are still in @demote_folios */
--
2.52.0.358.g0dd7633a29-goog
On Mon, Jan 05, 2026 at 05:01:52AM +0000, Bing Jiao wrote:
... snip ...
> +/**
> + * cpuset_nodes_allowed - return mems_allowed mask from a cgroup cpuset.
> + * @cgroup: pointer to struct cgroup.
> + * @mask: pointer to struct nodemask_t to be returned.
> + *
> + * Returns mems_allowed mask from a cgroup cpuset if it is cgroup v2 and
> + * has cpuset subsys. Otherwise, returns node_states[N_MEMORY].
> + *
> + * Returned @mask may be empty, and nodes in @mask are not guaranteed
> + * to be online.
> + **/
> +void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask)
> +void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask)
> {
... snip ...
> /*
> * Normally, accessing effective_mems would require the cpuset_mutex
> - * or callback_lock - but node_isset is atomic and the reference
> + * or callback_lock - but not doing so is acceptable and the reference
"node_isset is atomic" is an argument that not taking cpuset_mutex is
acceptable since it's a singular operation against a nodemask (one bit
it checked) - and therefore for a moment in time the node is either
allowed or not (and we make no absolute guarantee of corrected when this
race occurs, we just note that we're corrected).
nodes_copy is not atomic, and in fact this can result in returning an
empty nodemask if cs->effective_mems is being recalculated at the time
this copy occurs.
Rather than just saying "not doing so is acceptable" - can you please
change this comment to explain the implications of not acquiring the
mutex a little more clearly?
Example:
```
We do not acquire cpuset_mutex during this check because the correctness
of this information is stale immediately after the query anyway - this
saves lock contention in exchange for racing against mems_allowed rebinds.
As a result, @mask may be empty because cs->effective_mems can be rebound
during this call. Callers must check the mask for validity on return.
```
The rest of the comments in the function explains a about this, but I
think with this update the comments need a little more rework.
~Gregory
On Mon, Jan 05, 2026 at 10:54:05AM -0500, Gregory Price wrote:
> On Mon, Jan 05, 2026 at 05:01:52AM +0000, Bing Jiao wrote:
> ... snip ...
> > +/**
> > + * cpuset_nodes_allowed - return mems_allowed mask from a cgroup cpuset.
> > + * @cgroup: pointer to struct cgroup.
> > + * @mask: pointer to struct nodemask_t to be returned.
> > + *
> > + * Returns mems_allowed mask from a cgroup cpuset if it is cgroup v2 and
> > + * has cpuset subsys. Otherwise, returns node_states[N_MEMORY].
> > + *
> > + * Returned @mask may be empty, and nodes in @mask are not guaranteed
> > + * to be online.
> > + **/
> > +void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask)
> > +void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask)
> > {
> ... snip ...
> > /*
> > * Normally, accessing effective_mems would require the cpuset_mutex
> > - * or callback_lock - but node_isset is atomic and the reference
> > + * or callback_lock - but not doing so is acceptable and the reference
>
>
> "node_isset is atomic" is an argument that not taking cpuset_mutex is
> acceptable since it's a singular operation against a nodemask (one bit
> it checked) - and therefore for a moment in time the node is either
> allowed or not (and we make no absolute guarantee of corrected when this
> race occurs, we just note that we're corrected).
>
> nodes_copy is not atomic, and in fact this can result in returning an
> empty nodemask if cs->effective_mems is being recalculated at the time
> this copy occurs.
>
> Rather than just saying "not doing so is acceptable" - can you please
> change this comment to explain the implications of not acquiring the
> mutex a little more clearly?
>
> Example:
> ```
> We do not acquire cpuset_mutex during this check because the correctness
> of this information is stale immediately after the query anyway - this
> saves lock contention in exchange for racing against mems_allowed rebinds.
>
> As a result, @mask may be empty because cs->effective_mems can be rebound
> during this call. Callers must check the mask for validity on return.
> ```
>
> The rest of the comments in the function explains a about this, but I
> think with this update the comments need a little more rework.
>
> ~Gregory
Thanks for the suggestions. I will reword the comment in V6.
Best,
Bing
Fix two bugs in demote_folio_list() and can_demote() due to incorrect
demotion target checks in reclaim/demotion.
Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
introduces the cpuset.mems_effective check and applies it to
can_demote(). However:
1. It does not apply this check in demote_folio_list(), which leads
to situations where pages are demoted to nodes that are
explicitly excluded from the task's cpuset.mems.
2. It checks only the nodes in the immediate next demotion hierarchy
and does not check all allowed demotion targets in can_demote().
This can cause pages to never be demoted if the nodes in the next
demotion hierarchy are not set in mems_effective.
These bugs break resource isolation provided by cpuset.mems.
This is visible from userspace because pages can either fail to be
demoted entirely or are demoted to nodes that are not allowed
in multi-tier memory systems.
To address these bugs, update cpuset_node_allowed() and
mem_cgroup_node_allowed() to return effective_mems, allowing directly
logic-and operation against demotion targets. Also update can_demote()
and demote_folio_list() accordingly.
Bug 1 reproduction:
Assume a system with 4 nodes, where nodes 0-1 are top-tier and
nodes 2-3 are far-tier memory. All nodes have equal capacity.
Test script:
echo 1 > /sys/kernel/mm/numa/demotion_enabled
mkdir /sys/fs/cgroup/test
echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
echo "0-2" > /sys/fs/cgroup/test/cpuset.mems
echo $$ > /sys/fs/cgroup/test/cgroup.procs
swapoff -a
# Expectation: Should respect node 0-2 limit.
# Observation: Node 3 shows significant allocation (MemFree drops)
stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1
Bug 2 reproduction:
Assume a system with 6 nodes, where nodes 0-2 are top-tier,
node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes.
All nodes have equal capacity.
Test script:
echo 1 > /sys/kernel/mm/numa/demotion_enabled
mkdir /sys/fs/cgroup/test
echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems
echo $$ > /sys/fs/cgroup/test/cgroup.procs
swapoff -a
# Expectation: Pages are demoted to Nodes 4-5
# Observation: No pages are demoted before oom.
stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2
Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
Cc: <stable@vger.kernel.org>
Signed-off-by: Bing Jiao <bingjiao@google.com>
---
Patch against the linux mainline.
Tested on the mainline and passed.
Tested on mm-everyting, after Akinobu Mita's series "mm: fix oom-killer
not being invoked when demotion is enabled v2", and passed.
v5 -> v6: update cpuset_nodes_allowed()'s comments; move some comments
from cpuset_nodes_allowed() to mem_cgroup_node_filter_allowed().
---
include/linux/cpuset.h | 6 ++---
include/linux/memcontrol.h | 6 ++---
kernel/cgroup/cpuset.c | 54 +++++++++++++++++++++++++-------------
mm/memcontrol.c | 16 +++++++++--
mm/vmscan.c | 30 ++++++++++++---------
5 files changed, 74 insertions(+), 38 deletions(-)
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index a98d3330385c..631577384677 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -174,7 +174,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
task_unlock(current);
}
-extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
+extern void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask);
#else /* !CONFIG_CPUSETS */
static inline bool cpusets_enabled(void) { return false; }
@@ -301,9 +301,9 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
return false;
}
-static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
+static inline void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask)
{
- return true;
+ nodes_copy(*mask, node_states[N_MEMORY]);
}
#endif /* !CONFIG_CPUSETS */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0651865a4564..412db7663357 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1744,7 +1744,7 @@ static inline void count_objcg_events(struct obj_cgroup *objcg,
rcu_read_unlock();
}
-bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid);
+void mem_cgroup_node_filter_allowed(struct mem_cgroup *memcg, nodemask_t *mask);
void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg);
@@ -1815,9 +1815,9 @@ static inline ino_t page_cgroup_ino(struct page *page)
return 0;
}
-static inline bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
+static inline void mem_cgroup_node_filter_allowed(struct mem_cgroup *memcg,
+ nodemask_t *mask)
{
- return true;
}
static inline void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 3e8cc34d8d50..76d7d0fa8137 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4427,40 +4427,58 @@ bool cpuset_current_node_allowed(int node, gfp_t gfp_mask)
return allowed;
}
-bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
+/**
+ * cpuset_nodes_allowed - return effective_mems mask from a cgroup cpuset.
+ * @cgroup: pointer to struct cgroup.
+ * @mask: pointer to struct nodemask_t to be returned.
+ *
+ * Returns effective_mems mask from a cgroup cpuset if it is cgroup v2 and
+ * has cpuset subsys. Otherwise, returns node_states[N_MEMORY].
+ *
+ * This function intentionally avoids taking the cpuset_mutex or callback_lock
+ * when accessing effective_mems. This is because the obtained effective_mems
+ * is stale immediately after the query anyway (e.g., effective_mems is updated
+ * immediately after releasing the lock but before returning).
+ *
+ * As a result, returned @mask may be empty because cs->effective_mems can be
+ * rebound during this call. Besides, nodes in @mask are not guaranteed to be
+ * online due to hot plugins. Callers should check the mask for validity on
+ * return based on its subsequent use.
+ **/
+void cpuset_nodes_allowed(struct cgroup *cgroup, nodemask_t *mask)
{
struct cgroup_subsys_state *css;
struct cpuset *cs;
- bool allowed;
/*
* In v1, mem_cgroup and cpuset are unlikely in the same hierarchy
* and mems_allowed is likely to be empty even if we could get to it,
- * so return true to avoid taking a global lock on the empty check.
+ * so return directly to avoid taking a global lock on the empty check.
*/
- if (!cpuset_v2())
- return true;
+ if (!cgroup || !cpuset_v2()) {
+ nodes_copy(*mask, node_states[N_MEMORY]);
+ return;
+ }
css = cgroup_get_e_css(cgroup, &cpuset_cgrp_subsys);
- if (!css)
- return true;
+ if (!css) {
+ nodes_copy(*mask, node_states[N_MEMORY]);
+ return;
+ }
/*
- * Normally, accessing effective_mems would require the cpuset_mutex
- * or callback_lock - but node_isset is atomic and the reference
- * taken via cgroup_get_e_css is sufficient to protect css.
- *
- * Since this interface is intended for use by migration paths, we
- * relax locking here to avoid taking global locks - while accepting
- * there may be rare scenarios where the result may be innaccurate.
+ * The reference taken via cgroup_get_e_css is sufficient to
+ * protect css, but it does not imply safe accesses to effective_mems.
*
- * Reclaim and migration are subject to these same race conditions, and
- * cannot make strong isolation guarantees, so this is acceptable.
+ * Normally, accessing effective_mems would require the cpuset_mutex
+ * or callback_lock - but the correctness of this information is stale
+ * immediately after the query anyway. We do not acquire the lock
+ * during this process to save lock contention in exchange for racing
+ * against mems_allowed rebinds.
*/
cs = container_of(css, struct cpuset, css);
- allowed = node_isset(nid, cs->effective_mems);
+ nodes_copy(*mask, cs->effective_mems);
css_put(css);
- return allowed;
}
/**
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 86f43b7e5f71..702c3db624a0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5624,9 +5624,21 @@ subsys_initcall(mem_cgroup_swap_init);
#endif /* CONFIG_SWAP */
-bool mem_cgroup_node_allowed(struct mem_cgroup *memcg, int nid)
+void mem_cgroup_node_filter_allowed(struct mem_cgroup *memcg, nodemask_t *mask)
{
- return memcg ? cpuset_node_allowed(memcg->css.cgroup, nid) : true;
+ nodemask_t allowed;
+
+ if (!memcg)
+ return;
+
+ /*
+ * Since this interface is intended for use by migration paths, and
+ * reclaim and migration are subject to race conditions such as changes
+ * in effective_mems and hot-unpluging of nodes, inaccurate allowed
+ * mask is acceptable.
+ */
+ cpuset_nodes_allowed(memcg->css.cgroup, &allowed);
+ nodes_and(*mask, *mask, allowed);
}
void mem_cgroup_show_protected_memory(struct mem_cgroup *memcg)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 670fe9fae5ba..eed1becfcb34 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -344,19 +344,21 @@ static void flush_reclaim_state(struct scan_control *sc)
static bool can_demote(int nid, struct scan_control *sc,
struct mem_cgroup *memcg)
{
- int demotion_nid;
+ struct pglist_data *pgdat = NODE_DATA(nid);
+ nodemask_t allowed_mask;
- if (!numa_demotion_enabled)
+ if (!pgdat || !numa_demotion_enabled)
return false;
if (sc && sc->no_demotion)
return false;
- demotion_nid = next_demotion_node(nid);
- if (demotion_nid == NUMA_NO_NODE)
+ node_get_allowed_targets(pgdat, &allowed_mask);
+ if (nodes_empty(allowed_mask))
return false;
- /* If demotion node isn't in the cgroup's mems_allowed, fall back */
- return mem_cgroup_node_allowed(memcg, demotion_nid);
+ /* Filter out nodes that are not in cgroup's mems_allowed. */
+ mem_cgroup_node_filter_allowed(memcg, &allowed_mask);
+ return !nodes_empty(allowed_mask);
}
static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
@@ -1019,7 +1021,8 @@ static struct folio *alloc_demote_folio(struct folio *src,
* Folios which are not demoted are left on @demote_folios.
*/
static unsigned int demote_folio_list(struct list_head *demote_folios,
- struct pglist_data *pgdat)
+ struct pglist_data *pgdat,
+ struct mem_cgroup *memcg)
{
int target_nid = next_demotion_node(pgdat->node_id);
unsigned int nr_succeeded;
@@ -1033,7 +1036,6 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
*/
.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
__GFP_NOMEMALLOC | GFP_NOWAIT,
- .nid = target_nid,
.nmask = &allowed_mask,
.reason = MR_DEMOTION,
};
@@ -1041,10 +1043,14 @@ static unsigned int demote_folio_list(struct list_head *demote_folios,
if (list_empty(demote_folios))
return 0;
- if (target_nid == NUMA_NO_NODE)
- return 0;
-
node_get_allowed_targets(pgdat, &allowed_mask);
+ mem_cgroup_node_filter_allowed(memcg, &allowed_mask);
+ if (nodes_empty(allowed_mask))
+ return false;
+
+ if (!node_isset(target_nid, allowed_mask))
+ target_nid = node_random(&allowed_mask);
+ mtc.nid = target_nid;
/* Demotion ignores all cpuset and mempolicy settings */
migrate_pages(demote_folios, alloc_demote_folio, NULL,
@@ -1566,7 +1572,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
/* 'folio_list' is always empty here */
/* Migrate folios selected for demotion */
- nr_demoted = demote_folio_list(&demote_folios, pgdat);
+ nr_demoted = demote_folio_list(&demote_folios, pgdat, memcg);
nr_reclaimed += nr_demoted;
stat->nr_demoted += nr_demoted;
/* Folios that could not be demoted are still in @demote_folios */
--
2.52.0.358.g0dd7633a29-goog
On Tue, 6 Jan 2026 07:56:54 +0000 Bing Jiao <bingjiao@google.com> wrote:
> Fix two bugs in demote_folio_list() and can_demote() due to incorrect
> demotion target checks in reclaim/demotion.
>
> Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
> introduces the cpuset.mems_effective check and applies it to
> can_demote(). However:
>
> 1. It does not apply this check in demote_folio_list(), which leads
> to situations where pages are demoted to nodes that are
> explicitly excluded from the task's cpuset.mems.
>
> 2. It checks only the nodes in the immediate next demotion hierarchy
> and does not check all allowed demotion targets in can_demote().
> This can cause pages to never be demoted if the nodes in the next
> demotion hierarchy are not set in mems_effective.
>
> These bugs break resource isolation provided by cpuset.mems.
> This is visible from userspace because pages can either fail to be
> demoted entirely or are demoted to nodes that are not allowed
> in multi-tier memory systems.
>
> To address these bugs, update cpuset_node_allowed() and
> mem_cgroup_node_allowed() to return effective_mems, allowing directly
> logic-and operation against demotion targets. Also update can_demote()
> and demote_folio_list() accordingly.
>
> Bug 1 reproduction:
> Assume a system with 4 nodes, where nodes 0-1 are top-tier and
> nodes 2-3 are far-tier memory. All nodes have equal capacity.
>
> Test script:
> echo 1 > /sys/kernel/mm/numa/demotion_enabled
> mkdir /sys/fs/cgroup/test
> echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
> echo "0-2" > /sys/fs/cgroup/test/cpuset.mems
> echo $$ > /sys/fs/cgroup/test/cgroup.procs
> swapoff -a
> # Expectation: Should respect node 0-2 limit.
> # Observation: Node 3 shows significant allocation (MemFree drops)
> stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1
>
> Bug 2 reproduction:
> Assume a system with 6 nodes, where nodes 0-2 are top-tier,
> node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes.
> All nodes have equal capacity.
>
> Test script:
> echo 1 > /sys/kernel/mm/numa/demotion_enabled
> mkdir /sys/fs/cgroup/test
> echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
> echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems
> echo $$ > /sys/fs/cgroup/test/cgroup.procs
> swapoff -a
> # Expectation: Pages are demoted to Nodes 4-5
> # Observation: No pages are demoted before oom.
> stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2
Thanks.
I'm not confident in my attempts to resolve Akinobu Mita's "mm/vmscan:
don't demote if there is not enough free memory in the lower memory
tier" against this. In can_demote(). So I'll drop Akinobu's series,
sorry.
Akinobu, can you please redo that series against tomorrow's linux-next?
it looks like it needs a resend anyway to try to create some reviewer
input.
On Tue, Jan 06, 2026 at 07:56:54AM +0000, Bing Jiao wrote:
>
> Bug 1 reproduction:
> Assume a system with 4 nodes, where nodes 0-1 are top-tier and
> nodes 2-3 are far-tier memory. All nodes have equal capacity.
>
> Test script:
> echo 1 > /sys/kernel/mm/numa/demotion_enabled
> mkdir /sys/fs/cgroup/test
> echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
> echo "0-2" > /sys/fs/cgroup/test/cpuset.mems
> echo $$ > /sys/fs/cgroup/test/cgroup.procs
> swapoff -a
> # Expectation: Should respect node 0-2 limit.
> # Observation: Node 3 shows significant allocation (MemFree drops)
> stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1
>
> Bug 2 reproduction:
> Assume a system with 6 nodes, where nodes 0-2 are top-tier,
> node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes.
> All nodes have equal capacity.
>
> Test script:
> echo 1 > /sys/kernel/mm/numa/demotion_enabled
> mkdir /sys/fs/cgroup/test
> echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
> echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems
> echo $$ > /sys/fs/cgroup/test/cgroup.procs
> swapoff -a
> # Expectation: Pages are demoted to Nodes 4-5
> # Observation: No pages are demoted before oom.
> stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2
>
> Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Bing Jiao <bingjiao@google.com>
This looks ok now, haven't tested it myself yet, but looks good.
Thank you for the fix.
Reviewed-by: Gregory Price <gourry@gourry.net>
~Gregory
© 2016 - 2026 Red Hat, Inc.