mm/vmscan.c | 1 + 1 file changed, 1 insertion(+)
When a memory cgroup exceeds its memory limit, the system reclaims
its cold memory.However, if /sys/kernel/mm/numa/demotion_enabled is
set to 1, memory on fast memory nodes will also be demoted to slow
memory nodes.
This demotion contradicts the goal of reclaiming cold memory within
the memcg.At this point, demoting cold memory from fast to slow nodes
is pointless;it doesn't reduce the memcg's memory usage. Therefore,
we should set no_demotion when reclaiming memory in a memcg.
Signed-off-by: cuishiwei <cuishw@inspur.com>
---
mm/vmscan.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ca9e1cd3cd68..1edf618a3604 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6706,6 +6706,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
.may_unmap = 1,
.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
+ .no_demotion = 1,
};
/*
* Traverse the ZONELIST_FALLBACK zonelist of the current node to put
--
2.43.0
On Tue 09-09-25 09:21:41, cuishiwei wrote: > When a memory cgroup exceeds its memory limit, the system reclaims > its cold memory.However, if /sys/kernel/mm/numa/demotion_enabled is > set to 1, memory on fast memory nodes will also be demoted to slow > memory nodes. > > This demotion contradicts the goal of reclaiming cold memory within > the memcg.At this point, demoting cold memory from fast to slow nodes > is pointless;it doesn't reduce the memcg's memory usage. Therefore, > we should set no_demotion when reclaiming memory in a memcg. We have discussed this in the past and it is my recollection that we have concluded that demotion is a part of proper aging and therefore it should be done during the limit reclaim. Pro active reclaim through memcg.memory_reclaim has a slightly different semantic (see 6b426d071419a). I can see you have replied with more details to Andrew but in general it is always better to describe your usecase and why the current behavior is unexpected. Is the memory limit not being enforced? Do you see unexpected memcg OOM situations? What is the actual problem? > Signed-off-by: cuishiwei <cuishw@inspur.com> > --- > mm/vmscan.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index ca9e1cd3cd68..1edf618a3604 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -6706,6 +6706,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > .may_unmap = 1, > .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), > .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), > + .no_demotion = 1, > }; > /* > * Traverse the ZONELIST_FALLBACK zonelist of the current node to put > -- > 2.43.0 > -- Michal Hocko SUSE Labs
On Tue, Sep 09, 2025 at 09:40:51AM +0200, Michal Hocko wrote: > On Tue 09-09-25 09:21:41, cuishiwei wrote: > > When a memory cgroup exceeds its memory limit, the system reclaims > > its cold memory.However, if /sys/kernel/mm/numa/demotion_enabled is > > set to 1, memory on fast memory nodes will also be demoted to slow > > memory nodes. > > > > This demotion contradicts the goal of reclaiming cold memory within > > the memcg.At this point, demoting cold memory from fast to slow nodes > > is pointless;it doesn't reduce the memcg's memory usage. Therefore, > > we should set no_demotion when reclaiming memory in a memcg. > > We have discussed this in the past and it is my recollection that we > have concluded that demotion is a part of proper aging and therefore it > should be done during the limit reclaim. Yes, thanks. This is intentional. Please see 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim"") for more details.
On Tue, 9 Sep 2025 15:45:31 +0100 Johannes Weiner <hannes@cmpxchg.org> wrote: > On Tue, Sep 09, 2025 at 09:40:51AM +0200, Michal Hocko wrote: > > On Tue 09-09-25 09:21:41, cuishiwei wrote: > > > When a memory cgroup exceeds its memory limit, the system reclaims > > > its cold memory.However, if /sys/kernel/mm/numa/demotion_enabled is > > > set to 1, memory on fast memory nodes will also be demoted to slow > > > memory nodes. > > > > > > This demotion contradicts the goal of reclaiming cold memory within > > > the memcg.At this point, demoting cold memory from fast to slow nodes > > > is pointless;it doesn't reduce the memcg's memory usage. Therefore, > > > we should set no_demotion when reclaiming memory in a memcg. > > > > We have discussed this in the past and it is my recollection that we > > have concluded that demotion is a part of proper aging and therefore it > > should be done during the limit reclaim. > > Yes, thanks. This is intentional. Please see 3f1509c57b1b ("Revert > "mm/vmscan: never demote for memcg reclaim"") for more details. Thank you for the guidance. It seems the original processing logic was sound. Sent using hkml (https://github.com/sjp38/hackermail)
On Tue, 9 Sep 2025 09:21:41 +0800 cuishiwei <cuishw@inspur.com> wrote: > When a memory cgroup exceeds its memory limit, the system reclaims > its cold memory.However, if /sys/kernel/mm/numa/demotion_enabled is > set to 1, memory on fast memory nodes will also be demoted to slow > memory nodes. > > This demotion contradicts the goal of reclaiming cold memory within > the memcg.At this point, demoting cold memory from fast to slow nodes > is pointless;it doesn't reduce the memcg's memory usage. Therefore, > we should set no_demotion when reclaiming memory in a memcg. Is this from code inspection? Or is there some workload which benefits from this change? If the latter, please tell us all about it. > > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -6706,6 +6706,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > .may_unmap = 1, > .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), > .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), > + .no_demotion = 1, > }; > /* > * Traverse the ZONELIST_FALLBACK zonelist of the current node to put > -- > 2.43.0
On Mon, 8 Sep 2025 18:36:49 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > On Tue, 9 Sep 2025 09:21:41 +0800 cuishiwei <cuishw@inspur.com> wrote: > > > When a memory cgroup exceeds its memory limit, the system reclaims > > its cold memory.However, if /sys/kernel/mm/numa/demotion_enabled is > > set to 1, memory on fast memory nodes will also be demoted to slow > > memory nodes. > > > > This demotion contradicts the goal of reclaiming cold memory within > > the memcg.At this point, demoting cold memory from fast to slow nodes > > is pointless;it doesn't reduce the memcg's memory usage. Therefore, > > we should set no_demotion when reclaiming memory in a memcg. > > Is this from code inspection? Or is there some workload which benefits > from this change? If the latter, please tell us all about it. Hello, I've found an issue while using CXL memory. My machine has one DRAM NUMA node and one CXL NUMA node: node 1 cpus: 96 97 98 99... - dram Numa node node 1 size: 772048 MB node 1 free: 759737 MB node 3 cpus: - CXL memory Numa node node 3 size: 524288 MB node 3 free: 524287 MB 1.enable demotion echo 1 > /sys/kernel/mm/numa/demotion_enabled 2.Execute a memory allocation program in a memcg cgexec -g memory:test numactl -N 1 ./allocate_memory 20 - allocate 20G memory numastat allocate_memory: Node 0 Node 1 Node 3 --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.00 0.00 0.00 Stack 0.00 0.01 0.00 Private 0.05 20481.56 0.01 3.Setting the memory cgroup memory limit to be exceeded echo 15G > /sys/fs/cgroup/test/memory.max numastat allocate_memory: Node 0 Node 1 Node 3 --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.00 0.00 0.00 Stack 0.00 0.01 0.00 Private 0.00 4011.54 10560.00 Based on what you can see, because demotion was enabled, when the memcg's memory limit was exceeded, memory from the DRAM NUMA node was first migrated to the CXL NUMA node. After that, a memory reclaim was performed, which was unnecessary. > > > > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -6706,6 +6706,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, > > .may_unmap = 1, > > .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), > > .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), > > + .no_demotion = 1, > > }; > > /* > > * Traverse the ZONELIST_FALLBACK zonelist of the current node to put > > -- > > 2.43.0 Sent using hkml (https://github.com/sjp38/hackermail)
© 2016 - 2025 Red Hat, Inc.