[PATCH] disable demotion during memory reclamation

cuishiwei posted 1 patch 3 weeks, 2 days ago
mm/vmscan.c | 1 +
1 file changed, 1 insertion(+)
[PATCH] disable demotion during memory reclamation
Posted by cuishiwei 3 weeks, 2 days ago
When a memory cgroup exceeds its memory limit, the system reclaims
its cold memory.However, if /sys/kernel/mm/numa/demotion_enabled is
set to 1, memory on fast memory nodes will also be demoted to slow 
memory nodes.

This demotion contradicts the goal of reclaiming cold memory within
the memcg.At this point, demoting cold memory from fast to slow nodes
is pointless;it doesn't reduce the memcg's memory usage. Therefore, 
we should set no_demotion when reclaiming memory in a memcg.

Signed-off-by: cuishiwei <cuishw@inspur.com>
---
 mm/vmscan.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ca9e1cd3cd68..1edf618a3604 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6706,6 +6706,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.may_unmap = 1,
 		.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
 		.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
+		.no_demotion = 1,
 	};
 	/*
 	 * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
-- 
2.43.0
Re: [PATCH] disable demotion during memory reclamation
Posted by Michal Hocko 3 weeks, 2 days ago
On Tue 09-09-25 09:21:41, cuishiwei wrote:
> When a memory cgroup exceeds its memory limit, the system reclaims
> its cold memory.However, if /sys/kernel/mm/numa/demotion_enabled is
> set to 1, memory on fast memory nodes will also be demoted to slow 
> memory nodes.
> 
> This demotion contradicts the goal of reclaiming cold memory within
> the memcg.At this point, demoting cold memory from fast to slow nodes
> is pointless;it doesn't reduce the memcg's memory usage. Therefore, 
> we should set no_demotion when reclaiming memory in a memcg.

We have discussed this in the past and it is my recollection that we
have concluded that demotion is a part of proper aging and therefore it
should be done during the limit reclaim. Pro active reclaim through
memcg.memory_reclaim has a slightly different semantic (see
6b426d071419a).

I can see you have replied with more details to Andrew but in general it
is always better to describe your usecase and why the current behavior
is unexpected. Is the memory limit not being enforced? Do you see
unexpected memcg OOM situations? What is the actual problem?

> Signed-off-by: cuishiwei <cuishw@inspur.com>
> ---
>  mm/vmscan.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ca9e1cd3cd68..1edf618a3604 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6706,6 +6706,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  		.may_unmap = 1,
>  		.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
>  		.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
> +		.no_demotion = 1,
>  	};
>  	/*
>  	 * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
> -- 
> 2.43.0
> 

-- 
Michal Hocko
SUSE Labs
Re: [PATCH] disable demotion during memory reclamation
Posted by Johannes Weiner 3 weeks, 2 days ago
On Tue, Sep 09, 2025 at 09:40:51AM +0200, Michal Hocko wrote:
> On Tue 09-09-25 09:21:41, cuishiwei wrote:
> > When a memory cgroup exceeds its memory limit, the system reclaims
> > its cold memory.However, if /sys/kernel/mm/numa/demotion_enabled is
> > set to 1, memory on fast memory nodes will also be demoted to slow 
> > memory nodes.
> > 
> > This demotion contradicts the goal of reclaiming cold memory within
> > the memcg.At this point, demoting cold memory from fast to slow nodes
> > is pointless;it doesn't reduce the memcg's memory usage. Therefore, 
> > we should set no_demotion when reclaiming memory in a memcg.
> 
> We have discussed this in the past and it is my recollection that we
> have concluded that demotion is a part of proper aging and therefore it
> should be done during the limit reclaim.

Yes, thanks. This is intentional. Please see 3f1509c57b1b ("Revert
"mm/vmscan: never demote for memcg reclaim"") for more details.
Re: [PATCH] disable demotion during memory reclamation
Posted by cuishiwei 3 weeks, 1 day ago
On Tue, 9 Sep 2025 15:45:31 +0100 Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Tue, Sep 09, 2025 at 09:40:51AM +0200, Michal Hocko wrote:
> > On Tue 09-09-25 09:21:41, cuishiwei wrote:
> > > When a memory cgroup exceeds its memory limit, the system reclaims
> > > its cold memory.However, if /sys/kernel/mm/numa/demotion_enabled is
> > > set to 1, memory on fast memory nodes will also be demoted to slow 
> > > memory nodes.
> > > 
> > > This demotion contradicts the goal of reclaiming cold memory within
> > > the memcg.At this point, demoting cold memory from fast to slow nodes
> > > is pointless;it doesn't reduce the memcg's memory usage. Therefore, 
> > > we should set no_demotion when reclaiming memory in a memcg.
> > 
> > We have discussed this in the past and it is my recollection that we
> > have concluded that demotion is a part of proper aging and therefore it
> > should be done during the limit reclaim.
> 
> Yes, thanks. This is intentional. Please see 3f1509c57b1b ("Revert
> "mm/vmscan: never demote for memcg reclaim"") for more details.
Thank you for the guidance. It seems the original processing logic was sound.

Sent using hkml (https://github.com/sjp38/hackermail)
Re: [PATCH] disable demotion during memory reclamation
Posted by Andrew Morton 3 weeks, 2 days ago
On Tue, 9 Sep 2025 09:21:41 +0800 cuishiwei <cuishw@inspur.com> wrote:

> When a memory cgroup exceeds its memory limit, the system reclaims
> its cold memory.However, if /sys/kernel/mm/numa/demotion_enabled is
> set to 1, memory on fast memory nodes will also be demoted to slow 
> memory nodes.
> 
> This demotion contradicts the goal of reclaiming cold memory within
> the memcg.At this point, demoting cold memory from fast to slow nodes
> is pointless;it doesn't reduce the memcg's memory usage. Therefore, 
> we should set no_demotion when reclaiming memory in a memcg.

Is this from code inspection?  Or is there some workload which benefits
from this change?  If the latter, please tell us all about it.

>
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6706,6 +6706,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  		.may_unmap = 1,
>  		.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
>  		.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
> +		.no_demotion = 1,
>  	};
>  	/*
>  	 * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
> -- 
> 2.43.0
Re: [PATCH] disable demotion during memory reclamation
Posted by cuishiwei 3 weeks, 2 days ago
On Mon, 8 Sep 2025 18:36:49 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 9 Sep 2025 09:21:41 +0800 cuishiwei <cuishw@inspur.com> wrote:
> 
> > When a memory cgroup exceeds its memory limit, the system reclaims
> > its cold memory.However, if /sys/kernel/mm/numa/demotion_enabled is
> > set to 1, memory on fast memory nodes will also be demoted to slow 
> > memory nodes.
> > 
> > This demotion contradicts the goal of reclaiming cold memory within
> > the memcg.At this point, demoting cold memory from fast to slow nodes
> > is pointless;it doesn't reduce the memcg's memory usage. Therefore, 
> > we should set no_demotion when reclaiming memory in a memcg.
> 
> Is this from code inspection?  Or is there some workload which benefits
> from this change?  If the latter, please tell us all about it.
Hello, I've found an issue while using CXL memory. My machine has one DRAM
NUMA node and one CXL NUMA node:
node 1 cpus: 96 97 98 99...	- dram Numa node
node 1 size: 772048 MB
node 1 free: 759737 MB
node 3 cpus:			- CXL memory Numa node
node 3 size: 524288 MB
node 3 free: 524287 MB
1.enable demotion
echo 1 > /sys/kernel/mm/numa/demotion_enabled
2.Execute a memory allocation program in a memcg
cgexec -g memory:test numactl -N 1 ./allocate_memory 20	- allocate 20G memory
numastat allocate_memory:
                           Node 0          Node 1          Node 3
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.00            0.00            0.00
Stack                        0.00            0.01            0.00
Private                      0.05        20481.56            0.01
3.Setting the memory cgroup memory limit to be exceeded
echo 15G > /sys/fs/cgroup/test/memory.max
numastat allocate_memory:
                           Node 0          Node 1          Node 3
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.00            0.00            0.00
Stack                        0.00            0.01            0.00
Private                      0.00         4011.54            10560.00

Based on what you can see, because demotion was enabled, 
when the memcg's memory limit was exceeded, memory from
the DRAM NUMA node was first migrated to the CXL NUMA node.
After that, a memory reclaim was performed, which was unnecessary.
> 
> >
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -6706,6 +6706,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> >  		.may_unmap = 1,
> >  		.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
> >  		.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
> > +		.no_demotion = 1,
> >  	};
> >  	/*
> >  	 * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
> > -- 
> > 2.43.0

Sent using hkml (https://github.com/sjp38/hackermail)