[v4] mm: fix oom-killer not being invoked when demotion is enabled

[PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier

Posted by Akinobu Mita 3 weeks, 4 days ago

On systems with multiple memory-tiers consisting of DRAM and CXL memory,
the OOM killer is not invoked properly.

Here's the command to reproduce:

$ sudo swapoff -a
$ stress-ng --oomable -v --memrate 20 --memrate-bytes 10G \
    --memrate-rd-mbs 1 --memrate-wr-mbs 1

The memory usage is the number of workers specified with the --memrate
option multiplied by the buffer size specified with the --memrate-bytes
option, so please adjust it so that it exceeds the total size of the
installed DRAM and CXL memory.

If swap is disabled, you can usually expect the OOM killer to terminate
the stress-ng process when memory usage approaches the installed memory
size.

However, if multiple memory-tiers exist (multiple
/sys/devices/virtual/memory_tiering/memory_tier<N> directories exist) and
/sys/kernel/mm/numa/demotion_enabled is true, the OOM killer will not be
invoked and the system will become inoperable, regardless of whether MGLRU
is enabled or not.

This issue can be reproduced using NUMA emulation even on systems with
only DRAM.  You can create two-fake memory-tiers by booting a single-node
system with "numa=fake=2 numa_emulation.adistance=576,704" kernel
parameters.

The reason for this issue is that memory allocations do not directly
trigger the oom-killer, assuming that if the target node has an underlying
memory tier, it can always be reclaimed by demotion.

So this change avoids this issue by not attempting to demote if the
underlying node has less free memory than the minimum watermark, and the
oom-killer will be triggered directly from memory allocations.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
---
v4:
- add a code comment in can_demote()

v3:
- rebase to linux-next (next-20260108), where demotion target has changed
  from node id to node mask.

v2:
- describe reproducibility with !mglru in the commit log
- removed unnecessary consideration for scan control when checking demotion_nid watermarks

 mm/vmscan.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f35afc5093dc..f980c533c778 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -358,7 +358,22 @@ static bool can_demote(int nid, struct scan_control *sc,
 
 	/* Filter out nodes that are not in cgroup's mems_allowed. */
 	mem_cgroup_node_filter_allowed(memcg, &allowed_mask);
-	return !nodes_empty(allowed_mask);
+	if (nodes_empty(allowed_mask))
+		return false;
+
+	/* Check if there is enough free memory in the demotion target */
+	for_each_node_mask(nid, allowed_mask) {
+		int z;
+		struct zone *zone;
+		struct pglist_data *pgdat = NODE_DATA(nid);
+
+		for_each_managed_zone_pgdat(zone, pgdat, z, MAX_NR_ZONES - 1) {
+			if (zone_watermark_ok(zone, 0, min_wmark_pages(zone),
+						ZONE_MOVABLE, 0))
+				return true;
+		}
+	}
+	return false;
 }
 
 static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
-- 
2.43.0

Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier

Posted by Michal Hocko 3 weeks, 4 days ago

On Tue 13-01-26 17:14:53, Akinobu Mita wrote:
> On systems with multiple memory-tiers consisting of DRAM and CXL memory,
> the OOM killer is not invoked properly.
> 
> Here's the command to reproduce:
> 
> $ sudo swapoff -a
> $ stress-ng --oomable -v --memrate 20 --memrate-bytes 10G \
>     --memrate-rd-mbs 1 --memrate-wr-mbs 1
> 
> The memory usage is the number of workers specified with the --memrate
> option multiplied by the buffer size specified with the --memrate-bytes
> option, so please adjust it so that it exceeds the total size of the
> installed DRAM and CXL memory.
> 
> If swap is disabled, you can usually expect the OOM killer to terminate
> the stress-ng process when memory usage approaches the installed memory
> size.
> 
> However, if multiple memory-tiers exist (multiple
> /sys/devices/virtual/memory_tiering/memory_tier<N> directories exist) and
> /sys/kernel/mm/numa/demotion_enabled is true, the OOM killer will not be
> invoked and the system will become inoperable, regardless of whether MGLRU
> is enabled or not.
> 
> This issue can be reproduced using NUMA emulation even on systems with
> only DRAM.  You can create two-fake memory-tiers by booting a single-node
> system with "numa=fake=2 numa_emulation.adistance=576,704" kernel
> parameters.
> 
> The reason for this issue is that memory allocations do not directly
> trigger the oom-killer, assuming that if the target node has an underlying
> memory tier, it can always be reclaimed by demotion.

Why don't we fall back to no demotion mode in this case? I mean we have 
shrink_folio_list:
        if (!list_empty(&demote_folios)) {
                /* Folios which weren't demoted go back on @folio_list */
                list_splice_init(&demote_folios, folio_list);

                /*
                 * goto retry to reclaim the undemoted folios in folio_list if
                 * desired.
                 *
                 * Reclaiming directly from top tier nodes is not often desired
                 * due to it breaking the LRU ordering: in general memory
                 * should be reclaimed from lower tier nodes and demoted from
                 * top tier nodes.
                 *
                 * However, disabling reclaim from top tier nodes entirely
                 * would cause ooms in edge scenarios where lower tier memory
                 * is unreclaimable for whatever reason, eg memory being
                 * mlocked or too hot to reclaim. We can disable reclaim
                 * from top tier nodes in proactive reclaim though as that is
                 * not real memory pressure.
                 */
                if (!sc->proactive) {
                        do_demote_pass = false;
                        goto retry;
                }
        }

to handle this situation no?

-- 
Michal Hocko
SUSE Labs

Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier

Posted by Akinobu Mita 3 weeks, 3 days ago

2026年1月13日(火) 22:40 Michal Hocko <mhocko@suse.com>:
>
> On Tue 13-01-26 17:14:53, Akinobu Mita wrote:
> > On systems with multiple memory-tiers consisting of DRAM and CXL memory,
> > the OOM killer is not invoked properly.
> >
> > Here's the command to reproduce:
> >
> > $ sudo swapoff -a
> > $ stress-ng --oomable -v --memrate 20 --memrate-bytes 10G \
> >     --memrate-rd-mbs 1 --memrate-wr-mbs 1
> >
> > The memory usage is the number of workers specified with the --memrate
> > option multiplied by the buffer size specified with the --memrate-bytes
> > option, so please adjust it so that it exceeds the total size of the
> > installed DRAM and CXL memory.
> >
> > If swap is disabled, you can usually expect the OOM killer to terminate
> > the stress-ng process when memory usage approaches the installed memory
> > size.
> >
> > However, if multiple memory-tiers exist (multiple
> > /sys/devices/virtual/memory_tiering/memory_tier<N> directories exist) and
> > /sys/kernel/mm/numa/demotion_enabled is true, the OOM killer will not be
> > invoked and the system will become inoperable, regardless of whether MGLRU
> > is enabled or not.
> >
> > This issue can be reproduced using NUMA emulation even on systems with
> > only DRAM.  You can create two-fake memory-tiers by booting a single-node
> > system with "numa=fake=2 numa_emulation.adistance=576,704" kernel
> > parameters.
> >
> > The reason for this issue is that memory allocations do not directly
> > trigger the oom-killer, assuming that if the target node has an underlying
> > memory tier, it can always be reclaimed by demotion.
>
> Why don't we fall back to no demotion mode in this case? I mean we have
> shrink_folio_list:
>         if (!list_empty(&demote_folios)) {
>                 /* Folios which weren't demoted go back on @folio_list */
>                 list_splice_init(&demote_folios, folio_list);
>
>                 /*
>                  * goto retry to reclaim the undemoted folios in folio_list if
>                  * desired.
>                  *
>                  * Reclaiming directly from top tier nodes is not often desired
>                  * due to it breaking the LRU ordering: in general memory
>                  * should be reclaimed from lower tier nodes and demoted from
>                  * top tier nodes.
>                  *
>                  * However, disabling reclaim from top tier nodes entirely
>                  * would cause ooms in edge scenarios where lower tier memory
>                  * is unreclaimable for whatever reason, eg memory being
>                  * mlocked or too hot to reclaim. We can disable reclaim
>                  * from top tier nodes in proactive reclaim though as that is
>                  * not real memory pressure.
>                  */
>                 if (!sc->proactive) {
>                         do_demote_pass = false;
>                         goto retry;
>                 }
>         }
>
> to handle this situation no?

can_demote() is called from four places.
I tried modifying the patch to change the behavior only when can_demote()
is called from shrink_folio_list(), but the problem was not fixed
(oom did not occur).

Similarly, changing the behavior of can_demote() when called from
can_reclaim_anon_pages(), shrink_folio_list(), and can_age_anon_pages(),
but not when called from get_swappiness(), did not fix the problem either
(oom did not occur).

Conversely, changing the behavior only when called from get_swappiness(),
but not changing the behavior of can_reclaim_anon_pages(),
shrink_folio_list(), and can_age_anon_pages(), fixed the problem
(oom did occur).

Therefore, it appears that the behavior of get_swappiness() is important
in this issue.

Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier

Posted by Joshua Hahn 2 weeks, 2 days ago

Hello Akinobu,

I hope you are doing well! First of all, sorry for the late review on the
series. I have a few questions about the problem itself, and how it is being
triggered.

> > > On systems with multiple memory-tiers consisting of DRAM and CXL memory,
> > > the OOM killer is not invoked properly.
> > >
> > > Here's the command to reproduce:
> > >
> > > $ sudo swapoff -a
> > > $ stress-ng --oomable -v --memrate 20 --memrate-bytes 10G \
> > >     --memrate-rd-mbs 1 --memrate-wr-mbs 1
> > >
> > > The memory usage is the number of workers specified with the --memrate
> > > option multiplied by the buffer size specified with the --memrate-bytes
> > > option, so please adjust it so that it exceeds the total size of the
> > > installed DRAM and CXL memory.
> > >
> > > If swap is disabled, you can usually expect the OOM killer to terminate
> > > the stress-ng process when memory usage approaches the installed memory
> > > size.
> > >
> > > However, if multiple memory-tiers exist (multiple
> > > /sys/devices/virtual/memory_tiering/memory_tier<N> directories exist) and
> > > /sys/kernel/mm/numa/demotion_enabled is true, the OOM killer will not be
> > > invoked and the system will become inoperable, regardless of whether MGLRU
> > > is enabled or not.
> > >
> > > This issue can be reproduced using NUMA emulation even on systems with
> > > only DRAM.  You can create two-fake memory-tiers by booting a single-node
> > > system with "numa=fake=2 numa_emulation.adistance=576,704" kernel
> > > parameters.

[...snip...]
 
> can_demote() is called from four places.
> I tried modifying the patch to change the behavior only when can_demote()
> is called from shrink_folio_list(), but the problem was not fixed
> (oom did not occur).
> 
> Similarly, changing the behavior of can_demote() when called from
> can_reclaim_anon_pages(), shrink_folio_list(), and can_age_anon_pages(),
> but not when called from get_swappiness(), did not fix the problem either
> (oom did not occur).
> 
> Conversely, changing the behavior only when called from get_swappiness(),
> but not changing the behavior of can_reclaim_anon_pages(),
> shrink_folio_list(), and can_age_anon_pages(), fixed the problem
> (oom did occur).
> 
> Therefore, it appears that the behavior of get_swappiness() is important
> in this issue.

This is quite mysterious.

Especially because get_swappiness() is an MGLRU exclusive function, I find
it quite strange that the issue you mention above occurs regardless of whether
MGLRU is enabled or disabled. With MGLRU disabled, did you see the same hangs
as before? Were these hangs similarly fixed by modifying the callsite in
get_swappiness?

On a separate note, I feel a bit uncomfortable for making this the default
setting, regardless of whether there is swap space or not. Just as it is
easy to create a degenerate scenario where all memory is unreclaimable
and the system starts going into (wasteful) reclaim on the lower tiers,
it is equally easy to create a scenario where all memory is very easily
reclaimable (say, clean pagecache) and we OOM without making any attempt to
free up memory on the lower tiers.

Reality is likely somewhere in between. And from my perspective, as long as
we have some amount of easily reclaimable memory, I don't think immediately
OOMing will be helpful for the system (and even if none of the memory is
easily reclaimable, we should still try doing something before killing).

> > > The reason for this issue is that memory allocations do not directly
> > > trigger the oom-killer, assuming that if the target node has an underlying
> > > memory tier, it can always be reclaimed by demotion.

This patch enforces that the opposite of this assumption is true; that even
if a target node has an underlying memory tier, it can never be reclaimed by
demotion.

Certainly for systems with swap and some compression methods (z{ram, swap}),
this new enforcement could be harmful to the system. What do you think?

Again, sorry for the late review. I hope you have a great day!
Joshua

Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier

Posted by Akinobu Mita 1 week, 6 days ago

2026年1月23日(金) 3:34 Joshua Hahn <joshua.hahnjy@gmail.com>:
>
> Hello Akinobu,
>
> I hope you are doing well! First of all, sorry for the late review on the
> series. I have a few questions about the problem itself, and how it is being
> triggered.
>
> > > > On systems with multiple memory-tiers consisting of DRAM and CXL memory,
> > > > the OOM killer is not invoked properly.
> > > >
> > > > Here's the command to reproduce:
> > > >
> > > > $ sudo swapoff -a
> > > > $ stress-ng --oomable -v --memrate 20 --memrate-bytes 10G \
> > > >     --memrate-rd-mbs 1 --memrate-wr-mbs 1
> > > >
> > > > The memory usage is the number of workers specified with the --memrate
> > > > option multiplied by the buffer size specified with the --memrate-bytes
> > > > option, so please adjust it so that it exceeds the total size of the
> > > > installed DRAM and CXL memory.
> > > >
> > > > If swap is disabled, you can usually expect the OOM killer to terminate
> > > > the stress-ng process when memory usage approaches the installed memory
> > > > size.
> > > >
> > > > However, if multiple memory-tiers exist (multiple
> > > > /sys/devices/virtual/memory_tiering/memory_tier<N> directories exist) and
> > > > /sys/kernel/mm/numa/demotion_enabled is true, the OOM killer will not be
> > > > invoked and the system will become inoperable, regardless of whether MGLRU
> > > > is enabled or not.
> > > >
> > > > This issue can be reproduced using NUMA emulation even on systems with
> > > > only DRAM.  You can create two-fake memory-tiers by booting a single-node
> > > > system with "numa=fake=2 numa_emulation.adistance=576,704" kernel
> > > > parameters.
>
> [...snip...]
>
> > can_demote() is called from four places.
> > I tried modifying the patch to change the behavior only when can_demote()
> > is called from shrink_folio_list(), but the problem was not fixed
> > (oom did not occur).
> >
> > Similarly, changing the behavior of can_demote() when called from
> > can_reclaim_anon_pages(), shrink_folio_list(), and can_age_anon_pages(),
> > but not when called from get_swappiness(), did not fix the problem either
> > (oom did not occur).
> >
> > Conversely, changing the behavior only when called from get_swappiness(),
> > but not changing the behavior of can_reclaim_anon_pages(),
> > shrink_folio_list(), and can_age_anon_pages(), fixed the problem
> > (oom did occur).
> >
> > Therefore, it appears that the behavior of get_swappiness() is important
> > in this issue.
>
> This is quite mysterious.
>
> Especially because get_swappiness() is an MGLRU exclusive function, I find
> it quite strange that the issue you mention above occurs regardless of whether
> MGLRU is enabled or disabled. With MGLRU disabled, did you see the same hangs
> as before? Were these hangs similarly fixed by modifying the callsite in
> get_swappiness?

Good point.
When MGLRU is disabled, changing only the behavior of can_demote()
called by get_swappiness() did not solve the problem.

Instead, the problem was avoided by changing only the behavior of
can_demote() called by can_reclaim_anon_page(), without changing the
behavior of can_demote() called from other places.

> On a separate note, I feel a bit uncomfortable for making this the default
> setting, regardless of whether there is swap space or not. Just as it is
> easy to create a degenerate scenario where all memory is unreclaimable
> and the system starts going into (wasteful) reclaim on the lower tiers,
> it is equally easy to create a scenario where all memory is very easily
> reclaimable (say, clean pagecache) and we OOM without making any attempt to
> free up memory on the lower tiers.
>
> Reality is likely somewhere in between. And from my perspective, as long as
> we have some amount of easily reclaimable memory, I don't think immediately
> OOMing will be helpful for the system (and even if none of the memory is
> easily reclaimable, we should still try doing something before killing).
>
> > > > The reason for this issue is that memory allocations do not directly
> > > > trigger the oom-killer, assuming that if the target node has an underlying
> > > > memory tier, it can always be reclaimed by demotion.
>
> This patch enforces that the opposite of this assumption is true; that even
> if a target node has an underlying memory tier, it can never be reclaimed by
> demotion.
>
> Certainly for systems with swap and some compression methods (z{ram, swap}),
> this new enforcement could be harmful to the system. What do you think?

Thank you for the detailed explanation.

I understand the concern regarding the current patch, which only
checks the free memory of the demotion target node.
I will explore a solution.

Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier

Posted by Joshua Hahn 1 week, 4 days ago

> > > Therefore, it appears that the behavior of get_swappiness() is important
> > > in this issue.
> >
> > This is quite mysterious.
> >
> > Especially because get_swappiness() is an MGLRU exclusive function, I find
> > it quite strange that the issue you mention above occurs regardless of whether
> > MGLRU is enabled or disabled. With MGLRU disabled, did you see the same hangs
> > as before? Were these hangs similarly fixed by modifying the callsite in
> > get_swappiness?
> 
> Good point.
> When MGLRU is disabled, changing only the behavior of can_demote()
> called by get_swappiness() did not solve the problem.
> 
> Instead, the problem was avoided by changing only the behavior of
> can_demote() called by can_reclaim_anon_page(), without changing the
> behavior of can_demote() called from other places.
> 
> > On a separate note, I feel a bit uncomfortable for making this the default
> > setting, regardless of whether there is swap space or not. Just as it is
> > easy to create a degenerate scenario where all memory is unreclaimable
> > and the system starts going into (wasteful) reclaim on the lower tiers,
> > it is equally easy to create a scenario where all memory is very easily
> > reclaimable (say, clean pagecache) and we OOM without making any attempt to
> > free up memory on the lower tiers.
> >
> > Reality is likely somewhere in between. And from my perspective, as long as
> > we have some amount of easily reclaimable memory, I don't think immediately
> > OOMing will be helpful for the system (and even if none of the memory is
> > easily reclaimable, we should still try doing something before killing).
> >
> > > > > The reason for this issue is that memory allocations do not directly
> > > > > trigger the oom-killer, assuming that if the target node has an underlying
> > > > > memory tier, it can always be reclaimed by demotion.
> >
> > This patch enforces that the opposite of this assumption is true; that even
> > if a target node has an underlying memory tier, it can never be reclaimed by
> > demotion.
> >
> > Certainly for systems with swap and some compression methods (z{ram, swap}),
> > this new enforcement could be harmful to the system. What do you think?
> 
> Thank you for the detailed explanation.
> 
> I understand the concern regarding the current patch, which only
> checks the free memory of the demotion target node.
> I will explore a solution.

Hello Akinobu, I hope you had a great weekend!

I noticed something that I thought was worth flagging. It seems like the
primary addition of this patch, which is to check for zone_watermark_ok
across the zones, is already a part of should_reclaim_retry():

    /*
     * Keep reclaiming pages while there is a chance this will lead
     * somewhere.  If none of the target zones can satisfy our allocation
     * request even if all reclaimable pages are considered then we are
     * screwed and have to go OOM.
     */
    for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
                ac->highest_zoneidx, ac->nodemask) {

	[...snip...]

        /*
         * Would the allocation succeed if we reclaimed all
         * reclaimable pages?
         */
        wmark = __zone_watermark_ok(zone, order, min_wmark,
                ac->highest_zoneidx, alloc_flags, available);

        if (wmark) {
            ret = true;
            break;
        }
    }

... which is called in __alloc_pages_slowpath. I wonder why we don't already
hit this. It seems to do the same thing your patch is doing?

What do you think? I hope you have a great day!
Joshua

Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier

Posted by Akinobu Mita 1 week, 3 days ago

2026年1月28日(水) 7:00 Joshua Hahn <joshua.hahnjy@gmail.com>:
>
> > > > Therefore, it appears that the behavior of get_swappiness() is important
> > > > in this issue.
> > >
> > > This is quite mysterious.
> > >
> > > Especially because get_swappiness() is an MGLRU exclusive function, I find
> > > it quite strange that the issue you mention above occurs regardless of whether
> > > MGLRU is enabled or disabled. With MGLRU disabled, did you see the same hangs
> > > as before? Were these hangs similarly fixed by modifying the callsite in
> > > get_swappiness?
> >
> > Good point.
> > When MGLRU is disabled, changing only the behavior of can_demote()
> > called by get_swappiness() did not solve the problem.
> >
> > Instead, the problem was avoided by changing only the behavior of
> > can_demote() called by can_reclaim_anon_page(), without changing the
> > behavior of can_demote() called from other places.
> >
> > > On a separate note, I feel a bit uncomfortable for making this the default
> > > setting, regardless of whether there is swap space or not. Just as it is
> > > easy to create a degenerate scenario where all memory is unreclaimable
> > > and the system starts going into (wasteful) reclaim on the lower tiers,
> > > it is equally easy to create a scenario where all memory is very easily
> > > reclaimable (say, clean pagecache) and we OOM without making any attempt to
> > > free up memory on the lower tiers.
> > >
> > > Reality is likely somewhere in between. And from my perspective, as long as
> > > we have some amount of easily reclaimable memory, I don't think immediately
> > > OOMing will be helpful for the system (and even if none of the memory is
> > > easily reclaimable, we should still try doing something before killing).
> > >
> > > > > > The reason for this issue is that memory allocations do not directly
> > > > > > trigger the oom-killer, assuming that if the target node has an underlying
> > > > > > memory tier, it can always be reclaimed by demotion.
> > >
> > > This patch enforces that the opposite of this assumption is true; that even
> > > if a target node has an underlying memory tier, it can never be reclaimed by
> > > demotion.
> > >
> > > Certainly for systems with swap and some compression methods (z{ram, swap}),
> > > this new enforcement could be harmful to the system. What do you think?
> >
> > Thank you for the detailed explanation.
> >
> > I understand the concern regarding the current patch, which only
> > checks the free memory of the demotion target node.
> > I will explore a solution.
>
> Hello Akinobu, I hope you had a great weekend!
>
> I noticed something that I thought was worth flagging. It seems like the
> primary addition of this patch, which is to check for zone_watermark_ok
> across the zones, is already a part of should_reclaim_retry():
>
>     /*
>      * Keep reclaiming pages while there is a chance this will lead
>      * somewhere.  If none of the target zones can satisfy our allocation
>      * request even if all reclaimable pages are considered then we are
>      * screwed and have to go OOM.
>      */
>     for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
>                 ac->highest_zoneidx, ac->nodemask) {
>
>         [...snip...]
>
>         /*
>          * Would the allocation succeed if we reclaimed all
>          * reclaimable pages?
>          */
>         wmark = __zone_watermark_ok(zone, order, min_wmark,
>                 ac->highest_zoneidx, alloc_flags, available);
>
>         if (wmark) {
>             ret = true;
>             break;
>         }
>     }
>
> ... which is called in __alloc_pages_slowpath. I wonder why we don't already
> hit this. It seems to do the same thing your patch is doing?

I checked the number of calls and the time spent for several functions
called by __alloc_pages_slowpath(), and found that time is spent in
__alloc_pages_direct_reclaim() before reaching the first should_reclaim_retry().

After a few minutes have passed and the debug code that automatically
resets numa_demotion_enabled to false is executed, it appears that
__alloc_pages_direct_reclaim() immediately exits.

Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier

Posted by Michal Hocko 5 days, 15 hours ago

On Thu 29-01-26 09:40:17, Akinobu Mita wrote:
> 2026年1月28日(水) 7:00 Joshua Hahn <joshua.hahnjy@gmail.com>:
> >
> > > > > Therefore, it appears that the behavior of get_swappiness() is important
> > > > > in this issue.
> > > >
> > > > This is quite mysterious.
> > > >
> > > > Especially because get_swappiness() is an MGLRU exclusive function, I find
> > > > it quite strange that the issue you mention above occurs regardless of whether
> > > > MGLRU is enabled or disabled. With MGLRU disabled, did you see the same hangs
> > > > as before? Were these hangs similarly fixed by modifying the callsite in
> > > > get_swappiness?
> > >
> > > Good point.
> > > When MGLRU is disabled, changing only the behavior of can_demote()
> > > called by get_swappiness() did not solve the problem.
> > >
> > > Instead, the problem was avoided by changing only the behavior of
> > > can_demote() called by can_reclaim_anon_page(), without changing the
> > > behavior of can_demote() called from other places.
> > >
> > > > On a separate note, I feel a bit uncomfortable for making this the default
> > > > setting, regardless of whether there is swap space or not. Just as it is
> > > > easy to create a degenerate scenario where all memory is unreclaimable
> > > > and the system starts going into (wasteful) reclaim on the lower tiers,
> > > > it is equally easy to create a scenario where all memory is very easily
> > > > reclaimable (say, clean pagecache) and we OOM without making any attempt to
> > > > free up memory on the lower tiers.
> > > >
> > > > Reality is likely somewhere in between. And from my perspective, as long as
> > > > we have some amount of easily reclaimable memory, I don't think immediately
> > > > OOMing will be helpful for the system (and even if none of the memory is
> > > > easily reclaimable, we should still try doing something before killing).
> > > >
> > > > > > > The reason for this issue is that memory allocations do not directly
> > > > > > > trigger the oom-killer, assuming that if the target node has an underlying
> > > > > > > memory tier, it can always be reclaimed by demotion.
> > > >
> > > > This patch enforces that the opposite of this assumption is true; that even
> > > > if a target node has an underlying memory tier, it can never be reclaimed by
> > > > demotion.
> > > >
> > > > Certainly for systems with swap and some compression methods (z{ram, swap}),
> > > > this new enforcement could be harmful to the system. What do you think?
> > >
> > > Thank you for the detailed explanation.
> > >
> > > I understand the concern regarding the current patch, which only
> > > checks the free memory of the demotion target node.
> > > I will explore a solution.
> >
> > Hello Akinobu, I hope you had a great weekend!
> >
> > I noticed something that I thought was worth flagging. It seems like the
> > primary addition of this patch, which is to check for zone_watermark_ok
> > across the zones, is already a part of should_reclaim_retry():
> >
> >     /*
> >      * Keep reclaiming pages while there is a chance this will lead
> >      * somewhere.  If none of the target zones can satisfy our allocation
> >      * request even if all reclaimable pages are considered then we are
> >      * screwed and have to go OOM.
> >      */
> >     for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
> >                 ac->highest_zoneidx, ac->nodemask) {
> >
> >         [...snip...]
> >
> >         /*
> >          * Would the allocation succeed if we reclaimed all
> >          * reclaimable pages?
> >          */
> >         wmark = __zone_watermark_ok(zone, order, min_wmark,
> >                 ac->highest_zoneidx, alloc_flags, available);
> >
> >         if (wmark) {
> >             ret = true;
> >             break;
> >         }
> >     }
> >
> > ... which is called in __alloc_pages_slowpath. I wonder why we don't already
> > hit this. It seems to do the same thing your patch is doing?
> 
> I checked the number of calls and the time spent for several functions
> called by __alloc_pages_slowpath(), and found that time is spent in
> __alloc_pages_direct_reclaim() before reaching the first should_reclaim_retry().
> 
> After a few minutes have passed and the debug code that automatically
> resets numa_demotion_enabled to false is executed, it appears that
> __alloc_pages_direct_reclaim() immediately exits.

First of all is this MGLRU or traditional reclaim? Or both?

Then another thing I've noticed only now. There seems to be a layering
discrepancy (for traditional LRU reclaim) when get_scan_count which
controls the to-be-reclaimed lrus always relies on can_reclaim_anon_pages
while down the reclaim path shrink_folio_list tries to be more clever
and avoid demotion if it turns out to be inefficient.

I wouldn't be surprised if get_scan_count predominantly (or even
exclusively) scanned anon LRUs only while increasing the reclaim
priority  (so essentially just checked all anon pages on the LRU list)
before concluding that it makes no sense. This can take quite some time
and in the worst case you could be recycling couple of page cache pages
remaining on the list to make small but sufficient progress to loop
around.

So I think the first step is to make the demotion behavior consistent.
If demotion fails then it would probably makes sense to set sc->no_demotion
so that get_scan_count can learn from the reclaim feedback that
anonymous pages are not a good reclaim target in this situation. But the
whole reclaim path needs a careful review I am afraid.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier

Posted by Akinobu Mita 4 days, 3 hours ago

2026年2月2日(月) 22:11 Michal Hocko <mhocko@suse.com>:
>
> On Thu 29-01-26 09:40:17, Akinobu Mita wrote:
> > 2026年1月28日(水) 7:00 Joshua Hahn <joshua.hahnjy@gmail.com>:
> > >
> > > > > > Therefore, it appears that the behavior of get_swappiness() is important
> > > > > > in this issue.
> > > > >
> > > > > This is quite mysterious.
> > > > >
> > > > > Especially because get_swappiness() is an MGLRU exclusive function, I find
> > > > > it quite strange that the issue you mention above occurs regardless of whether
> > > > > MGLRU is enabled or disabled. With MGLRU disabled, did you see the same hangs
> > > > > as before? Were these hangs similarly fixed by modifying the callsite in
> > > > > get_swappiness?
> > > >
> > > > Good point.
> > > > When MGLRU is disabled, changing only the behavior of can_demote()
> > > > called by get_swappiness() did not solve the problem.
> > > >
> > > > Instead, the problem was avoided by changing only the behavior of
> > > > can_demote() called by can_reclaim_anon_page(), without changing the
> > > > behavior of can_demote() called from other places.
> > > >
> > > > > On a separate note, I feel a bit uncomfortable for making this the default
> > > > > setting, regardless of whether there is swap space or not. Just as it is
> > > > > easy to create a degenerate scenario where all memory is unreclaimable
> > > > > and the system starts going into (wasteful) reclaim on the lower tiers,
> > > > > it is equally easy to create a scenario where all memory is very easily
> > > > > reclaimable (say, clean pagecache) and we OOM without making any attempt to
> > > > > free up memory on the lower tiers.
> > > > >
> > > > > Reality is likely somewhere in between. And from my perspective, as long as
> > > > > we have some amount of easily reclaimable memory, I don't think immediately
> > > > > OOMing will be helpful for the system (and even if none of the memory is
> > > > > easily reclaimable, we should still try doing something before killing).
> > > > >
> > > > > > > > The reason for this issue is that memory allocations do not directly
> > > > > > > > trigger the oom-killer, assuming that if the target node has an underlying
> > > > > > > > memory tier, it can always be reclaimed by demotion.
> > > > >
> > > > > This patch enforces that the opposite of this assumption is true; that even
> > > > > if a target node has an underlying memory tier, it can never be reclaimed by
> > > > > demotion.
> > > > >
> > > > > Certainly for systems with swap and some compression methods (z{ram, swap}),
> > > > > this new enforcement could be harmful to the system. What do you think?
> > > >
> > > > Thank you for the detailed explanation.
> > > >
> > > > I understand the concern regarding the current patch, which only
> > > > checks the free memory of the demotion target node.
> > > > I will explore a solution.
> > >
> > > Hello Akinobu, I hope you had a great weekend!
> > >
> > > I noticed something that I thought was worth flagging. It seems like the
> > > primary addition of this patch, which is to check for zone_watermark_ok
> > > across the zones, is already a part of should_reclaim_retry():
> > >
> > >     /*
> > >      * Keep reclaiming pages while there is a chance this will lead
> > >      * somewhere.  If none of the target zones can satisfy our allocation
> > >      * request even if all reclaimable pages are considered then we are
> > >      * screwed and have to go OOM.
> > >      */
> > >     for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
> > >                 ac->highest_zoneidx, ac->nodemask) {
> > >
> > >         [...snip...]
> > >
> > >         /*
> > >          * Would the allocation succeed if we reclaimed all
> > >          * reclaimable pages?
> > >          */
> > >         wmark = __zone_watermark_ok(zone, order, min_wmark,
> > >                 ac->highest_zoneidx, alloc_flags, available);
> > >
> > >         if (wmark) {
> > >             ret = true;
> > >             break;
> > >         }
> > >     }
> > >
> > > ... which is called in __alloc_pages_slowpath. I wonder why we don't already
> > > hit this. It seems to do the same thing your patch is doing?
> >
> > I checked the number of calls and the time spent for several functions
> > called by __alloc_pages_slowpath(), and found that time is spent in
> > __alloc_pages_direct_reclaim() before reaching the first should_reclaim_retry().
> >
> > After a few minutes have passed and the debug code that automatically
> > resets numa_demotion_enabled to false is executed, it appears that
> > __alloc_pages_direct_reclaim() immediately exits.
>
> First of all is this MGLRU or traditional reclaim? Or both?

The behavior is almost the same whether MGLRU is enabled or not.
However, one difference is that __alloc_pages_direct_reclaim() may be
called multiple times when __alloc_pages_slowpath() is called, and
should_reclaim_retry() also returns true several times.

This is probably because the watermark check in should_reclaim_retry()
considers not only NR_FREE_PAGES but also NR_ZONE_INACTIVE_ANON and
NR_ZONE_ACTIVE_ANON as potential free memory. (zone_reclaimable_pages())

The following is the increment of stats in /proc/vmstat from the start
of the reproduction test until the problem occurred and
numa_demotion_enabled was automatically reset by the debug code and
OOM occurred a few minutes later:

workingset_nodes 578
workingset_refault_anon 5054381
workingset_refault_file 41502
workingset_activate_anon 3003283
workingset_activate_file 33232
workingset_restore_anon 2556549
workingset_restore_file 27139
workingset_nodereclaim 3472
pgdemote_kswapd 121684
pgdemote_direct 23977
pgdemote_khugepaged 0
pgdemote_proactive 0
pgsteal_kswapd 3480404
pgsteal_direct 2602011
pgsteal_khugepaged 74
pgsteal_proactive 0
pgscan_kswapd 93334262
pgscan_direct 227649302
pgscan_khugepaged 1232161
pgscan_proactive 0
pgscan_direct_throttle 18
pgscan_anon 320480379
pgscan_file 1735346
pgsteal_anon 5828270
pgsteal_file 254219

> Then another thing I've noticed only now. There seems to be a layering
> discrepancy (for traditional LRU reclaim) when get_scan_count which
> controls the to-be-reclaimed lrus always relies on can_reclaim_anon_pages
> while down the reclaim path shrink_folio_list tries to be more clever
> and avoid demotion if it turns out to be inefficient.
>
> I wouldn't be surprised if get_scan_count predominantly (or even
> exclusively) scanned anon LRUs only while increasing the reclaim
> priority  (so essentially just checked all anon pages on the LRU list)
> before concluding that it makes no sense. This can take quite some time
> and in the worst case you could be recycling couple of page cache pages
> remaining on the list to make small but sufficient progress to loop
> around.
>
> So I think the first step is to make the demotion behavior consistent.
> If demotion fails then it would probably makes sense to set sc->no_demotion
> so that get_scan_count can learn from the reclaim feedback that
> anonymous pages are not a good reclaim target in this situation. But the
> whole reclaim path needs a careful review I am afraid.

If migrate_pages() in demote_folio_list() detects that it cannot
migrate any folios and all calls to alloc_demote_folio() also fail
(this is made possible by adding a few fields to migration_target_control),
it sets sc->no_demotion to true, which also resolves the issue.

        migrate_pages(demote_folios, alloc_demote_folio, NULL,
                      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
                      &nr_succeeded);
        if (!nr_succeeded && mtc.nr_alloc_tried > 0 &&
                        (mtc.nr_alloc_tried == mtc.nr_alloc_failed)) {
                sc->no_demotion = 1;
        }

Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier

Posted by Michal Hocko 3 days, 19 hours ago

On Wed 04-02-26 11:07:03, Akinobu Mita wrote:
> 2026年2月2日(月) 22:11 Michal Hocko <mhocko@suse.com>:
> >
> > On Thu 29-01-26 09:40:17, Akinobu Mita wrote:
> > > 2026年1月28日(水) 7:00 Joshua Hahn <joshua.hahnjy@gmail.com>:
> > > >
> > > > > > > Therefore, it appears that the behavior of get_swappiness() is important
> > > > > > > in this issue.
> > > > > >
> > > > > > This is quite mysterious.
> > > > > >
> > > > > > Especially because get_swappiness() is an MGLRU exclusive function, I find
> > > > > > it quite strange that the issue you mention above occurs regardless of whether
> > > > > > MGLRU is enabled or disabled. With MGLRU disabled, did you see the same hangs
> > > > > > as before? Were these hangs similarly fixed by modifying the callsite in
> > > > > > get_swappiness?
> > > > >
> > > > > Good point.
> > > > > When MGLRU is disabled, changing only the behavior of can_demote()
> > > > > called by get_swappiness() did not solve the problem.
> > > > >
> > > > > Instead, the problem was avoided by changing only the behavior of
> > > > > can_demote() called by can_reclaim_anon_page(), without changing the
> > > > > behavior of can_demote() called from other places.
> > > > >
> > > > > > On a separate note, I feel a bit uncomfortable for making this the default
> > > > > > setting, regardless of whether there is swap space or not. Just as it is
> > > > > > easy to create a degenerate scenario where all memory is unreclaimable
> > > > > > and the system starts going into (wasteful) reclaim on the lower tiers,
> > > > > > it is equally easy to create a scenario where all memory is very easily
> > > > > > reclaimable (say, clean pagecache) and we OOM without making any attempt to
> > > > > > free up memory on the lower tiers.
> > > > > >
> > > > > > Reality is likely somewhere in between. And from my perspective, as long as
> > > > > > we have some amount of easily reclaimable memory, I don't think immediately
> > > > > > OOMing will be helpful for the system (and even if none of the memory is
> > > > > > easily reclaimable, we should still try doing something before killing).
> > > > > >
> > > > > > > > > The reason for this issue is that memory allocations do not directly
> > > > > > > > > trigger the oom-killer, assuming that if the target node has an underlying
> > > > > > > > > memory tier, it can always be reclaimed by demotion.
> > > > > >
> > > > > > This patch enforces that the opposite of this assumption is true; that even
> > > > > > if a target node has an underlying memory tier, it can never be reclaimed by
> > > > > > demotion.
> > > > > >
> > > > > > Certainly for systems with swap and some compression methods (z{ram, swap}),
> > > > > > this new enforcement could be harmful to the system. What do you think?
> > > > >
> > > > > Thank you for the detailed explanation.
> > > > >
> > > > > I understand the concern regarding the current patch, which only
> > > > > checks the free memory of the demotion target node.
> > > > > I will explore a solution.
> > > >
> > > > Hello Akinobu, I hope you had a great weekend!
> > > >
> > > > I noticed something that I thought was worth flagging. It seems like the
> > > > primary addition of this patch, which is to check for zone_watermark_ok
> > > > across the zones, is already a part of should_reclaim_retry():
> > > >
> > > >     /*
> > > >      * Keep reclaiming pages while there is a chance this will lead
> > > >      * somewhere.  If none of the target zones can satisfy our allocation
> > > >      * request even if all reclaimable pages are considered then we are
> > > >      * screwed and have to go OOM.
> > > >      */
> > > >     for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
> > > >                 ac->highest_zoneidx, ac->nodemask) {
> > > >
> > > >         [...snip...]
> > > >
> > > >         /*
> > > >          * Would the allocation succeed if we reclaimed all
> > > >          * reclaimable pages?
> > > >          */
> > > >         wmark = __zone_watermark_ok(zone, order, min_wmark,
> > > >                 ac->highest_zoneidx, alloc_flags, available);
> > > >
> > > >         if (wmark) {
> > > >             ret = true;
> > > >             break;
> > > >         }
> > > >     }
> > > >
> > > > ... which is called in __alloc_pages_slowpath. I wonder why we don't already
> > > > hit this. It seems to do the same thing your patch is doing?
> > >
> > > I checked the number of calls and the time spent for several functions
> > > called by __alloc_pages_slowpath(), and found that time is spent in
> > > __alloc_pages_direct_reclaim() before reaching the first should_reclaim_retry().
> > >
> > > After a few minutes have passed and the debug code that automatically
> > > resets numa_demotion_enabled to false is executed, it appears that
> > > __alloc_pages_direct_reclaim() immediately exits.
> >
> > First of all is this MGLRU or traditional reclaim? Or both?
> 
> The behavior is almost the same whether MGLRU is enabled or not.
> However, one difference is that __alloc_pages_direct_reclaim() may be
> called multiple times when __alloc_pages_slowpath() is called, and
> should_reclaim_retry() also returns true several times.
> 
> This is probably because the watermark check in should_reclaim_retry()
> considers not only NR_FREE_PAGES but also NR_ZONE_INACTIVE_ANON and
> NR_ZONE_ACTIVE_ANON as potential free memory. (zone_reclaimable_pages())

Yes, seems like the same problem as with get_scan_count.

> The following is the increment of stats in /proc/vmstat from the start
> of the reproduction test until the problem occurred and
> numa_demotion_enabled was automatically reset by the debug code and
> OOM occurred a few minutes later:
> 
> workingset_nodes 578
> workingset_refault_anon 5054381
> workingset_refault_file 41502
> workingset_activate_anon 3003283
> workingset_activate_file 33232
> workingset_restore_anon 2556549
> workingset_restore_file 27139
> workingset_nodereclaim 3472
> pgdemote_kswapd 121684
> pgdemote_direct 23977
> pgdemote_khugepaged 0
> pgdemote_proactive 0
> pgsteal_kswapd 3480404
> pgsteal_direct 2602011
> pgsteal_khugepaged 74
> pgsteal_proactive 0
> pgscan_kswapd 93334262
> pgscan_direct 227649302
> pgscan_khugepaged 1232161
> pgscan_proactive 0
> pgscan_direct_throttle 18
> pgscan_anon 320480379
> pgscan_file 1735346
> pgsteal_anon 5828270
> pgsteal_file 254219

You can clearly see that the order of magnitute of pages scanned is
completely disproportional to pages actually reclaimed. So there is a
lot of scanning without any progress at all.

> > Then another thing I've noticed only now. There seems to be a layering
> > discrepancy (for traditional LRU reclaim) when get_scan_count which
> > controls the to-be-reclaimed lrus always relies on can_reclaim_anon_pages
> > while down the reclaim path shrink_folio_list tries to be more clever
> > and avoid demotion if it turns out to be inefficient.
> >
> > I wouldn't be surprised if get_scan_count predominantly (or even
> > exclusively) scanned anon LRUs only while increasing the reclaim
> > priority  (so essentially just checked all anon pages on the LRU list)
> > before concluding that it makes no sense. This can take quite some time
> > and in the worst case you could be recycling couple of page cache pages
> > remaining on the list to make small but sufficient progress to loop
> > around.
> >
> > So I think the first step is to make the demotion behavior consistent.
> > If demotion fails then it would probably makes sense to set sc->no_demotion
> > so that get_scan_count can learn from the reclaim feedback that
> > anonymous pages are not a good reclaim target in this situation. But the
> > whole reclaim path needs a careful review I am afraid.
> 
> If migrate_pages() in demote_folio_list() detects that it cannot
> migrate any folios and all calls to alloc_demote_folio() also fail
> (this is made possible by adding a few fields to migration_target_control),
> it sets sc->no_demotion to true, which also resolves the issue.
> 
>         migrate_pages(demote_folios, alloc_demote_folio, NULL,
>                       (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
>                       &nr_succeeded);
>         if (!nr_succeeded && mtc.nr_alloc_tried > 0 &&
>                         (mtc.nr_alloc_tried == mtc.nr_alloc_failed)) {
>                 sc->no_demotion = 1;
>         }

This seems to low level place to make such a decision. Keep in mind that
shrink_list operates on SWAP_CLUSTER_MAX batches so the backoff could be
pre-mature. shrink_lruvec seems like a better place to make such a
decision but this really requires a deeper evaluation.

Anyway, it is good that we have a better understanding what is going on.
Thanks for confirming the theory.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier

Posted by Michal Hocko 5 days, 15 hours ago

Btw. collecting /proc/vmstat during this test could give you a better
insight into what is going on. Especially reclaim stats (pgscan*,
pgsteal*, pgdemote*, workingset*).
-- 
Michal Hocko
SUSE Labs

Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier

Posted by Gregory Price 3 weeks, 3 days ago

On Wed, Jan 14, 2026 at 09:51:28PM +0900, Akinobu Mita wrote:
> can_demote() is called from four places.
> I tried modifying the patch to change the behavior only when can_demote()
> is called from shrink_folio_list(), but the problem was not fixed
> (oom did not occur).
> 
> Similarly, changing the behavior of can_demote() when called from
> can_reclaim_anon_pages(), shrink_folio_list(), and can_age_anon_pages(),
> but not when called from get_swappiness(), did not fix the problem either
> (oom did not occur).
> 
> Conversely, changing the behavior only when called from get_swappiness(),
> but not changing the behavior of can_reclaim_anon_pages(),
> shrink_folio_list(), and can_age_anon_pages(), fixed the problem
> (oom did occur).
> 
> Therefore, it appears that the behavior of get_swappiness() is important
> in this issue.

"It appears that..." and the process of twiddling bits and observing
behavior does not strike confidence in this solution.

Can you take another go at trying to define the bad interaction more
explicitly? I worry that we're modifying vmscan.c behavior to induce an
OOM for a corner case - but it will also cause another regression.

(I am guilty of doing this)

~Gregory

Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier

Posted by Akinobu Mita 3 weeks, 3 days ago

2026年1月15日(木) 2:49 Gregory Price <gourry@gourry.net>:
>
> On Wed, Jan 14, 2026 at 09:51:28PM +0900, Akinobu Mita wrote:
> > can_demote() is called from four places.
> > I tried modifying the patch to change the behavior only when can_demote()
> > is called from shrink_folio_list(), but the problem was not fixed
> > (oom did not occur).
> >
> > Similarly, changing the behavior of can_demote() when called from
> > can_reclaim_anon_pages(), shrink_folio_list(), and can_age_anon_pages(),
> > but not when called from get_swappiness(), did not fix the problem either
> > (oom did not occur).
> >
> > Conversely, changing the behavior only when called from get_swappiness(),
> > but not changing the behavior of can_reclaim_anon_pages(),
> > shrink_folio_list(), and can_age_anon_pages(), fixed the problem
> > (oom did occur).
> >
> > Therefore, it appears that the behavior of get_swappiness() is important
> > in this issue.
>
> "It appears that..." and the process of twiddling bits and observing
> behavior does not strike confidence in this solution.
>
> Can you take another go at trying to define the bad interaction more
> explicitly? I worry that we're modifying vmscan.c behavior to induce an
> OOM for a corner case - but it will also cause another regression.

I agree.
It surprised me that the behavior of get_swappiness() had an impact on the
issue, so I'll clarify its relationship to this issue.

Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier

Posted by Akinobu Mita 2 weeks, 3 days ago

2026年1月15日(木) 9:40 Akinobu Mita <akinobu.mita@gmail.com>:
>
> 2026年1月15日(木) 2:49 Gregory Price <gourry@gourry.net>:
> >
> > On Wed, Jan 14, 2026 at 09:51:28PM +0900, Akinobu Mita wrote:
> > > can_demote() is called from four places.
> > > I tried modifying the patch to change the behavior only when can_demote()
> > > is called from shrink_folio_list(), but the problem was not fixed
> > > (oom did not occur).
> > >
> > > Similarly, changing the behavior of can_demote() when called from
> > > can_reclaim_anon_pages(), shrink_folio_list(), and can_age_anon_pages(),
> > > but not when called from get_swappiness(), did not fix the problem either
> > > (oom did not occur).
> > >
> > > Conversely, changing the behavior only when called from get_swappiness(),
> > > but not changing the behavior of can_reclaim_anon_pages(),
> > > shrink_folio_list(), and can_age_anon_pages(), fixed the problem
> > > (oom did occur).
> > >
> > > Therefore, it appears that the behavior of get_swappiness() is important
> > > in this issue.
> >
> > "It appears that..." and the process of twiddling bits and observing
> > behavior does not strike confidence in this solution.
> >
> > Can you take another go at trying to define the bad interaction more
> > explicitly? I worry that we're modifying vmscan.c behavior to induce an
> > OOM for a corner case - but it will also cause another regression.
>
> I agree.
> It surprised me that the behavior of get_swappiness() had an impact on the
> issue, so I'll clarify its relationship to this issue.

To investigate what was happening while the system was inoperable due
to this issue, I applied a patch that automatically resets
demotion_enabled to false after a certain period of time has passed
since demotion_enabled was set to true.

This allowed me to investigate what was happening during this time,
and it showed that the system was not in a permanently inoperable
state such as a deadlock, but was just wasting time while
demotion_enabled was true.

I measured the elapsed time for __alloc_pages_slowpath() that called
out_of_memory() and the number of folios scanned during its execution,
i.e., the total increase in scan_control.nr_to_scan per execution of
shrink_zones(), several times.

When demotion_enabled was initially false, the longest
__alloc_pages_slowpath() execution time was 185 ms, with 18 calls
to try_to_free_pages() and 3095 folio scans.

On the other hand, when demotion_enabled was true,
__alloc_pages_slowpath() took 144692 ms, try_to_free_pages() was
called once, and 5811414 folio scans were performed.

However, as mentioned above, in this case, demotion_enabled
automatically returns to false during execution, limiting the number
of folios that can be scanned and speeding up completion; otherwise,
it would have taken longer and required more folio scans.

Almost all of the execution time is consumed by folio_alloc_swap(),
and analysis using Flame Graph reveals that spinlock contention is
occurring in the call path __mem_cgroup_try_charge_swap ->
__memcg_memory_event -> cgroup_file_notify.

In this reproduction procedure, no swap is configured, and calls to
folio_alloc_swap() always fail. To avoid spinlock contention, I tried
modifying the source code to return -ENOMEM without calling
folio_alloc_swap(), but this caused other lock contention
(lruvec->lru_lock in evict_folios()) in several other places, so it
did not work around the problem.

When demotion_enabled is true, if there is no free memory on the target
node during memory allocation, even if there is no swap device, demotion
may be able to move anonymous pages to a lower node and free up memory,
so more anonymous pages become candidates for eviction.
However, if free memory on the target node for demotion runs out,
various processes will perform similar operations in search of free
memory, wasting time on lock contention.

Reducing lock contention or changing the eviction process is also an
interesting solution, but at present I have not come up with any workaround
other than disabling demotion when free memory on lower-level nodes is
exhausted.

Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier

Posted by Gregory Price 2 weeks, 2 days ago

On Thu, Jan 22, 2026 at 09:32:51AM +0900, Akinobu Mita wrote:
> Almost all of the execution time is consumed by folio_alloc_swap(),
> and analysis using Flame Graph reveals that spinlock contention is
> occurring in the call path __mem_cgroup_try_charge_swap ->
> __memcg_memory_event -> cgroup_file_notify.
> 
> In this reproduction procedure, no swap is configured, and calls to
> folio_alloc_swap() always fail. To avoid spinlock contention, I tried
> modifying the source code to return -ENOMEM without calling
> folio_alloc_swap(), but this caused other lock contention
> (lruvec->lru_lock in evict_folios()) in several other places, so it
> did not work around the problem.

Doesn't this suggest what I mentioned earlier?  If you don't demote when
the target node is full, then you're removing a memory pressure signal
from the lower node and reclaim won't ever clean up the lower node to
make room for future demotions.

I might be missing something here, though, is your system completely out
of memory at this point?

Presumably you're hitting direct reclaim and not just waking up kswapd
because things are locking up.

If there's no swap and no where to demote, then this all sounds like
normal OOM behavior.

Does this whole thing go away if you configure some swap space?

> 
> When demotion_enabled is true, if there is no free memory on the target
> node during memory allocation, even if there is no swap device, demotion
> may be able to move anonymous pages to a lower node and free up memory,
> so more anonymous pages become candidates for eviction.
> However, if free memory on the target node for demotion runs out,
> various processes will perform similar operations in search of free
> memory, wasting time on lock contention.
> 
> Reducing lock contention or changing the eviction process is also an
> interesting solution, but at present I have not come up with any workaround
> other than disabling demotion when free memory on lower-level nodes is
> exhausted.

The lock contention seems like a symptom, not the cause.  The cause
appears to be that you're out of memory with no swap configured.

~Gregory

Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier

Posted by Akinobu Mita 1 week, 6 days ago

2026年1月23日(金) 1:39 Gregory Price <gourry@gourry.net>:
>
> On Thu, Jan 22, 2026 at 09:32:51AM +0900, Akinobu Mita wrote:
> > Almost all of the execution time is consumed by folio_alloc_swap(),
> > and analysis using Flame Graph reveals that spinlock contention is
> > occurring in the call path __mem_cgroup_try_charge_swap ->
> > __memcg_memory_event -> cgroup_file_notify.
> >
> > In this reproduction procedure, no swap is configured, and calls to
> > folio_alloc_swap() always fail. To avoid spinlock contention, I tried
> > modifying the source code to return -ENOMEM without calling
> > folio_alloc_swap(), but this caused other lock contention
> > (lruvec->lru_lock in evict_folios()) in several other places, so it
> > did not work around the problem.
>
> Doesn't this suggest what I mentioned earlier?  If you don't demote when
> the target node is full, then you're removing a memory pressure signal
> from the lower node and reclaim won't ever clean up the lower node to
> make room for future demotions.

Thank you for your analysis.
Now I finally understand the concerns (though I'll need to learn more
to find a solution...)

> I might be missing something here, though, is your system completely out
> of memory at this point?
>
> Presumably you're hitting direct reclaim and not just waking up kswapd
> because things are locking up.
>
> If there's no swap and no where to demote, then this all sounds like
> normal OOM behavior.
>
> Does this whole thing go away if you configure some swap space?

I tried it and found that the same issue occurred when I ran a
stress-ng-memrate workload that exceeded the combined memory and swap
capacity using the same repro steps.

To be more precise, for over an hour, I was unable to do anything and
of course I couldn't manually terminate the workload, so the only
option was to reset the power.

It may be a bit inconvenient that if something similar to this workload
were to happen due to a mistake or a runaway program, it would not
cause an OOM and would be inoperable for hours.

> > When demotion_enabled is true, if there is no free memory on the target
> > node during memory allocation, even if there is no swap device, demotion
> > may be able to move anonymous pages to a lower node and free up memory,
> > so more anonymous pages become candidates for eviction.
> > However, if free memory on the target node for demotion runs out,
> > various processes will perform similar operations in search of free
> > memory, wasting time on lock contention.
> >
> > Reducing lock contention or changing the eviction process is also an
> > interesting solution, but at present I have not come up with any workaround
> > other than disabling demotion when free memory on lower-level nodes is
> > exhausted.
>
> The lock contention seems like a symptom, not the cause.  The cause
> appears to be that you're out of memory with no swap configured.

I understand that there are issues with the current patch, but I would
like to resolve the above issues even when swap is configured with
a better solution.

Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier

Posted by Gregory Price 1 week, 4 days ago

On Mon, Jan 26, 2026 at 10:57:11AM +0900, Akinobu Mita wrote:
> >
> > Doesn't this suggest what I mentioned earlier?  If you don't demote when
> > the target node is full, then you're removing a memory pressure signal
> > from the lower node and reclaim won't ever clean up the lower node to
> > make room for future demotions.
> 
> Thank you for your analysis.
> Now I finally understand the concerns (though I'll need to learn more
> to find a solution...)
>

Apologies - sorry for the multiple threads, i accidentally replied on v3

It's taken me a while to detangle this, but what looks  like what might
be happening is demote_folios is actually stealing all the potential
candidates for swap for leaving reclaim with no forward progress and no
OOM signal.

1) demotion is already not a reclaim signal, so forgive my prior
   comments, i missed the masking of ~__GFP_RECLAIM

2) it appears we spend most of the time building the demotion list, but
   then just abandon the list without having made progress later when
   the demotion allocation target fails (w/ __THISNODE you don't get
   OOM on allocation failure, we just continue)

3) i don't see hugetlb pages causing the GFP_RECLAIM override bug being
   an issue in reclaim, because the page->lru is used for something else
   in hugetlb pages (i.e. we shouldn't see hugetlb pages here)

4) skipping the entire demotion pass will shunt all this pressure to
   swap instead (do_demote_pass = false -> so we swap instead).

The risk here is that the OOM situation is temporary and some amount of
memory from toptier gets shunting to swap while kswapd on other tiers
makes progress.  This is effectively LRU inversion.

Why swappiness affects behavior is likely because it changes how
aggressively your lower-tier gets reclaimed, and therefore reduces the
upper tier demotion failures until swap is already pressured.

I'm not sure there's a best-option here, we may need additional input to
determine what the least-worst option is.  Causing LRU inversion when
all the nodes are pressured but swap is available is not preferable.

~Gregory

Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier

Posted by Akinobu Mita 1 week, 3 days ago

2026年1月28日(水) 6:21 Gregory Price <gourry@gourry.net>:
>
> On Mon, Jan 26, 2026 at 10:57:11AM +0900, Akinobu Mita wrote:
> > >
> > > Doesn't this suggest what I mentioned earlier?  If you don't demote when
> > > the target node is full, then you're removing a memory pressure signal
> > > from the lower node and reclaim won't ever clean up the lower node to
> > > make room for future demotions.
> >
> > Thank you for your analysis.
> > Now I finally understand the concerns (though I'll need to learn more
> > to find a solution...)
> >
>
> Apologies - sorry for the multiple threads, i accidentally replied on v3
>
> It's taken me a while to detangle this, but what looks  like what might
> be happening is demote_folios is actually stealing all the potential
> candidates for swap for leaving reclaim with no forward progress and no
> OOM signal.
>
> 1) demotion is already not a reclaim signal, so forgive my prior
>    comments, i missed the masking of ~__GFP_RECLAIM
>
> 2) it appears we spend most of the time building the demotion list, but
>    then just abandon the list without having made progress later when
>    the demotion allocation target fails (w/ __THISNODE you don't get
>    OOM on allocation failure, we just continue)
>
> 3) i don't see hugetlb pages causing the GFP_RECLAIM override bug being
>    an issue in reclaim, because the page->lru is used for something else
>    in hugetlb pages (i.e. we shouldn't see hugetlb pages here)
>
> 4) skipping the entire demotion pass will shunt all this pressure to
>    swap instead (do_demote_pass = false -> so we swap instead).
>
>
> The risk here is that the OOM situation is temporary and some amount of
> memory from toptier gets shunting to swap while kswapd on other tiers
> makes progress.  This is effectively LRU inversion.
>
> Why swappiness affects behavior is likely because it changes how
> aggressively your lower-tier gets reclaimed, and therefore reduces the
> upper tier demotion failures until swap is already pressured.
>
> I'm not sure there's a best-option here, we may need additional input to
> determine what the least-worst option is.  Causing LRU inversion when
> all the nodes are pressured but swap is available is not preferable.

Would it be better if can_demote() returned false after checking that
there is no free swap space at all and that there is not enough free space
on the demote target node or its lower nodes?

can_demote()
{
        ...
        /* If demotion node isn't in the cgroup's mems_allowed, fall back */
        if (mem_cgroup_node_allowed(memcg, demotion_nid)) {
                if (get_nr_swap_pages() > 0)
                        return true;
                do {
                        int z;
                        struct zone *zone;
                        struct pglist_data *pgdat = NODE_DATA(demotion_nid);

                        for_each_managed_zone_pgdat(zone, pgdat, z,
MAX_NR_ZONES - 1) {
                                if (zone_watermark_ok(zone, 0,
min_wmark_pages(zone),
                                                      ZONE_MOVABLE, 0))
                                        return true;
                        }
                        demotion_nid = next_demotion_node(demotion_nid);
                } while (demotion_nid != NUMA_NO_NODE);
        }
        return false;
}

Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier

Posted by Gregory Price 1 week, 3 days ago

On Thu, Jan 29, 2026 at 09:51:44AM +0900, Akinobu Mita wrote:
> > I'm not sure there's a best-option here, we may need additional input to
> > determine what the least-worst option is.  Causing LRU inversion when
> > all the nodes are pressured but swap is available is not preferable.
> 
> Would it be better if can_demote() returned false after checking that
> there is no free swap space at all and that there is not enough free space
> on the demote target node or its lower nodes?
>

I need some time to think on this.

If we take your patch, I think we essentially default to the same
behavior as-if demotion was wholesale disabled in the first place -
toptier nodes would reclaim space directly into swap.  zswap would
probably get skipped if we're already in direct reclaim (if we can't
allocate a page, neither can zswap).

The alternative is reclaim makes absolutely no progress, even if there
is (z)swap space available - which is essentially what you're
experiencing.

Maybe there is an argument for simply falling back to swap if there's no
room on any node further away.

Will ponder this for a bit and get back to you.

~Gregory

Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier

Posted by Michal Hocko 3 weeks, 3 days ago

On Wed 14-01-26 21:51:28, Akinobu Mita wrote:
> 2026年1月13日(火) 22:40 Michal Hocko <mhocko@suse.com>:
> >
> > On Tue 13-01-26 17:14:53, Akinobu Mita wrote:
> > > On systems with multiple memory-tiers consisting of DRAM and CXL memory,
> > > the OOM killer is not invoked properly.
> > >
> > > Here's the command to reproduce:
> > >
> > > $ sudo swapoff -a
> > > $ stress-ng --oomable -v --memrate 20 --memrate-bytes 10G \
> > >     --memrate-rd-mbs 1 --memrate-wr-mbs 1
> > >
> > > The memory usage is the number of workers specified with the --memrate
> > > option multiplied by the buffer size specified with the --memrate-bytes
> > > option, so please adjust it so that it exceeds the total size of the
> > > installed DRAM and CXL memory.
> > >
> > > If swap is disabled, you can usually expect the OOM killer to terminate
> > > the stress-ng process when memory usage approaches the installed memory
> > > size.
> > >
> > > However, if multiple memory-tiers exist (multiple
> > > /sys/devices/virtual/memory_tiering/memory_tier<N> directories exist) and
> > > /sys/kernel/mm/numa/demotion_enabled is true, the OOM killer will not be
> > > invoked and the system will become inoperable, regardless of whether MGLRU
> > > is enabled or not.
> > >
> > > This issue can be reproduced using NUMA emulation even on systems with
> > > only DRAM.  You can create two-fake memory-tiers by booting a single-node
> > > system with "numa=fake=2 numa_emulation.adistance=576,704" kernel
> > > parameters.
> > >
> > > The reason for this issue is that memory allocations do not directly
> > > trigger the oom-killer, assuming that if the target node has an underlying
> > > memory tier, it can always be reclaimed by demotion.
> >
> > Why don't we fall back to no demotion mode in this case? I mean we have
> > shrink_folio_list:
> >         if (!list_empty(&demote_folios)) {
> >                 /* Folios which weren't demoted go back on @folio_list */
> >                 list_splice_init(&demote_folios, folio_list);
> >
> >                 /*
> >                  * goto retry to reclaim the undemoted folios in folio_list if
> >                  * desired.
> >                  *
> >                  * Reclaiming directly from top tier nodes is not often desired
> >                  * due to it breaking the LRU ordering: in general memory
> >                  * should be reclaimed from lower tier nodes and demoted from
> >                  * top tier nodes.
> >                  *
> >                  * However, disabling reclaim from top tier nodes entirely
> >                  * would cause ooms in edge scenarios where lower tier memory
> >                  * is unreclaimable for whatever reason, eg memory being
> >                  * mlocked or too hot to reclaim. We can disable reclaim
> >                  * from top tier nodes in proactive reclaim though as that is
> >                  * not real memory pressure.
> >                  */
> >                 if (!sc->proactive) {
> >                         do_demote_pass = false;
> >                         goto retry;
> >                 }
> >         }
> >
> > to handle this situation no?
> 
> can_demote() is called from four places.
> I tried modifying the patch to change the behavior only when can_demote()
> is called from shrink_folio_list(), but the problem was not fixed
> (oom did not occur).
> 
> Similarly, changing the behavior of can_demote() when called from
> can_reclaim_anon_pages(), shrink_folio_list(), and can_age_anon_pages(),
> but not when called from get_swappiness(), did not fix the problem either
> (oom did not occur).
> 
> Conversely, changing the behavior only when called from get_swappiness(),
> but not changing the behavior of can_reclaim_anon_pages(),
> shrink_folio_list(), and can_age_anon_pages(), fixed the problem
> (oom did occur).
> 
> Therefore, it appears that the behavior of get_swappiness() is important
> in this issue.

You have said that there is no swap configured in the system, right?
That would imply that anonymous pages are not reclaimable at all (see
can_reclaim_anon_pages)?

-- 
Michal Hocko
SUSE Labs

[PATCH v4 1/3] mm: memory-tiers, numa_emu: enable to create memory tiers using fake numa nodes
[PATCH v4 2/3] mm: numa_emu: add document for NUMA emulation
[PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier