On systems with multiple memory-tiers consisting of DRAM and CXL memory,
the OOM killer is not invoked properly.
Here's the command to reproduce:
$ sudo swapoff -a
$ stress-ng --oomable -v --memrate 20 --memrate-bytes 10G \
--memrate-rd-mbs 1 --memrate-wr-mbs 1
The memory usage is the number of workers specified with the --memrate
option multiplied by the buffer size specified with the --memrate-bytes
option, so please adjust it so that it exceeds the total size of the
installed DRAM and CXL memory.
If swap is disabled, you can usually expect the OOM killer to terminate
the stress-ng process when memory usage approaches the installed memory
size.
However, if multiple memory-tiers exist (multiple
/sys/devices/virtual/memory_tiering/memory_tier<N> directories exist),
and /sys/kernel/mm/numa/demotion_enabled is true and
/sys/kernel/mm/lru_gen/min_ttl_ms is 0, the OOM killer will not be invoked
and the system will become inoperable.
This issue can be reproduced using NUMA emulation even on systems with
only DRAM. You can create two-fake memory-tiers by booting a single-node
system with "numa=fake=2 numa_emulation.adistance=576,704" kernel
parameters.
The reason for this issue is that if the target node for allocation has
an underlying memory tier, it is always assumed that it can be reclaimed
via demotion.
So this change avoids this issue by not attempting to demote if the
demoting node has less free memory than the minimum watermark.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
---
mm/vmscan.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fddd168a9737..f4748f258294 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -356,7 +356,20 @@ static bool can_demote(int nid, struct scan_control *sc,
return false;
/* If demotion node isn't in the cgroup's mems_allowed, fall back */
- return mem_cgroup_node_allowed(memcg, demotion_nid);
+ if (mem_cgroup_node_allowed(memcg, demotion_nid)) {
+ int z;
+ struct zone *zone;
+ struct pglist_data *pgdat = NODE_DATA(demotion_nid);
+ unsigned int highest_zoneidx = sc ? sc->reclaim_idx : MAX_NR_ZONES - 1;
+ int order = sc ? sc->order : 0;
+
+ for_each_managed_zone_pgdat(zone, pgdat, z, highest_zoneidx) {
+ if (zone_watermark_ok(zone, order, min_wmark_pages(zone),
+ highest_zoneidx, 0))
+ return true;
+ }
+ }
+ return false;
}
static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
--
2.43.0
On 12/8/25 10:40, Akinobu Mita wrote:
> On systems with multiple memory-tiers consisting of DRAM and CXL memory,
> the OOM killer is not invoked properly.
>
> Here's the command to reproduce:
>
> $ sudo swapoff -a
> $ stress-ng --oomable -v --memrate 20 --memrate-bytes 10G \
> --memrate-rd-mbs 1 --memrate-wr-mbs 1
>
> The memory usage is the number of workers specified with the --memrate
> option multiplied by the buffer size specified with the --memrate-bytes
> option, so please adjust it so that it exceeds the total size of the
> installed DRAM and CXL memory.
>
> If swap is disabled, you can usually expect the OOM killer to terminate
> the stress-ng process when memory usage approaches the installed memory
> size.
>
> However, if multiple memory-tiers exist (multiple
> /sys/devices/virtual/memory_tiering/memory_tier<N> directories exist),
> and /sys/kernel/mm/numa/demotion_enabled is true and
> /sys/kernel/mm/lru_gen/min_ttl_ms is 0, the OOM killer will not be invoked
Does this mean only mglru has the problem, or !mglru too?
Also is min_ttl_ms = 0 a sensible setting? What happens without it?
If !mglru doesn't have this problem, how does the fix affect it?
> and the system will become inoperable.
>
> This issue can be reproduced using NUMA emulation even on systems with
> only DRAM. You can create two-fake memory-tiers by booting a single-node
> system with "numa=fake=2 numa_emulation.adistance=576,704" kernel
> parameters.
>
> The reason for this issue is that if the target node for allocation has
> an underlying memory tier, it is always assumed that it can be reclaimed
> via demotion.
>
> So this change avoids this issue by not attempting to demote if the
> demoting node has less free memory than the minimum watermark.
>
> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
> ---
> mm/vmscan.c | 15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index fddd168a9737..f4748f258294 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -356,7 +356,20 @@ static bool can_demote(int nid, struct scan_control *sc,
> return false;
>
> /* If demotion node isn't in the cgroup's mems_allowed, fall back */
> - return mem_cgroup_node_allowed(memcg, demotion_nid);
> + if (mem_cgroup_node_allowed(memcg, demotion_nid)) {
> + int z;
> + struct zone *zone;
> + struct pglist_data *pgdat = NODE_DATA(demotion_nid);
> + unsigned int highest_zoneidx = sc ? sc->reclaim_idx : MAX_NR_ZONES - 1;
> + int order = sc ? sc->order : 0;
> +
> + for_each_managed_zone_pgdat(zone, pgdat, z, highest_zoneidx) {
> + if (zone_watermark_ok(zone, order, min_wmark_pages(zone),
> + highest_zoneidx, 0))
> + return true;
> + }
> + }
> + return false;
> }
>
> static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
2025年12月17日(水) 21:55 Vlastimil Babka <vbabka@suse.cz>: > > On 12/8/25 10:40, Akinobu Mita wrote: > > On systems with multiple memory-tiers consisting of DRAM and CXL memory, > > the OOM killer is not invoked properly. > > > > Here's the command to reproduce: > > > > $ sudo swapoff -a > > $ stress-ng --oomable -v --memrate 20 --memrate-bytes 10G \ > > --memrate-rd-mbs 1 --memrate-wr-mbs 1 > > > > The memory usage is the number of workers specified with the --memrate > > option multiplied by the buffer size specified with the --memrate-bytes > > option, so please adjust it so that it exceeds the total size of the > > installed DRAM and CXL memory. > > > > If swap is disabled, you can usually expect the OOM killer to terminate > > the stress-ng process when memory usage approaches the installed memory > > size. > > > > However, if multiple memory-tiers exist (multiple > > /sys/devices/virtual/memory_tiering/memory_tier<N> directories exist), > > and /sys/kernel/mm/numa/demotion_enabled is true and > > /sys/kernel/mm/lru_gen/min_ttl_ms is 0, the OOM killer will not be invoked > > Does this mean only mglru has the problem, or !mglru too? !mglru has the problem, too. > Also is min_ttl_ms = 0 a sensible setting? What happens without it? Setting min_tto_ms = 1 or a longer value will cause kswapd to trigger the oom-killer. However, when the stress-ng devshm test was run in addition to the above test to increase the load, the system remained inoperable with only the oom-killer by kswapd. > If !mglru doesn't have this problem, how does the fix affect it? With this patch, the oom-killer will be triggered directly from memory allocations, regardless of mglru or not.
© 2016 - 2025 Red Hat, Inc.