[v2] mm/mglru: improve reclaim loop and dirty folio handling

[PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling

Posted by Kairui Song via B4 Relay 4 days, 15 hours ago

This series is based on mm-new to avoid conflict with Baolin's Cgroup V1
MGLRU fix.

This series cleans up and slightly improves MGLRU's reclaim loop and
dirty writeback handling. As a result, we can see an up to ~30% increase
in some workloads like MongoDB with YCSB and a huge decrease in file
refault, no swap involved. Other common benchmarks have no regression,
and LOC is reduced, with less unexpected OOM, too.

Some of the problems were found in our production environment, and
others were mostly exposed while stress testing during the development
of the LSM/MM/BPF topic on improving MGLRU [1]. This series cleans up
the code base and fixes several performance issues, preparing for
further work.

MGLRU's reclaim loop is a bit complex, and hence these problems are
somehow related to each other. The aging, scan number calculation, and
reclaim loop are coupled together, and the dirty folio handling logic is
quite different, making the reclaim loop hard to follow and the dirty
flush ineffective.

This series slightly cleans up and improves these issues using a scan
budget by calculating the number of folios to scan at the beginning of
the loop, and decouples aging from the reclaim calculation helpers.
Then, move the dirty flush logic inside the reclaim loop so it can kick
in more effectively. These issues are somehow related, and this series
handles them and improves MGLRU reclaim in many ways.

Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
and a 128G memory machine using NVME as storage.

MongoDB
=======
Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
threads:32), which does 95% read and 5% update to generate mixed read
and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
the WiredTiger cache size is set to 4.5G, using NVME as storage.

Not using SWAP.

Median of 3 test run, results are stable.

Before:
Throughput(ops/sec): 63050.37725142389
AverageLatency(us): 497.0088950307069
pgpgin 164636727
pgpgout 5551216
workingset_refault_anon 0
workingset_refault_file 34365441

After:
Throughput(ops/sec): 79937.11613530689 (+26.7%, higher is better)
AverageLatency(us): 390.1616943501661  (-21.5%, lower is better)
pgpgin 108820685                       (-33.9%, lower is better)
pgpgout 5406292
workingset_refault_anon 0
workingset_refault_file 18934364       (-44.9%, lower is better)

We can see a significant performance improvement after this series.
The test is done on NVME and the performance gap would be even larger
for slow devices, such as HDD or network storage. We observed over
100% gain for some workloads with slow IO.

Chrome & Node.js [3]
====================
Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
workers:

Before:
Total requests:            81832
Per-worker 95% CI (mean):  [1248.8, 1308.4]
Per-worker stdev:          119.1
Jain's fairness:           0.991530 (1.0 = perfectly fair)
Latency:
[0,1)s     27951   34.16%   34.16%
[1,2)s      7495    9.16%   43.32%
[2,4)s      8140    9.95%   53.26%
[4,8)s     38246   46.74%  100.00%

After:
Total requests:            82413
Per-worker 95% CI (mean):  [1241.4, 1334.0]
Per-worker stdev:          185.3
Jain's fairness:           0.980016 (1.0 = perfectly fair)
Latency:
[0,1)s     27940   33.90%   33.90%
[1,2)s      8772   10.64%   44.55%
[2,4)s      6827    8.28%   52.83%
[4,8)s     38874   47.17%  100.00%

Seems identical, reclaim is still fair and effective, total requests
number seems slightly better.

OOM issue with aging and throttling
===================================
For the throttling OOM issue, it can be easily reproduced using dd and
cgroup limit as demonstrated in patch 12, and fixed by this series.

The aging OOM is a bit tricky, a specific reproducer can be used to
simulate what we encountered in production environment [4]: Spawns
multiple workers that keep reading the given file using mmap, and pauses
for 120ms after one file read batch. It also spawns another set of
workers that keep allocating and freeing a given size of anonymous memory.
The total memory size exceeds the memory limit (eg. 44G anon + 8G file,
which is 52G vs 48G memcg limit).

- MGLRU disabled:
  Finished 128 iterations.

- MGLRU enabled:
  OOM with following info after about ~10-20 iterations:
    [  154.365634] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
    [  154.366456] memory: usage 50331648kB, limit 50331648kB, failcnt 354207
    [  154.378941] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
    [  154.379408] Memory cgroup stats for /demo:
    [  154.379544] anon 44352327680
    [  154.380079] file 7187271680

  OOM occurs despite there being still evictable file folios.

- MGLRU enabled after this series:
  Finished 128 iterations.

Worth noting there is another OOM related issue reported in V1 of
this series, which is tested and looking OK now [5].

MySQL:
======

Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
ZRAM as swap and test command:

sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
  --tables=48 --table-size=2000000 --threads=48 --time=600 run

Before:            17237.570000 tps
After patch 5:     17259.975714 tps
After patch 6:     17230.475714 tps
After patch 7:     17250.316667 tps
After patch 8:     17278.933333 tps
After this series: 17265.361667 tps (+0.2%, higher is better)

MySQL is anon folios heavy, involves writeback and file and still
looking good. Seems only noise level changes, no regression in
any step.

FIO:
====
Testing with the following command, where /mnt/ramdisk is a
64G EXT4 ramdisk, each test file is 3G, 6 test runs each in a
12G memcg:

fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \
       --name=cached --numjobs=16 --buffered=1 --ioengine=mmap \
       --rw=randread --random_distribution=zipf:1.2 --norandommap \
       --time_based --ramp_time=1m --runtime=5m --group_reporting

Before:            75912.75 MB/s
After this series: 75907.46 MB/s

Also seem only noise level changes and no regression.

Build kernel:
=============
Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
using make -j96 and defconfig, measuring system time, 12 test run each.

Before:            2604.29s
After this series: 2538.90s

Also seem only noise level changes, no regression or very slightly better.

Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1]
Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2]
Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3]
Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4]
Link: https://lore.kernel.org/linux-mm/acgNCzRDVmSbXrOE@KASONG-MC4/ [5]

Signed-off-by: Kairui Song <kasong@tencent.com>
---
Changes in v2:
- Rebase on top of mm-new which includes Cgroup V1 fix from
  [ Baolin Wang ].
- Added dirty throttling OOM fix as patch 12, as [ Chen Ridong ]'s
  review suggested that we shouldn't leave the counter and reclaim
  feedback in shrink_folio_list untracked in this series.
- Add a minimal scan number of SWAP_CLUSTER_MAX limit in patch
  "restructure the reclaim loop", the change is trivial but might
  help avoid livelock for tiny cgroups.
- Redo the tests, most test are basically identical to before, but just
  in case, since the patch also solves the throttling issue now, and
  discussed with reports from CachyOS.
- Add a separate patch for variable renaming as suggested by [ Barry
  Song ]. No feature change.
- Improve several comment and code issue [ Axel Rasmussen ].
- Remove no longer needed variable [ Axel Rasmussen ].
- Collect review by.
- Link to v1: https://lore.kernel.org/r/20260318-mglru-reclaim-v1-0-2c46f9eb0508@tencent.com

---
Kairui Song (12):
      mm/mglru: consolidate common code for retrieving evitable size
      mm/mglru: rename variables related to aging and rotation
      mm/mglru: relocate the LRU scan batch limit to callers
      mm/mglru: restructure the reclaim loop
      mm/mglru: scan and count the exact number of folios
      mm/mglru: use a smaller batch for reclaim
      mm/mglru: don't abort scan immediately right after aging
      mm/mglru: simplify and improve dirty writeback handling
      mm/mglru: remove no longer used reclaim argument for folio protection
      mm/vmscan: remove sc->file_taken
      mm/vmscan: remove sc->unqueued_dirty
      mm/vmscan: unify writeback reclaim statistic and throttling

 mm/vmscan.c | 308 ++++++++++++++++++++++++++----------------------------------
 1 file changed, 132 insertions(+), 176 deletions(-)
---
base-commit: e4b3c4494ae831396aded19f30132826a0d63031
change-id: 20260314-mglru-reclaim-1c9d45ac57f6

Best regards,
--  
Kairui Song <kasong@tencent.com>

Re: [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling

Posted by Leno Hou 1 day, 5 hours ago

On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> The aging OOM is a bit tricky, a specific reproducer can be used to
> simulate what we encountered in production environment [4]: Spawns
> multiple workers that keep reading the given file using mmap, and pauses
> for 120ms after one file read batch. It also spawns another set of
> workers that keep allocating and freeing a given size of anonymous memory.
> The total memory size exceeds the memory limit (eg. 44G anon + 8G file,
> which is 52G vs 48G memcg limit).
> 
> - MGLRU disabled:
>    Finished 128 iterations.
> 
> - MGLRU enabled:
>    OOM with following info after about ~10-20 iterations:
>      [  154.365634] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
>      [  154.366456] memory: usage 50331648kB, limit 50331648kB, failcnt 354207
>      [  154.378941] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
>      [  154.379408] Memory cgroup stats for /demo:
>      [  154.379544] anon 44352327680
>      [  154.380079] file 7187271680
> 
>    OOM occurs despite there being still evictable file folios.
> 
> - MGLRU enabled after this series:
>    Finished 128 iterations.

Hi Kairui,

I've tested on v6.1.163 and unable to reproduce the OOM issue by your 
test script [4], Could you point the kernel version in your environment 
or more information?


Link: 
https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure 
[4]


--
Best Regards,
Leno Hou

Re: [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling

Posted by Kairui Song 1 day, 3 hours ago

On Wed, Apr 01, 2026 at 01:18:16PM +0800, Leno Hou wrote:
> On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> > The aging OOM is a bit tricky, a specific reproducer can be used to
> > simulate what we encountered in production environment [4]: Spawns
> > multiple workers that keep reading the given file using mmap, and pauses
> > for 120ms after one file read batch. It also spawns another set of
> > workers that keep allocating and freeing a given size of anonymous memory.
> > The total memory size exceeds the memory limit (eg. 44G anon + 8G file,
> > which is 52G vs 48G memcg limit).
> > 
> > - MGLRU disabled:
> >    Finished 128 iterations.
> > 
> > - MGLRU enabled:
> >    OOM with following info after about ~10-20 iterations:
> >      [  154.365634] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
> >      [  154.366456] memory: usage 50331648kB, limit 50331648kB, failcnt 354207
> >      [  154.378941] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
> >      [  154.379408] Memory cgroup stats for /demo:
> >      [  154.379544] anon 44352327680
> >      [  154.380079] file 7187271680
> > 
> >    OOM occurs despite there being still evictable file folios.
> > 
> > - MGLRU enabled after this series:
> >    Finished 128 iterations.
> 
> Hi Kairui,

Hi Leno,

> 
> I've tested on v6.1.163 and unable to reproduce the OOM issue by your test
> script [4], Could you point the kernel version in your environment or more
> information?
> 

Thanks for testing!

Right, this one is very tricky to trigger, I struggled a lot with that
and took many attempts to construct a reproducer. I later changed the
setup to 16G memcg for easier reproduce, idea is still the same:

- Mount a ramdisk (/dev/pmem0) at /mnt/ramdisk:
  mkfs.xfs -f /dev/pmem0; mount /dev/pmem0 /mnt/ramdisk/
- Setup a 16g memcg
  mkdir -p /sys/fs/cgroup/demo
  echo 16G > /sys/fs/cgroup/demo/memory.max
  echo $$ > /sys/fs/cgroup/demo/cgroup.procs
  echo $PPID > /sys/fs/cgroup/demo/cgroup.procs
  echo $BASHPID > /sys/fs/cgroup/demo/cgroup.procs
- Then run the reproducer:
  file_anon_mix_pressure /mnt/ramdisk/test.img 14g 8g 96 96 120000

The parameters is depend on your system config. My system is a
48c96t machine:

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
BIOS Vendor ID:      Intel(R) Corporation
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz
BIOS Model name:     Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz
Stepping:            7
CPU MHz:             3100.021
CPU max MHz:         2501.0000
CPU min MHz:         1000.0000
BogoMIPS:            5000.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-23,48-71
NUMA node1 CPU(s):   24-47,72-95
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

$ free -m
               total        used        free      shared  buff/cache   available
Mem:           62132        9553       49022          18        4172       52579
Swap:              0           0           0

And gets the OOM without this series:
[   17.537545] XFS (pmem0): Ending clean mount
[   49.329042] hrtimer: interrupt took 13930 ns
[   49.823993] file_anon_mix_p (3832): drop_caches: 3
[   62.624130] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[   62.624892] CPU: 95 UID: 0 PID: 4875 Comm: file_anon_mix_p Kdump: loaded Not tainted 7.0.0-rc5.orig-gb822cd37c749 #292 PREEMPT(full)·
[   62.624897] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
[   62.624899] Call Trace:
[   62.624902]  <TASK>
[   62.624905]  dump_stack_lvl+0x4a/0x70
[   62.624912]  dump_header+0x43/0x1b3
[   62.624918]  oom_kill_process.cold+0x8/0x85
[   62.624922]  out_of_memory+0xee/0x280
[   62.624927]  mem_cgroup_out_of_memory+0xbc/0xd0
[   62.624933]  try_charge_memcg+0x3c1/0x5d0
[   62.624936]  charge_memcg+0x4a/0xb0
[   62.624939]  __mem_cgroup_charge+0x28/0x80
[   62.624942]  alloc_anon_folio+0x1d1/0x3d0
[   62.624947]  do_anonymous_page+0x19d/0x550
[   62.624950]  ? pte_offset_map_rw_nolock+0x1b/0x80
[   62.624954]  __handle_mm_fault+0x346/0x6d0
[   62.624956]  ? __schedule+0x29c/0x5b0
[   62.624968]  handle_mm_fault+0xe8/0x2d0
[   62.624971]  do_user_addr_fault+0x204/0x660
[   62.624977]  exc_page_fault+0x67/0x170
[   62.624979]  asm_exc_page_fault+0x22/0x30
[   62.624982] RIP: 0033:0x401451
[   62.624985] Code: 00 00 00 c3 0f 1f 44 00 00 48 83 7f 10 00 74 23 31 c0 0f 1f 80 00 00 00 00 48 8b 57 18 48 01 c2 48 03 57 08 48 05 00 10 00 00 <c6> 02 00 48 3b 47 10 72 e6 c7 47 20 01 00 00 00 31 c0 c3 90 66 66
[   62.624987] RSP: 002b:00007f3ec53a5e68 EFLAGS: 00010206
[   62.624989] RAX: 000000000731d000 RBX: 00007f3ec53a66c0 RCX: 00007f4271ca02d6
[   62.624991] RDX: 00007f425cefd000 RSI: 00007f3ec53a6fb0 RDI: 000000000a2f1c28
[   62.624992] RBP: 00007f3ec53a5f30 R08: 0000000000000000 R09: 0000000000000021
[   62.624993] R10: 0000000000000008 R11: 0000000000000246 R12: 00007f3ec53a66c0
[   62.624995] R13: 00007ffe83436d80 R14: 00007f3ec53a6ce4 R15: 00007ffe83436e87
[   62.624998]  </TASK>
[   62.624999] memory: usage 16777216kB, limit 16777216kB, failcnt 24460
[   62.640200] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[   62.640823] Memory cgroup stats for /demo:
[   62.641017] anon 10604879872
[   62.641941] file 6574858240
[   62.642259] kernel 0
[   62.642443] kernel_stack 0
[   62.642674] pagetables 0
[   62.642889] sec_pagetables 0
[   62.643318] percpu 0
[   62.643545] sock 0
[   62.643782] vmalloc 0
[   62.643987] shmem 0
[   62.644244] zswap 0
[   62.644425] zswapped 0
[   62.644666] zswap_incomp 0
[   62.644917] file_mapped 6574698496
[   62.645344] file_dirty 0
[   62.645835] file_writeback 0
[   62.646707] swapcached 0
[   62.647430] anon_thp 0
[   62.648204] file_thp 0
[   62.648895] shmem_thp 0
[   62.649737] inactive_anon 10597609472
[   62.650675] active_anon 7270400
[   62.651549] inactive_file 6367440896
[   62.652430] active_file 179376128
[   62.653318] unevictable 0
[   62.653976] slab_reclaimable 0
[   62.654664] slab_unreclaimable 0
[   62.655625] slab 0
[   62.656418] workingset_refault_anon 0
[   62.656816] workingset_refault_file 1120215
[   62.657293] workingset_activate_anon 0
[   62.657667] workingset_activate_file 45850
[   62.658167] workingset_restore_anon 0
[   62.658562] workingset_restore_file 45850
[   62.658981] workingset_nodereclaim 0
[   62.659417] pgdemote_kswapd 0
[   62.659715] pgdemote_direct 0
[   62.660102] pgdemote_khugepaged 0
[   62.660434] pgdemote_proactive 0
[   62.660730] pgsteal_kswapd 0
[   62.661015] pgsteal_direct 1612151
[   62.662669] pgscan_khugepaged 0
[   62.662990] pgscan_proactive 0
[   62.663393] pgrefill 4536757
[   62.663706] pgpromote_success 0
[   62.664115] pgscan 3867681
[   62.664397] pgsteal 1612151
[   62.664691] pswpin 0
[   62.664925] pswpout 0
[   62.665266] pgfault 35906959
[   62.665564] pgmajfault 95947
[   62.665867] pgactivate 3693439
[   62.666261] pgdeactivate 0
[   62.666492] pglazyfree 34
[   62.666728] pglazyfreed 0
[   62.666990] swpin_zero 0
[   62.667365] swpout_zero 0
[   62.667664] zswpin 0
[   62.667910] zswpout 0
[   62.668235] zswpwb 0
[   62.668472] thp_fault_alloc 0
[   62.668790] thp_collapse_alloc 0
[   62.669211] thp_swpout 0
[   62.669469] thp_swpout_fallback 0
[   62.669762] numa_pages_migrated 0
[   62.670177] numa_pte_updates 0
[   62.670470] numa_hint_faults 0
[   62.670774] Memory cgroup min protection 0kB -- low protection 0kB
[   62.670776] Tasks state (memory values in pages):
[   62.672213] [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
[   62.673379] [   3364]     0  3364     1794      900       73      827         0    57344        0             0 spawn-cgroup.sh
[   62.674519] [   3266]     0  3266    72782     2891      576     2315         0   110592        0             0 fish
[   62.675663] [   3591]     0  3591    55883     2979      625     2354         0   110592        0             0 fish
[   62.676546] [   3832]     0  3832  3867588  2588259  2587769      490         0 21630976        0             0 file_anon_mix_p
[   62.677691] [   3962]     0  3962  2098020  1237009      281  1236728         0 16855040        0             0 file_anon_mix_p
[   62.678950] [   3963]     0  3963  2098020  1236990      281  1236709         0 16855040        0             0 file_anon_mix_p
[   62.680233] [   3964]     0  3964  2098020  1236985      281  1236704         0 16855040        0             0 file_anon_mix_p
[   62.681374] [   3965]     0  3965  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.682501] [   3966]     0  3966  2098020  1237015      281  1236734         0 16855040        0             0 file_anon_mix_p
[   62.683637] [   3967]     0  3967  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.684812] [   3968]     0  3968  2098020  1237015      281  1236734         0 16855040        0             0 file_anon_mix_p
[   62.685883] [   3969]     0  3969  2098020  1236967      281  1236686         0 16855040        0             0 file_anon_mix_p
[   62.686988] [   3970]     0  3970  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.688168] [   3971]     0  3971  2098020  1236993      281  1236712         0 16855040        0             0 file_anon_mix_p
[   62.689402] [   3972]     0  3972  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.690621] [   3973]     0  3973  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.691839] [   3974]     0  3974  2098020  1237011      281  1236730         0 16855040        0             0 file_anon_mix_p
[   62.693550] [   3975]     0  3975  2098020  1237016      281  1236735         0 16855040        0             0 file_anon_mix_p
[   62.695292] [   3976]     0  3976  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.696997] [   3977]     0  3977  2098020  1237014      281  1236733         0 16855040        0             0 file_anon_mix_p
[   62.698734] [   3978]     0  3978  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.700415] [   3979]     0  3979  2098020  1236992      281  1236711         0 16855040        0             0 file_anon_mix_p
[   62.702153] [   3980]     0  3980  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.703859] [   3981]     0  3981  2098020  1236919      281  1236638         0 16855040        0             0 file_anon_mix_p
[   62.705597] [   3982]     0  3982  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.707329] [   3983]     0  3983  2098020  1236277      281  1235996         0 16855040        0             0 file_anon_mix_p
[   62.709056] [   3984]     0  3984  2098020  1236952      281  1236671         0 16855040        0             0 file_anon_mix_p
[   62.710732] [   3985]     0  3985  2098020  1236948      281  1236667         0 16855040        0             0 file_anon_mix_p
[   62.712482] [   3986]     0  3986  2098020  1237014      281  1236733         0 16855040        0             0 file_anon_mix_p
[   62.714184] [   3987]     0  3987  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.715930] [   3988]     0  3988  2098020  1237015      281  1236734         0 16855040        0             0 file_anon_mix_p
[   62.717543] [   3989]     0  3989  2098020  1237015      281  1236734         0 16855040        0             0 file_anon_mix_p
[   62.719129] [   3990]     0  3990  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.720723] [   3991]     0  3991  2098020  1237011      281  1236730         0 16855040        0             0 file_anon_mix_p
[   62.722356] [   3992]     0  3992  2098020  1236945      281  1236664         0 16855040        0             0 file_anon_mix_p
[   62.723893] [   3993]     0  3993  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.725413] [   3994]     0  3994  2098020  1236982      281  1236701         0 16855040        0             0 file_anon_mix_p
[   62.727108] [   3995]     0  3995  2098020  1237012      281  1236731         0 16855040        0             0 file_anon_mix_p
[   62.728701] [   3996]     0  3996  2098020  1236990      281  1236709         0 16855040        0             0 file_anon_mix_p

.. snip ..

The testing kernel commit is latest mm-new:

$ git log --oneline
b822cd37c749 (HEAD) mm/mglru: improve reclaim loop and dirty folio handling
  # This is a empty commit, to hold my cover letter.
54c9d0359b18 selftests-mm-add-merge-test-for-partial-msealed-range-fix
  # This is mm-new, see below.
fc127b77592e selftests/mm: add merge test for partial msealed range
ff02b14f414c mm/vmalloc: use dedicated unbound workqueue for vmap purge/drain

$ git log 54c9d0359b18
commit 54c9d0359b180b34070aa7ff8d9428fa3db8acbb (akpm/mm-new)
Author: Andrew Morton <akpm@linux-foundation.org>
Date:   Mon Mar 30 17:12:35 2026 -0700

    selftests-mm-add-merge-test-for-partial-msealed-range-fix

Note you may see fish in the OOM list, I use that shell, bash and
the memcg spawn is wrapped by spawn-cgroup.sh, irrelevant but just
to avoid confusion.

Reproducer log:
.. snip ..
[phase3] Starting 96 anonymous pressure threads (14336 MB x 128 rounds)...
[pressure] Round 1/128: faulting 14336 MB across 96 threads...
[pressure] Round 1/128 complete.
[pressure] Round 2/128: faulting 14336 MB across 96 threads...
[pressure] Round 2/128 complete.
[pressure] Round 3/128: faulting 14336 MB across 96 threads...
[pressure] Round 3/128 complete.

.. snip ...

[pressure] Round 17/128 complete.
[pressure] Round 18/128: faulting 14336 MB across 96 threads...
[pressure] Round 18/128 complete.
[pressure] Round 19/128: faulting 14336 MB across 96 threads...
fish: Job 1, './file_anon_mix_pressure /mnt/r…' terminated by signal SIGKILL (Forced quit)

OOM doesn't occur with MGLRU disabled or after this series,
128 rounds finishes just fine.

Very unfortunately I haven't find a easy and generic way to
reproduce this as the time window is extremely short: if
another reclaim thread keeps getting rejected due to
should_run_aging return true, and a racing thread is doing
the aging but not finished, MGLRU might OOM when it shouldn't.

This series greatly avoided that, but in very rare cases and
in theory, we may still see OOM due to the force protection
of MIN_NR_GENS. That can be fixed later.

We have see some very rare OOM issue with several services.
It took me a long time to figure out what is actually wrong here
since the racing window is extremely tiny and hard to trigger.
This reproducer is currently the best I can provide to simulate
that. It's not a 100% accurate and stable but close enough.

Maybe you can try to adjust the parameters to reproduce
that, and the storage have to be fast for the reproducer.