mm/vmscan.c | 191 ++++++++++++++++++++++++++---------------------------------- 1 file changed, 81 insertions(+), 110 deletions(-)
This series cleans up and slightly improves MGLRU's reclaim loop and
dirty flush logic. As a result, we can see an up to ~50% reduce of file
faults and 30% increase in MongoDB throughput with YCSB and no swap
involved, other common benchmarks have no regression, and LOC is
reduced, with less unexpected OOM in our production environment.
Some of the problems were found in our production environment, and
others are mostly exposed while stress testing the LFU-like design as
proposed in the LSM/MM/BPF topic this year [1]. This series has no
direct relationship to that topic, but it cleans up the code base and
fixes several strange behaviors that make the test result of the
LFU-like design not as good as expected.
MGLRU's reclaim loop is a bit complex, and hence these problems are
somehow related to each other. The aging, scan number calculation, and
reclaim loop are coupled together, and the dirty folio handling logic is
quite different, making the reclaim loop hard to follow and the dirty
flush ineffective too.
This series slightly cleans up and improves the reclaim loop using a
scan budget by calculating the number of folios to scan at the beginning
of the loop, and decouples aging from the reclaim calculation helpers
Then move the dirty flush logic inside the reclaim loop so it can kick
in more effectively. These issues are somehow related, and this series
handles them and improves MGLRU reclaim in many ways.
Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
and 128G memory machine using NVME as storage.
MongoDB
=======
Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
threads:32), which does 95% read and 5% update to generate mixed read
and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
the WiredTiger cache size is set to 4.5G, using NVME as storage.
Not using SWAP.
Median of 3 test run, results are stable.
Before:
Throughput(ops/sec): 61642.78008938203
AverageLatency(us): 507.11127774145166
pgpgin 158190589
pgpgout 5880616
workingset_refault 7262988
After:
Throughput(ops/sec): 80216.04855744806 (+30.1%, higher is better)
AverageLatency(us): 388.17633477268913 (-23.5%, lower is better)
pgpgin 101871227 (-35.6%, lower is better)
pgpgout 5770028
workingset_refault 3418186 (-52.9%, lower is better)
We can see a significant performance improvement after this series for
file cache heavy workloads like this. The test is done on NVME and the
performance gap would be even larger for slow devices, we observed
>100% gain for some other workloads running on HDD devices.
Chrome & Node.js [3]
====================
Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
workers:
Before:
Total requests: 77920
Per-worker 95% CI (mean): [1199.9, 1235.1]
Per-worker stdev: 70.5
Jain's fairness: 0.996706 (1.0 = perfectly fair)
Latency:
Bucket Count Pct Cumul
[0,1)s 25649 32.92% 32.92%
[1,2)s 7759 9.96% 42.87%
[2,4)s 5156 6.62% 49.49%
[4,8)s 39356 50.51% 100.00%
After:
Total requests: 79564
Per-worker 95% CI (mean): [1224.2, 1262.2]
Per-worker stdev: 76.1
Jain's fairness: 0.996328 (1.0 = perfectly fair)
Latency:
Bucket Count Pct Cumul
[0,1)s 25485 32.03% 32.03%
[1,2)s 8661 10.89% 42.92%
[2,4)s 6268 7.88% 50.79%
[4,8)s 39150 49.21% 100.00%
Seems identical, reclaim is still fair and effective, total requests
number seems slightly better.
OOM issue [4]
=============
Testing with a specific reproducer [4] to simulate what we encounterd in
production environment. Still using the same test machine but one node
is used as pmem ramdisk following steps in the reproducer, no SWAP used.
This reproducer spawns multiple workers that keep reading the given file
using mmap, and pauses for 120ms after one file read batch. It also
spawns another set of workers that keep allocating and freeing a
given size of anonymous memory. The total memory size exceeds the
memory limit (eg. 44G anon + 8G file, which is 52G vs 48G memcg limit).
But by evicting the file cache, the workload should hold just fine,
especially given that the file worker pauses after every batch, allowing
other workers to catch up.
- MGLRU disabled:
Finished 128 iterations.
- MGLRU enabled:
Hung or OOM with following info after about ~10-20 iterations:
[ 357.332946] file_anon_mix_pressure invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
... <snip> ...
[ 357.333827] memory: usage 50331648kB, limit 50331648kB, failcnt 90907
[ 357.347728] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[ 357.348192] Memory cgroup stats for /demo:
[ 357.348314] anon 46724382720
[ 357.348963] file 4160753664
OOM occurs despite there is still evictable file folios.
- MGLRU enabled after this series:
Finished 128 iterations.
With aging blocking reclaim, the OOM will be much more likely to occur.
This issue is mostly fixed by patch 6 and result is much better, but
this series is still only the first step to improve file folio reclaim
for MGLRU, as there are still cases where file folios can't be
effectively reclaimed.
MySQL:
======
Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
ZRAM as swap and test command:
sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
--tables=48 --table-size=2000000 --threads=96 --time=600 run
Before: 22343.701667 tps
After patch 4: 22327.325000 tps
After patch 5: 22373.224000 tps
After patch 6: 22321.174000 tps
After patch 7: 22625.961667 tps (+1.26%, higher is better)
MySQL is anon folios heavy but still looking good. Seems only noise level
changes, no regression.
FIO:
====
Testing with the following command, where /mnt is an EXT4 ramdisk, 6
test runs each in a 10G memcg:
fio -name=cached --numjobs=16 --filename=/mnt/test.img --buffered=1 \
--ioengine=io_uring --iodepth=128 --iodepth_batch_submit=32 \
--iodepth_batch_complete=32 --rw=randread \
--random_distribution=zipf:1.2 --norandommap --time_based \
--ramp_time=1m --runtime=10m --group_reporting
Before: 32039.56 MB/s
After patch 3: 32751.50 MB/s
After patch 4: 32703.03 MB/s
After patch 5: 33395.52 MB/s
After patch 6: 32031.51 MB/s
After patch 7: 32534.29 MB/s
Also seem only noise level changes and no regression.
Build kernel:
=============
Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
using make -j96 and defconfig, measuring system time, 8 test run each.
Before: 2881.41s
After patch 3: 2894.09s
After patch 4: 2846.73s
After patch 5: 2847.91s
After patch 6: 2835.17s
After patch 7: 2842.90s
Also seem only noise level changes, no regression or very slightly better.
Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1]
Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2]
Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3]
Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4]
Signed-off-by: Kairui Song <kasong@tencent.com>
---
Kairui Song (8):
mm/mglru: consolidate common code for retrieving evitable size
mm/mglru: relocate the LRU scan batch limit to callers
mm/mglru: restructure the reclaim loop
mm/mglru: scan and count the exact number of folios
mm/mglru: use a smaller batch for reclaim
mm/mglru: don't abort scan immediately right after aging
mm/mglru: simplify and improve dirty writeback handling
mm/vmscan: remove sc->file_taken
mm/vmscan.c | 191 ++++++++++++++++++++++++++----------------------------------
1 file changed, 81 insertions(+), 110 deletions(-)
---
base-commit: dffde584d8054e88e597e3f28de04c7f5d191a67
change-id: 20260314-mglru-reclaim-1c9d45ac57f6
Best regards,
--
Kairui Song <kasong@tencent.com>
Hi Kairui, On 3/18/26 3:08 AM, Kairui Song via B4 Relay wrote: > This series cleans up and slightly improves MGLRU's reclaim loop and > dirty flush logic. As a result, we can see an up to ~50% reduce of file > faults and 30% increase in MongoDB throughput with YCSB and no swap > involved, other common benchmarks have no regression, and LOC is > reduced, with less unexpected OOM in our production environment. > > Some of the problems were found in our production environment, and > others are mostly exposed while stress testing the LFU-like design as > proposed in the LSM/MM/BPF topic this year [1]. This series has no > direct relationship to that topic, but it cleans up the code base and > fixes several strange behaviors that make the test result of the > LFU-like design not as good as expected. > > MGLRU's reclaim loop is a bit complex, and hence these problems are > somehow related to each other. The aging, scan number calculation, and > reclaim loop are coupled together, and the dirty folio handling logic is > quite different, making the reclaim loop hard to follow and the dirty > flush ineffective too. > > This series slightly cleans up and improves the reclaim loop using a > scan budget by calculating the number of folios to scan at the beginning > of the loop, and decouples aging from the reclaim calculation helpers > Then move the dirty flush logic inside the reclaim loop so it can kick > in more effectively. These issues are somehow related, and this series > handles them and improves MGLRU reclaim in many ways. > > Test results: All tests are done on a 48c96t NUMA machine with 2 nodes > and 128G memory machine using NVME as storage. > > MongoDB > ======= > Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000, > threads:32), which does 95% read and 5% update to generate mixed read > and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and > the WiredTiger cache size is set to 4.5G, using NVME as storage. > > Not using SWAP. > > Median of 3 test run, results are stable. > > Before: > Throughput(ops/sec): 61642.78008938203 > AverageLatency(us): 507.11127774145166 > pgpgin 158190589 > pgpgout 5880616 > workingset_refault 7262988 > > After: > Throughput(ops/sec): 80216.04855744806 (+30.1%, higher is better) > AverageLatency(us): 388.17633477268913 (-23.5%, lower is better) > pgpgin 101871227 (-35.6%, lower is better) > pgpgout 5770028 > workingset_refault 3418186 (-52.9%, lower is better) > > We can see a significant performance improvement after this series for > file cache heavy workloads like this. The test is done on NVME and the > performance gap would be even larger for slow devices, we observed >> 100% gain for some other workloads running on HDD devices. > > Chrome & Node.js [3] > ==================== > Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2 > nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64 > workers: > > Before: > Total requests: 77920 > Per-worker 95% CI (mean): [1199.9, 1235.1] > Per-worker stdev: 70.5 > Jain's fairness: 0.996706 (1.0 = perfectly fair) > Latency: > Bucket Count Pct Cumul > [0,1)s 25649 32.92% 32.92% > [1,2)s 7759 9.96% 42.87% > [2,4)s 5156 6.62% 49.49% > [4,8)s 39356 50.51% 100.00% > > After: > Total requests: 79564 > Per-worker 95% CI (mean): [1224.2, 1262.2] > Per-worker stdev: 76.1 > Jain's fairness: 0.996328 (1.0 = perfectly fair) > Latency: > Bucket Count Pct Cumul > [0,1)s 25485 32.03% 32.03% > [1,2)s 8661 10.89% 42.92% > [2,4)s 6268 7.88% 50.79% > [4,8)s 39150 49.21% 100.00% > > Seems identical, reclaim is still fair and effective, total requests > number seems slightly better. > > OOM issue [4] > ============= > Testing with a specific reproducer [4] to simulate what we encounterd in > production environment. Still using the same test machine but one node > is used as pmem ramdisk following steps in the reproducer, no SWAP used. > > This reproducer spawns multiple workers that keep reading the given file > using mmap, and pauses for 120ms after one file read batch. It also > spawns another set of workers that keep allocating and freeing a > given size of anonymous memory. The total memory size exceeds the > memory limit (eg. 44G anon + 8G file, which is 52G vs 48G memcg limit). > But by evicting the file cache, the workload should hold just fine, > especially given that the file worker pauses after every batch, allowing > other workers to catch up. > > - MGLRU disabled: > Finished 128 iterations. > > - MGLRU enabled: > Hung or OOM with following info after about ~10-20 iterations: > > [ 357.332946] file_anon_mix_pressure invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 > ... <snip> ... > [ 357.333827] memory: usage 50331648kB, limit 50331648kB, failcnt 90907 > [ 357.347728] swap: usage 0kB, limit 9007199254740988kB, failcnt 0 > [ 357.348192] Memory cgroup stats for /demo: > [ 357.348314] anon 46724382720 > [ 357.348963] file 4160753664 > > OOM occurs despite there is still evictable file folios. > > - MGLRU enabled after this series: > Finished 128 iterations. > > With aging blocking reclaim, the OOM will be much more likely to occur. > This issue is mostly fixed by patch 6 and result is much better, but > this series is still only the first step to improve file folio reclaim > for MGLRU, as there are still cases where file folios can't be > effectively reclaimed. > > MySQL: > ====== > > Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using > ZRAM as swap and test command: > > sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \ > --tables=48 --table-size=2000000 --threads=96 --time=600 run > > Before: 22343.701667 tps > After patch 4: 22327.325000 tps > After patch 5: 22373.224000 tps > After patch 6: 22321.174000 tps > After patch 7: 22625.961667 tps (+1.26%, higher is better) > > MySQL is anon folios heavy but still looking good. Seems only noise level > changes, no regression. > > FIO: > ==== > Testing with the following command, where /mnt is an EXT4 ramdisk, 6 > test runs each in a 10G memcg: > > fio -name=cached --numjobs=16 --filename=/mnt/test.img --buffered=1 \ > --ioengine=io_uring --iodepth=128 --iodepth_batch_submit=32 \ > --iodepth_batch_complete=32 --rw=randread \ > --random_distribution=zipf:1.2 --norandommap --time_based \ > --ramp_time=1m --runtime=10m --group_reporting > > Before: 32039.56 MB/s > After patch 3: 32751.50 MB/s > After patch 4: 32703.03 MB/s > After patch 5: 33395.52 MB/s > After patch 6: 32031.51 MB/s > After patch 7: 32534.29 MB/s > > Also seem only noise level changes and no regression. > > Build kernel: > ============= > Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg > using make -j96 and defconfig, measuring system time, 8 test run each. > > Before: 2881.41s > After patch 3: 2894.09s > After patch 4: 2846.73s > After patch 5: 2847.91s > After patch 6: 2835.17s > After patch 7: 2842.90s > > Also seem only noise level changes, no regression or very slightly better. > > Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1] > Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2] > Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3] > Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4] > > Signed-off-by: Kairui Song <kasong@tencent.com> I applied this patch set to 7.0-rc5 and noticed the system locking up when performing the below test. fallocate -l 5G 5G while true; do tail /dev/zero; done while true; do time cat 5G > /dev/null; sleep $(($(cat /sys/kernel/mm/lru_gen/min_ttl_ms)/1000+1)); done After reading [1], I suspect that this was because the system was using zram as swap, and yes if zram is disabled then the lock up does not occur. Anything that I (CachyOS) can do to help debug this regression, if it is to be considered one as according to [1], zram as swap seems to be unsupported by upstream. (the user that tested this wasn't able to get a good kernel trace, the only thing left was a trace of the OOM killer firing) [1] https://chrisdown.name/2026/03/24/zswap-vs-zram-when-to-use-what.html -- Regards, Eric
On Wed, Mar 25, 2026 at 1:04 PM Eric Naim <dnaim@cachyos.org> wrote: > > Hi Kairui, > > On 3/18/26 3:08 AM, Kairui Song via B4 Relay wrote: > > This series cleans up and slightly improves MGLRU's reclaim loop and > > dirty flush logic. As a result, we can see an up to ~50% reduce of file > > faults and 30% increase in MongoDB throughput with YCSB and no swap > > involved, other common benchmarks have no regression, and LOC is > > reduced, with less unexpected OOM in our production environment. > > ... > > Before: 2881.41s > > After patch 3: 2894.09s > > After patch 4: 2846.73s > > After patch 5: 2847.91s > > After patch 6: 2835.17s > > After patch 7: 2842.90s > > > > Also seem only noise level changes, no regression or very slightly better. > > > > Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1] > > Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2] > > Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3] > > Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4] > > > > Signed-off-by: Kairui Song <kasong@tencent.com> > > I applied this patch set to 7.0-rc5 and noticed the system locking up when performing the below test. > > fallocate -l 5G 5G > while true; do tail /dev/zero; done > while true; do time cat 5G > /dev/null; sleep $(($(cat /sys/kernel/mm/lru_gen/min_ttl_ms)/1000+1)); done > > After reading [1], I suspect that this was because the system was using zram as swap, and yes if zram is disabled then the lock up does not occur. Hi Eric, Thanks for the report, I was about to send V2 but noticing your report I'll try to reproduce your issue first. So far I didn't notice any regression, is this an issue caused by this patch or is it an existing issue? I don't have any context about how you are doing the test. BTW the calculation in patch "mm/mglru: restructure the reclaim loop" needs to have a lowest bar "max(nr_to_scan, SWAP_CLUSTER_MAX)" for small machines, not sure if related but will add to V2. And about the test you posted: while true; do tail /dev/zero; done I believe this will just consume all memory with zero pages and then get OOM killed, that's exactly what the test is meant to do. By lockup I'm not sure you mean since you mentioned OOM kill. The system actually hung or the desktop is dead? I just ran that with or without ZRAM on two machines and my laptop, everything looks good here with this series. > zram as swap seems to be unsupported by upstream. That's simply not true, other distros like Fedora even have ZRAM as swap by default: https://fedoraproject.org/wiki/Changes/SwapOnZRAM And systemd have a widely used ZRAM swap support: https://github.com/systemd/zram-generator Android also uses that, and we are using ZRAM by default in our fleet which runs fine. > the user that tested this wasn't able to get a > good kernel trace, the only thing left was > a trace of the OOM killer firing. No worry, that's fine, just send me the OOM trace or log, the more detailed context I get the better.
On 3/25/26 1:47 PM, Kairui Song wrote: > On Wed, Mar 25, 2026 at 1:04 PM Eric Naim <dnaim@cachyos.org> wrote: >> >> Hi Kairui, >> >> On 3/18/26 3:08 AM, Kairui Song via B4 Relay wrote: >>> This series cleans up and slightly improves MGLRU's reclaim loop and >>> dirty flush logic. As a result, we can see an up to ~50% reduce of file >>> faults and 30% increase in MongoDB throughput with YCSB and no swap >>> involved, other common benchmarks have no regression, and LOC is >>> reduced, with less unexpected OOM in our production environment. >>> > > ... > >> >> I applied this patch set to 7.0-rc5 and noticed the system locking up when performing the below test. >> >> fallocate -l 5G 5G >> while true; do tail /dev/zero; done >> while true; do time cat 5G > /dev/null; sleep $(($(cat /sys/kernel/mm/lru_gen/min_ttl_ms)/1000+1)); done >> >> After reading [1], I suspect that this was because the system was using zram as swap, and yes if zram is disabled then the lock up does not occur. > > Hi Eric, > > Thanks for the report, I was about to send V2 but noticing your report > I'll try to reproduce your issue first. > > So far I didn't notice any regression, is this an issue caused by this > patch or is it an existing issue? I don't have any context about how > you are doing the test. BTW the calculation in patch "mm/mglru: > restructure the reclaim loop" needs to have a lowest bar > "max(nr_to_scan, SWAP_CLUSTER_MAX)" for small machines, not sure if > related but will add to V2. > As of writing this, I got some new information that makes this a bit more confusing. The kernel that doesn't have the issue was patched with [1] as a means of protecting the working set (similar to lru_gen_min_ttl_ms). So this time on an unpatched kernel, the system still freezes but quickly recovers itself after about 2 seconds. With this patchset applied, the system freezes but it doesn't quickly recover (if at all). Curiously, I had the user test again but this time with lru_gen_min_ttl_ms = 100. With this set, the system doesn't freeze at all with or without this patchset. > And about the test you posted: > while true; do tail /dev/zero; done > > I believe this will just consume all memory with zero pages and then > get OOM killed, that's exactly what the test is meant to do. By lockup > I'm not sure you mean since you mentioned OOM kill. The system > actually hung or the desktop is dead? The system actually hung. They needed a hard reset to recover the system. (pure speculation: given a few minutes the system would likely recover itself as this seems to be a common scenario) > > I just ran that with or without ZRAM on two machines and my laptop, > everything looks good here with this series. > >> zram as swap seems to be unsupported by upstream. > > That's simply not true, other distros like Fedora even have ZRAM as > swap by default: > https://fedoraproject.org/wiki/Changes/SwapOnZRAM > > And systemd have a widely used ZRAM swap support: > https://github.com/systemd/zram-generator > > Android also uses that, and we are using ZRAM by default in our fleet > which runs fine. > >> the user that tested this wasn't able to get a >> good kernel trace, the only thing left was >> a trace of the OOM killer firing. > > No worry, that's fine, just send me the OOM trace or log, the more > detailed context I get the better. Mar 25 08:24:22 osiris kernel: Call Trace: Mar 25 08:24:22 osiris kernel: <TASK> Mar 25 08:24:22 osiris kernel: dump_stack_lvl+0x61/0x80 Mar 25 08:24:22 osiris kernel: dump_header+0x4a/0x160 Mar 25 08:24:22 osiris kernel: oom_kill_process+0x18f/0x1f0 Mar 25 08:24:22 osiris kernel: out_of_memory+0x4ab/0x5c0 Mar 25 08:24:22 osiris kernel: __alloc_pages_slowpath+0x9ac/0x1060 Mar 25 08:24:22 osiris kernel: __alloc_frozen_pages_noprof+0x29a/0x320 Mar 25 08:24:22 osiris kernel: alloc_pages_mpol+0x107/0x1b0 Mar 25 08:24:22 osiris kernel: folio_alloc_noprof+0x85/0xb0 Mar 25 08:24:22 osiris kernel: __filemap_get_folio_mpol+0x1ff/0x4c0 Mar 25 08:24:22 osiris kernel: filemap_fault+0x3e3/0x6e0 Mar 25 08:24:22 osiris kernel: __do_fault+0x46/0x140 Mar 25 08:24:22 osiris kernel: do_pte_missing+0x154/0xea0 Mar 25 08:24:22 osiris kernel: ? __pte_offset_map+0x1d/0xd0 Mar 25 08:24:22 osiris kernel: handle_mm_fault+0x89c/0x1280 Mar 25 08:24:22 osiris kernel: do_user_addr_fault+0x23b/0x720 Mar 25 08:24:22 osiris kernel: exc_page_fault+0x75/0xe0 Mar 25 08:24:22 osiris kernel: asm_exc_page_fault+0x26/0x30 Mar 25 08:24:22 osiris kernel: RIP: 0033:0x7fec4beb43c0 Mar 25 08:24:22 osiris kernel: Code: Unable to access opcode bytes at 0x7fec4beb4396. Mar 25 08:24:22 osiris kernel: RSP: 002b:00007ffcb348d698 EFLAGS: 00010293 Mar 25 08:24:22 osiris kernel: RAX: 00000000c70f6907 RBX: 00007ffcb348d8d0 RCX: 00007fec4bb1604d Mar 25 08:24:22 osiris kernel: RDX: c6a4a7935bd1e995 RSI: 4fb7dae88ad99bfb RDI: 000055ee77cc8150 Mar 25 08:24:22 osiris kernel: RBP: 00007ffcb348dd60 R08: 000055ee77cc8158 R09: 000000000000000c Mar 25 08:24:22 osiris kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c Mar 25 08:24:22 osiris kernel: R13: 000055ee77cc8150 R14: 0000000000000064 R15: 431bde82d7b634db Mar 25 08:24:22 osiris kernel: </TASK> Here's the call trace that was recovered. Some mm related settings that we set in our kernel in case its useful: vm.compact_unevictable_allowed = 0 vm.compaction_proactiveness = 0 vm.page-cluster = 0 vm.swappiness = 150 vm.vfs_cache_pressure = 50 vm.dirty_bytes = 268435456 vm.dirty_background_bytes = 67108864 vm.dirty_writeback_centisecs = 1500 vm.watermark_boost_factor = 0 /sys/kernel/mm/transparent_hugepage/defrag = defer+madvise [1] https://github.com/firelzrd/le9uo/ -- Regards, Eric
On Wed, Mar 25, 2026 at 5:27 PM Eric Naim <dnaim@cachyos.org> wrote: > > On 3/25/26 1:47 PM, Kairui Song wrote: > > On Wed, Mar 25, 2026 at 1:04 PM Eric Naim <dnaim@cachyos.org> wrote: > >> > >> Hi Kairui, > >> > >> On 3/18/26 3:08 AM, Kairui Song via B4 Relay wrote: > >>> This series cleans up and slightly improves MGLRU's reclaim loop and > >>> dirty flush logic. As a result, we can see an up to ~50% reduce of file > >>> faults and 30% increase in MongoDB throughput with YCSB and no swap > >>> involved, other common benchmarks have no regression, and LOC is > >>> reduced, with less unexpected OOM in our production environment. > >>> > > > > ... > > > >> > >> I applied this patch set to 7.0-rc5 and noticed the system locking up when performing the below test. > >> > >> fallocate -l 5G 5G > >> while true; do tail /dev/zero; done > >> while true; do time cat 5G > /dev/null; sleep $(($(cat /sys/kernel/mm/lru_gen/min_ttl_ms)/1000+1)); done > >> > >> After reading [1], I suspect that this was because the system was using zram as swap, and yes if zram is disabled then the lock up does not occur. > > > > Hi Eric, > > > > Thanks for the report, I was about to send V2 but noticing your report > > I'll try to reproduce your issue first. > > > > So far I didn't notice any regression, is this an issue caused by this > > patch or is it an existing issue? I don't have any context about how > > you are doing the test. BTW the calculation in patch "mm/mglru: > > restructure the reclaim loop" needs to have a lowest bar > > "max(nr_to_scan, SWAP_CLUSTER_MAX)" for small machines, not sure if > > related but will add to V2. > > > > As of writing this, I got some new information that makes this a bit more confusing. The kernel that doesn't have the issue was patched with [1] as a means of protecting the working set (similar to lru_gen_min_ttl_ms). > > So this time on an unpatched kernel, the system still freezes but quickly recovers itself after about 2 seconds. With this patchset applied, the system freezes but it doesn't quickly recover (if at all). > > Curiously, I had the user test again but this time with lru_gen_min_ttl_ms = 100. With this set, the system doesn't freeze at all with or without this patchset. Ah thanks, that makes sense now, the downstream patch you mentioned limits the reclaim of file pages to avoid thrashing, and your test cases exhaust the memory on purpose which forces the kernel to reclaim all reclaimable folios including page cache. A thrashing page cache causes desktop hangs easily, using TTL is an effective way to avoid thrashing and trigger OOM early. That's why the problem is gone with lru_gen_min_ttl_ms = 100 or le9. > > And about the test you posted: > > while true; do tail /dev/zero; done > > > > I believe this will just consume all memory with zero pages and then > > get OOM killed, that's exactly what the test is meant to do. By lockup > > I'm not sure you mean since you mentioned OOM kill. The system > > actually hung or the desktop is dead? > > The system actually hung. They needed a hard reset to recover the system. (pure speculation: given a few minutes the system would likely recover itself as this seems to be a common scenario) Yeah I believe so. Thrashing prevention is why MGLRU's TTL is introduced, so I do suggest using that. It can be further improved too. Will keep that in mind and try to make some test cases to cover your case too and make some adjustments. BTW how does the kernel behave with MGLRU disabled for your case?
On Wed, Mar 25, 2026 at 05:47:41PM +0800, Kairui Song wrote: > On Wed, Mar 25, 2026 at 5:27 PM Eric Naim <dnaim@cachyos.org> wrote: > > > > On 3/25/26 1:47 PM, Kairui Song wrote: > > > On Wed, Mar 25, 2026 at 1:04 PM Eric Naim <dnaim@cachyos.org> wrote: > > >> > > >> Hi Kairui, > > >> > > >> On 3/18/26 3:08 AM, Kairui Song via B4 Relay wrote: > > >>> This series cleans up and slightly improves MGLRU's reclaim loop and > > >>> dirty flush logic. As a result, we can see an up to ~50% reduce of file > > >>> faults and 30% increase in MongoDB throughput with YCSB and no swap > > >>> involved, other common benchmarks have no regression, and LOC is > > >>> reduced, with less unexpected OOM in our production environment. > > >>> > > > > > > ... > > > > > >> > > >> I applied this patch set to 7.0-rc5 and noticed the system locking up when performing the below test. > > >> > > >> fallocate -l 5G 5G > > >> while true; do tail /dev/zero; done > > >> while true; do time cat 5G > /dev/null; sleep $(($(cat /sys/kernel/mm/lru_gen/min_ttl_ms)/1000+1)); done > > >> > > >> After reading [1], I suspect that this was because the system was using zram as swap, and yes if zram is disabled then the lock up does not occur. > > > > > > Hi Eric, > > > > > > Thanks for the report, I was about to send V2 but noticing your report > > > I'll try to reproduce your issue first. > > > > > > So far I didn't notice any regression, is this an issue caused by this > > > patch or is it an existing issue? I don't have any context about how > > > you are doing the test. BTW the calculation in patch "mm/mglru: > > > restructure the reclaim loop" needs to have a lowest bar > > > "max(nr_to_scan, SWAP_CLUSTER_MAX)" for small machines, not sure if > > > related but will add to V2. > > > > > > > As of writing this, I got some new information that makes this a bit more confusing. The kernel that doesn't have the issue was patched with [1] as a means of protecting the working set (similar to lru_gen_min_ttl_ms). > > > > So this time on an unpatched kernel, the system still freezes but quickly recovers itself after about 2 seconds. With this patchset applied, the system freezes but it doesn't quickly recover (if at all). > > > > Curiously, I had the user test again but this time with lru_gen_min_ttl_ms = 100. With this set, the system doesn't freeze at all with or without this patchset. > > Ah thanks, that makes sense now, the downstream patch you mentioned > limits the reclaim of file pages to avoid thrashing, and your test > cases exhaust the memory on purpose which forces the kernel to reclaim > all reclaimable folios including page cache. > > A thrashing page cache causes desktop hangs easily, using TTL is an > effective way to avoid thrashing and trigger OOM early. That's why the > problem is gone with lru_gen_min_ttl_ms = 100 or le9. > > > > And about the test you posted: > > > while true; do tail /dev/zero; done > > > > > > I believe this will just consume all memory with zero pages and then > > > get OOM killed, that's exactly what the test is meant to do. By lockup > > > I'm not sure you mean since you mentioned OOM kill. The system > > > actually hung or the desktop is dead? > > > > The system actually hung. They needed a hard reset to recover the system. (pure speculation: given a few minutes the system would likely recover itself as this seems to be a common scenario) > > Yeah I believe so. > > Thrashing prevention is why MGLRU's TTL is introduced, so I do suggest > using that. It can be further improved too. > > Will keep that in mind and try to make some test cases to cover your > case too and make some adjustments. > > BTW how does the kernel behave with MGLRU disabled for your case? Hi all, I tested it multiple times on my Fedora, comparing MGLRU to classic LRU (using v2 of this series also also includes some minor improvements). I modified the reproduce a bit just to test the OOM behavior: - Running following command in console A: fallocate -l 5G 5G while true; do time cat 5G > /dev/null; done - Then run following command in console B: while true; do tail /dev/zero; done The console A output is below: With MGLRU disabled: ... real 0m4.925s user 0m0.016s sys 0m4.904s # Under pressure real 0m5.544s user 0m0.015s sys 0m5.521s real 0m5.444s user 0m0.012s sys 0m5.425s real 0m7.607s user 0m0.016s sys 0m7.561s real 0m7.268s user 0m0.017s sys 0m7.240s real 0m6.686s user 0m0.016s sys 0m6.656s real 0m9.919s user 0m0.014s sys 0m9.831s # <- OOM in B triggers real 0m4.559s user 0m0.012s sys 0m4.539s real 0m1.381s user 0m0.009s sys 0m1.362s real 0m11.816s user 0m0.010s sys 0m11.795s real 0m6.797s user 0m0.021s sys 0m6.753s real 0m0.944s user 0m0.013s sys 0m0.931s # <- OOM kill in B ends real 0m0.285s user 0m0.013s sys 0m0.272s MGLRU enabled, before this series: ... real 0m0.355s user 0m0.009s sys 0m0.346s # Under pressure real 0m0.352s user 0m0.008s sys 0m0.344s real 0m0.549s user 0m0.014s sys 0m0.535s real 0m0.628s user 0m0.009s sys 0m0.619s real 0m0.651s user 0m0.009s sys 0m0.642s real 0m5.294s user 0m0.010s sys 0m5.280s # <- OOM in B triggers real 0m1.041s user 0m0.014s sys 0m1.026s real 0m0.837s user 0m0.011s sys 0m0.826s real 0m2.450s user 0m0.013s sys 0m2.435s real 0m2.499s user 0m0.012s sys 0m2.485s real 0m1.857s user 0m0.015s sys 0m1.841s real 0m0.512s user 0m0.015s sys 0m0.497s real 0m0.418s user 0m0.011s sys 0m0.407s # <- OOM kill in B ends real 0m0.282s user 0m0.010s sys 0m0.272s MGLRU enabled, after this series: ... real 0m0.280s user 0m0.015s sys 0m0.265s # Under pressure real 0m0.283s user 0m0.010s sys 0m0.273s real 0m0.278s user 0m0.012s sys 0m0.266s real 0m0.315s user 0m0.018s sys 0m0.297s real 0m0.679s user 0m0.014s sys 0m0.663s real 0m0.716s user 0m0.011s sys 0m0.705s real 0m0.657s user 0m0.009s sys 0m0.648s real 0m6.615s user 0m0.007s sys 0m6.453s # <- OOM in B triggers real 0m1.244s user 0m0.018s sys 0m1.226s real 0m1.290s user 0m0.014s sys 0m1.276s real 0m1.119s user 0m0.011s sys 0m1.108s real 0m0.882s user 0m0.010s sys 0m0.872s real 0m0.855s user 0m0.007s sys 0m0.848s real 0m0.933s user 0m0.005s sys 0m0.928s real 0m0.833s user 0m0.009s sys 0m0.823s real 0m0.279s user 0m0.012s sys 0m0.267s # <- OOM killed in B real 0m0.273s user 0m0.010s sys 0m0.263s It seems with MGLRU enabled, both performance and OOM jitter seem better. As for this series, it now has no significant effect or slightly changed the jitter pattern, which I can't say is better or worse. The peak latency seems slightly higher, but the system seems to recover faster. Or maybe that's just noise. The OOM behavior is not really perfect in any case, but with MGLRU's TTL enabled, I got confirmation that the jitter is gone completely (only a few frames).
© 2016 - 2026 Red Hat, Inc.