[v1] mm/mglru: improve reclaim loop and dirty folio handling

[PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling

Posted by Kairui Song via B4 Relay 2 weeks, 6 days ago

This series cleans up and slightly improves MGLRU's reclaim loop and
dirty flush logic. As a result, we can see an up to ~50% reduce of file
faults and 30% increase in MongoDB throughput with YCSB and no swap
involved, other common benchmarks have no regression, and LOC is
reduced, with less unexpected OOM in our production environment.

Some of the problems were found in our production environment, and
others are mostly exposed while stress testing the LFU-like design as
proposed in the LSM/MM/BPF topic this year [1]. This series has no
direct relationship to that topic, but it cleans up the code base and
fixes several strange behaviors that make the test result of the
LFU-like design not as good as expected.

MGLRU's reclaim loop is a bit complex, and hence these problems are
somehow related to each other. The aging, scan number calculation, and
reclaim loop are coupled together, and the dirty folio handling logic is
quite different, making the reclaim loop hard to follow and the dirty
flush ineffective too.

This series slightly cleans up and improves the reclaim loop using a
scan budget by calculating the number of folios to scan at the beginning
of the loop, and decouples aging from the reclaim calculation helpers
Then move the dirty flush logic inside the reclaim loop so it can kick
in more effectively. These issues are somehow related, and this series
handles them and improves MGLRU reclaim in many ways.

Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
and 128G memory machine using NVME as storage.

MongoDB
=======
Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
threads:32), which does 95% read and 5% update to generate mixed read
and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
the WiredTiger cache size is set to 4.5G, using NVME as storage.

Not using SWAP.

Median of 3 test run, results are stable.

Before:
Throughput(ops/sec): 61642.78008938203
AverageLatency(us):  507.11127774145166
pgpgin 158190589
pgpgout 5880616
workingset_refault 7262988

After:
Throughput(ops/sec): 80216.04855744806  (+30.1%, higher is better)
AverageLatency(us):  388.17633477268913 (-23.5%, lower is better)
pgpgin 101871227                        (-35.6%, lower is better)
pgpgout 5770028
workingset_refault 3418186              (-52.9%, lower is better)

We can see a significant performance improvement after this series for
file cache heavy workloads like this. The test is done on NVME and the
performance gap would be even larger for slow devices, we observed
>100% gain for some other workloads running on HDD devices.

Chrome & Node.js [3]
====================
Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
workers:

Before:
Total requests:            77920
Per-worker 95% CI (mean):  [1199.9, 1235.1]
Per-worker stdev:          70.5
Jain's fairness:           0.996706 (1.0 = perfectly fair)
Latency:
Bucket     Count      Pct    Cumul
[0,1)s     25649   32.92%   32.92%
[1,2)s      7759    9.96%   42.87%
[2,4)s      5156    6.62%   49.49%
[4,8)s     39356   50.51%  100.00%

After:
Total requests:            79564
Per-worker 95% CI (mean):  [1224.2, 1262.2]
Per-worker stdev:          76.1
Jain's fairness:           0.996328 (1.0 = perfectly fair)
Latency:
Bucket     Count      Pct    Cumul
[0,1)s     25485   32.03%   32.03%
[1,2)s      8661   10.89%   42.92%
[2,4)s      6268    7.88%   50.79%
[4,8)s     39150   49.21%  100.00%

Seems identical, reclaim is still fair and effective, total requests
number seems slightly better.

OOM issue [4]
=============
Testing with a specific reproducer [4] to simulate what we encounterd in
production environment. Still using the same test machine but one node
is used as pmem ramdisk following steps in the reproducer, no SWAP used.

This reproducer spawns multiple workers that keep reading the given file
using mmap, and pauses for 120ms after one file read batch. It also
spawns another set of workers that keep allocating and freeing a
given size of anonymous memory. The total memory size exceeds the
memory limit (eg. 44G anon + 8G file, which is 52G vs 48G memcg limit).
But by evicting the file cache, the workload should hold just fine,
especially given that the file worker pauses after every batch, allowing
other workers to catch up.

- MGLRU disabled:
  Finished 128 iterations.

- MGLRU enabled:
  Hung or OOM with following info after about ~10-20 iterations:

    [  357.332946] file_anon_mix_pressure invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
    ... <snip> ...
    [  357.333827] memory: usage 50331648kB, limit 50331648kB, failcnt 90907
    [  357.347728] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
    [  357.348192] Memory cgroup stats for /demo:
    [  357.348314] anon 46724382720
    [  357.348963] file 4160753664

  OOM occurs despite there is still evictable file folios.

- MGLRU enabled after this series:
  Finished 128 iterations.

With aging blocking reclaim, the OOM will be much more likely to occur.
This issue is mostly fixed by patch 6 and result is much better, but
this series is still only the first step to improve file folio reclaim
for MGLRU, as there are still cases where file folios can't be
effectively reclaimed.

MySQL:
======

Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
ZRAM as swap and test command:

sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
  --tables=48 --table-size=2000000 --threads=96 --time=600 run

Before:         22343.701667 tps
After patch 4:  22327.325000 tps
After patch 5:  22373.224000 tps
After patch 6:  22321.174000 tps
After patch 7:  22625.961667 tps (+1.26%, higher is better)

MySQL is anon folios heavy but still looking good. Seems only noise level
changes, no regression.

FIO:
====
Testing with the following command, where /mnt is an EXT4 ramdisk, 6
test runs each in a 10G memcg:

fio -name=cached --numjobs=16 --filename=/mnt/test.img --buffered=1 \
  --ioengine=io_uring --iodepth=128 --iodepth_batch_submit=32 \
  --iodepth_batch_complete=32 --rw=randread \
  --random_distribution=zipf:1.2 --norandommap --time_based \
  --ramp_time=1m --runtime=10m --group_reporting

Before:        32039.56 MB/s
After patch 3: 32751.50 MB/s
After patch 4: 32703.03 MB/s
After patch 5: 33395.52 MB/s
After patch 6: 32031.51 MB/s
After patch 7: 32534.29 MB/s

Also seem only noise level changes and no regression.

Build kernel:
=============
Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
using make -j96 and defconfig, measuring system time, 8 test run each.

Before:        2881.41s
After patch 3: 2894.09s
After patch 4: 2846.73s
After patch 5: 2847.91s
After patch 6: 2835.17s
After patch 7: 2842.90s

Also seem only noise level changes, no regression or very slightly better.

Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1]
Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2]
Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3]
Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4]

Signed-off-by: Kairui Song <kasong@tencent.com>
---
Kairui Song (8):
      mm/mglru: consolidate common code for retrieving evitable size
      mm/mglru: relocate the LRU scan batch limit to callers
      mm/mglru: restructure the reclaim loop
      mm/mglru: scan and count the exact number of folios
      mm/mglru: use a smaller batch for reclaim
      mm/mglru: don't abort scan immediately right after aging
      mm/mglru: simplify and improve dirty writeback handling
      mm/vmscan: remove sc->file_taken

 mm/vmscan.c | 191 ++++++++++++++++++++++++++----------------------------------
 1 file changed, 81 insertions(+), 110 deletions(-)
---
base-commit: dffde584d8054e88e597e3f28de04c7f5d191a67
change-id: 20260314-mglru-reclaim-1c9d45ac57f6

Best regards,
-- 
Kairui Song <kasong@tencent.com>

Re: [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling

Posted by Eric Naim 1 week, 5 days ago

Hi Kairui,

On 3/18/26 3:08 AM, Kairui Song via B4 Relay wrote:
> This series cleans up and slightly improves MGLRU's reclaim loop and
> dirty flush logic. As a result, we can see an up to ~50% reduce of file
> faults and 30% increase in MongoDB throughput with YCSB and no swap
> involved, other common benchmarks have no regression, and LOC is
> reduced, with less unexpected OOM in our production environment.
> 
> Some of the problems were found in our production environment, and
> others are mostly exposed while stress testing the LFU-like design as
> proposed in the LSM/MM/BPF topic this year [1]. This series has no
> direct relationship to that topic, but it cleans up the code base and
> fixes several strange behaviors that make the test result of the
> LFU-like design not as good as expected.
> 
> MGLRU's reclaim loop is a bit complex, and hence these problems are
> somehow related to each other. The aging, scan number calculation, and
> reclaim loop are coupled together, and the dirty folio handling logic is
> quite different, making the reclaim loop hard to follow and the dirty
> flush ineffective too.
> 
> This series slightly cleans up and improves the reclaim loop using a
> scan budget by calculating the number of folios to scan at the beginning
> of the loop, and decouples aging from the reclaim calculation helpers
> Then move the dirty flush logic inside the reclaim loop so it can kick
> in more effectively. These issues are somehow related, and this series
> handles them and improves MGLRU reclaim in many ways.
> 
> Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
> and 128G memory machine using NVME as storage.
> 
> MongoDB
> =======
> Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
> threads:32), which does 95% read and 5% update to generate mixed read
> and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
> the WiredTiger cache size is set to 4.5G, using NVME as storage.
> 
> Not using SWAP.
> 
> Median of 3 test run, results are stable.
> 
> Before:
> Throughput(ops/sec): 61642.78008938203
> AverageLatency(us):  507.11127774145166
> pgpgin 158190589
> pgpgout 5880616
> workingset_refault 7262988
> 
> After:
> Throughput(ops/sec): 80216.04855744806  (+30.1%, higher is better)
> AverageLatency(us):  388.17633477268913 (-23.5%, lower is better)
> pgpgin 101871227                        (-35.6%, lower is better)
> pgpgout 5770028
> workingset_refault 3418186              (-52.9%, lower is better)
> 
> We can see a significant performance improvement after this series for
> file cache heavy workloads like this. The test is done on NVME and the
> performance gap would be even larger for slow devices, we observed
>> 100% gain for some other workloads running on HDD devices.
> 
> Chrome & Node.js [3]
> ====================
> Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
> nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
> workers:
> 
> Before:
> Total requests:            77920
> Per-worker 95% CI (mean):  [1199.9, 1235.1]
> Per-worker stdev:          70.5
> Jain's fairness:           0.996706 (1.0 = perfectly fair)
> Latency:
> Bucket     Count      Pct    Cumul
> [0,1)s     25649   32.92%   32.92%
> [1,2)s      7759    9.96%   42.87%
> [2,4)s      5156    6.62%   49.49%
> [4,8)s     39356   50.51%  100.00%
> 
> After:
> Total requests:            79564
> Per-worker 95% CI (mean):  [1224.2, 1262.2]
> Per-worker stdev:          76.1
> Jain's fairness:           0.996328 (1.0 = perfectly fair)
> Latency:
> Bucket     Count      Pct    Cumul
> [0,1)s     25485   32.03%   32.03%
> [1,2)s      8661   10.89%   42.92%
> [2,4)s      6268    7.88%   50.79%
> [4,8)s     39150   49.21%  100.00%
> 
> Seems identical, reclaim is still fair and effective, total requests
> number seems slightly better.
> 
> OOM issue [4]
> =============
> Testing with a specific reproducer [4] to simulate what we encounterd in
> production environment. Still using the same test machine but one node
> is used as pmem ramdisk following steps in the reproducer, no SWAP used.
> 
> This reproducer spawns multiple workers that keep reading the given file
> using mmap, and pauses for 120ms after one file read batch. It also
> spawns another set of workers that keep allocating and freeing a
> given size of anonymous memory. The total memory size exceeds the
> memory limit (eg. 44G anon + 8G file, which is 52G vs 48G memcg limit).
> But by evicting the file cache, the workload should hold just fine,
> especially given that the file worker pauses after every batch, allowing
> other workers to catch up.
> 
> - MGLRU disabled:
>   Finished 128 iterations.
> 
> - MGLRU enabled:
>   Hung or OOM with following info after about ~10-20 iterations:
> 
>     [  357.332946] file_anon_mix_pressure invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
>     ... <snip> ...
>     [  357.333827] memory: usage 50331648kB, limit 50331648kB, failcnt 90907
>     [  357.347728] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
>     [  357.348192] Memory cgroup stats for /demo:
>     [  357.348314] anon 46724382720
>     [  357.348963] file 4160753664
> 
>   OOM occurs despite there is still evictable file folios.
> 
> - MGLRU enabled after this series:
>   Finished 128 iterations.
> 
> With aging blocking reclaim, the OOM will be much more likely to occur.
> This issue is mostly fixed by patch 6 and result is much better, but
> this series is still only the first step to improve file folio reclaim
> for MGLRU, as there are still cases where file folios can't be
> effectively reclaimed.
> 
> MySQL:
> ======
> 
> Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
> ZRAM as swap and test command:
> 
> sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
>   --tables=48 --table-size=2000000 --threads=96 --time=600 run
> 
> Before:         22343.701667 tps
> After patch 4:  22327.325000 tps
> After patch 5:  22373.224000 tps
> After patch 6:  22321.174000 tps
> After patch 7:  22625.961667 tps (+1.26%, higher is better)
> 
> MySQL is anon folios heavy but still looking good. Seems only noise level
> changes, no regression.
> 
> FIO:
> ====
> Testing with the following command, where /mnt is an EXT4 ramdisk, 6
> test runs each in a 10G memcg:
> 
> fio -name=cached --numjobs=16 --filename=/mnt/test.img --buffered=1 \
>   --ioengine=io_uring --iodepth=128 --iodepth_batch_submit=32 \
>   --iodepth_batch_complete=32 --rw=randread \
>   --random_distribution=zipf:1.2 --norandommap --time_based \
>   --ramp_time=1m --runtime=10m --group_reporting
> 
> Before:        32039.56 MB/s
> After patch 3: 32751.50 MB/s
> After patch 4: 32703.03 MB/s
> After patch 5: 33395.52 MB/s
> After patch 6: 32031.51 MB/s
> After patch 7: 32534.29 MB/s
> 
> Also seem only noise level changes and no regression.
> 
> Build kernel:
> =============
> Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
> using make -j96 and defconfig, measuring system time, 8 test run each.
> 
> Before:        2881.41s
> After patch 3: 2894.09s
> After patch 4: 2846.73s
> After patch 5: 2847.91s
> After patch 6: 2835.17s
> After patch 7: 2842.90s
> 
> Also seem only noise level changes, no regression or very slightly better.
> 
> Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1]
> Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2]
> Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3]
> Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4]
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>

I applied this patch set to 7.0-rc5 and noticed the system locking up when performing the below test.

fallocate -l 5G 5G
while true; do tail /dev/zero; done
while true; do time cat 5G > /dev/null; sleep $(($(cat /sys/kernel/mm/lru_gen/min_ttl_ms)/1000+1)); done

After reading [1], I suspect that this was because the system was using zram as swap, and yes if zram is disabled then the lock up does not occur. Anything that I (CachyOS) can do to help debug this regression, if it is to be considered one as according to [1], zram as swap seems to be unsupported by upstream. (the user that tested this wasn't able to get a good kernel trace, the only thing left was a trace of the OOM killer firing)

[1] https://chrisdown.name/2026/03/24/zswap-vs-zram-when-to-use-what.html

-- 
Regards,
  Eric

Re: [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling

Posted by Kairui Song 1 week, 5 days ago

On Wed, Mar 25, 2026 at 1:04 PM Eric Naim <dnaim@cachyos.org> wrote:
>
> Hi Kairui,
>
> On 3/18/26 3:08 AM, Kairui Song via B4 Relay wrote:
> > This series cleans up and slightly improves MGLRU's reclaim loop and
> > dirty flush logic. As a result, we can see an up to ~50% reduce of file
> > faults and 30% increase in MongoDB throughput with YCSB and no swap
> > involved, other common benchmarks have no regression, and LOC is
> > reduced, with less unexpected OOM in our production environment.
> >

...

> > Before:        2881.41s
> > After patch 3: 2894.09s
> > After patch 4: 2846.73s
> > After patch 5: 2847.91s
> > After patch 6: 2835.17s
> > After patch 7: 2842.90s
> >
> > Also seem only noise level changes, no regression or very slightly better.
> >
> > Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1]
> > Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2]
> > Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3]
> > Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4]
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
>
> I applied this patch set to 7.0-rc5 and noticed the system locking up when performing the below test.
>
> fallocate -l 5G 5G
> while true; do tail /dev/zero; done
> while true; do time cat 5G > /dev/null; sleep $(($(cat /sys/kernel/mm/lru_gen/min_ttl_ms)/1000+1)); done
>
> After reading [1], I suspect that this was because the system was using zram as swap, and yes if zram is disabled then the lock up does not occur.

Hi Eric,

Thanks for the report, I was about to send V2 but noticing your report
I'll try to reproduce your issue first.

So far I didn't notice any regression, is this an issue caused by this
patch or is it an existing issue? I don't have any context about how
you are doing the test. BTW the calculation in patch "mm/mglru:
restructure the reclaim loop" needs to have a lowest bar
"max(nr_to_scan, SWAP_CLUSTER_MAX)" for small machines, not sure if
related but will add to V2.

And about the test you posted:
while true; do tail /dev/zero; done

I believe this will just consume all memory with zero pages and then
get OOM killed, that's exactly what the test is meant to do. By lockup
I'm not sure you mean since you mentioned OOM kill. The system
actually hung or the desktop is dead?

I just ran that with or without ZRAM on two machines and my laptop,
everything looks good here with this series.

> zram as swap seems to be unsupported by upstream.

That's simply not true, other distros like Fedora even have ZRAM as
swap by default:
https://fedoraproject.org/wiki/Changes/SwapOnZRAM

And systemd have a widely used ZRAM swap support:
https://github.com/systemd/zram-generator

Android also uses that, and we are using ZRAM by default in our fleet
which runs fine.

> the user that tested this wasn't able to get a
> good kernel trace, the only thing left was
> a trace of the OOM killer firing.

No worry, that's fine, just send me the OOM trace or log, the more
detailed context I get the better.

Re: [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling

Posted by Eric Naim 1 week, 5 days ago

On 3/25/26 1:47 PM, Kairui Song wrote:
> On Wed, Mar 25, 2026 at 1:04 PM Eric Naim <dnaim@cachyos.org> wrote:
>>
>> Hi Kairui,
>>
>> On 3/18/26 3:08 AM, Kairui Song via B4 Relay wrote:
>>> This series cleans up and slightly improves MGLRU's reclaim loop and
>>> dirty flush logic. As a result, we can see an up to ~50% reduce of file
>>> faults and 30% increase in MongoDB throughput with YCSB and no swap
>>> involved, other common benchmarks have no regression, and LOC is
>>> reduced, with less unexpected OOM in our production environment.
>>>
> 
> ...
> 
>>
>> I applied this patch set to 7.0-rc5 and noticed the system locking up when performing the below test.
>>
>> fallocate -l 5G 5G
>> while true; do tail /dev/zero; done
>> while true; do time cat 5G > /dev/null; sleep $(($(cat /sys/kernel/mm/lru_gen/min_ttl_ms)/1000+1)); done
>>
>> After reading [1], I suspect that this was because the system was using zram as swap, and yes if zram is disabled then the lock up does not occur.
> 
> Hi Eric,
> 
> Thanks for the report, I was about to send V2 but noticing your report
> I'll try to reproduce your issue first.
> 
> So far I didn't notice any regression, is this an issue caused by this
> patch or is it an existing issue? I don't have any context about how
> you are doing the test. BTW the calculation in patch "mm/mglru:
> restructure the reclaim loop" needs to have a lowest bar
> "max(nr_to_scan, SWAP_CLUSTER_MAX)" for small machines, not sure if
> related but will add to V2.
> 

As of writing this, I got some new information that makes this a bit more confusing. The kernel that doesn't have the issue was patched with [1] as a means of protecting the working set (similar to lru_gen_min_ttl_ms). 

So this time on an unpatched kernel, the system still freezes but quickly recovers itself after about 2 seconds. With this patchset applied, the system freezes but it doesn't quickly recover (if at all).

Curiously, I had the user test again but this time with lru_gen_min_ttl_ms = 100. With this set, the system doesn't freeze at all with or without this patchset.

> And about the test you posted:
> while true; do tail /dev/zero; done
> 
> I believe this will just consume all memory with zero pages and then
> get OOM killed, that's exactly what the test is meant to do. By lockup
> I'm not sure you mean since you mentioned OOM kill. The system
> actually hung or the desktop is dead?

The system actually hung. They needed a hard reset to recover the system. (pure speculation: given a few minutes the system would likely recover itself as this seems to be a common scenario)

> 
> I just ran that with or without ZRAM on two machines and my laptop,
> everything looks good here with this series.
> 
>> zram as swap seems to be unsupported by upstream.
> 
> That's simply not true, other distros like Fedora even have ZRAM as
> swap by default:
> https://fedoraproject.org/wiki/Changes/SwapOnZRAM
> 
> And systemd have a widely used ZRAM swap support:
> https://github.com/systemd/zram-generator
> 
> Android also uses that, and we are using ZRAM by default in our fleet
> which runs fine.
> 
>> the user that tested this wasn't able to get a
>> good kernel trace, the only thing left was
>> a trace of the OOM killer firing.
> 
> No worry, that's fine, just send me the OOM trace or log, the more
> detailed context I get the better.

Mar 25 08:24:22 osiris kernel: Call Trace:
Mar 25 08:24:22 osiris kernel:  <TASK>
Mar 25 08:24:22 osiris kernel:  dump_stack_lvl+0x61/0x80
Mar 25 08:24:22 osiris kernel:  dump_header+0x4a/0x160
Mar 25 08:24:22 osiris kernel:  oom_kill_process+0x18f/0x1f0
Mar 25 08:24:22 osiris kernel:  out_of_memory+0x4ab/0x5c0
Mar 25 08:24:22 osiris kernel:  __alloc_pages_slowpath+0x9ac/0x1060
Mar 25 08:24:22 osiris kernel:  __alloc_frozen_pages_noprof+0x29a/0x320
Mar 25 08:24:22 osiris kernel:  alloc_pages_mpol+0x107/0x1b0
Mar 25 08:24:22 osiris kernel:  folio_alloc_noprof+0x85/0xb0
Mar 25 08:24:22 osiris kernel:  __filemap_get_folio_mpol+0x1ff/0x4c0
Mar 25 08:24:22 osiris kernel:  filemap_fault+0x3e3/0x6e0
Mar 25 08:24:22 osiris kernel:  __do_fault+0x46/0x140
Mar 25 08:24:22 osiris kernel:  do_pte_missing+0x154/0xea0
Mar 25 08:24:22 osiris kernel:  ? __pte_offset_map+0x1d/0xd0
Mar 25 08:24:22 osiris kernel:  handle_mm_fault+0x89c/0x1280
Mar 25 08:24:22 osiris kernel:  do_user_addr_fault+0x23b/0x720
Mar 25 08:24:22 osiris kernel:  exc_page_fault+0x75/0xe0
Mar 25 08:24:22 osiris kernel:  asm_exc_page_fault+0x26/0x30
Mar 25 08:24:22 osiris kernel: RIP: 0033:0x7fec4beb43c0
Mar 25 08:24:22 osiris kernel: Code: Unable to access opcode bytes at 0x7fec4beb4396.
Mar 25 08:24:22 osiris kernel: RSP: 002b:00007ffcb348d698 EFLAGS: 00010293
Mar 25 08:24:22 osiris kernel: RAX: 00000000c70f6907 RBX: 00007ffcb348d8d0 RCX: 00007fec4bb1604d
Mar 25 08:24:22 osiris kernel: RDX: c6a4a7935bd1e995 RSI: 4fb7dae88ad99bfb RDI: 000055ee77cc8150
Mar 25 08:24:22 osiris kernel: RBP: 00007ffcb348dd60 R08: 000055ee77cc8158 R09: 000000000000000c
Mar 25 08:24:22 osiris kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
Mar 25 08:24:22 osiris kernel: R13: 000055ee77cc8150 R14: 0000000000000064 R15: 431bde82d7b634db
Mar 25 08:24:22 osiris kernel:  </TASK>

Here's the call trace that was recovered. Some mm related settings that we set in our kernel in case its useful:

vm.compact_unevictable_allowed = 0
vm.compaction_proactiveness = 0
vm.page-cluster = 0
vm.swappiness = 150 
vm.vfs_cache_pressure = 50
vm.dirty_bytes = 268435456
vm.dirty_background_bytes = 67108864
vm.dirty_writeback_centisecs = 1500
vm.watermark_boost_factor = 0

/sys/kernel/mm/transparent_hugepage/defrag = defer+madvise

[1] https://github.com/firelzrd/le9uo/

-- 
Regards,
  Eric

Re: [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling

Posted by Kairui Song 1 week, 5 days ago

On Wed, Mar 25, 2026 at 5:27 PM Eric Naim <dnaim@cachyos.org> wrote:
>
> On 3/25/26 1:47 PM, Kairui Song wrote:
> > On Wed, Mar 25, 2026 at 1:04 PM Eric Naim <dnaim@cachyos.org> wrote:
> >>
> >> Hi Kairui,
> >>
> >> On 3/18/26 3:08 AM, Kairui Song via B4 Relay wrote:
> >>> This series cleans up and slightly improves MGLRU's reclaim loop and
> >>> dirty flush logic. As a result, we can see an up to ~50% reduce of file
> >>> faults and 30% increase in MongoDB throughput with YCSB and no swap
> >>> involved, other common benchmarks have no regression, and LOC is
> >>> reduced, with less unexpected OOM in our production environment.
> >>>
> >
> > ...
> >
> >>
> >> I applied this patch set to 7.0-rc5 and noticed the system locking up when performing the below test.
> >>
> >> fallocate -l 5G 5G
> >> while true; do tail /dev/zero; done
> >> while true; do time cat 5G > /dev/null; sleep $(($(cat /sys/kernel/mm/lru_gen/min_ttl_ms)/1000+1)); done
> >>
> >> After reading [1], I suspect that this was because the system was using zram as swap, and yes if zram is disabled then the lock up does not occur.
> >
> > Hi Eric,
> >
> > Thanks for the report, I was about to send V2 but noticing your report
> > I'll try to reproduce your issue first.
> >
> > So far I didn't notice any regression, is this an issue caused by this
> > patch or is it an existing issue? I don't have any context about how
> > you are doing the test. BTW the calculation in patch "mm/mglru:
> > restructure the reclaim loop" needs to have a lowest bar
> > "max(nr_to_scan, SWAP_CLUSTER_MAX)" for small machines, not sure if
> > related but will add to V2.
> >
>
> As of writing this, I got some new information that makes this a bit more confusing. The kernel that doesn't have the issue was patched with [1] as a means of protecting the working set (similar to lru_gen_min_ttl_ms).
>
> So this time on an unpatched kernel, the system still freezes but quickly recovers itself after about 2 seconds. With this patchset applied, the system freezes but it doesn't quickly recover (if at all).
>
> Curiously, I had the user test again but this time with lru_gen_min_ttl_ms = 100. With this set, the system doesn't freeze at all with or without this patchset.

Ah thanks, that makes sense now, the downstream patch you mentioned
limits the reclaim of file pages to avoid thrashing, and your test
cases exhaust the memory on purpose which forces the kernel to reclaim
all reclaimable folios including page cache.

A thrashing page cache causes desktop hangs easily, using TTL is an
effective way to avoid thrashing and trigger OOM early. That's why the
problem is gone with lru_gen_min_ttl_ms = 100 or le9.

> > And about the test you posted:
> > while true; do tail /dev/zero; done
> >
> > I believe this will just consume all memory with zero pages and then
> > get OOM killed, that's exactly what the test is meant to do. By lockup
> > I'm not sure you mean since you mentioned OOM kill. The system
> > actually hung or the desktop is dead?
>
> The system actually hung. They needed a hard reset to recover the system. (pure speculation: given a few minutes the system would likely recover itself as this seems to be a common scenario)

Yeah I believe so.

Thrashing prevention is why MGLRU's TTL is introduced, so I do suggest
using that. It can be further improved too.

Will keep that in mind and try to make some test cases to cover your
case too and make some adjustments.

BTW how does the kernel behave with MGLRU disabled for your case?

Re: [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling

Posted by Kairui Song 1 week, 2 days ago

On Wed, Mar 25, 2026 at 05:47:41PM +0800, Kairui Song wrote:
> On Wed, Mar 25, 2026 at 5:27 PM Eric Naim <dnaim@cachyos.org> wrote:
> >
> > On 3/25/26 1:47 PM, Kairui Song wrote:
> > > On Wed, Mar 25, 2026 at 1:04 PM Eric Naim <dnaim@cachyos.org> wrote:
> > >>
> > >> Hi Kairui,
> > >>
> > >> On 3/18/26 3:08 AM, Kairui Song via B4 Relay wrote:
> > >>> This series cleans up and slightly improves MGLRU's reclaim loop and
> > >>> dirty flush logic. As a result, we can see an up to ~50% reduce of file
> > >>> faults and 30% increase in MongoDB throughput with YCSB and no swap
> > >>> involved, other common benchmarks have no regression, and LOC is
> > >>> reduced, with less unexpected OOM in our production environment.
> > >>>
> > >
> > > ...
> > >
> > >>
> > >> I applied this patch set to 7.0-rc5 and noticed the system locking up when performing the below test.
> > >>
> > >> fallocate -l 5G 5G
> > >> while true; do tail /dev/zero; done
> > >> while true; do time cat 5G > /dev/null; sleep $(($(cat /sys/kernel/mm/lru_gen/min_ttl_ms)/1000+1)); done
> > >>
> > >> After reading [1], I suspect that this was because the system was using zram as swap, and yes if zram is disabled then the lock up does not occur.
> > >
> > > Hi Eric,
> > >
> > > Thanks for the report, I was about to send V2 but noticing your report
> > > I'll try to reproduce your issue first.
> > >
> > > So far I didn't notice any regression, is this an issue caused by this
> > > patch or is it an existing issue? I don't have any context about how
> > > you are doing the test. BTW the calculation in patch "mm/mglru:
> > > restructure the reclaim loop" needs to have a lowest bar
> > > "max(nr_to_scan, SWAP_CLUSTER_MAX)" for small machines, not sure if
> > > related but will add to V2.
> > >
> >
> > As of writing this, I got some new information that makes this a bit more confusing. The kernel that doesn't have the issue was patched with [1] as a means of protecting the working set (similar to lru_gen_min_ttl_ms).
> >
> > So this time on an unpatched kernel, the system still freezes but quickly recovers itself after about 2 seconds. With this patchset applied, the system freezes but it doesn't quickly recover (if at all).
> >
> > Curiously, I had the user test again but this time with lru_gen_min_ttl_ms = 100. With this set, the system doesn't freeze at all with or without this patchset.
> 
> Ah thanks, that makes sense now, the downstream patch you mentioned
> limits the reclaim of file pages to avoid thrashing, and your test
> cases exhaust the memory on purpose which forces the kernel to reclaim
> all reclaimable folios including page cache.
> 
> A thrashing page cache causes desktop hangs easily, using TTL is an
> effective way to avoid thrashing and trigger OOM early. That's why the
> problem is gone with lru_gen_min_ttl_ms = 100 or le9.
> 
> > > And about the test you posted:
> > > while true; do tail /dev/zero; done
> > >
> > > I believe this will just consume all memory with zero pages and then
> > > get OOM killed, that's exactly what the test is meant to do. By lockup
> > > I'm not sure you mean since you mentioned OOM kill. The system
> > > actually hung or the desktop is dead?
> >
> > The system actually hung. They needed a hard reset to recover the system. (pure speculation: given a few minutes the system would likely recover itself as this seems to be a common scenario)
> 
> Yeah I believe so.
> 
> Thrashing prevention is why MGLRU's TTL is introduced, so I do suggest
> using that. It can be further improved too.
> 
> Will keep that in mind and try to make some test cases to cover your
> case too and make some adjustments.
> 
> BTW how does the kernel behave with MGLRU disabled for your case?

Hi all,

I tested it multiple times on my Fedora, comparing MGLRU to classic LRU
(using v2 of this series also also includes some minor improvements).

I modified the reproduce a bit just to test the OOM behavior:

- Running following command in console A:
fallocate -l 5G 5G
while true; do time cat 5G > /dev/null; done

- Then run following command in console B:
while true; do tail /dev/zero; done

The console A output is below:

With MGLRU disabled:
...
real    0m4.925s user    0m0.016s sys     0m4.904s # Under pressure
real    0m5.544s user    0m0.015s sys     0m5.521s
real    0m5.444s user    0m0.012s sys     0m5.425s
real    0m7.607s user    0m0.016s sys     0m7.561s
real    0m7.268s user    0m0.017s sys     0m7.240s
real    0m6.686s user    0m0.016s sys     0m6.656s
real    0m9.919s user    0m0.014s sys     0m9.831s # <- OOM in B triggers
real    0m4.559s user    0m0.012s sys     0m4.539s
real    0m1.381s user    0m0.009s sys     0m1.362s
real    0m11.816s user    0m0.010s sys     0m11.795s
real    0m6.797s user    0m0.021s sys     0m6.753s
real    0m0.944s user    0m0.013s sys     0m0.931s # <- OOM kill in B ends
real    0m0.285s user    0m0.013s sys     0m0.272s

MGLRU enabled, before this series:
...
real    0m0.355s user    0m0.009s sys     0m0.346s # Under pressure
real    0m0.352s user    0m0.008s sys     0m0.344s
real    0m0.549s user    0m0.014s sys     0m0.535s
real    0m0.628s user    0m0.009s sys     0m0.619s
real    0m0.651s user    0m0.009s sys     0m0.642s
real    0m5.294s user    0m0.010s sys     0m5.280s # <- OOM in B triggers
real    0m1.041s user    0m0.014s sys     0m1.026s
real    0m0.837s user    0m0.011s sys     0m0.826s
real    0m2.450s user    0m0.013s sys     0m2.435s
real    0m2.499s user    0m0.012s sys     0m2.485s
real    0m1.857s user    0m0.015s sys     0m1.841s
real    0m0.512s user    0m0.015s sys     0m0.497s
real    0m0.418s user    0m0.011s sys     0m0.407s # <- OOM kill in B ends
real    0m0.282s user    0m0.010s sys     0m0.272s

MGLRU enabled, after this series:
...
real    0m0.280s user    0m0.015s sys     0m0.265s # Under pressure
real    0m0.283s user    0m0.010s sys     0m0.273s
real    0m0.278s user    0m0.012s sys     0m0.266s
real    0m0.315s user    0m0.018s sys     0m0.297s
real    0m0.679s user    0m0.014s sys     0m0.663s
real    0m0.716s user    0m0.011s sys     0m0.705s
real    0m0.657s user    0m0.009s sys     0m0.648s
real    0m6.615s user    0m0.007s sys     0m6.453s # <- OOM in B triggers
real    0m1.244s user    0m0.018s sys     0m1.226s
real    0m1.290s user    0m0.014s sys     0m1.276s
real    0m1.119s user    0m0.011s sys     0m1.108s
real    0m0.882s user    0m0.010s sys     0m0.872s
real    0m0.855s user    0m0.007s sys     0m0.848s
real    0m0.933s user    0m0.005s sys     0m0.928s
real    0m0.833s user    0m0.009s sys     0m0.823s
real    0m0.279s user    0m0.012s sys     0m0.267s # <- OOM killed in B
real    0m0.273s user    0m0.010s sys     0m0.263s

It seems with MGLRU enabled, both performance and OOM jitter
seem better.

As for this series, it now has no significant effect or slightly
changed the jitter pattern, which I can't say is better or worse.
The peak latency seems slightly higher, but the system seems to
recover faster. Or maybe that's just noise.

The OOM behavior is not really perfect in any case, but with
MGLRU's TTL enabled, I got confirmation that the jitter is
gone completely (only a few frames).