mm/vmscan.c | 330 ++++++++++++++++++++++++++---------------------------------- 1 file changed, 143 insertions(+), 187 deletions(-)
From: Kairui Song <kasong@tencent.com>
This series cleans up and slightly improves MGLRU's reclaim loop and
dirty writeback handling. As a result, we can see an up to ~30% increase
in some workloads like MongoDB with YCSB and a huge decrease in file
refault, no swap involved. Other common benchmarks have no regression,
and LOC is reduced, with less unexpected OOM, too.
Some of the problems were found in our production environment, and
others were mostly exposed while stress testing during the development
of the LSM/MM/BPF topic on improving MGLRU [1]. This series cleans up
the code base and fixes several performance issues, preparing for
further work.
MGLRU's reclaim loop is a bit complex, and hence these problems are
somehow related to each other. The aging, scan number calculation, and
reclaim loop are coupled together, and the dirty folio handling logic is
quite different, making the reclaim loop hard to follow and the dirty
flush ineffective.
This series slightly cleans up and improves these issues using a scan
budget by calculating the number of folios to scan at the beginning of
the loop, and decouples aging from the reclaim calculation helpers.
Then, move the dirty flush logic inside the reclaim loop so it can kick
in more effectively. These issues are somehow related, and this series
handles them and improves MGLRU reclaim in many ways.
Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
and a 128G memory machine using NVME as storage.
MongoDB
=======
Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
threads:32), which does 95% read and 5% update to generate mixed read
and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
the WiredTiger cache size is set to 4.5G, using NVME as storage.
Not using SWAP.
Before:
Throughput(ops/sec): 62485.02962831822
AverageLatency(us): 500.9746963330107
pgpgin 159347462
pgpgout 5413332
workingset_refault_anon 0
workingset_refault_file 34522071
After:
Throughput(ops/sec): 79760.71784646061 (+27.6%, higher is better)
AverageLatency(us): 391.25169970043726 (-21.9%, lower is better)
pgpgin 111093923 (-30.3%, lower is better)
pgpgout 5437456
workingset_refault_anon 0
workingset_refault_file 19566366 (-43.3%, lower is better)
We can see a significant performance improvement after this series.
The test is done on NVME and the performance gap would be even larger
for slow devices, such as HDD or network storage. We observed over
100% gain for some workloads with slow IO.
Chrome & Node.js [3]
====================
Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
workers:
Before:
Total requests: 79915
Per-worker 95% CI (mean): [1233.9, 1263.5]
Per-worker stdev: 59.2
Jain's fairness: 0.997795 (1.0 = perfectly fair)
Latency:
Bucket Count Pct Cumul
[0,1)s 26859 33.61% 33.61%
[1,2)s 7818 9.78% 43.39%
[2,4)s 5532 6.92% 50.31%
[4,8)s 39706 49.69% 100.00%
After:
Total requests: 81382
Per-worker 95% CI (mean): [1241.9, 1301.3]
Per-worker stdev: 118.8
Jain's fairness: 0.991480 (1.0 = perfectly fair)
Latency:
Bucket Count Pct Cumul
[0,1)s 26696 32.80% 32.80%
[1,2)s 8745 10.75% 43.55%
[2,4)s 6865 8.44% 51.98%
[4,8)s 39076 48.02% 100.00%
Reclaim is still fair and effective, total requests number seems
slightly better.
OOM issue with aging and throttling
===================================
For the throttling OOM issue, it can be easily reproduced using dd and
cgroup limit as demonstrated in patch 14, and fixed by this series.
The aging OOM is a bit tricky, a specific reproducer can be used to
simulate what we encountered in production environment [4]:
Spawns multiple workers that keep reading the given file using mmap,
and pauses for 120ms after one file read batch. It also spawns another
set of workers that keep allocating and freeing a given size of
anonymous memory. The total memory size exceeds the memory limit
(eg. 14G anon + 8G file, which is 22G vs a 16G memcg limit).
- MGLRU disabled:
Finished 128 iterations.
- MGLRU enabled:
OOM with following info after about ~10-20 iterations:
[ 62.624130] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[ 62.624999] memory: usage 16777216kB, limit 16777216kB, failcnt 24460
[ 62.640200] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[ 62.640823] Memory cgroup stats for /demo:
[ 62.641017] anon 10604879872
[ 62.641941] file 6574858240
OOM occurs despite there being still evictable file folios.
- MGLRU enabled after this series:
Finished 128 iterations.
Worth noting there is another OOM related issue reported in V1 of
this series, which is tested and looking OK now [5].
MySQL:
======
Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
ZRAM as swap and test command:
sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
--tables=48 --table-size=2000000 --threads=48 --time=600 run
Before: 17260.781429 tps
After this series: 17266.842857 tps
MySQL is anon folios heavy, involves writeback and file and still
looking good. Seems only noise level changes, no regression.
FIO:
====
Testing with the following command, where /mnt/ramdisk is a
64G EXT4 ramdisk, each test file is 3G, in a 10G memcg,
6 test run each:
fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \
--name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \
--rw=randread --norandommap --time_based \
--ramp_time=1m --runtime=5m --group_reporting
Before: 9196.481429 MB/s
After this series: 9256.105000 MB/s
Also seem only noise level changes and no regression or slightly better.
Build kernel:
=============
Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
using make -j96 and defconfig, measuring system time, 12 test run each.
Before: 2589.63s
After this series: 2543.58s
Also seem only noise level changes, no regression or very slightly better.
Android:
========
Xinyu reported a performance gain on Android, too, with this series. The
test consisted of cold-starting multiple applications sequentially under
moderate system load. [6]
Before:
Launch Time Summary (all apps, all runs)
Mean 868.0ms
P50 888.0ms
P90 1274.2ms
P95 1399.0ms
After:
Launch Time Summary (all apps, all runs)
Mean 850.5ms (-2.07%)
P50 861.5ms (-3.04%)
P90 1179.0ms (-8.05%)
P95 1228.0ms (-12.2%)
Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1]
Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2]
Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3]
Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4]
Link: https://lore.kernel.org/linux-mm/acgNCzRDVmSbXrOE@KASONG-MC4/ [5]
Link: https://lore.kernel.org/linux-mm/20260417025123.2971253-1-wxy2009nrrr@163.com/ [6]
Signed-off-by: Kairui Song <kasong@tencent.com>
---
Changes in v6:
- Avoid potential over rotation of tiny cgroup (<16M):
https://lore.kernel.org/linux-mm/CAMgjq7ArnmmoHOGRt6Wc8hu7tjx_t583-UVzJK+HOHgjjetQ9g@mail.gmail.com/
- Avoid potentially skewed stat counter:
https://lore.kernel.org/linux-mm/CAMgjq7DCn8p_yMMhiejFjX6sdybZKYOw8qJbq=+OCsZ=AfJnFA@mail.gmail.com/
- Update a few comment and varible name as suggested by Barry Song.
- Tested over days, also tested on my Android phone, everything still
matches the cover letter description. And add test result from Xinyu.
- Link to v5: https://patch.msgid.link/20260413-mglru-reclaim-v5-0-8eaeacbddc44@tencent.com
Changes in v5:
- Add back a more moderate minimal batch limit for each reclaim loop:
https://lore.kernel.org/linux-mm/adYP81AhpNf0znp3@KASONG-MC4/
- Collect review-by.
- Link to v4: https://patch.msgid.link/20260407-mglru-reclaim-v4-0-98cf3dc69519@tencent.com
Changes in v4:
- Remove the minimal scan batch limit, and add rotate for
unevictable memcg as reported by sashiko:
https://lore.kernel.org/linux-mm/ac8xVN82LBLDZpIO@KASONG-MC4/
- Slightly imporove a few commit messages.
- Reran the test and seems identical with before so data are unchanged.
- Collect review-by.
- Link to v3: https://patch.msgid.link/20260403-mglru-reclaim-v3-0-a285efd6ff91@tencent.com
Changes in v3:
- Don't force scan at least SWAP_CLUSTER_MAX pages for each reclaim
loop. If the LRU is too small, adjust it accordingly. Now the
multi-cgroup scan balance looked even better for tiny cgroups:
https://lore.kernel.org/linux-mm/aciejkdIHyXPNS9Y@KASONG-MC4/
- Add one patch to remove the swap constraint check in isolate_folio. In
theory, it's fine, and both stress test and performance test didn't
show any issue:
https://lore.kernel.org/linux-mm/CAMgjq7C8TCsK99p85i3QzGCwgkXscTfFB6XCUTWQOcuqwHQa2Q@mail.gmail.com/
- I reran most tests, all seem identical, so most data is kept.
Intermediate test results are dropped. I ran tests on most patches
individually, and there is no problem, but the series is getting too
long, and posting them makes it harder to read and unnecessary.
- Split previously patch 8 into two patches as suggested [ Shakeel Butt ],
also some test result is collected to support the design:
https://lore.kernel.org/linux-mm/ac44BVOvOm8lhVvj@KASONG-MC4/#t
I kept Axel's review-by since the code is identical.
- Call try_to_inc_min_seq twice to avoid stale empty gen and drop
its return argument [ Baolin Wang ]
- Move a few lines of code between patches to where they fits better,
the final result is identical [ Baolin Wang ].
- Collect tested by and update test setup [ Leno Hou ]
- Collect review by.
- Update a few commit message [ Shakeel Butt ].
- Link to v2: https://patch.msgid.link/20260329-mglru-reclaim-v2-0-b53a3678513c@tencent.com
Changes in v2:
- Rebase on top of mm-new which includes Cgroup V1 fix from
[ Baolin Wang ].
- Added dirty throttling OOM fix as patch 12, as [ Chen Ridong ]'s
review suggested that we shouldn't leave the counter and reclaim
feedback in shrink_folio_list untracked in this series.
- Add a minimal scan number of SWAP_CLUSTER_MAX limit in patch
"restructure the reclaim loop", the change is trivial but might
help avoid livelock for tiny cgroups.
- Redo the tests, most test are basically identical to before, but just
in case, since the patch also solves the throttling issue now, and
discussed with reports from CachyOS.
- Add a separate patch for variable renaming as suggested by [ Barry
Song ]. No feature change.
- Improve several comment and code issue [ Axel Rasmussen ].
- Remove no longer needed variable [ Axel Rasmussen ].
- Collect review by.
- Link to v1: https://lore.kernel.org/r/20260318-mglru-reclaim-v1-0-2c46f9eb0508@tencent.com
---
Kairui Song (14):
mm/mglru: consolidate common code for retrieving evictable size
mm/mglru: rename variables related to aging and rotation
mm/mglru: relocate the LRU scan batch limit to callers
mm/mglru: restructure the reclaim loop
mm/mglru: scan and count the exact number of folios
mm/mglru: use a smaller batch for reclaim
mm/mglru: don't abort scan immediately right after aging
mm/mglru: remove redundant swap constrained check upon isolation
mm/mglru: use the common routine for dirty/writeback reactivation
mm/mglru: simplify and improve dirty writeback handling
mm/mglru: remove no longer used reclaim argument for folio protection
mm/vmscan: remove sc->file_taken
mm/vmscan: remove sc->unqueued_dirty
mm/vmscan: unify writeback reclaim statistic and throttling
mm/vmscan.c | 330 ++++++++++++++++++++++++++----------------------------------
1 file changed, 143 insertions(+), 187 deletions(-)
---
base-commit: 2bcc13c29c711381d815c1ba5d5b25737400c71a
change-id: 20260314-mglru-reclaim-1c9d45ac57f6
Best regards,
--
Kairui Song <kasong@tencent.com>
On Fri, 24 Apr 2026 01:43:11 +0800 Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote: > This series cleans up and slightly improves MGLRU's reclaim loop and > dirty writeback handling. As a result, we can see an up to ~30% increase > in some workloads like MongoDB with YCSB and a huge decrease in file > refault, no swap involved. Other common benchmarks have no regression, > and LOC is reduced, with less unexpected OOM, too. Thanks, I queued this up in mm-new. And thanks to all the testers and reviewers - it's so good to see MGRLU getting some attention.
On Fri, 24 Apr 2026 01:43:11 +0800 Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote: > This series cleans up and slightly improves MGLRU's reclaim loop and > dirty writeback handling. As a result, we can see an up to ~30% increase > in some workloads like MongoDB with YCSB and a huge decrease in file > refault, no swap involved. Other common benchmarks have no regression, > and LOC is reduced, with less unexpected OOM, too. Sashiko: https://sashiko.dev/#/patchset/20260424-mglru-reclaim-v6-0-a57622d770c3@tencent.com
On Fri, Apr 24, 2026 at 9:42 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Fri, 24 Apr 2026 01:43:11 +0800 Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote: > > > This series cleans up and slightly improves MGLRU's reclaim loop and > > dirty writeback handling. As a result, we can see an up to ~30% increase > > in some workloads like MongoDB with YCSB and a huge decrease in file > > refault, no swap involved. Other common benchmarks have no regression, > > and LOC is reduced, with less unexpected OOM, too. > > Sashiko: > https://sashiko.dev/#/patchset/20260424-mglru-reclaim-v6-0-a57622d770c3@tencent.com > Thanks. It's interesting that I followed sashiko previous suggestion, cause it's a trivial part, and now sashiko is suggesting to change it back to what I was doing :) Nothing critical and I just saw a false positive, but anyway, I'll forward the review to corresponding patch, I do prefer to change it back to the previous design, maybe do some other slight improvement too. Let me send an update in a few days after double checking the swappiness issue reported by Barry.
On Fri, Apr 24, 2026 at 1:43 AM Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote: > > From: Kairui Song <kasong@tencent.com> > > This series cleans up and slightly improves MGLRU's reclaim loop and > dirty writeback handling. As a result, we can see an up to ~30% increase > in some workloads like MongoDB with YCSB and a huge decrease in file > refault, no swap involved. Other common benchmarks have no regression, > and LOC is reduced, with less unexpected OOM, too. > > Some of the problems were found in our production environment, and > others were mostly exposed while stress testing during the development > of the LSM/MM/BPF topic on improving MGLRU [1]. This series cleans up > the code base and fixes several performance issues, preparing for > further work. > > MGLRU's reclaim loop is a bit complex, and hence these problems are > somehow related to each other. The aging, scan number calculation, and > reclaim loop are coupled together, and the dirty folio handling logic is > quite different, making the reclaim loop hard to follow and the dirty > flush ineffective. > > This series slightly cleans up and improves these issues using a scan > budget by calculating the number of folios to scan at the beginning of > the loop, and decouples aging from the reclaim calculation helpers. > Then, move the dirty flush logic inside the reclaim loop so it can kick > in more effectively. These issues are somehow related, and this series > handles them and improves MGLRU reclaim in many ways. > > Test results: All tests are done on a 48c96t NUMA machine with 2 nodes > and a 128G memory machine using NVME as storage. > > MongoDB > ======= > Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000, > threads:32), which does 95% read and 5% update to generate mixed read > and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and > the WiredTiger cache size is set to 4.5G, using NVME as storage. > > Not using SWAP. > > Before: > Throughput(ops/sec): 62485.02962831822 > AverageLatency(us): 500.9746963330107 > pgpgin 159347462 > pgpgout 5413332 > workingset_refault_anon 0 > workingset_refault_file 34522071 > > After: > Throughput(ops/sec): 79760.71784646061 (+27.6%, higher is better) > AverageLatency(us): 391.25169970043726 (-21.9%, lower is better) > pgpgin 111093923 (-30.3%, lower is better) > pgpgout 5437456 > workingset_refault_anon 0 > workingset_refault_file 19566366 (-43.3%, lower is better) > > We can see a significant performance improvement after this series. > The test is done on NVME and the performance gap would be even larger > for slow devices, such as HDD or network storage. We observed over > 100% gain for some workloads with slow IO. > > Chrome & Node.js [3] > ==================== > Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2 > nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64 > workers: > > Before: > Total requests: 79915 > Per-worker 95% CI (mean): [1233.9, 1263.5] > Per-worker stdev: 59.2 > Jain's fairness: 0.997795 (1.0 = perfectly fair) > Latency: > Bucket Count Pct Cumul > [0,1)s 26859 33.61% 33.61% > [1,2)s 7818 9.78% 43.39% > [2,4)s 5532 6.92% 50.31% > [4,8)s 39706 49.69% 100.00% > > After: > Total requests: 81382 > Per-worker 95% CI (mean): [1241.9, 1301.3] > Per-worker stdev: 118.8 > Jain's fairness: 0.991480 (1.0 = perfectly fair) > Latency: > Bucket Count Pct Cumul > [0,1)s 26696 32.80% 32.80% > [1,2)s 8745 10.75% 43.55% > [2,4)s 6865 8.44% 51.98% > [4,8)s 39076 48.02% 100.00% > > Reclaim is still fair and effective, total requests number seems > slightly better. > > OOM issue with aging and throttling > =================================== > For the throttling OOM issue, it can be easily reproduced using dd and > cgroup limit as demonstrated in patch 14, and fixed by this series. > > The aging OOM is a bit tricky, a specific reproducer can be used to > simulate what we encountered in production environment [4]: > Spawns multiple workers that keep reading the given file using mmap, > and pauses for 120ms after one file read batch. It also spawns another > set of workers that keep allocating and freeing a given size of > anonymous memory. The total memory size exceeds the memory limit > (eg. 14G anon + 8G file, which is 22G vs a 16G memcg limit). > > - MGLRU disabled: > Finished 128 iterations. > > - MGLRU enabled: > OOM with following info after about ~10-20 iterations: > [ 62.624130] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 > [ 62.624999] memory: usage 16777216kB, limit 16777216kB, failcnt 24460 > [ 62.640200] swap: usage 0kB, limit 9007199254740988kB, failcnt 0 > [ 62.640823] Memory cgroup stats for /demo: > [ 62.641017] anon 10604879872 > [ 62.641941] file 6574858240 > > OOM occurs despite there being still evictable file folios. > > - MGLRU enabled after this series: > Finished 128 iterations. > > Worth noting there is another OOM related issue reported in V1 of > this series, which is tested and looking OK now [5]. > > MySQL: > ====== > > Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using > ZRAM as swap and test command: > > sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \ > --tables=48 --table-size=2000000 --threads=48 --time=600 run > > Before: 17260.781429 tps > After this series: 17266.842857 tps > > MySQL is anon folios heavy, involves writeback and file and still > looking good. Seems only noise level changes, no regression. > > FIO: > ==== > Testing with the following command, where /mnt/ramdisk is a > 64G EXT4 ramdisk, each test file is 3G, in a 10G memcg, > 6 test run each: > > fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \ > --name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \ > --rw=randread --norandommap --time_based \ > --ramp_time=1m --runtime=5m --group_reporting > > Before: 9196.481429 MB/s > After this series: 9256.105000 MB/s > > Also seem only noise level changes and no regression or slightly better. > > Build kernel: > ============= > Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg > using make -j96 and defconfig, measuring system time, 12 test run each. > > Before: 2589.63s > After this series: 2543.58s > > Also seem only noise level changes, no regression or very slightly better. > > Android: > ======== > Xinyu reported a performance gain on Android, too, with this series. The > test consisted of cold-starting multiple applications sequentially under > moderate system load. [6] > > Before: > Launch Time Summary (all apps, all runs) > Mean 868.0ms > P50 888.0ms > P90 1274.2ms > P95 1399.0ms > > After: > Launch Time Summary (all apps, all runs) > Mean 850.5ms (-2.07%) > P50 861.5ms (-3.04%) > P90 1179.0ms (-8.05%) > P95 1228.0ms (-12.2%) > > Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1] > Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2] > Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3] > Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4] > Link: https://lore.kernel.org/linux-mm/acgNCzRDVmSbXrOE@KASONG-MC4/ [5] > Link: https://lore.kernel.org/linux-mm/20260417025123.2971253-1-wxy2009nrrr@163.com/ [6] > > Signed-off-by: Kairui Song <kasong@tencent.com> > --- Hi Kairui, I haven't identified the exact commit, but this patchset seems to make MGLRU's swappiness behavior more erratic. In mainline, MGLRU does not show as strong an effect as the active/inactive LRU, but it still behaves roughly linearly: higher swappiness leads to more swap activity and fewer file refaults. With this patchset, however, the behavior becomes non-monotonic as swappiness increases. I observed clear up-and-down fluctuations. I reproduced this by running a kernel build in a memcg limited to 1GB, with swappiness set to 35, 70, 105, 140, and 175. this is mainline using MGLRU: *** Executing round 1 *** set swappiness to 35 real 1m49.247s user 25m30.484s sys 3m37.203s pswpin: 933731 pswpout: 3365968 pgpgin: 5649320 pgpgout: 13786572 swpout_zero: 794960 swpin_zero: 10594 refault_file: 354998 refault_anon: 944323 *** Executing round 2 *** set swappiness to 70 real 1m49.313s user 25m31.643s sys 3m40.661s pswpin: 1049052 pswpout: 3565887 pgpgin: 5694288 pgpgout: 14582200 swpout_zero: 840947 swpin_zero: 12029 refault_file: 242973 refault_anon: 1061033 *** Executing round 3 *** set swappiness to 105 real 1m48.611s user 25m32.198s sys 3m37.210s pswpin: 981095 pswpout: 3396069 pgpgin: 5283940 pgpgout: 13898988 swpout_zero: 795932 swpin_zero: 11249 refault_file: 202432 refault_anon: 992295 *** Executing round 4 *** set swappiness to 140 real 1m49.398s user 25m35.650s sys 3m50.656s pswpin: 1222881 pswpout: 3935186 pgpgin: 6165024 pgpgout: 16056664 swpout_zero: 913808 swpin_zero: 13251 refault_file: 191564 refault_anon: 1236083 *** Executing round 5 *** set swappiness to 175 real 1m49.513s user 25m35.442s sys 3m55.869s pswpin: 1343139 pswpout: 4256014 pgpgin: 6557152 pgpgout: 17341452 swpout_zero: 998107 swpin_zero: 15692 refault_file: 175795 refault_anon: 1358782 this is mm-new using MGLRU: *** Executing round 1 *** set swappiness to 35 real 1m51.804s user 25m38.070s sys 4m16.301s pswpin: 1587728 pswpout: 4932011 pgpgin: 8788688 pgpgout: 20062761 swpout_zero: 1129975 swpin_zero: 17944 refault_file: 487923 refault_anon: 1605670 *** Executing round 2 *** set swappiness to 70 real 1m51.503s user 25m37.581s sys 4m18.161s pswpin: 1743890 pswpout: 5214587 pgpgin: 8676728 pgpgout: 21178716 swpout_zero: 1185453 swpin_zero: 20016 refault_file: 317993 refault_anon: 1763904 *** Executing round 3 *** set swappiness to 105 real 1m51.154s user 25m37.956s sys 4m15.017s pswpin: 1687517 pswpout: 5073825 pgpgin: 8173036 pgpgout: 20608932 swpout_zero: 1161806 swpin_zero: 20069 refault_file: 249769 refault_anon: 1707538 *** Executing round 4 *** set swappiness to 140 real 1m50.732s user 25m37.686s sys 4m16.066s pswpin: 1671678 pswpout: 5118895 pgpgin: 7929960 pgpgout: 20790468 swpout_zero: 1171029 swpin_zero: 19596 refault_file: 193421 refault_anon: 1691228 *** Executing round 5 *** set swappiness to 175 real 1m49.518s user 25m37.653s sys 4m12.619s pswpin: 1506888 pswpout: 4789793 pgpgin: 7270448 pgpgout: 19479188 swpout_zero: 1119251 swpin_zero: 16699 refault_file: 187304 refault_anon: 1523585 The final one is classic active/inactive LRU: *** Executing round 1 *** set swappiness to 35 real 1m50.038s user 25m21.911s sys 3m42.798s pswpin: 476994 pswpout: 2258185 pgpgin: 5247280 pgpgout: 9354640 swpout_zero: 684759 swpin_zero: 6387 refault_file: 750021 refault_anon: 483334 *** Executing round 2 *** set swappiness to 70 real 1m48.781s user 25m25.682s sys 3m37.854s pswpin: 515470 pswpout: 2306901 pgpgin: 4265500 pgpgout: 9547436 swpout_zero: 706437 swpin_zero: 6960 refault_file: 459740 refault_anon: 522381 *** Executing round 3 *** set swappiness to 105 real 1m48.233s user 25m26.623s sys 3m38.843s pswpin: 519540 pswpout: 2343897 pgpgin: 3628788 pgpgout: 9696500 swpout_zero: 743576 swpin_zero: 7782 refault_file: 303701 refault_anon: 527273 *** Executing round 4 *** set swappiness to 140 real 1m48.800s user 25m32.067s sys 3m50.751s pswpin: 605537 pswpout: 2615227 pgpgin: 3470540 pgpgout: 10776312 swpout_zero: 825446 swpin_zero: 9055 refault_file: 173236 refault_anon: 614544 *** Executing round 5 *** set swappiness to 175 real 1m52.356s user 25m29.727s sys 3m55.664s pswpin: 698228 pswpout: 2908292 pgpgin: 3602884 pgpgout: 11945332 swpout_zero: 912127 swpin_zero: 10298 refault_file: 117625 refault_anon: 708478 The build script is available here if you want to have a try: https://git.kernel.org/pub/scm/linux/kernel/git/baohua/linux.git/diff/tools/mm/build-kernel-with-increasing-swappiness.sh?h=zram-async-gc&id=d47888e9 I am also debugging this. One possibility is that placing dirty pages in the youngest generation may have affected lruvec_evictable_size()? Thanks Barry
On Fri, Apr 24, 2026 at 6:32 PM Barry Song <baohua@kernel.org> wrote: > > On Fri, Apr 24, 2026 at 1:43 AM Kairui Song via B4 Relay > <devnull+kasong.tencent.com@kernel.org> wrote: > > > > From: Kairui Song <kasong@tencent.com> > > > > This series cleans up and slightly improves MGLRU's reclaim loop and > > dirty writeback handling. As a result, we can see an up to ~30% increase > > in some workloads like MongoDB with YCSB and a huge decrease in file > > refault, no swap involved. Other common benchmarks have no regression, > > and LOC is reduced, with less unexpected OOM, too. > > > > Some of the problems were found in our production environment, and > > others were mostly exposed while stress testing during the development > > of the LSM/MM/BPF topic on improving MGLRU [1]. This series cleans up > > the code base and fixes several performance issues, preparing for > > further work. > > > > MGLRU's reclaim loop is a bit complex, and hence these problems are > > somehow related to each other. The aging, scan number calculation, and > > reclaim loop are coupled together, and the dirty folio handling logic is > > quite different, making the reclaim loop hard to follow and the dirty > > flush ineffective. > > > > This series slightly cleans up and improves these issues using a scan > > budget by calculating the number of folios to scan at the beginning of > > the loop, and decouples aging from the reclaim calculation helpers. > > Then, move the dirty flush logic inside the reclaim loop so it can kick > > in more effectively. These issues are somehow related, and this series > > handles them and improves MGLRU reclaim in many ways. > > > > Test results: All tests are done on a 48c96t NUMA machine with 2 nodes > > and a 128G memory machine using NVME as storage. > > > > MongoDB > > ======= > > Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000, > > threads:32), which does 95% read and 5% update to generate mixed read > > and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and > > the WiredTiger cache size is set to 4.5G, using NVME as storage. > > > > Not using SWAP. > > > > Before: > > Throughput(ops/sec): 62485.02962831822 > > AverageLatency(us): 500.9746963330107 > > pgpgin 159347462 > > pgpgout 5413332 > > workingset_refault_anon 0 > > workingset_refault_file 34522071 > > > > After: > > Throughput(ops/sec): 79760.71784646061 (+27.6%, higher is better) > > AverageLatency(us): 391.25169970043726 (-21.9%, lower is better) > > pgpgin 111093923 (-30.3%, lower is better) > > pgpgout 5437456 > > workingset_refault_anon 0 > > workingset_refault_file 19566366 (-43.3%, lower is better) > > > > We can see a significant performance improvement after this series. > > The test is done on NVME and the performance gap would be even larger > > for slow devices, such as HDD or network storage. We observed over > > 100% gain for some workloads with slow IO. > > > > Chrome & Node.js [3] > > ==================== > > Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2 > > nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64 > > workers: > > > > Before: > > Total requests: 79915 > > Per-worker 95% CI (mean): [1233.9, 1263.5] > > Per-worker stdev: 59.2 > > Jain's fairness: 0.997795 (1.0 = perfectly fair) > > Latency: > > Bucket Count Pct Cumul > > [0,1)s 26859 33.61% 33.61% > > [1,2)s 7818 9.78% 43.39% > > [2,4)s 5532 6.92% 50.31% > > [4,8)s 39706 49.69% 100.00% > > > > After: > > Total requests: 81382 > > Per-worker 95% CI (mean): [1241.9, 1301.3] > > Per-worker stdev: 118.8 > > Jain's fairness: 0.991480 (1.0 = perfectly fair) > > Latency: > > Bucket Count Pct Cumul > > [0,1)s 26696 32.80% 32.80% > > [1,2)s 8745 10.75% 43.55% > > [2,4)s 6865 8.44% 51.98% > > [4,8)s 39076 48.02% 100.00% > > > > Reclaim is still fair and effective, total requests number seems > > slightly better. > > > > OOM issue with aging and throttling > > =================================== > > For the throttling OOM issue, it can be easily reproduced using dd and > > cgroup limit as demonstrated in patch 14, and fixed by this series. > > > > The aging OOM is a bit tricky, a specific reproducer can be used to > > simulate what we encountered in production environment [4]: > > Spawns multiple workers that keep reading the given file using mmap, > > and pauses for 120ms after one file read batch. It also spawns another > > set of workers that keep allocating and freeing a given size of > > anonymous memory. The total memory size exceeds the memory limit > > (eg. 14G anon + 8G file, which is 22G vs a 16G memcg limit). > > > > - MGLRU disabled: > > Finished 128 iterations. > > > > - MGLRU enabled: > > OOM with following info after about ~10-20 iterations: > > [ 62.624130] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 > > [ 62.624999] memory: usage 16777216kB, limit 16777216kB, failcnt 24460 > > [ 62.640200] swap: usage 0kB, limit 9007199254740988kB, failcnt 0 > > [ 62.640823] Memory cgroup stats for /demo: > > [ 62.641017] anon 10604879872 > > [ 62.641941] file 6574858240 > > > > OOM occurs despite there being still evictable file folios. > > > > - MGLRU enabled after this series: > > Finished 128 iterations. > > > > Worth noting there is another OOM related issue reported in V1 of > > this series, which is tested and looking OK now [5]. > > > > MySQL: > > ====== > > > > Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using > > ZRAM as swap and test command: > > > > sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \ > > --tables=48 --table-size=2000000 --threads=48 --time=600 run > > > > Before: 17260.781429 tps > > After this series: 17266.842857 tps > > > > MySQL is anon folios heavy, involves writeback and file and still > > looking good. Seems only noise level changes, no regression. > > > > FIO: > > ==== > > Testing with the following command, where /mnt/ramdisk is a > > 64G EXT4 ramdisk, each test file is 3G, in a 10G memcg, > > 6 test run each: > > > > fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \ > > --name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \ > > --rw=randread --norandommap --time_based \ > > --ramp_time=1m --runtime=5m --group_reporting > > > > Before: 9196.481429 MB/s > > After this series: 9256.105000 MB/s > > > > Also seem only noise level changes and no regression or slightly better. > > > > Build kernel: > > ============= > > Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg > > using make -j96 and defconfig, measuring system time, 12 test run each. > > > > Before: 2589.63s > > After this series: 2543.58s > > > > Also seem only noise level changes, no regression or very slightly better. > > > > Android: > > ======== > > Xinyu reported a performance gain on Android, too, with this series. The > > test consisted of cold-starting multiple applications sequentially under > > moderate system load. [6] > > > > Before: > > Launch Time Summary (all apps, all runs) > > Mean 868.0ms > > P50 888.0ms > > P90 1274.2ms > > P95 1399.0ms > > > > After: > > Launch Time Summary (all apps, all runs) > > Mean 850.5ms (-2.07%) > > P50 861.5ms (-3.04%) > > P90 1179.0ms (-8.05%) > > P95 1228.0ms (-12.2%) > > > > Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1] > > Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2] > > Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3] > > Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4] > > Link: https://lore.kernel.org/linux-mm/acgNCzRDVmSbXrOE@KASONG-MC4/ [5] > > Link: https://lore.kernel.org/linux-mm/20260417025123.2971253-1-wxy2009nrrr@163.com/ [6] > > > > Signed-off-by: Kairui Song <kasong@tencent.com> > > --- > > Hi Kairui, > > I haven't identified the exact commit, but this patchset seems to > make MGLRU's swappiness behavior more erratic. > > In mainline, MGLRU does not show as strong an effect as the > active/inactive LRU, but it still behaves roughly linearly: higher > swappiness leads to more swap activity and fewer file refaults. > > With this patchset, however, the behavior becomes non-monotonic as > swappiness increases. I observed clear up-and-down fluctuations. > > I reproduced this by running a kernel build in a memcg limited to > 1GB, with swappiness set to 35, 70, 105, 140, and 175. > > this is mainline using MGLRU: > > *** Executing round 1 *** > set swappiness to 35 > > real 1m49.247s > user 25m30.484s > sys 3m37.203s > pswpin: 933731 > pswpout: 3365968 > pgpgin: 5649320 > pgpgout: 13786572 > swpout_zero: 794960 > swpin_zero: 10594 > refault_file: 354998 > refault_anon: 944323 > > *** Executing round 2 *** > set swappiness to 70 > > real 1m49.313s > user 25m31.643s > sys 3m40.661s > pswpin: 1049052 > pswpout: 3565887 > pgpgin: 5694288 > pgpgout: 14582200 > swpout_zero: 840947 > swpin_zero: 12029 > refault_file: 242973 > refault_anon: 1061033 > > *** Executing round 3 *** > set swappiness to 105 > > real 1m48.611s > user 25m32.198s > sys 3m37.210s > pswpin: 981095 > pswpout: 3396069 > pgpgin: 5283940 > pgpgout: 13898988 > swpout_zero: 795932 > swpin_zero: 11249 > refault_file: 202432 > refault_anon: 992295 > > *** Executing round 4 *** > set swappiness to 140 > > real 1m49.398s > user 25m35.650s > sys 3m50.656s > pswpin: 1222881 > pswpout: 3935186 > pgpgin: 6165024 > pgpgout: 16056664 > swpout_zero: 913808 > swpin_zero: 13251 > refault_file: 191564 > refault_anon: 1236083 > > *** Executing round 5 *** > set swappiness to 175 > > real 1m49.513s > user 25m35.442s > sys 3m55.869s > pswpin: 1343139 > pswpout: 4256014 > pgpgin: 6557152 > pgpgout: 17341452 > swpout_zero: 998107 > swpin_zero: 15692 > refault_file: 175795 > refault_anon: 1358782 > > this is mm-new using MGLRU: > > *** Executing round 1 *** > set swappiness to 35 > > real 1m51.804s > user 25m38.070s > sys 4m16.301s > pswpin: 1587728 > pswpout: 4932011 > pgpgin: 8788688 > pgpgout: 20062761 > swpout_zero: 1129975 > swpin_zero: 17944 > refault_file: 487923 > refault_anon: 1605670 > > *** Executing round 2 *** > set swappiness to 70 > > real 1m51.503s > user 25m37.581s > sys 4m18.161s > pswpin: 1743890 > pswpout: 5214587 > pgpgin: 8676728 > pgpgout: 21178716 > swpout_zero: 1185453 > swpin_zero: 20016 > refault_file: 317993 > refault_anon: 1763904 > > *** Executing round 3 *** > set swappiness to 105 > > real 1m51.154s > user 25m37.956s > sys 4m15.017s > pswpin: 1687517 > pswpout: 5073825 > pgpgin: 8173036 > pgpgout: 20608932 > swpout_zero: 1161806 > swpin_zero: 20069 > refault_file: 249769 > refault_anon: 1707538 > > *** Executing round 4 *** > set swappiness to 140 > > real 1m50.732s > user 25m37.686s > sys 4m16.066s > pswpin: 1671678 > pswpout: 5118895 > pgpgin: 7929960 > pgpgout: 20790468 > swpout_zero: 1171029 > swpin_zero: 19596 > refault_file: 193421 > refault_anon: 1691228 > > *** Executing round 5 *** > set swappiness to 175 > > real 1m49.518s > user 25m37.653s > sys 4m12.619s > pswpin: 1506888 > pswpout: 4789793 > pgpgin: 7270448 > pgpgout: 19479188 > swpout_zero: 1119251 > swpin_zero: 16699 > refault_file: 187304 > refault_anon: 1523585 > > The final one is classic active/inactive LRU: > > *** Executing round 1 *** > set swappiness to 35 > > real 1m50.038s > user 25m21.911s > sys 3m42.798s > pswpin: 476994 > pswpout: 2258185 > pgpgin: 5247280 > pgpgout: 9354640 > swpout_zero: 684759 > swpin_zero: 6387 > refault_file: 750021 > refault_anon: 483334 > > *** Executing round 2 *** > set swappiness to 70 > > real 1m48.781s > user 25m25.682s > sys 3m37.854s > pswpin: 515470 > pswpout: 2306901 > pgpgin: 4265500 > pgpgout: 9547436 > swpout_zero: 706437 > swpin_zero: 6960 > refault_file: 459740 > refault_anon: 522381 > > *** Executing round 3 *** > set swappiness to 105 > > real 1m48.233s > user 25m26.623s > sys 3m38.843s > pswpin: 519540 > pswpout: 2343897 > pgpgin: 3628788 > pgpgout: 9696500 > swpout_zero: 743576 > swpin_zero: 7782 > refault_file: 303701 > refault_anon: 527273 > > *** Executing round 4 *** > set swappiness to 140 > > real 1m48.800s > user 25m32.067s > sys 3m50.751s > pswpin: 605537 > pswpout: 2615227 > pgpgin: 3470540 > pgpgout: 10776312 > swpout_zero: 825446 > swpin_zero: 9055 > refault_file: 173236 > refault_anon: 614544 > > *** Executing round 5 *** > set swappiness to 175 > > real 1m52.356s > user 25m29.727s > sys 3m55.664s > pswpin: 698228 > pswpout: 2908292 > pgpgin: 3602884 > pgpgout: 11945332 > swpout_zero: 912127 > swpin_zero: 10298 > refault_file: 117625 > refault_anon: 708478 > > > The build script is available here if you want to have a try: > > https://git.kernel.org/pub/scm/linux/kernel/git/baohua/linux.git/diff/tools/mm/build-kernel-with-increasing-swappiness.sh?h=zram-async-gc&id=d47888e9 > > I am also debugging this. One possibility is that placing > dirty pages in the youngest generation may have affected > lruvec_evictable_size()? I reverted the six commits below, but swappiness behavior is still very unusual on mm-new. 4ce85c040e0a mm/vmscan: unify writeback reclaim statistic and throttling f80a81552f50 mm/vmscan: remove sc->unqueued_dirty 9381a541a759 mm/vmscan: remove sc->file_taken f2e2a7ae7660 mm/mglru: remove no longer used reclaim argument for folio protection b052c4a752a5 mm/mglru: simplify and improve dirty writeback handling 831409284da1 mm/mglru: use the common routine for dirty/writeback reactivation After reverting patch 9-14: *** Executing round 1 *** set swappiness to 35 real 2m6.982s user 24m59.930s sys 9m1.374s pswpin: 1973368 pswpout: 4792167 pgpgin: 12471516 pgpgout: 19490361 swpout_zero: 992543 swpin_zero: 48166 refault_file: 1002114 refault_anon: 2021486 *** Executing round 2 *** set swappiness to 70 real 1m56.011s user 25m24.954s sys 5m31.730s pswpin: 1788750 pswpout: 4869145 pgpgin: 9745888 pgpgout: 19799848 swpout_zero: 1009680 swpin_zero: 35920 refault_file: 540060 refault_anon: 1824622 *** Executing round 3 *** set swappiness to 105 real 1m52.184s user 25m29.605s sys 5m19.031s pswpin: 1894596 pswpout: 5220326 pgpgin: 9844536 pgpgout: 21251668 swpout_zero: 1107839 swpin_zero: 33253 refault_file: 453966 refault_anon: 1927801 *** Executing round 4 *** set swappiness to 140 real 1m56.725s user 25m26.667s sys 6m7.878s pswpin: 2366033 pswpout: 5584223 pgpgin: 11962872 pgpgout: 22660564 swpout_zero: 1167419 swpin_zero: 56513 refault_file: 442744 refault_anon: 2422500 *** Executing round 5 *** set swappiness to 175 real 2m16.219s user 24m32.728s sys 12m26.124s pswpin: 1990093 pswpout: 4568372 pgpgin: 13571748 pgpgout: 18604592 swpout_zero: 977963 swpin_zero: 52072 refault_file: 1289471 refault_anon: 2042117 So it is likely caused by an earlier commit than the six above. I need to get some sleep. Could this be because get_nr_to_scan() was moved out of the loop by [PATCH v6 04/14] mm/mglru: restructure the reclaim loop, while in mainline it is re-evaluated in each iteration? Will take a look tomorrow or the day after. Thanks Barry
On Fri, Apr 24, 2026 at 7:58 PM Barry Song <baohua@kernel.org> wrote: > Could this be because get_nr_to_scan() was moved out of the loop by > [PATCH v6 04/14] mm/mglru: restructure the reclaim loop, > while in mainline it is re-evaluated in each iteration? > > Will take a look tomorrow or the day after. > > Thanks > Barry > Hi Barry, I ran your test script a few times, and strangely I can't reproduce it. Swapniess behaves similarly after or before this series. I directly checked out the mm-new commit of this series (4ce85c040e0a) and compare to the mm-new commit right before this series (31a112f05f62). I also extended your script a bit to test more swappiness: Before: uname -a Linux localhost 7.0.0.mm-new-g31a112f05f62 #305 SMP PREEMPT_DYNAMIC Fri Apr 24 19:07:30 CST 2026 x86_64 GNU/Linux *** Executing swappiness 5 *** set swappiness to 5 Running as unit: run-p6675-i151640.scope; invocation ID: e8b79eab2854418b81c2ad62c3d121c4 real 1m50.652s user 38m58.129s sys 21m41.763s pswpin: 15917615 pswpout: 41199479 pgpgin: 76113864 pgpgout: 165640720 swpout_zero: 1546131 swpin_zero: 285359 refault_file: 1648508 refault_anon: 15829347 *** Executing swappiness 35 *** set swappiness to 35 Running as unit: run-p44771-i189005.scope; invocation ID: 562deebdc09647a290a2165870295ef7 real 1m50.623s user 38m50.505s sys 21m45.217s pswpin: 15915841 pswpout: 41297235 pgpgin: 75425648 pgpgout: 166025592 swpout_zero: 1546865 swpin_zero: 277756 refault_file: 1546926 refault_anon: 15778795 *** Executing swappiness 70 *** set swappiness to 70 Running as unit: run-p82859-i1819.scope; invocation ID: 82aead8ff78742bba6bf401a6534588e real 1m49.605s user 38m30.262s sys 21m49.417s pswpin: 16088273 pswpout: 41592875 pgpgin: 71469324 pgpgout: 167370708 swpout_zero: 1657002 swpin_zero: 319600 refault_file: 857694 refault_anon: 16175619 *** Executing swappiness 105 *** set swappiness to 105 Running as unit: run-p120948-i250916.scope; invocation ID: e7bc4e739cab4e8bbc5efb06b50ead14 real 1m48.951s user 38m30.080s sys 21m48.815s pswpin: 16265236 pswpout: 41884071 pgpgin: 70923620 pgpgout: 168654260 swpout_zero: 1649763 swpin_zero: 316386 refault_file: 691818 refault_anon: 16368960 *** Executing swappiness 140 *** set swappiness to 140 Running as unit: run-p159035-i42797.scope; invocation ID: ea31084e41bf4c7b9427800bf78500fb real 1m49.596s user 38m32.881s sys 21m50.914s pswpin: 16434435 pswpout: 42315789 pgpgin: 70803660 pgpgout: 170344784 swpout_zero: 1659821 swpin_zero: 314116 refault_file: 563047 refault_anon: 16557648 *** Executing swappiness 175 *** set swappiness to 175 Running as unit: run-p197122-i325303.scope; invocation ID: 181af70e74b74bff901e10ed97d32c50 real 1m49.091s user 38m31.101s sys 21m56.002s pswpin: 16470200 pswpout: 42370706 pgpgin: 70388644 pgpgout: 170452728 swpout_zero: 1689922 swpin_zero: 330216 refault_file: 487950 refault_anon: 16613124 *** Executing swappiness 200 *** set swappiness to 200 Running as unit: run-p235206-i68603.scope; invocation ID: d6af085ecd9a47a88ddecfb326af4d58 real 1m49.458s user 38m37.696s sys 22m7.194s pswpin: 16473742 pswpout: 42413098 pgpgin: 70351964 pgpgout: 170454504 swpout_zero: 1653981 swpin_zero: 316654 refault_file: 539774 refault_anon: 16620315 After: uname -a Linux localhost 7.0.0.ptch-g4ce85c040e0a #2733 SMP PREEMPT_DYNAMIC Fri Apr 24 19:07:08 CST 2026 x86_64 GNU/Linux *** Executing swappiness 5 *** set swappiness to 5 Running as unit: run-p10510-i123098.scope; invocation ID: 7773cea9690140378786e496d7bf0523 real 1m50.042s user 38m59.555s sys 21m25.018s pswpin: 15913149 pswpout: 41183479 pgpgin: 76136184 pgpgout: 165910472 swpout_zero: 1557330 swpin_zero: 282842 refault_file: 1599524 refault_anon: 15876245 *** Executing swappiness 35 *** set swappiness to 35 Running as unit: run-p48606-i45593.scope; invocation ID: 66799f770f0944558b258fa42a62cd28 real 1m50.479s user 38m59.363s sys 21m27.155s pswpin: 15865488 pswpout: 41087641 pgpgin: 75103236 pgpgout: 165212060 swpout_zero: 1557445 swpin_zero: 287809 refault_file: 1421868 refault_anon: 15790633 *** Executing swappiness 70 *** set swappiness to 70 Running as unit: run-p86689-i246403.scope; invocation ID: ac7303701c524901a64c9a5b945f9e20 real 1m49.409s user 38m36.037s sys 21m37.421s pswpin: 16198083 pswpout: 41876332 pgpgin: 71719832 pgpgout: 168490208 swpout_zero: 1622494 swpin_zero: 304799 refault_file: 847114 refault_anon: 16283181 *** Executing swappiness 105 *** set swappiness to 105 Running as unit: run-p124782-i205752.scope; invocation ID: 591727ffbc1d40c1b79a5a20567e1616 real 1m48.638s user 38m28.237s sys 21m31.590s pswpin: 16278124 pswpout: 41920058 pgpgin: 70591876 pgpgout: 168801512 swpout_zero: 1660539 swpin_zero: 321367 refault_file: 651376 refault_anon: 16400826 *** Executing swappiness 140 *** set swappiness to 140 Running as unit: run-p162871-i189610.scope; invocation ID: 3be1d782d9724eb79e7803eb72c47e4a real 1m48.602s user 38m31.708s sys 21m39.510s pswpin: 16376932 pswpout: 41921230 pgpgin: 70503000 pgpgout: 168528328 swpout_zero: 1684494 swpin_zero: 333271 refault_file: 524031 refault_anon: 16513067 *** Executing swappiness 175 *** set swappiness to 175 Running as unit: run-p200958-i133720.scope; invocation ID: 12da6771f02b485ea6cf0c6842bd5f73 real 1m48.905s user 38m28.887s sys 21m31.592s pswpin: 16535604 pswpout: 42244753 pgpgin: 70727968 pgpgout: 170088056 swpout_zero: 1675139 swpin_zero: 327897 refault_file: 489367 refault_anon: 16669662 *** Executing swappiness 200 *** set swappiness to 200 Running as unit: run-p239041-i341992.scope; invocation ID: 31ff32be856f41f4a3ad60ea7878a230 real 1m48.746s user 38m25.722s sys 21m50.843s pswpin: 16644335 pswpout: 42660348 pgpgin: 70911344 pgpgout: 171749444 swpout_zero: 1675432 swpin_zero: 321988 refault_file: 501717 refault_anon: 16775111 Since you mentioned it's mm-new vs mainline, and you have reverted part of this series and the problem is still there. Could it be related to something else in mm-new? I'll keep testing more stress and workload to dig deeper too. Or maybe the swappiness behavior just changed slightly, some it may perform better or worse depending on timing and workload? Swappiness on MGLRU currently only works as a factor for calculating the refault and reclaim balance of anon / file so it may behave a bit unpredictable. There isn't a proportional calculation like active / inactive LRU. That's a problem too, and we might fix that later.
On Fri, Apr 24, 2026 at 8:56 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Fri, Apr 24, 2026 at 7:58 PM Barry Song <baohua@kernel.org> wrote:
> > Could this be because get_nr_to_scan() was moved out of the loop by
> > [PATCH v6 04/14] mm/mglru: restructure the reclaim loop,
> > while in mainline it is re-evaluated in each iteration?
> >
> > Will take a look tomorrow or the day after.
> >
> > Thanks
> > Barry
> >
>
> Hi Barry,
>
> I ran your test script a few times, and strangely I can't reproduce
> it. Swapniess behaves similarly after or before this series. I
> directly checked out the mm-new commit of this series (4ce85c040e0a)
> and compare to the mm-new commit right before this series
> (31a112f05f62). I also extended your script a bit to test more
> swappiness:
Hi Kairui,
I reset the repository to commit 4ce85c040e0a using
git reset --hard, and I can still reproduce the
swappiness issue. My machine is:
barry@barry-desktop:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 20
On-line CPU(s) list: 0-19
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i9-10900 CPU @ 2.80GHz
CPU family: 6
Model: 165
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 1
Stepping: 5
CPU max MHz: 2800.0000
CPU min MHz: 800.0000
BogoMIPS: 5599.85
swap is zRAM only:
barry@barry-desktop:~$ cat /proc/swaps
Filename Type Size Used Priority
/dev/zram0 partition 12582908 280940 5
The data is as below,
*** Executing round 1 ***
set swappiness to 35
real 1m51.699s
user 25m31.134s
sys 4m13.127s
pswpin: 1562949
pswpout: 4840525
pgpgin: 8751872
pgpgout: 19741097
swpout_zero: 1095783
swpin_zero: 18079
refault_file: 515292
refault_anon: 1580980
*** Executing round 2 ***
set swappiness to 70
real 1m51.603s
user 25m33.600s
sys 4m21.738s
pswpin: 1786413
pswpout: 5350804
pgpgin: 8833652
pgpgout: 21715596
swpout_zero: 1230981
swpin_zero: 21051
refault_file: 313099
refault_anon: 1807417
*** Executing round 3 ***
set swappiness to 105
real 1m50.315s
user 25m40.863s
sys 4m12.446s
pswpin: 1555289
pswpout: 4911737
pgpgin: 7597548
pgpgout: 19956948
swpout_zero: 1125969
swpin_zero: 17594
refault_file: 237475
refault_anon: 1572835
*** Executing round 4 ***
set swappiness to 140
real 1m50.992s
user 25m34.774s
sys 4m14.068s
pswpin: 1642575
pswpout: 5027730
pgpgin: 7937214
pgpgout: 20426400
swpout_zero: 1155712
swpin_zero: 20248
refault_file: 215237
refault_anon: 1662775
*** Executing round 5 ***
set swappiness to 175
real 1m50.207s
user 25m38.244s
sys 4m7.655s
pswpin: 1522633
pswpout: 4788104
pgpgin: 7307172
pgpgout: 19464984
swpout_zero: 1109281
swpin_zero: 18085
refault_file: 186203
refault_anon: 1540669
I disabled turbo for the test:
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
My bisect shows that the commit causing the swappiness issue is:
[PATCH v6 05/14] mm/mglru: scan and count the exact number of folios
Before that, swappiness behaves as expected, and there is
also less swap-out/in activity(and much shorter sys time).
*** Executing round 1 ***
set swappiness to 35
real 1m49.406s
user 25m28.458s
sys 3m41.098s
pswpin: 984605
pswpout: 3329809
pgpgin: 5985696
pgpgout: 13648560
swpout_zero: 780136
swpin_zero: 11379
refault_file: 367629
refault_anon: 995982
*** Executing round 2 ***
set swappiness to 70
real 1m48.577s
user 25m34.994s
sys 3m42.694s
pswpin: 985650
pswpout: 3450097
pgpgin: 5468828
pgpgout: 14116020
swpout_zero: 820143
swpin_zero: 11808
refault_file: 245353
refault_anon: 997410
*** Executing round 3 ***
set swappiness to 105
real 1m49.262s
user 25m34.871s
sys 3m41.633s
pswpin: 998178
pswpout: 3553741
pgpgin: 5328896
pgpgout: 14535068
swpout_zero: 840706
swpin_zero: 10393
refault_file: 205514
refault_anon: 1008525
*** Executing round 4 ***
set swappiness to 140
real 1m49.417s
user 25m35.395s
sys 3m47.169s
pswpin: 1138043
pswpout: 3756034
pgpgin: 5807584
pgpgout: 15345816
swpout_zero: 884539
swpin_zero: 12652
refault_file: 185767
refault_anon: 1150649
*** Executing round 5 ***
set swappiness to 175
real 1m49.654s
user 25m35.244s
sys 3m53.330s
pswpin: 1235427
pswpout: 4058085
pgpgin: 6108792
pgpgout: 16547764
swpout_zero: 974086
swpin_zero: 14280
refault_file: 170452
refault_anon: 1249705
It’s too late today; I’ll continue debugging tomorrow.
[...]
> Since you mentioned it's mm-new vs mainline, and you have reverted
> part of this series and the problem is still there. Could it be
> related to something else in mm-new? I'll keep testing more stress and
> workload to dig deeper too. Or maybe the swappiness behavior just
> changed slightly, some it may perform better or worse depending on
> timing and workload? Swappiness on MGLRU currently only works as a
> factor for calculating the refault and reclaim balance of anon / file
> so it may behave a bit unpredictable. There isn't a proportional
> calculation like active / inactive LRU. That's a problem too, and we
> might fix that later.
read_ctrl_pos() should also bias towards swappiness, as
both sp and pv gains are affected by it. Yes, we need to
fix the swappiness for mglru.
Best Regards
Barry
On Sat, Apr 25, 2026 at 8:18 PM Barry Song <baohua@kernel.org> wrote:
>
> On Fri, Apr 24, 2026 at 8:56 PM Kairui Song <ryncsn@gmail.com> wrote:
> > Hi Barry,
> >
> > I ran your test script a few times, and strangely I can't reproduce
> > it. Swapniess behaves similarly after or before this series. I
> > directly checked out the mm-new commit of this series (4ce85c040e0a)
> > and compare to the mm-new commit right before this series
> > (31a112f05f62). I also extended your script a bit to test more
> > swappiness:
>
> Hi Kairui,
> I reset the repository to commit 4ce85c040e0a using
> git reset --hard, and I can still reproduce the
> swappiness issue. My machine is:
>
> barry@barry-desktop:~$ lscpu
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Address sizes: 39 bits physical, 48 bits virtual
> Byte Order: Little Endian
> CPU(s): 20
> On-line CPU(s) list: 0-19
> Vendor ID: GenuineIntel
> Model name: Intel(R) Core(TM) i9-10900 CPU @ 2.80GHz
> CPU family: 6
> Model: 165
> Thread(s) per core: 2
> Core(s) per socket: 10
> Socket(s): 1
> Stepping: 5
> CPU max MHz: 2800.0000
> CPU min MHz: 800.0000
> BogoMIPS: 5599.85
>
>
> swap is zRAM only:
> barry@barry-desktop:~$ cat /proc/swaps
> Filename Type Size Used Priority
> /dev/zram0 partition 12582908 280940 5
>
> The data is as below,
>
> *** Executing round 1 ***
> set swappiness to 35
>
> real 1m51.699s
> user 25m31.134s
> sys 4m13.127s
> pswpin: 1562949
> pswpout: 4840525
> pgpgin: 8751872
> pgpgout: 19741097
> swpout_zero: 1095783
> swpin_zero: 18079
> refault_file: 515292
> refault_anon: 1580980
>
> *** Executing round 2 ***
> set swappiness to 70
>
> real 1m51.603s
> user 25m33.600s
> sys 4m21.738s
> pswpin: 1786413
> pswpout: 5350804
> pgpgin: 8833652
> pgpgout: 21715596
> swpout_zero: 1230981
> swpin_zero: 21051
> refault_file: 313099
> refault_anon: 1807417
>
> *** Executing round 3 ***
> set swappiness to 105
>
> real 1m50.315s
> user 25m40.863s
> sys 4m12.446s
> pswpin: 1555289
> pswpout: 4911737
> pgpgin: 7597548
> pgpgout: 19956948
> swpout_zero: 1125969
> swpin_zero: 17594
> refault_file: 237475
> refault_anon: 1572835
>
> *** Executing round 4 ***
> set swappiness to 140
>
> real 1m50.992s
> user 25m34.774s
> sys 4m14.068s
> pswpin: 1642575
> pswpout: 5027730
> pgpgin: 7937214
> pgpgout: 20426400
> swpout_zero: 1155712
> swpin_zero: 20248
> refault_file: 215237
> refault_anon: 1662775
>
> *** Executing round 5 ***
> set swappiness to 175
>
> real 1m50.207s
> user 25m38.244s
> sys 4m7.655s
> pswpin: 1522633
> pswpout: 4788104
> pgpgin: 7307172
> pgpgout: 19464984
> swpout_zero: 1109281
> swpin_zero: 18085
> refault_file: 186203
> refault_anon: 1540669
Hmm, but reading the result you just posted, isn't swappiness actually
working as expected? Here is the data you just posted:
swappiness: 35 refault_file/anon: 515292 1580980
swappiness: 70 refault_file/anon: 313099 1807417
swappiness: 150 refault_file/anon: 237475 1572835
swappiness: 140 refault_file/anon: 215237 1662775
swappiness: 175 refault_file/anon: 186203 1540669
Higher swappiness we have, lower file refault we have.
> My bisect shows that the commit causing the swappiness issue is:
>
> [PATCH v6 05/14] mm/mglru: scan and count the exact number of folios
>
> Before that, swappiness behaves as expected, and there is
> also less swap-out/in activity(and much shorter sys time).
>
> *** Executing round 1 ***
> set swappiness to 35
>
> real 1m49.406s
> user 25m28.458s
> sys 3m41.098s
> pswpin: 984605
> pswpout: 3329809
> pgpgin: 5985696
> pgpgout: 13648560
> swpout_zero: 780136
> swpin_zero: 11379
> refault_file: 367629
> refault_anon: 995982
>
> *** Executing round 2 ***
> set swappiness to 70
>
> real 1m48.577s
> user 25m34.994s
> sys 3m42.694s
> pswpin: 985650
> pswpout: 3450097
> pgpgin: 5468828
> pgpgout: 14116020
> swpout_zero: 820143
> swpin_zero: 11808
> refault_file: 245353
> refault_anon: 997410
>
> *** Executing round 3 ***
> set swappiness to 105
>
> real 1m49.262s
> user 25m34.871s
> sys 3m41.633s
> pswpin: 998178
> pswpout: 3553741
> pgpgin: 5328896
> pgpgout: 14535068
> swpout_zero: 840706
> swpin_zero: 10393
> refault_file: 205514
> refault_anon: 1008525
>
> *** Executing round 4 ***
> set swappiness to 140
>
> real 1m49.417s
> user 25m35.395s
> sys 3m47.169s
> pswpin: 1138043
> pswpout: 3756034
> pgpgin: 5807584
> pgpgout: 15345816
> swpout_zero: 884539
> swpin_zero: 12652
> refault_file: 185767
> refault_anon: 1150649
>
> *** Executing round 5 ***
> set swappiness to 175
>
> real 1m49.654s
> user 25m35.244s
> sys 3m53.330s
> pswpin: 1235427
> pswpout: 4058085
> pgpgin: 6108792
> pgpgout: 16547764
> swpout_zero: 974086
> swpin_zero: 14280
> refault_file: 170452
> refault_anon: 1249705
>
> It’s too late today; I’ll continue debugging tomorrow.
Checking the data you just posted before that commit:
swappiness: 35 refault_file/anon: 367629 995982
swappiness: 70 refault_file/anon: 245353 997410
swappiness: 105 refault_file/anon: 205514 1008525
swappiness: 140 refault_file/anon: 185767 1150649
swappiness: 175 refault_file/anon: 170452 1249705
And after that commit is:
swappiness: 35 refault_file/anon: 515292 1580980
swappiness: 70 refault_file/anon: 313099 1807417
swappiness: 150 refault_file/anon: 237475 1572835
swappiness: 140 refault_file/anon: 215237 1662775
swappiness: 175 refault_file/anon: 186203 1540669
So I think the problem is not swappiness, but there are more anon
refaults after that commit.
Before:
pswpin: 998178
pswpout: 3553741
pgpgin: 5328896
pgpgout: 14535068
After:
pswpin: 1555289
pswpout: 4911737
pgpgin: 7597548
pgpgout: 19956948
I just ran a matrix of for kernels (mainline, mm-new HEAD, before this
series, after this series) X 3 different memcg configs (-j96 3G, -j48
2G, -j24 1G), and none of these showed any regression but all
improvement. That's really odd.
One possibility is that I removed the:
if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS >
lrugen->max_seq)
scanned = 0;
Which will make the reclaim loop go further and trigger aging.
Previously if reclaim drained the LRU's cold gens, it may go reclaim
slab instead. So idle inodes will be dropped with the mapping and
reclaim more file, and we won't see any refault data from that since
the mapping itself is gone. Sys will be lower too, as IO isn't counted
as sys. Checking your data, despite sys is higher, real is acutually
lower, which matches my guess.
Will the following patch help? I'm not sure if this is the problem,
but this added back that early abort, personally I don't think this
really makes much sense as it's more like a workaround for other
issues, but if that helps we might better keep it.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 30a89224117b..c1e7c65ff3b9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4837,6 +4837,7 @@ static int evict_folios(unsigned long
nr_to_scan, struct lruvec *lruvec,
int scanned, reclaimed;
int isolated = 0, type, type_scanned;
bool skip_retry = false;
+ struct lru_gen_folio *lrugen = &lruvec->lrugen;
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
@@ -4852,6 +4853,10 @@ static int evict_folios(unsigned long
nr_to_scan, struct lruvec *lruvec,
if (scanned)
try_to_inc_min_seq(lruvec, swappiness);
+ /* Out of cold folios, return 0 to abort early and also
trigger shrinkers beside LRU */
+ if (evictable_min_seq(lrugen->min_seq, swappiness) +
MIN_NR_GENS > lrugen->max_seq)
+ scanned = 0;
+
lruvec_unlock_irq(lruvec);
if (list_empty(&list))
And this could cause early OOM, we have observed that for several
times due to the early return. So maybe we better check sc->priority
too, or move this to should_abort_scan?
Or perhaps we should just restore the behavior of never running aging
at DEF_PRIOTIY, which seems better and safer like below:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 30a89224117b..2080522ea924 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4917,14 +4917,14 @@ static bool should_run_aging(struct lruvec
*lruvec, unsigned long max_seq,
{
DEFINE_MIN_SEQ(lruvec);
- /* have to run aging, since eviction is not possible anymore */
- if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
- return true;
-
/* try to avoid aging, do gentle reclaim at the default priority */
if (sc->priority == DEF_PRIORITY)
return false;
+ /* have to run aging, since eviction is not possible anymore */
+ if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
+ return true;
+
/* better to run aging even though eviction is still possible */
return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
}
I'll keep testing with more FS and setup. What FS are you using? This
might be related to FS side reclaim as well if it's caused by shrinker
balance.
> [...]
> > Since you mentioned it's mm-new vs mainline, and you have reverted
> > part of this series and the problem is still there. Could it be
> > related to something else in mm-new? I'll keep testing more stress and
> > workload to dig deeper too. Or maybe the swappiness behavior just
> > changed slightly, some it may perform better or worse depending on
> > timing and workload? Swappiness on MGLRU currently only works as a
> > factor for calculating the refault and reclaim balance of anon / file
> > so it may behave a bit unpredictable. There isn't a proportional
> > calculation like active / inactive LRU. That's a problem too, and we
> > might fix that later.
>
> read_ctrl_pos() should also bias towards swappiness, as
> both sp and pv gains are affected by it. Yes, we need to
> fix the swappiness for mglru.
Yes, read_ctrl_pos is the helper for calculating the refault and
reclaim balance that I was talking about.
On Sat, Apr 25, 2026 at 9:30 PM Kairui Song <ryncsn@gmail.com> wrote: [...] > > I just ran a matrix of for kernels (mainline, mm-new HEAD, before this > series, after this series) X 3 different memcg configs (-j96 3G, -j48 > 2G, -j24 1G), and none of these showed any regression but all > improvement. That's really odd. > > One possibility is that I removed the: > > if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS > > lrugen->max_seq) > scanned = 0; > > Which will make the reclaim loop go further and trigger aging. > Previously if reclaim drained the LRU's cold gens, it may go reclaim > slab instead. So idle inodes will be dropped with the mapping and > reclaim more file, and we won't see any refault data from that since > the mapping itself is gone. Sys will be lower too, as IO isn't counted > as sys. Checking your data, despite sys is higher, real is acutually > lower, which matches my guess. > > Will the following patch help? I'm not sure if this is the problem, > but this added back that early abort, personally I don't think this > really makes much sense as it's more like a workaround for other > issues, but if that helps we might better keep it. Hi Kairui, after five hours of sleep I’m feeling much more refreshed and should have identified the issue. I think the problem will be clear once you look at the patch below. Feel free to include it in the new version if you find it helpful. From e3a0b50dc53a3a5f2ef1adfb73111336e3c2d08b Mon Sep 17 00:00:00 2001 From: "Barry Song (Xiaomi)" <baohua@kernel.org> Date: Sun, 26 Apr 2026 08:34:21 +1200 Subject: [PATCH] mm/mglru: Avoid falling back to another type when scan_folios() leaves no remaining pages While remaining reaches 0 in scan_folios(), we quickly fall back to the other type in isolate_folios(). This is incorrect, as the current type may still have sufficient folios. Falling back can undermine the positive_ctrl_err() result from get_type_to_scan(), which is derived from swappiness. A simple fix is to continue scanning this type for another round. Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org> --- mm/vmscan.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index ef45f3a4aa38..169fbbe17c7b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4788,8 +4788,13 @@ static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec, *isolate_scanned = scanned; break; } - - type = !type; + /* + * If scanned > 0 and isolated == 0, avoid falling back to the + * other type, as this type remains sufficient. Falling back + * too readily can disrupt the positive_ctrl_err() bias. + */ + if (!scanned) + type = !type; } return total_scanned; -- 2.34.1 Thanks Barry
On Sun, Apr 26, 2026 at 4:58 AM Barry Song (Xiaomi) <baohua@kernel.org> wrote: > > On Sat, Apr 25, 2026 at 9:30 PM Kairui Song <ryncsn@gmail.com> wrote: > [...] > > > > I just ran a matrix of for kernels (mainline, mm-new HEAD, before this > > series, after this series) X 3 different memcg configs (-j96 3G, -j48 > > 2G, -j24 1G), and none of these showed any regression but all > > improvement. That's really odd. > > > > One possibility is that I removed the: > > > > if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS > > > lrugen->max_seq) > > scanned = 0; > > > > Which will make the reclaim loop go further and trigger aging. > > Previously if reclaim drained the LRU's cold gens, it may go reclaim > > slab instead. So idle inodes will be dropped with the mapping and > > reclaim more file, and we won't see any refault data from that since > > the mapping itself is gone. Sys will be lower too, as IO isn't counted > > as sys. Checking your data, despite sys is higher, real is acutually > > lower, which matches my guess. > > > > Will the following patch help? I'm not sure if this is the problem, > > but this added back that early abort, personally I don't think this > > really makes much sense as it's more like a workaround for other > > issues, but if that helps we might better keep it. > > Hi Kairui, after five hours of sleep I’m feeling much more > refreshed and should have identified the issue. I think the > problem will be clear once you look at the patch below. > Feel free to include it in the new version if you find it > helpful. > > From e3a0b50dc53a3a5f2ef1adfb73111336e3c2d08b Mon Sep 17 00:00:00 2001 > From: "Barry Song (Xiaomi)" <baohua@kernel.org> > Date: Sun, 26 Apr 2026 08:34:21 +1200 > Subject: [PATCH] mm/mglru: Avoid falling back to another type when > scan_folios() leaves no remaining pages > > While remaining reaches 0 in scan_folios(), we quickly fall back > to the other type in isolate_folios(). This is incorrect, as the > current type may still have sufficient folios. Falling back can > undermine the positive_ctrl_err() result from get_type_to_scan(), > which is derived from swappiness. > > A simple fix is to continue scanning this type for another round. > > Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org> > --- > mm/vmscan.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index ef45f3a4aa38..169fbbe17c7b 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -4788,8 +4788,13 @@ static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec, > *isolate_scanned = scanned; > break; > } > - > - type = !type; > + /* > + * If scanned > 0 and isolated == 0, avoid falling back to the > + * other type, as this type remains sufficient. Falling back > + * too readily can disrupt the positive_ctrl_err() bias. > + */ > + if (!scanned) > + type = !type; > } Thanks so much for catching this! The fix makes perfect sense, it restores the pre-patch behavior. I was too aggressive with the fallback; positive_ctrl_err() should win whenever the preferred type is actually productive. Would you prefer I squash this into the original patch (with a Co-developed-by for you?), or keep it as a standalone fix on top? I'm fine either way. One related thought for a follow-up: scanned == 0 from scan_folios() is essentially driven by get_nr_gens(lruvec, type) == MIN_NR_GENS. So when the type get_type_to_scan() picked is the one that's gen-exhausted, maybe the right response is also to age that type rather than fall back. Right now should_run_aging() only looks at the combined evictable_min_seq, so the ctrl_err preferred type can silently stay starved while we evict from the other one, that could be an old issue we can fix later.
On Sun, Apr 26, 2026 at 2:59 PM Kairui Song <ryncsn@gmail.com> wrote: > > On Sun, Apr 26, 2026 at 4:58 AM Barry Song (Xiaomi) <baohua@kernel.org> wrote: > > > > On Sat, Apr 25, 2026 at 9:30 PM Kairui Song <ryncsn@gmail.com> wrote: > > [...] > > > > > > I just ran a matrix of for kernels (mainline, mm-new HEAD, before this > > > series, after this series) X 3 different memcg configs (-j96 3G, -j48 > > > 2G, -j24 1G), and none of these showed any regression but all > > > improvement. That's really odd. > > > > > > One possibility is that I removed the: > > > > > > if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS > > > > lrugen->max_seq) > > > scanned = 0; > > > > > > Which will make the reclaim loop go further and trigger aging. > > > Previously if reclaim drained the LRU's cold gens, it may go reclaim > > > slab instead. So idle inodes will be dropped with the mapping and > > > reclaim more file, and we won't see any refault data from that since > > > the mapping itself is gone. Sys will be lower too, as IO isn't counted > > > as sys. Checking your data, despite sys is higher, real is acutually > > > lower, which matches my guess. > > > > > > Will the following patch help? I'm not sure if this is the problem, > > > but this added back that early abort, personally I don't think this > > > really makes much sense as it's more like a workaround for other > > > issues, but if that helps we might better keep it. > > > > Hi Kairui, after five hours of sleep I’m feeling much more > > refreshed and should have identified the issue. I think the > > problem will be clear once you look at the patch below. > > Feel free to include it in the new version if you find it > > helpful. > > > > From e3a0b50dc53a3a5f2ef1adfb73111336e3c2d08b Mon Sep 17 00:00:00 2001 > > From: "Barry Song (Xiaomi)" <baohua@kernel.org> > > Date: Sun, 26 Apr 2026 08:34:21 +1200 > > Subject: [PATCH] mm/mglru: Avoid falling back to another type when > > scan_folios() leaves no remaining pages > > > > While remaining reaches 0 in scan_folios(), we quickly fall back > > to the other type in isolate_folios(). This is incorrect, as the > > current type may still have sufficient folios. Falling back can > > undermine the positive_ctrl_err() result from get_type_to_scan(), > > which is derived from swappiness. > > > > A simple fix is to continue scanning this type for another round. > > > > Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org> > > --- > > mm/vmscan.c | 9 +++++++-- > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index ef45f3a4aa38..169fbbe17c7b 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -4788,8 +4788,13 @@ static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec, > > *isolate_scanned = scanned; > > break; > > } > > - > > - type = !type; > > + /* > > + * If scanned > 0 and isolated == 0, avoid falling back to the > > + * other type, as this type remains sufficient. Falling back > > + * too readily can disrupt the positive_ctrl_err() bias. > > + */ > > + if (!scanned) > > + type = !type; > > } > > Thanks so much for catching this! The fix makes perfect sense, it > restores the pre-patch behavior. I was too aggressive with the > fallback; positive_ctrl_err() should win whenever the preferred type > is actually productive. > > Would you prefer I squash this into the original patch (with a > Co-developed-by for you?), or keep it as a standalone fix on top? I'm > fine either way. Either approach is fine—use whichever works better for organizing v7. My boss would definitely be happier with the latter, as it would better support and encourage more active engagement in upstream development. It’s rare these days to find a boss like this who is open-minded and insightful about upstream contributions, especially with everyone moving toward AI. :-) > > One related thought for a follow-up: scanned == 0 from scan_folios() > is essentially driven by get_nr_gens(lruvec, type) == MIN_NR_GENS. So > when the type get_type_to_scan() picked is the one that's > gen-exhausted, maybe the right response is also to age that type > rather than fall back. Right now should_run_aging() only looks at the > combined evictable_min_seq, so the ctrl_err preferred type can > silently stay starved while we evict from the other one, that could be > an old issue we can fix later. Yep, I’ve been thinking about the same thing for quite a few days. This might also help address swappiness. On the other hand, it could lead to more aging, but it seems like a necessary trade-off if we want a simple solution that doesn’t require separate max_seq for file and anon to fix swappiness for mglru. I’m also trying another approach. For example, if the number of folios in the old generation is too low relative to swappiness, should_run_aging() could return true—similar to Yu’s earliest patch as below, but with a swappiness bias. + /* + * It's also ideal to spread pages out evenly, i.e., 1/(MIN_NR_GENS+1) + * of the total number of pages for each generation. A reasonable range + * for this average portion is [1/MIN_NR_GENS, 1/(MIN_NR_GENS+2)]. The + * aging cares about the upper bound of hot pages, while the eviction + * cares about the lower bound of cold pages. + */ + if (young * MIN_NR_GENS > total) + return true; + if (old * (MIN_NR_GENS + 2) < total) + return true; Hopefully, the above can resolve the problem before we have to resort to separate max_seq, which would break the shared time axis between file and anon. Maybe I will send an RFC before LSF/MM/BPF if we have enough time. Thanks Barry
On Sun, Apr 26, 2026 at 4:35 PM Barry Song <baohua@kernel.org> wrote: > > > Either approach is fine—use whichever works better for organizing v7. > My boss would definitely be happier with the latter, as it would better > support and encourage more active engagement in upstream development. > It’s rare these days to find a boss like this who is open-minded and > insightful about upstream contributions, especially with everyone > moving toward AI. :-) Will attach your patch as a standalone one right next to patch 5 then, the effect is not easy to hit I think, and shouldn't break any bisect. I just ran a set of tests, just in case it has any unexpected behavior, it seems identical to what we have in V5 with all my test cases: FIO: Before: 8968.76 MB/s V5 (this series): 9005.89 MB/s V6 (with your fix and nitpick from sashiko): 8995.63 MB/s Build kernel: Before: 2873.52 V5 (this series): 2816.44s V6 (with your fix and nitpick from sashiko): 2811.88s MySQL: Before: 17303.414444 tps V5 (this series): 17310.528750 V6 (with your fix and nitpick from sashiko): 17291.500000 tps MongoDB with YCSB workoad b: Before: 59227.33 ops/s V6: 79928.53 ops/sec Still matches the V5 cover letter and everything looks great, also seems alright on my android phone. Will send V6 later. > > > > > One related thought for a follow-up: scanned == 0 from scan_folios() > > is essentially driven by get_nr_gens(lruvec, type) == MIN_NR_GENS. So > > when the type get_type_to_scan() picked is the one that's > > gen-exhausted, maybe the right response is also to age that type > > rather than fall back. Right now should_run_aging() only looks at the > > combined evictable_min_seq, so the ctrl_err preferred type can > > silently stay starved while we evict from the other one, that could be > > an old issue we can fix later. > > Yep, I’ve been thinking about the same thing for quite a > few days. This might also help address swappiness. On the > other hand, it could lead to more aging, but it seems like > a necessary trade-off if we want a simple solution that > doesn’t require separate max_seq for file and anon to fix > swappiness for mglru. > > I’m also trying another approach. For example, if the > number of folios in the old generation is too low relative > to swappiness, should_run_aging() could return true—similar > to Yu’s earliest patch as below, but with a swappiness bias. > > + /* > + * It's also ideal to spread pages out evenly, i.e., 1/(MIN_NR_GENS+1) > + * of the total number of pages for each generation. A reasonable range > + * for this average portion is [1/MIN_NR_GENS, 1/(MIN_NR_GENS+2)]. The > + * aging cares about the upper bound of hot pages, while the eviction > + * cares about the lower bound of cold pages. > + */ > + if (young * MIN_NR_GENS > total) > + return true; > + if (old * (MIN_NR_GENS + 2) < total) > + return true; > > Hopefully, the above can resolve the problem before we have to > resort to separate max_seq, which would break the shared time > axis between file and anon. > > Maybe I will send an RFC before LSF/MM/BPF if we have enough > time. Good idea, that can also be discussed there.
© 2016 - 2026 Red Hat, Inc.