[PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling

Kairui Song via B4 Relay posted 15 patches 1 month, 2 weeks ago
mm/vmscan.c | 341 ++++++++++++++++++++++++++----------------------------------
1 file changed, 149 insertions(+), 192 deletions(-)
[PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling
Posted by Kairui Song via B4 Relay 1 month, 2 weeks ago
From: Kairui Song <kasong@tencent.com>

This series cleans up and slightly improves MGLRU's reclaim loop and
dirty writeback handling. As a result, we can see an up to ~30% increase
in some workloads like MongoDB with YCSB and a huge decrease in file
refault, no swap involved. Other common benchmarks have no regression,
and LOC is reduced, with less unexpected OOM, too.

Some of the problems were found in our production environment, and
others were mostly exposed while stress testing during the development
of the LSM/MM/BPF topic on improving MGLRU [1]. This series cleans up
the code base and fixes several performance issues, preparing for
further work.

MGLRU's reclaim loop is a bit complex, and hence these problems are
somehow related to each other. The aging, scan number calculation, and
reclaim loop are coupled together, and the dirty folio handling logic is
quite different, making the reclaim loop hard to follow and the dirty
flush ineffective.

This series slightly cleans up and improves these issues using a scan
budget by calculating the number of folios to scan at the beginning of
the loop, and decouples aging from the reclaim calculation helpers.
Then, move the dirty flush logic inside the reclaim loop so it can kick
in more effectively. These issues are somehow related, and this series
handles them and improves MGLRU reclaim in many ways.

Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
and a 128G memory machine using NVME as storage.

MongoDB
=======
Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
threads:32), which does 95% read and 5% update to generate mixed read
and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
the WiredTiger cache size is set to 4.5G, using NVME as storage.

Not using SWAP.

Before:
Throughput(ops/sec): 62485.02962831822
AverageLatency(us): 500.9746963330107
pgpgin 159347462
pgpgout 5413332
workingset_refault_anon 0
workingset_refault_file 34522071

After:
Throughput(ops/sec): 79760.71784646061 (+27.6%, higher is better)
AverageLatency(us): 391.25169970043726 (-21.9%, lower is better)
pgpgin 111093923                       (-30.3%, lower is better)
pgpgout 5437456
workingset_refault_anon 0
workingset_refault_file 19566366       (-43.3%, lower is better)

We can see a significant performance improvement after this series.
The test is done on NVME and the performance gap would be even larger
for slow devices, such as HDD or network storage. We observed over
100% gain for some workloads with slow IO.

Chrome & Node.js [3]
====================
Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
workers:

Before:
Total requests:            79915
Per-worker 95% CI (mean):  [1233.9, 1263.5]
Per-worker stdev:          59.2
Jain's fairness:           0.997795 (1.0 = perfectly fair)
Latency:
Bucket     Count      Pct    Cumul
[0,1)s     26859   33.61%   33.61%
[1,2)s      7818    9.78%   43.39%
[2,4)s      5532    6.92%   50.31%
[4,8)s     39706   49.69%  100.00%

After:
Total requests:            81382
Per-worker 95% CI (mean):  [1241.9, 1301.3]
Per-worker stdev:          118.8
Jain's fairness:           0.991480 (1.0 = perfectly fair)
Latency:
Bucket     Count      Pct    Cumul
[0,1)s     26696   32.80%   32.80%
[1,2)s      8745   10.75%   43.55%
[2,4)s      6865    8.44%   51.98%
[4,8)s     39076   48.02%  100.00%

Reclaim is still fair and effective, total requests number seems
slightly better.

OOM issue with aging and throttling
===================================
For the throttling OOM issue, it can be easily reproduced using dd and
cgroup limit as demonstrated and fixed by a later patch in this series.

The aging OOM is a bit tricky, a specific reproducer can be used to
simulate what we encountered in production environment [4]:
Spawns multiple workers that keep reading the given file using mmap,
and pauses for 120ms after one file read batch. It also spawns another
set of workers that keep allocating and freeing a given size of
anonymous memory. The total memory size exceeds the memory limit
(eg. 14G anon + 8G file, which is 22G vs a 16G memcg limit).

- MGLRU disabled:
  Finished 128 iterations.

- MGLRU enabled:
  OOM with following info after about ~10-20 iterations:
    [   62.624130] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
    [   62.624999] memory: usage 16777216kB, limit 16777216kB, failcnt 24460
    [   62.640200] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
    [   62.640823] Memory cgroup stats for /demo:
    [   62.641017] anon 10604879872
    [   62.641941] file 6574858240

  OOM occurs despite there being still evictable file folios.

- MGLRU enabled after this series:
  Finished 128 iterations.

Worth noting there is another OOM related issue reported in V1 of
this series, which is tested and looking OK now [5].

MySQL:
======

Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
ZRAM as swap and test command:

sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
  --tables=48 --table-size=2000000 --threads=48 --time=600 run

Before:            17303.41 tps
After this series: 17291.50 tps

Seems only noise level changes, no regression.

FIO:
====
Testing with the following command, where /mnt/ramdisk is a
64G EXT4 ramdisk, each test file is 3G, in a 10G memcg,
6 test run each:

fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \
  --name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \
  --rw=randread --norandommap --time_based \
  --ramp_time=1m --runtime=5m --group_reporting

Before:            8968.76 MB/s
After this series: 8995.63 MB/s

Also seem only noise level changes and no regression or slightly better.

Build kernel:
=============
Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
using make -j96 and defconfig, measuring system time, 12 test run each.

Before:            2873.52s
After this series: 2811.88s

Also seem only noise level changes, no regression or very slightly better.

Android:
========
Xinyu reported a performance gain on Android, too, with this series. The
test consisted of cold-starting multiple applications sequentially under
moderate system load. [6]

Before:
Launch Time Summary (all apps, all runs)
  Mean 868.0ms
  P50 888.0ms
  P90 1274.2ms
  P95 1399.0ms

After:
Launch Time Summary (all apps, all runs)
  Mean 850.5ms (-2.07%)
  P50 861.5ms  (-3.04%)
  P90 1179.0ms (-8.05%)
  P95 1228.0ms (-12.2%)

Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1]
Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2]
Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3]
Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4]
Link: https://lore.kernel.org/linux-mm/acgNCzRDVmSbXrOE@KASONG-MC4/ [5]
Link: https://lore.kernel.org/linux-mm/20260417025123.2971253-1-wxy2009nrrr@163.com/ [6]

Signed-off-by: Kairui Song <kasong@tencent.com>
---
Changes in v7:
- Fix swappiness not being effective with a standalone fix patch
  from Barry Song. It's OK to be a standalone fix since that is not a
  major bug but an unexpected behavior change, and shouldn't effect any
  bisecting. I slightly adjusted the commit message as the subjcect is too
  long and getting truncated for mail:
  https://lore.kernel.org/linux-mm/20260425205759.1701-1-baohua@kernel.org/
- Remove the min limit for calculating nr_to_scan:
  https://lore.kernel.org/linux-mm/aet1hd9DfRH4aSOO@KASONG-MC4/
  Instead just revert to V1:
  https://sashiko.dev/#/message/20260318-mglru-reclaim-v1-3-2c46f9eb0508%40tencent.com
  Everyone was fine with that, the min limit in later version was
  introduced to cover sashiko's review on V1, but now think again, that's
  actually not a bug and instead could be beneficial. This min
  check doesn't always make sense and there isn't any practical issue observed.
- Retest still looking very good in every case.
- Link to v6: https://patch.msgid.link/20260424-mglru-reclaim-v6-0-a57622d770c3@tencent.com

Changes in v6:
- Avoid potential over rotation of tiny cgroup (<16M):
  https://lore.kernel.org/linux-mm/CAMgjq7ArnmmoHOGRt6Wc8hu7tjx_t583-UVzJK+HOHgjjetQ9g@mail.gmail.com/
- Avoid potentially skewed stat counter:
  https://lore.kernel.org/linux-mm/CAMgjq7DCn8p_yMMhiejFjX6sdybZKYOw8qJbq=+OCsZ=AfJnFA@mail.gmail.com/
- Update a few comment and varible name as suggested by Barry Song.
- Tested over days, also tested on my Android phone, everything still
  matches the cover letter description. And add test result from Xinyu.
- Link to v5: https://patch.msgid.link/20260413-mglru-reclaim-v5-0-8eaeacbddc44@tencent.com

Changes in v5:
- Add back a more moderate minimal batch limit for each reclaim loop:
  https://lore.kernel.org/linux-mm/adYP81AhpNf0znp3@KASONG-MC4/
- Collect review-by.
- Link to v4: https://patch.msgid.link/20260407-mglru-reclaim-v4-0-98cf3dc69519@tencent.com

Changes in v4:
- Remove the minimal scan batch limit, and add rotate for
  unevictable memcg as reported by sashiko:
  https://lore.kernel.org/linux-mm/ac8xVN82LBLDZpIO@KASONG-MC4/
- Slightly imporove a few commit messages.
- Reran the test and seems identical with before so data are unchanged.
- Collect review-by.
- Link to v3: https://patch.msgid.link/20260403-mglru-reclaim-v3-0-a285efd6ff91@tencent.com

Changes in v3:
- Don't force scan at least SWAP_CLUSTER_MAX pages for each reclaim
  loop. If the LRU is too small, adjust it accordingly. Now the
  multi-cgroup scan balance looked even better for tiny cgroups:
  https://lore.kernel.org/linux-mm/aciejkdIHyXPNS9Y@KASONG-MC4/
- Add one patch to remove the swap constraint check in isolate_folio. In
  theory, it's fine, and both stress test and performance test didn't
  show any issue:
  https://lore.kernel.org/linux-mm/CAMgjq7C8TCsK99p85i3QzGCwgkXscTfFB6XCUTWQOcuqwHQa2Q@mail.gmail.com/
- I reran most tests, all seem identical, so most data is kept.
  Intermediate test results are dropped. I ran tests on most patches
  individually, and there is no problem, but the series is getting too
  long, and posting them makes it harder to read and unnecessary.
- Split previously patch 8 into two patches as suggested [ Shakeel Butt ],
  also some test result is collected to support the design:
  https://lore.kernel.org/linux-mm/ac44BVOvOm8lhVvj@KASONG-MC4/#t
  I kept Axel's review-by since the code is identical.
- Call try_to_inc_min_seq twice to avoid stale empty gen and drop
  its return argument [ Baolin Wang ]
- Move a few lines of code between patches to where they fits better,
  the final result is identical [ Baolin Wang ].
- Collect tested by and update test setup [ Leno Hou ]
- Collect review by.
- Update a few commit message [ Shakeel Butt ].
- Link to v2: https://patch.msgid.link/20260329-mglru-reclaim-v2-0-b53a3678513c@tencent.com

Changes in v2:
- Rebase on top of mm-new which includes Cgroup V1 fix from
  [ Baolin Wang ].
- Added dirty throttling OOM fix as patch 12, as [ Chen Ridong ]'s
  review suggested that we shouldn't leave the counter and reclaim
  feedback in shrink_folio_list untracked in this series.
- Add a minimal scan number of SWAP_CLUSTER_MAX limit in patch
  "restructure the reclaim loop", the change is trivial but might
  help avoid livelock for tiny cgroups.
- Redo the tests, most test are basically identical to before, but just
  in case, since the patch also solves the throttling issue now, and
  discussed with reports from CachyOS.
- Add a separate patch for variable renaming as suggested by [ Barry
  Song ]. No feature change.
- Improve several comment and code issue [ Axel Rasmussen ].
- Remove no longer needed variable [ Axel Rasmussen ].
- Collect review by.
- Link to v1: https://lore.kernel.org/r/20260318-mglru-reclaim-v1-0-2c46f9eb0508@tencent.com

---
Barry Song (Xiaomi) (1):
      mm/mglru: avoid reclaim type fall back when isolation makes no progress

Kairui Song (14):
      mm/mglru: consolidate common code for retrieving evictable size
      mm/mglru: rename variables related to aging and rotation
      mm/mglru: relocate the LRU scan batch limit to callers
      mm/mglru: restructure the reclaim loop
      mm/mglru: scan and count the exact number of folios
      mm/mglru: use a smaller batch for reclaim
      mm/mglru: don't abort scan immediately right after aging
      mm/mglru: remove redundant swap constrained check upon isolation
      mm/mglru: use the common routine for dirty/writeback reactivation
      mm/mglru: simplify and improve dirty writeback handling
      mm/mglru: remove no longer used reclaim argument for folio protection
      mm/vmscan: remove sc->file_taken
      mm/vmscan: remove sc->unqueued_dirty
      mm/vmscan: unify writeback reclaim statistic and throttling

 mm/vmscan.c | 341 ++++++++++++++++++++++++++----------------------------------
 1 file changed, 149 insertions(+), 192 deletions(-)
---
base-commit: 22f2053a471467342c51eb2e4ffd7daf601118d2
change-id: 20260314-mglru-reclaim-1c9d45ac57f6

Best regards,
--  
Kairui Song <kasong@tencent.com>
Re: [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling
Posted by Andrew Morton 1 month, 2 weeks ago
On Tue, 28 Apr 2026 02:06:51 +0800 Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote:

> From: Kairui Song <kasong@tencent.com>
> 
> This series cleans up and slightly improves MGLRU's reclaim loop and
> dirty writeback handling. As a result, we can see an up to ~30% increase
> in some workloads like MongoDB with YCSB and a huge decrease in file
> refault, no swap involved. Other common benchmarks have no regression,
> and LOC is reduced, with less unexpected OOM, too.

Thanks, I've updated mm.git's mm-new branch to this version.

> Changes in v7:
> - Fix swappiness not being effective with a standalone fix patch
>   from Barry Song. It's OK to be a standalone fix since that is not a
>   major bug but an unexpected behavior change, and shouldn't effect any
>   bisecting. I slightly adjusted the commit message as the subjcect is too
>   long and getting truncated for mail:
>   https://lore.kernel.org/linux-mm/20260425205759.1701-1-baohua@kernel.org/
> - Remove the min limit for calculating nr_to_scan:
>   https://lore.kernel.org/linux-mm/aet1hd9DfRH4aSOO@KASONG-MC4/
>   Instead just revert to V1:
>   https://sashiko.dev/#/message/20260318-mglru-reclaim-v1-3-2c46f9eb0508%40tencent.com
>   Everyone was fine with that, the min limit in later version was
>   introduced to cover sashiko's review on V1, but now think again, that's
>   actually not a bug and instead could be beneficial. This min
>   check doesn't always make sense and there isn't any practical issue observed.
> - Retest still looking very good in every case.

Here's how v7 altered mm.git.   (Looks small - did I mess this up?)


--- a/mm/vmscan.c~b
+++ a/mm/vmscan.c
@@ -4788,8 +4788,13 @@ static int isolate_folios(unsigned long
 			*isolate_scanned = scanned;
 			break;
 		}
-
-		type = !type;
+		/*
+		 * If scanned > 0 and isolated == 0, avoid falling back to the
+		 * other type, as this type remains sufficient. Falling back
+		 * too readily can disrupt the positive_ctrl_err() bias.
+		 */
+		if (!scanned)
+			type = !type;
 	}
 
 	return total_scanned;
@@ -4909,18 +4914,14 @@ static long get_nr_to_scan(struct lruvec
 	unsigned long nr_to_scan, evictable;
 
 	evictable = lruvec_evictable_size(lruvec, swappiness);
-	nr_to_scan = evictable;
 
 	/* try to scrape all its memory if this memcg was deleted */
 	if (!mem_cgroup_online(memcg))
-		return nr_to_scan;
+		return evictable;
 
-	nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
+	nr_to_scan = apply_proportional_protection(memcg, sc, evictable);
 	nr_to_scan >>= sc->priority;
 
-	if (!nr_to_scan && sc->priority < DEF_PRIORITY)
-		nr_to_scan = min(evictable, SWAP_CLUSTER_MAX);
-
 	return nr_to_scan;
 }
 
_
Re: [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling
Posted by Kairui Song 3 weeks, 6 days ago
On Tue, Apr 28, 2026 at 2:07 AM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> This series cleans up and slightly improves MGLRU's reclaim loop and
> dirty writeback handling. As a result, we can see an up to ~30% increase
> in some workloads like MongoDB with YCSB and a huge decrease in file
> refault, no swap involved. Other common benchmarks have no regression,
> and LOC is reduced, with less unexpected OOM, too.
>
> Some of the problems were found in our production environment, and
> others were mostly exposed while stress testing during the development
> of the LSM/MM/BPF topic on improving MGLRU [1]. This series cleans up
> the code base and fixes several performance issues, preparing for
> further work.
>
> MGLRU's reclaim loop is a bit complex, and hence these problems are
> somehow related to each other. The aging, scan number calculation, and
> reclaim loop are coupled together, and the dirty folio handling logic is
> quite different, making the reclaim loop hard to follow and the dirty
> flush ineffective.
>
> This series slightly cleans up and improves these issues using a scan
> budget by calculating the number of folios to scan at the beginning of
> the loop, and decouples aging from the reclaim calculation helpers.
> Then, move the dirty flush logic inside the reclaim loop so it can kick
> in more effectively. These issues are somehow related, and this series
> handles them and improves MGLRU reclaim in many ways.
>
> Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
> and a 128G memory machine using NVME as storage.

Hi All,

This is a supplementary test report and explaining why we are using
these cases. All tests below, unless explicitly declared otherwise,
are run at least six times, using the median result. Some tests are
also run against MGLRU-FG[1] as a reference. (MGLRU-FG is still under
development so the reading may change - hopefully for the better).

>
> MongoDB
> =======
> Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
> threads:32), which does 95% read and 5% update to generate mixed read
> and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
> the WiredTiger cache size is set to 4.5G, using NVME as storage.

MongoDB with workloadb has mixed writeback and read pressure, which
tests the LRU's capability to handle writeback flushing while
protecting the workingset.

Using the same test setup, I retested everything with Classical LRU
included (which I will refer to as CLRU below, open to suggestions for
a better abbreviation :) ). I rebased it on top of the current 7.1 rc
with a clean test environment.

CLRU:
93713.640901 ops/sec
workingset_refault_file 15013443
pgpgin 85365614
pgpgout 5866508

MGLRU Before:
60653.502655 ops/sec
workingset_refault_file 12904916
pgpgin 165366622
pgpgout 5219588

MGLRU After:
82384.354760 ops/sec
workingset_refault_file 7128285
pgpgin 113170693
pgpgout 5639724

Before this series, MGLRU lagged CLRU by approximately 35% on this
workload. This is the case where MGLRU has historically struggled the
most. This series closes most of that gap (within ~13%), and
MGLRU-FG[1] will close the rest (within noise). The trajectory is
clear and the work is ongoing:

MGLRU-FG:
92930.697550 ops/sec
workingset_refault_file 10775748
pgpgin 98558215
pgpgout 5736764

It's very interesting that MGLRU-FG and CLRU both have a higher
workingset_refault_file, but lower pgpgin, I suspect this could be
related to the slab (inode) shrinking balance. I ran into the similar
issue before [2], which can be looked into later but I think that's
irrelevant to this series and we are definitely on the right track.

The test results above basically match the cover letter as well,
reading is a bit different due to different test environment which
isn't a issue, so I think there is no need to update that.

>
> Not using SWAP.
>
> Before:
> Throughput(ops/sec): 62485.02962831822
> AverageLatency(us): 500.9746963330107
> pgpgin 159347462
> pgpgout 5413332
> workingset_refault_anon 0
> workingset_refault_file 34522071
>
> After:
> Throughput(ops/sec): 79760.71784646061 (+27.6%, higher is better)
> AverageLatency(us): 391.25169970043726 (-21.9%, lower is better)
> pgpgin 111093923                       (-30.3%, lower is better)
> pgpgout 5437456
> workingset_refault_anon 0
> workingset_refault_file 19566366       (-43.3%, lower is better)
>
> We can see a significant performance improvement after this series.
> The test is done on NVME and the performance gap would be even larger
> for slow devices, such as HDD or network storage. We observed over
> 100% gain for some workloads with slow IO.
>
> Chrome & Node.js [3]
> ====================
> Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
> nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
> workers:
>
> Before:
> Total requests:            79915
> Per-worker 95% CI (mean):  [1233.9, 1263.5]
> Per-worker stdev:          59.2
> Jain's fairness:           0.997795 (1.0 = perfectly fair)
> Latency:
> Bucket     Count      Pct    Cumul
> [0,1)s     26859   33.61%   33.61%
> [1,2)s      7818    9.78%   43.39%
> [2,4)s      5532    6.92%   50.31%
> [4,8)s     39706   49.69%  100.00%
>
> After:
> Total requests:            81382
> Per-worker 95% CI (mean):  [1241.9, 1301.3]
> Per-worker stdev:          118.8
> Jain's fairness:           0.991480 (1.0 = perfectly fair)
> Latency:
> Bucket     Count      Pct    Cumul
> [0,1)s     26696   32.80%   32.80%
> [1,2)s      8745   10.75%   43.55%
> [2,4)s      6865    8.44%   51.98%
> [4,8)s     39076   48.02%  100.00%
>
> Reclaim is still fair and effective, total requests number seems
> slightly better.

Chrome & Node.js is very common workload for many users. Running these
workloads in different cgroups can apply equal pressure to all cgroups
under a global pressure, hence testing the LRU's ability to detect and
protect the working set, efficiency, and balance reclamation between
multiple tenants.

I'll post the summary of test result since the raw test result is way too long.

CLRU:
THROUGHPUT
Total requests:           62399
Per-worker mean:          975.0
Per-worker 95% CI (mean):       [   941.9,   1008.1]
LATENCY DISTRIBUTION (all workers aggregated)
      Bucket     Count      Pct    Cumul
      [0,1)s     20051   32.13%   32.13%
      [1,2)s      2255    3.61%   35.75%
      [2,4)s      6149    9.85%   45.60%
      [4,8)s     33927   54.37%   99.97%
     [8,16)s        17    0.03%  100.00%
FAIRNESS (per-worker total requests)
Jain's fairness index: 0.982156  (1.0 = perfectly fair)

MGLRU before:
THROUGHPUT
Total requests:           81898
Per-worker mean:         1279.7
Per-worker 95% CI (mean):       [  1259.0,   1300.4]
LATENCY DISTRIBUTION (all workers aggregated)
      Bucket     Count      Pct    Cumul
      [0,1)s     28392   34.67%   34.67%
      [1,2)s      8022    9.80%   44.46%
      [2,4)s      6130    7.48%   51.95%
      [4,8)s     39354   48.05%  100.00%
FAIRNESS (per-worker total requests)
Jain's fairness index: 0.995893  (1.0 = perfectly fair)

MGLRU after:
THROUGHPUT
Total requests:           82901
Per-worker mean:         1295.3
Per-worker 95% CI (mean):       [  1265.3,   1325.4]
LATENCY DISTRIBUTION (all workers aggregated)
      Bucket     Count      Pct    Cumul
      [0,1)s     28128   33.93%   33.93%
      [1,2)s      8756   10.56%   44.49%
      [2,4)s      7028    8.48%   52.97%
      [4,8)s     38989   47.03%  100.00%
FAIRNESS (per-worker total requests)
Jain's fairness index: 0.991607  (1.0 = perfectly fair)

In summary MGLRU performs very well, both before and after this
series, across throughput, latency, and fairness. I also tested
MGLRU-FG, which yielded similar results with a per-worker 95% CI
(mean) of [1275.5, 1333.9].

>
> OOM issue with aging and throttling
> ===================================
> For the throttling OOM issue, it can be easily reproduced using dd and
> cgroup limit as demonstrated and fixed by a later patch in this series.

Skipping this one, aging/throttling OOM is an MGLRU-only issue, and
is fixed by this series.

>
> MySQL:
> ======
>
> Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
> ZRAM as swap and test command:
>
> sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
>   --tables=48 --table-size=2000000 --threads=48 --time=600 run
>
> Before:            17303.41 tps
> After this series: 17291.50 tps
>

MySQL with sysbench is a standard database benchmark. The 24G InnoDB
buffer pool inside a 2G memory cgroup forces aggressive eviction of
cached database anon pages, testing the LRU's ability to identify hot
pages and the eviction path's efficiency under swap pressure.

Here is the retested result (average of 6 test run):

CLRU: 16245.330000 tps
MGLRU before: 17313.688333 tps
MGLRU after: 17286.195000 tps
MGLRU-FG: 17225.123333 tps

So MGLRU before/after/FG are all doing well with this one, ahead of
CLRU. It seems very slightly slower after this series, but this could
be noise, and I think it's fine to ignore. This series has no
noticeable effect on MGLRU for this kind of test.

>
> FIO:
> ====
> Testing with the following command, where /mnt/ramdisk is a
> 64G EXT4 ramdisk, each test file is 3G, in a 10G memcg,
> 6 test run each:
>
> fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \
>   --name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \
>   --rw=randread --norandommap --time_based \
>   --ramp_time=1m --runtime=5m --group_reporting
>
> Before:            8968.76 MB/s
> After this series: 8995.63 MB/s

Random buffered FIO read on a ramdisk basically tests the LRU's
ability to evict the page cache efficiently. Results (average of 6
test runs):
CLRU: 8254.540000 MB/s
MGLRU before: 9033.908333 MB/s
MGLRU after: 9065.725000 MB/s
MGLRU-FG: 9067.105000 MB/s

MGLRU before / after / FG are all doing very well on this one,

>
> Also seem only noise level changes and no regression or slightly better.
>
> Build kernel:
> =============
> Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
> using make -j96 and defconfig, measuring system time, 12 test run each.
>
> Before:            2873.52s
> After this series: 2811.88s

Build kernel test is a very classical test for us, many performance
test use this as a standard, it is a real workload and stands for many
compilation tasks.

I'll just post the system time for more direct comparision, following
the setup as describe in the cover letter:

CLRU: 2760.50user 5023.50system 1:51.89elapsed
MGLRU before: 2924.41user 2823.13system 1:28.09elapsed
MGLRU after: 2938.42user 2801.26system 1:28.10elapsed
MGLRU-FG: 2936.42user 2781.65system 1:27.84elapsed

MGLRU before / after / FG are all doing very well on this one,

Testing on disk instead using BTRFS with a 3G memcg:
CLRU:
real   1m51.325s
user   37m16.586s
sys    11m20.294s

MGLRU before:
real   1m49.649s
user   37m38.325s
sys    9m0.360s

MGLRU after:
real   1m49.223s
user   37m15.546s
sys    8m46.135s

MGLRU-FG:
real   1m49.908s
user   37m22.696s
sys    8m53.138s

Still, MGLRU before / after / FG are all doing very well.

>
> Also seem only noise level changes, no regression or very slightly better.
>
> Android:
> ========
> Xinyu reported a performance gain on Android, too, with this series. The
> test consisted of cold-starting multiple applications sequentially under
> moderate system load. [6]
>
> Before:
> Launch Time Summary (all apps, all runs)
>   Mean 868.0ms
>   P50 888.0ms
>   P90 1274.2ms
>   P95 1399.0ms
>
> After:
> Launch Time Summary (all apps, all runs)
>   Mean 850.5ms (-2.07%)
>   P50 861.5ms  (-3.04%)
>   P90 1179.0ms (-8.05%)
>   P95 1228.0ms (-12.2%)

I've seen many reports from Android that MGLRU provides better battery
life and my personal experience backporting this series on my phone is
quite positive :). I currently lack a standard environment for Android
testing because I don't have any Android vendor support so I'll have
to skip the comparison on this one. And I think Xinyu's original
numbers are good enough for this series. (I remember seeing community
reports and historical reports in LPC or LSF/MM/BPF all look good so
far).

I've also posted other tests previously that all show this series is
behaving correctly, but I don't think we should include all of them or
this will be rediculiously long:
https://lore.kernel.org/linux-mm/CAMgjq7BzQAPp8u_3-9e3ueXmRCoW=2sydok0hFM=MYL7VC1YYg@mail.gmail.com/

======

In summary, I think these tests make a lot of sense, and re-testing
with CLRU in a row indicates that MGLRU performs very well with this
series. In most cases MGLRU performs much better. MGLRU suffered the
most during the MongoDB writeback workload (YCSB workloadb), and that
is exactly what we are solving, and gap is closing with a clear plan.

I can fold the per-benchmark rationale sentences and CLRU baselines
into a re-freshed cover letter (no code changes), or should we just
add a link to this email instead? The existing cover letter is already
long and sufficiently supportive IMO, and the new test result matches
what we already have.

In any case, I think we are headed in the right direction.

Link: https://lore.kernel.org/linux-mm/20260502-mglru-fg-v1-0-913619b014d9@tencent.com/
[1]
Link: https://lore.kernel.org/linux-mm/CAMgjq7BsY1tJeOZwSppxUN7Lha-_a7WLfhv1_bxTuU4EuiQyVg@mail.gmail.com/[2]
Re: [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling
Posted by Shakeel Butt 1 month ago
Hi Kairui,

On Tue, Apr 28, 2026 at 02:06:51AM +0800, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> This series cleans up and slightly improves MGLRU's reclaim loop and
> dirty writeback handling. As a result, we can see an up to ~30% increase
> in some workloads like MongoDB with YCSB and a huge decrease in file
> refault, no swap involved. Other common benchmarks have no regression,
> and LOC is reduced, with less unexpected OOM, too.
> 
> Some of the problems were found in our production environment, and
> others were mostly exposed while stress testing during the development
> of the LSM/MM/BPF topic on improving MGLRU [1]. This series cleans up
> the code base and fixes several performance issues, preparing for
> further work.
> 
> MGLRU's reclaim loop is a bit complex, and hence these problems are
> somehow related to each other. The aging, scan number calculation, and
> reclaim loop are coupled together, and the dirty folio handling logic is
> quite different, making the reclaim loop hard to follow and the dirty
> flush ineffective.
> 
> This series slightly cleans up and improves these issues using a scan
> budget by calculating the number of folios to scan at the beginning of
> the loop, and decouples aging from the reclaim calculation helpers.
> Then, move the dirty flush logic inside the reclaim loop so it can kick
> in more effectively. These issues are somehow related, and this series
> handles them and improves MGLRU reclaim in many ways.
> 
> Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
> and a 128G memory machine using NVME as storage.

Please include traditional LRU results for all of the following experiments as
well (where it makes sense).

> 
> MongoDB
> =======
> Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
> threads:32), which does 95% read and 5% update to generate mixed read
> and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
> the WiredTiger cache size is set to 4.5G, using NVME as storage.

Can you add a sentence here on why this workload is chosen and is important for
evaluation?

> 
> Not using SWAP.

Any specific reason to not have swap in this test?

> 
> Before:
> Throughput(ops/sec): 62485.02962831822
> AverageLatency(us): 500.9746963330107
> pgpgin 159347462
> pgpgout 5413332
> workingset_refault_anon 0
> workingset_refault_file 34522071
> 
> After:
> Throughput(ops/sec): 79760.71784646061 (+27.6%, higher is better)
> AverageLatency(us): 391.25169970043726 (-21.9%, lower is better)
> pgpgin 111093923                       (-30.3%, lower is better)
> pgpgout 5437456
> workingset_refault_anon 0
> workingset_refault_file 19566366       (-43.3%, lower is better)
> 
> We can see a significant performance improvement after this series.
> The test is done on NVME and the performance gap would be even larger
> for slow devices, such as HDD or network storage. We observed over
> 100% gain for some workloads with slow IO.
> 
> Chrome & Node.js [3]
> ====================
> Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
> nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
> workers:
> 
> Before:
> Total requests:            79915
> Per-worker 95% CI (mean):  [1233.9, 1263.5]
> Per-worker stdev:          59.2
> Jain's fairness:           0.997795 (1.0 = perfectly fair)
> Latency:
> Bucket     Count      Pct    Cumul
> [0,1)s     26859   33.61%   33.61%
> [1,2)s      7818    9.78%   43.39%
> [2,4)s      5532    6.92%   50.31%
> [4,8)s     39706   49.69%  100.00%
> 
> After:
> Total requests:            81382
> Per-worker 95% CI (mean):  [1241.9, 1301.3]
> Per-worker stdev:          118.8
> Jain's fairness:           0.991480 (1.0 = perfectly fair)
> Latency:
> Bucket     Count      Pct    Cumul
> [0,1)s     26696   32.80%   32.80%
> [1,2)s      8745   10.75%   43.55%
> [2,4)s      6865    8.44%   51.98%
> [4,8)s     39076   48.02%  100.00%
> 
> Reclaim is still fair and effective, total requests number seems
> slightly better.

Please add a reference to Jain's fairness and a sentence on why we should care
about it.

> 
> OOM issue with aging and throttling
> ===================================
> For the throttling OOM issue, it can be easily reproduced using dd and
> cgroup limit as demonstrated and fixed by a later patch in this series.
> 
> The aging OOM is a bit tricky, a specific reproducer can be used to
> simulate what we encountered in production environment [4]:
> Spawns multiple workers that keep reading the given file using mmap,
> and pauses for 120ms after one file read batch. It also spawns another
> set of workers that keep allocating and freeing a given size of
> anonymous memory. The total memory size exceeds the memory limit
> (eg. 14G anon + 8G file, which is 22G vs a 16G memcg limit).
> 
> - MGLRU disabled:
>   Finished 128 iterations.
> 
> - MGLRU enabled:
>   OOM with following info after about ~10-20 iterations:
>     [   62.624130] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
>     [   62.624999] memory: usage 16777216kB, limit 16777216kB, failcnt 24460
>     [   62.640200] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
>     [   62.640823] Memory cgroup stats for /demo:
>     [   62.641017] anon 10604879872
>     [   62.641941] file 6574858240
> 
>   OOM occurs despite there being still evictable file folios.
> 
> - MGLRU enabled after this series:
>   Finished 128 iterations.
> 
> Worth noting there is another OOM related issue reported in V1 of
> this series, which is tested and looking OK now [5].

Oh this is good as it seems like you are already running traditional LRU.

> 
> MySQL:
> ======
> 
> Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
> ZRAM as swap and test command:
> 
> sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
>   --tables=48 --table-size=2000000 --threads=48 --time=600 run
> 
> Before:            17303.41 tps
> After this series: 17291.50 tps
> 
> Seems only noise level changes, no regression.
> 

Please add a sentence on why this specific params.

> FIO:
> ====
> Testing with the following command, where /mnt/ramdisk is a
> 64G EXT4 ramdisk, each test file is 3G, in a 10G memcg,
> 6 test run each:
> 
> fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \
>   --name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \
>   --rw=randread --norandommap --time_based \
>   --ramp_time=1m --runtime=5m --group_reporting
> 
> Before:            8968.76 MB/s
> After this series: 8995.63 MB/s
> 
> Also seem only noise level changes and no regression or slightly better.

Same here.

> 
> Build kernel:
> =============
> Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
> using make -j96 and defconfig, measuring system time, 12 test run each.
> 
> Before:            2873.52s
> After this series: 2811.88s
> 
> Also seem only noise level changes, no regression or very slightly better.

So, the kernel source code is on tmpfs, right? Also 3G memcg means memory.max is
3G, correct?

> 
> Android:
> ========
> Xinyu reported a performance gain on Android, too, with this series. The
> test consisted of cold-starting multiple applications sequentially under
> moderate system load. [6]
> 
> Before:
> Launch Time Summary (all apps, all runs)
>   Mean 868.0ms
>   P50 888.0ms
>   P90 1274.2ms
>   P95 1399.0ms
> 
> After:
> Launch Time Summary (all apps, all runs)
>   Mean 850.5ms (-2.07%)
>   P50 861.5ms  (-3.04%)
>   P90 1179.0ms (-8.05%)
>   P95 1228.0ms (-12.2%)

It would be awesome if Xinyu can gather traditional LRU numbers but if not then
it is fine.
Re: [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling
Posted by Kairui Song 1 month ago
On Tue, May 12, 2026 at 2:51 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
>
> Hi Kairui,

Hello,

>
> On Tue, Apr 28, 2026 at 02:06:51AM +0800, Kairui Song via B4 Relay wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
> > and a 128G memory machine using NVME as storage.
>
> Please include traditional LRU results for all of the following experiments as
> well (where it makes sense).

Sure, I've spawn a few test instances, was busy travelling last week.
That specific test machine is occupied so it might take a while.

A systematic test run takes roughly one or two days to complete for
one kernel version or config, e.g. the JS test takes at least 2 hours
to finish. Comparing versions/setups takes more time.

>
> >
> > MongoDB
> > =======
> > Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
> > threads:32), which does 95% read and 5% update to generate mixed read
> > and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
> > the WiredTiger cache size is set to 4.5G, using NVME as storage.
>
> Can you add a sentence here on why this workload is chosen and is important for
> evaluation?

Because that's exactly the one we observed with regression since it
involves mixed writeback, and it's a pratical case.

>
> >
> > Not using SWAP.
>
> Any specific reason to not have swap in this test?

Because we are testing the writeback here, not related to SWAP, so
just to avoid noise and irrelevant parts.

A longer history involving SWAP is explained here:
https://lore.kernel.org/linux-mm/20230920190244.16839-1-ryncsn@gmail.com/

And a longer discussion on that:
https://lore.kernel.org/linux-mm/CAMgjq7BRaRgYLf2+8=+=nWtzkrHFKmudZPRm41PR6W+A+L=AKA@mail.gmail.com/

Both are not easy to reproduce, though. YCSB with MongoDB seems close
enough and I believe we are heading in the right track.

In an internal workload, we observed that patched MGLRU is about 20%
faster than classical LRU with MongoDB. Upstream MGLRU is still
slightly behind classical LRU at this point, and will hopefully be
patched soon, which is the RFC I posted:
https://lore.kernel.org/linux-mm/20260502-mglru-fg-v1-0-913619b014d9@tencent.com/

>
> >
> > Before:
> > Throughput(ops/sec): 62485.02962831822
> > AverageLatency(us): 500.9746963330107
> > pgpgin 159347462
> > pgpgout 5413332
> > workingset_refault_anon 0
> > workingset_refault_file 34522071
> >
> > After:
> > Throughput(ops/sec): 79760.71784646061 (+27.6%, higher is better)
> > AverageLatency(us): 391.25169970043726 (-21.9%, lower is better)
> > pgpgin 111093923                       (-30.3%, lower is better)
> > pgpgout 5437456
> > workingset_refault_anon 0
> > workingset_refault_file 19566366       (-43.3%, lower is better)
> >
> > We can see a significant performance improvement after this series.
> > The test is done on NVME and the performance gap would be even larger
> > for slow devices, such as HDD or network storage. We observed over
> > 100% gain for some workloads with slow IO.
> >
> > Chrome & Node.js [3]
> > ====================
> > Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
> > nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
> > workers:
> >
> > Before:
> > Total requests:            79915
> > Per-worker 95% CI (mean):  [1233.9, 1263.5]
> > Per-worker stdev:          59.2
> > Jain's fairness:           0.997795 (1.0 = perfectly fair)
> > Latency:
> > Bucket     Count      Pct    Cumul
> > [0,1)s     26859   33.61%   33.61%
> > [1,2)s      7818    9.78%   43.39%
> > [2,4)s      5532    6.92%   50.31%
> > [4,8)s     39706   49.69%  100.00%
> >
> > After:
> > Total requests:            81382
> > Per-worker 95% CI (mean):  [1241.9, 1301.3]
> > Per-worker stdev:          118.8
> > Jain's fairness:           0.991480 (1.0 = perfectly fair)
> > Latency:
> > Bucket     Count      Pct    Cumul
> > [0,1)s     26696   32.80%   32.80%
> > [1,2)s      8745   10.75%   43.55%
> > [2,4)s      6865    8.44%   51.98%
> > [4,8)s     39076   48.02%  100.00%
> >
> > Reclaim is still fair and effective, total requests number seems
> > slightly better.
>
> Please add a reference to Jain's fairness and a sentence on why we should care
> about it.

So first, Here is the previous test setup for that:
https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/

The basical idea is simple: if all memcgs are under similar pressure,
they should be reclaimed equally, which seems fair.

The fairness index measures the equality of resource allocation among
users. It is commonly used to evaluate network bandwidth distribution
for multiple users under pressure, which seems suitable here. We are
also measuring the reclaim ratio for multiple users under pressure.
I'm using a numeric index here so I don't need to post 500 lines of
raw test results every time:
https://www.sciencedirect.com/topics/computer-science/fairness-index

Also here is the longer version of test result I just collected in
past few days. The test closely mirrors real-world usage from desktop
to web services.

Classical LRU:
------------------------------------------------------------------------
THROUGHPUT
------------------------------------------------------------------------
  Total requests:           60226
  Per-worker mean:          941.0
  Per-worker median:          931
  Per-worker min:             678
  Per-worker max:            1252
  Per-worker stdev:         131.3
  95% CI (mean):       [   908.2,    973.9]
------------------------------------------------------------------------
LATENCY DISTRIBUTION (all workers aggregated)
------------------------------------------------------------------------
      Bucket     Count      Pct    Cumul
      [0,1)s     19493   32.37%   32.37%
      [1,2)s      2024    3.36%   35.73%
      [2,4)s      5621    9.33%   45.06%
      [4,8)s     32881   54.60%   99.66%
     [8,16)s       207    0.34%  100.00%
    [16,32)s         0    0.00%  100.00%
    [32,64)s         0    0.00%  100.00%
   [64,128)s         0    0.00%  100.00%
  [128,inf)s         0    0.00%  100.00%
------------------------------------------------------------------------
FAIRNESS (per-worker total requests)
------------------------------------------------------------------------
  Jain's fairness index: 0.981188  (1.0 = perfectly fair)
  Coeff of variation:    0.1396  (0.0 = perfectly fair)
  Min/Max ratio:         0.5415
  P10:                      780
  P25:                      855
  P50 (median):             931
  P75:                     1040
  P90:                     1112
------------------------------------------------------------------------
PER-MEMCG BREAKDOWN (sorted by total, top/bottom 5)
------------------------------------------------------------------------
  Memcgs: 32  mean=1882.1  95%CI=[1799.8, 1964.4]  Jain=0.9860
         --- Top 5 ---
    memcg  6:   2423 requests
    memcg 10:   2364 requests
    memcg 31:   2213 requests
    memcg 20:   2207 requests
    memcg 30:   2156 requests
      --- Bottom 5 ---
    memcg 27:   1658 requests
    memcg 19:   1645 requests
    memcg 12:   1610 requests
    memcg  0:   1566 requests
    memcg 28:   1533 requests
Raw Result:
client: 8047 total:    984, 0:    293, 1:     44, 2:    108, 4:
538, 8:      1, 16:      0, 32:      0, 64:      0, 128:      0
client: 8058 total:    882, 0:    289, 1:     18, 2:     63, 4:
507, 8:      5, 16:      0, 32:      0, 64:      0, 128:      0
client: 8017 total:   1051, 0:    347, 1:     43, 2:    133, 4:
528, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8059 total:    952, 0:    274, 1:     41, 2:     92, 4:
545, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8005 total:    921, 0:    230, 1:     43, 2:    113, 4:
535, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8063 total:   1173, 0:    459, 1:     50, 2:    161, 4:
503, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8051 total:    986, 0:    296, 1:     34, 2:    122, 4:
534, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8043 total:    949, 0:    260, 1:     53, 2:     90, 4:
546, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8045 total:   1069, 0:    362, 1:     46, 2:    143, 4:
518, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8008 total:    857, 0:    259, 1:     25, 2:     69, 4:
500, 8:      4, 16:      0, 32:      0, 64:      0, 128:      0
client: 8023 total:   1049, 0:    348, 1:     44, 2:    136, 4:
521, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8015 total:    895, 0:    221, 1:     34, 2:    105, 4:
534, 8:      1, 16:      0, 32:      0, 64:      0, 128:      0
client: 8027 total:    899, 0:    219, 1:     42, 2:     96, 4:
542, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8061 total:   1093, 0:    396, 1:     31, 2:    157, 4:
509, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8038 total:    737, 0:    174, 1:      7, 2:     46, 4:
501, 8:      9, 16:      0, 32:      0, 64:      0, 128:      0
client: 8056 total:    678, 0:    133, 1:      5, 2:     32, 4:
501, 8:      7, 16:      0, 32:      0, 64:      0, 128:      0
client: 8040 total:   1039, 0:    423, 1:     37, 2:     93, 4:
477, 8:      9, 16:      0, 32:      0, 64:      0, 128:      0
client: 8036 total:    766, 0:    202, 1:      7, 2:     54, 4:
494, 8:      9, 16:      0, 32:      0, 64:      0, 128:      0
client: 8000 total:    697, 0:    136, 1:     13, 2:     48, 4:
495, 8:      5, 16:      0, 32:      0, 64:      0, 128:      0
client: 8030 total:    804, 0:    232, 1:     14, 2:     53, 4:
501, 8:      4, 16:      0, 32:      0, 64:      0, 128:      0
client: 8006 total:    852, 0:    267, 1:     18, 2:     62, 4:
495, 8:     10, 16:      0, 32:      0, 64:      0, 128:      0
client: 8062 total:   1040, 0:    437, 1:     43, 2:     61, 4:
489, 8:     10, 16:      0, 32:      0, 64:      0, 128:      0
client: 8014 total:    833, 0:    254, 1:     14, 2:     58, 4:
497, 8:     10, 16:      0, 32:      0, 64:      0, 128:      0
client: 8060 total:   1063, 0:    465, 1:     23, 2:     81, 4:
485, 8:      9, 16:      0, 32:      0, 64:      0, 128:      0
client: 8046 total:    814, 0:    244, 1:     18, 2:     40, 4:
508, 8:      4, 16:      0, 32:      0, 64:      0, 128:      0
client: 8049 total:   1080, 0:    388, 1:     40, 2:    123, 4:
529, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8022 total:   1001, 0:    422, 1:     22, 2:     62, 4:
486, 8:      9, 16:      0, 32:      0, 64:      0, 128:      0
client: 8019 total:    988, 0:    304, 1:     36, 2:    116, 4:
532, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8026 total:    780, 0:    218, 1:     12, 2:     47, 4:
500, 8:      3, 16:      0, 32:      0, 64:      0, 128:      0
client: 8024 total:    719, 0:    163, 1:      7, 2:     43, 4:
501, 8:      5, 16:      0, 32:      0, 64:      0, 128:      0
client: 8053 total:   1045, 0:    360, 1:     38, 2:    120, 4:
527, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8034 total:    873, 0:    286, 1:     19, 2:     57, 4:
508, 8:      3, 16:      0, 32:      0, 64:      0, 128:      0
client: 8048 total:    889, 0:    301, 1:     26, 2:     59, 4:
497, 8:      6, 16:      0, 32:      0, 64:      0, 128:      0
client: 8055 total:    871, 0:    199, 1:     40, 2:     89, 4:
543, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8001 total:    869, 0:    196, 1:     35, 2:     95, 4:
543, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8003 total:   1051, 0:    369, 1:     42, 2:    103, 4:
537, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8011 total:   1118, 0:    398, 1:     53, 2:    156, 4:
511, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8018 total:    762, 0:    192, 1:     15, 2:     45, 4:
503, 8:      7, 16:      0, 32:      0, 64:      0, 128:      0
client: 8021 total:   1112, 0:    410, 1:     41, 2:    145, 4:
516, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8050 total:    869, 0:    276, 1:     21, 2:     71, 4:
496, 8:      5, 16:      0, 32:      0, 64:      0, 128:      0
client: 8032 total:    823, 0:    238, 1:     21, 2:     54, 4:
505, 8:      5, 16:      0, 32:      0, 64:      0, 128:      0
client: 8044 total:   1030, 0:    433, 1:     31, 2:     66, 4:
496, 8:      4, 16:      0, 32:      0, 64:      0, 128:      0
client: 8035 total:    965, 0:    283, 1:     42, 2:    112, 4:
528, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8025 total:    891, 0:    212, 1:     43, 2:     90, 4:
546, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8039 total:    908, 0:    241, 1:     36, 2:     86, 4:
545, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8009 total:    963, 0:    286, 1:     36, 2:    108, 4:
533, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8037 total:    917, 0:    227, 1:     45, 2:    100, 4:
545, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8020 total:   1252, 0:    607, 1:     51, 2:    115, 4:
477, 8:      2, 16:      0, 32:      0, 64:      0, 128:      0
client: 8004 total:    818, 0:    245, 1:     16, 2:     47, 4:
501, 8:      9, 16:      0, 32:      0, 64:      0, 128:      0
client: 8052 total:    870, 0:    285, 1:     20, 2:     52, 4:
507, 8:      6, 16:      0, 32:      0, 64:      0, 128:      0
client: 8033 total:    925, 0:    269, 1:     28, 2:     83, 4:
545, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8010 total:    931, 0:    334, 1:     29, 2:     62, 4:
500, 8:      6, 16:      0, 32:      0, 64:      0, 128:      0
client: 8016 total:    990, 0:    388, 1:     27, 2:     70, 4:
500, 8:      5, 16:      0, 32:      0, 64:      0, 128:      0
client: 8012 total:   1173, 0:    556, 1:     51, 2:     78, 4:
480, 8:      8, 16:      0, 32:      0, 64:      0, 128:      0
client: 8028 total:    837, 0:    253, 1:     32, 2:     47, 4:
500, 8:      5, 16:      0, 32:      0, 64:      0, 128:      0
client: 8031 total:    992, 0:    315, 1:     33, 2:    119, 4:
525, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8041 total:   1168, 0:    452, 1:     52, 2:    162, 4:
502, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8054 total:    787, 0:    212, 1:     15, 2:     58, 4:
493, 8:      9, 16:      0, 32:      0, 64:      0, 128:      0
client: 8042 total:    799, 0:    227, 1:     13, 2:     45, 4:
508, 8:      6, 16:      0, 32:      0, 64:      0, 128:      0
client: 8002 total:   1034, 0:    449, 1:     32, 2:     59, 4:
488, 8:      6, 16:      0, 32:      0, 64:      0, 128:      0
client: 8057 total:    855, 0:    184, 1:     47, 2:     81, 4:
543, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8007 total:    965, 0:    269, 1:     36, 2:    135, 4:
525, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8013 total:   1250, 0:    536, 1:     53, 2:    143, 4:
518, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8029 total:    973, 0:    290, 1:     41, 2:    102, 4:
539, 8:      1, 16:      0, 32:      0, 64:      0, 128:      0

MGLRU (after this series, results are similar before this with seems
slightly lower throughput or maybe just noise, see cover letter):
------------------------------------------------------------------------
THROUGHPUT
------------------------------------------------------------------------
  Total requests:           83926
  Per-worker mean:         1311.3
  Per-worker median:         1306
  Per-worker min:            1170
  Per-worker max:            1466
  Per-worker stdev:          70.8
  95% CI (mean):       [  1293.6,   1329.0]
------------------------------------------------------------------------
LATENCY DISTRIBUTION (all workers aggregated)
------------------------------------------------------------------------
      Bucket     Count      Pct    Cumul
      [0,1)s     27929   33.28%   33.28%
      [1,2)s      9075   10.81%   44.09%
      [2,4)s      8558   10.20%   54.29%
      [4,8)s     38364   45.71%  100.00%
     [8,16)s         0    0.00%  100.00%
    [16,32)s         0    0.00%  100.00%
    [32,64)s         0    0.00%  100.00%
   [64,128)s         0    0.00%  100.00%
  [128,inf)s         0    0.00%  100.00%
------------------------------------------------------------------------
FAIRNESS (per-worker total requests)
------------------------------------------------------------------------
  Jain's fairness index: 0.997138  (1.0 = perfectly fair)
  Coeff of variation:    0.0540  (0.0 = perfectly fair)
  Min/Max ratio:         0.7981
  P10:                     1220
  P25:                     1253
  P50 (median):            1306
  P75:                     1367
  P90:                     1398
------------------------------------------------------------------------
PER-MEMCG BREAKDOWN (sorted by total, top/bottom 5)
------------------------------------------------------------------------
  Memcgs: 32  mean=2622.7  95%CI=[2601.4, 2643.9]  Jain=0.9995
         --- Top 5 ---
    memcg 24:   2719 requests
    memcg  5:   2711 requests
    memcg 16:   2703 requests
    memcg  0:   2696 requests
    memcg 19:   2689 requests
      --- Bottom 5 ---
    memcg 20:   2550 requests
    memcg 21:   2534 requests
    memcg 23:   2521 requests
    memcg 22:   2514 requests
    memcg 27:   2514 requests
Raw result:
client: 8028 total:   1252, 0:    410, 1:    132, 2:    106, 4:
604, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8026 total:   1220, 0:    390, 1:    107, 2:    106, 4:
617, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8036 total:   1260, 0:    403, 1:    154, 2:     92, 4:
611, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8038 total:   1322, 0:    475, 1:    150, 2:     90, 4:
607, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8002 total:   1220, 0:    384, 1:    137, 2:     82, 4:
617, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8008 total:   1264, 0:    410, 1:    138, 2:    108, 4:
608, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8044 total:   1180, 0:    339, 1:    123, 2:     94, 4:
624, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8062 total:   1267, 0:    428, 1:    125, 2:    111, 4:
603, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8050 total:   1197, 0:    351, 1:    131, 2:    113, 4:
602, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8057 total:   1379, 0:    480, 1:    142, 2:    158, 4:
599, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8048 total:   1301, 0:    454, 1:    142, 2:    101, 4:
604, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8034 total:   1266, 0:    422, 1:    140, 2:     98, 4:
606, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8020 total:   1282, 0:    425, 1:    153, 2:     98, 4:
606, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8000 total:   1245, 0:    404, 1:    137, 2:     88, 4:
616, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8030 total:   1282, 0:    411, 1:    164, 2:    104, 4:
603, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8045 total:   1334, 0:    424, 1:    147, 2:    168, 4:
595, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8053 total:   1359, 0:    462, 1:    139, 2:    161, 4:
597, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8060 total:   1240, 0:    375, 1:    158, 2:    110, 4:
597, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8043 total:   1338, 0:    437, 1:    138, 2:    171, 4:
592, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8041 total:   1323, 0:    438, 1:    124, 2:    155, 4:
606, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8025 total:   1331, 0:    435, 1:    130, 2:    180, 4:
586, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8040 total:   1227, 0:    389, 1:    133, 2:     92, 4:
613, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8022 total:   1240, 0:    393, 1:    139, 2:    100, 4:
608, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8049 total:   1418, 0:    510, 1:    145, 2:    172, 4:
591, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8012 total:   1205, 0:    373, 1:    120, 2:     93, 4:
619, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8059 total:   1375, 0:    462, 1:    171, 2:    152, 4:
590, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8037 total:   1412, 0:    513, 1:    144, 2:    171, 4:
584, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8001 total:   1451, 0:    536, 1:    160, 2:    191, 4:
564, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8009 total:   1356, 0:    451, 1:    133, 2:    182, 4:
590, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8039 total:   1367, 0:    456, 1:    144, 2:    186, 4:
581, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8042 total:   1196, 0:    345, 1:    134, 2:     97, 4:
620, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8013 total:   1409, 0:    519, 1:    134, 2:    172, 4:
584, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8021 total:   1392, 0:    478, 1:    156, 2:    169, 4:
589, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8031 total:   1373, 0:    477, 1:    135, 2:    174, 4:
587, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8014 total:   1271, 0:    419, 1:    152, 2:     96, 4:
604, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8015 total:   1305, 0:    398, 1:    139, 2:    179, 4:
589, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8024 total:   1251, 0:    390, 1:    167, 2:     78, 4:
616, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8033 total:   1335, 0:    408, 1:    169, 2:    172, 4:
586, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8004 total:   1245, 0:    398, 1:    129, 2:    107, 4:
611, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8003 total:   1394, 0:    494, 1:    144, 2:    154, 4:
602, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8052 total:   1296, 0:    444, 1:    154, 2:    106, 4:
592, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8061 total:   1353, 0:    455, 1:    142, 2:    147, 4:
609, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8017 total:   1355, 0:    451, 1:    153, 2:    166, 4:
585, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8063 total:   1367, 0:    474, 1:    136, 2:    152, 4:
605, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8018 total:   1225, 0:    379, 1:    132, 2:     97, 4:
617, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8029 total:   1345, 0:    460, 1:    129, 2:    152, 4:
604, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8027 total:   1398, 0:    518, 1:    121, 2:    158, 4:
601, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8007 total:   1253, 0:    375, 1:    118, 2:    124, 4:
636, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8047 total:   1302, 0:    414, 1:    126, 2:    170, 4:
592, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8005 total:   1397, 0:    488, 1:    151, 2:    161, 4:
597, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8019 total:   1347, 0:    437, 1:    145, 2:    178, 4:
587, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8035 total:   1361, 0:    453, 1:    151, 2:    179, 4:
578, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8011 total:   1416, 0:    517, 1:    147, 2:    172, 4:
580, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8023 total:   1385, 0:    473, 1:    161, 2:    172, 4:
579, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8046 total:   1219, 0:    388, 1:    123, 2:     91, 4:
617, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8056 total:   1304, 0:    463, 1:    135, 2:     95, 4:
611, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8006 total:   1306, 0:    442, 1:    147, 2:    130, 4:
587, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8054 total:   1170, 0:    321, 1:    136, 2:    101, 4:
612, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8055 total:   1344, 0:    447, 1:    141, 2:    169, 4:
587, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8010 total:   1295, 0:    429, 1:    164, 2:    100, 4:
602, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8016 total:   1292, 0:    448, 1:    140, 2:    108, 4:
596, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8051 total:   1466, 0:    555, 1:    152, 2:    200, 4:
559, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8058 total:   1278, 0:    430, 1:    144, 2:     86, 4:
618, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8032 total:   1368, 0:    502, 1:    168, 2:    113, 4:
585, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0

The test is rebased on 7.1 rc, MGLRU seems ~30% faster compared to
classical LRU, better latency distribution, and better fairness too.
On my x86 machine the gain is not as much as the one Yu posted
for ARM, but it still looks very good.

Ridong also reproduced with a much better result where MGLRU seems to
be much faster than classical LRU on ARM (or maybe using different
time priod?):
https://lore.kernel.org/linux-mm/20260120134256.2271710-1-chenridong@huaweicloud.com/

During one or two test runs, a single memcg might archive much higher
throughput with MGLRU, causing fairness to look slightly worse,
however, overall performance still seems much better than classical
LRU. I suspect improvements are needed for aging or the random bucket
part, but I think that's an irrelevant topic for now.

> >
> > MySQL:
> > ======
> >
> > Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
> > ZRAM as swap and test command:
> >
> > sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
> >   --tables=48 --table-size=2000000 --threads=48 --time=600 run
> >
> > Before:            17303.41 tps
> > After this series: 17291.50 tps
> >
> > Seems only noise level changes, no regression.
> >
>
> Please add a sentence on why this specific params.
>
> > FIO:
> > ====
> > Testing with the following command, where /mnt/ramdisk is a
> > 64G EXT4 ramdisk, each test file is 3G, in a 10G memcg,
> > 6 test run each:
> >
> > fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \
> >   --name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \
> >   --rw=randread --norandommap --time_based \
> >   --ramp_time=1m --runtime=5m --group_reporting
> >
> > Before:            8968.76 MB/s
> > After this series: 8995.63 MB/s
> >
> > Also seem only noise level changes and no regression or slightly better.
>
> Same here.

I tested the page cache performance with buffered read. There is
another test involving classical LRU, where MGLRU seems to
significantly outperform classical LRU. The case was provided by the
CachyOS community, I didn't include it here because the cover letter
is already getting tediously long.

https://lore.kernel.org/all/acgNCzRDVmSbXrOE@KASONG-MC4/

MGLRU seems to have significantly lower jitter and better performance with that.

BTW I also disabled OOMD or any related daemon to avoid noise during
that test. I repeated the test several times, and recorded one test
run as well since it's meant for a desktop test and I was discussing
with distro communities at that time. MGLRU TTL can completely avoid
jitter, however, it's not enabled during the test to prevent
confusion.

Classical LRU:
https://www.youtube.com/watch?v=pujboGNcBNI

MGLRU:
https://www.youtube.com/watch?v=ffnFUeaBQ_0

>
> >
> > Build kernel:
> > =============
> > Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
> > using make -j96 and defconfig, measuring system time, 12 test run each.
> >
> > Before:            2873.52s
> > After this series: 2811.88s
> >
> > Also seem only noise level changes, no regression or very slightly better.
>
> So, the kernel source code is on tmpfs, right? Also 3G memcg means memory.max is
> 3G, correct?

Right. That's to avoid I/O noise. I also tested with source code on
disk, I didn't post that because I think the MySQL test already shows
a workload of mixed anon / file.
Re: [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling
Posted by Shakeel Butt 1 month ago
On Tue, May 12, 2026 at 01:08:49PM +0800, Kairui Song wrote:
> On Tue, May 12, 2026 at 2:51 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> >
> > Hi Kairui,
> 
> Hello,
> 
> >
> > On Tue, Apr 28, 2026 at 02:06:51AM +0800, Kairui Song via B4 Relay wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
> > > and a 128G memory machine using NVME as storage.
> >
> > Please include traditional LRU results for all of the following experiments as
> > well (where it makes sense).
> 
> Sure, I've spawn a few test instances, was busy travelling last week.
> That specific test machine is occupied so it might take a while.
> 
> A systematic test run takes roughly one or two days to complete for
> one kernel version or config, e.g. the JS test takes at least 2 hours
> to finish. Comparing versions/setups takes more time.
> 

No worries, we have couple of weeks before the next merge window, so no urgency.
I will go through the series in depth, hopefully there will not be a need for
next version and in that case, please just resend the cover letter with the
information you provided below and don't worry about the length of the cover
letter.

> >
> > >
> > > MongoDB
> > > =======
> > > Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
> > > threads:32), which does 95% read and 5% update to generate mixed read
> > > and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
> > > the WiredTiger cache size is set to 4.5G, using NVME as storage.
> >
> > Can you add a sentence here on why this workload is chosen and is important for
> > evaluation?
> 
> Because that's exactly the one we observed with regression since it
> involves mixed writeback, and it's a pratical case.
> 

Sure, add this sentence in the cover letter.

> >
> > >
> > > Not using SWAP.
> >
> > Any specific reason to not have swap in this test?
> 
> Because we are testing the writeback here, not related to SWAP, so
> just to avoid noise and irrelevant parts.
> 
> A longer history involving SWAP is explained here:
> https://lore.kernel.org/linux-mm/20230920190244.16839-1-ryncsn@gmail.com/
> 
> And a longer discussion on that:
> https://lore.kernel.org/linux-mm/CAMgjq7BRaRgYLf2+8=+=nWtzkrHFKmudZPRm41PR6W+A+L=AKA@mail.gmail.com/
> 
> Both are not easy to reproduce, though. YCSB with MongoDB seems close
> enough and I believe we are heading in the right track.
> 
> In an internal workload, we observed that patched MGLRU is about 20%
> faster than classical LRU with MongoDB. Upstream MGLRU is still
> slightly behind classical LRU at this point, and will hopefully be
> patched soon, which is the RFC I posted:
> https://lore.kernel.org/linux-mm/20260502-mglru-fg-v1-0-913619b014d9@tencent.com/
> 

Same here but don't need to go in such details.

> >
> > >
> > > Before:
> > > Throughput(ops/sec): 62485.02962831822
> > > AverageLatency(us): 500.9746963330107
> > > pgpgin 159347462
> > > pgpgout 5413332
> > > workingset_refault_anon 0
> > > workingset_refault_file 34522071
> > >
> > > After:
> > > Throughput(ops/sec): 79760.71784646061 (+27.6%, higher is better)
> > > AverageLatency(us): 391.25169970043726 (-21.9%, lower is better)
> > > pgpgin 111093923                       (-30.3%, lower is better)
> > > pgpgout 5437456
> > > workingset_refault_anon 0
> > > workingset_refault_file 19566366       (-43.3%, lower is better)
> > >
> > > We can see a significant performance improvement after this series.
> > > The test is done on NVME and the performance gap would be even larger
> > > for slow devices, such as HDD or network storage. We observed over
> > > 100% gain for some workloads with slow IO.
> > >
> > > Chrome & Node.js [3]
> > > ====================
> > > Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
> > > nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
> > > workers:
> > >
> > > Before:
> > > Total requests:            79915
> > > Per-worker 95% CI (mean):  [1233.9, 1263.5]
> > > Per-worker stdev:          59.2
> > > Jain's fairness:           0.997795 (1.0 = perfectly fair)
> > > Latency:
> > > Bucket     Count      Pct    Cumul
> > > [0,1)s     26859   33.61%   33.61%
> > > [1,2)s      7818    9.78%   43.39%
> > > [2,4)s      5532    6.92%   50.31%
> > > [4,8)s     39706   49.69%  100.00%
> > >
> > > After:
> > > Total requests:            81382
> > > Per-worker 95% CI (mean):  [1241.9, 1301.3]
> > > Per-worker stdev:          118.8
> > > Jain's fairness:           0.991480 (1.0 = perfectly fair)
> > > Latency:
> > > Bucket     Count      Pct    Cumul
> > > [0,1)s     26696   32.80%   32.80%
> > > [1,2)s      8745   10.75%   43.55%
> > > [2,4)s      6865    8.44%   51.98%
> > > [4,8)s     39076   48.02%  100.00%
> > >
> > > Reclaim is still fair and effective, total requests number seems
> > > slightly better.
> >
> > Please add a reference to Jain's fairness and a sentence on why we should care
> > about it.
> 
> So first, Here is the previous test setup for that:
> https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/
> 
> The basical idea is simple: if all memcgs are under similar pressure,
> they should be reclaimed equally, which seems fair.

I think this is too much information. Just summarize this in couple of sentences
in the cover letter. You can refer to your email in the cover letter for more
details.

[...]

> > >
> > > MySQL:
> > > ======
> > >
> > > Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
> > > ZRAM as swap and test command:
> > >
> > > sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
> > >   --tables=48 --table-size=2000000 --threads=48 --time=600 run
> > >
> > > Before:            17303.41 tps
> > > After this series: 17291.50 tps
> > >
> > > Seems only noise level changes, no regression.
> > >
> >
> > Please add a sentence on why this specific params.
> >
> > > FIO:
> > > ====
> > > Testing with the following command, where /mnt/ramdisk is a
> > > 64G EXT4 ramdisk, each test file is 3G, in a 10G memcg,
> > > 6 test run each:
> > >
> > > fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \
> > >   --name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \
> > >   --rw=randread --norandommap --time_based \
> > >   --ramp_time=1m --runtime=5m --group_reporting
> > >
> > > Before:            8968.76 MB/s
> > > After this series: 8995.63 MB/s
> > >
> > > Also seem only noise level changes and no regression or slightly better.
> >
> > Same here.
> 
> I tested the page cache performance with buffered read. There is
> another test involving classical LRU, where MGLRU seems to
> significantly outperform classical LRU. The case was provided by the
> CachyOS community, I didn't include it here because the cover letter
> is already getting tediously long.
> 
> https://lore.kernel.org/all/acgNCzRDVmSbXrOE@KASONG-MC4/
> 
> MGLRU seems to have significantly lower jitter and better performance with that.
> 
> BTW I also disabled OOMD or any related daemon to avoid noise during
> that test. I repeated the test several times, and recorded one test
> run as well since it's meant for a desktop test and I was discussing
> with distro communities at that time. MGLRU TTL can completely avoid
> jitter, however, it's not enabled during the test to prevent
> confusion.
> 
> Classical LRU:
> https://www.youtube.com/watch?v=pujboGNcBNI
> 
> MGLRU:
> https://www.youtube.com/watch?v=ffnFUeaBQ_0

The point is not which is better but documenting the performance difference
between them for the given workload.

At the high level, I am just asking for a given benchmark/workload, let's add a
sentence why we think this specific workload is important to measure and
evaluate reclaim mechanism.
Re: [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling
Posted by Andrew Morton 2 weeks, 1 day ago
On Mon, 11 May 2026 22:56:21 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> > > Please include traditional LRU results for all of the following experiments as
> > > well (where it makes sense).
> > 
> > Sure, I've spawn a few test instances, was busy travelling last week.
> > That specific test machine is occupied so it might take a while.
> > 
> > A systematic test run takes roughly one or two days to complete for
> > one kernel version or config, e.g. the JS test takes at least 2 hours
> > to finish. Comparing versions/setups takes more time.
> > 
> 
> No worries, we have couple of weeks before the next merge window, so no urgency.

Well, no, not really.  Some schmuck wants to get our
stable-non-rebasing branch into upstreamable shape well before the next
merge window.

This series was issued a month ago!

Sorry to crack the whip, but let's please all be aware or our
upstreaming timing.

> I will go through the series in depth, hopefully there will not be a need for
> next version and in that case, please just resend the cover letter with the
> information you provided below and don't worry about the length of the cover
> letter.

That's a plan.

Happily, MGRLU changes are well-isolated so I was able to trivially
move this series to the tail of mm-unstable.

It isn't a problem at all for me to defer this until the next cycle -
please let me know.

I'd like to know this as early as possible so I can hide the series
until after -rc1.  We shouldn't have "not for next merge window"
material in there possibly invalidating our ongoing testing.
Re: [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling
Posted by Kairui Song 2 weeks, 1 day ago
On Tue, May 26, 2026 at 06:35:06PM +0800, Andrew Morton wrote:
> On Mon, 11 May 2026 22:56:21 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > 
> > No worries, we have couple of weeks before the next merge window, so no urgency.
> 
> Well, no, not really.  Some schmuck wants to get our
> stable-non-rebasing branch into upstreamable shape well before the next
> merge window.
> 
> This series was issued a month ago!
> 
> Sorry to crack the whip, but let's please all be aware or our
> upstreaming timing.
> 
> > I will go through the series in depth, hopefully there will not be a need for
> > next version and in that case, please just resend the cover letter with the
> > information you provided below and don't worry about the length of the cover
> > letter.
> 
> That's a plan.
> 
> Happily, MGRLU changes are well-isolated so I was able to trivially
> move this series to the tail of mm-unstable.
> 
> It isn't a problem at all for me to defer this until the next cycle -
> please let me know.
> 
> I'd like to know this as early as possible so I can hide the series
> until after -rc1.  We shouldn't have "not for next merge window"
> material in there possibly invalidating our ongoing testing.

Hi Andrew,

From my side I didn't see any major reason to block this. There has
been plenty of review and the series has been tested many times. I
also re-ran with more rounds, with CLRU baseline, and the numbers
are very close to what was already posted.

The CLRU baseline can be useful as a reference. CLRU is unaffected by
this series as we don't touch it, and the comparison data Shakeel
asked for is on lore alongside the patch link. For that reason
I originally did not think a refreshed cover letter was strictly
needed.

But anyway, looking forward to Shakeel's review, and here is an
updated cover letter folding in the additions. There is not
much change: refreshed numbers, an "MGLRU disabled" (classical LRU)
row added to each benchmark, a short note on why each benchmark is
used, and three extra Link: tags. No code changes:

From: Kairui Song <kasong@tencent.com>

This series cleans up and slightly improves MGLRU's reclaim loop and dirty
writeback handling.  As a result, we can see an up to ~30% increase in
some workloads like MongoDB with YCSB and a huge decrease in file refault,
no swap involved.  Other common benchmarks have no regression, and LOC is
reduced, with less unexpected OOM, too.

Some of the problems were found in our production environment, and others
were mostly exposed while stress testing during the development of the
LSM/MM/BPF topic on improving MGLRU [1].  This series cleans up the code
base and fixes several performance issues, preparing for further work.

MGLRU's reclaim loop is a bit complex, and hence these problems are
somehow related to each other.  The aging, scan number calculation, and
reclaim loop are coupled together, and the dirty folio handling logic is
quite different, making the reclaim loop hard to follow and the dirty
flush ineffective.

This series slightly cleans up and improves these issues using a scan
budget by calculating the number of folios to scan at the beginning of the
loop, and decouples aging from the reclaim calculation helpers.  Then,
move the dirty flush logic inside the reclaim loop so it can kick in more
effectively.  These issues are somehow related, and this series handles
them and improves MGLRU reclaim in many ways.

Test results: All tests are done on a 48c96t NUMA machine with 2 nodes and
a 128G memory machine using NVME as storage.  Classical (non-MGLRU) LRU
numbers are included as "MGLRU disabled" for each benchmark below; see
[8] and [9] for the longer write-up.

MongoDB
=======
Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
threads:32), which does 95% read and 5% update to generate mixed read and
dirty writeback.  MongoDB is set up in a 10G cgroup using Docker, and the
WiredTiger cache size is set to 4.5G, using NVME as storage.  This is
close to the case we observed regressing in our production environment:
mixed read and writeback pressure, so it is a practical case for
evaluation.

Not using SWAP.  The intent is to isolate the file LRU writeback path.
Enabling SWAP would just add noise from anonymous reclaim.

MGLRU Before:
Throughput(ops/sec): 60653.502655
workingset_refault_file 12904916
pgpgin 165366622
pgpgout 5219588

MGLRU After:
Throughput(ops/sec): 82384.354760 (+35.8%, higher is better)
workingset_refault_file 7128285   (-44.7%, lower is better)
pgpgin 113170693                  (-31.5%, lower is better)
pgpgout 5639724

MGLRU Disabled:
Throughput(ops/sec): 93713.640901
workingset_refault_file 15013443
pgpgin 85365614
pgpgout 5866508

We can see a significant performance improvement after this series.  The
test is done on NVME and the performance gap would be even larger for slow
devices, such as HDD or network storage.  We observed over 100% gain for
some workloads with slow IO.

Note, classical LRU is still faster for this benchmark, MGLRU may catch
up later with further work [7].

Chrome & Node.js [3]
====================
Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
workers.  Many memcgs each applying roughly equal pressure exercises the
LRU's ability to detect/protect each tenant's working set and to balance
reclamation fairly between tenants, which makes this a meaningful test
for the reclaim mechanism.

Fairness is reported via Jain's fairness index (1.0 means all tenants get
exactly equal allocation, lower is worse). Under equal pressure, all
memcgs should make roughly equal forward progress.  See [8] for the
longer rationale and per-memcg breakdown.

MGLRU before:
Total requests:           81898
Per-worker mean:         1279.7
Per-worker 95% CI (mean):       [  1259.0,   1300.4]
Jain's fairness index: 0.995893  (1.0 = perfectly fair)
Latency:
      Bucket     Count      Pct    Cumul
      [0,1)s     28392   34.67%   34.67%
      [1,2)s      8022    9.80%   44.46%
      [2,4)s      6130    7.48%   51.95%
      [4,8)s     39354   48.05%  100.00%

MGLRU after:
Total requests:           82901
Per-worker mean:         1295.3
Per-worker 95% CI (mean):       [  1265.3,   1325.4]
Jain's fairness index: 0.991607  (1.0 = perfectly fair)
Latency:
      Bucket     Count      Pct    Cumul
      [0,1)s     28128   33.93%   33.93%
      [1,2)s      8756   10.56%   44.49%
      [2,4)s      7028    8.48%   52.97%
      [4,8)s     38989   47.03%  100.00%

MGLRU disabled:
Total requests:           62399
Per-worker mean:          975.0
Per-worker 95% CI (mean):       [   941.9,   1008.1]
Jain's fairness index: 0.982156  (1.0 = perfectly fair)
Latency:
      Bucket     Count      Pct    Cumul
      [0,1)s     20051   32.13%   32.13%
      [1,2)s      2255    3.61%   35.75%
      [2,4)s      6149    9.85%   45.60%
      [4,8)s     33927   54.37%   99.97%
     [8,16)s        17    0.03%  100.00%

Reclaim is still fair and effective, total requests number seems
slightly better.

OOM issue with aging and throttling
===================================
For the throttling OOM issue, it can be easily reproduced using dd and
cgroup limit as demonstrated and fixed by a later patch in this series.

The aging OOM is a bit tricky, a specific reproducer can be used to
simulate what we encountered in production environment [4]: Spawns
multiple workers that keep reading the given file using mmap, and pauses
for 120ms after one file read batch.  It also spawns another set of
workers that keep allocating and freeing a given size of anonymous memory.
The total memory size exceeds the memory limit (eg.  14G anon + 8G file,
which is 22G vs a 16G memcg limit).

- MGLRU disabled:
  Finished 128 iterations.

- MGLRU enabled:
  OOM with following info after about ~10-20 iterations:
    [   62.624130] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
    [   62.624999] memory: usage 16777216kB, limit 16777216kB, failcnt 24460
    [   62.640200] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
    [   62.640823] Memory cgroup stats for /demo:
    [   62.641017] anon 10604879872
    [   62.641941] file 6574858240

  OOM occurs despite there being still evictable file folios.

- MGLRU enabled after this series:
  Finished 128 iterations.

Worth noting there is another OOM related issue reported in V1 of this
series, which is tested and looking OK now [5].

MySQL:
======

Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
ZRAM as swap and test command:

sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
  --tables=48 --table-size=2000000 --threads=48 --time=600 run

A 24G InnoDB buffer pool inside a 2G memcg with ZRAM as swap forces
aggressive eviction of cached database anon pages, which exercises the
LRU's hot page detection and the eviction path under swap pressure.  The
workload is practical, and the pressure is higher than what we usually
see in production but it is intended to expose the extreme case.

MGLRU before:   17313.688333 tps
MGLRU after:    17286.195000 tps
MGLRU disabled: 16245.330000 tps

Seems only noise level changes, no regression.

FIO:
====
Testing with the following command, where /mnt/ramdisk is a 64G EXT4
ramdisk, each test file is 3G, in a 10G memcg, 6 test run each:

fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \
  --name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \
  --rw=randread --norandommap --time_based \
  --ramp_time=1m --runtime=5m --group_reporting

Random buffered mmap read on a ramdisk strips out storage variance and
stresses purely the LRU's ability to evict and recycle the page cache
under heavy random read pressure.

MGLRU before:      9033.91 MB/s
MGLRU after:       9065.72 MB/s
MGLRU disabled:    8254.54 MB/s

Also seem only noise level changes and no regression or slightly better.

Build kernel:
=============
Build kernel test using ZRAM as swap, kernel source on tmpfs, in a memcg
with memory.max=3G, using make -j96 and defconfig, measuring system time,
6 test run each.  Building the kernel is a classical mixed anon + file
workload (lots of small file reads/writes plus parallel anon allocations
from cc/ld) and is representative of many real compilation jobs.

MGLRU before:     2823.13s
MGLRU after:      2801.26s
MGLRU disabled:   5023.50s

Also seem only noise level changes, no regression or very slightly better.

Android:
========
Xinyu reported a performance gain on Android, too, with this series.  The
test consisted of cold-starting multiple applications sequentially under
moderate system load [6]; this is a real Android user-visible scenario,
dominated by the LRU's ability to keep the right working set resident
and re-fault launch-critical pages quickly.

Before:
Launch Time Summary (all apps, all runs)
  Mean 868.0ms
  P50 888.0ms
  P90 1274.2ms
  P95 1399.0ms

After:
Launch Time Summary (all apps, all runs)
  Mean 850.5ms (-2.07%)
  P50 861.5ms  (-3.04%)
  P90 1179.0ms (-8.05%)
  P95 1228.0ms (-12.2%)

Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-0-02fabb92dc43@tencent.com
Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-1-02fabb92dc43@tencent.com
Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1]
Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2]
Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3]
Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4]
Link: https://lore.kernel.org/linux-mm/acgNCzRDVmSbXrOE@KASONG-MC4/ [5]
Link: https://lore.kernel.org/linux-mm/20260417025123.2971253-1-wxy2009nrrr@163.com/ [6]
Link: https://lore.kernel.org/linux-mm/20260502-mglru-fg-v1-0-913619b014d9@tencent.com/ [7]
Link: https://lore.kernel.org/linux-mm/CAMgjq7BzQAPp8u_3-9e3ueXmRCoW=2sydok0hFM=MYL7VC1YYg@mail.gmail.com/ [8]
Link: https://lore.kernel.org/linux-mm/CAMgjq7D+4QmiWe73OPFuH0s+ZKCUJoo+MfcWOdJcV+VO-T2Wmg@mail.gmail.com/ [9]
Re: [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling
Posted by Andrew Morton 2 weeks ago
On Wed, 27 May 2026 13:36:06 +0800 Kairui Song <ryncsn@gmail.com> wrote:

> On Tue, May 26, 2026 at 06:35:06PM +0800, Andrew Morton wrote:
> > On Mon, 11 May 2026 22:56:21 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > 
> > > No worries, we have couple of weeks before the next merge window, so no urgency.
> > 
> > Well, no, not really.  Some schmuck wants to get our
> > stable-non-rebasing branch into upstreamable shape well before the next
> > merge window.
> > 
> > This series was issued a month ago!
> > 
> > Sorry to crack the whip, but let's please all be aware or our
> > upstreaming timing.
> > 
> > > I will go through the series in depth, hopefully there will not be a need for
> > > next version and in that case, please just resend the cover letter with the
> > > information you provided below and don't worry about the length of the cover
> > > letter.
> > 
> > That's a plan.
> > 
> > Happily, MGRLU changes are well-isolated so I was able to trivially
> > move this series to the tail of mm-unstable.
> > 
> > It isn't a problem at all for me to defer this until the next cycle -
> > please let me know.
> > 
> > I'd like to know this as early as possible so I can hide the series
> > until after -rc1.  We shouldn't have "not for next merge window"
> > material in there possibly invalidating our ongoing testing.
> 
> Hi Andrew,
> 
> >From my side I didn't see any major reason to block this. There has
> been plenty of review and the series has been tested many times. I
> also re-ran with more rounds, with CLRU baseline, and the numbers
> are very close to what was already posted.

OK...

> But anyway, looking forward to Shakeel's review, and here is an
> updated cover letter folding in the additions. There is not
> much change: refreshed numbers, an "MGLRU disabled" (classical LRU)
> row added to each benchmark, a short note on why each benchmark is
> used, and three extra Link: tags. No code changes:

Updated, thanks.
Re: [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling
Posted by Shakeel Butt 2 weeks, 1 day ago
On Tue, May 26, 2026 at 06:35:06PM -0700, Andrew Morton wrote:
> On Mon, 11 May 2026 22:56:21 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> > > > Please include traditional LRU results for all of the following experiments as
> > > > well (where it makes sense).
> > > 
> > > Sure, I've spawn a few test instances, was busy travelling last week.
> > > That specific test machine is occupied so it might take a while.
> > > 
> > > A systematic test run takes roughly one or two days to complete for
> > > one kernel version or config, e.g. the JS test takes at least 2 hours
> > > to finish. Comparing versions/setups takes more time.
> > > 
> > 
> > No worries, we have couple of weeks before the next merge window, so no urgency.
> 
> Well, no, not really.  Some schmuck wants to get our
> stable-non-rebasing branch into upstreamable shape well before the next
> merge window.
> 
> This series was issued a month ago!
> 
> Sorry to crack the whip, but let's please all be aware or our
> upstreaming timing.

Thanks for the reminder.

> 
> > I will go through the series in depth, hopefully there will not be a need for
> > next version and in that case, please just resend the cover letter with the
> > information you provided below and don't worry about the length of the cover
> > letter.
> 
> That's a plan.
> 
> Happily, MGRLU changes are well-isolated so I was able to trivially
> move this series to the tail of mm-unstable.
> 
> It isn't a problem at all for me to defer this until the next cycle -
> please let me know.

I am on it and will aim to be done by weekend.

> 
> I'd like to know this as early as possible so I can hide the series
> until after -rc1.  We shouldn't have "not for next merge window"
> material in there possibly invalidating our ongoing testing.

Ack.