[v5] mm/mglru: improve reclaim loop and dirty folio handling

[PATCH v5 00/14] mm/mglru: improve reclaim loop and dirty folio handling

Posted by Kairui Song via B4 Relay 2 months, 1 week ago

This series is based on mm-unstable, also applies to mm-new.

This series cleans up and slightly improves MGLRU's reclaim loop and
dirty writeback handling. As a result, we can see an up to ~30% increase
in some workloads like MongoDB with YCSB and a huge decrease in file
refault, no swap involved. Other common benchmarks have no regression,
and LOC is reduced, with less unexpected OOM, too.

Some of the problems were found in our production environment, and
others were mostly exposed while stress testing during the development
of the LSM/MM/BPF topic on improving MGLRU [1]. This series cleans up
the code base and fixes several performance issues, preparing for
further work.

MGLRU's reclaim loop is a bit complex, and hence these problems are
somehow related to each other. The aging, scan number calculation, and
reclaim loop are coupled together, and the dirty folio handling logic is
quite different, making the reclaim loop hard to follow and the dirty
flush ineffective.

This series slightly cleans up and improves these issues using a scan
budget by calculating the number of folios to scan at the beginning of
the loop, and decouples aging from the reclaim calculation helpers.
Then, move the dirty flush logic inside the reclaim loop so it can kick
in more effectively. These issues are somehow related, and this series
handles them and improves MGLRU reclaim in many ways.

Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
and a 128G memory machine using NVME as storage.

MongoDB
=======
Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
threads:32), which does 95% read and 5% update to generate mixed read
and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
the WiredTiger cache size is set to 4.5G, using NVME as storage.

Not using SWAP.

Before:
Throughput(ops/sec): 62485.02962831822
AverageLatency(us): 500.9746963330107
pgpgin 159347462
pgpgout 5413332
workingset_refault_anon 0
workingset_refault_file 34522071

After:
Throughput(ops/sec): 79760.71784646061 (+27.6%, higher is better)
AverageLatency(us): 391.25169970043726 (-21.9%, lower is better)
pgpgin 111093923                       (-30.3%, lower is better)
pgpgout 5437456
workingset_refault_anon 0
workingset_refault_file 19566366       (-43.3%, lower is better)

We can see a significant performance improvement after this series.
The test is done on NVME and the performance gap would be even larger
for slow devices, such as HDD or network storage. We observed over
100% gain for some workloads with slow IO.

Chrome & Node.js [3]
====================
Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
workers:

Before:
Total requests:            79915
Per-worker 95% CI (mean):  [1233.9, 1263.5]
Per-worker stdev:          59.2
Jain's fairness:           0.997795 (1.0 = perfectly fair)
Latency:
Bucket     Count      Pct    Cumul
[0,1)s     26859   33.61%   33.61%
[1,2)s      7818    9.78%   43.39%
[2,4)s      5532    6.92%   50.31%
[4,8)s     39706   49.69%  100.00%

After:
Total requests:            81382
Per-worker 95% CI (mean):  [1241.9, 1301.3]
Per-worker stdev:          118.8
Jain's fairness:           0.991480 (1.0 = perfectly fair)
Latency:
Bucket     Count      Pct    Cumul
[0,1)s     26696   32.80%   32.80%
[1,2)s      8745   10.75%   43.55%
[2,4)s      6865    8.44%   51.98%
[4,8)s     39076   48.02%  100.00%

Reclaim is still fair and effective, total requests number seems
slightly better.

OOM issue with aging and throttling
===================================
For the throttling OOM issue, it can be easily reproduced using dd and
cgroup limit as demonstrated in patch 14, and fixed by this series.

The aging OOM is a bit tricky, a specific reproducer can be used to
simulate what we encountered in production environment [4]:
Spawns multiple workers that keep reading the given file using mmap,
and pauses for 120ms after one file read batch. It also spawns another
set of workers that keep allocating and freeing a given size of
anonymous memory. The total memory size exceeds the memory limit
(eg. 14G anon + 8G file, which is 22G vs a 16G memcg limit).

- MGLRU disabled:
  Finished 128 iterations.

- MGLRU enabled:
  OOM with following info after about ~10-20 iterations:
    [   62.624130] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
    [   62.624999] memory: usage 16777216kB, limit 16777216kB, failcnt 24460
    [   62.640200] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
    [   62.640823] Memory cgroup stats for /demo:
    [   62.641017] anon 10604879872
    [   62.641941] file 6574858240

  OOM occurs despite there being still evictable file folios.

- MGLRU enabled after this series:
  Finished 128 iterations.

Worth noting there is another OOM related issue reported in V1 of
this series, which is tested and looking OK now [5].

MySQL:
======

Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
ZRAM as swap and test command:

sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
  --tables=48 --table-size=2000000 --threads=48 --time=600 run

Before:            17260.781429 tps
After this series: 17266.842857 tps

MySQL is anon folios heavy, involves writeback and file and still
looking good. Seems only noise level changes, no regression.

FIO:
====
Testing with the following command, where /mnt/ramdisk is a
64G EXT4 ramdisk, each test file is 3G, in a 10G memcg,
6 test run each:

fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \
  --name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \
  --rw=randread --norandommap --time_based \
  --ramp_time=1m --runtime=5m --group_reporting

Before:            9196.481429 MB/s
After this series: 9256.105000 MB/s

Also seem only noise level changes and no regression or slightly better.

Build kernel:
=============
Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
using make -j96 and defconfig, measuring system time, 12 test run each.

Before:            2589.63s
After this series: 2543.58s

Also seem only noise level changes, no regression or very slightly better.

Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1]
Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2]
Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3]
Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4]
Link: https://lore.kernel.org/linux-mm/acgNCzRDVmSbXrOE@KASONG-MC4/ [5]

Signed-off-by: Kairui Song <kasong@tencent.com>
---
Changes in v5:
- Add back a more moderate minimal batch limit for each reclaim loop:
  https://lore.kernel.org/linux-mm/adYP81AhpNf0znp3@KASONG-MC4/
- Collect review-by.
- Link to v4: https://patch.msgid.link/20260407-mglru-reclaim-v4-0-98cf3dc69519@tencent.com

Changes in v4:
- Remove the minimal scan batch limit, and add rotate for
  unevictable memcg as reported by sashiko:
  https://lore.kernel.org/linux-mm/ac8xVN82LBLDZpIO@KASONG-MC4/
- Slightly imporove a few commit messages.
- Reran the test and seems identical with before so data are unchanged.
- Collect review-by.
- Link to v3: https://patch.msgid.link/20260403-mglru-reclaim-v3-0-a285efd6ff91@tencent.com

Changes in v3:
- Don't force scan at least SWAP_CLUSTER_MAX pages for each reclaim
  loop. If the LRU is too small, adjust it accordingly. Now the
  multi-cgroup scan balance looked even better for tiny cgroups:
  https://lore.kernel.org/linux-mm/aciejkdIHyXPNS9Y@KASONG-MC4/
- Add one patch to remove the swap constraint check in isolate_folio. In
  theory, it's fine, and both stress test and performance test didn't
  show any issue:
  https://lore.kernel.org/linux-mm/CAMgjq7C8TCsK99p85i3QzGCwgkXscTfFB6XCUTWQOcuqwHQa2Q@mail.gmail.com/
- I reran most tests, all seem identical, so most data is kept.
  Intermediate test results are dropped. I ran tests on most patches
  individually, and there is no problem, but the series is getting too
  long, and posting them makes it harder to read and unnecessary.
- Split previously patch 8 into two patches as suggested [ Shakeel Butt ],
  also some test result is collected to support the design:
  https://lore.kernel.org/linux-mm/ac44BVOvOm8lhVvj@KASONG-MC4/#t
  I kept Axel's review-by since the code is identical.
- Call try_to_inc_min_seq twice to avoid stale empty gen and drop
  its return argument [ Baolin Wang ]
- Move a few lines of code between patches to where they fits better,
  the final result is identical [ Baolin Wang ].
- Collect tested by and update test setup [ Leno Hou ]
- Collect review by.
- Update a few commit message [ Shakeel Butt ].
- Link to v2: https://patch.msgid.link/20260329-mglru-reclaim-v2-0-b53a3678513c@tencent.com

Changes in v2:
- Rebase on top of mm-new which includes Cgroup V1 fix from
  [ Baolin Wang ].
- Added dirty throttling OOM fix as patch 12, as [ Chen Ridong ]'s
  review suggested that we shouldn't leave the counter and reclaim
  feedback in shrink_folio_list untracked in this series.
- Add a minimal scan number of SWAP_CLUSTER_MAX limit in patch
  "restructure the reclaim loop", the change is trivial but might
  help avoid livelock for tiny cgroups.
- Redo the tests, most test are basically identical to before, but just
  in case, since the patch also solves the throttling issue now, and
  discussed with reports from CachyOS.
- Add a separate patch for variable renaming as suggested by [ Barry
  Song ]. No feature change.
- Improve several comment and code issue [ Axel Rasmussen ].
- Remove no longer needed variable [ Axel Rasmussen ].
- Collect review by.
- Link to v1: https://lore.kernel.org/r/20260318-mglru-reclaim-v1-0-2c46f9eb0508@tencent.com

---
Kairui Song (14):
      mm/mglru: consolidate common code for retrieving evictable size
      mm/mglru: rename variables related to aging and rotation
      mm/mglru: relocate the LRU scan batch limit to callers
      mm/mglru: restructure the reclaim loop
      mm/mglru: scan and count the exact number of folios
      mm/mglru: use a smaller batch for reclaim
      mm/mglru: don't abort scan immediately right after aging
      mm/mglru: remove redundant swap constrained check upon isolation
      mm/mglru: use the common routine for dirty/writeback reactivation
      mm/mglru: simplify and improve dirty writeback handling
      mm/mglru: remove no longer used reclaim argument for folio protection
      mm/vmscan: remove sc->file_taken
      mm/vmscan: remove sc->unqueued_dirty
      mm/vmscan: unify writeback reclaim statistic and throttling

 mm/vmscan.c | 330 ++++++++++++++++++++++++++----------------------------------
 1 file changed, 142 insertions(+), 188 deletions(-)
---
base-commit: 196ab4af58d724f24335fed3da62920c3cea945f
change-id: 20260314-mglru-reclaim-1c9d45ac57f6

Best regards,
--  
Kairui Song <kasong@tencent.com>

Re: [PATCH v5 00/14] mm/mglru: improve reclaim loop and dirty folio handling

Posted by wangxinyu19 2 months ago

On Mon, 13 Apr 2026 00:48:14 +0800, Kairui Song wrote:
> This series is based on mm-unstable, also applies to mm-new.
> 
> This series cleans up and slightly improves MGLRU's reclaim loop and
> dirty writeback handling. As a result, we can see an up to ~30% increase
> in some workloads like MongoDB with YCSB and a huge decrease in file
> refault, no swap involved. Other common benchmarks have no regression,
> and LOC is reduced, with less unexpected OOM, too.
> 
> Some of the problems were found in our production environment, and
> others were mostly exposed while stress testing during the development
> of the LSM/MM/BPF topic on improving MGLRU [1]. This series cleans up
> the code base and fixes several performance issues, preparing for
> further work.
> 
> MGLRU's reclaim loop is a bit complex, and hence these problems are
> somehow related to each other. The aging, scan number calculation, and
> reclaim loop are coupled together, and the dirty folio handling logic is
> quite different, making the reclaim loop hard to follow and the dirty
> flush ineffective.

> This series slightly cleans up and improves these issues using a scan
> budget by calculating the number of folios to scan at the beginning of
> the loop, and decouples aging from the reclaim calculation helpers.
> Then, move the dirty flush logic inside the reclaim loop so it can kick
> in more effectively. These issues are somehow related, and this series
> handles them and improves MGLRU reclaim in many ways.
> 
> Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
> and a 128G memory machine using NVME as storage.

Hi Kairui,

We have tested this patch series on Android device under a typical scenario.

The test consisted of cold-starting multiple applications sequentially
under moderate system load (some services running on the background, 
such as map navigating, AI voice-assistant). Each test round cold-starts
a fixed set of apps one by one and records the cold start latency.
A total of 100 rounds were conducted to ensure statistical significance.

Before:
  /proc/vmstat info:
    pgpgin 269,224
    pgpgout 226,078
    workingset_refault_anon 237
    workingset_refault_file 27689

  Launch Time Summary (all apps, all runs)
    Mean 868.0ms
    P50 888.0ms
    P90 1274.2ms
    P95 1399.0ms

After:
  /proc/vmstat info:
    pgpgin 223,801                (-16.9%)
    pgpgout 308,873
    workingset_refault_anon 498
    workingset_refault_file 17075 (-38.3%)

  Launch Time Summary (all apps, all runs)
    Mean 850.5ms (-2.07%)
    P50 861.5ms  (-3.04%)
    P90 1179.0ms (-8.05%)
    P95 1228.0ms (-12.2%)
    
--
Best regards,
Xinyu

RE: [PATCH v5 00/14] mm/mglru: improve reclaim loop and dirty folio handling

Posted by wangzicheng 2 months ago

> Hi Kairui,
> 
> We have tested this patch series on Android device under a typical scenario.
> 
> The test consisted of cold-starting multiple applications sequentially
> under moderate system load (some services running on the background,
> such as map navigating, AI voice-assistant). Each test round cold-starts
> a fixed set of apps one by one and records the cold start latency.
> A total of 100 rounds were conducted to ensure statistical significance.
> 

Hi Xinyu and Kairui,

We have test the patch under a **heavy** load benchmark for camera.

> Before:
>   /proc/vmstat info:
>     pgpgin 269,224
>     pgpgout 226,078
>     workingset_refault_anon 237
>     workingset_refault_file 27689
> 
>   Launch Time Summary (all apps, all runs)
>     Mean 868.0ms
>     P50 888.0ms
>     P90 1274.2ms
>     P95 1399.0ms
> 
> After:
>   /proc/vmstat info:
>     pgpgin 223,801                (-16.9%)
>     pgpgout 308,873
>     workingset_refault_anon 498
>     workingset_refault_file 17075 (-38.3%)
> 
>   Launch Time Summary (all apps, all runs)
>     Mean 850.5ms (-2.07%)
>     P50 861.5ms  (-3.04%)
>     P90 1179.0ms (-8.05%)
>     P95 1228.0ms (-12.2%)
> 
> --
> Best regards,
> Xinyu
> 

We evaluated the backported patches on android16-6.12 using a **heavy**
mobile workload on a Qualcomm 8850 device (16GB RAM + 16GB zram).
(vmscan code in this tree is largely similar to v6.18)

The workload simulates real user behavior by sequentially
cold-starting 23 apps. For each application we perform the related
operations (short‑video swiping, background music playback, and
navigation). After exiting one application the next is launched
immediately in 1s. After all apps complete, the camera is launched
and a photo is taken.

Baseline and patched kernels were tested under identical conditions.
(with a fan kept cooling the testbed)
Full system traces were collected for three runs in each
configuration, and ten additional traces were recorded for the final
camera launch stage.

Overall application keepalive behavior shows no noticeable
difference. However, we observed performance deviations in some
memory‑pressure scenarios.

Before:
Meminfo (100 ms per sample, average result)
MemAvailable: 5420
MemFree: 1421
Cached: 3862
AnonPages: 3804
Dirty: 62
vmstat counters (last sample)
pgpgin: 3,701,869
pgpgout: 3,545,058
workingset_refault_anon: 390,967
workingset_refault_file: 79,927
Total app launch time (23 apps + launcher × 23): 7702 ms
Camera launch time: 684 ms

After:
Meminfo (100 ms per sample, average result)
MemAvailable: 5058 (-7%)
MemFree: 1382 (-3%)
Cached: 3213 (-17%)
AnonPages: 3637 (-4%)
Dirty: 35 (-44%)
vmstat counters (last sample)
pgpgin: 5,752,429 (+55%)
pgpgout: 3,668,788 (+3%)
workingset_refault_anon: 1,492,964 (+282%)
workingset_refault_file: 590,505 (+639%)
Total app launch time (23 apps + launcher × 23): 8872 ms (+15%)
Among the tested apps, 11 improved while 14 regressed.
Camera launch time: 980 ms (+43%), which is also the stage with the
highest memory pressure.

From whole trace analysis, direct reclaim appears to run slower.
Before v.s. after
total duration: 11659 ms / 57006 ms
total reclaimed: 3953 MB / 6344 MB
speed: 0.339 MB/ms / 0.111 MB/ms
times: 16117 / 27562

The performance might behave differently on devices with smaller memory
(e.g. 8–16GB) compared to servers with 100+GB memory, or under
moderate to heavy memory pressure.
Could this be related to patch 09/14[1] which removes folio_inc_gen()
when ` writeback || (type == LRU_GEN_FILE && dirty)`?

Any comments or suggestions would be appreciated.

[1] https://lore.kernel.org/linux-mm/20260413-mglru-reclaim-v5-0-8eaeacbddc44@tencent.com/T/#m568eba84d35d8d5ff519d3e29237de6d64f67659

Best,
Zicheng

Re: [PATCH v5 00/14] mm/mglru: improve reclaim loop and dirty folio handling

Posted by Barry Song 2 months ago

[...]
>
> The performance might behave differently on devices with smaller memory
> (e.g. 8–16GB) compared to servers with 100+GB memory, or under
> moderate to heavy memory pressure.
> Could this be related to patch 09/14[1] which removes folio_inc_gen()
> when ` writeback || (type == LRU_GEN_FILE && dirty)`?

Very unlikely. I think this should be a positive change. Placing dirty
pages in the youngest generation helps avoid unnecessary scanning, and
they can still be rotated to the oldest generation once writeback is
complete.

>
> Any comments or suggestions would be appreciated.
>
> [1] https://lore.kernel.org/linux-mm/20260413-mglru-reclaim-v5-0-8eaeacbddc44@tencent.com/T/#m568eba84d35d8d5ff519d3e29237de6d64f67659
>

Thanks
Barry

Re: [PATCH v5 00/14] mm/mglru: improve reclaim loop and dirty folio handling

Posted by Kairui Song 2 months ago

On Sat, Apr 18, 2026 at 3:38 PM wangzicheng <wangzicheng@honor.com> wrote:
>
> > Hi Kairui,
> >
> > We have tested this patch series on Android device under a typical scenario.
> >
> > The test consisted of cold-starting multiple applications sequentially
> > under moderate system load (some services running on the background,
> > such as map navigating, AI voice-assistant). Each test round cold-starts
> > a fixed set of apps one by one and records the cold start latency.
> > A total of 100 rounds were conducted to ensure statistical significance.
> >
>
> Hi Xinyu and Kairui,
>
> We have test the patch under a **heavy** load benchmark for camera.
>
> > Before:
> >   /proc/vmstat info:
> >     pgpgin 269,224
> >     pgpgout 226,078
> >     workingset_refault_anon 237
> >     workingset_refault_file 27689
> >
> >   Launch Time Summary (all apps, all runs)
> >     Mean 868.0ms
> >     P50 888.0ms
> >     P90 1274.2ms
> >     P95 1399.0ms
> >
> > After:
> >   /proc/vmstat info:
> >     pgpgin 223,801                (-16.9%)
> >     pgpgout 308,873
> >     workingset_refault_anon 498
> >     workingset_refault_file 17075 (-38.3%)
> >
> >   Launch Time Summary (all apps, all runs)
> >     Mean 850.5ms (-2.07%)
> >     P50 861.5ms  (-3.04%)
> >     P90 1179.0ms (-8.05%)
> >     P95 1228.0ms (-12.2%)
> >
> > --
> > Best regards,
> > Xinyu
> >
>
> We evaluated the backported patches on android16-6.12 using a **heavy**

Hi Zicheng

I'm not sure how you did that, this series applies on mm-unstable and
there is a large gap between that and 6.12.

> mobile workload on a Qualcomm 8850 device (16GB RAM + 16GB zram).
> (vmscan code in this tree is largely similar to v6.18)
>
> The workload simulates real user behavior by sequentially
> cold-starting 23 apps. For each application we perform the related
> operations (short‑video swiping, background music playback, and
> navigation). After exiting one application the next is launched
> immediately in 1s. After all apps complete, the camera is launched
> and a photo is taken.
>
> Baseline and patched kernels were tested under identical conditions.
> (with a fan kept cooling the testbed)
> Full system traces were collected for three runs in each
> configuration, and ten additional traces were recorded for the final
> camera launch stage.
>
> Overall application keepalive behavior shows no noticeable
> difference. However, we observed performance deviations in some
> memory‑pressure scenarios.
>
> Before:
> Meminfo (100 ms per sample, average result)
> MemAvailable: 5420
> MemFree: 1421
> Cached: 3862
> AnonPages: 3804
> Dirty: 62
> vmstat counters (last sample)
> pgpgin: 3,701,869
> pgpgout: 3,545,058
> workingset_refault_anon: 390,967
> workingset_refault_file: 79,927
> Total app launch time (23 apps + launcher × 23): 7702 ms
> Camera launch time: 684 ms
>
> After:
> Meminfo (100 ms per sample, average result)
> MemAvailable: 5058 (-7%)
> MemFree: 1382 (-3%)
> Cached: 3213 (-17%)
> AnonPages: 3637 (-4%)
> Dirty: 35 (-44%)
> vmstat counters (last sample)
> pgpgin: 5,752,429 (+55%)
> pgpgout: 3,668,788 (+3%)
> workingset_refault_anon: 1,492,964 (+282%)
> workingset_refault_file: 590,505 (+639%)
> Total app launch time (23 apps + launcher × 23): 8872 ms (+15%)
> Among the tested apps, 11 improved while 14 regressed.
> Camera launch time: 980 ms (+43%), which is also the stage with the
> highest memory pressure.
>
> From whole trace analysis, direct reclaim appears to run slower.
> Before v.s. after
> total duration: 11659 ms / 57006 ms

Being 5 times slower seems really horrible, but I'm not sure what is
causing that as there seems to be very few dirty folios in your test
case. I knew there are some vendor hook for android, and since now
MGLRU is using the common routine, so these hooks are also affecting
MGLRU but the modules ain't aware of that which is causing strange
behavior?

> total reclaimed: 3953 MB / 6344 MB
> speed: 0.339 MB/ms / 0.111 MB/ms
> times: 16117 / 27562
>
> The performance might behave differently on devices with smaller memory
> (e.g. 8–16GB) compared to servers with 100+GB memory, or under
> moderate to heavy memory pressure.
> Could this be related to patch 09/14[1] which removes folio_inc_gen()
> when ` writeback || (type == LRU_GEN_FILE && dirty)`?
>
> Any comments or suggestions would be appreciated.

Can you share the code you actually tested or maybe test in on
mm-unstable / mm-unstable + this series? Or how can we reproduce that?
Or maybe some full log or dump of lru_gen info and vmstat?

>
> [1] https://lore.kernel.org/linux-mm/20260413-mglru-reclaim-v5-0-8eaeacbddc44@tencent.com/T/#m568eba84d35d8d5ff519d3e29237de6d64f67659
>
> Best,
> Zicheng
>

RE: [PATCH v5 00/14] mm/mglru: improve reclaim loop and dirty folio handling

Posted by wangzicheng 2 months ago

> On Sat, Apr 18, 2026 at 3:38 PM wangzicheng <wangzicheng@honor.com>
> wrote:
> >
> > > Hi Kairui,
> > >
> > > We have tested this patch series on Android device under a typical
> scenario.
> > >
> > > The test consisted of cold-starting multiple applications sequentially
> > > under moderate system load (some services running on the background,
> > > such as map navigating, AI voice-assistant). Each test round cold-starts
> > > a fixed set of apps one by one and records the cold start latency.
> > > A total of 100 rounds were conducted to ensure statistical significance.
> > >
> >
> > Hi Xinyu and Kairui,
> >
> > We have test the patch under a **heavy** load benchmark for camera.
> >
> > > Before:
> > >   /proc/vmstat info:
> > >     pgpgin 269,224
> > >     pgpgout 226,078
> > >     workingset_refault_anon 237
> > >     workingset_refault_file 27689
> > >
> > >   Launch Time Summary (all apps, all runs)
> > >     Mean 868.0ms
> > >     P50 888.0ms
> > >     P90 1274.2ms
> > >     P95 1399.0ms
> > >
> > > After:
> > >   /proc/vmstat info:
> > >     pgpgin 223,801                (-16.9%)
> > >     pgpgout 308,873
> > >     workingset_refault_anon 498
> > >     workingset_refault_file 17075 (-38.3%)
> > >
> > >   Launch Time Summary (all apps, all runs)
> > >     Mean 850.5ms (-2.07%)
> > >     P50 861.5ms  (-3.04%)
> > >     P90 1179.0ms (-8.05%)
> > >     P95 1228.0ms (-12.2%)
> > >
> > > --
> > > Best regards,
> > > Xinyu
> > >
> >
> > We evaluated the backported patches on android16-6.12 using a
> **heavy**
> 
> Hi Zicheng
> 
> I'm not sure how you did that, this series applies on mm-unstable and
> there is a large gap between that and 6.12.
> 
> > mobile workload on a Qualcomm 8850 device (16GB RAM + 16GB zram).
> > (vmscan code in this tree is largely similar to v6.18)
> >
Thanks for pointing that out.

There is indeed a relatively large gap between mm-unstable and our
android16-6.12 tree. The series was backported manually and we only
applied the changes required to make it build and run in our tree.

Because of this, it is possible that some related changes from
mm-unstable were not included, which may have affected the behavior or
performance we observed. If this caused misleading results, we
apologize for the confusion.

Regarding vendor hooks, in our tree there is only one hook in
get_nr_to_scan(). We tested with that hook disabled.

The performance data was collected using Perfetto traces.
Unfortunately those traces contain a large amount of runtime
information and are not easy to share externally.

If needed, we can also try to reproduce the test on a tree closer to
mm-unstable once our chipset platform kernel tree gets updated to
a newer version, to see whether the behavior still reproduces.

Below is the patch we manually applied during the backport.




diff --git a/mm/vmscan.c b/mm/vmscan.c
index f78cfe059f14..50109cd5e94c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1987,6 +1987,44 @@ static int current_may_throttle(void)
 	return !(current->flags & PF_LOCAL_THROTTLE);
 }
 
+static void handle_reclaim_writeback(unsigned long nr_taken,
+				     struct pglist_data *pgdat,
+				     struct scan_control *sc,
+				     struct reclaim_stat *stat)
+{
+	/*
+	 * If dirty folios are scanned that are not queued for IO, it
+	 * implies that flushers are not doing their job. This can
+	 * happen when memory pressure pushes dirty folios to the end of
+	 * the LRU before the dirty limits are breached and the dirty
+	 * data has expired. It can also happen when the proportion of
+	 * dirty folios grows not through writes but through memory
+	 * pressure reclaiming all the clean cache. And in some cases,
+	 * the flushers simply cannot keep up with the allocation
+	 * rate. Nudge the flusher threads in case they are asleep.
+	 */
+	if (stat->nr_unqueued_dirty == nr_taken && nr_taken) {
+		wakeup_flusher_threads(WB_REASON_VMSCAN);
+		/*
+		 * For cgroupv1 dirty throttling is achieved by waking up
+		 * the kernel flusher here and later waiting on folios
+		 * which are in writeback to finish (see shrink_folio_list()).
+		 *
+		 * Flusher may not be able to issue writeback quickly
+		 * enough for cgroupv1 writeback throttling to work
+		 * on a large system.
+		 */
+		if (!writeback_throttling_sane(sc))
+			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
+	}
+
+	sc->nr.dirty += stat->nr_dirty;
+	sc->nr.congested += stat->nr_congested;
+	sc->nr.writeback += stat->nr_writeback;
+	sc->nr.immediate += stat->nr_immediate;
+	sc->nr.taken += nr_taken;
+}
+
 /*
  * shrink_inactive_list() is a helper for shrink_node().  It returns the number
  * of reclaimed pages
@@ -2054,41 +2092,15 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 
 	lru_note_cost(lruvec, file, stat.nr_pageout, nr_scanned - nr_reclaimed);
 
-	/*
-	 * If dirty folios are scanned that are not queued for IO, it
-	 * implies that flushers are not doing their job. This can
-	 * happen when memory pressure pushes dirty folios to the end of
-	 * the LRU before the dirty limits are breached and the dirty
-	 * data has expired. It can also happen when the proportion of
-	 * dirty folios grows not through writes but through memory
-	 * pressure reclaiming all the clean cache. And in some cases,
-	 * the flushers simply cannot keep up with the allocation
-	 * rate. Nudge the flusher threads in case they are asleep.
-	 */
-	if (stat.nr_unqueued_dirty == nr_taken) {
-		wakeup_flusher_threads(WB_REASON_VMSCAN);
-		/*
-		 * For cgroupv1 dirty throttling is achieved by waking up
-		 * the kernel flusher here and later waiting on folios
-		 * which are in writeback to finish (see shrink_folio_list()).
-		 *
-		 * Flusher may not be able to issue writeback quickly
-		 * enough for cgroupv1 writeback throttling to work
-		 * on a large system.
-		 */
-		if (!writeback_throttling_sane(sc))
-			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
-	}
+	
+	// sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
+	// leave nr_unqueued_dirty in scan_control to keep integrity
 
-	sc->nr.dirty += stat.nr_dirty;
-	sc->nr.congested += stat.nr_congested;
-	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
-	sc->nr.writeback += stat.nr_writeback;
-	sc->nr.immediate += stat.nr_immediate;
-	sc->nr.taken += nr_taken;
-	if (file)
-		sc->nr.file_taken += nr_taken;
+	// if (file)
+	// 	sc->nr.file_taken += nr_taken;
+	// leave nr_taken in scan_control to keep integrity
 
+	handle_reclaim_writeback(nr_taken, pgdat, sc, &stat);
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			nr_scanned, nr_reclaimed, &stat, sc->priority, file);
 	return nr_reclaimed;
@@ -3291,7 +3303,7 @@ static int folio_update_gen(struct folio *folio, int gen)
 }
 
 /* protect pages accessed multiple times through file descriptors */
-static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio)
 {
 	int type = folio_is_file_lru(folio);
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
@@ -3310,9 +3322,6 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
 
 		new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_FLAGS);
 		new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
-		/* for folio_end_writeback() */
-		if (reclaiming)
-			new_flags |= BIT(PG_reclaim);
 	} while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
 
 	lru_gen_update_size(lruvec, folio, old_gen, new_gen);
@@ -3918,7 +3927,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, int swappiness)
 			VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
 			VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
 
-			new_gen = folio_inc_gen(lruvec, folio, false);
+			new_gen = folio_inc_gen(lruvec, folio);
 			list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]);
 
 			/* don't count the workingset being lazily promoted */
@@ -3941,10 +3950,10 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, int swappiness)
 	return true;
 }
 
-static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
+static void try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
 {
 	int gen, type, zone;
-	bool success = false;
+	bool seq_inc_flag = false;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	DEFINE_MIN_SEQ(lruvec);
 
@@ -3961,11 +3970,19 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
 			}
 
 			min_seq[type]++;
+			seq_inc_flag = true;
 		}
 next:
 		;
 	}
 
+	/*
+	 * If min_seq[type] of both anonymous and file is not increased,
+	 * return here to avoid unnecessary checking overhead later.
+ 	 */
+	if (!seq_inc_flag)
+		return;
+
 	/* see the comment on lru_gen_folio */
 	if (swappiness && swappiness <= MAX_SWAPPINESS) {
 		unsigned long seq = lrugen->max_seq - MIN_NR_GENS;
@@ -3982,10 +3999,8 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
 
 		reset_ctrl_pos(lruvec, type, true);
 		WRITE_ONCE(lrugen->min_seq[type], min_seq[type]);
-		success = true;
 	}
 
-	return success;
 }
 
 static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness)
@@ -4137,27 +4152,33 @@ static void set_initial_priority(struct pglist_data *pgdat, struct scan_control
 	sc->priority = clamp(priority, DEF_PRIORITY / 2, DEF_PRIORITY);
 }
 
-static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
+static unsigned long lruvec_evictable_size(struct lruvec *lruvec, int swappiness)
 {
 	int gen, type, zone;
-	unsigned long total = 0;
-	int swappiness = get_swappiness(lruvec, sc);
+	unsigned long seq, total = 0;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
-	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	DEFINE_MAX_SEQ(lruvec);
 	DEFINE_MIN_SEQ(lruvec);
 
 	for_each_evictable_type(type, swappiness) {
-		unsigned long seq;
-
 		for (seq = min_seq[type]; seq <= max_seq; seq++) {
 			gen = lru_gen_from_seq(seq);
-
 			for (zone = 0; zone < MAX_NR_ZONES; zone++)
 				total += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
 		}
 	}
 
+	return total;
+}
+
+static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
+{
+	unsigned long total;
+	int swappiness = get_swappiness(lruvec, sc);
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	total = lruvec_evictable_size(lruvec, swappiness);
+
 	/* whether the size is big enough to be helpful */
 	return mem_cgroup_online(memcg) ? (total >> sc->priority) : total;
 }
@@ -4475,7 +4496,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		       int tier_idx)
 {
 	bool success;
-	bool dirty, writeback;
 	int gen = folio_lru_gen(folio);
 	int type = folio_is_file_lru(folio);
 	int zone = folio_zonenum(folio);
@@ -4505,7 +4525,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 
 	/* protected */
 	if (tier > tier_idx || refs + workingset == BIT(LRU_REFS_WIDTH) + 1) {
-		gen = folio_inc_gen(lruvec, folio, false);
+		gen = folio_inc_gen(lruvec, folio);
 		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
 
 		/* don't count the workingset being lazily promoted */
@@ -4520,26 +4540,11 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 
 	/* ineligible */
 	if (!folio_test_lru(folio) || zone > sc->reclaim_idx) {
-		gen = folio_inc_gen(lruvec, folio, false);
+		gen = folio_inc_gen(lruvec, folio);
 		list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
 		return true;
 	}
 
-	dirty = folio_test_dirty(folio);
-	writeback = folio_test_writeback(folio);
-	if (type == LRU_GEN_FILE && dirty) {
-		sc->nr.file_taken += delta;
-		if (!writeback)
-			sc->nr.unqueued_dirty += delta;
-	}
-
-	/* waiting for writeback */
-	if (writeback || (type == LRU_GEN_FILE && dirty)) {
-		gen = folio_inc_gen(lruvec, folio, true);
-		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
-		return true;
-	}
-
 	return false;
 }
 
@@ -4547,12 +4552,6 @@ bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct scan_contr
 {
 	bool success;
 
-	/* swap constrained */
-	if (!(sc->gfp_mask & __GFP_IO) &&
-	    (folio_test_dirty(folio) ||
-	     (folio_test_anon(folio) && !folio_test_swapcache(folio))))
-		return false;
-
 	/* raced with release_pages() */
 	if (!folio_try_get(folio))
 		return false;
@@ -4567,8 +4566,6 @@ bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct scan_contr
 	if (!folio_test_referenced(folio))
 		set_mask_bits(&folio->flags, LRU_REFS_MASK, 0);
 
-	/* for shrink_folio_list() */
-	folio_clear_reclaim(folio);
 
 	success = lru_gen_del_folio(lruvec, folio, true);
 	VM_WARN_ON_ONCE_FOLIO(!success, folio);
@@ -4577,8 +4574,9 @@ bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct scan_contr
 }
 EXPORT_SYMBOL_GPL(isolate_folio);
 
-static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
-		       int type, int tier, struct list_head *list)
+static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec, struct scan_control *sc,
+		       int type, int tier, 
+			   struct list_head *list, int *isolatedp)
 {
 	int i;
 	int gen;
@@ -4587,10 +4585,11 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 	int scanned = 0;
 	int isolated = 0;
 	int skipped = 0;
-	int remaining = MAX_LRU_BATCH;
+	unsigned long remaining = nr_to_scan;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 
+	VM_WARN_ON_ONCE(nr_to_scan > MAX_LRU_BATCH);
 	VM_WARN_ON_ONCE(!list_empty(list));
 
 	if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
@@ -4647,16 +4646,12 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 	__count_memcg_events(memcg, item, isolated);
 	__count_memcg_events(memcg, PGREFILL, sorted);
 	__count_vm_events(PGSCAN_ANON + type, isolated);
-	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU_BATCH,
+	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
 				scanned, skipped, isolated,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
-	if (type == LRU_GEN_FILE)
-		sc->nr.file_taken += isolated;
-	/*
-	 * There might not be eligible folios due to reclaim_idx. Check the
-	 * remaining to prevent livelock if it's not making progress.
-	 */
-	return isolated || !remaining ? scanned : 0;
+
+	*isolatedp = isolated;
+	return scanned;
 }
 
 static int get_tier_idx(struct lruvec *lruvec, int type)
@@ -4698,33 +4693,36 @@ static int get_type_to_scan(struct lruvec *lruvec, int swappiness)
 	return positive_ctrl_err(&sp, &pv);
 }
 
-static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness,
-			  int *type_scanned, struct list_head *list)
+static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec, struct scan_control *sc, int swappiness,
+			   struct list_head *list, int *isolated,
+			  int *isolate_type, int *isolate_scanned)
 {
 	int i;
+	int scanned = 0;
 	int type = get_type_to_scan(lruvec, swappiness);
 
 	for_each_evictable_type(i, swappiness) {
-		int scanned;
+		int type_scan;
 		int tier = get_tier_idx(lruvec, type);
 
-		*type_scanned = type;
+		type_scan = scan_folios(nr_to_scan, lruvec, sc,
+					type, tier, list, isolated);
 
-		scanned = scan_folios(lruvec, sc, type, tier, list);
-		if (scanned)
-			return scanned;
+		scanned += type_scan;
+		if (*isolated) {
+			*isolate_type = type;
+			*isolate_scanned = type_scan;
+			break;
+		}
 
 		type = !type;
 	}
 
-	return 0;
+	return scanned;
 }
 
-static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
+static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec, struct scan_control *sc, int swappiness)
 {
-	int type;
-	int scanned;
-	int reclaimed;
 	LIST_HEAD(list);
 	LIST_HEAD(clean);
 	struct folio *folio;
@@ -4732,19 +4730,23 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 	enum vm_event_item item;
 	struct reclaim_stat stat;
 	struct lru_gen_mm_walk *walk;
+	int scanned, reclaimed;
+	int isolated = 0, type, type_scanned;
 	bool skip_retry = false;
-	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
 	spin_lock_irq(&lruvec->lru_lock);
 
-	scanned = isolate_folios(lruvec, sc, swappiness, &type, &list);
+	/* In case folio deletion left empty old gens, flush them */
+	try_to_inc_min_seq(lruvec, swappiness);
 
-	scanned += try_to_inc_min_seq(lruvec, swappiness);
+	scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness,
+				 &list, &isolated, &type, &type_scanned);
 
-	if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS > lrugen->max_seq)
-		scanned = 0;
+	/* Isolation might create empty gen, flush them */
+	if (scanned)
+		try_to_inc_min_seq(lruvec, swappiness);
 
 	spin_unlock_irq(&lruvec->lru_lock);
 
@@ -4752,10 +4754,10 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 		return scanned;
 retry:
 	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false);
-	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
 	sc->nr_reclaimed += reclaimed;
+	handle_reclaim_writeback(isolated, pgdat, sc, &stat);
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
-			scanned, reclaimed, &stat, sc->priority,
+			type_scanned, reclaimed, &stat, sc->priority,
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 
 	list_for_each_entry_safe_reverse(folio, next, &list, lru) {
@@ -4804,6 +4806,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 
 	if (!list_empty(&list)) {
 		skip_retry = true;
+		isolated = 0;
 		goto retry;
 	}
 
@@ -4813,28 +4816,14 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
 			     int swappiness, unsigned long *nr_to_scan)
 {
-	int gen, type, zone;
-	unsigned long size = 0;
-	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	DEFINE_MIN_SEQ(lruvec);
 
-	*nr_to_scan = 0;
 	/* have to run aging, since eviction is not possible anymore */
 	if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
 		return true;
 
-	for_each_evictable_type(type, swappiness) {
-		unsigned long seq;
-
-		for (seq = min_seq[type]; seq <= max_seq; seq++) {
-			gen = lru_gen_from_seq(seq);
+	*nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
 
-			for (zone = 0; zone < MAX_NR_ZONES; zone++)
-				size += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
-		}
-	}
-
-	*nr_to_scan = size;
 	/* better to run aging even though eviction is still possible */
 	return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
 }
@@ -4844,27 +4833,55 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
  * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
  *    reclaim.
  */
-static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
-{
-	bool success;
-	unsigned long nr_to_scan;
-	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
-	DEFINE_MAX_SEQ(lruvec);
+// static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
+// 			   struct mem_cgroup *memcg, int swappiness)
+// {
+// 	unsigned long nr_to_scan, evictable;
+// 	bool bypass = false;
+// 	bool young = false;
+// 	DEFINE_MAX_SEQ(lruvec);
+
+// 	evictable = lruvec_evictable_size(lruvec, swappiness);
+// 	nr_to_scan = evictable;
+
+// 	/* try to scrape all its memory if this memcg was deleted */
+// 	if (!mem_cgroup_online(memcg))
+// 		return nr_to_scan;
+
+// 	// nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
+// 	// not exist in the android code
+// 	nr_to_scan >>= sc->priority;
+
+// 	if (!nr_to_scan && sc->priority < DEF_PRIORITY)
+// 		nr_to_scan = min(evictable, SWAP_CLUSTER_MAX);
+
+// 	trace_android_vh_mglru_aging_bypass(lruvec, max_seq,
+// 		swappiness, &bypass, &young);
+// 	if (bypass)
+// 		return young ? -1 : 0;
+
+// 	return nr_to_scan;
+// }
+static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
+			   struct mem_cgroup *memcg, int swappiness)
+{
+	unsigned long nr_to_scan, evictable;
 	bool bypass = false;
 	bool young = false;
+	DEFINE_MAX_SEQ(lruvec);
 
-	if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
-		return -1;
-
-	success = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
+	evictable = lruvec_evictable_size(lruvec, swappiness);
+	nr_to_scan = evictable;
 
 	/* try to scrape all its memory if this memcg was deleted */
-	if (nr_to_scan && !mem_cgroup_online(memcg))
+	if (!mem_cgroup_online(memcg))
 		return nr_to_scan;
 
+	nr_to_scan >>= sc->priority;
+
 	/* try to get away with not aging at the default priority */
-	if (!success || sc->priority == DEF_PRIORITY)
-		return nr_to_scan >> sc->priority;
+	if (!nr_to_scan && sc->priority < DEF_PRIORITY)
+		nr_to_scan = min(evictable, SWAP_CLUSTER_MAX);
 
 	trace_android_vh_mglru_aging_bypass(lruvec, max_seq,
 		swappiness, &bypass, &young);
@@ -4872,7 +4889,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int s
 		return young ? -1 : 0;
 
 	/* stop scanning this lruvec as it's low on cold folios */
-	return try_to_inc_max_seq(lruvec, max_seq, swappiness, false) ? -1 : 0;
+	return nr_to_scan;
 }
 
 static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
@@ -4909,47 +4926,58 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
 	return true;
 }
 
+/*
+ * For future optimizations:
+ * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
+ *    reclaim.
+ */
 static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
-	long nr_to_scan;
-	unsigned long scanned = 0;
+	bool need_rotate = false, should_age = false;
+	long nr_batch, nr_to_scan;
 	int swappiness = get_swappiness(lruvec, sc);
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 
-	while (true) {
+	nr_to_scan = get_nr_to_scan(lruvec, sc, memcg, swappiness);
+	if (!nr_to_scan)
+		need_rotate = true;
+
+	while (nr_to_scan > 0) {
 		int delta;
+		DEFINE_MAX_SEQ(lruvec);
 
-		nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness);
-		if (nr_to_scan <= 0)
+		if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg)) {
+			need_rotate = true;
 			break;
+		}
 
-		delta = evict_folios(lruvec, sc, swappiness);
+		if (should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan)) {
+			if (try_to_inc_max_seq(lruvec, max_seq, swappiness, false))
+				need_rotate = true;
+			should_age = true;
+		}
+
+		nr_batch = min(nr_to_scan, MIN_LRU_BATCH);
+		delta = evict_folios(nr_batch, lruvec, sc, swappiness);
 		if (!delta)
 			break;
 
-		scanned += delta;
-		if (scanned >= nr_to_scan)
+		if (should_abort_scan(lruvec, sc))
 			break;
 
-		if (should_abort_scan(lruvec, sc))
+		/* For cgroup reclaim, fairness is handled by iterator, not rotation */
+		if (root_reclaim(sc) && should_age)
 			break;
 
 		cond_resched();
 	}
 
-	/*
-	 * If too many file cache in the coldest generation can't be evicted
-	 * due to being dirty, wake up the flusher.
-	 */
-	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken)
-		wakeup_flusher_threads(WB_REASON_VMSCAN);
-
-	/* whether this lruvec should be rotated */
-	return nr_to_scan < 0;
+	return need_rotate;
 }
 
 static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 {
-	bool success;
+	bool need_rotate;
 	unsigned long scanned = sc->nr_scanned;
 	unsigned long reclaimed = sc->nr_reclaimed;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
@@ -4967,7 +4995,7 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 		memcg_memory_event(memcg, MEMCG_LOW);
 	}
 
-	success = try_to_shrink_lruvec(lruvec, sc);
+	need_rotate = try_to_shrink_lruvec(lruvec, sc);
 
 	shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
 
@@ -4977,10 +5005,10 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 
 	flush_reclaim_state(sc);
 
-	if (success && mem_cgroup_online(memcg))
+	if (need_rotate && mem_cgroup_online(memcg))
 		return MEMCG_LRU_YOUNG;
 
-	if (!success && lruvec_is_sizable(lruvec, sc))
+	if (!need_rotate && lruvec_is_sizable(lruvec, sc))
 		return 0;
 
 	/* one retry if offlined or too small */
@@ -5532,6 +5560,7 @@ static int run_aging(struct lruvec *lruvec, unsigned long seq,
 static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc,
 			int swappiness, unsigned long nr_to_reclaim)
 {
+	int nr_batch;
 	DEFINE_MAX_SEQ(lruvec);
 
 	if (seq + MIN_NR_GENS > max_seq)
@@ -5548,7 +5577,8 @@ static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_co
 		if (sc->nr_reclaimed >= nr_to_reclaim)
 			return 0;
 
-		if (!evict_folios(lruvec, sc, swappiness))
+		nr_batch = min(nr_to_reclaim - sc->nr_reclaimed, MAX_LRU_BATCH);
+		if (!evict_folios(nr_batch, lruvec, sc, swappiness))
 			return 0;
 
 		cond_resched();

Re: [PATCH v5 00/14] mm/mglru: improve reclaim loop and dirty folio handling

Posted by Kairui Song 2 months ago

On Sat, Apr 18, 2026 at 5:08 PM wangzicheng <wangzicheng@honor.com> wrote:
> There is indeed a relatively large gap between mm-unstable and our
> android16-6.12 tree. The series was backported manually and we only
> applied the changes required to make it build and run in our tree.
>
> Because of this, it is possible that some related changes from
> mm-unstable were not included, which may have affected the behavior or
> performance we observed. If this caused misleading results, we
> apologize for the confusion.
>
> Regarding vendor hooks, in our tree there is only one hook in
> get_nr_to_scan(). We tested with that hook disabled.
>
> The performance data was collected using Perfetto traces.
> Unfortunately those traces contain a large amount of runtime
> information and are not easy to share externally.
>
> If needed, we can also try to reproduce the test on a tree closer to
> mm-unstable once our chipset platform kernel tree gets updated to
> a newer version, to see whether the behavior still reproduces.
>
> Below is the patch we manually applied during the backport.
>

Hi Zicheng!

Thanks for sharing this. It helps a lot!

I'm still not sure how I can reproduce your issue though. Android have
many adaptive behaviors and vendors (in userspace) have many
customized policies too, so maybe some metrics change have unexpected
behavior.

>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f78cfe059f14..50109cd5e94c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1987,6 +1987,44 @@ static int current_may_throttle(void)
>         return !(current->flags & PF_LOCAL_THROTTLE);
>  }
>
> +static void handle_reclaim_writeback(unsigned long nr_taken,
> +                                    struct pglist_data *pgdat,
> +                                    struct scan_control *sc,
> +                                    struct reclaim_stat *stat)
> +{
> +       /*
> +        * If dirty folios are scanned that are not queued for IO, it
> +        * implies that flushers are not doing their job. This can
> +        * happen when memory pressure pushes dirty folios to the end of
> +        * the LRU before the dirty limits are breached and the dirty
> +        * data has expired. It can also happen when the proportion of
> +        * dirty folios grows not through writes but through memory
> +        * pressure reclaiming all the clean cache. And in some cases,
> +        * the flushers simply cannot keep up with the allocation
> +        * rate. Nudge the flusher threads in case they are asleep.
> +        */
> +       if (stat->nr_unqueued_dirty == nr_taken && nr_taken) {
> +               wakeup_flusher_threads(WB_REASON_VMSCAN);
> +               /*
> +                * For cgroupv1 dirty throttling is achieved by waking up
> +                * the kernel flusher here and later waiting on folios
> +                * which are in writeback to finish (see shrink_folio_list()).
> +                *
> +                * Flusher may not be able to issue writeback quickly
> +                * enough for cgroupv1 writeback throttling to work
> +                * on a large system.
> +                */
> +               if (!writeback_throttling_sane(sc))
> +                       reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> +       }
> +
> +       sc->nr.dirty += stat->nr_dirty;
> +       sc->nr.congested += stat->nr_congested;
> +       sc->nr.writeback += stat->nr_writeback;
> +       sc->nr.immediate += stat->nr_immediate;
> +       sc->nr.taken += nr_taken;
> +}
> +
>  /*
>   * shrink_inactive_list() is a helper for shrink_node().  It returns the number
>   * of reclaimed pages
> @@ -2054,41 +2092,15 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>
>         lru_note_cost(lruvec, file, stat.nr_pageout, nr_scanned - nr_reclaimed);
>
> -       /*
> -        * If dirty folios are scanned that are not queued for IO, it
> -        * implies that flushers are not doing their job. This can
> -        * happen when memory pressure pushes dirty folios to the end of
> -        * the LRU before the dirty limits are breached and the dirty
> -        * data has expired. It can also happen when the proportion of
> -        * dirty folios grows not through writes but through memory
> -        * pressure reclaiming all the clean cache. And in some cases,
> -        * the flushers simply cannot keep up with the allocation
> -        * rate. Nudge the flusher threads in case they are asleep.
> -        */
> -       if (stat.nr_unqueued_dirty == nr_taken) {
> -               wakeup_flusher_threads(WB_REASON_VMSCAN);
> -               /*
> -                * For cgroupv1 dirty throttling is achieved by waking up
> -                * the kernel flusher here and later waiting on folios
> -                * which are in writeback to finish (see shrink_folio_list()).
> -                *
> -                * Flusher may not be able to issue writeback quickly
> -                * enough for cgroupv1 writeback throttling to work
> -                * on a large system.
> -                */
> -               if (!writeback_throttling_sane(sc))
> -                       reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> -       }
> +
> +       // sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
> +       // leave nr_unqueued_dirty in scan_control to keep integrity
>
> -       sc->nr.dirty += stat.nr_dirty;
> -       sc->nr.congested += stat.nr_congested;
> -       sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
> -       sc->nr.writeback += stat.nr_writeback;
> -       sc->nr.immediate += stat.nr_immediate;
> -       sc->nr.taken += nr_taken;
> -       if (file)
> -               sc->nr.file_taken += nr_taken;
> +       // if (file)
> +       //      sc->nr.file_taken += nr_taken;
> +       // leave nr_taken in scan_control to keep integrity
>
> +       handle_reclaim_writeback(nr_taken, pgdat, sc, &stat);

Since it's not a full backport, the backport itself might be buggy or
be missing things or dependency. For example, this part, I dropped
nr_unqueued_dirty and file_taken in this series, that's perfectly fine
for upstream mainline after 2f05435df932 (6.19), but I just checked
android16-6.12 branch of AOSP, if you remove this counter update here,
maybe some dirty reactivation path is completely broken, or if there
are related downstream metrics or user, they are broken.

> -static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
> +static unsigned long lruvec_evictable_size(struct lruvec *lruvec, int swappiness)
>  {
>         int gen, type, zone;
> -       unsigned long total = 0;
> -       int swappiness = get_swappiness(lruvec, sc);
> +       unsigned long seq, total = 0;
>         struct lru_gen_folio *lrugen = &lruvec->lrugen;
> -       struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>         DEFINE_MAX_SEQ(lruvec);
>         DEFINE_MIN_SEQ(lruvec);
>
>         for_each_evictable_type(type, swappiness) {
> -               unsigned long seq;
> -
>                 for (seq = min_seq[type]; seq <= max_seq; seq++) {
>                         gen = lru_gen_from_seq(seq);
> -
>                         for (zone = 0; zone < MAX_NR_ZONES; zone++)
>                                 total += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
>                 }
>         }
>
> +       return total;
> +}
> +
> +static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
> +{
> +       unsigned long total;
> +       int swappiness = get_swappiness(lruvec, sc);
> +       struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +
> +       total = lruvec_evictable_size(lruvec, swappiness);
> +
>         /* whether the size is big enough to be helpful */
>         return mem_cgroup_online(memcg) ? (total >> sc->priority) : total;
>  }
> @@ -4475,7 +4496,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>                        int tier_idx)
>  {
>         bool success;
> -       bool dirty, writeback;
>         int gen = folio_lru_gen(folio);
>         int type = folio_is_file_lru(folio);
>         int zone = folio_zonenum(folio);
> @@ -4505,7 +4525,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>
>         /* protected */
>         if (tier > tier_idx || refs + workingset == BIT(LRU_REFS_WIDTH) + 1) {
> -               gen = folio_inc_gen(lruvec, folio, false);
> +               gen = folio_inc_gen(lruvec, folio);
>                 list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
>
>                 /* don't count the workingset being lazily promoted */
> @@ -4520,26 +4540,11 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>
>         /* ineligible */
>         if (!folio_test_lru(folio) || zone > sc->reclaim_idx) {
> -               gen = folio_inc_gen(lruvec, folio, false);
> +               gen = folio_inc_gen(lruvec, folio);
>                 list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
>                 return true;
>         }
>
> -       dirty = folio_test_dirty(folio);
> -       writeback = folio_test_writeback(folio);
> -       if (type == LRU_GEN_FILE && dirty) {
> -               sc->nr.file_taken += delta;
> -               if (!writeback)
> -                       sc->nr.unqueued_dirty += delta;
> -       }
> -
> -       /* waiting for writeback */
> -       if (writeback || (type == LRU_GEN_FILE && dirty)) {
> -               gen = folio_inc_gen(lruvec, folio, true);
> -               list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> -               return true;
> -       }
> -
>         return false;
>  }
>
> @@ -4547,12 +4552,6 @@ bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct scan_contr
>  {
>         bool success;
>
> -       /* swap constrained */
> -       if (!(sc->gfp_mask & __GFP_IO) &&
> -           (folio_test_dirty(folio) ||
> -            (folio_test_anon(folio) && !folio_test_swapcache(folio))))
> -               return false;
> -
>         /* raced with release_pages() */
>         if (!folio_try_get(folio))
>                 return false;
> @@ -4567,8 +4566,6 @@ bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct scan_contr
>         if (!folio_test_referenced(folio))
>                 set_mask_bits(&folio->flags, LRU_REFS_MASK, 0);
>
> -       /* for shrink_folio_list() */
> -       folio_clear_reclaim(folio);
>
>         success = lru_gen_del_folio(lruvec, folio, true);
>         VM_WARN_ON_ONCE_FOLIO(!success, folio);
> @@ -4577,8 +4574,9 @@ bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct scan_contr
>  }
>  EXPORT_SYMBOL_GPL(isolate_folio);
>
> -static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
> -                      int type, int tier, struct list_head *list)
> +static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec, struct scan_control *sc,
> +                      int type, int tier,
> +                          struct list_head *list, int *isolatedp)
>  {
>         int i;
>         int gen;
> @@ -4587,10 +4585,11 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
>         int scanned = 0;
>         int isolated = 0;
>         int skipped = 0;
> -       int remaining = MAX_LRU_BATCH;
> +       unsigned long remaining = nr_to_scan;
>         struct lru_gen_folio *lrugen = &lruvec->lrugen;
>         struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>
> +       VM_WARN_ON_ONCE(nr_to_scan > MAX_LRU_BATCH);
>         VM_WARN_ON_ONCE(!list_empty(list));
>
>         if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
> @@ -4647,16 +4646,12 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
>         __count_memcg_events(memcg, item, isolated);
>         __count_memcg_events(memcg, PGREFILL, sorted);
>         __count_vm_events(PGSCAN_ANON + type, isolated);
> -       trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU_BATCH,
> +       trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
>                                 scanned, skipped, isolated,
>                                 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> -       if (type == LRU_GEN_FILE)
> -               sc->nr.file_taken += isolated;
> -       /*
> -        * There might not be eligible folios due to reclaim_idx. Check the
> -        * remaining to prevent livelock if it's not making progress.
> -        */
> -       return isolated || !remaining ? scanned : 0;
> +
> +       *isolatedp = isolated;
> +       return scanned;
>  }
>
>  static int get_tier_idx(struct lruvec *lruvec, int type)
> @@ -4698,33 +4693,36 @@ static int get_type_to_scan(struct lruvec *lruvec, int swappiness)
>         return positive_ctrl_err(&sp, &pv);
>  }
>
> -static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness,
> -                         int *type_scanned, struct list_head *list)
> +static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec, struct scan_control *sc, int swappiness,
> +                          struct list_head *list, int *isolated,
> +                         int *isolate_type, int *isolate_scanned)
>  {
>         int i;
> +       int scanned = 0;
>         int type = get_type_to_scan(lruvec, swappiness);
>
>         for_each_evictable_type(i, swappiness) {
> -               int scanned;
> +               int type_scan;
>                 int tier = get_tier_idx(lruvec, type);
>
> -               *type_scanned = type;
> +               type_scan = scan_folios(nr_to_scan, lruvec, sc,
> +                                       type, tier, list, isolated);
>
> -               scanned = scan_folios(lruvec, sc, type, tier, list);
> -               if (scanned)
> -                       return scanned;
> +               scanned += type_scan;
> +               if (*isolated) {
> +                       *isolate_type = type;
> +                       *isolate_scanned = type_scan;
> +                       break;
> +               }
>
>                 type = !type;
>         }
>
> -       return 0;
> +       return scanned;
>  }
>
> -static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
> +static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec, struct scan_control *sc, int swappiness)

The signature change in upstream comes with the proportional
protection, simply changing that downstream might be missing things
and we are not on the same baseline.

>  {
> -       int type;
> -       int scanned;
> -       int reclaimed;
>         LIST_HEAD(list);
>         LIST_HEAD(clean);
>         struct folio *folio;
> @@ -4732,19 +4730,23 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
>         enum vm_event_item item;
>         struct reclaim_stat stat;
>         struct lru_gen_mm_walk *walk;
> +       int scanned, reclaimed;
> +       int isolated = 0, type, type_scanned;
>         bool skip_retry = false;
> -       struct lru_gen_folio *lrugen = &lruvec->lrugen;
>         struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>         struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>
>         spin_lock_irq(&lruvec->lru_lock);
>
> -       scanned = isolate_folios(lruvec, sc, swappiness, &type, &list);
> +       /* In case folio deletion left empty old gens, flush them */
> +       try_to_inc_min_seq(lruvec, swappiness);
>
> -       scanned += try_to_inc_min_seq(lruvec, swappiness);
> +       scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness,
> +                                &list, &isolated, &type, &type_scanned);
>
> -       if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS > lrugen->max_seq)
> -               scanned = 0;
> +       /* Isolation might create empty gen, flush them */
> +       if (scanned)
> +               try_to_inc_min_seq(lruvec, swappiness);
>
>         spin_unlock_irq(&lruvec->lru_lock);
>
> @@ -4752,10 +4754,10 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
>                 return scanned;
>  retry:
>         reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false);
> -       sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
>         sc->nr_reclaimed += reclaimed;
> +       handle_reclaim_writeback(isolated, pgdat, sc, &stat);
>         trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> -                       scanned, reclaimed, &stat, sc->priority,
> +                       type_scanned, reclaimed, &stat, sc->priority,
>                         type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>
>         list_for_each_entry_safe_reverse(folio, next, &list, lru) {
> @@ -4804,6 +4806,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
>
>         if (!list_empty(&list)) {
>                 skip_retry = true;
> +               isolated = 0;
>                 goto retry;
>         }
>
> @@ -4813,28 +4816,14 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
>  static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
>                              int swappiness, unsigned long *nr_to_scan)
>  {
> -       int gen, type, zone;
> -       unsigned long size = 0;
> -       struct lru_gen_folio *lrugen = &lruvec->lrugen;
>         DEFINE_MIN_SEQ(lruvec);
>
> -       *nr_to_scan = 0;
>         /* have to run aging, since eviction is not possible anymore */
>         if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
>                 return true;

And you lost the DEF_PRIORITY early-return here.

>
> -       for_each_evictable_type(type, swappiness) {
> -               unsigned long seq;
> -
> -               for (seq = min_seq[type]; seq <= max_seq; seq++) {
> -                       gen = lru_gen_from_seq(seq);
> +       *nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
>
> -                       for (zone = 0; zone < MAX_NR_ZONES; zone++)
> -                               size += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
> -               }
> -       }
> -
> -       *nr_to_scan = size;
>         /* better to run aging even though eviction is still possible */
>         return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
>  }
> @@ -4844,27 +4833,55 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
>   * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
>   *    reclaim.
>   */
> -static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
> -{
> -       bool success;
> -       unsigned long nr_to_scan;
> -       struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> -       DEFINE_MAX_SEQ(lruvec);
> +// static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
> +//                        struct mem_cgroup *memcg, int swappiness)
> +// {
> +//     unsigned long nr_to_scan, evictable;
> +//     bool bypass = false;
> +//     bool young = false;
> +//     DEFINE_MAX_SEQ(lruvec);
> +
> +//     evictable = lruvec_evictable_size(lruvec, swappiness);
> +//     nr_to_scan = evictable;
> +
> +//     /* try to scrape all its memory if this memcg was deleted */
> +//     if (!mem_cgroup_online(memcg))
> +//             return nr_to_scan;
> +
> +//     // nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
> +//     // not exist in the android code
> +//     nr_to_scan >>= sc->priority;
> +
> +//     if (!nr_to_scan && sc->priority < DEF_PRIORITY)
> +//             nr_to_scan = min(evictable, SWAP_CLUSTER_MAX);
> +
> +//     trace_android_vh_mglru_aging_bypass(lruvec, max_seq,
> +//             swappiness, &bypass, &young);

This part looks really hackish... I'm not sure if anything is wrong.

> +//     if (bypass)
> +//             return young ? -1 : 0;
> +
> +//     return nr_to_scan;
> +// }
> +/*
> + * For future optimizations:
> + * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
> + *    reclaim.
> + */
>  static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  {
> -       long nr_to_scan;
> -       unsigned long scanned = 0;
> +       bool need_rotate = false, should_age = false;
> +       long nr_batch, nr_to_scan;
>         int swappiness = get_swappiness(lruvec, sc);
> +       struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>
> -       while (true) {
> +       nr_to_scan = get_nr_to_scan(lruvec, sc, memcg, swappiness);
> +       if (!nr_to_scan)
> +               need_rotate = true;
> +
> +       while (nr_to_scan > 0) {
>                 int delta;
> +               DEFINE_MAX_SEQ(lruvec);
>
> -               nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness);
> -               if (nr_to_scan <= 0)
> +               if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg)) {
> +                       need_rotate = true;
>                         break;
> +               }
>
> -               delta = evict_folios(lruvec, sc, swappiness);
> +               if (should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan)) {

Here should_run_aging() clobbers the same nr_to_scan the loop, which
changes the reclaim behavior dramatically compared to this series.

> +                       if (try_to_inc_max_seq(lruvec, max_seq, swappiness, false))
> +                               need_rotate = true;
> +                       should_age = true;
> +               }
> +
> +               nr_batch = min(nr_to_scan, MIN_LRU_BATCH);
> +               delta = evict_folios(nr_batch, lruvec, sc, swappiness);
>                 if (!delta)
>                         break;
>
> -               scanned += delta;
> -               if (scanned >= nr_to_scan)
> +               if (should_abort_scan(lruvec, sc))
>                         break;
>
> -               if (should_abort_scan(lruvec, sc))
> +               /* For cgroup reclaim, fairness is handled by iterator, not rotation */
> +               if (root_reclaim(sc) && should_age)
>                         break;
>
>                 cond_resched();

And here you are not doing "nr_to_scan -= delta". Maybe the reclaim
will keep going on in a extreme aggressive way? The new design meant
to use nr_to_scan as budget, not a bool.

>         }
>
> -       /*
> -        * If too many file cache in the coldest generation can't be evicted
> -        * due to being dirty, wake up the flusher.
> -        */
> -       if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken)
> -               wakeup_flusher_threads(WB_REASON_VMSCAN);
> -
> -       /* whether this lruvec should be rotated */
> -       return nr_to_scan < 0;
> +       return need_rotate;
>  }
>

So I did a quick look of this backport, it does looks very buggy
itself with several inconsistent part I can identify on spot, and
besides I'm not sure if there are more gaps in other parts in the
downstream with this series.

I think you are simply not testing the same thing as I posted. It pass
the build doesn't mean it's correct, at least the reactivation, budget
and the aging part might be kind of broken.

Don't worry, your workload and concern definitely make sense, but I
think we really need to come up with some reproducible tests that can
be benchmarked upstreamly to avoid confusion and inaccuracy, so all
our cases can be better covered.

I'll also try to do a few more tests on my android phone. And feel
free to provide more suggestion or cases :)

Re: [PATCH v5 00/14] mm/mglru: improve reclaim loop and dirty folio handling

Posted by Kairui Song 2 months ago

On Fri, Apr 17, 2026 at 10:53 AM wangxinyu19 <wxy2009nrrr@163.com> wrote:
>
> On Mon, 13 Apr 2026 00:48:14 +0800, Kairui Song wrote:
> > This series is based on mm-unstable, also applies to mm-new.
> >
> > This series cleans up and slightly improves MGLRU's reclaim loop and
> > dirty writeback handling. As a result, we can see an up to ~30% increase
> > in some workloads like MongoDB with YCSB and a huge decrease in file
> > refault, no swap involved. Other common benchmarks have no regression,
> > and LOC is reduced, with less unexpected OOM, too.
> >
> > Some of the problems were found in our production environment, and
> > others were mostly exposed while stress testing during the development
> > of the LSM/MM/BPF topic on improving MGLRU [1]. This series cleans up
> > the code base and fixes several performance issues, preparing for
> > further work.
> >
> > MGLRU's reclaim loop is a bit complex, and hence these problems are
> > somehow related to each other. The aging, scan number calculation, and
> > reclaim loop are coupled together, and the dirty folio handling logic is
> > quite different, making the reclaim loop hard to follow and the dirty
> > flush ineffective.
>
> > This series slightly cleans up and improves these issues using a scan
> > budget by calculating the number of folios to scan at the beginning of
> > the loop, and decouples aging from the reclaim calculation helpers.
> > Then, move the dirty flush logic inside the reclaim loop so it can kick
> > in more effectively. These issues are somehow related, and this series
> > handles them and improves MGLRU reclaim in many ways.
> >
> > Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
> > and a 128G memory machine using NVME as storage.
>
> Hi Kairui,

Hello Xinyu,

> After:
>   /proc/vmstat info:
>     pgpgin 223,801                (-16.9%)
>     pgpgout 308,873
>     workingset_refault_anon 498
>     workingset_refault_file 17075 (-38.3%)
>
>   Launch Time Summary (all apps, all runs)
>     Mean 850.5ms (-2.07%)
>     P50 861.5ms  (-3.04%)
>     P90 1179.0ms (-8.05%)
>     P95 1228.0ms (-12.2%)

Thanks a lot for testing! Results are looking good, fewer refaults and
pgin, better performance. pgout is a bit higher, maybe because
retaining the flags on dirty protected folios helped identify or
protect more file folios, or maybe anon / cold file reclaim is more
effective since writeback pending folios are activated to the hottest
gen instead of stuck on tail gen; or maybe the reclaim loop is better
structured so there are less wasted loops and unnecessary reclaim to
slab. In any case, it's a good thing. Will mention this in the next
update.