[v2] mm/mglru: improve reclaim loop and dirty folio handling

[PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling

Posted by Kairui Song via B4 Relay 4 days, 16 hours ago

From: Kairui Song <kasong@tencent.com>

The current handling of dirty writeback folios is not working well for
file page heavy workloads: Dirty folios are protected and move to next
gen upon isolation of getting throttled or reactivation upon pageout
(shrink_folio_list).

This might help to reduce the LRU lock contention slightly, but as a
result, the ping-pong effect of folios between head and tail of last two
gens is serious as the shrinker will run into protected dirty writeback
folios more frequently compared to activation. The dirty flush wakeup
condition is also much more passive compared to active/inactive LRU.
Active / inactve LRU wakes the flusher if one batch of folios passed to
shrink_folio_list is unevictable due to under writeback, but MGLRU
instead has to check this after the whole reclaim loop is done, and then
count the isolation protection number compared to the total reclaim
number.

And we previously saw OOM problems with it, too, which were fixed but
still not perfect [1].

So instead, just drop the special handling for dirty writeback, just
re-activate it like active / inactive LRU. And also move the dirty flush
wake up check right after shrink_folio_list. This should improve both
throttling and performance.

Test with YCSB workloadb showed a major performance improvement:

Before this series:
Throughput(ops/sec): 61642.78008938203
AverageLatency(us):  507.11127774145166
pgpgin 158190589
pgpgout 5880616
workingset_refault 7262988

After this commit:
Throughput(ops/sec): 80216.04855744806  (+30.1%, higher is better)
AverageLatency(us):  388.17633477268913 (-23.5%, lower is better)
pgpgin 101871227                        (-35.6%, lower is better)
pgpgout 5770028
workingset_refault 3418186              (-52.9%, lower is better)

The refault rate is ~50% lower, and throughput is ~30% higher, which
is a huge gain. We also observed significant performance gain for
other real-world workloads.

We were concerned that the dirty flush could cause more wear for SSD:
that should not be the problem here, since the wakeup condition is when
the dirty folios have been pushed to the tail of LRU, which indicates
that memory pressure is so high that writeback is blocking the workload
already.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Link: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/ [1]
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 57 ++++++++++++++++-----------------------------------------
 1 file changed, 16 insertions(+), 41 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8de5c8d5849e..17b5318fad39 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4583,7 +4583,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		       int tier_idx)
 {
 	bool success;
-	bool dirty, writeback;
 	int gen = folio_lru_gen(folio);
 	int type = folio_is_file_lru(folio);
 	int zone = folio_zonenum(folio);
@@ -4633,21 +4632,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		return true;
 	}
 
-	dirty = folio_test_dirty(folio);
-	writeback = folio_test_writeback(folio);
-	if (type == LRU_GEN_FILE && dirty) {
-		sc->nr.file_taken += delta;
-		if (!writeback)
-			sc->nr.unqueued_dirty += delta;
-	}
-
-	/* waiting for writeback */
-	if (writeback || (type == LRU_GEN_FILE && dirty)) {
-		gen = folio_inc_gen(lruvec, folio, true);
-		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
-		return true;
-	}
-
 	return false;
 }
 
@@ -4754,8 +4738,6 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
 				scanned, skipped, isolated,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
-	if (type == LRU_GEN_FILE)
-		sc->nr.file_taken += isolated;
 
 	*isolatedp = isolated;
 	return scanned;
@@ -4858,12 +4840,27 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 		return scanned;
 retry:
 	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
-	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
 	sc->nr_reclaimed += reclaimed;
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			type_scanned, reclaimed, &stat, sc->priority,
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 
+	/*
+	 * If too many file cache in the coldest generation can't be evicted
+	 * due to being dirty, wake up the flusher.
+	 */
+	if (stat.nr_unqueued_dirty == isolated) {
+		wakeup_flusher_threads(WB_REASON_VMSCAN);
+
+		/*
+		 * For cgroupv1 dirty throttling is achieved by waking up
+		 * the kernel flusher here and later waiting on folios
+		 * which are in writeback to finish (see shrink_folio_list()).
+		 */
+		if (!writeback_throttling_sane(sc))
+			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
+	}
+
 	list_for_each_entry_safe_reverse(folio, next, &list, lru) {
 		DEFINE_MIN_SEQ(lruvec);
 
@@ -5020,28 +5017,6 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		cond_resched();
 	}
 
-	/*
-	 * If too many file cache in the coldest generation can't be evicted
-	 * due to being dirty, wake up the flusher.
-	 */
-	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken) {
-		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
-
-		wakeup_flusher_threads(WB_REASON_VMSCAN);
-
-		/*
-		 * For cgroupv1 dirty throttling is achieved by waking up
-		 * the kernel flusher here and later waiting on folios
-		 * which are in writeback to finish (see shrink_folio_list()).
-		 *
-		 * Flusher may not be able to issue writeback quickly
-		 * enough for cgroupv1 writeback throttling to work
-		 * on a large system.
-		 */
-		if (!writeback_throttling_sane(sc))
-			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
-	}
-
 	return need_rotate;
 }
 

-- 
2.53.0

Re: [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling

Posted by Shakeel Butt 13 hours ago

On Sun, Mar 29, 2026 at 03:52:34AM +0800, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> The current handling of dirty writeback folios is not working well for
> file page heavy workloads: Dirty folios are protected and move to next
> gen upon isolation of getting throttled or reactivation upon pageout
> (shrink_folio_list).
> 
> This might help to reduce the LRU lock contention slightly, but as a
> result, the ping-pong effect of folios between head and tail of last two
> gens is serious as the shrinker will run into protected dirty writeback
> folios more frequently compared to activation. The dirty flush wakeup
> condition is also much more passive compared to active/inactive LRU.
> Active / inactve LRU wakes the flusher if one batch of folios passed to
> shrink_folio_list is unevictable due to under writeback, but MGLRU
> instead has to check this after the whole reclaim loop is done, and then
> count the isolation protection number compared to the total reclaim
> number.

I was just ranting about this on Baolin's patch and thanks for unifying them.

> 
> And we previously saw OOM problems with it, too, which were fixed but
> still not perfect [1].
> 
> So instead, just drop the special handling for dirty writeback, just
> re-activate it like active / inactive LRU. And also move the dirty flush
> wake up check right after shrink_folio_list. This should improve both
> throttling and performance.

Please divide this patch into two separate ones. One for moving the flusher
waker (& v1 throttling) within evict_folios() and second the above heuristic of
dirty writeback.

Re: [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling

Posted by Kairui Song 55 minutes ago

On Wed, Apr 01, 2026 at 04:37:14PM +0800, Shakeel Butt wrote:
> On Sun, Mar 29, 2026 at 03:52:34AM +0800, Kairui Song via B4 Relay wrote:
> > From: Kairui Song <kasong@tencent.com>
> > 
> > The current handling of dirty writeback folios is not working well for
> > file page heavy workloads: Dirty folios are protected and move to next
> > gen upon isolation of getting throttled or reactivation upon pageout
> > (shrink_folio_list).
> > 
> > This might help to reduce the LRU lock contention slightly, but as a
> > result, the ping-pong effect of folios between head and tail of last two
> > gens is serious as the shrinker will run into protected dirty writeback
> > folios more frequently compared to activation. The dirty flush wakeup
> > condition is also much more passive compared to active/inactive LRU.
> > Active / inactve LRU wakes the flusher if one batch of folios passed to
> > shrink_folio_list is unevictable due to under writeback, but MGLRU
> > instead has to check this after the whole reclaim loop is done, and then
> > count the isolation protection number compared to the total reclaim
> > number.
> 
> I was just ranting about this on Baolin's patch and thanks for unifying them.
> 
> > 
> > And we previously saw OOM problems with it, too, which were fixed but
> > still not perfect [1].
> > 
> > So instead, just drop the special handling for dirty writeback, just
> > re-activate it like active / inactive LRU. And also move the dirty flush
> > wake up check right after shrink_folio_list. This should improve both
> > throttling and performance.
> 
> Please divide this patch into two separate ones. One for moving the flusher
> waker (& v1 throttling) within evict_folios() and second the above heuristic of
> dirty writeback.

OK, but throttling is not handled by this commit, it handled by the last
commit. And using the common routine in shrink_folio_list and activate the
folio is suppose to be done before moving the flusher wakeup and throttle,
as I observed some inefficient reclaim or over aggressive / passive if we
don't do that first. We will run into these folios again and again very
frequently and shrink_folio_list also have better dirty / writeback
detection.

I tested these two changes separately again in case I remembered it
wrongly, using the MongoDB YCSB case:

Before this series or commit, it's similar:
Throughput(ops/sec), 63414.891930455

Apply only the remove folio_inc_gen and use shrink_folio_list
to active folio part in this commit:
Throughput(ops/sec), 68580.83394294075

Skip the folio_inc_gen part but apply other part:
Throughput(ops/sec), 61614.29451632779

After the two fixes together (apply this commit fully):
Throughput(ops/sec), 80857.08510208207

And the whole series:
Throughput(ops/sec), 79760.71784646061

The test is a bit noisy, but after the whole series the throttling
seems is already slightly slowing down the workload, still accetable
IMO, this is also why activate the folios here is a good idea or we
will run into problematic throttling.

I think this can be further improved later, as I observed previously with
the LFU alike rework I mentioned, it will help promote folios more
proactively to younger gen and it will have a even better performance:
https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/

For now I can split this into two in V3, first a commit to use the
common routine for activating the folio, then move then fluster wakeup.

Re: [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling

Posted by Baolin Wang 2 days, 3 hours ago


On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> The current handling of dirty writeback folios is not working well for
> file page heavy workloads: Dirty folios are protected and move to next
> gen upon isolation of getting throttled or reactivation upon pageout
> (shrink_folio_list).
> 
> This might help to reduce the LRU lock contention slightly, but as a
> result, the ping-pong effect of folios between head and tail of last two
> gens is serious as the shrinker will run into protected dirty writeback
> folios more frequently compared to activation. The dirty flush wakeup
> condition is also much more passive compared to active/inactive LRU.
> Active / inactve LRU wakes the flusher if one batch of folios passed to
> shrink_folio_list is unevictable due to under writeback, but MGLRU
> instead has to check this after the whole reclaim loop is done, and then
> count the isolation protection number compared to the total reclaim
> number.
> 
> And we previously saw OOM problems with it, too, which were fixed but
> still not perfect [1].
> 
> So instead, just drop the special handling for dirty writeback, just
> re-activate it like active / inactive LRU. And also move the dirty flush
> wake up check right after shrink_folio_list. This should improve both
> throttling and performance.
> 
> Test with YCSB workloadb showed a major performance improvement:
> 
> Before this series:
> Throughput(ops/sec): 61642.78008938203
> AverageLatency(us):  507.11127774145166
> pgpgin 158190589
> pgpgout 5880616
> workingset_refault 7262988
> 
> After this commit:
> Throughput(ops/sec): 80216.04855744806  (+30.1%, higher is better)
> AverageLatency(us):  388.17633477268913 (-23.5%, lower is better)
> pgpgin 101871227                        (-35.6%, lower is better)
> pgpgout 5770028
> workingset_refault 3418186              (-52.9%, lower is better)
> 
> The refault rate is ~50% lower, and throughput is ~30% higher, which
> is a huge gain. We also observed significant performance gain for
> other real-world workloads.
> 
> We were concerned that the dirty flush could cause more wear for SSD:
> that should not be the problem here, since the wakeup condition is when
> the dirty folios have been pushed to the tail of LRU, which indicates
> that memory pressure is so high that writeback is blocking the workload
> already.
> 
> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> Link: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/ [1]
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>   mm/vmscan.c | 57 ++++++++++++++++-----------------------------------------
>   1 file changed, 16 insertions(+), 41 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8de5c8d5849e..17b5318fad39 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4583,7 +4583,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>   		       int tier_idx)
>   {
>   	bool success;
> -	bool dirty, writeback;
>   	int gen = folio_lru_gen(folio);
>   	int type = folio_is_file_lru(folio);
>   	int zone = folio_zonenum(folio);
> @@ -4633,21 +4632,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>   		return true;
>   	}
>   
> -	dirty = folio_test_dirty(folio);
> -	writeback = folio_test_writeback(folio);
> -	if (type == LRU_GEN_FILE && dirty) {
> -		sc->nr.file_taken += delta;
> -		if (!writeback)
> -			sc->nr.unqueued_dirty += delta;
> -	}
> -
> -	/* waiting for writeback */
> -	if (writeback || (type == LRU_GEN_FILE && dirty)) {
> -		gen = folio_inc_gen(lruvec, folio, true);
> -		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> -		return true;
> -	}

I'm a bit concerned about the handling of dirty folios.

In the original logic, if we encounter a dirty folio, we increment its 
generation counter by 1 and move it to the *second oldest generation*.

However, with your patch, shrink_folio_list() will activate the dirty 
folio by calling folio_set_active(). Then, evict_folios() -> 
move_folios_to_lru() will put the dirty folio back into the MGLRU list.

But because the folio_test_active() is true for this dirty folio, the 
dirty folio will now be placed into the *second youngest generation* 
(see lru_gen_folio_seq()).

As a result, during the next eviction, these dirty folios won't be 
scanned again (because they are in the second youngest generation). 
Wouldn't this lead to a situation where the flusher cannot be woken up 
in time, making OOM more likely?

Re: [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling

Posted by Kairui Song 2 days, 3 hours ago

On Tue, Mar 31, 2026 at 04:42:59PM +0800, Baolin Wang wrote:
> 
> 
> On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> > From: Kairui Song <kasong@tencent.com>
> > 
> > The current handling of dirty writeback folios is not working well for
> > file page heavy workloads: Dirty folios are protected and move to next
> > gen upon isolation of getting throttled or reactivation upon pageout
> > (shrink_folio_list).
> > 
> > This might help to reduce the LRU lock contention slightly, but as a
> > result, the ping-pong effect of folios between head and tail of last two
> > gens is serious as the shrinker will run into protected dirty writeback
> > folios more frequently compared to activation. The dirty flush wakeup
> > condition is also much more passive compared to active/inactive LRU.
> > Active / inactve LRU wakes the flusher if one batch of folios passed to
> > shrink_folio_list is unevictable due to under writeback, but MGLRU
> > instead has to check this after the whole reclaim loop is done, and then
> > count the isolation protection number compared to the total reclaim
> > number.
> > 
> > And we previously saw OOM problems with it, too, which were fixed but
> > still not perfect [1].
> > 
> > So instead, just drop the special handling for dirty writeback, just
> > re-activate it like active / inactive LRU. And also move the dirty flush
> > wake up check right after shrink_folio_list. This should improve both
> > throttling and performance.
> > 
> > Test with YCSB workloadb showed a major performance improvement:
> > 
> > Before this series:
> > Throughput(ops/sec): 61642.78008938203
> > AverageLatency(us):  507.11127774145166
> > pgpgin 158190589
> > pgpgout 5880616
> > workingset_refault 7262988
> > 
> > After this commit:
> > Throughput(ops/sec): 80216.04855744806  (+30.1%, higher is better)
> > AverageLatency(us):  388.17633477268913 (-23.5%, lower is better)
> > pgpgin 101871227                        (-35.6%, lower is better)
> > pgpgout 5770028
> > workingset_refault 3418186              (-52.9%, lower is better)
> > 
> > The refault rate is ~50% lower, and throughput is ~30% higher, which
> > is a huge gain. We also observed significant performance gain for
> > other real-world workloads.
> > 
> > We were concerned that the dirty flush could cause more wear for SSD:
> > that should not be the problem here, since the wakeup condition is when
> > the dirty folios have been pushed to the tail of LRU, which indicates
> > that memory pressure is so high that writeback is blocking the workload
> > already.
> > 
> > Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> > Link: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/ [1]
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >   mm/vmscan.c | 57 ++++++++++++++++-----------------------------------------
> >   1 file changed, 16 insertions(+), 41 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 8de5c8d5849e..17b5318fad39 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4583,7 +4583,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> >   		       int tier_idx)
> >   {
> >   	bool success;
> > -	bool dirty, writeback;
> >   	int gen = folio_lru_gen(folio);
> >   	int type = folio_is_file_lru(folio);
> >   	int zone = folio_zonenum(folio);
> > @@ -4633,21 +4632,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> >   		return true;
> >   	}
> > -	dirty = folio_test_dirty(folio);
> > -	writeback = folio_test_writeback(folio);
> > -	if (type == LRU_GEN_FILE && dirty) {
> > -		sc->nr.file_taken += delta;
> > -		if (!writeback)
> > -			sc->nr.unqueued_dirty += delta;
> > -	}
> > -
> > -	/* waiting for writeback */
> > -	if (writeback || (type == LRU_GEN_FILE && dirty)) {
> > -		gen = folio_inc_gen(lruvec, folio, true);
> > -		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> > -		return true;
> > -	}
> 
> I'm a bit concerned about the handling of dirty folios.
> 
> In the original logic, if we encounter a dirty folio, we increment its
> generation counter by 1 and move it to the *second oldest generation*.
> 
> However, with your patch, shrink_folio_list() will activate the dirty folio
> by calling folio_set_active(). Then, evict_folios() -> move_folios_to_lru()
> will put the dirty folio back into the MGLRU list.
> 
> But because the folio_test_active() is true for this dirty folio, the dirty
> folio will now be placed into the *second youngest generation* (see
> lru_gen_folio_seq()).

Yeah, and that's exactly what we want. Or else, these folios will
stay at oldest gen, following scan will keep seeing them and hence
keep bouncing these folios again and again to a younger gen since
they are not reclaimable.

The writeback callback (folio_rotate_reclaimable) will move them
back to tail once they are actually reclaimable. So we are not
losing any ability to reclaim them. Am I missing anything?

> 
> As a result, during the next eviction, these dirty folios won't be scanned
> again (because they are in the second youngest generation). Wouldn't this
> lead to a situation where the flusher cannot be woken up in time, making OOM
> more likely?

No? Flusher is already waken up by the time they are seen for the
first time. If we see these folios again very soon, the LRU is
congested, one following patch handles the congested case too by
throttling (which was completely missing previously). And now we
are actually a bit more proactive about waking up the flusher,
since the wakeup hook is moved inside the loop instead of after
the whole loop is finished.

These two behavior change above is basically just unifying MGLRU to do
what the classical LRU has been doing for years, and result looks
really good.

The global congestion handling is still missing after this series
though. Have to fix that later I guess...

Re: [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling

Posted by Barry Song 12 hours ago

On Tue, Mar 31, 2026 at 5:18 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Mar 31, 2026 at 04:42:59PM +0800, Baolin Wang wrote:
> >
> >
> > On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > The current handling of dirty writeback folios is not working well for
> > > file page heavy workloads: Dirty folios are protected and move to next
> > > gen upon isolation of getting throttled or reactivation upon pageout
> > > (shrink_folio_list).
> > >
> > > This might help to reduce the LRU lock contention slightly, but as a
> > > result, the ping-pong effect of folios between head and tail of last two
> > > gens is serious as the shrinker will run into protected dirty writeback
> > > folios more frequently compared to activation. The dirty flush wakeup
> > > condition is also much more passive compared to active/inactive LRU.
> > > Active / inactve LRU wakes the flusher if one batch of folios passed to
> > > shrink_folio_list is unevictable due to under writeback, but MGLRU
> > > instead has to check this after the whole reclaim loop is done, and then
> > > count the isolation protection number compared to the total reclaim
> > > number.
> > >
> > > And we previously saw OOM problems with it, too, which were fixed but
> > > still not perfect [1].
> > >
> > > So instead, just drop the special handling for dirty writeback, just
> > > re-activate it like active / inactive LRU. And also move the dirty flush
> > > wake up check right after shrink_folio_list. This should improve both
> > > throttling and performance.
> > >
> > > Test with YCSB workloadb showed a major performance improvement:
> > >
> > > Before this series:
> > > Throughput(ops/sec): 61642.78008938203
> > > AverageLatency(us):  507.11127774145166
> > > pgpgin 158190589
> > > pgpgout 5880616
> > > workingset_refault 7262988
> > >
> > > After this commit:
> > > Throughput(ops/sec): 80216.04855744806  (+30.1%, higher is better)
> > > AverageLatency(us):  388.17633477268913 (-23.5%, lower is better)
> > > pgpgin 101871227                        (-35.6%, lower is better)
> > > pgpgout 5770028
> > > workingset_refault 3418186              (-52.9%, lower is better)
> > >
> > > The refault rate is ~50% lower, and throughput is ~30% higher, which
> > > is a huge gain. We also observed significant performance gain for
> > > other real-world workloads.
> > >
> > > We were concerned that the dirty flush could cause more wear for SSD:
> > > that should not be the problem here, since the wakeup condition is when
> > > the dirty folios have been pushed to the tail of LRU, which indicates
> > > that memory pressure is so high that writeback is blocking the workload
> > > already.
> > >
> > > Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> > > Link: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/ [1]
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >   mm/vmscan.c | 57 ++++++++++++++++-----------------------------------------
> > >   1 file changed, 16 insertions(+), 41 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 8de5c8d5849e..17b5318fad39 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -4583,7 +4583,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> > >                    int tier_idx)
> > >   {
> > >     bool success;
> > > -   bool dirty, writeback;
> > >     int gen = folio_lru_gen(folio);
> > >     int type = folio_is_file_lru(folio);
> > >     int zone = folio_zonenum(folio);
> > > @@ -4633,21 +4632,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> > >             return true;
> > >     }
> > > -   dirty = folio_test_dirty(folio);
> > > -   writeback = folio_test_writeback(folio);
> > > -   if (type == LRU_GEN_FILE && dirty) {
> > > -           sc->nr.file_taken += delta;
> > > -           if (!writeback)
> > > -                   sc->nr.unqueued_dirty += delta;
> > > -   }
> > > -
> > > -   /* waiting for writeback */
> > > -   if (writeback || (type == LRU_GEN_FILE && dirty)) {
> > > -           gen = folio_inc_gen(lruvec, folio, true);
> > > -           list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> > > -           return true;
> > > -   }
> >
> > I'm a bit concerned about the handling of dirty folios.
> >
> > In the original logic, if we encounter a dirty folio, we increment its
> > generation counter by 1 and move it to the *second oldest generation*.
> >
> > However, with your patch, shrink_folio_list() will activate the dirty folio
> > by calling folio_set_active(). Then, evict_folios() -> move_folios_to_lru()
> > will put the dirty folio back into the MGLRU list.
> >
> > But because the folio_test_active() is true for this dirty folio, the dirty
> > folio will now be placed into the *second youngest generation* (see
> > lru_gen_folio_seq()).
>
> Yeah, and that's exactly what we want. Or else, these folios will
> stay at oldest gen, following scan will keep seeing them and hence
> keep bouncing these folios again and again to a younger gen since
> they are not reclaimable.
>
> The writeback callback (folio_rotate_reclaimable) will move them
> back to tail once they are actually reclaimable. So we are not
> losing any ability to reclaim them. Am I missing anything?
>

This makes sense to me. As long as folio_rotate_reclaimable()
exists, we can move those folios back to the tail once they are
clean and ready for reclaim.

This reminds me of Ridong's patch, which tried to emulate MGLRU's
behavior by 'rotating' folios whose IO completed during isolate,
and thus missed folio_rotate_reclaimable() in the active/inactive
LRUs[1]. Not sure if that patch has managed to land since v7.

                /* retry folios that may have missed
folio_rotate_reclaimable() */
                if (!skip_retry && !folio_test_active(folio) &&
!folio_mapped(folio) &&
                    !folio_test_dirty(folio) && !folio_test_writeback(folio)) {
                        list_move(&folio->lru, &clean);
                        continue;
                }

[1] https://lore.kernel.org/linux-mm/20250111091504.1363075-1-chenridong@huaweicloud.com/

Best Regards
Barry

Re: [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling

Posted by Baolin Wang 1 day, 9 hours ago


On 3/31/26 5:18 PM, Kairui Song wrote:
> On Tue, Mar 31, 2026 at 04:42:59PM +0800, Baolin Wang wrote:
>>
>>
>> On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
>>> From: Kairui Song <kasong@tencent.com>
>>>
>>> The current handling of dirty writeback folios is not working well for
>>> file page heavy workloads: Dirty folios are protected and move to next
>>> gen upon isolation of getting throttled or reactivation upon pageout
>>> (shrink_folio_list).
>>>
>>> This might help to reduce the LRU lock contention slightly, but as a
>>> result, the ping-pong effect of folios between head and tail of last two
>>> gens is serious as the shrinker will run into protected dirty writeback
>>> folios more frequently compared to activation. The dirty flush wakeup
>>> condition is also much more passive compared to active/inactive LRU.
>>> Active / inactve LRU wakes the flusher if one batch of folios passed to
>>> shrink_folio_list is unevictable due to under writeback, but MGLRU
>>> instead has to check this after the whole reclaim loop is done, and then
>>> count the isolation protection number compared to the total reclaim
>>> number.
>>>
>>> And we previously saw OOM problems with it, too, which were fixed but
>>> still not perfect [1].
>>>
>>> So instead, just drop the special handling for dirty writeback, just
>>> re-activate it like active / inactive LRU. And also move the dirty flush
>>> wake up check right after shrink_folio_list. This should improve both
>>> throttling and performance.
>>>
>>> Test with YCSB workloadb showed a major performance improvement:
>>>
>>> Before this series:
>>> Throughput(ops/sec): 61642.78008938203
>>> AverageLatency(us):  507.11127774145166
>>> pgpgin 158190589
>>> pgpgout 5880616
>>> workingset_refault 7262988
>>>
>>> After this commit:
>>> Throughput(ops/sec): 80216.04855744806  (+30.1%, higher is better)
>>> AverageLatency(us):  388.17633477268913 (-23.5%, lower is better)
>>> pgpgin 101871227                        (-35.6%, lower is better)
>>> pgpgout 5770028
>>> workingset_refault 3418186              (-52.9%, lower is better)
>>>
>>> The refault rate is ~50% lower, and throughput is ~30% higher, which
>>> is a huge gain. We also observed significant performance gain for
>>> other real-world workloads.
>>>
>>> We were concerned that the dirty flush could cause more wear for SSD:
>>> that should not be the problem here, since the wakeup condition is when
>>> the dirty folios have been pushed to the tail of LRU, which indicates
>>> that memory pressure is so high that writeback is blocking the workload
>>> already.
>>>
>>> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
>>> Link: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/ [1]
>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>> ---
>>>    mm/vmscan.c | 57 ++++++++++++++++-----------------------------------------
>>>    1 file changed, 16 insertions(+), 41 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 8de5c8d5849e..17b5318fad39 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -4583,7 +4583,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>>>    		       int tier_idx)
>>>    {
>>>    	bool success;
>>> -	bool dirty, writeback;
>>>    	int gen = folio_lru_gen(folio);
>>>    	int type = folio_is_file_lru(folio);
>>>    	int zone = folio_zonenum(folio);
>>> @@ -4633,21 +4632,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>>>    		return true;
>>>    	}
>>> -	dirty = folio_test_dirty(folio);
>>> -	writeback = folio_test_writeback(folio);
>>> -	if (type == LRU_GEN_FILE && dirty) {
>>> -		sc->nr.file_taken += delta;
>>> -		if (!writeback)
>>> -			sc->nr.unqueued_dirty += delta;
>>> -	}
>>> -
>>> -	/* waiting for writeback */
>>> -	if (writeback || (type == LRU_GEN_FILE && dirty)) {
>>> -		gen = folio_inc_gen(lruvec, folio, true);
>>> -		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
>>> -		return true;
>>> -	}
>>
>> I'm a bit concerned about the handling of dirty folios.
>>
>> In the original logic, if we encounter a dirty folio, we increment its
>> generation counter by 1 and move it to the *second oldest generation*.
>>
>> However, with your patch, shrink_folio_list() will activate the dirty folio
>> by calling folio_set_active(). Then, evict_folios() -> move_folios_to_lru()
>> will put the dirty folio back into the MGLRU list.
>>
>> But because the folio_test_active() is true for this dirty folio, the dirty
>> folio will now be placed into the *second youngest generation* (see
>> lru_gen_folio_seq()).
> 
> Yeah, and that's exactly what we want. Or else, these folios will
> stay at oldest gen, following scan will keep seeing them and hence

Not the oldest gen, instead, they will be moved into the second oldest 
gen, right?

if (writeback || (type == LRU_GEN_FILE && dirty)) {
	gen = folio_inc_gen(lruvec, folio, true);
	list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
	return true;
}

> keep bouncing these folios again and again to a younger gen since
> they are not reclaimable.
> 
> The writeback callback (folio_rotate_reclaimable) will move them
> back to tail once they are actually reclaimable. So we are not
> losing any ability to reclaim them. Am I missing anything?

Right.

>> As a result, during the next eviction, these dirty folios won't be scanned
>> again (because they are in the second youngest generation). Wouldn't this
>> lead to a situation where the flusher cannot be woken up in time, making OOM
>> more likely?
> 
> No? Flusher is already waken up by the time they are seen for the
> first time. If we see these folios again very soon, the LRU is
> congested, one following patch handles the congested case too by
> throttling (which was completely missing previously). And now we

Yes, throttling is what we expect.

My concern is that if all dirty folios are requeued into the *second 
youngest generation*, it might lead to the throttling mechanism in 
shrink_folio_list() becoming ineffective (because these dirty folios are 
no longer scanned again), resulting in a failure to throttle reclamation 
and leaving no reclaimable folios to scan, potentially causing premature 
OOM.

Specifically, if the reclaimer scan a memcg's MGLRU first time, all 
dirty folios are moved into the *second youngest generation*, the 
*oldest generation* will be empty and will be removed by 
try_to_inc_min_seq(), leaving only 3 generations now.

Then on the next scan, we cannot find any file folios to scan, and if 
the writeback of the memcg’s dirty folios has not yet completed, this 
can lead to a premature OOM.

If, as in the original logic, these dirty folios are scanned by 
shrink_folio_list() and moved them into the *second oldest generation*, 
then when the *oldest generation* becomes empty and is removed, 
reclaimer can still continue scanning the dirty folios (the former 
second oldest generation becomes the oldest generation), thereby 
continuing to trigger shrink_folio_list()’s writeback throttling and 
avoiding a premature OOM.

Am I overthinking this?

> are actually a bit more proactive about waking up the flusher,
> since the wakeup hook is moved inside the loop instead of after
> the whole loop is finished.
> 
> These two behavior change above is basically just unifying MGLRU to do
> what the classical LRU has been doing for years, and result looks
> really good.

One difference is that, For classical LRU, if the inactive list is low, 
we will run shrink_active_list() to refill the inactive list.

But for MGLRU, after your changes, we might not perform aging (e.g., 
DEF_PRIORITY will skip aging), which could make shrink_folio_list()’s 
throttling less effective than expected, as I mentioned above.

Re: [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling

Posted by Kairui Song 1 day, 7 hours ago

On Wed, Apr 01, 2026 at 10:52:54AM +0800, Baolin Wang wrote:
> 
> 
> On 3/31/26 5:18 PM, Kairui Song wrote:
> > On Tue, Mar 31, 2026 at 04:42:59PM +0800, Baolin Wang wrote:
> > > 
> > > 
> > > On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> > > > From: Kairui Song <kasong@tencent.com>
> > > > 
> > > > The current handling of dirty writeback folios is not working well for
> > > > file page heavy workloads: Dirty folios are protected and move to next
> > > > gen upon isolation of getting throttled or reactivation upon pageout
> > > > (shrink_folio_list).
> > > > 
> > > > This might help to reduce the LRU lock contention slightly, but as a
> > > > result, the ping-pong effect of folios between head and tail of last two
> > > > gens is serious as the shrinker will run into protected dirty writeback
> > > > folios more frequently compared to activation. The dirty flush wakeup
> > > > condition is also much more passive compared to active/inactive LRU.
> > > > Active / inactve LRU wakes the flusher if one batch of folios passed to
> > > > shrink_folio_list is unevictable due to under writeback, but MGLRU
> > > > instead has to check this after the whole reclaim loop is done, and then
> > > > count the isolation protection number compared to the total reclaim
> > > > number.
> > > > 
> > > > And we previously saw OOM problems with it, too, which were fixed but
> > > > still not perfect [1].
> > > > 
> > > > So instead, just drop the special handling for dirty writeback, just
> > > > re-activate it like active / inactive LRU. And also move the dirty flush
> > > > wake up check right after shrink_folio_list. This should improve both
> > > > throttling and performance.
> > > > 
> > > > Test with YCSB workloadb showed a major performance improvement:
> > > > 
> > > > Before this series:
> > > > Throughput(ops/sec): 61642.78008938203
> > > > AverageLatency(us):  507.11127774145166
> > > > pgpgin 158190589
> > > > pgpgout 5880616
> > > > workingset_refault 7262988
> > > > 
> > > > After this commit:
> > > > Throughput(ops/sec): 80216.04855744806  (+30.1%, higher is better)
> > > > AverageLatency(us):  388.17633477268913 (-23.5%, lower is better)
> > > > pgpgin 101871227                        (-35.6%, lower is better)
> > > > pgpgout 5770028
> > > > workingset_refault 3418186              (-52.9%, lower is better)
> > > > 
> > > > The refault rate is ~50% lower, and throughput is ~30% higher, which
> > > > is a huge gain. We also observed significant performance gain for
> > > > other real-world workloads.
> > > > 
> > > > We were concerned that the dirty flush could cause more wear for SSD:
> > > > that should not be the problem here, since the wakeup condition is when
> > > > the dirty folios have been pushed to the tail of LRU, which indicates
> > > > that memory pressure is so high that writeback is blocking the workload
> > > > already.
> > > > 
> > > > Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> > > > Link: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/ [1]
> > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > ---
> > > >    mm/vmscan.c | 57 ++++++++++++++++-----------------------------------------
> > > >    1 file changed, 16 insertions(+), 41 deletions(-)
> > > > 
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 8de5c8d5849e..17b5318fad39 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -4583,7 +4583,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> > > >    		       int tier_idx)
> > > >    {
> > > >    	bool success;
> > > > -	bool dirty, writeback;
> > > >    	int gen = folio_lru_gen(folio);
> > > >    	int type = folio_is_file_lru(folio);
> > > >    	int zone = folio_zonenum(folio);
> > > > @@ -4633,21 +4632,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> > > >    		return true;
> > > >    	}
> > > > -	dirty = folio_test_dirty(folio);
> > > > -	writeback = folio_test_writeback(folio);
> > > > -	if (type == LRU_GEN_FILE && dirty) {
> > > > -		sc->nr.file_taken += delta;
> > > > -		if (!writeback)
> > > > -			sc->nr.unqueued_dirty += delta;
> > > > -	}
> > > > -
> > > > -	/* waiting for writeback */
> > > > -	if (writeback || (type == LRU_GEN_FILE && dirty)) {
> > > > -		gen = folio_inc_gen(lruvec, folio, true);
> > > > -		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> > > > -		return true;
> > > > -	}
> > > 
> > > I'm a bit concerned about the handling of dirty folios.
> > > 
> > > In the original logic, if we encounter a dirty folio, we increment its
> > > generation counter by 1 and move it to the *second oldest generation*.
> > > 
> > > However, with your patch, shrink_folio_list() will activate the dirty folio
> > > by calling folio_set_active(). Then, evict_folios() -> move_folios_to_lru()
> > > will put the dirty folio back into the MGLRU list.
> > > 
> > > But because the folio_test_active() is true for this dirty folio, the dirty
> > > folio will now be placed into the *second youngest generation* (see
> > > lru_gen_folio_seq()).
> > 
> > Yeah, and that's exactly what we want. Or else, these folios will
> > stay at oldest gen, following scan will keep seeing them and hence
> 
> Not the oldest gen, instead, they will be moved into the second oldest gen,
> right?
> 
> if (writeback || (type == LRU_GEN_FILE && dirty)) {
> 	gen = folio_inc_gen(lruvec, folio, true);
> 	list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> 	return true;
> }

Right, it is still similar though, scanner will see these folios
very soon again as the oldest gen is drained.

> > > As a result, during the next eviction, these dirty folios won't be scanned
> > > again (because they are in the second youngest generation). Wouldn't this
> > > lead to a situation where the flusher cannot be woken up in time, making OOM
> > > more likely?
> > 
> > No? Flusher is already waken up by the time they are seen for the
> > first time. If we see these folios again very soon, the LRU is
> > congested, one following patch handles the congested case too by
> > throttling (which was completely missing previously). And now we
> 
> Yes, throttling is what we expect.
> 
> My concern is that if all dirty folios are requeued into the *second
> youngest generation*, it might lead to the throttling mechanism in
> shrink_folio_list() becoming ineffective (because these dirty folios are no
> longer scanned again), resulting in a failure to throttle reclamation and
> leaving no reclaimable folios to scan, potentially causing premature OOM.

They are scanned again just fine when older gens are drained.
MGLRU uses PID and protection so it might seem harder for promoted
folios to get demoted to tail - but we are not activating them to HEAD
either, second youngest gen is not that far away from tail.

Classic LRU will simply move these pages to head of active, so it
takes a whole scan iteration of the whole lruvec before seeing
these folios again, so we don't go throttle unless this is really
no way to progress.

> Specifically, if the reclaimer scan a memcg's MGLRU first time, all dirty
> folios are moved into the *second youngest generation*, the *oldest
> generation* will be empty and will be removed by try_to_inc_min_seq(),
> leaving only 3 generations now.
> 
> Then on the next scan, we cannot find any file folios to scan, and if the
> writeback of the memcg’s dirty folios has not yet completed, this can lead
> to a premature OOM.

Let's walk through this concretely. Assume gen 4 is youngest, gen 1 is
oldest. Dirty folios are activated to gen 3 (second youngest). Then
gen 1 is drained and removed. gen 2 becomes the new oldest, and it
is still evictable .

If we are so unlucky and gen 2 is empty or unevictable, anon reclaim
is still available. And if anon is unevictable (no swap, swap full,
or getting recycled), then file eviction proceeds - MGLRU's force
age is performed as anon gen is drained.

Gen 3's content (demoted) is reached after old gen 2 is dropped, by
which point the flusher could have been running for two full
generation-drain cycles and finished. We are all good.

Overall I think this issues seem trivial considering the chance
and time window of reclaim rotation vs aging, and the worst we
get here is a bit more anon reclaim. The anon / file balance
and swappiness issue when the gen gap >= 2 worth another fix.

> 
> If, as in the original logic, these dirty folios are scanned by
> shrink_folio_list() and moved them into the *second oldest generation*, then
> when the *oldest generation* becomes empty and is removed, reclaimer can
> still continue scanning the dirty folios (the former second oldest
> generation becomes the oldest generation), thereby continuing to trigger
> shrink_folio_list()’s writeback throttling and avoiding a premature OOM.

Moving them to gen 2 (second oldest) blocks reclaim of gen 2 and starts
throttling early, while gen 2 is very likely reclaimable with clean
folios. Classic LRU will scan the whole LRU before starting to
throttle to avoid that.

So I even hesitated about moving these folios to the youngest gen here.
It might be fine as the youngest gen in theory should be hottest, so
skipping it might not be a bad idea.

> 
> Am I overthinking this?

We lived without throttling or proper dirty writeback handling for
years (e.g. the benchmark represents a lot of real workloads).
Things are getting much better, so I think we are fine :)

Have been testing this new design on servers and my Android phone,
so far everything looks good.

> > are actually a bit more proactive about waking up the flusher,
> > since the wakeup hook is moved inside the loop instead of after
> > the whole loop is finished.
> > 
> > These two behavior change above is basically just unifying MGLRU to do
> > what the classical LRU has been doing for years, and result looks
> > really good.
> 
> One difference is that, For classical LRU, if the inactive list is low, we
> will run shrink_active_list() to refill the inactive list.
> 
> But for MGLRU, after your changes, we might not perform aging (e.g.,
> DEF_PRIORITY will skip aging), which could make shrink_folio_list()’s
> throttling less effective than expected, as I mentioned above.

That refill doesn't change the order of folios, it just shifts the
LRU as a whole. So essentially it needs to scan the whole LRU before
throttling. I think we might even be still a bit too aggressive since
gen 4 is not touched before throttling starts, but gen 4 being
protected seems sane, so the whole picture looks alright.

DEF_PRIORITY gets escalated easily if the scan fails to satisfy the
reclaimer's need.

Re: [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling

Posted by Kairui Song 4 days, 4 hours ago

Forwarding sashiko review:

https://sashiko.dev/#/message/20260329-mglru-reclaim-v2-8-b53a3678513c%40tencent.com

> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 8de5c8d5849e..17b5318fad39 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> [ ... ]
> > @@ -4633,21 +4632,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> >  		return true;
> >  	}
> >  
> > -	dirty = folio_test_dirty(folio);
> > -	writeback = folio_test_writeback(folio);
> > -	if (type == LRU_GEN_FILE && dirty) {
> > -		sc->nr.file_taken += delta;
> > -		if (!writeback)
> > -			sc->nr.unqueued_dirty += delta;
> > -	}
> > -
> > -	/* waiting for writeback */
> > -	if (writeback || (type == LRU_GEN_FILE && dirty)) {
> > -		gen = folio_inc_gen(lruvec, folio, true);
> > -		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> > -		return true;
> > -	}
> > -
> >  	return false;
> >  }
>
> Could removing this proactive promotion cause a livelock and premature OOM
> during GFP_NOIO or GFP_NOFS direct reclaim?
> Without this check, dirty file folios fall through to isolate_folio(). If
> the reclaim does not have __GFP_IO, isolate_folio() will skip them.
> These skipped folios are spliced back to the front of the oldest generation
> list in scan_folios(). Because isolate_folio() isolates 0 folios,
> evict_folios() returns early.
> This completely bypasses the new wakeup_flusher_threads() call, leaving the
> dirty folios unqueued for writeback, and the oldest generation can never
> advance.

This is a nice found. For GFP_NOIO (or actually !__GFP_IO), we also need
to active and set reclaim for the dirty folios. It's a narrow case though.

Following update should work:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8170aee096e9..342ba3afe77c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4641,8 +4641,7 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
 
        /* swap constrained */
        if (!(sc->gfp_mask & __GFP_IO) &&
-           (folio_test_dirty(folio) ||
-            (folio_test_anon(folio) && !folio_test_swapcache(folio))))
+            (folio_test_anon(folio) && !folio_test_swapcache(folio)))
                return false;

> [ ... ]
> > @@ -4858,12 +4840,27 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >  		return scanned;
> >  retry:
> >  	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
> > -	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
> >  	sc->nr_reclaimed += reclaimed;
> >  	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> >  			type_scanned, reclaimed, &stat, sc->priority,
> >  			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> >  
> > +	/*
> > +	 * If too many file cache in the coldest generation can't be evicted
> > +	 * due to being dirty, wake up the flusher.
> > +	 */
> > +	if (stat.nr_unqueued_dirty == isolated) {
>
> Is the isolated variable stale when evaluated on the retry path?
> If evict_folios() jumps back to the retry label, shrink_folio_list()
> processes a smaller list of only clean folios. The isolated variable retains
> the size of the original list, while stat.nr_unqueued_dirty can only be as
> large as the new, smaller list.
> Does this logically impossible condition cause any unintended behavior, or
> should the check be moved outside the retry loop to avoid confusion?

This is fine, stat.nr_unqueued_dirty is always smaller than isolated. The
"retry" label above only used to handle some folios that are failed to be
reclaimed after isolation. Meanwhile I do think we should clean up this
retry logic as it will also confuse the tracepoint. Better do it later.

Re: [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling

Posted by Kairui Song 4 days, 3 hours ago

On Sun, Mar 29, 2026 at 4:21 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> Forwarding sashiko review:
>
> https://sashiko.dev/#/message/20260329-mglru-reclaim-v2-8-b53a3678513c%40tencent.com
>
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 8de5c8d5849e..17b5318fad39 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > [ ... ]
> > > @@ -4633,21 +4632,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> > >             return true;
> > >     }
> > >
> > > -   dirty = folio_test_dirty(folio);
> > > -   writeback = folio_test_writeback(folio);
> > > -   if (type == LRU_GEN_FILE && dirty) {
> > > -           sc->nr.file_taken += delta;
> > > -           if (!writeback)
> > > -                   sc->nr.unqueued_dirty += delta;
> > > -   }
> > > -
> > > -   /* waiting for writeback */
> > > -   if (writeback || (type == LRU_GEN_FILE && dirty)) {
> > > -           gen = folio_inc_gen(lruvec, folio, true);
> > > -           list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> > > -           return true;
> > > -   }
> > > -
> > >     return false;
> > >  }
> >
> > Could removing this proactive promotion cause a livelock and premature OOM
> > during GFP_NOIO or GFP_NOFS direct reclaim?
> > Without this check, dirty file folios fall through to isolate_folio(). If
> > the reclaim does not have __GFP_IO, isolate_folio() will skip them.
> > These skipped folios are spliced back to the front of the oldest generation
> > list in scan_folios(). Because isolate_folio() isolates 0 folios,
> > evict_folios() returns early.
> > This completely bypasses the new wakeup_flusher_threads() call, leaving the
> > dirty folios unqueued for writeback, and the oldest generation can never
> > advance.
>
> This is a nice found. For GFP_NOIO (or actually !__GFP_IO), we also need
> to active and set reclaim for the dirty folios. It's a narrow case though.
>
> Following update should work:
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8170aee096e9..342ba3afe77c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4641,8 +4641,7 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
>
>         /* swap constrained */
>         if (!(sc->gfp_mask & __GFP_IO) &&
> -           (folio_test_dirty(folio) ||
> -            (folio_test_anon(folio) && !folio_test_swapcache(folio))))
> +            (folio_test_anon(folio) && !folio_test_swapcache(folio)))

Or this check should just be removed. shrink_folio_list already has a
check for swap and a more accurate may_enter_fs check.