[PATCH 2/2] mm, vmscan: flush TLB for every 31 folios evictions

Zhang Peng via B4 Relay posted 2 patches 1 month ago
There is a newer version of this series
[PATCH 2/2] mm, vmscan: flush TLB for every 31 folios evictions
Posted by Zhang Peng via B4 Relay 1 month ago
From: bruzzhang <bruzzhang@tencent.com>

Currently we flush TLB for every dirty folio, which is a bottleneck for
systems with many cores as this causes heavy IPI usage.

So instead, batch the folios, and flush once for every 31 folios (one
folio_batch). These folios will be held in a folio_batch releasing their
lock, then when folio_batch is full, do following steps:

- For each folio: lock - check still evictable - unlock
  - If no longer evictable, return the folio to the caller.
- Flush TLB once for the batch
- Pageout the folios (refcount freeze happens in the pageout path)

Note we can't hold a frozen folio in folio_batch for long as it will
cause filemap/swapcache lookup to livelock. Fortunately pageout usually
won't take too long; sync IO is fast, and non-sync IO will be issued
with the folio marked writeback.

Suggested-by: Kairui Song <kasong@tencent.com>
Signed-off-by: bruzzhang <bruzzhang@tencent.com>
---
 mm/vmscan.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 61 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a336f7fc7dae..69cdd3252ff8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1240,6 +1240,48 @@ static void pageout_one(struct folio *folio, struct list_head *ret_folios,
 	VM_BUG_ON_FOLIO(folio_test_lru(folio) ||
 			folio_test_unevictable(folio), folio);
 }
+
+static void pageout_batch(struct folio_batch *fbatch,
+			  struct list_head *ret_folios,
+			  struct folio_batch *free_folios,
+			  struct scan_control *sc, struct reclaim_stat *stat,
+			  struct swap_iocb **plug, struct list_head *folio_list)
+{
+	int i = 0, count = folio_batch_count(fbatch);
+	struct folio *folio;
+
+	folio_batch_reinit(fbatch);
+	do {
+		folio = fbatch->folios[i];
+		if (!folio_trylock(folio)) {
+			list_add(&folio->lru, ret_folios);
+			continue;
+		}
+
+		if (folio_test_writeback(folio) || folio_test_lru(folio) ||
+		    folio_mapped(folio))
+			goto next;
+		folio_batch_add(fbatch, folio);
+		continue;
+next:
+		folio_unlock(folio);
+		list_add(&folio->lru, ret_folios);
+	} while (++i != count);
+
+	i = 0;
+	count = folio_batch_count(fbatch);
+	if (!count)
+		return;
+	/* One TLB flush for the batch */
+	try_to_unmap_flush_dirty();
+	do {
+		folio = fbatch->folios[i];
+		pageout_one(folio, ret_folios, free_folios, sc, stat, plug,
+			    folio_list);
+	} while (++i != count);
+	folio_batch_reinit(fbatch);
+}
+
 /*
  * Reclaimed folios are counted in stat->nr_reclaimed.
  */
@@ -1249,6 +1291,8 @@ static void shrink_folio_list(struct list_head *folio_list,
 		struct mem_cgroup *memcg)
 {
 	struct folio_batch free_folios;
+	struct folio_batch flush_folios;
+
 	LIST_HEAD(ret_folios);
 	LIST_HEAD(demote_folios);
 	unsigned int nr_demoted = 0;
@@ -1257,6 +1301,8 @@ static void shrink_folio_list(struct list_head *folio_list,
 	struct swap_iocb *plug = NULL;
 
 	folio_batch_init(&free_folios);
+	folio_batch_init(&flush_folios);
+
 	memset(stat, 0, sizeof(*stat));
 	cond_resched();
 	do_demote_pass = can_demote(pgdat->node_id, sc, memcg);
@@ -1578,15 +1624,19 @@ static void shrink_folio_list(struct list_head *folio_list,
 				goto keep_locked;
 			if (!sc->may_writepage)
 				goto keep_locked;
-
 			/*
-			 * Folio is dirty. Flush the TLB if a writable entry
-			 * potentially exists to avoid CPU writes after I/O
-			 * starts and then write it out here.
+			 * For anon, we should only see swap cache (anon) and
+			 * the list pinning the page. For file page, the filemap
+			 * and the list pins it. Combined with the page_ref_freeze
+			 * in pageout_batch ensure nothing else touches the page
+			 * during lock unlocked.
 			 */
-			try_to_unmap_flush_dirty();
-			pageout_one(folio, &ret_folios, &free_folios, sc, stat,
-				&plug, folio_list);
+			folio_unlock(folio);
+			if (!folio_batch_add(&flush_folios, folio))
+				pageout_batch(&flush_folios,
+							&ret_folios, &free_folios,
+							sc, stat, &plug,
+							folio_list);
 			goto next;
 		}
 
@@ -1614,6 +1664,10 @@ static void shrink_folio_list(struct list_head *folio_list,
 next:
 		continue;
 	}
+	if (folio_batch_count(&flush_folios)) {
+		pageout_batch(&flush_folios, &ret_folios, &free_folios, sc,
+			      stat, &plug, folio_list);
+	}
 	/* 'folio_list' is always empty here */
 
 	/* Migrate folios selected for demotion */

-- 
2.43.7
Re: [PATCH 2/2] mm, vmscan: flush TLB for every 31 folios evictions
Posted by Usama Arif 1 month ago
On Mon, 09 Mar 2026 16:17:42 +0800 Zhang Peng via B4 Relay <devnull+zippermonkey.icloud.com@kernel.org> wrote:

> From: bruzzhang <bruzzhang@tencent.com>
> 
> Currently we flush TLB for every dirty folio, which is a bottleneck for
> systems with many cores as this causes heavy IPI usage.
> 
> So instead, batch the folios, and flush once for every 31 folios (one
> folio_batch). These folios will be held in a folio_batch releasing their
> lock, then when folio_batch is full, do following steps:
> 
> - For each folio: lock - check still evictable - unlock
>   - If no longer evictable, return the folio to the caller.
> - Flush TLB once for the batch
> - Pageout the folios (refcount freeze happens in the pageout path)
> 
> Note we can't hold a frozen folio in folio_batch for long as it will
> cause filemap/swapcache lookup to livelock. Fortunately pageout usually
> won't take too long; sync IO is fast, and non-sync IO will be issued
> with the folio marked writeback.
> 
> Suggested-by: Kairui Song <kasong@tencent.com>
> Signed-off-by: bruzzhang <bruzzhang@tencent.com>
> ---
>  mm/vmscan.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 61 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a336f7fc7dae..69cdd3252ff8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1240,6 +1240,48 @@ static void pageout_one(struct folio *folio, struct list_head *ret_folios,
>  	VM_BUG_ON_FOLIO(folio_test_lru(folio) ||
>  			folio_test_unevictable(folio), folio);
>  }
> +
> +static void pageout_batch(struct folio_batch *fbatch,
> +			  struct list_head *ret_folios,
> +			  struct folio_batch *free_folios,
> +			  struct scan_control *sc, struct reclaim_stat *stat,
> +			  struct swap_iocb **plug, struct list_head *folio_list)
> +{
> +	int i = 0, count = folio_batch_count(fbatch);
> +	struct folio *folio;
> +
> +	folio_batch_reinit(fbatch);
> +	do {
> +		folio = fbatch->folios[i];
> +		if (!folio_trylock(folio)) {
> +			list_add(&folio->lru, ret_folios);
> +			continue;
> +		}
> +
> +		if (folio_test_writeback(folio) || folio_test_lru(folio) ||
> +		    folio_mapped(folio))
> +			goto next;
> +		folio_batch_add(fbatch, folio);
> +		continue;
> +next:
> +		folio_unlock(folio);
> +		list_add(&folio->lru, ret_folios);
> +	} while (++i != count);

Hello!

Instead of using do {} while(++i != count), its better to use for loop,
a standard for loop would be better for code readability.

> +
> +	i = 0;
> +	count = folio_batch_count(fbatch);
> +	if (!count)
> +		return;
> +	/* One TLB flush for the batch */
> +	try_to_unmap_flush_dirty();
> +	do {
> +		folio = fbatch->folios[i];
> +		pageout_one(folio, ret_folios, free_folios, sc, stat, plug,
> +			    folio_list);
> +	} while (++i != count);
> +	folio_batch_reinit(fbatch);
> +}
> +
>  /*
>   * Reclaimed folios are counted in stat->nr_reclaimed.
>   */
> @@ -1249,6 +1291,8 @@ static void shrink_folio_list(struct list_head *folio_list,
>  		struct mem_cgroup *memcg)
>  {
>  	struct folio_batch free_folios;
> +	struct folio_batch flush_folios;
> +
>  	LIST_HEAD(ret_folios);
>  	LIST_HEAD(demote_folios);
>  	unsigned int nr_demoted = 0;
> @@ -1257,6 +1301,8 @@ static void shrink_folio_list(struct list_head *folio_list,
>  	struct swap_iocb *plug = NULL;
>  
>  	folio_batch_init(&free_folios);
> +	folio_batch_init(&flush_folios);
> +
>  	memset(stat, 0, sizeof(*stat));
>  	cond_resched();
>  	do_demote_pass = can_demote(pgdat->node_id, sc, memcg);
> @@ -1578,15 +1624,19 @@ static void shrink_folio_list(struct list_head *folio_list,
>  				goto keep_locked;
>  			if (!sc->may_writepage)
>  				goto keep_locked;
> -
>  			/*
> -			 * Folio is dirty. Flush the TLB if a writable entry
> -			 * potentially exists to avoid CPU writes after I/O
> -			 * starts and then write it out here.
> +			 * For anon, we should only see swap cache (anon) and
> +			 * the list pinning the page. For file page, the filemap
> +			 * and the list pins it. Combined with the page_ref_freeze
> +			 * in pageout_batch ensure nothing else touches the page
> +			 * during lock unlocked.
>  			 */

page_ref_freeze happens inside pageout_one() -> pageout() -> __remove_mapping(),
which runs after the folio is re-locked and after the TLB flush.  During
the unlocked window, the refcount is not frozen. Right?

With this patch, the folio is unlocked before try_to_unmap_flush_dirty() runs
in pageout_batch(). During this window, TLB entries on other CPUs could allow
writes to the folio after it has been selected for pageout. My understanding
is that the original code intentionally flushed TLB while the folio was locked
to prevent this? Could there be data corruption can result if a write through
a stale TLB entry races with the pageout I/O?


> -			try_to_unmap_flush_dirty();
> -			pageout_one(folio, &ret_folios, &free_folios, sc, stat,
> -				&plug, folio_list);
> +			folio_unlock(folio);
> +			if (!folio_batch_add(&flush_folios, folio))
> +				pageout_batch(&flush_folios,
> +							&ret_folios, &free_folios,
> +							sc, stat, &plug,
> +							folio_list);
>  			goto next;
>  		}
>  
> @@ -1614,6 +1664,10 @@ static void shrink_folio_list(struct list_head *folio_list,
>  next:
>  		continue;
>  	}
> +	if (folio_batch_count(&flush_folios)) {
> +		pageout_batch(&flush_folios, &ret_folios, &free_folios, sc,
> +			      stat, &plug, folio_list);
> +	}
>  	/* 'folio_list' is always empty here */
>  
>  	/* Migrate folios selected for demotion */
> 
> -- 
> 2.43.7
> 
> 
>
Re: [PATCH 2/2] mm, vmscan: flush TLB for every 31 folios evictions
Posted by Zhang Peng 1 month ago
Hi Usama,

Thanks for the review!

You are right that the comment is wrong, page_ref_freeze does not exist in
pageout_batch(). I will fix the comment in v2.

Regarding the data corruption concern: try_to_unmap_flush_dirty() is called
before pageout_one(), so all stale writable TLB entries are invalidated 
before
IO starts. Any writes through stale TLB entries during the unlocked 
window will
have completed and landed in physical memory before the flush, and will be
correctly captured by the subsequent IO.

pageout_batch() relocks each folio and rechecks its state (writeback, lru,
mapped, dma_pinned) before proceeding. If any of these conditions have 
changed
during the unlocked window, the folio is not written out and is put back 
to the
LRU list for a future reclaim attempt. So there should be no data corruption
issue.

I will also add folio_maybe_dma_pinned() check in v2 as suggested by 
Kairui Song.
Re: [PATCH 2/2] mm, vmscan: flush TLB for every 31 folios evictions
Posted by Kairui Song 1 month ago
On Mon, Mar 9, 2026 at 8:42 PM Usama Arif <usama.arif@linux.dev> wrote:
>
> On Mon, 09 Mar 2026 16:17:42 +0800 Zhang Peng via B4 Relay <devnull+zippermonkey.icloud.com@kernel.org> wrote:
>
> > From: bruzzhang <bruzzhang@tencent.com>
> >
> > Currently we flush TLB for every dirty folio, which is a bottleneck for
> > systems with many cores as this causes heavy IPI usage.
> >
> > So instead, batch the folios, and flush once for every 31 folios (one
> > folio_batch). These folios will be held in a folio_batch releasing their
> > lock, then when folio_batch is full, do following steps:
> >
> > - For each folio: lock - check still evictable - unlock
> >   - If no longer evictable, return the folio to the caller.
> > - Flush TLB once for the batch
> > - Pageout the folios (refcount freeze happens in the pageout path)
> >
> > Note we can't hold a frozen folio in folio_batch for long as it will
> > cause filemap/swapcache lookup to livelock. Fortunately pageout usually
> > won't take too long; sync IO is fast, and non-sync IO will be issued
> > with the folio marked writeback.
> >
> > Suggested-by: Kairui Song <kasong@tencent.com>
> > Signed-off-by: bruzzhang <bruzzhang@tencent.com>
> > ---
> >  mm/vmscan.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
> >  1 file changed, 61 insertions(+), 7 deletions(-)

...

> >       folio_batch_init(&free_folios);
> > +     folio_batch_init(&flush_folios);
> > +
> >       memset(stat, 0, sizeof(*stat));
> >       cond_resched();
> >       do_demote_pass = can_demote(pgdat->node_id, sc, memcg);
> > @@ -1578,15 +1624,19 @@ static void shrink_folio_list(struct list_head *folio_list,
> >                               goto keep_locked;
> >                       if (!sc->may_writepage)
> >                               goto keep_locked;
> > -
> >                       /*
> > -                      * Folio is dirty. Flush the TLB if a writable entry
> > -                      * potentially exists to avoid CPU writes after I/O
> > -                      * starts and then write it out here.
> > +                      * For anon, we should only see swap cache (anon) and
> > +                      * the list pinning the page. For file page, the filemap
> > +                      * and the list pins it. Combined with the page_ref_freeze
> > +                      * in pageout_batch ensure nothing else touches the page
> > +                      * during lock unlocked.
> >                        */
>
> page_ref_freeze happens inside pageout_one() -> pageout() -> __remove_mapping(),
> which runs after the folio is re-locked and after the TLB flush.  During
> the unlocked window, the refcount is not frozen. Right?
>
> With this patch, the folio is unlocked before try_to_unmap_flush_dirty() runs
> in pageout_batch(). During this window, TLB entries on other CPUs could allow
> writes to the folio after it has been selected for pageout. My understanding
> is that the original code intentionally flushed TLB while the folio was locked
> to prevent this? Could there be data corruption can result if a write through
> a stale TLB entry races with the pageout I/O?

Hi Usama,

Thanks for the review. Yeah the comment here seems wrong, I agree with you.

Hi, Peng, I think you might have used some stall comment, at least
page_ref_freeze doesn't exist here and that doesn't seem to be how
this patch works currently. Can you help double check and update?

These folios are kept in the batch unlocked and unfreeze. Also,
unmapped. They could get mapped again or touched, so the batch flush
should relocks the folios and redo some routines before that unmap
before, and if they are still in a ready to be freed status, then
flush and do the IO, then free.

BTW some checks seem missing in the batch check? eg. folio_maybe_dma_pinned.