[v2] mm: batch TLB flushing for dirty folios in vmscan

[PATCH v2 5/5] mm/vmscan: flush TLB for every 31 folios evictions

Posted by Zhang Peng 1 week ago

From: Zhang Peng <bruzzhang@tencent.com>

Currently we flush TLB for every dirty folio, which is a bottleneck for
systems with many cores as this causes heavy IPI usage.

So instead, batch the folios, and flush once for every 31 folios (one
folio_batch). These folios will be held in a folio_batch releasing their
lock, then when folio_batch is full, do following steps:

- For each folio: lock - check still evictable (writeback, lru,  mapped,
  dma_pinned)
  - If no longer evictable, put back to LRU
- Flush TLB once for the batch
- Pageout the folios

Note we can't hold a frozen folio in folio_batch for long as it will
cause filemap/swapcache lookup to livelock. Fortunately pageout usually
won't take too long; sync IO is fast, and non-sync IO will be issued
with the folio marked writeback.

Suggested-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Zhang Peng <bruzzhang@tencent.com>
---
 mm/vmscan.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 62 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 63cc88c875e8..27de8034f582 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1217,6 +1217,47 @@ static void pageout_one(struct folio *folio, struct list_head *ret_folios,
 			folio_test_unevictable(folio), folio);
 }
 
+static void pageout_batch(struct folio_batch *fbatch,
+			  struct list_head *ret_folios,
+			  struct folio_batch *free_folios,
+			  struct scan_control *sc, struct reclaim_stat *stat,
+			  struct swap_iocb **plug, struct list_head *folio_list)
+{
+	int i, count = folio_batch_count(fbatch);
+	struct folio *folio;
+
+	folio_batch_reinit(fbatch);
+	for (i = 0; i < count; ++i) {
+		folio = fbatch->folios[i];
+		if (!folio_trylock(folio)) {
+			list_add(&folio->lru, ret_folios);
+			continue;
+		}
+
+		if (folio_test_writeback(folio) || folio_test_lru(folio) ||
+		    folio_mapped(folio) || folio_maybe_dma_pinned(folio)) {
+			folio_unlock(folio);
+			list_add(&folio->lru, ret_folios);
+			continue;
+		}
+
+		folio_batch_add(fbatch, folio);
+	}
+
+	i = 0;
+	count = folio_batch_count(fbatch);
+	if (!count)
+		return;
+	/* One TLB flush for the batch */
+	try_to_unmap_flush_dirty();
+	for (i = 0; i < count; ++i) {
+		folio = fbatch->folios[i];
+		pageout_one(folio, ret_folios, free_folios, sc, stat, plug,
+			    folio_list);
+	}
+	folio_batch_reinit(fbatch);
+}
+
 static bool folio_try_unmap(struct folio *folio, struct reclaim_stat *stat,
 			    unsigned int nr_pages)
 {
@@ -1264,6 +1305,8 @@ static void shrink_folio_list(struct list_head *folio_list,
 		struct mem_cgroup *memcg)
 {
 	struct folio_batch free_folios;
+	struct folio_batch flush_folios;
+
 	LIST_HEAD(ret_folios);
 	LIST_HEAD(demote_folios);
 	unsigned int nr_demoted = 0;
@@ -1272,6 +1315,8 @@ static void shrink_folio_list(struct list_head *folio_list,
 	struct swap_iocb *plug = NULL;
 
 	folio_batch_init(&free_folios);
+	folio_batch_init(&flush_folios);
+
 	memset(stat, 0, sizeof(*stat));
 	cond_resched();
 	do_demote_pass = can_demote(pgdat->node_id, sc, memcg);
@@ -1565,15 +1610,21 @@ static void shrink_folio_list(struct list_head *folio_list,
 				goto keep_locked;
 			if (!sc->may_writepage)
 				goto keep_locked;
-
 			/*
-			 * Folio is dirty. Flush the TLB if a writable entry
-			 * potentially exists to avoid CPU writes after I/O
-			 * starts and then write it out here.
+			 * For anon, we should only see swap cache (anon) and
+			 * the list pinning the page. For file page, the filemap
+			 * and the list pins it. The folio is unlocked while
+			 * held in the batch, so pageout_batch() relocks each
+			 * folio and rechecks its state. If the folio is under
+			 * writeback, on LRU, mapped, or DMA-pinned, it will
+			 * not be written out and is put back to LRU list.
 			 */
-			try_to_unmap_flush_dirty();
-			pageout_one(folio, &ret_folios, &free_folios, sc, stat,
-				&plug, folio_list);
+			folio_unlock(folio);
+			if (!folio_batch_add(&flush_folios, folio))
+				pageout_batch(&flush_folios,
+							&ret_folios, &free_folios,
+							sc, stat, &plug,
+							folio_list);
 			goto next;
 		}
 
@@ -1603,6 +1654,10 @@ static void shrink_folio_list(struct list_head *folio_list,
 next:
 		continue;
 	}
+	if (folio_batch_count(&flush_folios)) {
+		pageout_batch(&flush_folios, &ret_folios, &free_folios, sc,
+			      stat, &plug, folio_list);
+	}
 	/* 'folio_list' is always empty here */
 
 	/* Migrate folios selected for demotion */

-- 
2.43.7

Re: [PATCH v2 5/5] mm/vmscan: flush TLB for every 31 folios evictions

Posted by Pedro Falcato 1 week ago

On Thu, Mar 26, 2026 at 04:36:21PM +0800, Zhang Peng wrote:
> From: Zhang Peng <bruzzhang@tencent.com>
> 
> Currently we flush TLB for every dirty folio, which is a bottleneck for
> systems with many cores as this causes heavy IPI usage.
> 
> So instead, batch the folios, and flush once for every 31 folios (one
> folio_batch). These folios will be held in a folio_batch releasing their
> lock, then when folio_batch is full, do following steps:
> 
> - For each folio: lock - check still evictable (writeback, lru,  mapped,
>   dma_pinned)
>   - If no longer evictable, put back to LRU
> - Flush TLB once for the batch
> - Pageout the folios
> 
> Note we can't hold a frozen folio in folio_batch for long as it will
> cause filemap/swapcache lookup to livelock. Fortunately pageout usually
> won't take too long; sync IO is fast, and non-sync IO will be issued
> with the folio marked writeback.
> 
> Suggested-by: Kairui Song <kasong@tencent.com>
> Signed-off-by: Zhang Peng <bruzzhang@tencent.com>
> ---
>  mm/vmscan.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 62 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 63cc88c875e8..27de8034f582 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1217,6 +1217,47 @@ static void pageout_one(struct folio *folio, struct list_head *ret_folios,
>  			folio_test_unevictable(folio), folio);
>  }
>  
> +static void pageout_batch(struct folio_batch *fbatch,
> +			  struct list_head *ret_folios,
> +			  struct folio_batch *free_folios,
> +			  struct scan_control *sc, struct reclaim_stat *stat,
> +			  struct swap_iocb **plug, struct list_head *folio_list)
> +{
> +	int i, count = folio_batch_count(fbatch);
> +	struct folio *folio;
> +
> +	folio_batch_reinit(fbatch);
> +	for (i = 0; i < count; ++i) {
> +		folio = fbatch->folios[i];
> +		if (!folio_trylock(folio)) {
> +			list_add(&folio->lru, ret_folios);
> +			continue;
> +		}
> +
> +		if (folio_test_writeback(folio) || folio_test_lru(folio) ||

If PG_lru is set here, we're in a world of trouble as we're actively using
folio->lru. I don't think it's possible for it to be set, as isolating folios
clears lru, and refcount bump means the folio cannot be reused or reinserted
back on the LRU. So perhaps:
		VM_WARN_ON_FOLIO(folio_test_lru(folio), folio);

> +		    folio_mapped(folio) || folio_maybe_dma_pinned(folio)) {
> +			folio_unlock(folio);
> +			list_add(&folio->lru, ret_folios);
> +			continue;
> +		}
> +
> +		folio_batch_add(fbatch, folio);
> +	}
> +
> +	i = 0;
> +	count = folio_batch_count(fbatch);
> +	if (!count)
> +		return;
> +	/* One TLB flush for the batch */
> +	try_to_unmap_flush_dirty();
> +	for (i = 0; i < count; ++i) {
> +		folio = fbatch->folios[i];
> +		pageout_one(folio, ret_folios, free_folios, sc, stat, plug,
> +			    folio_list);

Would be lovely if we could pass the batch down to the swap layer.

> +	}
> +	folio_batch_reinit(fbatch);

The way you keep reinitializing fbatch is a bit confusing.
Probably worth a comment or two (or kdocs for pageout_batch documenting
that the folio batch is reset, etc).

> +}
> +
>  static bool folio_try_unmap(struct folio *folio, struct reclaim_stat *stat,
>  			    unsigned int nr_pages)
>  {
> @@ -1264,6 +1305,8 @@ static void shrink_folio_list(struct list_head *folio_list,
>  		struct mem_cgroup *memcg)
>  {
>  	struct folio_batch free_folios;
> +	struct folio_batch flush_folios;
> +
>  	LIST_HEAD(ret_folios);
>  	LIST_HEAD(demote_folios);
>  	unsigned int nr_demoted = 0;
> @@ -1272,6 +1315,8 @@ static void shrink_folio_list(struct list_head *folio_list,
>  	struct swap_iocb *plug = NULL;
>  
>  	folio_batch_init(&free_folios);
> +	folio_batch_init(&flush_folios);
> +
>  	memset(stat, 0, sizeof(*stat));
>  	cond_resched();
>  	do_demote_pass = can_demote(pgdat->node_id, sc, memcg);
> @@ -1565,15 +1610,21 @@ static void shrink_folio_list(struct list_head *folio_list,
>  				goto keep_locked;
>  			if (!sc->may_writepage)
>  				goto keep_locked;
> -
>  			/*
> -			 * Folio is dirty. Flush the TLB if a writable entry
> -			 * potentially exists to avoid CPU writes after I/O
> -			 * starts and then write it out here.
> +			 * For anon, we should only see swap cache (anon) and
> +			 * the list pinning the page. For file page, the filemap
> +			 * and the list pins it. The folio is unlocked while
> +			 * held in the batch, so pageout_batch() relocks each
> +			 * folio and rechecks its state. If the folio is under
> +			 * writeback, on LRU, mapped, or DMA-pinned, it will
> +			 * not be written out and is put back to LRU list.
>  			 */
> -			try_to_unmap_flush_dirty();
> -			pageout_one(folio, &ret_folios, &free_folios, sc, stat,
> -				&plug, folio_list);
> +			folio_unlock(folio);

Why is the folio unlocked? I don't see the need to take the lock trip twice.
Is there something I'm missing?

-- 
Pedro

Re: [PATCH v2 5/5] mm/vmscan: flush TLB for every 31 folios evictions

Posted by Zhang Peng 3 days, 19 hours ago

On Thu, Mar 26, 2026 at 12:40:50PM +0000, Pedro Falcato wrote:
 > On Thu, Mar 26, 2026 at 04:36:21PM +0800, Zhang Peng wrote:
 > > +    folio_batch_reinit(fbatch);
 > > +    for (i = 0; i < count; ++i) {
 > > +        folio = fbatch->folios[i];
 > > +        if (!folio_trylock(folio)) {
 > > +            list_add(&folio->lru, ret_folios);
 > > +            continue;
 > > +        }
 > > +
 > > +        if (folio_test_writeback(folio) || folio_test_lru(folio) ||
 >
 > If PG_lru is set here, we're in a world of trouble as we're actively 
using
 > folio->lru. I don't think it's possible for it to be set, as 
isolating folios
 > clears lru, and refcount bump means the folio cannot be reused or 
reinserted
 > back on the LRU. So perhaps:
 >         VM_WARN_ON_FOLIO(folio_test_lru(folio), folio);

Agreed. The folio was isolated from LRU in shrink_folio_list()'s caller
with PG_lru cleared, and we hold a reference throughout. There's no path
that can re-set PG_lru while we hold the folio. Will replace the
folio_test_lru() check with VM_WARN_ON_FOLIO.

 > > +            folio_mapped(folio) || folio_maybe_dma_pinned(folio)) {
 > > +            folio_unlock(folio);
 > > +            list_add(&folio->lru, ret_folios);
 > > +            continue;
 > > +        }
 > > +
 > > +        folio_batch_add(fbatch, folio);
 > > +    }
 > > +
 > > +    i = 0;
 > > +    count = folio_batch_count(fbatch);
 > > +    if (!count)
 > > +        return;
 > > +    /* One TLB flush for the batch */
 > > +    try_to_unmap_flush_dirty();
 > > +    for (i = 0; i < count; ++i) {
 > > +        folio = fbatch->folios[i];
 > > +        pageout_one(folio, ret_folios, free_folios, sc, stat, plug,
 > > +                folio_list);
 >
 > Would be lovely if we could pass the batch down to the swap layer.

Agreed, that would be the logical next step. For now I kept the scope
small to just batch the TLB flush, but submitting swap IO in batches
could further reduce per-folio overhead. Will look into that as a
follow-up.

 > > +    }
 > > +    folio_batch_reinit(fbatch);
 >
 > The way you keep reinitializing fbatch is a bit confusing.
 > Probably worth a comment or two (or kdocs for pageout_batch documenting
 > that the folio batch is reset, etc).

Will add kdocs. The first reinit saves the original folios into local
variables and resets the batch for reuse as a "still-valid" set. The
second reinit cleans up after pageout_one() has consumed them. Will make
this two-phase usage explicit in the documentation.

 > > +            folio_unlock(folio);
 >
 > Why is the folio unlocked? I don't see the need to take the lock trip 
twice.
 > Is there something I'm missing?

We should not hold the folio lock longer than necessary. The folio sits
in flush_folios while the main loop continues scanning the remaining
folios - accumulating a full batch means processing up to 31 more folios
through trylock, unmap, swap allocation etc.

During that window the folio is still in swap cache and findable by
other CPUs. For example, do_swap_page() can look it up via
swap_cache_get_folio() and then block at folio_lock_or_retry() waiting
for us to finish accumulating. That is a direct stall on a process
trying to fault in its own memory.

Unlocking here and relocking in pageout_batch() is the same pattern used
by the demote path a few lines above.

[PATCH v2 1/5] mm/vmscan: track reclaimed pages in reclaim_stat
[PATCH v2 2/5] mm/vmscan: extract folio activation into folio_active_bounce()
[PATCH v2 3/5] mm/vmscan: extract folio_free() and pageout_one()
[PATCH v2 4/5] mm/vmscan: extract folio unmap logic into folio_try_unmap()
[PATCH v2 5/5] mm/vmscan: flush TLB for every 31 folios evictions