[v3] mm/mglru: improve reclaim loop and dirty folio handling

[PATCH v3 06/14] mm/mglru: use a smaller batch for reclaim

Posted by Kairui Song via B4 Relay 3 days, 18 hours ago

From: Kairui Song <kasong@tencent.com>

With a fixed number to reclaim calculated at the beginning, making each
following step smaller should reduce the lock contention and avoid
over-aggressive reclaim of folios, as it will abort earlier when the
number of folios to be reclaimed is reached.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 643f9fc10214..9c28afb0219c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5008,7 +5008,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 			break;
 		}
 
-		nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
+		nr_batch = min(nr_to_scan, MIN_LRU_BATCH);
 		delta = evict_folios(nr_batch, lruvec, sc, swappiness);
 		if (!delta)
 			break;

-- 
2.53.0

Re: [PATCH v3 06/14] mm/mglru: use a smaller batch for reclaim

Posted by Barry Song 3 days, 5 hours ago

On Fri, Apr 3, 2026 at 2:53 AM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> With a fixed number to reclaim calculated at the beginning, making each
> following step smaller should reduce the lock contention and avoid
> over-aggressive reclaim of folios, as it will abort earlier when the
> number of folios to be reclaimed is reached.
>
> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/vmscan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 643f9fc10214..9c28afb0219c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -5008,7 +5008,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>                         break;
>                 }
>
> -               nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
> +               nr_batch = min(nr_to_scan, MIN_LRU_BATCH);

I’m fine with the smaller batch size, but I wonder if
MIN_LRU_BATCH is too small.

Just curious if we are calling get_nr_to_scan() more frequently
before we can abort the while (true) loop if reclamation
is not making good progress.

Assume get_nr_to_scan() also has a cost. Not sure if a
value between MIN_LRU_BATCH and MAX_LRU_BATCH
would be better.

Thanks
Barry

Re: [PATCH v3 06/14] mm/mglru: use a smaller batch for reclaim

Posted by Kairui Song 3 days, 4 hours ago

On Fri, Apr 03, 2026 at 03:50:37PM +0800, Barry Song wrote:
> On Fri, Apr 3, 2026 at 2:53 AM Kairui Song via B4 Relay
> <devnull+kasong.tencent.com@kernel.org> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > With a fixed number to reclaim calculated at the beginning, making each
> > following step smaller should reduce the lock contention and avoid
> > over-aggressive reclaim of folios, as it will abort earlier when the
> > number of folios to be reclaimed is reached.
> >
> > Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> > Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/vmscan.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 643f9fc10214..9c28afb0219c 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -5008,7 +5008,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >                         break;
> >                 }
> >
> > -               nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
> > +               nr_batch = min(nr_to_scan, MIN_LRU_BATCH);
> 
> I’m fine with the smaller batch size, but I wonder if
> MIN_LRU_BATCH is too small.

Thanks for the review, Barry!

It's quite reasonable value I think, for comparison classical LRU's
batch size is SWAP_CLUSTER_MAX (32), even smaller than
MIN_LRU_BATCH (64).

I ran many different benchmarks on this which can be found in
V2 / V1's cover letter (it getting too long so I didn't include these
results in V3 but I did retest). The new value looked good from large
server to small VMs.

It's also a much more reasonable value for batch throttling and dirty
writeback IMO.

> 
> Just curious if we are calling get_nr_to_scan() more frequently
> before we can abort the while (true) loop if reclamation
> is not making good progress.
> 
> Assume get_nr_to_scan() also has a cost. Not sure if a
> value between MIN_LRU_BATCH and MAX_LRU_BATCH
> would be better.

We are calling that less frequently actually, in a previous
commit it was moved out of the loop to act like a budget
control. That's also where using a smaller batch start
to makes more sense.

The overhead of other function calls also seems trivial.

I also wonder if we can unify or remove some
SWAP_CLUSTER_MAX usage, that value might be no longer
suitable in many places.

Re: [PATCH v3 06/14] mm/mglru: use a smaller batch for reclaim

Posted by Barry Song 3 days, 4 hours ago

On Fri, Apr 3, 2026 at 5:09 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Fri, Apr 03, 2026 at 03:50:37PM +0800, Barry Song wrote:
> > On Fri, Apr 3, 2026 at 2:53 AM Kairui Song via B4 Relay
> > <devnull+kasong.tencent.com@kernel.org> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > With a fixed number to reclaim calculated at the beginning, making each
> > > following step smaller should reduce the lock contention and avoid
> > > over-aggressive reclaim of folios, as it will abort earlier when the
> > > number of folios to be reclaimed is reached.
> > >
> > > Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> > > Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
> > > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >  mm/vmscan.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 643f9fc10214..9c28afb0219c 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -5008,7 +5008,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> > >                         break;
> > >                 }
> > >
> > > -               nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
> > > +               nr_batch = min(nr_to_scan, MIN_LRU_BATCH);
> >
> > I’m fine with the smaller batch size, but I wonder if
> > MIN_LRU_BATCH is too small.
>
> Thanks for the review, Barry!
>
> It's quite reasonable value I think, for comparison classical LRU's
> batch size is SWAP_CLUSTER_MAX (32), even smaller than
> MIN_LRU_BATCH (64).
>
> I ran many different benchmarks on this which can be found in
> V2 / V1's cover letter (it getting too long so I didn't include these
> results in V3 but I did retest). The new value looked good from large
> server to small VMs.
>
> It's also a much more reasonable value for batch throttling and dirty
> writeback IMO.
>
> >
> > Just curious if we are calling get_nr_to_scan() more frequently
> > before we can abort the while (true) loop if reclamation
> > is not making good progress.
> >
> > Assume get_nr_to_scan() also has a cost. Not sure if a
> > value between MIN_LRU_BATCH and MAX_LRU_BATCH
> > would be better.
>
> We are calling that less frequently actually, in a previous
> commit it was moved out of the loop to act like a budget
> control. That's also where using a smaller batch start
> to makes more sense.

Sorry, I missed your earlier change of moving get_nr_to_scan()
out of the loop[1]. It makes a lot of sense to me now.

It seems easier to review if moving it out and decreasing the batch
size are put in the same patch. Anyway, I understand the story now,
thanks!

[1] https://lore.kernel.org/linux-mm/20260403-mglru-reclaim-v3-4-a285efd6ff91@tencent.com/

>
> The overhead of other function calls also seems trivial.
>
> I also wonder if we can unify or remove some
> SWAP_CLUSTER_MAX usage, that value might be no longer
> suitable in many places.