[v2] mm/mglru: improve reclaim loop and dirty folio handling

[PATCH v2 04/12] mm/mglru: restructure the reclaim loop

Posted by Kairui Song via B4 Relay 4 days, 16 hours ago

From: Kairui Song <kasong@tencent.com>

The current loop will calculate the scan number on each iteration. The
number of folios to scan is based on the LRU length, with some unclear
behaviors, eg, it only shifts the scan number by reclaim priority at the
default priority, and it couples the number calculation with aging and
rotation.

Adjust, simplify it, and decouple aging and rotation. Just calculate the
scan number for once at the beginning of the reclaim, always respect the
reclaim priority, and make the aging and rotation more explicit.

This slightly changes how offline memcg aging works: previously, offline
memcg wouldn't be aged unless it didn't have any evictable folios. Now,
we might age it if it has only 3 generations and the reclaim priority is
less than DEF_PRIORITY, which should be fine. On one hand, offline memcg
might still hold long-term folios, and in fact, a long-existing offline
memcg must be pinned by some long-term folios like shmem. These folios
might be used by other memcg, so aging them as ordinary memcg doesn't
seem wrong. And besides, aging enables further reclaim of an offlined
memcg, which will certainly happen if we keep shrinking it. And offline
memcg might soon be no longer an issue once reparenting is all ready.

Overall, the memcg LRU rotation, as described in mmzone.h,
remains the same.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 70 +++++++++++++++++++++++++++++++------------------------------
 1 file changed, 36 insertions(+), 34 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 963362523782..ab81ffdb241a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4913,49 +4913,40 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 }
 
 static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
-			     int swappiness, unsigned long *nr_to_scan)
+			     struct scan_control *sc, int swappiness)
 {
 	DEFINE_MIN_SEQ(lruvec);
 
-	*nr_to_scan = 0;
 	/* have to run aging, since eviction is not possible anymore */
 	if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
 		return true;
 
-	*nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
+	/* try to get away with not aging at the default priority */
+	if (sc->priority == DEF_PRIORITY)
+		return false;
+
 	/* better to run aging even though eviction is still possible */
 	return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
 }
 
-/*
- * For future optimizations:
- * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
- *    reclaim.
- */
-static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
+static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
+			   struct mem_cgroup *memcg, int swappiness)
 {
-	bool need_aging;
 	unsigned long nr_to_scan;
-	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
-	DEFINE_MAX_SEQ(lruvec);
-
-	if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
-		return -1;
-
-	need_aging = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
 
+	nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
 	/* try to scrape all its memory if this memcg was deleted */
-	if (nr_to_scan && !mem_cgroup_online(memcg))
+	if (!mem_cgroup_online(memcg))
 		return nr_to_scan;
 
 	nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
 
-	/* try to get away with not aging at the default priority */
-	if (!need_aging || sc->priority == DEF_PRIORITY)
-		return nr_to_scan >> sc->priority;
-
-	/* stop scanning this lruvec as it's low on cold folios */
-	return try_to_inc_max_seq(lruvec, max_seq, swappiness, false) ? -1 : 0;
+	/*
+	 * Always respect scan priority, minimally target
+	 * SWAP_CLUSTER_MAX pages to keep reclaim moving forwards.
+	 */
+	nr_to_scan >>= sc->priority;
+	return max(nr_to_scan, SWAP_CLUSTER_MAX);
 }
 
 static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
@@ -4985,31 +4976,43 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
 	return true;
 }
 
+/*
+ * For future optimizations:
+ * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
+ *    reclaim.
+ */
 static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
+	bool need_rotate = false;
 	long nr_batch, nr_to_scan;
-	unsigned long scanned = 0;
 	int swappiness = get_swappiness(lruvec, sc);
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 
-	while (true) {
+	nr_to_scan = get_nr_to_scan(lruvec, sc, memcg, swappiness);
+	while (nr_to_scan > 0) {
 		int delta;
+		DEFINE_MAX_SEQ(lruvec);
 
-		nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness);
-		if (nr_to_scan <= 0)
+		if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg)) {
+			need_rotate = true;
 			break;
+		}
+
+		if (should_run_aging(lruvec, max_seq, sc, swappiness)) {
+			if (try_to_inc_max_seq(lruvec, max_seq, swappiness, false))
+				need_rotate = true;
+			break;
+		}
 
 		nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
 		delta = evict_folios(nr_batch, lruvec, sc, swappiness);
 		if (!delta)
 			break;
 
-		scanned += delta;
-		if (scanned >= nr_to_scan)
-			break;
-
 		if (should_abort_scan(lruvec, sc))
 			break;
 
+		nr_to_scan -= delta;
 		cond_resched();
 	}
 
@@ -5035,8 +5038,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
 	}
 
-	/* whether this lruvec should be rotated */
-	return nr_to_scan < 0;
+	return need_rotate;
 }
 
 static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)

-- 
2.53.0

Re: [PATCH v2 04/12] mm/mglru: restructure the reclaim loop

Posted by Kairui Song 4 days, 5 hours ago

So sashiko provided some review, I'll post it below:

https://sashiko.dev/#/patchset/20260329-mglru-reclaim-v2-0-b53a3678513c%40tencent.com?patch=12986

> From sashiko:
>
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4913,49 +4913,40 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >  }
> >  
> >  static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
> > -			     int swappiness, unsigned long *nr_to_scan)
> > +			     struct scan_control *sc, int swappiness)
> >  {
> >  	DEFINE_MIN_SEQ(lruvec);
> >  
> > -	*nr_to_scan = 0;
> >  	/* have to run aging, since eviction is not possible anymore */
> >  	if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
> >  		return true;
> >  
> > -	*nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
> > +	/* try to get away with not aging at the default priority */
> > +	if (sc->priority == DEF_PRIORITY)
> > +		return false;
> 
> Is this a regression that unconditionally forces aging even at DEF_PRIORITY?
> In the original code, if eviction wasn't possible, it set need_aging to
> true, but the subsequent check against DEF_PRIORITY could still bypass aging
> and allow MGLRU to skip memcgs without cold pages.
> With the new logic, the mandatory aging check precedes the DEF_PRIORITY
> check, returning true immediately. Will this cause try_to_shrink_lruvec()
> to inappropriately force hot pages into older generations on the gentlest
> reclaim pass instead of trying other memcgs first?

Right, this part is a bit confusing indeed, I tried to split it
into two patches but it seem could be more messy that way.

This is more of an intentional change. When eviction is not possible
(fewer than MIN_NR_GENS generations), aging is the only way to make
progress. The old code always skip aging at DEF_PRIORITY.

For single cgroup setup, this is not helpful in anyway, it waste a
reclaim iteration only to come back at a higher priority and age
then. And it causes over reclaim of slab.

For multi cgroup setup, it may seems reasonable that defer the aging
and reclaim other cgroup a good idea, but we will have to proceed
with aging anyway, potentially when all cgroups are run out of
gens, causing a larger jitter, and break reclaim balance.
And the escalated priority also may cause over reclaim.

And right now we still have 3 gens as the buffer.

Testing the multi cgroup setup using JaveScript workload in the
cover letter showed that after this series, multi cgroup reclaim
balance and latency are good, the total throughput seems better.

Testing the reclaim balance of cgroup tree (see below) also
seems better.

I think to avoid the latency of aging, we really need async
aging, not deferred aging. Async aging might be easier to be
integrated here now after this series.

Could add some word on that in the commit message.

> From sashiko:
>
> > -static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
> > +static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
> > +			   struct mem_cgroup *memcg, int swappiness)
> >  {
> > -	bool need_aging;
> >  	unsigned long nr_to_scan;
> > -	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > -	DEFINE_MAX_SEQ(lruvec);
> > -
> > -	if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
> > -		return -1;
> > -
> > -	need_aging = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
> >  
> > +	nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
> >  	/* try to scrape all its memory if this memcg was deleted */
> > -	if (nr_to_scan && !mem_cgroup_online(memcg))
> > +	if (!mem_cgroup_online(memcg))
> >  		return nr_to_scan;
>
> Will this cause a regression resulting in a permanent memory leak for
> offline memcgs that have zero evictable folios?
> If an offline memcg has all its folios in the unevictable youngest
> generation, lruvec_evictable_size() returns 0. The early return here
> then evaluates to 0.
> When try_to_shrink_lruvec() receives 0, its while (nr_to_scan > 0) loop
> is completely bypassed. Since the aging logic was moved inside that loop,
> the offline memcg is never aged, meaning its folios will remain in the
> unevictable youngest generation indefinitely.

Seems hallucination. lruvec_evictable_size returns evictable
folios across gen, not excluding the youngest one since aging
may make then evictable. Perhaps we can add some comment for this,
if it confuses AI, it may confuse human too.

And aging is not out of the loop, it's just moved to a different
place helper.

> From sashiko:
> >
> >  	nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
> >  
> > -	/* try to get away with not aging at the default priority */
> > -	if (!need_aging || sc->priority == DEF_PRIORITY)
> > -		return nr_to_scan >> sc->priority;
> > -
> > -	/* stop scanning this lruvec as it's low on cold folios */
> > -	return try_to_inc_max_seq(lruvec, max_seq, swappiness, false) ? -1 : 0;
> > +	/*
> > +	 * Always respect scan priority, minimally target
> > +	 * SWAP_CLUSTER_MAX pages to keep reclaim moving forwards.
> > +	 */
> > +	nr_to_scan >>= sc->priority;
> > +	return max(nr_to_scan, SWAP_CLUSTER_MAX);
> >  }
>
> Does this regression break proportional reclaim and memory protection
> semantics for small memcgs by forcing a minimum scan size?

Actually this change was inspired by sashiko's review for V1:

https://sashiko.dev/#/patchset/20260318-mglru-reclaim-v1-0-2c46f9eb0508%40tencent.com?patch=2909

Without this, for cgroups smaller than 16M, def priority scan will just
do nothing. It's still OK but to make it more efficient I added a
minimal batch. But now think again, it better be:

if (!nr_to_scan)
	nr_to_scan = min(lruvec_evictable_size, SWAP_CLUSTER_MAX);

Using max() here could get very small cgroups over reclaimed.
I did test V2 using test_memcg_min suggested by af827e090489:

Before:
Proportional reclaim results:
  c[0]     actual=  29069312 (27M)  ideal=  30408704 (29M)  err=4.4%
  c[1]     actual=  23257088 (22M)  ideal=  22020096 (21M)  err=5.6%
  c[2]     actual=   1552384 (1M)  (expected ~0)
  c[3]     actual=         0 (0M)  (expected =0)

After:
Proportional reclaim results:
  c[0]     actual=  31391744 (29M)  ideal=  30408704 (29M)  err=3.2%
  c[1]     actual=  21028864 (20M)  ideal=  22020096 (21M)  err=4.5%
  c[2]     actual=   1515520 (1M)  (expected ~0)
  c[3]     actual=         0 (0M)  (expected =0)

In both case the result is somehow not very stable, I run the test
7 times using the medium stable result, after this series it seems
sometimes the result is even better but likely just noisy. And
didn't see a regression.

The 32 folios minimal batch seems already small enough for
typical usage, but min(evictable_size, SWAP_CLUSTER_MAX) is definitely
better. Will send a V3 to update this.

I think non of the benchmark or test would be effected by this.