[PATCH -next 1/5] mm/mglru: use mem_cgroup_iter for global reclaim

Chen Ridong posted 5 patches 2 months ago
There is a newer version of this series
[PATCH -next 1/5] mm/mglru: use mem_cgroup_iter for global reclaim
Posted by Chen Ridong 2 months ago
From: Chen Ridong <chenridong@huawei.com>

The memcg LRU was originally introduced for global reclaim to enhance
scalability. However, its implementation complexity has led to performance
regressions when dealing with a large number of memory cgroups [1].

As suggested by Johannes [1], this patch adopts mem_cgroup_iter with
cookie-based iteration for global reclaim, aligning with the approach
already used in shrink_node_memcgs. This simplification removes the
dedicated memcg LRU tracking while maintaining the core functionality.

It performed a stress test based on Yu Zhao's methodology [2] on a
1 TB, 4-node NUMA system. The results are summarized below:

	pgsteal:
						memcg LRU    memcg iter
	stddev(pgsteal) / mean(pgsteal)		106.03%		93.20%
	sum(pgsteal) / sum(requested)		98.10%		99.28%

	workingset_refault_anon:
						memcg LRU    memcg iter
	stddev(refault) / mean(refault)		193.97%		134.67%
	sum(refault)				1963229		2027567

The new implementation shows a clear fairness improvement, reducing the
standard deviation relative to the mean by 12.8 percentage points. The
pgsteal ratio is also closer to 100%. Refault counts increased by 3.2%
(from 1,963,229 to 2,027,567).

The primary benefits of this change are:
1. Simplified codebase by removing custom memcg LRU infrastructure
2. Improved fairness in memory reclaim across multiple cgroups
3. Better performance when creating many memory cgroups

[1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
[2] https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
Suggested-by: Johannes Weiner <hannes@cmxpchg.org>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Johannes Weiner <hannes@cmxpchg.org>
---
 mm/vmscan.c | 117 ++++++++++++++++------------------------------------
 1 file changed, 36 insertions(+), 81 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fddd168a9737..70b0e7e5393c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4895,27 +4895,14 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	return nr_to_scan < 0;
 }
 
-static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
+static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 {
-	bool success;
 	unsigned long scanned = sc->nr_scanned;
 	unsigned long reclaimed = sc->nr_reclaimed;
-	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 
-	/* lru_gen_age_node() called mem_cgroup_calculate_protection() */
-	if (mem_cgroup_below_min(NULL, memcg))
-		return MEMCG_LRU_YOUNG;
-
-	if (mem_cgroup_below_low(NULL, memcg)) {
-		/* see the comment on MEMCG_NR_GENS */
-		if (READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL)
-			return MEMCG_LRU_TAIL;
-
-		memcg_memory_event(memcg, MEMCG_LOW);
-	}
-
-	success = try_to_shrink_lruvec(lruvec, sc);
+	try_to_shrink_lruvec(lruvec, sc);
 
 	shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
 
@@ -4924,86 +4911,55 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 			   sc->nr_reclaimed - reclaimed);
 
 	flush_reclaim_state(sc);
-
-	if (success && mem_cgroup_online(memcg))
-		return MEMCG_LRU_YOUNG;
-
-	if (!success && lruvec_is_sizable(lruvec, sc))
-		return 0;
-
-	/* one retry if offlined or too small */
-	return READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL ?
-	       MEMCG_LRU_TAIL : MEMCG_LRU_YOUNG;
 }
 
 static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
 {
-	int op;
-	int gen;
-	int bin;
-	int first_bin;
-	struct lruvec *lruvec;
-	struct lru_gen_folio *lrugen;
+	struct mem_cgroup *target = sc->target_mem_cgroup;
+	struct mem_cgroup_reclaim_cookie reclaim = {
+		.pgdat = pgdat,
+	};
+	struct mem_cgroup_reclaim_cookie *cookie = &reclaim;
 	struct mem_cgroup *memcg;
-	struct hlist_nulls_node *pos;
 
-	gen = get_memcg_gen(READ_ONCE(pgdat->memcg_lru.seq));
-	bin = first_bin = get_random_u32_below(MEMCG_NR_BINS);
-restart:
-	op = 0;
-	memcg = NULL;
-
-	rcu_read_lock();
+	if (current_is_kswapd() || sc->memcg_full_walk)
+		cookie = NULL;
 
-	hlist_nulls_for_each_entry_rcu(lrugen, pos, &pgdat->memcg_lru.fifo[gen][bin], list) {
-		if (op) {
-			lru_gen_rotate_memcg(lruvec, op);
-			op = 0;
-		}
+	memcg = mem_cgroup_iter(target, NULL, cookie);
+	while (memcg) {
+		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
 
-		mem_cgroup_put(memcg);
-		memcg = NULL;
+		cond_resched();
 
-		if (gen != READ_ONCE(lrugen->gen))
-			continue;
+		mem_cgroup_calculate_protection(target, memcg);
 
-		lruvec = container_of(lrugen, struct lruvec, lrugen);
-		memcg = lruvec_memcg(lruvec);
+		if (mem_cgroup_below_min(target, memcg))
+			goto next;
 
-		if (!mem_cgroup_tryget(memcg)) {
-			lru_gen_release_memcg(memcg);
-			memcg = NULL;
-			continue;
+		if (mem_cgroup_below_low(target, memcg)) {
+			if (!sc->memcg_low_reclaim) {
+				sc->memcg_low_skipped = 1;
+				goto next;
+			}
+			memcg_memory_event(memcg, MEMCG_LOW);
 		}
 
-		rcu_read_unlock();
+		shrink_one(lruvec, sc);
 
-		op = shrink_one(lruvec, sc);
-
-		rcu_read_lock();
-
-		if (should_abort_scan(lruvec, sc))
+		if (should_abort_scan(lruvec, sc)) {
+			if (cookie)
+				mem_cgroup_iter_break(target, memcg);
 			break;
-	}
-
-	rcu_read_unlock();
-
-	if (op)
-		lru_gen_rotate_memcg(lruvec, op);
-
-	mem_cgroup_put(memcg);
-
-	if (!is_a_nulls(pos))
-		return;
+		}
 
-	/* restart if raced with lru_gen_rotate_memcg() */
-	if (gen != get_nulls_value(pos))
-		goto restart;
+next:
+		if (cookie && sc->nr_reclaimed >= sc->nr_to_reclaim) {
+			mem_cgroup_iter_break(target, memcg);
+			break;
+		}
 
-	/* try the rest of the bins of the current generation */
-	bin = get_memcg_bin(bin + 1);
-	if (bin != first_bin)
-		goto restart;
+		memcg = mem_cgroup_iter(target, memcg, cookie);
+	}
 }
 
 static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
@@ -5019,8 +4975,7 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
 
 	set_mm_walk(NULL, sc->proactive);
 
-	if (try_to_shrink_lruvec(lruvec, sc))
-		lru_gen_rotate_memcg(lruvec, MEMCG_LRU_YOUNG);
+	try_to_shrink_lruvec(lruvec, sc);
 
 	clear_mm_walk();
 
-- 
2.34.1
Re: [PATCH -next 1/5] mm/mglru: use mem_cgroup_iter for global reclaim
Posted by Shakeel Butt 1 month, 2 weeks ago
On Tue, Dec 09, 2025 at 01:25:53AM +0000, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
> 
> The memcg LRU was originally introduced for global reclaim to enhance
> scalability. However, its implementation complexity has led to performance
> regressions when dealing with a large number of memory cgroups [1].
> 
> As suggested by Johannes [1], this patch adopts mem_cgroup_iter with
> cookie-based iteration for global reclaim, aligning with the approach
> already used in shrink_node_memcgs. This simplification removes the
> dedicated memcg LRU tracking while maintaining the core functionality.
> 
> It performed a stress test based on Yu Zhao's methodology [2] on a
> 1 TB, 4-node NUMA system. The results are summarized below:
> 
> 	pgsteal:
> 						memcg LRU    memcg iter
> 	stddev(pgsteal) / mean(pgsteal)		106.03%		93.20%
> 	sum(pgsteal) / sum(requested)		98.10%		99.28%
> 
> 	workingset_refault_anon:
> 						memcg LRU    memcg iter
> 	stddev(refault) / mean(refault)		193.97%		134.67%
> 	sum(refault)				1963229		2027567
> 
> The new implementation shows a clear fairness improvement, reducing the
> standard deviation relative to the mean by 12.8 percentage points. The
> pgsteal ratio is also closer to 100%. Refault counts increased by 3.2%
> (from 1,963,229 to 2,027,567).
> 
> The primary benefits of this change are:
> 1. Simplified codebase by removing custom memcg LRU infrastructure
> 2. Improved fairness in memory reclaim across multiple cgroups
> 3. Better performance when creating many memory cgroups
> 
> [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
> [2] https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
> Suggested-by: Johannes Weiner <hannes@cmxpchg.org>
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> Acked-by: Johannes Weiner <hannes@cmxpchg.org>
> ---
>  mm/vmscan.c | 117 ++++++++++++++++------------------------------------
>  1 file changed, 36 insertions(+), 81 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index fddd168a9737..70b0e7e5393c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4895,27 +4895,14 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  	return nr_to_scan < 0;
>  }
>  
> -static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
> +static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>  {
> -	bool success;
>  	unsigned long scanned = sc->nr_scanned;
>  	unsigned long reclaimed = sc->nr_reclaimed;
> -	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>  	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>  
> -	/* lru_gen_age_node() called mem_cgroup_calculate_protection() */
> -	if (mem_cgroup_below_min(NULL, memcg))
> -		return MEMCG_LRU_YOUNG;
> -
> -	if (mem_cgroup_below_low(NULL, memcg)) {
> -		/* see the comment on MEMCG_NR_GENS */
> -		if (READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL)
> -			return MEMCG_LRU_TAIL;
> -
> -		memcg_memory_event(memcg, MEMCG_LOW);
> -	}
> -
> -	success = try_to_shrink_lruvec(lruvec, sc);
> +	try_to_shrink_lruvec(lruvec, sc);
>  
>  	shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
>  
> @@ -4924,86 +4911,55 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>  			   sc->nr_reclaimed - reclaimed);
>  
>  	flush_reclaim_state(sc);
> -
> -	if (success && mem_cgroup_online(memcg))
> -		return MEMCG_LRU_YOUNG;
> -
> -	if (!success && lruvec_is_sizable(lruvec, sc))
> -		return 0;
> -
> -	/* one retry if offlined or too small */
> -	return READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL ?
> -	       MEMCG_LRU_TAIL : MEMCG_LRU_YOUNG;
>  }
>  
>  static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
>  {
> -	int op;
> -	int gen;
> -	int bin;
> -	int first_bin;
> -	struct lruvec *lruvec;
> -	struct lru_gen_folio *lrugen;
> +	struct mem_cgroup *target = sc->target_mem_cgroup;
> +	struct mem_cgroup_reclaim_cookie reclaim = {
> +		.pgdat = pgdat,
> +	};
> +	struct mem_cgroup_reclaim_cookie *cookie = &reclaim;

Please keep the naming same as shrink_node_memcgs i.e. use 'partial'
here.

>  	struct mem_cgroup *memcg;
> -	struct hlist_nulls_node *pos;
>  
> -	gen = get_memcg_gen(READ_ONCE(pgdat->memcg_lru.seq));
> -	bin = first_bin = get_random_u32_below(MEMCG_NR_BINS);
> -restart:
> -	op = 0;
> -	memcg = NULL;
> -
> -	rcu_read_lock();
> +	if (current_is_kswapd() || sc->memcg_full_walk)
> +		cookie = NULL;
>  
> -	hlist_nulls_for_each_entry_rcu(lrugen, pos, &pgdat->memcg_lru.fifo[gen][bin], list) {
> -		if (op) {
> -			lru_gen_rotate_memcg(lruvec, op);
> -			op = 0;
> -		}
> +	memcg = mem_cgroup_iter(target, NULL, cookie);
> +	while (memcg) {

Please use the do-while loop same as shrink_node_memcgs and then change
the goto next below to continue similar to shrink_node_memcgs.

> +		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>  
> -		mem_cgroup_put(memcg);
> -		memcg = NULL;
> +		cond_resched();
>  
> -		if (gen != READ_ONCE(lrugen->gen))
> -			continue;
> +		mem_cgroup_calculate_protection(target, memcg);
>  
> -		lruvec = container_of(lrugen, struct lruvec, lrugen);
> -		memcg = lruvec_memcg(lruvec);
> +		if (mem_cgroup_below_min(target, memcg))
> +			goto next;
>  
> -		if (!mem_cgroup_tryget(memcg)) {
> -			lru_gen_release_memcg(memcg);
> -			memcg = NULL;
> -			continue;
> +		if (mem_cgroup_below_low(target, memcg)) {
> +			if (!sc->memcg_low_reclaim) {
> +				sc->memcg_low_skipped = 1;
> +				goto next;
> +			}
> +			memcg_memory_event(memcg, MEMCG_LOW);
>  		}
>  
> -		rcu_read_unlock();
> +		shrink_one(lruvec, sc);
>  
> -		op = shrink_one(lruvec, sc);
> -
> -		rcu_read_lock();
> -
> -		if (should_abort_scan(lruvec, sc))
> +		if (should_abort_scan(lruvec, sc)) {
> +			if (cookie)
> +				mem_cgroup_iter_break(target, memcg);
>  			break;

This seems buggy as we may break the loop without calling
mem_cgroup_iter_break(). I think for kswapd the cookie will be NULL and
if should_abort_scan() returns true, we will break the loop without
calling mem_cgroup_iter_break() and will leak a reference to memcg.
Re: [PATCH -next 1/5] mm/mglru: use mem_cgroup_iter for global reclaim
Posted by Chen Ridong 1 month, 2 weeks ago

On 2025/12/22 11:12, Shakeel Butt wrote:
> On Tue, Dec 09, 2025 at 01:25:53AM +0000, Chen Ridong wrote:
>> From: Chen Ridong <chenridong@huawei.com>
>>
>> The memcg LRU was originally introduced for global reclaim to enhance
>> scalability. However, its implementation complexity has led to performance
>> regressions when dealing with a large number of memory cgroups [1].
>>
>> As suggested by Johannes [1], this patch adopts mem_cgroup_iter with
>> cookie-based iteration for global reclaim, aligning with the approach
>> already used in shrink_node_memcgs. This simplification removes the
>> dedicated memcg LRU tracking while maintaining the core functionality.
>>
>> It performed a stress test based on Yu Zhao's methodology [2] on a
>> 1 TB, 4-node NUMA system. The results are summarized below:
>>
>> 	pgsteal:
>> 						memcg LRU    memcg iter
>> 	stddev(pgsteal) / mean(pgsteal)		106.03%		93.20%
>> 	sum(pgsteal) / sum(requested)		98.10%		99.28%
>>
>> 	workingset_refault_anon:
>> 						memcg LRU    memcg iter
>> 	stddev(refault) / mean(refault)		193.97%		134.67%
>> 	sum(refault)				1963229		2027567
>>
>> The new implementation shows a clear fairness improvement, reducing the
>> standard deviation relative to the mean by 12.8 percentage points. The
>> pgsteal ratio is also closer to 100%. Refault counts increased by 3.2%
>> (from 1,963,229 to 2,027,567).
>>
>> The primary benefits of this change are:
>> 1. Simplified codebase by removing custom memcg LRU infrastructure
>> 2. Improved fairness in memory reclaim across multiple cgroups
>> 3. Better performance when creating many memory cgroups
>>
>> [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
>> [2] https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
>> Suggested-by: Johannes Weiner <hannes@cmxpchg.org>
>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>> Acked-by: Johannes Weiner <hannes@cmxpchg.org>
>> ---
>>  mm/vmscan.c | 117 ++++++++++++++++------------------------------------
>>  1 file changed, 36 insertions(+), 81 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index fddd168a9737..70b0e7e5393c 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -4895,27 +4895,14 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>>  	return nr_to_scan < 0;
>>  }
>>  
>> -static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>> +static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>>  {
>> -	bool success;
>>  	unsigned long scanned = sc->nr_scanned;
>>  	unsigned long reclaimed = sc->nr_reclaimed;
>> -	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>  	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>  
>> -	/* lru_gen_age_node() called mem_cgroup_calculate_protection() */
>> -	if (mem_cgroup_below_min(NULL, memcg))
>> -		return MEMCG_LRU_YOUNG;
>> -
>> -	if (mem_cgroup_below_low(NULL, memcg)) {
>> -		/* see the comment on MEMCG_NR_GENS */
>> -		if (READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL)
>> -			return MEMCG_LRU_TAIL;
>> -
>> -		memcg_memory_event(memcg, MEMCG_LOW);
>> -	}
>> -
>> -	success = try_to_shrink_lruvec(lruvec, sc);
>> +	try_to_shrink_lruvec(lruvec, sc);
>>  
>>  	shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
>>  
>> @@ -4924,86 +4911,55 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>>  			   sc->nr_reclaimed - reclaimed);
>>  
>>  	flush_reclaim_state(sc);
>> -
>> -	if (success && mem_cgroup_online(memcg))
>> -		return MEMCG_LRU_YOUNG;
>> -
>> -	if (!success && lruvec_is_sizable(lruvec, sc))
>> -		return 0;
>> -
>> -	/* one retry if offlined or too small */
>> -	return READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL ?
>> -	       MEMCG_LRU_TAIL : MEMCG_LRU_YOUNG;
>>  }
>>  
>>  static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
>>  {
>> -	int op;
>> -	int gen;
>> -	int bin;
>> -	int first_bin;
>> -	struct lruvec *lruvec;
>> -	struct lru_gen_folio *lrugen;
>> +	struct mem_cgroup *target = sc->target_mem_cgroup;
>> +	struct mem_cgroup_reclaim_cookie reclaim = {
>> +		.pgdat = pgdat,
>> +	};
>> +	struct mem_cgroup_reclaim_cookie *cookie = &reclaim;
> 
> Please keep the naming same as shrink_node_memcgs i.e. use 'partial'
> here.
> 

Thank you, will update.

>>  	struct mem_cgroup *memcg;
>> -	struct hlist_nulls_node *pos;
>>  
>> -	gen = get_memcg_gen(READ_ONCE(pgdat->memcg_lru.seq));
>> -	bin = first_bin = get_random_u32_below(MEMCG_NR_BINS);
>> -restart:
>> -	op = 0;
>> -	memcg = NULL;
>> -
>> -	rcu_read_lock();
>> +	if (current_is_kswapd() || sc->memcg_full_walk)
>> +		cookie = NULL;
>>  
>> -	hlist_nulls_for_each_entry_rcu(lrugen, pos, &pgdat->memcg_lru.fifo[gen][bin], list) {
>> -		if (op) {
>> -			lru_gen_rotate_memcg(lruvec, op);
>> -			op = 0;
>> -		}
>> +	memcg = mem_cgroup_iter(target, NULL, cookie);
>> +	while (memcg) {
> 
> Please use the do-while loop same as shrink_node_memcgs and then change
> the goto next below to continue similar to shrink_node_memcgs.
> 

Will update.

>> +		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>  
>> -		mem_cgroup_put(memcg);
>> -		memcg = NULL;
>> +		cond_resched();
>>  
>> -		if (gen != READ_ONCE(lrugen->gen))
>> -			continue;
>> +		mem_cgroup_calculate_protection(target, memcg);
>>  
>> -		lruvec = container_of(lrugen, struct lruvec, lrugen);
>> -		memcg = lruvec_memcg(lruvec);
>> +		if (mem_cgroup_below_min(target, memcg))
>> +			goto next;
>>  
>> -		if (!mem_cgroup_tryget(memcg)) {
>> -			lru_gen_release_memcg(memcg);
>> -			memcg = NULL;
>> -			continue;
>> +		if (mem_cgroup_below_low(target, memcg)) {
>> +			if (!sc->memcg_low_reclaim) {
>> +				sc->memcg_low_skipped = 1;
>> +				goto next;
>> +			}
>> +			memcg_memory_event(memcg, MEMCG_LOW);
>>  		}
>>  
>> -		rcu_read_unlock();
>> +		shrink_one(lruvec, sc);
>>  
>> -		op = shrink_one(lruvec, sc);
>> -
>> -		rcu_read_lock();
>> -
>> -		if (should_abort_scan(lruvec, sc))
>> +		if (should_abort_scan(lruvec, sc)) {
>> +			if (cookie)
>> +				mem_cgroup_iter_break(target, memcg);
>>  			break;
> 
> This seems buggy as we may break the loop without calling
> mem_cgroup_iter_break(). I think for kswapd the cookie will be NULL and
> if should_abort_scan() returns true, we will break the loop without
> calling mem_cgroup_iter_break() and will leak a reference to memcg.
> 

Thank you for catching that—my mistake.

This also brings up another point: In kswapd, the traditional LRU iterates through all memcgs, but
stops for the generational LRU (GENLRU) when should_abort_scan is met (i.e., enough pages are
reclaimed or the watermark is satisfied). Shouldn't both behave consistently?

Perhaps we should add should_abort_scan(lruvec, sc) in shrink_node_memcgs for the traditional LRU as
well?

-- 
Best regards,
Ridong

Re: [PATCH -next 1/5] mm/mglru: use mem_cgroup_iter for global reclaim
Posted by Shakeel Butt 1 month, 2 weeks ago
On Mon, Dec 22, 2025 at 03:27:26PM +0800, Chen Ridong wrote:
> 
[...]
> 
> >> -		if (should_abort_scan(lruvec, sc))
> >> +		if (should_abort_scan(lruvec, sc)) {
> >> +			if (cookie)
> >> +				mem_cgroup_iter_break(target, memcg);
> >>  			break;
> > 
> > This seems buggy as we may break the loop without calling
> > mem_cgroup_iter_break(). I think for kswapd the cookie will be NULL and
> > if should_abort_scan() returns true, we will break the loop without
> > calling mem_cgroup_iter_break() and will leak a reference to memcg.
> > 
> 
> Thank you for catching that—my mistake.
> 
> This also brings up another point: In kswapd, the traditional LRU iterates through all memcgs, but
> stops for the generational LRU (GENLRU) when should_abort_scan is met (i.e., enough pages are
> reclaimed or the watermark is satisfied). Shouldn't both behave consistently?
> 
> Perhaps we should add should_abort_scan(lruvec, sc) in shrink_node_memcgs for the traditional LRU as
> well?

We definitely should discuss about should_abort_scan() for traditional
reclaim but to keep things simple, let's do that after this series. For
now, follow Johannes' suggestion of lru_gen_should_abort_scan().
Re: [PATCH -next 1/5] mm/mglru: use mem_cgroup_iter for global reclaim
Posted by Chen Ridong 1 month, 2 weeks ago

On 2025/12/23 5:18, Shakeel Butt wrote:
> On Mon, Dec 22, 2025 at 03:27:26PM +0800, Chen Ridong wrote:
>>
> [...]
>>
>>>> -		if (should_abort_scan(lruvec, sc))
>>>> +		if (should_abort_scan(lruvec, sc)) {
>>>> +			if (cookie)
>>>> +				mem_cgroup_iter_break(target, memcg);
>>>>  			break;
>>>
>>> This seems buggy as we may break the loop without calling
>>> mem_cgroup_iter_break(). I think for kswapd the cookie will be NULL and
>>> if should_abort_scan() returns true, we will break the loop without
>>> calling mem_cgroup_iter_break() and will leak a reference to memcg.
>>>
>>
>> Thank you for catching that—my mistake.
>>
>> This also brings up another point: In kswapd, the traditional LRU iterates through all memcgs, but
>> stops for the generational LRU (GENLRU) when should_abort_scan is met (i.e., enough pages are
>> reclaimed or the watermark is satisfied). Shouldn't both behave consistently?
>>
>> Perhaps we should add should_abort_scan(lruvec, sc) in shrink_node_memcgs for the traditional LRU as
>> well?
> 
> We definitely should discuss about should_abort_scan() for traditional
> reclaim but to keep things simple, let's do that after this series. For
> now, follow Johannes' suggestion of lru_gen_should_abort_scan().
> 

Okey, understood.

-- 
Best regards,
Ridong