[v3] Eliminate Dying Memory Cgroup

[PATCH v3 26/30] mm: vmscan: prepare for reparenting MGLRU folios

Posted by Qi Zheng 3 weeks, 5 days ago

From: Qi Zheng <zhengqi.arch@bytedance.com>

Similar to traditional LRU folios, in order to solve the dying memcg
problem, we also need to reparenting MGLRU folios to the parent memcg when
memcg offline.

However, there are the following challenges:

1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the
   number of generations of the parent and child memcg may be different,
   so we cannot simply transfer MGLRU folios in the child memcg to the
   parent memcg as we did for traditional LRU folios.
2. The generation information is stored in folio->flags, but we cannot
   traverse these folios while holding the lru lock, otherwise it may
   cause softlockup.
3. In walk_update_folio(), the gen of folio and corresponding lru size
   may be updated, but the folio is not immediately moved to the
   corresponding lru list. Therefore, there may be folios of different
   generations on an LRU list.
4. In lru_gen_del_folio(), the generation to which the folio belongs is
   found based on the generation information in folio->flags, and the
   corresponding LRU size will be updated. Therefore, we need to update
   the lru size correctly during reparenting, otherwise the lru size may
   be updated incorrectly in lru_gen_del_folio().

Finally, this patch chose a compromise method, which is to splice the lru
list in the child memcg to the lru list of the same generation in the
parent memcg during reparenting. And in order to ensure that the parent
memcg has the same generation, we need to increase the generations in the
parent memcg to the MAX_NR_GENS before reparenting.

Of course, the same generation has different meanings in the parent and
child memcg, this will cause confusion in the hot and cold information of
folios. But other than that, this method is simple enough, the lru size
is correct, and there is no need to consider some concurrency issues (such
as lru_gen_del_folio()).

To prepare for the above work, this commit implements the specific
functions, which will be used during reparenting.

Suggested-by: Harry Yoo <harry.yoo@oracle.com>
Suggested-by: Imran Khan <imran.f.khan@oracle.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mmzone.h |  16 +++++
 mm/vmscan.c            | 144 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 160 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1014b5a93c09c..a41f4f0ae5eb7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -628,6 +628,9 @@ void lru_gen_online_memcg(struct mem_cgroup *memcg);
 void lru_gen_offline_memcg(struct mem_cgroup *memcg);
 void lru_gen_release_memcg(struct mem_cgroup *memcg);
 void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid);
+void max_lru_gen_memcg(struct mem_cgroup *memcg);
+bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg);
+void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent);
 
 #else /* !CONFIG_LRU_GEN */
 
@@ -668,6 +671,19 @@ static inline void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
 {
 }
 
+static inline void max_lru_gen_memcg(struct mem_cgroup *memcg)
+{
+}
+
+static inline bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg)
+{
+	return true;
+}
+
+static inline void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent)
+{
+}
+
 #endif /* CONFIG_LRU_GEN */
 
 struct lruvec {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e738082874878..6bc8047b7aec5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4445,6 +4445,150 @@ void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
 		lru_gen_rotate_memcg(lruvec, MEMCG_LRU_HEAD);
 }
 
+bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg)
+{
+	int nid;
+
+	for_each_node(nid) {
+		struct lruvec *lruvec = get_lruvec(memcg, nid);
+		int type;
+
+		for (type = 0; type < ANON_AND_FILE; type++) {
+			if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
+				return false;
+		}
+	}
+
+	return true;
+}
+
+static void try_to_inc_max_seq_nowalk(struct mem_cgroup *memcg,
+				      struct lruvec *lruvec)
+{
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+	struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
+	int swappiness = mem_cgroup_swappiness(memcg);
+	DEFINE_MAX_SEQ(lruvec);
+	bool success = false;
+
+	/*
+	 * We are not iterating the mm_list here, updating mm_state->seq is just
+	 * to make mm walkers work properly.
+	 */
+	if (mm_state) {
+		spin_lock(&mm_list->lock);
+		VM_WARN_ON_ONCE(mm_state->seq + 1 < max_seq);
+		if (max_seq > mm_state->seq) {
+			WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
+			success = true;
+		}
+		spin_unlock(&mm_list->lock);
+	} else {
+		success = true;
+	}
+
+	if (success)
+		inc_max_seq(lruvec, max_seq, swappiness);
+}
+
+/*
+ * We need to ensure that the folios of child memcg can be reparented to the
+ * same gen of the parent memcg, so the gens of the parent memcg needed be
+ * incremented to the MAX_NR_GENS before reparenting.
+ */
+void max_lru_gen_memcg(struct mem_cgroup *memcg)
+{
+	int nid;
+
+	for_each_node(nid) {
+		struct lruvec *lruvec = get_lruvec(memcg, nid);
+		int type;
+
+		for (type = 0; type < ANON_AND_FILE; type++) {
+			while (get_nr_gens(lruvec, type) < MAX_NR_GENS) {
+				try_to_inc_max_seq_nowalk(memcg, lruvec);
+				cond_resched();
+			}
+		}
+	}
+}
+
+/*
+ * Compared to traditional LRU, MGLRU faces the following challenges:
+ *
+ * 1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the
+ *    number of generations of the parent and child memcg may be different,
+ *    so we cannot simply transfer MGLRU folios in the child memcg to the
+ *    parent memcg as we did for traditional LRU folios.
+ * 2. The generation information is stored in folio->flags, but we cannot
+ *    traverse these folios while holding the lru lock, otherwise it may
+ *    cause softlockup.
+ * 3. In walk_update_folio(), the gen of folio and corresponding lru size
+ *    may be updated, but the folio is not immediately moved to the
+ *    corresponding lru list. Therefore, there may be folios of different
+ *    generations on an LRU list.
+ * 4. In lru_gen_del_folio(), the generation to which the folio belongs is
+ *    found based on the generation information in folio->flags, and the
+ *    corresponding LRU size will be updated. Therefore, we need to update
+ *    the lru size correctly during reparenting, otherwise the lru size may
+ *    be updated incorrectly in lru_gen_del_folio().
+ *
+ * Finally, we choose a compromise method, which is to splice the lru list in
+ * the child memcg to the lru list of the same generation in the parent memcg
+ * during reparenting.
+ *
+ * The same generation has different meanings in the parent and child memcg,
+ * so this compromise method will cause the LRU inversion problem. But as the
+ * system runs, this problem will be fixed automatically.
+ */
+static void __lru_gen_reparent_memcg(struct lruvec *child_lruvec, struct lruvec *parent_lruvec,
+				     int zone, int type)
+{
+	struct lru_gen_folio *child_lrugen, *parent_lrugen;
+	enum lru_list lru = type * LRU_INACTIVE_FILE;
+	int i;
+
+	child_lrugen = &child_lruvec->lrugen;
+	parent_lrugen = &parent_lruvec->lrugen;
+
+	for (i = 0; i < get_nr_gens(child_lruvec, type); i++) {
+		int gen = lru_gen_from_seq(child_lrugen->max_seq - i);
+		long nr_pages = child_lrugen->nr_pages[gen][type][zone];
+		int dst_lru_active = lru_gen_is_active(parent_lruvec, gen) ? LRU_ACTIVE : 0;
+
+		/* Assuming that child pages are colder than parent pages */
+		list_splice_init(&child_lrugen->folios[gen][type][zone],
+				 &parent_lrugen->folios[gen][type][zone]);
+
+		WRITE_ONCE(child_lrugen->nr_pages[gen][type][zone], 0);
+		WRITE_ONCE(parent_lrugen->nr_pages[gen][type][zone],
+			   parent_lrugen->nr_pages[gen][type][zone] + nr_pages);
+
+		update_lru_size(parent_lruvec, lru + dst_lru_active, zone, nr_pages);
+	}
+}
+
+void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent)
+{
+	int nid;
+
+	for_each_node(nid) {
+		struct lruvec *child_lruvec, *parent_lruvec;
+		int type, zid;
+		struct zone *zone;
+
+		child_lruvec = get_lruvec(memcg, nid);
+		parent_lruvec = get_lruvec(parent, nid);
+
+		for_each_managed_zone_pgdat(zone, NODE_DATA(nid), zid, MAX_NR_ZONES - 1) {
+			for (type = 0; type < ANON_AND_FILE; type++)
+				__lru_gen_reparent_memcg(child_lruvec, parent_lruvec, zid, type);
+			mem_cgroup_update_lru_size(parent_lruvec, LRU_UNEVICTABLE, zid,
+				mem_cgroup_get_zone_lru_size(child_lruvec, LRU_UNEVICTABLE, zid));
+		}
+	}
+}
+
 #endif /* CONFIG_MEMCG */
 
 /******************************************************************************
-- 
2.20.1

Re: [PATCH v3 26/30] mm: vmscan: prepare for reparenting MGLRU folios

Posted by Shakeel Butt 3 weeks, 2 days ago

I asked AI to review this patch and it find following potential issues:

Author: Qi Zheng <zhengqi.arch@bytedance.com>

mm: vmscan: prepare for reparenting MGLRU folios

This commit adds infrastructure for reparenting MGLRU folios from dying
memcg to parent memcg, including functions to ensure parent has MAX_NR_GENS
generations before reparenting.

Link: https://lkml.kernel.org/r/92e0728fed3d68855173352416cf8077670610f0.1768389889.git.zhengqi.arch@bytedance.com

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f1a45c94c2fc..284292653d14 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c

[ ... ]

> +static void try_to_inc_max_seq_nowalk(struct mem_cgroup *memcg,
> +				      struct lruvec *lruvec)
> +{
> +	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
> +	struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
> +	int swappiness = mem_cgroup_swappiness(memcg);
> +	DEFINE_MAX_SEQ(lruvec);
> +	bool success = false;
> +
> +	/*
> +	 * We are not iterating the mm_list here, updating mm_state->seq is just
> +	 * to make mm walkers work properly.
> +	 */
> +	if (mm_state) {
> +		spin_lock(&mm_list->lock);
> +		VM_WARN_ON_ONCE(mm_state->seq + 1 < max_seq);
> +		if (max_seq > mm_state->seq) {
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When mm_state->seq equals max_seq (which can happen when mm walkers have
completed their iteration), this condition is false, success remains false,
and inc_max_seq() is never called.

The caller max_lru_gen_memcg() loops with:

    while (get_nr_gens(lruvec, type) < MAX_NR_GENS) {
        try_to_inc_max_seq_nowalk(memcg, lruvec);
        cond_resched();
    }

If try_to_inc_max_seq_nowalk() fails to increment max_seq, get_nr_gens()
returns the same value, and the while loop continues indefinitely.

Can this cause an infinite loop during memcg offline when mm_state->seq has
caught up to max_seq?

Suggested fix: Remove the success variable and always call inc_max_seq().
The mm_state->seq update is only for coordination with mm walkers - it should
not gate whether we increment the generation. inc_max_seq() has its own
staleness check (seq < READ_ONCE(lrugen->max_seq)) so calling it unconditionally
is safe:

    if (mm_state) {
        spin_lock(&mm_list->lock);
        VM_WARN_ON_ONCE(mm_state->seq + 1 < max_seq);
        if (max_seq > mm_state->seq)
            WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
        spin_unlock(&mm_list->lock);
    }

    inc_max_seq(lruvec, max_seq, swappiness);

> +			WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
> +			success = true;
> +		}
> +		spin_unlock(&mm_list->lock);
> +	} else {
> +		success = true;
> +	}
> +
> +	if (success)
> +		inc_max_seq(lruvec, max_seq, swappiness);
> +}

[ ... ]

> +static void __lru_gen_reparent_memcg(struct lruvec *child_lruvec, struct lruvec *parent_lruvec,
> +				     int zone, int type)
> +{
> +	struct lru_gen_folio *child_lrugen, *parent_lrugen;
> +	enum lru_list lru = type * LRU_INACTIVE_FILE;
> +	int i;

[ ... ]

> +		WRITE_ONCE(child_lrugen->nr_pages[gen][type][zone], 0);
> +		WRITE_ONCE(parent_lrugen->nr_pages[gen][type][zone],
> +			   parent_lrugen->nr_pages[gen][type][zone] + nr_pages);
> +
> +		update_lru_size(parent_lruvec, lru + dst_lru_active, zone, nr_pages);
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This update_lru_size() call updates both __update_lru_size() counters and
mem_cgroup_update_lru_size(). However, lru_gen_reparent_memcg() also calls
mem_cgroup_update_lru_size() for LRU_UNEVICTABLE. The combination appears
to cause incorrect accounting.

Note: This issue was fixed in later commit ("mm: mglru: do not call
update_lru_size() during reparenting").

> +	}
> +}

Re: [PATCH v3 26/30] mm: vmscan: prepare for reparenting MGLRU folios

Posted by Qi Zheng 3 weeks, 1 day ago


On 1/18/26 11:29 AM, Shakeel Butt wrote:
> I asked AI to review this patch and it find following potential issues:

Thanks.

> 
> Author: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> mm: vmscan: prepare for reparenting MGLRU folios
> 
> This commit adds infrastructure for reparenting MGLRU folios from dying
> memcg to parent memcg, including functions to ensure parent has MAX_NR_GENS
> generations before reparenting.
> 
> Link: https://lkml.kernel.org/r/92e0728fed3d68855173352416cf8077670610f0.1768389889.git.zhengqi.arch@bytedance.com
> 
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index f1a45c94c2fc..284292653d14 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
> 
> [ ... ]
> 
>> +static void try_to_inc_max_seq_nowalk(struct mem_cgroup *memcg,
>> +				      struct lruvec *lruvec)
>> +{
>> +	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
>> +	struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
>> +	int swappiness = mem_cgroup_swappiness(memcg);
>> +	DEFINE_MAX_SEQ(lruvec);
>> +	bool success = false;
>> +
>> +	/*
>> +	 * We are not iterating the mm_list here, updating mm_state->seq is just
>> +	 * to make mm walkers work properly.
>> +	 */
>> +	if (mm_state) {
>> +		spin_lock(&mm_list->lock);
>> +		VM_WARN_ON_ONCE(mm_state->seq + 1 < max_seq);
>> +		if (max_seq > mm_state->seq) {
>                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> When mm_state->seq equals max_seq (which can happen when mm walkers have
> completed their iteration), this condition is false, success remains false,
> and inc_max_seq() is never called.

I expect that once mm walkers complete their iterations, the number of
generations will have increased, so it won't cause an infinite loop.

> 
> The caller max_lru_gen_memcg() loops with:
> 
>      while (get_nr_gens(lruvec, type) < MAX_NR_GENS) {
>          try_to_inc_max_seq_nowalk(memcg, lruvec);
>          cond_resched();
>      }
> 
> If try_to_inc_max_seq_nowalk() fails to increment max_seq, get_nr_gens()
> returns the same value, and the while loop continues indefinitely.
> 
> Can this cause an infinite loop during memcg offline when mm_state->seq has
> caught up to max_seq?
> 
> Suggested fix: Remove the success variable and always call inc_max_seq().
> The mm_state->seq update is only for coordination with mm walkers - it should
> not gate whether we increment the generation. inc_max_seq() has its own
> staleness check (seq < READ_ONCE(lrugen->max_seq)) so calling it unconditionally
> is safe:
> 
>      if (mm_state) {
>          spin_lock(&mm_list->lock);
>          VM_WARN_ON_ONCE(mm_state->seq + 1 < max_seq);
>          if (max_seq > mm_state->seq)
>              WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
>          spin_unlock(&mm_list->lock);
>      }
> 
>      inc_max_seq(lruvec, max_seq, swappiness);
> 
>> +			WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
>> +			success = true;
>> +		}
>> +		spin_unlock(&mm_list->lock);
>> +	} else {
>> +		success = true;
>> +	}
>> +
>> +	if (success)
>> +		inc_max_seq(lruvec, max_seq, swappiness);
>> +}
> 
> [ ... ]
> 
>> +static void __lru_gen_reparent_memcg(struct lruvec *child_lruvec, struct lruvec *parent_lruvec,
>> +				     int zone, int type)
>> +{
>> +	struct lru_gen_folio *child_lrugen, *parent_lrugen;
>> +	enum lru_list lru = type * LRU_INACTIVE_FILE;
>> +	int i;
> 
> [ ... ]
> 
>> +		WRITE_ONCE(child_lrugen->nr_pages[gen][type][zone], 0);
>> +		WRITE_ONCE(parent_lrugen->nr_pages[gen][type][zone],
>> +			   parent_lrugen->nr_pages[gen][type][zone] + nr_pages);
>> +
>> +		update_lru_size(parent_lruvec, lru + dst_lru_active, zone, nr_pages);
>                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> This update_lru_size() call updates both __update_lru_size() counters and
> mem_cgroup_update_lru_size(). However, lru_gen_reparent_memcg() also calls
> mem_cgroup_update_lru_size() for LRU_UNEVICTABLE. The combination appears
> to cause incorrect accounting.
> 
> Note: This issue was fixed in later commit ("mm: mglru: do not call
> update_lru_size() during reparenting").

Right.

> 
>> +	}
>> +}
>

Re: [PATCH v3 26/30] mm: vmscan: prepare for reparenting MGLRU folios

Posted by Shakeel Butt 3 weeks, 2 days ago

Axel, Yuanchu & Wei, please help reviewing this patch.

On Wed, Jan 14, 2026 at 07:32:53PM +0800, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> Similar to traditional LRU folios, in order to solve the dying memcg
> problem, we also need to reparenting MGLRU folios to the parent memcg when
> memcg offline.
> 
> However, there are the following challenges:
> 
> 1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the
>    number of generations of the parent and child memcg may be different,
>    so we cannot simply transfer MGLRU folios in the child memcg to the
>    parent memcg as we did for traditional LRU folios.
> 2. The generation information is stored in folio->flags, but we cannot
>    traverse these folios while holding the lru lock, otherwise it may
>    cause softlockup.
> 3. In walk_update_folio(), the gen of folio and corresponding lru size
>    may be updated, but the folio is not immediately moved to the
>    corresponding lru list. Therefore, there may be folios of different
>    generations on an LRU list.
> 4. In lru_gen_del_folio(), the generation to which the folio belongs is
>    found based on the generation information in folio->flags, and the
>    corresponding LRU size will be updated. Therefore, we need to update
>    the lru size correctly during reparenting, otherwise the lru size may
>    be updated incorrectly in lru_gen_del_folio().
> 
> Finally, this patch chose a compromise method, which is to splice the lru
> list in the child memcg to the lru list of the same generation in the
> parent memcg during reparenting. And in order to ensure that the parent
> memcg has the same generation, we need to increase the generations in the
> parent memcg to the MAX_NR_GENS before reparenting.
> 
> Of course, the same generation has different meanings in the parent and
> child memcg, this will cause confusion in the hot and cold information of
> folios. But other than that, this method is simple enough, the lru size
> is correct, and there is no need to consider some concurrency issues (such
> as lru_gen_del_folio()).
> 
> To prepare for the above work, this commit implements the specific
> functions, which will be used during reparenting.
> 
> Suggested-by: Harry Yoo <harry.yoo@oracle.com>
> Suggested-by: Imran Khan <imran.f.khan@oracle.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  include/linux/mmzone.h |  16 +++++
>  mm/vmscan.c            | 144 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 160 insertions(+)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 1014b5a93c09c..a41f4f0ae5eb7 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -628,6 +628,9 @@ void lru_gen_online_memcg(struct mem_cgroup *memcg);
>  void lru_gen_offline_memcg(struct mem_cgroup *memcg);
>  void lru_gen_release_memcg(struct mem_cgroup *memcg);
>  void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid);
> +void max_lru_gen_memcg(struct mem_cgroup *memcg);
> +bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg);
> +void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent);
>  
>  #else /* !CONFIG_LRU_GEN */
>  
> @@ -668,6 +671,19 @@ static inline void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
>  {
>  }
>  
> +static inline void max_lru_gen_memcg(struct mem_cgroup *memcg)
> +{
> +}
> +
> +static inline bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg)
> +{
> +	return true;
> +}
> +
> +static inline void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent)
> +{
> +}
> +
>  #endif /* CONFIG_LRU_GEN */
>  
>  struct lruvec {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e738082874878..6bc8047b7aec5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4445,6 +4445,150 @@ void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
>  		lru_gen_rotate_memcg(lruvec, MEMCG_LRU_HEAD);
>  }
>  
> +bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg)
> +{
> +	int nid;
> +
> +	for_each_node(nid) {
> +		struct lruvec *lruvec = get_lruvec(memcg, nid);
> +		int type;
> +
> +		for (type = 0; type < ANON_AND_FILE; type++) {
> +			if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
> +				return false;
> +		}
> +	}
> +
> +	return true;
> +}
> +
> +static void try_to_inc_max_seq_nowalk(struct mem_cgroup *memcg,
> +				      struct lruvec *lruvec)
> +{
> +	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
> +	struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
> +	int swappiness = mem_cgroup_swappiness(memcg);
> +	DEFINE_MAX_SEQ(lruvec);
> +	bool success = false;
> +
> +	/*
> +	 * We are not iterating the mm_list here, updating mm_state->seq is just
> +	 * to make mm walkers work properly.
> +	 */
> +	if (mm_state) {
> +		spin_lock(&mm_list->lock);
> +		VM_WARN_ON_ONCE(mm_state->seq + 1 < max_seq);
> +		if (max_seq > mm_state->seq) {
> +			WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
> +			success = true;
> +		}
> +		spin_unlock(&mm_list->lock);
> +	} else {
> +		success = true;
> +	}
> +
> +	if (success)
> +		inc_max_seq(lruvec, max_seq, swappiness);
> +}
> +
> +/*
> + * We need to ensure that the folios of child memcg can be reparented to the
> + * same gen of the parent memcg, so the gens of the parent memcg needed be
> + * incremented to the MAX_NR_GENS before reparenting.
> + */
> +void max_lru_gen_memcg(struct mem_cgroup *memcg)
> +{
> +	int nid;
> +
> +	for_each_node(nid) {
> +		struct lruvec *lruvec = get_lruvec(memcg, nid);
> +		int type;
> +
> +		for (type = 0; type < ANON_AND_FILE; type++) {
> +			while (get_nr_gens(lruvec, type) < MAX_NR_GENS) {
> +				try_to_inc_max_seq_nowalk(memcg, lruvec);
> +				cond_resched();
> +			}
> +		}
> +	}
> +}
> +
> +/*
> + * Compared to traditional LRU, MGLRU faces the following challenges:
> + *
> + * 1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the
> + *    number of generations of the parent and child memcg may be different,
> + *    so we cannot simply transfer MGLRU folios in the child memcg to the
> + *    parent memcg as we did for traditional LRU folios.
> + * 2. The generation information is stored in folio->flags, but we cannot
> + *    traverse these folios while holding the lru lock, otherwise it may
> + *    cause softlockup.
> + * 3. In walk_update_folio(), the gen of folio and corresponding lru size
> + *    may be updated, but the folio is not immediately moved to the
> + *    corresponding lru list. Therefore, there may be folios of different
> + *    generations on an LRU list.
> + * 4. In lru_gen_del_folio(), the generation to which the folio belongs is
> + *    found based on the generation information in folio->flags, and the
> + *    corresponding LRU size will be updated. Therefore, we need to update
> + *    the lru size correctly during reparenting, otherwise the lru size may
> + *    be updated incorrectly in lru_gen_del_folio().
> + *
> + * Finally, we choose a compromise method, which is to splice the lru list in
> + * the child memcg to the lru list of the same generation in the parent memcg
> + * during reparenting.
> + *
> + * The same generation has different meanings in the parent and child memcg,
> + * so this compromise method will cause the LRU inversion problem. But as the
> + * system runs, this problem will be fixed automatically.
> + */
> +static void __lru_gen_reparent_memcg(struct lruvec *child_lruvec, struct lruvec *parent_lruvec,
> +				     int zone, int type)
> +{
> +	struct lru_gen_folio *child_lrugen, *parent_lrugen;
> +	enum lru_list lru = type * LRU_INACTIVE_FILE;
> +	int i;
> +
> +	child_lrugen = &child_lruvec->lrugen;
> +	parent_lrugen = &parent_lruvec->lrugen;
> +
> +	for (i = 0; i < get_nr_gens(child_lruvec, type); i++) {
> +		int gen = lru_gen_from_seq(child_lrugen->max_seq - i);
> +		long nr_pages = child_lrugen->nr_pages[gen][type][zone];
> +		int dst_lru_active = lru_gen_is_active(parent_lruvec, gen) ? LRU_ACTIVE : 0;
> +
> +		/* Assuming that child pages are colder than parent pages */
> +		list_splice_init(&child_lrugen->folios[gen][type][zone],
> +				 &parent_lrugen->folios[gen][type][zone]);
> +
> +		WRITE_ONCE(child_lrugen->nr_pages[gen][type][zone], 0);
> +		WRITE_ONCE(parent_lrugen->nr_pages[gen][type][zone],
> +			   parent_lrugen->nr_pages[gen][type][zone] + nr_pages);
> +
> +		update_lru_size(parent_lruvec, lru + dst_lru_active, zone, nr_pages);
> +	}
> +}
> +
> +void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent)
> +{
> +	int nid;
> +
> +	for_each_node(nid) {
> +		struct lruvec *child_lruvec, *parent_lruvec;
> +		int type, zid;
> +		struct zone *zone;
> +
> +		child_lruvec = get_lruvec(memcg, nid);
> +		parent_lruvec = get_lruvec(parent, nid);
> +
> +		for_each_managed_zone_pgdat(zone, NODE_DATA(nid), zid, MAX_NR_ZONES - 1) {
> +			for (type = 0; type < ANON_AND_FILE; type++)
> +				__lru_gen_reparent_memcg(child_lruvec, parent_lruvec, zid, type);
> +			mem_cgroup_update_lru_size(parent_lruvec, LRU_UNEVICTABLE, zid,
> +				mem_cgroup_get_zone_lru_size(child_lruvec, LRU_UNEVICTABLE, zid));
> +		}
> +	}
> +}
> +
>  #endif /* CONFIG_MEMCG */
>  
>  /******************************************************************************
> -- 
> 2.20.1
>

[PATCH v3 26/30 fix] mm: mglru: do not call update_lru_size() during reparenting

Posted by Qi Zheng 3 weeks, 4 days ago

From: Qi Zheng <zhengqi.arch@bytedance.com>

Only non-hierarchical lruvec_stats->state_local needs to be reparented,
so handle it in reparent_state_local(), and remove the unreasonable
update_lru_size() call in __lru_gen_reparent_memcg().

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/vmscan.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 56714f3bc6f88..5e7a32e3cffbc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4544,7 +4544,6 @@ static void __lru_gen_reparent_memcg(struct lruvec *child_lruvec, struct lruvec
 				     int zone, int type)
 {
 	struct lru_gen_folio *child_lrugen, *parent_lrugen;
-	enum lru_list lru = type * LRU_INACTIVE_FILE;
 	int i;
 
 	child_lrugen = &child_lruvec->lrugen;
@@ -4553,7 +4552,6 @@ static void __lru_gen_reparent_memcg(struct lruvec *child_lruvec, struct lruvec
 	for (i = 0; i < get_nr_gens(child_lruvec, type); i++) {
 		int gen = lru_gen_from_seq(child_lrugen->max_seq - i);
 		long nr_pages = child_lrugen->nr_pages[gen][type][zone];
-		int dst_lru_active = lru_gen_is_active(parent_lruvec, gen) ? LRU_ACTIVE : 0;
 
 		/* Assuming that child pages are colder than parent pages */
 		list_splice_init(&child_lrugen->folios[gen][type][zone],
@@ -4562,8 +4560,6 @@ static void __lru_gen_reparent_memcg(struct lruvec *child_lruvec, struct lruvec
 		WRITE_ONCE(child_lrugen->nr_pages[gen][type][zone], 0);
 		WRITE_ONCE(parent_lrugen->nr_pages[gen][type][zone],
 			   parent_lrugen->nr_pages[gen][type][zone] + nr_pages);
-
-		update_lru_size(parent_lruvec, lru + dst_lru_active, zone, nr_pages);
 	}
 }
 
@@ -4575,15 +4571,21 @@ void lru_gen_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent)
 		struct lruvec *child_lruvec, *parent_lruvec;
 		int type, zid;
 		struct zone *zone;
+		enum lru_list lru;
 
 		child_lruvec = get_lruvec(memcg, nid);
 		parent_lruvec = get_lruvec(parent, nid);
 
-		for_each_managed_zone_pgdat(zone, NODE_DATA(nid), zid, MAX_NR_ZONES - 1) {
+		for_each_managed_zone_pgdat(zone, NODE_DATA(nid), zid, MAX_NR_ZONES - 1)
 			for (type = 0; type < ANON_AND_FILE; type++)
 				__lru_gen_reparent_memcg(child_lruvec, parent_lruvec, zid, type);
-			mem_cgroup_update_lru_size(parent_lruvec, LRU_UNEVICTABLE, zid,
-				mem_cgroup_get_zone_lru_size(child_lruvec, LRU_UNEVICTABLE, zid));
+
+		for_each_lru(lru) {
+			for_each_managed_zone_pgdat(zone, NODE_DATA(nid), zid, MAX_NR_ZONES - 1) {
+				unsigned long size = mem_cgroup_get_zone_lru_size(child_lruvec, lru, zid);
+
+				mem_cgroup_update_lru_size(parent_lruvec, lru, zid, size);
+			}
 		}
 	}
 }
-- 
2.20.1

Re: [PATCH v3 26/30 fix] mm: mglru: do not call update_lru_size() during reparenting

Posted by Harry Yoo 2 weeks, 6 days ago

On Thu, Jan 15, 2026 at 06:44:44PM +0800, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> Only non-hierarchical lruvec_stats->state_local needs to be reparented,
> so handle it in reparent_state_local(), and remove the unreasonable
> update_lru_size() call in __lru_gen_reparent_memcg().

Hmm well, how are the hierarchical statistics consistent when pages are
reparented from an "active" gen to an "inactive" gen, or the other way around?

They'll become inconsistent when those pages are reclaimed or
moved between generations?

-- 
Cheers,
Harry / Hyeonggon

Re: [PATCH v3 26/30 fix] mm: mglru: do not call update_lru_size() during reparenting

Posted by Harry Yoo 2 weeks, 6 days ago

On Wed, Jan 21, 2026 at 12:53:28PM +0900, Harry Yoo wrote:
> On Thu, Jan 15, 2026 at 06:44:44PM +0800, Qi Zheng wrote:
> > From: Qi Zheng <zhengqi.arch@bytedance.com>
> > 
> > Only non-hierarchical lruvec_stats->state_local needs to be reparented,
> > so handle it in reparent_state_local(), and remove the unreasonable
> > update_lru_size() call in __lru_gen_reparent_memcg().
> 
> Hmm well, how are the hierarchical statistics consistent when pages are
> reparented from an "active" gen to an "inactive" gen, or the other way around?
> 
> They'll become inconsistent when those pages are reclaimed or
> moved between generations?

FYI we've observed this while testing downstream implementation
as it led to MemAvailable being unreasonably high due to inconsistent
statistics.

The solution was, if lru_gen_is_active(child, gen) and
lru_gen_is_active(parent, gen) do not match, # of pages being
reparented must be subtracted from the child's statistics
(and up to the root, as it's hierarchical), and added to the parent's
statistics for the generation.

-- 
Cheers,
Harry / Hyeonggon

Re: [PATCH v3 26/30 fix] mm: mglru: do not call update_lru_size() during reparenting

Posted by Qi Zheng 2 weeks, 5 days ago


On 1/21/26 12:19 PM, Harry Yoo wrote:
> On Wed, Jan 21, 2026 at 12:53:28PM +0900, Harry Yoo wrote:
>> On Thu, Jan 15, 2026 at 06:44:44PM +0800, Qi Zheng wrote:
>>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>>
>>> Only non-hierarchical lruvec_stats->state_local needs to be reparented,
>>> so handle it in reparent_state_local(), and remove the unreasonable
>>> update_lru_size() call in __lru_gen_reparent_memcg().
>>
>> Hmm well, how are the hierarchical statistics consistent when pages are
>> reparented from an "active" gen to an "inactive" gen, or the other way around?

Oh, I completely forgot about that. If update_lru_size() is not called
during the rreparenting, this issue should be considered separately.

>>
>> They'll become inconsistent when those pages are reclaimed or
>> moved between generations?
> 
> FYI we've observed this while testing downstream implementation
> as it led to MemAvailable being unreasonably high due to inconsistent
> statistics.
> 
> The solution was, if lru_gen_is_active(child, gen) and
> lru_gen_is_active(parent, gen) do not match, # of pages being
> reparented must be subtracted from the child's statistics
> (and up to the root, as it's hierarchical), and added to the parent's
> statistics for the generation.

Make sense, will fix it in v4.

Thanks!

>

Re: [PATCH v3 26/30 fix] mm: mglru: do not call update_lru_size() during reparenting

Posted by Andrew Morton 3 weeks, 4 days ago

On Thu, 15 Jan 2026 18:44:44 +0800 Qi Zheng <qi.zheng@linux.dev> wrote:

> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> Only non-hierarchical lruvec_stats->state_local needs to be reparented,
> so handle it in reparent_state_local(), and remove the unreasonable
> update_lru_size() call in __lru_gen_reparent_memcg().
> 

Thanks, added as a squashable fix against "mm: vmscan: prepare for
reparenting MGLRU folios".