[v2] Eliminate Dying Memory Cgroup

[PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Qi Zheng 1 month, 3 weeks ago

From: Qi Zheng <zhengqi.arch@bytedance.com>

Changes in v2:
 - add [PATCH v2 04/28] and remove local_irq_disable() in evict_folios()
   (pointed by Harry Yoo)
 - recheck objcg in [PATCH v2 07/28] (pointed by Harry Yoo)
 - modify the commit message in [PATCH v2 12/28] and [PATCH v2 21/28]
   (pointed by Harry Yoo)
 - use rcu lock to protect mm_state in [PATCH v2 14/28] (pointed by Harry Yoo)
 - fix bad unlock balance warning in [PATCH v2 23/28]
 - change nr_pages type to long in [PATCH v2 25/28] (pointed by Harry Yoo)
 - incease mm_state->seq during reparenting to make mm walker work properly in
   [PATCH v2 25/28] (pointed by Harry Yoo)
 - add [PATCH v2 18/28] to fix WARNING in folio_memcg() (pointed by Harry Yoo)
 - collect Reviewed-bys
 - rebase onto the next-20251216

Changes in v1:
 - drop [PATCH RFC 02/28]
 - drop THP split queue related part, which has been merged as a separate
   patchset[2]
 - prevent memory cgroup release in folio_split_queue_lock{_irqsave}() in
   [PATCH v1 16/26]
 - Separate the reparenting function of traditional LRU folios to [PATCH v1 22/26]
 - adapted to the MGLRU scenarios in [PATCH v1 23/26]
 - refactor memcg_reparent_objcgs() in [PATCH v1 24/26]
 - collect Acked-bys and Reviewed-bys
 - rebase onto the next-20251028

Hi all,

Introduction
============

This patchset is intended to transfer the LRU pages to the object cgroup
without holding a reference to the original memory cgroup in order to
address the issue of the dying memory cgroup. A consensus has already been
reached regarding this approach recently [1].

Background
==========

The issue of a dying memory cgroup refers to a situation where a memory
cgroup is no longer being used by users, but memory (the metadata
associated with memory cgroups) remains allocated to it. This situation
may potentially result in memory leaks or inefficiencies in memory
reclamation and has persisted as an issue for several years. Any memory
allocation that endures longer than the lifespan (from the users'
perspective) of a memory cgroup can lead to the issue of dying memory
cgroup. We have exerted greater efforts to tackle this problem by
introducing the infrastructure of object cgroup [2].

Presently, numerous types of objects (slab objects, non-slab kernel
allocations, per-CPU objects) are charged to the object cgroup without
holding a reference to the original memory cgroup. The final allocations
for LRU pages (anonymous pages and file pages) are charged at allocation
time and continues to hold a reference to the original memory cgroup
until reclaimed.

File pages are more complex than anonymous pages as they can be shared
among different memory cgroups and may persist beyond the lifespan of
the memory cgroup. The long-term pinning of file pages to memory cgroups
is a widespread issue that causes recurring problems in practical
scenarios [3]. File pages remain unreclaimed for extended periods.
Additionally, they are accessed by successive instances (second, third,
fourth, etc.) of the same job, which is restarted into a new cgroup each
time. As a result, unreclaimable dying memory cgroups accumulate,
leading to memory wastage and significantly reducing the efficiency
of page reclamation.

Fundamentals
============

A folio will no longer pin its corresponding memory cgroup. It is necessary
to ensure that the memory cgroup or the lruvec associated with the memory
cgroup is not released when a user obtains a pointer to the memory cgroup
or lruvec returned by folio_memcg() or folio_lruvec(). Users are required
to hold the RCU read lock or acquire a reference to the memory cgroup
associated with the folio to prevent its release if they are not concerned
about the binding stability between the folio and its corresponding memory
cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock)
desire a stable binding between the folio and its corresponding memory
cgroup. An approach is needed to ensure the stability of the binding while
the lruvec lock is held, and to detect the situation of holding the
incorrect lruvec lock when there is a race condition during memory cgroup
reparenting. The following four steps are taken to achieve these goals.

1. The first step  to be taken is to identify all users of both functions
   (folio_memcg() and folio_lruvec()) who are not concerned about binding
   stability and implement appropriate measures (such as holding a RCU read
   lock or temporarily obtaining a reference to the memory cgroup for a
   brief period) to prevent the release of the memory cgroup.

2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates
   how to ensure the binding stability from the user's perspective of
   folio_lruvec().

   struct lruvec *folio_lruvec_lock(struct folio *folio)
   {
           struct lruvec *lruvec;

           rcu_read_lock();
   retry:
           lruvec = folio_lruvec(folio);
           spin_lock(&lruvec->lru_lock);
           if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
                   spin_unlock(&lruvec->lru_lock);
                   goto retry;
           }

           return lruvec;
   }

   From the perspective of memory cgroup removal, the entire reparenting
   process (altering the binding relationship between folio and its memory
   cgroup and moving the LRU lists to its parental memory cgroup) should be
   carried out under both the lruvec lock of the memory cgroup being removed
   and the lruvec lock of its parent.

3. Finally, transfer the LRU pages to the object cgroup without holding a
   reference to the original memory cgroup.

Effect
======

Finally, it can be observed that the quantity of dying memory cgroups will
not experience a significant increase if the following test script is
executed to reproduce the issue.

```bash
#!/bin/bash

# Create a temporary file 'temp' filled with zero bytes
dd if=/dev/zero of=temp bs=4096 count=1

# Display memory-cgroup info from /proc/cgroups
cat /proc/cgroups | grep memory

for i in {0..2000}
do
    mkdir /sys/fs/cgroup/memory/test$i
    echo $$ > /sys/fs/cgroup/memory/test$i/cgroup.procs

    # Append 'temp' file content to 'log'
    cat temp >> log

    echo $$ > /sys/fs/cgroup/memory/cgroup.procs

    # Potentially create a dying memory cgroup
    rmdir /sys/fs/cgroup/memory/test$i
done

# Display memory-cgroup info after test
cat /proc/cgroups | grep memory

rm -f temp log
```

Comments and suggestions are welcome!

Thanks,
Qi

[1].https://lore.kernel.org/linux-mm/Z6OkXXYDorPrBvEQ@hm-sls2/
[2].https://lwn.net/Articles/895431/
[3].https://github.com/systemd/systemd/pull/36827

Muchun Song (22):
  mm: memcontrol: remove dead code of checking parent memory cgroup
  mm: workingset: use folio_lruvec() in workingset_refault()
  mm: rename unlock_page_lruvec_irq and its variants
  mm: vmscan: refactor move_folios_to_lru()
  mm: memcontrol: allocate object cgroup for non-kmem case
  mm: memcontrol: return root object cgroup for root memory cgroup
  mm: memcontrol: prevent memory cgroup release in
    get_mem_cgroup_from_folio()
  buffer: prevent memory cgroup release in folio_alloc_buffers()
  writeback: prevent memory cgroup release in writeback module
  mm: memcontrol: prevent memory cgroup release in
    count_memcg_folio_events()
  mm: page_io: prevent memory cgroup release in page_io module
  mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
  mm: mglru: prevent memory cgroup release in mglru
  mm: memcontrol: prevent memory cgroup release in
    mem_cgroup_swap_full()
  mm: workingset: prevent memory cgroup release in lru_gen_eviction()
  mm: workingset: prevent lruvec release in workingset_refault()
  mm: zswap: prevent lruvec release in zswap_folio_swapin()
  mm: swap: prevent lruvec release in lru_gen_clear_refs()
  mm: workingset: prevent lruvec release in workingset_activation()
  mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
  mm: memcontrol: eliminate the problem of dying memory cgroup for LRU
    folios
  mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers

Qi Zheng (6):
  mm: vmscan: prepare for the refactoring the move_folios_to_lru()
  mm: thp: prevent memory cgroup release in
    folio_split_queue_lock{_irqsave}()
  mm: zswap: prevent memory cgroup release in zswap_compress()
  mm: vmscan: prepare for reparenting traditional LRU folios
  mm: vmscan: prepare for reparenting MGLRU folios
  mm: memcontrol: refactor memcg_reparent_objcgs()

 fs/buffer.c                      |   4 +-
 fs/fs-writeback.c                |  22 +-
 include/linux/memcontrol.h       | 159 ++++++------
 include/linux/mm_inline.h        |   6 +
 include/linux/mmzone.h           |  20 ++
 include/trace/events/writeback.h |   3 +
 mm/compaction.c                  |  43 +++-
 mm/huge_memory.c                 |  18 +-
 mm/memcontrol-v1.c               |  15 +-
 mm/memcontrol.c                  | 405 ++++++++++++++++++-------------
 mm/migrate.c                     |   2 +
 mm/mlock.c                       |   2 +-
 mm/page_io.c                     |   8 +-
 mm/percpu.c                      |   2 +-
 mm/shrinker.c                    |   6 +-
 mm/swap.c                        |  20 +-
 mm/vmscan.c                      | 267 ++++++++++++++++----
 mm/workingset.c                  |  26 +-
 mm/zswap.c                       |   5 +
 19 files changed, 677 insertions(+), 356 deletions(-)

-- 
2.20.1

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Roman Gushchin 1 month, 1 week ago

Qi Zheng <qi.zheng@linux.dev> writes:

Hey!

I ran this patchset through AI review and it found few regression (which
can of course be false positives). When you'll have time, can you,
please, take a look and comment on which are real and which are not?

Thank you!

--

# Task
Date: 2025-12-29 19:55:20
Model: gemini-3-pro-preview
Prompts SHA: 192922ae6bf4 ("bpf.md: adjust the documentation about bpf kfunc parameter validation")
Commits to review:
- e416d881eea4 ("mm: memcontrol: remove dead code of checking parent memory cgroup")
- 8e00ae594254 ("mm: workingset: use folio_lruvec() in workingset_refault()")
- a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its variants")
- d57d548a3d6b ("mm: vmscan: prepare for the refactoring the move_folios_to_lru()")
- 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")
- 057fca991b78 ("mm: memcontrol: allocate object cgroup for non-kmem case")
- 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup for root memory cgroup")
- 8479f2eef536 ("mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()")
- c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")
- 65610d739afc ("writeback: prevent memory cgroup release in writeback module")
- f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()")
- 91e4b3924291 ("mm: page_io: prevent memory cgroup release in page_io module")
- bb45e352bb34 ("mm: migrate: prevent memory cgroup release in folio_migrate_mapping()")
- a1189dd21a56 ("mm: mglru: prevent memory cgroup release in mglru")
- 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()")
- de63e2b7a03e ("mm: workingset: prevent memory cgroup release in lru_gen_eviction()")
- c0cce04fd4dc ("mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()")
- 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in zswap_compress()")
- 80bbd804adde ("mm: workingset: prevent lruvec release in workingset_refault()")
- 9d232388a8e3 ("mm: zswap: prevent lruvec release in zswap_folio_swapin()")
- d7cb66b9350d ("mm: swap: prevent lruvec release in lru_gen_clear_refs()")
- 3e71e5543c8f ("mm: workingset: prevent lruvec release in workingset_activation()")
- e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU pages for lruvec lock")
- d04921029e6d ("mm: vmscan: prepare for reparenting traditional LRU folios")
- dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU folios")
- 18791acfe86b ("mm: memcontrol: refactor memcg_reparent_objcgs()")
- 19f79273a812 ("mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios")
- 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers")

# Commit e416d881eea4 ("mm: memcontrol: remove dead code of checking parent memory cgroup")
No detailed regression report generated.

# Commit 8e00ae594254 ("mm: workingset: use folio_lruvec() in workingset_refault()")
No detailed regression report generated.

# Commit a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its variants")
No detailed regression report generated.

# Commit d57d548a3d6b ("mm: vmscan: prepare for the refactoring the move_folios_to_lru()")
No detailed regression report generated.

# Commit 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")
commit 9b02a45b6fc8a2253e2edcccb24c8485d3dca672
Author: Muchun Song <muchun.song@linux.dev>

mm: vmscan: refactor move_folios_to_lru()

Refactors move_folios_to_lru() to handle its own locking via
folio_lruvec_relock_irq(), removing the requirement for the caller to hold
lruvec->lru_lock. Updates call sites to release the lock before calling.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 49e566174621..354b19f7365d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c

[ ... ]

> @@ -4736,14 +4734,14 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  			set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_active));
>  	}
>  
> -	spin_lock_irq(&lruvec->lru_lock);
> -
> -	move_folios_to_lru(lruvec, &list);
> +	move_folios_to_lru(&list);
>  
>  	walk = current->reclaim_state->mm_walk;
>  	if (walk && walk->batched) {
>  		walk->lruvec = lruvec;
> +		spin_lock(&lruvec->lru_lock);

move_folios_to_lru() drops the lock and enables interrupts via
lruvec_unlock_irq() (which calls spin_unlock_irq()). Since lru_lock is
an IRQ-safe lock (used elsewhere with spin_lock_irq()), is it safe to
use spin_lock() here with interrupts enabled? If an interrupt occurs
and tries to acquire lru_lock, could this lead to a deadlock?

>  		reset_batch_size(walk);
> +		spin_unlock(&lruvec->lru_lock);
>  	}
>  
>  	mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),

[ ... ]

# Commit 057fca991b78 ("mm: memcontrol: allocate object cgroup for non-kmem case")
No detailed regression report generated.

# Commit 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup for root memory cgroup")
No detailed regression report generated.

# Commit 8479f2eef536 ("mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()")
commit 8479f2eef5363fb93cc593038e9bc54dd5068ef2
Author: Muchun Song <muchun.song@linux.dev>

mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()

This patch modifies get_mem_cgroup_from_folio() to retry the lookup if
locking the memcg fails, serving as preparation for future changes where
folios may not pin their memcg.

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 21b5aad34cae..431b3154c70c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -973,14 +973,19 @@ struct mem_cgroup *get_mem_cgroup_from_current(void)
>   */
>  struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio)
>  {
> -	struct mem_cgroup *memcg = folio_memcg(folio);
> +	struct mem_cgroup *memcg;
>  
>  	if (mem_cgroup_disabled())
>  		return NULL;
>  
> +	if (!folio_memcg_charged(folio))
> +		return root_mem_cgroup;
> +
>  	rcu_read_lock();
> -	if (!memcg || WARN_ON_ONCE(!css_tryget(&memcg->css)))
> -		memcg = root_mem_cgroup;
> +retry:
> +	memcg = folio_memcg(folio);
> +	if (unlikely(!css_tryget(&memcg->css)))
> +		goto retry;

If css_tryget() fails (e.g. refcount is 0), this loop spins indefinitely
with the RCU read lock held. Is it guaranteed that folio_memcg() will
return a different, alive memcg in subsequent iterations?

If the folio is isolated (e.g. via migrate_misplaced_folio()), it might be
missed by reparenting logic that iterates LRU lists. In that case, the
folio would continue pointing to the dying memcg, leading to a hard lockup.

Also, folio_memcg() calls __folio_memcg(), which reads folio->memcg_data
without READ_ONCE(). Since this loop waits for memcg_data to be updated
by another CPU (reparenting), could the compiler hoist the load out of
the loop, preventing the update from being seen?

Finally, the previous code fell back to root_mem_cgroup on failure. Is it
safe to remove that fallback? If css_tryget() fails unexpectedly, hanging
seems more severe than the previous behavior of warning and falling back.

>  	rcu_read_unlock();
>  	return memcg;
>  }
# Commit c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")
No detailed regression report generated.

# Commit 65610d739afc ("writeback: prevent memory cgroup release in writeback module")

<regression>
  <description>
    Tight infinite loop in get_mem_cgroup_from_folio due to missing READ_ONCE and cpu_relax
  </description>
  <file>
    mm/memcontrol.c
  </file>
  <lines>
    991-993
  </lines>
  <reason>
    The retry loop in get_mem_cgroup_from_folio() continuously checks folio_memcg(folio)
    and tries to get a css reference. There are two issues:
    1. folio_memcg() reads folio->memcg_data without READ_ONCE(). The compiler may
       hoist this load out of the loop, leading to an infinite loop if the memcg
       is offlined and the loop condition depends on reparenting updating memcg_data.
    2. The loop does not contain cpu_relax(). If the update to folio->memcg_data
       is delayed (e.g., reparenting takes time), this creates a tight loop that
       burns CPU and stalls the RCU read-side critical section, potentially causing
       RCU stalls or soft lockups.
  </reason>
  <classification>
    Performance/Stability (Soft Lockup / Infinite Loop)
  </classification>
</regression>

<regression>
  <description>
    Removal of NULL check in mem_cgroup_track_foreign_dirty introduces potential crash
  </description>
  <file>
    include/linux/memcontrol.h
  </file>
  <lines>
    1557
  </lines>
  <reason>
    The check `if (unlikely(memcg && &memcg->css != wb->memcg_css))` was replaced with
    `if (unlikely(&memcg->css != wb->memcg_css))`.
    While folio_memcg_charged() is checked earlier, there is a race window between that check
    and the call to folio_memcg() where the folio could be uncharged (setting memcg_data to 0).
    If folio_memcg() returns NULL, accessing `&memcg->css` will dereference the NULL pointer
    (since css is at offset 0). The previous code safely handled NULL memcg.
  </reason>
  <classification>
    Robustness / Null Pointer Dereference
  </classification>
</regression>

# Commit f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()")
No detailed regression report generated.

# Commit 91e4b3924291 ("mm: page_io: prevent memory cgroup release in page_io module")
No detailed regression report generated.

# Commit bb45e352bb34 ("mm: migrate: prevent memory cgroup release in folio_migrate_mapping()")
No detailed regression report generated.

# Commit a1189dd21a56 ("mm: mglru: prevent memory cgroup release in mglru")
commit a1189dd21a568b45eaddafc17b982f6200ba28c0
Author: Muchun Song <muchun.song@linux.dev>

mm: mglru: prevent memory cgroup release in mglru

This patch uses the rcu read lock to safeguard against the release of
the memory cgroup in mglru, as a preparatory measure for the reparenting
of the LRU pages.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 354b19f7365d..814498a2c1bd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
[ ... ]
> @@ -4242,6 +4244,13 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>  		}
>  	}
>  
> +	rcu_read_lock();
> +	memcg = folio_memcg(folio);
> +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +	max_seq = READ_ONCE((lruvec)->lrugen.max_seq);
> +	gen = lru_gen_from_seq(max_seq);
> +	mm_state = get_mm_state(lruvec);
> +
>  	arch_enter_lazy_mmu_mode();
>  
>  	pte -= (addr - start) / PAGE_SIZE;
[ ... ]
> @@ -4282,6 +4291,8 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>  	if (mm_state && suitable_to_scan(i, young))
>  		update_bloom_filter(mm_state, max_seq, pvmw->pmd);
>  
> +	rcu_read_unlock();
> +
>  	return true;
>  }

This critical section appears to encompass calls to walk_update_folio(),
which calls folio_activate(), which eventually calls folio_lruvec_lock_irq()
to acquire lruvec->lru_lock.

Since lruvec->lru_lock is a spinlock_t, it becomes a sleeping lock on
PREEMPT_RT kernels. Is it safe to acquire a sleeping lock inside an RCU
read-side critical section on PREEMPT_RT?

# Commit 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()")
No detailed regression report generated.

# Commit de63e2b7a03e ("mm: workingset: prevent memory cgroup release in lru_gen_eviction()")
No detailed regression report generated.

# Commit c0cce04fd4dc ("mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()")
commit c0cce04fd4dca3a2b00e8846cbbeb52bd2f27625
Author: Qi Zheng <zhengqi.arch@bytedance.com>

mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()

This patch wraps the queue lookup in rcu_read_lock/unlock to safeguard
against memory cgroup release during the lookup, as folios may no longer
pin the memcg.

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 12b46215b30c..b9e6855ec0b6 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1154,13 +1154,25 @@ split_queue_lock_irqsave(int nid, struct mem_cgroup *memcg, unsigned long *flags
>  
>  static struct deferred_split *folio_split_queue_lock(struct folio *folio)
>  {
> -	return split_queue_lock(folio_nid(folio), folio_memcg(folio));
> +	struct deferred_split *queue;
> +
> +	rcu_read_lock();
> +	queue = split_queue_lock(folio_nid(folio), folio_memcg(folio));
> +	rcu_read_unlock();
> +
> +	return queue;
>  }

If the folio does not pin the memcg (as suggested by the commit message),
is it safe to return the queue pointer here? The queue is embedded in the
memcg, so if the memcg is freed after rcu_read_unlock(), wouldn't this
pointer become invalid while the caller still holds the lock?

>  
>  static struct deferred_split *
>  folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
>  {
> -	return split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags);
> +	struct deferred_split *queue;
> +
> +	rcu_read_lock();
> +	queue = split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags);
> +	rcu_read_unlock();
> +
> +	return queue;
>  }

# Commit 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in zswap_compress()")
No detailed regression report generated.

# Commit 80bbd804adde ("mm: workingset: prevent lruvec release in workingset_refault()")
commit 80bbd804adde7198f6c7c5fdf7b4349996586597
Author: Muchun Song <muchun.song@linux.dev>
mm: workingset: prevent lruvec release in workingset_refault()

In the near future, a folio will no longer pin its corresponding
memory cgroup. So an lruvec returned by folio_lruvec() could be
released without the rcu read lock or a reference to its memory
cgroup.

> diff --git a/mm/workingset.c b/mm/workingset.c
> index 445fc634196d..427ca1a5625e 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -560,11 +560,12 @@ void workingset_refault(struct folio *folio, void *shadow)
>  	 * locked to guarantee folio_memcg() stability throughout.
>  	 */
>  	nr = folio_nr_pages(folio);
> +	rcu_read_lock();
>  	lruvec = folio_lruvec(folio);
>  	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
>  
>  	if (!workingset_test_recent(shadow, file, &workingset, true))
> -		return;
> +		goto out;

Does calling workingset_test_recent() with flush=true while holding
rcu_read_lock() risk sleeping in atomic context?

workingset_test_recent() calls mem_cgroup_flush_stats_ratelimited() when
flush is true.  The flush path leads to css_rstat_flush(), which calls
might_sleep() and cond_resched().

>  
>  	folio_set_active(folio);
>  	workingset_age_nonresident(lruvec, nr);
> @@ -580,6 +581,8 @@ void workingset_refault(struct folio *folio, void *shadow)
>  		lru_note_cost_refault(folio);
>  		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr);
>  	}
> +out:
> +	rcu_read_unlock();
>  }
>  
>  /**

# Commit 9d232388a8e3 ("mm: zswap: prevent lruvec release in zswap_folio_swapin()")
No detailed regression report generated.

# Commit d7cb66b9350d ("mm: swap: prevent lruvec release in lru_gen_clear_refs()")
No detailed regression report generated.

# Commit 3e71e5543c8f ("mm: workingset: prevent lruvec release in workingset_activation()")
No detailed regression report generated.

# Commit e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU pages for lruvec lock")
No detailed regression report generated.

# Commit d04921029e6d ("mm: vmscan: prepare for reparenting traditional LRU folios")
commit d04921029e6dc92824024356a9ed83a4f82d68dc
Author: Qi Zheng <zhengqi.arch@bytedance.com>

mm: vmscan: prepare for reparenting traditional LRU folios

This commit implements a specific function to reparent LRU folios of a
child memcg to its parent memcg during the reparenting process.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 814498a2c1bd..5fd0f97c3719 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2648,6 +2648,44 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
>  			  lruvec_memcg(lruvec));
>  }
>  
> +#ifdef CONFIG_MEMCG
> +static void lruvec_reparent_lru(struct lruvec *src, struct lruvec *dst,
> +				enum lru_list lru)
> +{
> +	int zid;
> +	struct mem_cgroup_per_node *mz_src, *mz_dst;
> +
> +	mz_src = container_of(src, struct mem_cgroup_per_node, lruvec);
> +	mz_dst = container_of(dst, struct mem_cgroup_per_node, lruvec);
> +
> +	if (lru != LRU_UNEVICTABLE)
> +		list_splice_tail_init(&src->lists[lru], &dst->lists[lru]);

Is it safe to modify src->lists and dst->lists here without holding
src->lru_lock or dst->lru_lock?  Since dst is the parent memcg, it is
likely live and subject to concurrent access by reclaim or allocation paths.

> +
> +	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> +		mz_dst->lru_zone_size[zid][lru] += mz_src->lru_zone_size[zid][lru];
> +		mz_src->lru_zone_size[zid][lru] = 0;
> +	}

Also, lru_zone_size is typically protected by lru_lock.  Is it safe to
update these counters without locking?

> +}
> +
> +void lru_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
> +{
> +	int nid;
> +
> +	for_each_node(nid) {
> +		enum lru_list lru;
> +		struct lruvec *src_lruvec, *dst_lruvec;
> +
> +		src_lruvec = mem_cgroup_lruvec(src, NODE_DATA(nid));
> +		dst_lruvec = mem_cgroup_lruvec(dst, NODE_DATA(nid));
> +		dst_lruvec->anon_cost += src_lruvec->anon_cost;
> +		dst_lruvec->file_cost += src_lruvec->file_cost;
> +
> +		for_each_lru(lru)
> +			lruvec_reparent_lru(src_lruvec, dst_lruvec, lru);
> +	}
> +}
> +#endif

# Commit dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU folios")
Here are the findings for the provided patch.

1.  **Missing locking in `lru_gen_reparent_memcg`**
    In `mm/vmscan.c`, the function `lru_gen_reparent_memcg` (and its helper `__lru_gen_reparent_memcg`) modifies the LRU lists and statistics of `lruvec` structures without holding the `lru_lock`.

    Specifically:
    - `__lru_gen_reparent_memcg` calls `__update_lru_size`.
    - `__update_lru_size` has a `lockdep_assert_held(&lruvec->lru_lock)`.
    - `__lru_gen_reparent_memcg` calls `list_splice_tail_init` to move folios from the source lruvec to the destination lruvec.

    The destination lruvec (`dst_lruvec`) belongs to the parent memcg, which is active and shared. Modifying its lists and counters without locking will lead to data corruption (list corruption) and statistics drift, as well as triggering lockdep warnings.

    ```c
    void lru_gen_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
    {
            int nid;

            for_each_node(nid) {
                    struct lruvec *src_lruvec, *dst_lruvec;
                    /* ... */
                    src_lruvec = get_lruvec(src, nid);
                    dst_lruvec = get_lruvec(dst, nid);

                    for (zone = 0; zone < MAX_NR_ZONES; zone++)
                            for (type = 0; type < ANON_AND_FILE; type++)
                                    __lru_gen_reparent_memcg(src_lruvec, dst_lruvec, zone, type);
            }
    }
    ```

    The `lruvec` lock must be acquired for each node before calling `__lru_gen_reparent_memcg`.

# Commit 18791acfe86b ("mm: memcontrol: refactor memcg_reparent_objcgs()")
No detailed regression report generated.

# Commit 19f79273a812 ("mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios")
file: mm/memcontrol.c
line: 224
type: Bug
category: Locking
description:
The `reparent_locks` function takes `lru_lock` for all NUMA nodes in a loop, utilizing `spin_lock_nested` with an incrementing `nest` counter. The `nest` counter increments for each lock taken (2 per node: src and dst). Since `MAX_LOCKDEP_SUBCLASSES` is 8, this code will trigger a Lockdep violation (and potential panic if `panic_on_warn` is set) on systems with more than 4 NUMA nodes (4 nodes * 2 locks = 8 subclasses). Furthermore, locking all nodes simultaneously is a scalability regression, blocking LRU operations globally during reparenting.

file: include/linux/memcontrol.h
line: 430
type: Risk
category: API
description:
The implementation of `folio_memcg` has changed to rely on `obj_cgroup_memcg`, which enforces that `rcu_read_lock` or `cgroup_mutex` is held via a lockdep assertion. Previously, for LRU folios, the memcg pointer was directly embedded and stable under the folio lock. Existing callers (e.g., in `mm/workingset.c`) relied on the folio lock for stability. While some callers may hold RCU, others might not, leading to lockdep warnings or races where `folio_memcg` returns a pointer to a memcg that is being reparented or freed. Additionally, the return value of `folio_memcg` is no longer constant for a locked folio; it can change if reparenting occurs, potentially breaking logic that assumes identity equality over time.

# Commit 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers")
No detailed regression report generated.

# Summary

| Commit                                                                                        | Regressions |
| :-------------------------------------------------------------------------------------------- | :---------- |
| e416d881eea4 ("mm: memcontrol: remove dead code of checking parent memory cgroup")            | 0           |
| 8e00ae594254 ("mm: workingset: use folio_lruvec() in workingset_refault()")                   | 0           |
| a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its variants")                           | 0           |
| d57d548a3d6b ("mm: vmscan: prepare for the refactoring the move_folios_to_lru()")             | 0           |
| 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")                                    | 1           |
| 057fca991b78 ("mm: memcontrol: allocate object cgroup for non-kmem case")                     | 0           |
| 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup for root memory cgroup")             | 0           |
| 8479f2eef536 ("mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()") | 1           |
| c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")               | 0           |
| 65610d739afc ("writeback: prevent memory cgroup release in writeback module")                 | 2           |
| f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()")  | 0           |
| 91e4b3924291 ("mm: page_io: prevent memory cgroup release in page_io module")                 | 0           |
| bb45e352bb34 ("mm: migrate: prevent memory cgroup release in folio_migrate_mapping()")        | 0           |
| a1189dd21a56 ("mm: mglru: prevent memory cgroup release in mglru")                            | 1           |
| 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()")      | 0           |
| de63e2b7a03e ("mm: workingset: prevent memory cgroup release in lru_gen_eviction()")          | 0           |
| c0cce04fd4dc ("mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()") | 1           |
| 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in zswap_compress()")                 | 0           |
| 80bbd804adde ("mm: workingset: prevent lruvec release in workingset_refault()")               | 1           |
| 9d232388a8e3 ("mm: zswap: prevent lruvec release in zswap_folio_swapin()")                    | 0           |
| d7cb66b9350d ("mm: swap: prevent lruvec release in lru_gen_clear_refs()")                     | 0           |
| 3e71e5543c8f ("mm: workingset: prevent lruvec release in workingset_activation()")            | 0           |
| e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU pages for lruvec lock")            | 0           |
| d04921029e6d ("mm: vmscan: prepare for reparenting traditional LRU folios")                   | 1           |
| dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU folios")                             | 1           |
| 18791acfe86b ("mm: memcontrol: refactor memcg_reparent_objcgs()")                             | 0           |
| 19f79273a812 ("mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios")  | 2           |
| 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers")                | 0           |

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Shakeel Butt 1 month, 1 week ago

On Tue, Dec 30, 2025 at 01:36:12AM +0000, Roman Gushchin wrote:
> Qi Zheng <qi.zheng@linux.dev> writes:
> 
> Hey!
> 
> I ran this patchset through AI review and it found few regression (which
> can of course be false positives). When you'll have time, can you,
> please, take a look and comment on which are real and which are not?
> 
> Thank you!

Hi Roman, this is really good. I assume this is Gemini model. I see BPF
and networking folks have automated the AI review process which is
really good. I think MM should also adopt that model. Are you looking
into automating MM review bot?

thanks,
Shakeel

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Roman Gushchin 1 month, 1 week ago

Shakeel Butt <shakeel.butt@linux.dev> writes:

> On Tue, Dec 30, 2025 at 01:36:12AM +0000, Roman Gushchin wrote:
>> Qi Zheng <qi.zheng@linux.dev> writes:
>> 
>> Hey!
>> 
>> I ran this patchset through AI review and it found few regression (which
>> can of course be false positives). When you'll have time, can you,
>> please, take a look and comment on which are real and which are not?
>> 
>> Thank you!
>
> Hi Roman, this is really good. I assume this is Gemini model. I see BPF
> and networking folks have automated the AI review process which is
> really good. I think MM should also adopt that model. Are you looking
> into automating MM review bot?

Yes, absolutely. We're working on it, hopefully in January/February we
can have something reasonably solid.

Thanks

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Shakeel Butt 1 month, 1 week ago

On Mon, Dec 29, 2025 at 08:11:22PM -0800, Roman Gushchin wrote:
> Shakeel Butt <shakeel.butt@linux.dev> writes:
> 
> > On Tue, Dec 30, 2025 at 01:36:12AM +0000, Roman Gushchin wrote:
> >> Qi Zheng <qi.zheng@linux.dev> writes:
> >> 
> >> Hey!
> >> 
> >> I ran this patchset through AI review and it found few regression (which
> >> can of course be false positives). When you'll have time, can you,
> >> please, take a look and comment on which are real and which are not?
> >> 
> >> Thank you!
> >
> > Hi Roman, this is really good. I assume this is Gemini model. I see BPF
> > and networking folks have automated the AI review process which is
> > really good. I think MM should also adopt that model. Are you looking
> > into automating MM review bot?
> 
> Yes, absolutely. We're working on it, hopefully in January/February we
> can have something reasonably solid.

Can you share a bit more about the plan? Are you working more on the
infra side of things or also iterating on the prompts? (I assume you are
using Chris Mason's review prompts)

On the infra side, one thing I would love to have is to get early
feedback/review on my patches before posting on the list. Will it be
possible to support such a scenario?

On the prompt side, what kind of experiments are you doing to reduce the
false positives? I wonder if we can comeup with some recommendations
which help maintainers to describe relevant prompts for their sub-area
and be helpful for AI reviews.

On the automation side, I assume we will start with some manual process.
I definitely see back-and-forth to improve prompts for MM and someone
needs to manually review the results generated by AI and may have to
update prompts for better results. Also we can start with opt-in
approach i.e. someone adds a tag to the subject for AI to review the
series.

Anyways a good topic to start a separete email for discussion.

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Roman Gushchin 1 month, 1 week ago

Shakeel Butt <shakeel.butt@linux.dev> writes:

> On Mon, Dec 29, 2025 at 08:11:22PM -0800, Roman Gushchin wrote:
>> Shakeel Butt <shakeel.butt@linux.dev> writes:
>> 
>> > On Tue, Dec 30, 2025 at 01:36:12AM +0000, Roman Gushchin wrote:
>> >> Qi Zheng <qi.zheng@linux.dev> writes:
>> >> 
>> >> Hey!
>> >> 
>> >> I ran this patchset through AI review and it found few regression (which
>> >> can of course be false positives). When you'll have time, can you,
>> >> please, take a look and comment on which are real and which are not?
>> >> 
>> >> Thank you!
>> >
>> > Hi Roman, this is really good. I assume this is Gemini model. I see BPF
>> > and networking folks have automated the AI review process which is
>> > really good. I think MM should also adopt that model. Are you looking
>> > into automating MM review bot?
>> 
>> Yes, absolutely. We're working on it, hopefully in January/February we
>> can have something reasonably solid.
>
> Can you share a bit more about the plan? Are you working more on the
> infra side of things or also iterating on the prompts? (I assume you are
> using Chris Mason's review prompts)

Mostly on the infra side. I want to look closer into mm and cgroups
patches, but I need a bit more data and infra to do this.

As of now we're building some basic infra to run it on scale.

> On the infra side, one thing I would love to have is to get early
> feedback/review on my patches before posting on the list. Will it be
> possible to support such a scenario?

Absolutely.
You can do it already using a local setup, assuming you have an access
to some decent LLM. In my experience, a standard entry-level pro
subscription which is like $20/month these days is good enough to review
a number of patchsets per day. This works well for reviewing personal
patches and/or some targeted upstream reviews (like this one), but for
covering entire subsystems we need more infra and more token budgets
(and this is what we're building).

Btw, I mastered a pre-configured environment which saves time on setting
things up: https://github.com/rgushchin/kengp .
I've been testing it with Gemini CLI, but it should work with other
similar tools with some minimal changes (and I'd love to accept patches
adding this support).

Actually I think there will be several rounds of ai-reviews:
1) Developers can review their patches while working on the code and
before sending anything upstream
2) AI bots can send feedback to proposed patchset, as it works for bpf
currently
3) Maintainers and/or ci systems will run ai reviews to ensure the
quality of changes (e.g. eventually we can re-review mm-next nightly
to make sure it's not creating new regressions with other changes in
linux-next).

> On the prompt side, what kind of experiments are you doing to reduce the
> false positives? I wonder if we can comeup with some recommendations
> which help maintainers to describe relevant prompts for their sub-area
> and be helpful for AI reviews.

I like Chris's approach here: let's look into specific examples of
false positives and missed bugs and tailor our prompt systems to handle
these cases correctly. The smarter LLMs are, the fewer tricks we really
need, so it's better to keep prompts minimal. We don't really need to
write AI-specific long texts about subsystems, but sometimes codify some
non-obvious design principles and limitations.

> On the automation side, I assume we will start with some manual process.
> I definitely see back-and-forth to improve prompts for MM and someone
> needs to manually review the results generated by AI and may have to
> update prompts for better results. Also we can start with opt-in
> approach i.e. someone adds a tag to the subject for AI to review the
> series.

Yeah, there are some things to figure out on how to make sure we're not
creating a lot of noise and not repeating the same feedback if it's not
useful.

Thanks!

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Qi Zheng 1 month, 1 week ago

Hi Roman,

On 12/30/25 9:36 AM, Roman Gushchin wrote:
> Qi Zheng <qi.zheng@linux.dev> writes:
> 
> Hey!
> 
> I ran this patchset through AI review and it found few regression (which
> can of course be false positives). When you'll have time, can you,
> please, take a look and comment on which are real and which are not?

Thank you for running the AI review for this patchset, but please do not
directly send the raw data from the AI review to the community, as this
is no different from automated review by a robot.

Thanks,
Qi

> 
> Thank you!
> 
> --
> 
> # Task
> Date: 2025-12-29 19:55:20
> Model: gemini-3-pro-preview
> Prompts SHA: 192922ae6bf4 ("bpf.md: adjust the documentation about bpf kfunc parameter validation")
> Commits to review:
> - e416d881eea4 ("mm: memcontrol: remove dead code of checking parent memory cgroup")
> - 8e00ae594254 ("mm: workingset: use folio_lruvec() in workingset_refault()")
> - a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its variants")
> - d57d548a3d6b ("mm: vmscan: prepare for the refactoring the move_folios_to_lru()")
> - 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")
> - 057fca991b78 ("mm: memcontrol: allocate object cgroup for non-kmem case")
> - 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup for root memory cgroup")
> - 8479f2eef536 ("mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()")
> - c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")
> - 65610d739afc ("writeback: prevent memory cgroup release in writeback module")
> - f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()")
> - 91e4b3924291 ("mm: page_io: prevent memory cgroup release in page_io module")
> - bb45e352bb34 ("mm: migrate: prevent memory cgroup release in folio_migrate_mapping()")
> - a1189dd21a56 ("mm: mglru: prevent memory cgroup release in mglru")
> - 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()")
> - de63e2b7a03e ("mm: workingset: prevent memory cgroup release in lru_gen_eviction()")
> - c0cce04fd4dc ("mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()")
> - 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in zswap_compress()")
> - 80bbd804adde ("mm: workingset: prevent lruvec release in workingset_refault()")
> - 9d232388a8e3 ("mm: zswap: prevent lruvec release in zswap_folio_swapin()")
> - d7cb66b9350d ("mm: swap: prevent lruvec release in lru_gen_clear_refs()")
> - 3e71e5543c8f ("mm: workingset: prevent lruvec release in workingset_activation()")
> - e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU pages for lruvec lock")
> - d04921029e6d ("mm: vmscan: prepare for reparenting traditional LRU folios")
> - dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU folios")
> - 18791acfe86b ("mm: memcontrol: refactor memcg_reparent_objcgs()")
> - 19f79273a812 ("mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios")
> - 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers")
> 
> 
> # Commit e416d881eea4 ("mm: memcontrol: remove dead code of checking parent memory cgroup")
> No detailed regression report generated.
> 
> # Commit 8e00ae594254 ("mm: workingset: use folio_lruvec() in workingset_refault()")
> No detailed regression report generated.
> 
> # Commit a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its variants")
> No detailed regression report generated.
> 
> # Commit d57d548a3d6b ("mm: vmscan: prepare for the refactoring the move_folios_to_lru()")
> No detailed regression report generated.
> 
> # Commit 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")
> commit 9b02a45b6fc8a2253e2edcccb24c8485d3dca672
> Author: Muchun Song <muchun.song@linux.dev>
> 
> mm: vmscan: refactor move_folios_to_lru()
> 
> Refactors move_folios_to_lru() to handle its own locking via
> folio_lruvec_relock_irq(), removing the requirement for the caller to hold
> lruvec->lru_lock. Updates call sites to release the lock before calling.
> 
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 49e566174621..354b19f7365d 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
> 
> [ ... ]
> 
>> @@ -4736,14 +4734,14 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>>   			set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_active));
>>   	}
>>   
>> -	spin_lock_irq(&lruvec->lru_lock);
>> -
>> -	move_folios_to_lru(lruvec, &list);
>> +	move_folios_to_lru(&list);
>>   
>>   	walk = current->reclaim_state->mm_walk;
>>   	if (walk && walk->batched) {
>>   		walk->lruvec = lruvec;
>> +		spin_lock(&lruvec->lru_lock);
> 
> move_folios_to_lru() drops the lock and enables interrupts via
> lruvec_unlock_irq() (which calls spin_unlock_irq()). Since lru_lock is
> an IRQ-safe lock (used elsewhere with spin_lock_irq()), is it safe to
> use spin_lock() here with interrupts enabled? If an interrupt occurs
> and tries to acquire lru_lock, could this lead to a deadlock?
> 
>>   		reset_batch_size(walk);
>> +		spin_unlock(&lruvec->lru_lock);
>>   	}
>>   
>>   	mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
> 
> [ ... ]
> 
> # Commit 057fca991b78 ("mm: memcontrol: allocate object cgroup for non-kmem case")
> No detailed regression report generated.
> 
> # Commit 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup for root memory cgroup")
> No detailed regression report generated.
> 
> # Commit 8479f2eef536 ("mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()")
> commit 8479f2eef5363fb93cc593038e9bc54dd5068ef2
> Author: Muchun Song <muchun.song@linux.dev>
> 
> mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()
> 
> This patch modifies get_mem_cgroup_from_folio() to retry the lookup if
> locking the memcg fails, serving as preparation for future changes where
> folios may not pin their memcg.
> 
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 21b5aad34cae..431b3154c70c 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -973,14 +973,19 @@ struct mem_cgroup *get_mem_cgroup_from_current(void)
>>    */
>>   struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio)
>>   {
>> -	struct mem_cgroup *memcg = folio_memcg(folio);
>> +	struct mem_cgroup *memcg;
>>   
>>   	if (mem_cgroup_disabled())
>>   		return NULL;
>>   
>> +	if (!folio_memcg_charged(folio))
>> +		return root_mem_cgroup;
>> +
>>   	rcu_read_lock();
>> -	if (!memcg || WARN_ON_ONCE(!css_tryget(&memcg->css)))
>> -		memcg = root_mem_cgroup;
>> +retry:
>> +	memcg = folio_memcg(folio);
>> +	if (unlikely(!css_tryget(&memcg->css)))
>> +		goto retry;
> 
> If css_tryget() fails (e.g. refcount is 0), this loop spins indefinitely
> with the RCU read lock held. Is it guaranteed that folio_memcg() will
> return a different, alive memcg in subsequent iterations?
> 
> If the folio is isolated (e.g. via migrate_misplaced_folio()), it might be
> missed by reparenting logic that iterates LRU lists. In that case, the
> folio would continue pointing to the dying memcg, leading to a hard lockup.
> 
> Also, folio_memcg() calls __folio_memcg(), which reads folio->memcg_data
> without READ_ONCE(). Since this loop waits for memcg_data to be updated
> by another CPU (reparenting), could the compiler hoist the load out of
> the loop, preventing the update from being seen?
> 
> Finally, the previous code fell back to root_mem_cgroup on failure. Is it
> safe to remove that fallback? If css_tryget() fails unexpectedly, hanging
> seems more severe than the previous behavior of warning and falling back.
> 
>>   	rcu_read_unlock();
>>   	return memcg;
>>   }
> # Commit c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")
> No detailed regression report generated.
> 
> # Commit 65610d739afc ("writeback: prevent memory cgroup release in writeback module")
> 
> <regression>
>    <description>
>      Tight infinite loop in get_mem_cgroup_from_folio due to missing READ_ONCE and cpu_relax
>    </description>
>    <file>
>      mm/memcontrol.c
>    </file>
>    <lines>
>      991-993
>    </lines>
>    <reason>
>      The retry loop in get_mem_cgroup_from_folio() continuously checks folio_memcg(folio)
>      and tries to get a css reference. There are two issues:
>      1. folio_memcg() reads folio->memcg_data without READ_ONCE(). The compiler may
>         hoist this load out of the loop, leading to an infinite loop if the memcg
>         is offlined and the loop condition depends on reparenting updating memcg_data.
>      2. The loop does not contain cpu_relax(). If the update to folio->memcg_data
>         is delayed (e.g., reparenting takes time), this creates a tight loop that
>         burns CPU and stalls the RCU read-side critical section, potentially causing
>         RCU stalls or soft lockups.
>    </reason>
>    <classification>
>      Performance/Stability (Soft Lockup / Infinite Loop)
>    </classification>
> </regression>
> 
> <regression>
>    <description>
>      Removal of NULL check in mem_cgroup_track_foreign_dirty introduces potential crash
>    </description>
>    <file>
>      include/linux/memcontrol.h
>    </file>
>    <lines>
>      1557
>    </lines>
>    <reason>
>      The check `if (unlikely(memcg && &memcg->css != wb->memcg_css))` was replaced with
>      `if (unlikely(&memcg->css != wb->memcg_css))`.
>      While folio_memcg_charged() is checked earlier, there is a race window between that check
>      and the call to folio_memcg() where the folio could be uncharged (setting memcg_data to 0).
>      If folio_memcg() returns NULL, accessing `&memcg->css` will dereference the NULL pointer
>      (since css is at offset 0). The previous code safely handled NULL memcg.
>    </reason>
>    <classification>
>      Robustness / Null Pointer Dereference
>    </classification>
> </regression>
> 
> # Commit f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()")
> No detailed regression report generated.
> 
> # Commit 91e4b3924291 ("mm: page_io: prevent memory cgroup release in page_io module")
> No detailed regression report generated.
> 
> # Commit bb45e352bb34 ("mm: migrate: prevent memory cgroup release in folio_migrate_mapping()")
> No detailed regression report generated.
> 
> # Commit a1189dd21a56 ("mm: mglru: prevent memory cgroup release in mglru")
> commit a1189dd21a568b45eaddafc17b982f6200ba28c0
> Author: Muchun Song <muchun.song@linux.dev>
> 
> mm: mglru: prevent memory cgroup release in mglru
> 
> This patch uses the rcu read lock to safeguard against the release of
> the memory cgroup in mglru, as a preparatory measure for the reparenting
> of the LRU pages.
> 
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 354b19f7365d..814498a2c1bd 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
> [ ... ]
>> @@ -4242,6 +4244,13 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>>   		}
>>   	}
>>   
>> +	rcu_read_lock();
>> +	memcg = folio_memcg(folio);
>> +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
>> +	max_seq = READ_ONCE((lruvec)->lrugen.max_seq);
>> +	gen = lru_gen_from_seq(max_seq);
>> +	mm_state = get_mm_state(lruvec);
>> +
>>   	arch_enter_lazy_mmu_mode();
>>   
>>   	pte -= (addr - start) / PAGE_SIZE;
> [ ... ]
>> @@ -4282,6 +4291,8 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>>   	if (mm_state && suitable_to_scan(i, young))
>>   		update_bloom_filter(mm_state, max_seq, pvmw->pmd);
>>   
>> +	rcu_read_unlock();
>> +
>>   	return true;
>>   }
> 
> This critical section appears to encompass calls to walk_update_folio(),
> which calls folio_activate(), which eventually calls folio_lruvec_lock_irq()
> to acquire lruvec->lru_lock.
> 
> Since lruvec->lru_lock is a spinlock_t, it becomes a sleeping lock on
> PREEMPT_RT kernels. Is it safe to acquire a sleeping lock inside an RCU
> read-side critical section on PREEMPT_RT?
> 
> 
> # Commit 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()")
> No detailed regression report generated.
> 
> # Commit de63e2b7a03e ("mm: workingset: prevent memory cgroup release in lru_gen_eviction()")
> No detailed regression report generated.
> 
> # Commit c0cce04fd4dc ("mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()")
> commit c0cce04fd4dca3a2b00e8846cbbeb52bd2f27625
> Author: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()
> 
> This patch wraps the queue lookup in rcu_read_lock/unlock to safeguard
> against memory cgroup release during the lookup, as folios may no longer
> pin the memcg.
> 
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 12b46215b30c..b9e6855ec0b6 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1154,13 +1154,25 @@ split_queue_lock_irqsave(int nid, struct mem_cgroup *memcg, unsigned long *flags
>>   
>>   static struct deferred_split *folio_split_queue_lock(struct folio *folio)
>>   {
>> -	return split_queue_lock(folio_nid(folio), folio_memcg(folio));
>> +	struct deferred_split *queue;
>> +
>> +	rcu_read_lock();
>> +	queue = split_queue_lock(folio_nid(folio), folio_memcg(folio));
>> +	rcu_read_unlock();
>> +
>> +	return queue;
>>   }
> 
> If the folio does not pin the memcg (as suggested by the commit message),
> is it safe to return the queue pointer here? The queue is embedded in the
> memcg, so if the memcg is freed after rcu_read_unlock(), wouldn't this
> pointer become invalid while the caller still holds the lock?
> 
>>   
>>   static struct deferred_split *
>>   folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
>>   {
>> -	return split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags);
>> +	struct deferred_split *queue;
>> +
>> +	rcu_read_lock();
>> +	queue = split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags);
>> +	rcu_read_unlock();
>> +
>> +	return queue;
>>   }
> 
> # Commit 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in zswap_compress()")
> No detailed regression report generated.
> 
> # Commit 80bbd804adde ("mm: workingset: prevent lruvec release in workingset_refault()")
> commit 80bbd804adde7198f6c7c5fdf7b4349996586597
> Author: Muchun Song <muchun.song@linux.dev>
> mm: workingset: prevent lruvec release in workingset_refault()
> 
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. So an lruvec returned by folio_lruvec() could be
> released without the rcu read lock or a reference to its memory
> cgroup.
> 
>> diff --git a/mm/workingset.c b/mm/workingset.c
>> index 445fc634196d..427ca1a5625e 100644
>> --- a/mm/workingset.c
>> +++ b/mm/workingset.c
>> @@ -560,11 +560,12 @@ void workingset_refault(struct folio *folio, void *shadow)
>>   	 * locked to guarantee folio_memcg() stability throughout.
>>   	 */
>>   	nr = folio_nr_pages(folio);
>> +	rcu_read_lock();
>>   	lruvec = folio_lruvec(folio);
>>   	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
>>   
>>   	if (!workingset_test_recent(shadow, file, &workingset, true))
>> -		return;
>> +		goto out;
> 
> Does calling workingset_test_recent() with flush=true while holding
> rcu_read_lock() risk sleeping in atomic context?
> 
> workingset_test_recent() calls mem_cgroup_flush_stats_ratelimited() when
> flush is true.  The flush path leads to css_rstat_flush(), which calls
> might_sleep() and cond_resched().
> 
>>   
>>   	folio_set_active(folio);
>>   	workingset_age_nonresident(lruvec, nr);
>> @@ -580,6 +581,8 @@ void workingset_refault(struct folio *folio, void *shadow)
>>   		lru_note_cost_refault(folio);
>>   		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr);
>>   	}
>> +out:
>> +	rcu_read_unlock();
>>   }
>>   
>>   /**
> 
> # Commit 9d232388a8e3 ("mm: zswap: prevent lruvec release in zswap_folio_swapin()")
> No detailed regression report generated.
> 
> # Commit d7cb66b9350d ("mm: swap: prevent lruvec release in lru_gen_clear_refs()")
> No detailed regression report generated.
> 
> # Commit 3e71e5543c8f ("mm: workingset: prevent lruvec release in workingset_activation()")
> No detailed regression report generated.
> 
> # Commit e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU pages for lruvec lock")
> No detailed regression report generated.
> 
> # Commit d04921029e6d ("mm: vmscan: prepare for reparenting traditional LRU folios")
> commit d04921029e6dc92824024356a9ed83a4f82d68dc
> Author: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> mm: vmscan: prepare for reparenting traditional LRU folios
> 
> This commit implements a specific function to reparent LRU folios of a
> child memcg to its parent memcg during the reparenting process.
> 
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 814498a2c1bd..5fd0f97c3719 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2648,6 +2648,44 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
>>   			  lruvec_memcg(lruvec));
>>   }
>>   
>> +#ifdef CONFIG_MEMCG
>> +static void lruvec_reparent_lru(struct lruvec *src, struct lruvec *dst,
>> +				enum lru_list lru)
>> +{
>> +	int zid;
>> +	struct mem_cgroup_per_node *mz_src, *mz_dst;
>> +
>> +	mz_src = container_of(src, struct mem_cgroup_per_node, lruvec);
>> +	mz_dst = container_of(dst, struct mem_cgroup_per_node, lruvec);
>> +
>> +	if (lru != LRU_UNEVICTABLE)
>> +		list_splice_tail_init(&src->lists[lru], &dst->lists[lru]);
> 
> Is it safe to modify src->lists and dst->lists here without holding
> src->lru_lock or dst->lru_lock?  Since dst is the parent memcg, it is
> likely live and subject to concurrent access by reclaim or allocation paths.
> 
>> +
>> +	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>> +		mz_dst->lru_zone_size[zid][lru] += mz_src->lru_zone_size[zid][lru];
>> +		mz_src->lru_zone_size[zid][lru] = 0;
>> +	}
> 
> Also, lru_zone_size is typically protected by lru_lock.  Is it safe to
> update these counters without locking?
> 
>> +}
>> +
>> +void lru_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
>> +{
>> +	int nid;
>> +
>> +	for_each_node(nid) {
>> +		enum lru_list lru;
>> +		struct lruvec *src_lruvec, *dst_lruvec;
>> +
>> +		src_lruvec = mem_cgroup_lruvec(src, NODE_DATA(nid));
>> +		dst_lruvec = mem_cgroup_lruvec(dst, NODE_DATA(nid));
>> +		dst_lruvec->anon_cost += src_lruvec->anon_cost;
>> +		dst_lruvec->file_cost += src_lruvec->file_cost;
>> +
>> +		for_each_lru(lru)
>> +			lruvec_reparent_lru(src_lruvec, dst_lruvec, lru);
>> +	}
>> +}
>> +#endif
> 
> # Commit dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU folios")
> Here are the findings for the provided patch.
> 
> 1.  **Missing locking in `lru_gen_reparent_memcg`**
>      In `mm/vmscan.c`, the function `lru_gen_reparent_memcg` (and its helper `__lru_gen_reparent_memcg`) modifies the LRU lists and statistics of `lruvec` structures without holding the `lru_lock`.
> 
>      Specifically:
>      - `__lru_gen_reparent_memcg` calls `__update_lru_size`.
>      - `__update_lru_size` has a `lockdep_assert_held(&lruvec->lru_lock)`.
>      - `__lru_gen_reparent_memcg` calls `list_splice_tail_init` to move folios from the source lruvec to the destination lruvec.
> 
>      The destination lruvec (`dst_lruvec`) belongs to the parent memcg, which is active and shared. Modifying its lists and counters without locking will lead to data corruption (list corruption) and statistics drift, as well as triggering lockdep warnings.
> 
>      ```c
>      void lru_gen_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
>      {
>              int nid;
> 
>              for_each_node(nid) {
>                      struct lruvec *src_lruvec, *dst_lruvec;
>                      /* ... */
>                      src_lruvec = get_lruvec(src, nid);
>                      dst_lruvec = get_lruvec(dst, nid);
> 
>                      for (zone = 0; zone < MAX_NR_ZONES; zone++)
>                              for (type = 0; type < ANON_AND_FILE; type++)
>                                      __lru_gen_reparent_memcg(src_lruvec, dst_lruvec, zone, type);
>              }
>      }
>      ```
> 
>      The `lruvec` lock must be acquired for each node before calling `__lru_gen_reparent_memcg`.
> 
> # Commit 18791acfe86b ("mm: memcontrol: refactor memcg_reparent_objcgs()")
> No detailed regression report generated.
> 
> # Commit 19f79273a812 ("mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios")
> file: mm/memcontrol.c
> line: 224
> type: Bug
> category: Locking
> description:
> The `reparent_locks` function takes `lru_lock` for all NUMA nodes in a loop, utilizing `spin_lock_nested` with an incrementing `nest` counter. The `nest` counter increments for each lock taken (2 per node: src and dst). Since `MAX_LOCKDEP_SUBCLASSES` is 8, this code will trigger a Lockdep violation (and potential panic if `panic_on_warn` is set) on systems with more than 4 NUMA nodes (4 nodes * 2 locks = 8 subclasses). Furthermore, locking all nodes simultaneously is a scalability regression, blocking LRU operations globally during reparenting.
> 
> file: include/linux/memcontrol.h
> line: 430
> type: Risk
> category: API
> description:
> The implementation of `folio_memcg` has changed to rely on `obj_cgroup_memcg`, which enforces that `rcu_read_lock` or `cgroup_mutex` is held via a lockdep assertion. Previously, for LRU folios, the memcg pointer was directly embedded and stable under the folio lock. Existing callers (e.g., in `mm/workingset.c`) relied on the folio lock for stability. While some callers may hold RCU, others might not, leading to lockdep warnings or races where `folio_memcg` returns a pointer to a memcg that is being reparented or freed. Additionally, the return value of `folio_memcg` is no longer constant for a locked folio; it can change if reparenting occurs, potentially breaking logic that assumes identity equality over time.
> 
> # Commit 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers")
> No detailed regression report generated.
> 
> # Summary
> 
> | Commit                                                                                        | Regressions |
> | :-------------------------------------------------------------------------------------------- | :---------- |
> | e416d881eea4 ("mm: memcontrol: remove dead code of checking parent memory cgroup")            | 0           |
> | 8e00ae594254 ("mm: workingset: use folio_lruvec() in workingset_refault()")                   | 0           |
> | a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its variants")                           | 0           |
> | d57d548a3d6b ("mm: vmscan: prepare for the refactoring the move_folios_to_lru()")             | 0           |
> | 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")                                    | 1           |
> | 057fca991b78 ("mm: memcontrol: allocate object cgroup for non-kmem case")                     | 0           |
> | 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup for root memory cgroup")             | 0           |
> | 8479f2eef536 ("mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()") | 1           |
> | c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")               | 0           |
> | 65610d739afc ("writeback: prevent memory cgroup release in writeback module")                 | 2           |
> | f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()")  | 0           |
> | 91e4b3924291 ("mm: page_io: prevent memory cgroup release in page_io module")                 | 0           |
> | bb45e352bb34 ("mm: migrate: prevent memory cgroup release in folio_migrate_mapping()")        | 0           |
> | a1189dd21a56 ("mm: mglru: prevent memory cgroup release in mglru")                            | 1           |
> | 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()")      | 0           |
> | de63e2b7a03e ("mm: workingset: prevent memory cgroup release in lru_gen_eviction()")          | 0           |
> | c0cce04fd4dc ("mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()") | 1           |
> | 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in zswap_compress()")                 | 0           |
> | 80bbd804adde ("mm: workingset: prevent lruvec release in workingset_refault()")               | 1           |
> | 9d232388a8e3 ("mm: zswap: prevent lruvec release in zswap_folio_swapin()")                    | 0           |
> | d7cb66b9350d ("mm: swap: prevent lruvec release in lru_gen_clear_refs()")                     | 0           |
> | 3e71e5543c8f ("mm: workingset: prevent lruvec release in workingset_activation()")            | 0           |
> | e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU pages for lruvec lock")            | 0           |
> | d04921029e6d ("mm: vmscan: prepare for reparenting traditional LRU folios")                   | 1           |
> | dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU folios")                             | 1           |
> | 18791acfe86b ("mm: memcontrol: refactor memcg_reparent_objcgs()")                             | 0           |
> | 19f79273a812 ("mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios")  | 2           |
> | 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers")                | 0           |

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Roman Gushchin 1 month, 1 week ago

Qi Zheng <qi.zheng@linux.dev> writes:

> Hi Roman,
>
> On 12/30/25 9:36 AM, Roman Gushchin wrote:
>> Qi Zheng <qi.zheng@linux.dev> writes:
>> Hey!
>> I ran this patchset through AI review and it found few regression
>> (which
>> can of course be false positives). When you'll have time, can you,
>> please, take a look and comment on which are real and which are not?
>
> Thank you for running the AI review for this patchset, but please do not
> directly send the raw data from the AI review to the community, as this
> is no different from automated review by a robot.

Hi Qi,

I don't know why you're so negative towards it. It's been great at
finding pretty tricky bugs often missed by human reviewers. In no way
it's a replacement for human reviews, but if a robot can find real
issues and make the kernel more reliable and safe, I'm in.

In my experience it finds real problems pretty often. In my measurements
at least 50% of the reported issues are real, and it matches the data
reported by others. Some subsystems (e.g. bpf) pass all patches through
the ai review.

In any case feel free to ignore it.

>
> Thanks,
> Qi
>
>> Thank you!
>> --
>> # Task
>> Date: 2025-12-29 19:55:20
>> Model: gemini-3-pro-preview
>> Prompts SHA: 192922ae6bf4 ("bpf.md: adjust the documentation about bpf kfunc parameter validation")
>> Commits to review:
>> - e416d881eea4 ("mm: memcontrol: remove dead code of checking parent memory cgroup")
>> - 8e00ae594254 ("mm: workingset: use folio_lruvec() in workingset_refault()")
>> - a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its variants")
>> - d57d548a3d6b ("mm: vmscan: prepare for the refactoring the move_folios_to_lru()")
>> - 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")
>> - 057fca991b78 ("mm: memcontrol: allocate object cgroup for non-kmem case")
>> - 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup for root memory cgroup")
>> - 8479f2eef536 ("mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()")
>> - c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")
>> - 65610d739afc ("writeback: prevent memory cgroup release in writeback module")
>> - f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()")
>> - 91e4b3924291 ("mm: page_io: prevent memory cgroup release in page_io module")
>> - bb45e352bb34 ("mm: migrate: prevent memory cgroup release in folio_migrate_mapping()")
>> - a1189dd21a56 ("mm: mglru: prevent memory cgroup release in mglru")
>> - 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()")
>> - de63e2b7a03e ("mm: workingset: prevent memory cgroup release in lru_gen_eviction()")
>> - c0cce04fd4dc ("mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()")
>> - 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in zswap_compress()")
>> - 80bbd804adde ("mm: workingset: prevent lruvec release in workingset_refault()")
>> - 9d232388a8e3 ("mm: zswap: prevent lruvec release in zswap_folio_swapin()")
>> - d7cb66b9350d ("mm: swap: prevent lruvec release in lru_gen_clear_refs()")
>> - 3e71e5543c8f ("mm: workingset: prevent lruvec release in workingset_activation()")
>> - e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU pages for lruvec lock")
>> - d04921029e6d ("mm: vmscan: prepare for reparenting traditional LRU folios")
>> - dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU folios")
>> - 18791acfe86b ("mm: memcontrol: refactor memcg_reparent_objcgs()")
>> - 19f79273a812 ("mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios")
>> - 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers")
>> # Commit e416d881eea4 ("mm: memcontrol: remove dead code of checking
>> parent memory cgroup")
>> No detailed regression report generated.
>> # Commit 8e00ae594254 ("mm: workingset: use folio_lruvec() in
>> workingset_refault()")
>> No detailed regression report generated.
>> # Commit a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its
>> variants")
>> No detailed regression report generated.
>> # Commit d57d548a3d6b ("mm: vmscan: prepare for the refactoring the
>> move_folios_to_lru()")
>> No detailed regression report generated.
>> # Commit 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")
>> commit 9b02a45b6fc8a2253e2edcccb24c8485d3dca672
>> Author: Muchun Song <muchun.song@linux.dev>
>> mm: vmscan: refactor move_folios_to_lru()
>> Refactors move_folios_to_lru() to handle its own locking via
>> folio_lruvec_relock_irq(), removing the requirement for the caller to hold
>> lruvec->lru_lock. Updates call sites to release the lock before calling.
>> 
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 49e566174621..354b19f7365d 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>> [ ... ]
>> 
>>> @@ -4736,14 +4734,14 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>>>   			set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_active));
>>>   	}
>>>   -	spin_lock_irq(&lruvec->lru_lock);
>>> -
>>> -	move_folios_to_lru(lruvec, &list);
>>> +	move_folios_to_lru(&list);
>>>     	walk = current->reclaim_state->mm_walk;
>>>   	if (walk && walk->batched) {
>>>   		walk->lruvec = lruvec;
>>> +		spin_lock(&lruvec->lru_lock);
>> move_folios_to_lru() drops the lock and enables interrupts via
>> lruvec_unlock_irq() (which calls spin_unlock_irq()). Since lru_lock is
>> an IRQ-safe lock (used elsewhere with spin_lock_irq()), is it safe to
>> use spin_lock() here with interrupts enabled? If an interrupt occurs
>> and tries to acquire lru_lock, could this lead to a deadlock?
>> 
>>>   		reset_batch_size(walk);
>>> +		spin_unlock(&lruvec->lru_lock);
>>>   	}
>>>     	mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD +
>>> reclaimer_offset(sc),
>> [ ... ]
>> # Commit 057fca991b78 ("mm: memcontrol: allocate object cgroup for
>> non-kmem case")
>> No detailed regression report generated.
>> # Commit 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup
>> for root memory cgroup")
>> No detailed regression report generated.
>> # Commit 8479f2eef536 ("mm: memcontrol: prevent memory cgroup
>> release in get_mem_cgroup_from_folio()")
>> commit 8479f2eef5363fb93cc593038e9bc54dd5068ef2
>> Author: Muchun Song <muchun.song@linux.dev>
>> mm: memcontrol: prevent memory cgroup release in
>> get_mem_cgroup_from_folio()
>> This patch modifies get_mem_cgroup_from_folio() to retry the lookup
>> if
>> locking the memcg fails, serving as preparation for future changes where
>> folios may not pin their memcg.
>> 
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 21b5aad34cae..431b3154c70c 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -973,14 +973,19 @@ struct mem_cgroup *get_mem_cgroup_from_current(void)
>>>    */
>>>   struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio)
>>>   {
>>> -	struct mem_cgroup *memcg = folio_memcg(folio);
>>> +	struct mem_cgroup *memcg;
>>>     	if (mem_cgroup_disabled())
>>>   		return NULL;
>>>   +	if (!folio_memcg_charged(folio))
>>> +		return root_mem_cgroup;
>>> +
>>>   	rcu_read_lock();
>>> -	if (!memcg || WARN_ON_ONCE(!css_tryget(&memcg->css)))
>>> -		memcg = root_mem_cgroup;
>>> +retry:
>>> +	memcg = folio_memcg(folio);
>>> +	if (unlikely(!css_tryget(&memcg->css)))
>>> +		goto retry;
>> If css_tryget() fails (e.g. refcount is 0), this loop spins
>> indefinitely
>> with the RCU read lock held. Is it guaranteed that folio_memcg() will
>> return a different, alive memcg in subsequent iterations?
>> If the folio is isolated (e.g. via migrate_misplaced_folio()), it
>> might be
>> missed by reparenting logic that iterates LRU lists. In that case, the
>> folio would continue pointing to the dying memcg, leading to a hard lockup.
>> Also, folio_memcg() calls __folio_memcg(), which reads
>> folio->memcg_data
>> without READ_ONCE(). Since this loop waits for memcg_data to be updated
>> by another CPU (reparenting), could the compiler hoist the load out of
>> the loop, preventing the update from being seen?
>> Finally, the previous code fell back to root_mem_cgroup on
>> failure. Is it
>> safe to remove that fallback? If css_tryget() fails unexpectedly, hanging
>> seems more severe than the previous behavior of warning and falling back.
>> 
>>>   	rcu_read_unlock();
>>>   	return memcg;
>>>   }
>> # Commit c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")
>> No detailed regression report generated.
>> # Commit 65610d739afc ("writeback: prevent memory cgroup release in
>> writeback module")
>> <regression>
>>    <description>
>>      Tight infinite loop in get_mem_cgroup_from_folio due to missing READ_ONCE and cpu_relax
>>    </description>
>>    <file>
>>      mm/memcontrol.c
>>    </file>
>>    <lines>
>>      991-993
>>    </lines>
>>    <reason>
>>      The retry loop in get_mem_cgroup_from_folio() continuously checks folio_memcg(folio)
>>      and tries to get a css reference. There are two issues:
>>      1. folio_memcg() reads folio->memcg_data without READ_ONCE(). The compiler may
>>         hoist this load out of the loop, leading to an infinite loop if the memcg
>>         is offlined and the loop condition depends on reparenting updating memcg_data.
>>      2. The loop does not contain cpu_relax(). If the update to folio->memcg_data
>>         is delayed (e.g., reparenting takes time), this creates a tight loop that
>>         burns CPU and stalls the RCU read-side critical section, potentially causing
>>         RCU stalls or soft lockups.
>>    </reason>
>>    <classification>
>>      Performance/Stability (Soft Lockup / Infinite Loop)
>>    </classification>
>> </regression>
>> <regression>
>>    <description>
>>      Removal of NULL check in mem_cgroup_track_foreign_dirty introduces potential crash
>>    </description>
>>    <file>
>>      include/linux/memcontrol.h
>>    </file>
>>    <lines>
>>      1557
>>    </lines>
>>    <reason>
>>      The check `if (unlikely(memcg && &memcg->css != wb->memcg_css))` was replaced with
>>      `if (unlikely(&memcg->css != wb->memcg_css))`.
>>      While folio_memcg_charged() is checked earlier, there is a race window between that check
>>      and the call to folio_memcg() where the folio could be uncharged (setting memcg_data to 0).
>>      If folio_memcg() returns NULL, accessing `&memcg->css` will dereference the NULL pointer
>>      (since css is at offset 0). The previous code safely handled NULL memcg.
>>    </reason>
>>    <classification>
>>      Robustness / Null Pointer Dereference
>>    </classification>
>> </regression>
>> # Commit f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup
>> release in count_memcg_folio_events()")
>> No detailed regression report generated.
>> # Commit 91e4b3924291 ("mm: page_io: prevent memory cgroup release
>> in page_io module")
>> No detailed regression report generated.
>> # Commit bb45e352bb34 ("mm: migrate: prevent memory cgroup release
>> in folio_migrate_mapping()")
>> No detailed regression report generated.
>> # Commit a1189dd21a56 ("mm: mglru: prevent memory cgroup release in
>> mglru")
>> commit a1189dd21a568b45eaddafc17b982f6200ba28c0
>> Author: Muchun Song <muchun.song@linux.dev>
>> mm: mglru: prevent memory cgroup release in mglru
>> This patch uses the rcu read lock to safeguard against the release
>> of
>> the memory cgroup in mglru, as a preparatory measure for the reparenting
>> of the LRU pages.
>> 
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 354b19f7365d..814498a2c1bd 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>> [ ... ]
>>> @@ -4242,6 +4244,13 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>>>   		}
>>>   	}
>>>   +	rcu_read_lock();
>>> +	memcg = folio_memcg(folio);
>>> +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>> +	max_seq = READ_ONCE((lruvec)->lrugen.max_seq);
>>> +	gen = lru_gen_from_seq(max_seq);
>>> +	mm_state = get_mm_state(lruvec);
>>> +
>>>   	arch_enter_lazy_mmu_mode();
>>>     	pte -= (addr - start) / PAGE_SIZE;
>> [ ... ]
>>> @@ -4282,6 +4291,8 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>>>   	if (mm_state && suitable_to_scan(i, young))
>>>   		update_bloom_filter(mm_state, max_seq, pvmw->pmd);
>>>   +	rcu_read_unlock();
>>> +
>>>   	return true;
>>>   }
>> This critical section appears to encompass calls to
>> walk_update_folio(),
>> which calls folio_activate(), which eventually calls folio_lruvec_lock_irq()
>> to acquire lruvec->lru_lock.
>> Since lruvec->lru_lock is a spinlock_t, it becomes a sleeping lock
>> on
>> PREEMPT_RT kernels. Is it safe to acquire a sleeping lock inside an RCU
>> read-side critical section on PREEMPT_RT?
>> # Commit 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup
>> release in mem_cgroup_swap_full()")
>> No detailed regression report generated.
>> # Commit de63e2b7a03e ("mm: workingset: prevent memory cgroup
>> release in lru_gen_eviction()")
>> No detailed regression report generated.
>> # Commit c0cce04fd4dc ("mm: thp: prevent memory cgroup release in
>> folio_split_queue_lock{_irqsave}()")
>> commit c0cce04fd4dca3a2b00e8846cbbeb52bd2f27625
>> Author: Qi Zheng <zhengqi.arch@bytedance.com>
>> mm: thp: prevent memory cgroup release in
>> folio_split_queue_lock{_irqsave}()
>> This patch wraps the queue lookup in rcu_read_lock/unlock to
>> safeguard
>> against memory cgroup release during the lookup, as folios may no longer
>> pin the memcg.
>> 
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 12b46215b30c..b9e6855ec0b6 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1154,13 +1154,25 @@ split_queue_lock_irqsave(int nid, struct mem_cgroup *memcg, unsigned long *flags
>>>     static struct deferred_split *folio_split_queue_lock(struct
>>> folio *folio)
>>>   {
>>> -	return split_queue_lock(folio_nid(folio), folio_memcg(folio));
>>> +	struct deferred_split *queue;
>>> +
>>> +	rcu_read_lock();
>>> +	queue = split_queue_lock(folio_nid(folio), folio_memcg(folio));
>>> +	rcu_read_unlock();
>>> +
>>> +	return queue;
>>>   }
>> If the folio does not pin the memcg (as suggested by the commit
>> message),
>> is it safe to return the queue pointer here? The queue is embedded in the
>> memcg, so if the memcg is freed after rcu_read_unlock(), wouldn't this
>> pointer become invalid while the caller still holds the lock?
>> 
>>>     static struct deferred_split *
>>>   folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
>>>   {
>>> -	return split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags);
>>> +	struct deferred_split *queue;
>>> +
>>> +	rcu_read_lock();
>>> +	queue = split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags);
>>> +	rcu_read_unlock();
>>> +
>>> +	return queue;
>>>   }
>> # Commit 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in
>> zswap_compress()")
>> No detailed regression report generated.
>> # Commit 80bbd804adde ("mm: workingset: prevent lruvec release in
>> workingset_refault()")
>> commit 80bbd804adde7198f6c7c5fdf7b4349996586597
>> Author: Muchun Song <muchun.song@linux.dev>
>> mm: workingset: prevent lruvec release in workingset_refault()
>> In the near future, a folio will no longer pin its corresponding
>> memory cgroup. So an lruvec returned by folio_lruvec() could be
>> released without the rcu read lock or a reference to its memory
>> cgroup.
>> 
>>> diff --git a/mm/workingset.c b/mm/workingset.c
>>> index 445fc634196d..427ca1a5625e 100644
>>> --- a/mm/workingset.c
>>> +++ b/mm/workingset.c
>>> @@ -560,11 +560,12 @@ void workingset_refault(struct folio *folio, void *shadow)
>>>   	 * locked to guarantee folio_memcg() stability throughout.
>>>   	 */
>>>   	nr = folio_nr_pages(folio);
>>> +	rcu_read_lock();
>>>   	lruvec = folio_lruvec(folio);
>>>   	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
>>>     	if (!workingset_test_recent(shadow, file, &workingset, true))
>>> -		return;
>>> +		goto out;
>> Does calling workingset_test_recent() with flush=true while holding
>> rcu_read_lock() risk sleeping in atomic context?
>> workingset_test_recent() calls mem_cgroup_flush_stats_ratelimited()
>> when
>> flush is true.  The flush path leads to css_rstat_flush(), which calls
>> might_sleep() and cond_resched().
>> 
>>>     	folio_set_active(folio);
>>>   	workingset_age_nonresident(lruvec, nr);
>>> @@ -580,6 +581,8 @@ void workingset_refault(struct folio *folio, void *shadow)
>>>   		lru_note_cost_refault(folio);
>>>   		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr);
>>>   	}
>>> +out:
>>> +	rcu_read_unlock();
>>>   }
>>>     /**
>> # Commit 9d232388a8e3 ("mm: zswap: prevent lruvec release in
>> zswap_folio_swapin()")
>> No detailed regression report generated.
>> # Commit d7cb66b9350d ("mm: swap: prevent lruvec release in
>> lru_gen_clear_refs()")
>> No detailed regression report generated.
>> # Commit 3e71e5543c8f ("mm: workingset: prevent lruvec release in
>> workingset_activation()")
>> No detailed regression report generated.
>> # Commit e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU
>> pages for lruvec lock")
>> No detailed regression report generated.
>> # Commit d04921029e6d ("mm: vmscan: prepare for reparenting
>> traditional LRU folios")
>> commit d04921029e6dc92824024356a9ed83a4f82d68dc
>> Author: Qi Zheng <zhengqi.arch@bytedance.com>
>> mm: vmscan: prepare for reparenting traditional LRU folios
>> This commit implements a specific function to reparent LRU folios of
>> a
>> child memcg to its parent memcg during the reparenting process.
>> 
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 814498a2c1bd..5fd0f97c3719 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -2648,6 +2648,44 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
>>>   			  lruvec_memcg(lruvec));
>>>   }
>>>   +#ifdef CONFIG_MEMCG
>>> +static void lruvec_reparent_lru(struct lruvec *src, struct lruvec *dst,
>>> +				enum lru_list lru)
>>> +{
>>> +	int zid;
>>> +	struct mem_cgroup_per_node *mz_src, *mz_dst;
>>> +
>>> +	mz_src = container_of(src, struct mem_cgroup_per_node, lruvec);
>>> +	mz_dst = container_of(dst, struct mem_cgroup_per_node, lruvec);
>>> +
>>> +	if (lru != LRU_UNEVICTABLE)
>>> +		list_splice_tail_init(&src->lists[lru], &dst->lists[lru]);
>> Is it safe to modify src->lists and dst->lists here without holding
>> src->lru_lock or dst->lru_lock?  Since dst is the parent memcg, it is
>> likely live and subject to concurrent access by reclaim or allocation paths.
>> 
>>> +
>>> +	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>>> +		mz_dst->lru_zone_size[zid][lru] += mz_src->lru_zone_size[zid][lru];
>>> +		mz_src->lru_zone_size[zid][lru] = 0;
>>> +	}
>> Also, lru_zone_size is typically protected by lru_lock.  Is it safe
>> to
>> update these counters without locking?
>> 
>>> +}
>>> +
>>> +void lru_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
>>> +{
>>> +	int nid;
>>> +
>>> +	for_each_node(nid) {
>>> +		enum lru_list lru;
>>> +		struct lruvec *src_lruvec, *dst_lruvec;
>>> +
>>> +		src_lruvec = mem_cgroup_lruvec(src, NODE_DATA(nid));
>>> +		dst_lruvec = mem_cgroup_lruvec(dst, NODE_DATA(nid));
>>> +		dst_lruvec->anon_cost += src_lruvec->anon_cost;
>>> +		dst_lruvec->file_cost += src_lruvec->file_cost;
>>> +
>>> +		for_each_lru(lru)
>>> +			lruvec_reparent_lru(src_lruvec, dst_lruvec, lru);
>>> +	}
>>> +}
>>> +#endif
>> # Commit dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU
>> folios")
>> Here are the findings for the provided patch.
>> 1.  **Missing locking in `lru_gen_reparent_memcg`**
>> In `mm/vmscan.c`, the function `lru_gen_reparent_memcg` (and its
> helper `__lru_gen_reparent_memcg`) modifies the LRU lists and
> statistics of `lruvec` structures without holding the `lru_lock`.
>>      Specifically:
>>      - `__lru_gen_reparent_memcg` calls `__update_lru_size`.
>>      - `__update_lru_size` has a `lockdep_assert_held(&lruvec->lru_lock)`.
>>      - `__lru_gen_reparent_memcg` calls `list_splice_tail_init` to move folios from the source lruvec to the destination lruvec.
>>      The destination lruvec (`dst_lruvec`) belongs to the parent
>> memcg, which is active and shared. Modifying its lists and counters
>> without locking will lead to data corruption (list corruption) and
>> statistics drift, as well as triggering lockdep warnings.
>>      ```c
>>      void lru_gen_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
>>      {
>>              int nid;
>>              for_each_node(nid) {
>>                      struct lruvec *src_lruvec, *dst_lruvec;
>>                      /* ... */
>>                      src_lruvec = get_lruvec(src, nid);
>>                      dst_lruvec = get_lruvec(dst, nid);
>>                      for (zone = 0; zone < MAX_NR_ZONES; zone++)
>>                              for (type = 0; type < ANON_AND_FILE; type++)
>>                                      __lru_gen_reparent_memcg(src_lruvec, dst_lruvec, zone, type);
>>              }
>>      }
>>      ```
>>      The `lruvec` lock must be acquired for each node before calling
>> `__lru_gen_reparent_memcg`.
>> # Commit 18791acfe86b ("mm: memcontrol: refactor
>> memcg_reparent_objcgs()")
>> No detailed regression report generated.
>> # Commit 19f79273a812 ("mm: memcontrol: eliminate the problem of
>> dying memory cgroup for LRU folios")
>> file: mm/memcontrol.c
>> line: 224
>> type: Bug
>> category: Locking
>> description:
>> The `reparent_locks` function takes `lru_lock` for all NUMA nodes in
> a loop, utilizing `spin_lock_nested` with an incrementing `nest`
> counter. The `nest` counter increments for each lock taken (2 per
> node: src and dst). Since `MAX_LOCKDEP_SUBCLASSES` is 8, this code
> will trigger a Lockdep violation (and potential panic if
> `panic_on_warn` is set) on systems with more than 4 NUMA nodes (4
> nodes * 2 locks = 8 subclasses). Furthermore, locking all nodes
> simultaneously is a scalability regression, blocking LRU operations
> globally during reparenting.
>> file: include/linux/memcontrol.h
>> line: 430
>> type: Risk
>> category: API
>> description:
>> The implementation of `folio_memcg` has changed to rely on
> `obj_cgroup_memcg`, which enforces that `rcu_read_lock` or
> `cgroup_mutex` is held via a lockdep assertion. Previously, for LRU
> folios, the memcg pointer was directly embedded and stable under the
> folio lock. Existing callers (e.g., in `mm/workingset.c`) relied on
> the folio lock for stability. While some callers may hold RCU, others
> might not, leading to lockdep warnings or races where `folio_memcg`
> returns a pointer to a memcg that is being reparented or
> freed. Additionally, the return value of `folio_memcg` is no longer
> constant for a locked folio; it can change if reparenting occurs,
> potentially breaking logic that assumes identity equality over time.
>> # Commit 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru
>> maintenance helpers")
>> No detailed regression report generated.
>> # Summary
>> | Commit
>> | Regressions |
>> | :-------------------------------------------------------------------------------------------- | :---------- |
>> | e416d881eea4 ("mm: memcontrol: remove dead code of checking parent memory cgroup")            | 0           |
>> | 8e00ae594254 ("mm: workingset: use folio_lruvec() in workingset_refault()")                   | 0           |
>> | a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its variants")                           | 0           |
>> | d57d548a3d6b ("mm: vmscan: prepare for the refactoring the move_folios_to_lru()")             | 0           |
>> | 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")                                    | 1           |
>> | 057fca991b78 ("mm: memcontrol: allocate object cgroup for non-kmem case")                     | 0           |
>> | 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup for root memory cgroup")             | 0           |
>> | 8479f2eef536 ("mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()") | 1           |
>> | c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")               | 0           |
>> | 65610d739afc ("writeback: prevent memory cgroup release in writeback module")                 | 2           |
>> | f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()")  | 0           |
>> | 91e4b3924291 ("mm: page_io: prevent memory cgroup release in page_io module")                 | 0           |
>> | bb45e352bb34 ("mm: migrate: prevent memory cgroup release in folio_migrate_mapping()")        | 0           |
>> | a1189dd21a56 ("mm: mglru: prevent memory cgroup release in mglru")                            | 1           |
>> | 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()")      | 0           |
>> | de63e2b7a03e ("mm: workingset: prevent memory cgroup release in lru_gen_eviction()")          | 0           |
>> | c0cce04fd4dc ("mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()") | 1           |
>> | 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in zswap_compress()")                 | 0           |
>> | 80bbd804adde ("mm: workingset: prevent lruvec release in workingset_refault()")               | 1           |
>> | 9d232388a8e3 ("mm: zswap: prevent lruvec release in zswap_folio_swapin()")                    | 0           |
>> | d7cb66b9350d ("mm: swap: prevent lruvec release in lru_gen_clear_refs()")                     | 0           |
>> | 3e71e5543c8f ("mm: workingset: prevent lruvec release in workingset_activation()")            | 0           |
>> | e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU pages for lruvec lock")            | 0           |
>> | d04921029e6d ("mm: vmscan: prepare for reparenting traditional LRU folios")                   | 1           |
>> | dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU folios")                             | 1           |
>> | 18791acfe86b ("mm: memcontrol: refactor memcg_reparent_objcgs()")                             | 0           |
>> | 19f79273a812 ("mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios")  | 2           |
>> | 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers")                | 0           |

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Qi Zheng 1 month, 1 week ago


On 12/30/25 12:20 PM, Roman Gushchin wrote:
> Qi Zheng <qi.zheng@linux.dev> writes:
> 
>> Hi Roman,
>>
>> On 12/30/25 9:36 AM, Roman Gushchin wrote:
>>> Qi Zheng <qi.zheng@linux.dev> writes:
>>> Hey!
>>> I ran this patchset through AI review and it found few regression
>>> (which
>>> can of course be false positives). When you'll have time, can you,
>>> please, take a look and comment on which are real and which are not?
>>
>> Thank you for running the AI review for this patchset, but please do not
>> directly send the raw data from the AI review to the community, as this
>> is no different from automated review by a robot.
> 
> Hi Qi,
> 
> I don't know why you're so negative towards it. It's been great at

No, I don't object to having a dedicated robot to do this.

> finding pretty tricky bugs often missed by human reviewers. In no way
> it's a replacement for human reviews, but if a robot can find real
> issues and make the kernel more reliable and safe, I'm in.

I just think you should do a preliminary review of the AI review results
instead of sending them out directly. Otherwise, if everyone does this,
the community will be full of bots.

No?

> 
> In my experience it finds real problems pretty often. In my measurements
> at least 50% of the reported issues are real, and it matches the data
> reported by others. Some subsystems (e.g. bpf) pass all patches through
> the ai review.
> 
> In any case feel free to ignore it.
> 
>>
>> Thanks,
>> Qi
>>
>>> Thank you!
>>> --
>>> # Task
>>> Date: 2025-12-29 19:55:20
>>> Model: gemini-3-pro-preview
>>> Prompts SHA: 192922ae6bf4 ("bpf.md: adjust the documentation about bpf kfunc parameter validation")
>>> Commits to review:
>>> - e416d881eea4 ("mm: memcontrol: remove dead code of checking parent memory cgroup")
>>> - 8e00ae594254 ("mm: workingset: use folio_lruvec() in workingset_refault()")
>>> - a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its variants")
>>> - d57d548a3d6b ("mm: vmscan: prepare for the refactoring the move_folios_to_lru()")
>>> - 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")
>>> - 057fca991b78 ("mm: memcontrol: allocate object cgroup for non-kmem case")
>>> - 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup for root memory cgroup")
>>> - 8479f2eef536 ("mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()")
>>> - c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")
>>> - 65610d739afc ("writeback: prevent memory cgroup release in writeback module")
>>> - f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()")
>>> - 91e4b3924291 ("mm: page_io: prevent memory cgroup release in page_io module")
>>> - bb45e352bb34 ("mm: migrate: prevent memory cgroup release in folio_migrate_mapping()")
>>> - a1189dd21a56 ("mm: mglru: prevent memory cgroup release in mglru")
>>> - 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()")
>>> - de63e2b7a03e ("mm: workingset: prevent memory cgroup release in lru_gen_eviction()")
>>> - c0cce04fd4dc ("mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()")
>>> - 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in zswap_compress()")
>>> - 80bbd804adde ("mm: workingset: prevent lruvec release in workingset_refault()")
>>> - 9d232388a8e3 ("mm: zswap: prevent lruvec release in zswap_folio_swapin()")
>>> - d7cb66b9350d ("mm: swap: prevent lruvec release in lru_gen_clear_refs()")
>>> - 3e71e5543c8f ("mm: workingset: prevent lruvec release in workingset_activation()")
>>> - e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU pages for lruvec lock")
>>> - d04921029e6d ("mm: vmscan: prepare for reparenting traditional LRU folios")
>>> - dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU folios")
>>> - 18791acfe86b ("mm: memcontrol: refactor memcg_reparent_objcgs()")
>>> - 19f79273a812 ("mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios")
>>> - 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers")
>>> # Commit e416d881eea4 ("mm: memcontrol: remove dead code of checking
>>> parent memory cgroup")
>>> No detailed regression report generated.
>>> # Commit 8e00ae594254 ("mm: workingset: use folio_lruvec() in
>>> workingset_refault()")
>>> No detailed regression report generated.
>>> # Commit a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its
>>> variants")
>>> No detailed regression report generated.
>>> # Commit d57d548a3d6b ("mm: vmscan: prepare for the refactoring the
>>> move_folios_to_lru()")
>>> No detailed regression report generated.
>>> # Commit 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")
>>> commit 9b02a45b6fc8a2253e2edcccb24c8485d3dca672
>>> Author: Muchun Song <muchun.song@linux.dev>
>>> mm: vmscan: refactor move_folios_to_lru()
>>> Refactors move_folios_to_lru() to handle its own locking via
>>> folio_lruvec_relock_irq(), removing the requirement for the caller to hold
>>> lruvec->lru_lock. Updates call sites to release the lock before calling.
>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 49e566174621..354b19f7365d 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>> [ ... ]
>>>
>>>> @@ -4736,14 +4734,14 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>>>>    			set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_active));
>>>>    	}
>>>>    -	spin_lock_irq(&lruvec->lru_lock);
>>>> -
>>>> -	move_folios_to_lru(lruvec, &list);
>>>> +	move_folios_to_lru(&list);
>>>>      	walk = current->reclaim_state->mm_walk;
>>>>    	if (walk && walk->batched) {
>>>>    		walk->lruvec = lruvec;
>>>> +		spin_lock(&lruvec->lru_lock);
>>> move_folios_to_lru() drops the lock and enables interrupts via
>>> lruvec_unlock_irq() (which calls spin_unlock_irq()). Since lru_lock is
>>> an IRQ-safe lock (used elsewhere with spin_lock_irq()), is it safe to
>>> use spin_lock() here with interrupts enabled? If an interrupt occurs
>>> and tries to acquire lru_lock, could this lead to a deadlock?
>>>
>>>>    		reset_batch_size(walk);
>>>> +		spin_unlock(&lruvec->lru_lock);
>>>>    	}
>>>>      	mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD +
>>>> reclaimer_offset(sc),
>>> [ ... ]
>>> # Commit 057fca991b78 ("mm: memcontrol: allocate object cgroup for
>>> non-kmem case")
>>> No detailed regression report generated.
>>> # Commit 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup
>>> for root memory cgroup")
>>> No detailed regression report generated.
>>> # Commit 8479f2eef536 ("mm: memcontrol: prevent memory cgroup
>>> release in get_mem_cgroup_from_folio()")
>>> commit 8479f2eef5363fb93cc593038e9bc54dd5068ef2
>>> Author: Muchun Song <muchun.song@linux.dev>
>>> mm: memcontrol: prevent memory cgroup release in
>>> get_mem_cgroup_from_folio()
>>> This patch modifies get_mem_cgroup_from_folio() to retry the lookup
>>> if
>>> locking the memcg fails, serving as preparation for future changes where
>>> folios may not pin their memcg.
>>>
>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>> index 21b5aad34cae..431b3154c70c 100644
>>>> --- a/mm/memcontrol.c
>>>> +++ b/mm/memcontrol.c
>>>> @@ -973,14 +973,19 @@ struct mem_cgroup *get_mem_cgroup_from_current(void)
>>>>     */
>>>>    struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio)
>>>>    {
>>>> -	struct mem_cgroup *memcg = folio_memcg(folio);
>>>> +	struct mem_cgroup *memcg;
>>>>      	if (mem_cgroup_disabled())
>>>>    		return NULL;
>>>>    +	if (!folio_memcg_charged(folio))
>>>> +		return root_mem_cgroup;
>>>> +
>>>>    	rcu_read_lock();
>>>> -	if (!memcg || WARN_ON_ONCE(!css_tryget(&memcg->css)))
>>>> -		memcg = root_mem_cgroup;
>>>> +retry:
>>>> +	memcg = folio_memcg(folio);
>>>> +	if (unlikely(!css_tryget(&memcg->css)))
>>>> +		goto retry;
>>> If css_tryget() fails (e.g. refcount is 0), this loop spins
>>> indefinitely
>>> with the RCU read lock held. Is it guaranteed that folio_memcg() will
>>> return a different, alive memcg in subsequent iterations?
>>> If the folio is isolated (e.g. via migrate_misplaced_folio()), it
>>> might be
>>> missed by reparenting logic that iterates LRU lists. In that case, the
>>> folio would continue pointing to the dying memcg, leading to a hard lockup.
>>> Also, folio_memcg() calls __folio_memcg(), which reads
>>> folio->memcg_data
>>> without READ_ONCE(). Since this loop waits for memcg_data to be updated
>>> by another CPU (reparenting), could the compiler hoist the load out of
>>> the loop, preventing the update from being seen?
>>> Finally, the previous code fell back to root_mem_cgroup on
>>> failure. Is it
>>> safe to remove that fallback? If css_tryget() fails unexpectedly, hanging
>>> seems more severe than the previous behavior of warning and falling back.
>>>
>>>>    	rcu_read_unlock();
>>>>    	return memcg;
>>>>    }
>>> # Commit c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")
>>> No detailed regression report generated.
>>> # Commit 65610d739afc ("writeback: prevent memory cgroup release in
>>> writeback module")
>>> <regression>
>>>     <description>
>>>       Tight infinite loop in get_mem_cgroup_from_folio due to missing READ_ONCE and cpu_relax
>>>     </description>
>>>     <file>
>>>       mm/memcontrol.c
>>>     </file>
>>>     <lines>
>>>       991-993
>>>     </lines>
>>>     <reason>
>>>       The retry loop in get_mem_cgroup_from_folio() continuously checks folio_memcg(folio)
>>>       and tries to get a css reference. There are two issues:
>>>       1. folio_memcg() reads folio->memcg_data without READ_ONCE(). The compiler may
>>>          hoist this load out of the loop, leading to an infinite loop if the memcg
>>>          is offlined and the loop condition depends on reparenting updating memcg_data.
>>>       2. The loop does not contain cpu_relax(). If the update to folio->memcg_data
>>>          is delayed (e.g., reparenting takes time), this creates a tight loop that
>>>          burns CPU and stalls the RCU read-side critical section, potentially causing
>>>          RCU stalls or soft lockups.
>>>     </reason>
>>>     <classification>
>>>       Performance/Stability (Soft Lockup / Infinite Loop)
>>>     </classification>
>>> </regression>
>>> <regression>
>>>     <description>
>>>       Removal of NULL check in mem_cgroup_track_foreign_dirty introduces potential crash
>>>     </description>
>>>     <file>
>>>       include/linux/memcontrol.h
>>>     </file>
>>>     <lines>
>>>       1557
>>>     </lines>
>>>     <reason>
>>>       The check `if (unlikely(memcg && &memcg->css != wb->memcg_css))` was replaced with
>>>       `if (unlikely(&memcg->css != wb->memcg_css))`.
>>>       While folio_memcg_charged() is checked earlier, there is a race window between that check
>>>       and the call to folio_memcg() where the folio could be uncharged (setting memcg_data to 0).
>>>       If folio_memcg() returns NULL, accessing `&memcg->css` will dereference the NULL pointer
>>>       (since css is at offset 0). The previous code safely handled NULL memcg.
>>>     </reason>
>>>     <classification>
>>>       Robustness / Null Pointer Dereference
>>>     </classification>
>>> </regression>
>>> # Commit f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup
>>> release in count_memcg_folio_events()")
>>> No detailed regression report generated.
>>> # Commit 91e4b3924291 ("mm: page_io: prevent memory cgroup release
>>> in page_io module")
>>> No detailed regression report generated.
>>> # Commit bb45e352bb34 ("mm: migrate: prevent memory cgroup release
>>> in folio_migrate_mapping()")
>>> No detailed regression report generated.
>>> # Commit a1189dd21a56 ("mm: mglru: prevent memory cgroup release in
>>> mglru")
>>> commit a1189dd21a568b45eaddafc17b982f6200ba28c0
>>> Author: Muchun Song <muchun.song@linux.dev>
>>> mm: mglru: prevent memory cgroup release in mglru
>>> This patch uses the rcu read lock to safeguard against the release
>>> of
>>> the memory cgroup in mglru, as a preparatory measure for the reparenting
>>> of the LRU pages.
>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 354b19f7365d..814498a2c1bd 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>> [ ... ]
>>>> @@ -4242,6 +4244,13 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>>>>    		}
>>>>    	}
>>>>    +	rcu_read_lock();
>>>> +	memcg = folio_memcg(folio);
>>>> +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>> +	max_seq = READ_ONCE((lruvec)->lrugen.max_seq);
>>>> +	gen = lru_gen_from_seq(max_seq);
>>>> +	mm_state = get_mm_state(lruvec);
>>>> +
>>>>    	arch_enter_lazy_mmu_mode();
>>>>      	pte -= (addr - start) / PAGE_SIZE;
>>> [ ... ]
>>>> @@ -4282,6 +4291,8 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>>>>    	if (mm_state && suitable_to_scan(i, young))
>>>>    		update_bloom_filter(mm_state, max_seq, pvmw->pmd);
>>>>    +	rcu_read_unlock();
>>>> +
>>>>    	return true;
>>>>    }
>>> This critical section appears to encompass calls to
>>> walk_update_folio(),
>>> which calls folio_activate(), which eventually calls folio_lruvec_lock_irq()
>>> to acquire lruvec->lru_lock.
>>> Since lruvec->lru_lock is a spinlock_t, it becomes a sleeping lock
>>> on
>>> PREEMPT_RT kernels. Is it safe to acquire a sleeping lock inside an RCU
>>> read-side critical section on PREEMPT_RT?
>>> # Commit 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup
>>> release in mem_cgroup_swap_full()")
>>> No detailed regression report generated.
>>> # Commit de63e2b7a03e ("mm: workingset: prevent memory cgroup
>>> release in lru_gen_eviction()")
>>> No detailed regression report generated.
>>> # Commit c0cce04fd4dc ("mm: thp: prevent memory cgroup release in
>>> folio_split_queue_lock{_irqsave}()")
>>> commit c0cce04fd4dca3a2b00e8846cbbeb52bd2f27625
>>> Author: Qi Zheng <zhengqi.arch@bytedance.com>
>>> mm: thp: prevent memory cgroup release in
>>> folio_split_queue_lock{_irqsave}()
>>> This patch wraps the queue lookup in rcu_read_lock/unlock to
>>> safeguard
>>> against memory cgroup release during the lookup, as folios may no longer
>>> pin the memcg.
>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 12b46215b30c..b9e6855ec0b6 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -1154,13 +1154,25 @@ split_queue_lock_irqsave(int nid, struct mem_cgroup *memcg, unsigned long *flags
>>>>      static struct deferred_split *folio_split_queue_lock(struct
>>>> folio *folio)
>>>>    {
>>>> -	return split_queue_lock(folio_nid(folio), folio_memcg(folio));
>>>> +	struct deferred_split *queue;
>>>> +
>>>> +	rcu_read_lock();
>>>> +	queue = split_queue_lock(folio_nid(folio), folio_memcg(folio));
>>>> +	rcu_read_unlock();
>>>> +
>>>> +	return queue;
>>>>    }
>>> If the folio does not pin the memcg (as suggested by the commit
>>> message),
>>> is it safe to return the queue pointer here? The queue is embedded in the
>>> memcg, so if the memcg is freed after rcu_read_unlock(), wouldn't this
>>> pointer become invalid while the caller still holds the lock?
>>>
>>>>      static struct deferred_split *
>>>>    folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
>>>>    {
>>>> -	return split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags);
>>>> +	struct deferred_split *queue;
>>>> +
>>>> +	rcu_read_lock();
>>>> +	queue = split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags);
>>>> +	rcu_read_unlock();
>>>> +
>>>> +	return queue;
>>>>    }
>>> # Commit 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in
>>> zswap_compress()")
>>> No detailed regression report generated.
>>> # Commit 80bbd804adde ("mm: workingset: prevent lruvec release in
>>> workingset_refault()")
>>> commit 80bbd804adde7198f6c7c5fdf7b4349996586597
>>> Author: Muchun Song <muchun.song@linux.dev>
>>> mm: workingset: prevent lruvec release in workingset_refault()
>>> In the near future, a folio will no longer pin its corresponding
>>> memory cgroup. So an lruvec returned by folio_lruvec() could be
>>> released without the rcu read lock or a reference to its memory
>>> cgroup.
>>>
>>>> diff --git a/mm/workingset.c b/mm/workingset.c
>>>> index 445fc634196d..427ca1a5625e 100644
>>>> --- a/mm/workingset.c
>>>> +++ b/mm/workingset.c
>>>> @@ -560,11 +560,12 @@ void workingset_refault(struct folio *folio, void *shadow)
>>>>    	 * locked to guarantee folio_memcg() stability throughout.
>>>>    	 */
>>>>    	nr = folio_nr_pages(folio);
>>>> +	rcu_read_lock();
>>>>    	lruvec = folio_lruvec(folio);
>>>>    	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
>>>>      	if (!workingset_test_recent(shadow, file, &workingset, true))
>>>> -		return;
>>>> +		goto out;
>>> Does calling workingset_test_recent() with flush=true while holding
>>> rcu_read_lock() risk sleeping in atomic context?
>>> workingset_test_recent() calls mem_cgroup_flush_stats_ratelimited()
>>> when
>>> flush is true.  The flush path leads to css_rstat_flush(), which calls
>>> might_sleep() and cond_resched().
>>>
>>>>      	folio_set_active(folio);
>>>>    	workingset_age_nonresident(lruvec, nr);
>>>> @@ -580,6 +581,8 @@ void workingset_refault(struct folio *folio, void *shadow)
>>>>    		lru_note_cost_refault(folio);
>>>>    		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr);
>>>>    	}
>>>> +out:
>>>> +	rcu_read_unlock();
>>>>    }
>>>>      /**
>>> # Commit 9d232388a8e3 ("mm: zswap: prevent lruvec release in
>>> zswap_folio_swapin()")
>>> No detailed regression report generated.
>>> # Commit d7cb66b9350d ("mm: swap: prevent lruvec release in
>>> lru_gen_clear_refs()")
>>> No detailed regression report generated.
>>> # Commit 3e71e5543c8f ("mm: workingset: prevent lruvec release in
>>> workingset_activation()")
>>> No detailed regression report generated.
>>> # Commit e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU
>>> pages for lruvec lock")
>>> No detailed regression report generated.
>>> # Commit d04921029e6d ("mm: vmscan: prepare for reparenting
>>> traditional LRU folios")
>>> commit d04921029e6dc92824024356a9ed83a4f82d68dc
>>> Author: Qi Zheng <zhengqi.arch@bytedance.com>
>>> mm: vmscan: prepare for reparenting traditional LRU folios
>>> This commit implements a specific function to reparent LRU folios of
>>> a
>>> child memcg to its parent memcg during the reparenting process.
>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 814498a2c1bd..5fd0f97c3719 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -2648,6 +2648,44 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
>>>>    			  lruvec_memcg(lruvec));
>>>>    }
>>>>    +#ifdef CONFIG_MEMCG
>>>> +static void lruvec_reparent_lru(struct lruvec *src, struct lruvec *dst,
>>>> +				enum lru_list lru)
>>>> +{
>>>> +	int zid;
>>>> +	struct mem_cgroup_per_node *mz_src, *mz_dst;
>>>> +
>>>> +	mz_src = container_of(src, struct mem_cgroup_per_node, lruvec);
>>>> +	mz_dst = container_of(dst, struct mem_cgroup_per_node, lruvec);
>>>> +
>>>> +	if (lru != LRU_UNEVICTABLE)
>>>> +		list_splice_tail_init(&src->lists[lru], &dst->lists[lru]);
>>> Is it safe to modify src->lists and dst->lists here without holding
>>> src->lru_lock or dst->lru_lock?  Since dst is the parent memcg, it is
>>> likely live and subject to concurrent access by reclaim or allocation paths.
>>>
>>>> +
>>>> +	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>>>> +		mz_dst->lru_zone_size[zid][lru] += mz_src->lru_zone_size[zid][lru];
>>>> +		mz_src->lru_zone_size[zid][lru] = 0;
>>>> +	}
>>> Also, lru_zone_size is typically protected by lru_lock.  Is it safe
>>> to
>>> update these counters without locking?
>>>
>>>> +}
>>>> +
>>>> +void lru_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
>>>> +{
>>>> +	int nid;
>>>> +
>>>> +	for_each_node(nid) {
>>>> +		enum lru_list lru;
>>>> +		struct lruvec *src_lruvec, *dst_lruvec;
>>>> +
>>>> +		src_lruvec = mem_cgroup_lruvec(src, NODE_DATA(nid));
>>>> +		dst_lruvec = mem_cgroup_lruvec(dst, NODE_DATA(nid));
>>>> +		dst_lruvec->anon_cost += src_lruvec->anon_cost;
>>>> +		dst_lruvec->file_cost += src_lruvec->file_cost;
>>>> +
>>>> +		for_each_lru(lru)
>>>> +			lruvec_reparent_lru(src_lruvec, dst_lruvec, lru);
>>>> +	}
>>>> +}
>>>> +#endif
>>> # Commit dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU
>>> folios")
>>> Here are the findings for the provided patch.
>>> 1.  **Missing locking in `lru_gen_reparent_memcg`**
>>> In `mm/vmscan.c`, the function `lru_gen_reparent_memcg` (and its
>> helper `__lru_gen_reparent_memcg`) modifies the LRU lists and
>> statistics of `lruvec` structures without holding the `lru_lock`.
>>>       Specifically:
>>>       - `__lru_gen_reparent_memcg` calls `__update_lru_size`.
>>>       - `__update_lru_size` has a `lockdep_assert_held(&lruvec->lru_lock)`.
>>>       - `__lru_gen_reparent_memcg` calls `list_splice_tail_init` to move folios from the source lruvec to the destination lruvec.
>>>       The destination lruvec (`dst_lruvec`) belongs to the parent
>>> memcg, which is active and shared. Modifying its lists and counters
>>> without locking will lead to data corruption (list corruption) and
>>> statistics drift, as well as triggering lockdep warnings.
>>>       ```c
>>>       void lru_gen_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
>>>       {
>>>               int nid;
>>>               for_each_node(nid) {
>>>                       struct lruvec *src_lruvec, *dst_lruvec;
>>>                       /* ... */
>>>                       src_lruvec = get_lruvec(src, nid);
>>>                       dst_lruvec = get_lruvec(dst, nid);
>>>                       for (zone = 0; zone < MAX_NR_ZONES; zone++)
>>>                               for (type = 0; type < ANON_AND_FILE; type++)
>>>                                       __lru_gen_reparent_memcg(src_lruvec, dst_lruvec, zone, type);
>>>               }
>>>       }
>>>       ```
>>>       The `lruvec` lock must be acquired for each node before calling
>>> `__lru_gen_reparent_memcg`.
>>> # Commit 18791acfe86b ("mm: memcontrol: refactor
>>> memcg_reparent_objcgs()")
>>> No detailed regression report generated.
>>> # Commit 19f79273a812 ("mm: memcontrol: eliminate the problem of
>>> dying memory cgroup for LRU folios")
>>> file: mm/memcontrol.c
>>> line: 224
>>> type: Bug
>>> category: Locking
>>> description:
>>> The `reparent_locks` function takes `lru_lock` for all NUMA nodes in
>> a loop, utilizing `spin_lock_nested` with an incrementing `nest`
>> counter. The `nest` counter increments for each lock taken (2 per
>> node: src and dst). Since `MAX_LOCKDEP_SUBCLASSES` is 8, this code
>> will trigger a Lockdep violation (and potential panic if
>> `panic_on_warn` is set) on systems with more than 4 NUMA nodes (4
>> nodes * 2 locks = 8 subclasses). Furthermore, locking all nodes
>> simultaneously is a scalability regression, blocking LRU operations
>> globally during reparenting.
>>> file: include/linux/memcontrol.h
>>> line: 430
>>> type: Risk
>>> category: API
>>> description:
>>> The implementation of `folio_memcg` has changed to rely on
>> `obj_cgroup_memcg`, which enforces that `rcu_read_lock` or
>> `cgroup_mutex` is held via a lockdep assertion. Previously, for LRU
>> folios, the memcg pointer was directly embedded and stable under the
>> folio lock. Existing callers (e.g., in `mm/workingset.c`) relied on
>> the folio lock for stability. While some callers may hold RCU, others
>> might not, leading to lockdep warnings or races where `folio_memcg`
>> returns a pointer to a memcg that is being reparented or
>> freed. Additionally, the return value of `folio_memcg` is no longer
>> constant for a locked folio; it can change if reparenting occurs,
>> potentially breaking logic that assumes identity equality over time.
>>> # Commit 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru
>>> maintenance helpers")
>>> No detailed regression report generated.
>>> # Summary
>>> | Commit
>>> | Regressions |
>>> | :-------------------------------------------------------------------------------------------- | :---------- |
>>> | e416d881eea4 ("mm: memcontrol: remove dead code of checking parent memory cgroup")            | 0           |
>>> | 8e00ae594254 ("mm: workingset: use folio_lruvec() in workingset_refault()")                   | 0           |
>>> | a272ef87d5e7 ("mm: rename unlock_page_lruvec_irq and its variants")                           | 0           |
>>> | d57d548a3d6b ("mm: vmscan: prepare for the refactoring the move_folios_to_lru()")             | 0           |
>>> | 9b02a45b6fc8 ("mm: vmscan: refactor move_folios_to_lru()")                                    | 1           |
>>> | 057fca991b78 ("mm: memcontrol: allocate object cgroup for non-kmem case")                     | 0           |
>>> | 7c4110a3d8b6 ("mm: memcontrol: return root object cgroup for root memory cgroup")             | 0           |
>>> | 8479f2eef536 ("mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()") | 1           |
>>> | c10b7e11fc09 ("buffer: prevent memory cgroup release in folio_alloc_buffers()")               | 0           |
>>> | 65610d739afc ("writeback: prevent memory cgroup release in writeback module")                 | 2           |
>>> | f9b3cc3aed9f ("mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()")  | 0           |
>>> | 91e4b3924291 ("mm: page_io: prevent memory cgroup release in page_io module")                 | 0           |
>>> | bb45e352bb34 ("mm: migrate: prevent memory cgroup release in folio_migrate_mapping()")        | 0           |
>>> | a1189dd21a56 ("mm: mglru: prevent memory cgroup release in mglru")                            | 1           |
>>> | 4f41e0db1fd8 ("mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()")      | 0           |
>>> | de63e2b7a03e ("mm: workingset: prevent memory cgroup release in lru_gen_eviction()")          | 0           |
>>> | c0cce04fd4dc ("mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()") | 1           |
>>> | 555a447cb5f1 ("mm: zswap: prevent memory cgroup release in zswap_compress()")                 | 0           |
>>> | 80bbd804adde ("mm: workingset: prevent lruvec release in workingset_refault()")               | 1           |
>>> | 9d232388a8e3 ("mm: zswap: prevent lruvec release in zswap_folio_swapin()")                    | 0           |
>>> | d7cb66b9350d ("mm: swap: prevent lruvec release in lru_gen_clear_refs()")                     | 0           |
>>> | 3e71e5543c8f ("mm: workingset: prevent lruvec release in workingset_activation()")            | 0           |
>>> | e765ff303f13 ("mm: memcontrol: prepare for reparenting LRU pages for lruvec lock")            | 0           |
>>> | d04921029e6d ("mm: vmscan: prepare for reparenting traditional LRU folios")                   | 1           |
>>> | dc3b4a08ad35 ("mm: vmscan: prepare for reparenting MGLRU folios")                             | 1           |
>>> | 18791acfe86b ("mm: memcontrol: refactor memcg_reparent_objcgs()")                             | 0           |
>>> | 19f79273a812 ("mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios")  | 2           |
>>> | 44b8e8bca06e ("mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers")                | 0           |

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Shakeel Butt 1 month, 1 week ago

On Tue, Dec 30, 2025 at 12:25:31PM +0800, Qi Zheng wrote:
> 
> 
[...]
> > > 
> > > Thank you for running the AI review for this patchset, but please do not
> > > directly send the raw data from the AI review to the community, as this
> > > is no different from automated review by a robot.
> > 
> > Hi Qi,
> > 
> > I don't know why you're so negative towards it. It's been great at
> 
> No, I don't object to having a dedicated robot to do this.
> 
> > finding pretty tricky bugs often missed by human reviewers. In no way
> > it's a replacement for human reviews, but if a robot can find real
> > issues and make the kernel more reliable and safe, I'm in.
> 
> I just think you should do a preliminary review of the AI review results
> instead of sending them out directly. Otherwise, if everyone does this,
> the community will be full of bots.
> 
> No?
> 

We don't want too many bots but we definitely want at least one AI
review bot. Now we have precedence of BPF and networking subsystem and
the results I have seen are really good. I think the MM community needs
to come together and decide on the formalities of AI review process and
I see Roman is doing some early experimentation and result looks great.

Shakeel

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Zi Yan 1 month, 1 week ago

On 29 Dec 2025, at 23:48, Shakeel Butt wrote:

> On Tue, Dec 30, 2025 at 12:25:31PM +0800, Qi Zheng wrote:
>>
>>
> [...]
>>>>
>>>> Thank you for running the AI review for this patchset, but please do not
>>>> directly send the raw data from the AI review to the community, as this
>>>> is no different from automated review by a robot.
>>>
>>> Hi Qi,
>>>
>>> I don't know why you're so negative towards it. It's been great at
>>
>> No, I don't object to having a dedicated robot to do this.
>>
>>> finding pretty tricky bugs often missed by human reviewers. In no way
>>> it's a replacement for human reviews, but if a robot can find real
>>> issues and make the kernel more reliable and safe, I'm in.
>>
>> I just think you should do a preliminary review of the AI review results
>> instead of sending them out directly. Otherwise, if everyone does this,
>> the community will be full of bots.
>>
>> No?
>>
>
> We don't want too many bots but we definitely want at least one AI
> review bot. Now we have precedence of BPF and networking subsystem and
> the results I have seen are really good. I think the MM community needs
> to come together and decide on the formalities of AI review process and
> I see Roman is doing some early experimentation and result looks great.

Do you mind explaining why the result looks great? Does it mean you agree
the regressions pointed out by the AI review?

If we want to do AI reviews, the process should be improved instead of
just pasting the output from AI. In the initial stage, I think some human
intervention is needed, at least adding some comment on AI reviews would
be helpful. Otherwise, it looks like you agree completely with AI reviews.
In addition, “50% of the reported issues are real”, is the AI tossing
a coin when reporting issues?

When I am looking into the prompt part, I have the following questions:

1. What is “Prompts SHA: 192922ae6bf4 ("bpf.md: adjust the documentation
about bpf kfunc parameter validation”)”? I got the actual prompts
from irc: https://github.com/masoncl/review-prompts/tree/main, but it
should be provided along with the review for others to reproduce.

2. Looking at the mm prompt: https://github.com/masoncl/review-prompts/blob/main/mm.md, are you sure the patterns are all right?
	a. Page/Folio States, Large folios require per-page state tracking for
		Reference counts. I thought we want to get rid of per page refcount.
    b. Migration Invariants, NUMA balancing expects valid PTE combinations.
		PROTNONE PTEs are hardware invalid to trigger fault.
	c. TLB flushes required after PTE modifications. How about spurious fault
		handling?

3. For a cgroup patchset, I was expecting some cgroup specific prompt rules,
	but could not find any. What am I missing?

Best Regards,
Yan, Zi

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Roman Gushchin 1 month, 1 week ago

Zi Yan <ziy@nvidia.com> writes:

> On 29 Dec 2025, at 23:48, Shakeel Butt wrote:
>
>> On Tue, Dec 30, 2025 at 12:25:31PM +0800, Qi Zheng wrote:
>>>
>>>
>> [...]
>>>>>
>>>>> Thank you for running the AI review for this patchset, but please do not
>>>>> directly send the raw data from the AI review to the community, as this
>>>>> is no different from automated review by a robot.
>>>>
>>>> Hi Qi,
>>>>
>>>> I don't know why you're so negative towards it. It's been great at
>>>
>>> No, I don't object to having a dedicated robot to do this.
>>>
>>>> finding pretty tricky bugs often missed by human reviewers. In no way
>>>> it's a replacement for human reviews, but if a robot can find real
>>>> issues and make the kernel more reliable and safe, I'm in.
>>>
>>> I just think you should do a preliminary review of the AI review results
>>> instead of sending them out directly. Otherwise, if everyone does this,
>>> the community will be full of bots.
>>>
>>> No?

The problem is that it works only when AI is obviously wrong,
which is not a large percentage of cases with latest models.
In my practice with Gemini 3 and Chris Mason's prompts, it almost
never dead wrong: it's either a real issue or some gray zone.
And you really often need a deep expertise and a significant amount
of time to decide if it's real or not, so it's not like you can
assign a single person who can review all ai reviews.

>>>
>>
>> We don't want too many bots but we definitely want at least one AI
>> review bot. Now we have precedence of BPF and networking subsystem and
>> the results I have seen are really good. I think the MM community needs
>> to come together and decide on the formalities of AI review process and
>> I see Roman is doing some early experimentation and result looks great.
>
> Do you mind explaining why the result looks great? Does it mean you agree
> the regressions pointed out by the AI review?
>
> If we want to do AI reviews, the process should be improved instead of
> just pasting the output from AI. In the initial stage, I think some human
> intervention is needed, at least adding some comment on AI reviews would
> be helpful. Otherwise, it looks like you agree completely with AI reviews.
> In addition, “50% of the reported issues are real”, is the AI tossing
> a coin when reporting issues?

I said at least 50% in my experience. If there is a 50% chance that
someone is pointing at a real issue in my code, I'd rather look into it
and fix or explain why it's not an issue. Btw, this is exactly how I
learned about this stuff - sent some bpf patches (bpf oom) and got
excited about a number of real issues discovered by ai review.

I agree though that we should not pollute email threads with a number of
AI-generated reports with a similar context.

> When I am looking into the prompt part, I have the following questions:
>
> 1. What is “Prompts SHA: 192922ae6bf4 ("bpf.md: adjust the documentation
> about bpf kfunc parameter validation”)”? I got the actual prompts
> from irc: https://github.com/masoncl/review-prompts/tree/main, but it
> should be provided along with the review for others to reproduce.

It's a significant amount of text, way too much to directly include into
emails. SHA from the prompts git should be enough, no?

> 2. Looking at the mm prompt: https://github.com/masoncl/review-prompts/blob/main/mm.md, are you sure the patterns are all right?
> 	a. Page/Folio States, Large folios require per-page state tracking for
> 		Reference counts. I thought we want to get rid of per page refcount.
>     b. Migration Invariants, NUMA balancing expects valid PTE combinations.
> 		PROTNONE PTEs are hardware invalid to trigger fault.
> 	c. TLB flushes required after PTE modifications. How about spurious fault
> 		handling?
>
> 3. For a cgroup patchset, I was expecting some cgroup specific prompt rules,
> 	but could not find any. What am I missing?

MM and cgroups-specific prompts are definitely in a very early stage.
But to develop/improve them we need data.

Thanks!

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Zi Yan 1 month, 1 week ago

On 30 Dec 2025, at 14:34, Roman Gushchin wrote:

> Zi Yan <ziy@nvidia.com> writes:
>
>> On 29 Dec 2025, at 23:48, Shakeel Butt wrote:
>>
>>> On Tue, Dec 30, 2025 at 12:25:31PM +0800, Qi Zheng wrote:
>>>>
>>>>
>>> [...]
>>>>>>
>>>>>> Thank you for running the AI review for this patchset, but please do not
>>>>>> directly send the raw data from the AI review to the community, as this
>>>>>> is no different from automated review by a robot.
>>>>>
>>>>> Hi Qi,
>>>>>
>>>>> I don't know why you're so negative towards it. It's been great at
>>>>
>>>> No, I don't object to having a dedicated robot to do this.
>>>>
>>>>> finding pretty tricky bugs often missed by human reviewers. In no way
>>>>> it's a replacement for human reviews, but if a robot can find real
>>>>> issues and make the kernel more reliable and safe, I'm in.
>>>>
>>>> I just think you should do a preliminary review of the AI review results
>>>> instead of sending them out directly. Otherwise, if everyone does this,
>>>> the community will be full of bots.
>>>>
>>>> No?
>
> The problem is that it works only when AI is obviously wrong,
> which is not a large percentage of cases with latest models.
> In my practice with Gemini 3 and Chris Mason's prompts, it almost
> never dead wrong: it's either a real issue or some gray zone.
> And you really often need a deep expertise and a significant amount
> of time to decide if it's real or not, so it's not like you can
> assign a single person who can review all ai reviews.
>
>>>>
>>>
>>> We don't want too many bots but we definitely want at least one AI
>>> review bot. Now we have precedence of BPF and networking subsystem and
>>> the results I have seen are really good. I think the MM community needs
>>> to come together and decide on the formalities of AI review process and
>>> I see Roman is doing some early experimentation and result looks great.
>>
>> Do you mind explaining why the result looks great? Does it mean you agree
>> the regressions pointed out by the AI review?
>>
>> If we want to do AI reviews, the process should be improved instead of
>> just pasting the output from AI. In the initial stage, I think some human
>> intervention is needed, at least adding some comment on AI reviews would
>> be helpful. Otherwise, it looks like you agree completely with AI reviews.
>> In addition, “50% of the reported issues are real”, is the AI tossing
>> a coin when reporting issues?
>
> I said at least 50% in my experience. If there is a 50% chance that
> someone is pointing at a real issue in my code, I'd rather look into it
> and fix or explain why it's not an issue. Btw, this is exactly how I
> learned about this stuff - sent some bpf patches (bpf oom) and got
> excited about a number of real issues discovered by ai review.
>
> I agree though that we should not pollute email threads with a number of
> AI-generated reports with a similar context.
>
>> When I am looking into the prompt part, I have the following questions:
>>
>> 1. What is “Prompts SHA: 192922ae6bf4 ("bpf.md: adjust the documentation
>> about bpf kfunc parameter validation”)”? I got the actual prompts
>> from irc: https://github.com/masoncl/review-prompts/tree/main, but it
>> should be provided along with the review for others to reproduce.
>
> It's a significant amount of text, way too much to directly include into
> emails. SHA from the prompts git should be enough, no?

I mean at least the GitHub link should be provided, otherwise, how can people
know the exact prompts?

>
>> 2. Looking at the mm prompt: https://github.com/masoncl/review-prompts/blob/main/mm.md, are you sure the patterns are all right?
>> 	a. Page/Folio States, Large folios require per-page state tracking for
>> 		Reference counts. I thought we want to get rid of per page refcount.
>>     b. Migration Invariants, NUMA balancing expects valid PTE combinations.
>> 		PROTNONE PTEs are hardware invalid to trigger fault.
>> 	c. TLB flushes required after PTE modifications. How about spurious fault
>> 		handling?
>>
>> 3. For a cgroup patchset, I was expecting some cgroup specific prompt rules,
>> 	but could not find any. What am I missing?
>
> MM and cgroups-specific prompts are definitely in a very early stage.
> But to develop/improve them we need data.

Not just data. You are a maintainer of cgroup, so at least you could add
more cgroup specific rules to improve the quality of AI reviews.


Best Regards,
Yan, Zi

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Shakeel Butt 1 month, 1 week ago

On Tue, Dec 30, 2025 at 11:46:02AM -0500, Zi Yan wrote:
> On 29 Dec 2025, at 23:48, Shakeel Butt wrote:
> 
> > On Tue, Dec 30, 2025 at 12:25:31PM +0800, Qi Zheng wrote:
> >>
> >>
> > [...]
> >>>>
> >>>> Thank you for running the AI review for this patchset, but please do not
> >>>> directly send the raw data from the AI review to the community, as this
> >>>> is no different from automated review by a robot.
> >>>
> >>> Hi Qi,
> >>>
> >>> I don't know why you're so negative towards it. It's been great at
> >>
> >> No, I don't object to having a dedicated robot to do this.
> >>
> >>> finding pretty tricky bugs often missed by human reviewers. In no way
> >>> it's a replacement for human reviews, but if a robot can find real
> >>> issues and make the kernel more reliable and safe, I'm in.
> >>
> >> I just think you should do a preliminary review of the AI review results
> >> instead of sending them out directly. Otherwise, if everyone does this,
> >> the community will be full of bots.
> >>
> >> No?
> >>
> >
> > We don't want too many bots but we definitely want at least one AI
> > review bot. Now we have precedence of BPF and networking subsystem and
> > the results I have seen are really good. I think the MM community needs
> > to come together and decide on the formalities of AI review process and
> > I see Roman is doing some early experimentation and result looks great.
> 
> Do you mind explaining why the result looks great? Does it mean you agree
> the regressions pointed out by the AI review?

The result looks great because the points raised are really thought
provoking and things I have not thought about when I reviewed the
series. The lru lock without irq or the possible infinite retry loop in
get_mem_cgroup_from_folio() are two such examples. Are these real
regressions? I am not sure.

> 
> If we want to do AI reviews, the process should be improved instead of
> just pasting the output from AI. In the initial stage, I think some human
> intervention is needed, at least adding some comment on AI reviews would
> be helpful.

Yes I agree and therefore I mentioned we should discuss how should we
(MM community) should adopt the AI reviews.

> Otherwise, it looks like you agree completely with AI reviews.
> In addition, “50% of the reported issues are real”, is the AI tossing
> a coin when reporting issues?
> 
> When I am looking into the prompt part, I have the following questions:
> 
> 1. What is “Prompts SHA: 192922ae6bf4 ("bpf.md: adjust the documentation
> about bpf kfunc parameter validation”)”? I got the actual prompts
> from irc: https://github.com/masoncl/review-prompts/tree/main, but it
> should be provided along with the review for others to reproduce.

I agree and I didn't know that Chris's review prompts are used here.

Ccing Chris for your following questions.

> 
> 2. Looking at the mm prompt: https://github.com/masoncl/review-prompts/blob/main/mm.md, are you sure the patterns are all right?
> 	a. Page/Folio States, Large folios require per-page state tracking for
> 		Reference counts. I thought we want to get rid of per page refcount.
>     b. Migration Invariants, NUMA balancing expects valid PTE combinations.
> 		PROTNONE PTEs are hardware invalid to trigger fault.
> 	c. TLB flushes required after PTE modifications. How about spurious fault
> 		handling?
> 
> 3. For a cgroup patchset, I was expecting some cgroup specific prompt rules,
> 	but could not find any. What am I missing?
> 
>

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Chris Mason 1 month, 1 week ago

On 12/30/25 1:13 PM, Shakeel Butt wrote:
> On Tue, Dec 30, 2025 at 11:46:02AM -0500, Zi Yan wrote:
>> On 29 Dec 2025, at 23:48, Shakeel Butt wrote:
>>
>>> On Tue, Dec 30, 2025 at 12:25:31PM +0800, Qi Zheng wrote:
>>>>
>>>>
>>> [...]
>>>>>>
>>>>>> Thank you for running the AI review for this patchset, but please do not
>>>>>> directly send the raw data from the AI review to the community, as this
>>>>>> is no different from automated review by a robot.
>>>>>
>>>>> Hi Qi,
>>>>>
>>>>> I don't know why you're so negative towards it. It's been great at
>>>>
>>>> No, I don't object to having a dedicated robot to do this.
>>>>
>>>>> finding pretty tricky bugs often missed by human reviewers. In no way
>>>>> it's a replacement for human reviews, but if a robot can find real
>>>>> issues and make the kernel more reliable and safe, I'm in.
>>>>
>>>> I just think you should do a preliminary review of the AI review results
>>>> instead of sending them out directly. Otherwise, if everyone does this,
>>>> the community will be full of bots.

I do think it's awkward to dump the whole review output for the patch
series in a single message.  It looks like there's a sudden jump to XML?
It's better to reply to the individual patches with the comments
inline, which I think is where Roman is trying to go long term.

With BPF, it looks more like this:
https://lore.kernel.org/bpf/?q=AI+reviewed+your+patch

>>>>
>>>> No?
>>>>
>>>
>>> We don't want too many bots but we definitely want at least one AI
>>> review bot. Now we have precedence of BPF and networking subsystem and
>>> the results I have seen are really good. I think the MM community needs
>>> to come together and decide on the formalities of AI review process and
>>> I see Roman is doing some early experimentation and result looks great.
>>
>> Do you mind explaining why the result looks great? Does it mean you agree
>> the regressions pointed out by the AI review?
> 
> The result looks great because the points raised are really thought
> provoking and things I have not thought about when I reviewed the
> series. The lru lock without irq or the possible infinite retry loop in
> get_mem_cgroup_from_folio() are two such examples. Are these real
> regressions? I am not sure.
> 
>>
>> If we want to do AI reviews, the process should be improved instead of
>> just pasting the output from AI. In the initial stage, I think some human
>> intervention is needed, at least adding some comment on AI reviews would
>> be helpful.
> 
> Yes I agree and therefore I mentioned we should discuss how should we
> (MM community) should adopt the AI reviews.

What tends to happen with BPF is the patch author or bpf maintainers
point out problems with the reviews and I fix up the prompts over time.
The false positive rate is ~20% today (measured since late October), and
it's generally declining.

> 
>> Otherwise, it looks like you agree completely with AI reviews.
>> In addition, “50% of the reported issues are real”, is the AI tossing
>> a coin when reporting issues?
>>
>> When I am looking into the prompt part, I have the following questions:
>>
>> 1. What is “Prompts SHA: 192922ae6bf4 ("bpf.md: adjust the documentation
>> about bpf kfunc parameter validation”)”? I got the actual prompts
>> from irc: https://github.com/masoncl/review-prompts/tree/main , but it
>> should be provided along with the review for others to reproduce.
> 
> I agree and I didn't know that Chris's review prompts are used here.
> 
> Ccing Chris for your following questions.
> 
>>>> 2. Looking at the mm prompt: https://github.com/masoncl/review-prompts/blob/main/mm.md , are you sure the patterns are all right?
>> 	a. Page/Folio States, Large folios require per-page state tracking for
>> 		Reference counts. I thought we want to get rid of per page refcount.

Early in prompt development I hand picked a few hundred patches from
6.16 fixing bugs, and I iterated on these adding subsystem knowledge to
catch the known bugs.  That's where that rule came from, but as you say
there's a risk this information gets old.  Do we want to get rid of per
page refcounts or have we done it?  (more on that at the bottom of the
email).

>>     b. Migration Invariants, NUMA balancing expects valid PTE combinations.
>> 		PROTNONE PTEs are hardware invalid to trigger fault.
>> 	c. TLB flushes required after PTE modifications. How about spurious fault
>> 		handling?
>>

AI generally uses them as a starting point and fills in details, but
I agree the MM bits are pretty minimal.

>> 3. For a cgroup patchset, I was expecting some cgroup specific prompt rules,
>> 	but could not find any. What am I missing?

I think the only cgroup specific information I've needed so far is
explaining css_get() and the section on __GFP_ACCOUNT.  I actively try
to avoid adding details unless we're missing bugs or generating false
positives.

As an example of how I'd fix the prompt if the per page state tracking
were causing problems (and if we didn't want to just remove it), I asked
claude to analyze how it is still used.  The output is below, I'd double
check things as best I could, shorten into prompt form and send to the
list for review.

Per-Page Tracking in Large Folios - Analysis
=============================================

Based on analysis of mm/*.c files and commit history, MM-004's claim is
still partially true - large folios do need per-page tracking for some
bits, though recent work has significantly reduced this.

Bits That Still Require Per-Page Tracking
------------------------------------------

1. PG_hwpoison (include/linux/page-flags.h:118)

   Defined as PAGEFLAG(HWPoison, hwpoison, PF_ANY), this flag is set on
   individual pages within a large folio when hardware memory corruption
   is detected.

   The folio_test_has_hwpoisoned() flag on the second page indicates at
   least one subpage is poisoned, but does not identify which one.

   When splitting a large folio, page_range_has_hwpoisoned() in
   mm/huge_memory.c:3467 iterates through pages checking PageHWPoison()
   for each:

       static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
       {
           for (; nr_pages; page++, nr_pages--)
               if (PageHWPoison(page))
                   return true;
           return false;
       }

   Used in rmap code (mm/rmap.c:1990, 2070, 2473) to check individual
   subpages when unmapping or migrating.

2. PG_anon_exclusive (include/linux/page-flags.h:146)

   Per the comment at include/linux/page-flags.h:139-145:

       "Depending on the way an anonymous folio can be mapped into a page
       table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped
       THP), PG_anon_exclusive may be set only for the head page or for
       tail pages of an anonymous folio. For now, we only expect it to be
       set on tail pages for PTE-mapped THP."

   Used at mm/rmap.c:1408-1416: when RMAP_EXCLUSIVE flag is set for
   PTE-level mappings, it iterates through each page:

       for (i = 0; i < nr_pages; i++)
           SetPageAnonExclusive(page + i);

   HugeTLB stores this on head page only (see PageAnonExclusive() at
   include/linux/page-flags.h:1153-1162), but PTE-mapped THP needs
   per-page tracking.

Recent Changes - Per-Page Mapcount Removed
------------------------------------------

Commit 749492229e3bd ("mm: stop maintaining the per-page mapcount of
large folios") by David Hildenbrand (March 2025) introduced
CONFIG_NO_PAGE_MAPCOUNT which:

  - Stops maintaining per-page mapcounts in tail pages of large folios
  - Tail page mapcount is now always logically 0 (-1 value)
  - Removed _nr_pages_mapped tracking

This was a significant simplification, but it does not affect the
per-page flag tracking described above.

Flags Stored in Second Page Only (Not Per-Page)
-----------------------------------------------

These are stored in the first tail page (FOLIO_SECOND_PAGE) and apply to
the entire folio, not individual pages:

  - PG_has_hwpoisoned  - indicates some page in folio is poisoned
  - PG_large_rmappable - folio is rmappable
  - PG_partially_mapped - folio is partially mapped

See PAGE_FLAGS_SECOND definition at include/linux/page-flags.h:1218-1220.

Conclusion
----------

While per-page mapcount has been eliminated, PG_hwpoison and
PG_anon_exclusive (for PTE-mapped THP) still require per-page tracking
in large folios. MM-004's claim remains valid for these specific bits.

Key source files:
  - include/linux/page-flags.h (flag definitions and accessors)
  - mm/huge_memory.c (folio split handling)
  - mm/rmap.c (reverse mapping with per-page exclusive tracking)

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Zi Yan 1 month, 1 week ago

On 30 Dec 2025, at 14:18, Chris Mason wrote:

> On 12/30/25 1:13 PM, Shakeel Butt wrote:
>> On Tue, Dec 30, 2025 at 11:46:02AM -0500, Zi Yan wrote:
>>> On 29 Dec 2025, at 23:48, Shakeel Butt wrote:
>>>
>>>> On Tue, Dec 30, 2025 at 12:25:31PM +0800, Qi Zheng wrote:
>>>>>
>>>>>
>>>> [...]
>>>>>>>
>>>>>>> Thank you for running the AI review for this patchset, but please do not
>>>>>>> directly send the raw data from the AI review to the community, as this
>>>>>>> is no different from automated review by a robot.
>>>>>>
>>>>>> Hi Qi,
>>>>>>
>>>>>> I don't know why you're so negative towards it. It's been great at
>>>>>
>>>>> No, I don't object to having a dedicated robot to do this.
>>>>>
>>>>>> finding pretty tricky bugs often missed by human reviewers. In no way
>>>>>> it's a replacement for human reviews, but if a robot can find real
>>>>>> issues and make the kernel more reliable and safe, I'm in.
>>>>>
>>>>> I just think you should do a preliminary review of the AI review results
>>>>> instead of sending them out directly. Otherwise, if everyone does this,
>>>>> the community will be full of bots.
>
> I do think it's awkward to dump the whole review output for the patch
> series in a single message.  It looks like there's a sudden jump to XML?
> It's better to reply to the individual patches with the comments
> inline, which I think is where Roman is trying to go long term.
>
> With BPF, it looks more like this:
> https://lore.kernel.org/bpf/?q=AI+reviewed+your+patch

These look really good. At least the patch author can easily see the
feedback.

>
>>>>>
>>>>> No?
>>>>>
>>>>
>>>> We don't want too many bots but we definitely want at least one AI
>>>> review bot. Now we have precedence of BPF and networking subsystem and
>>>> the results I have seen are really good. I think the MM community needs
>>>> to come together and decide on the formalities of AI review process and
>>>> I see Roman is doing some early experimentation and result looks great.
>>>
>>> Do you mind explaining why the result looks great? Does it mean you agree
>>> the regressions pointed out by the AI review?
>>
>> The result looks great because the points raised are really thought
>> provoking and things I have not thought about when I reviewed the
>> series. The lru lock without irq or the possible infinite retry loop in
>> get_mem_cgroup_from_folio() are two such examples. Are these real
>> regressions? I am not sure.
>>
>>>
>>> If we want to do AI reviews, the process should be improved instead of
>>> just pasting the output from AI. In the initial stage, I think some human
>>> intervention is needed, at least adding some comment on AI reviews would
>>> be helpful.
>>
>> Yes I agree and therefore I mentioned we should discuss how should we
>> (MM community) should adopt the AI reviews.
>
> What tends to happen with BPF is the patch author or bpf maintainers
> point out problems with the reviews and I fix up the prompts over time.
> The false positive rate is ~20% today (measured since late October), and
> it's generally declining.

Yeah, I can see bpf.md contains more detailed rules compared to mm.md.

>
>>
>>> Otherwise, it looks like you agree completely with AI reviews.
>>> In addition, “50% of the reported issues are real”, is the AI tossing
>>> a coin when reporting issues?
>>>
>>> When I am looking into the prompt part, I have the following questions:
>>>
>>> 1. What is “Prompts SHA: 192922ae6bf4 ("bpf.md: adjust the documentation
>>> about bpf kfunc parameter validation”)”? I got the actual prompts
>>> from irc: https://github.com/masoncl/review-prompts/tree/main , but it
>>> should be provided along with the review for others to reproduce.
>>
>> I agree and I didn't know that Chris's review prompts are used here.
>>
>> Ccing Chris for your following questions.
>>
>>>>> 2. Looking at the mm prompt: https://github.com/masoncl/review-prompts/blob/main/mm.md , are you sure the patterns are all right?
>>> 	a. Page/Folio States, Large folios require per-page state tracking for
>>> 		Reference counts. I thought we want to get rid of per page refcount.
>
> Early in prompt development I hand picked a few hundred patches from
> 6.16 fixing bugs, and I iterated on these adding subsystem knowledge to
> catch the known bugs.  That's where that rule came from, but as you say
> there's a risk this information gets old.  Do we want to get rid of per
> page refcounts or have we done it?  (more on that at the bottom of the
> email).

willy has covered this part in another email.

>
>>>     b. Migration Invariants, NUMA balancing expects valid PTE combinations.
>>> 		PROTNONE PTEs are hardware invalid to trigger fault.
>>> 	c. TLB flushes required after PTE modifications. How about spurious fault
>>> 		handling?
>>>
>
> AI generally uses them as a starting point and fills in details, but
> I agree the MM bits are pretty minimal.
>
>>> 3. For a cgroup patchset, I was expecting some cgroup specific prompt rules,
>>> 	but could not find any. What am I missing?
>
> I think the only cgroup specific information I've needed so far is
> explaining css_get() and the section on __GFP_ACCOUNT.  I actively try
> to avoid adding details unless we're missing bugs or generating false
> positives.

I assume your review prompts are mainly used for BPF code, so it is understandable
that there are not many MM rules. My above concerns are mainly on the prompts
are directly used on MM patches without adding more MM specific rules.

Ideally, each subsystem could maintain its own rules in the corresponding
file to get a better outcome. :)

Best Regards,
Yan, Zi

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Matthew Wilcox 1 month, 1 week ago

On Tue, Dec 30, 2025 at 02:18:51PM -0500, Chris Mason wrote:
> >>>> I just think you should do a preliminary review of the AI review results
> >>>> instead of sending them out directly. Otherwise, if everyone does this,
> >>>> the community will be full of bots.
> 
> I do think it's awkward to dump the whole review output for the patch
> series in a single message.  It looks like there's a sudden jump to XML?
> It's better to reply to the individual patches with the comments
> inline, which I think is where Roman is trying to go long term.

I don't know what Roman's trying to do long-term, but his email
that started this thread was so badly written that it was offensive.
Had it been sent to me, I would have responded in the style of Arkell
v Pressdram.

> With BPF, it looks more like this:
> https://lore.kernel.org/bpf/?q=AI+reviewed+your+patch

That's actually useful.

> >>>> 2. Looking at the mm prompt: https://github.com/masoncl/review-prompts/blob/main/mm.md , are you sure the patterns are all right?
> >> 	a. Page/Folio States, Large folios require per-page state tracking for
> >> 		Reference counts. I thought we want to get rid of per page refcount.
> 
> Early in prompt development I hand picked a few hundred patches from
> 6.16 fixing bugs, and I iterated on these adding subsystem knowledge to
> catch the known bugs.  That's where that rule came from, but as you say
> there's a risk this information gets old.  Do we want to get rid of per
> page refcounts or have we done it?  (more on that at the bottom of the
> email).

There is no such thing as a per-page reference count.  Any attempt to
access the page reference count redirects to the folio refcount.  This
has been the case since 2016 (four years before folios existed).  See
commit ddc58f27f9ee.

We do want to git rid of calls to get_page() and put_page() for a
variety of reasons that will be long and painful to write out.

> As an example of how I'd fix the prompt if the per page state tracking
> were causing problems (and if we didn't want to just remove it), I asked
> claude to analyze how it is still used.  The output is below, I'd double
> check things as best I could, shorten into prompt form and send to the
> list for review.
> 
> Per-Page Tracking in Large Folios - Analysis
> =============================================
> 
> Based on analysis of mm/*.c files and commit history, MM-004's claim is
> still partially true - large folios do need per-page tracking for some
> bits, though recent work has significantly reduced this.
> 
> 
> Bits That Still Require Per-Page Tracking
> ------------------------------------------
> 
> 1. PG_hwpoison (include/linux/page-flags.h:118)
> 
>    Defined as PAGEFLAG(HWPoison, hwpoison, PF_ANY), this flag is set on
>    individual pages within a large folio when hardware memory corruption
>    is detected.
> 
>    The folio_test_has_hwpoisoned() flag on the second page indicates at
>    least one subpage is poisoned, but does not identify which one.
> 
>    When splitting a large folio, page_range_has_hwpoisoned() in
>    mm/huge_memory.c:3467 iterates through pages checking PageHWPoison()
>    for each:
> 
>        static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
>        {
>            for (; nr_pages; page++, nr_pages--)
>                if (PageHWPoison(page))
>                    return true;
>            return false;
>        }
> 
>    Used in rmap code (mm/rmap.c:1990, 2070, 2473) to check individual
>    subpages when unmapping or migrating.
> 
> 2. PG_anon_exclusive (include/linux/page-flags.h:146)
> 
>    Per the comment at include/linux/page-flags.h:139-145:
> 
>        "Depending on the way an anonymous folio can be mapped into a page
>        table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped
>        THP), PG_anon_exclusive may be set only for the head page or for
>        tail pages of an anonymous folio. For now, we only expect it to be
>        set on tail pages for PTE-mapped THP."
> 
>    Used at mm/rmap.c:1408-1416: when RMAP_EXCLUSIVE flag is set for
>    PTE-level mappings, it iterates through each page:
> 
>        for (i = 0; i < nr_pages; i++)
>            SetPageAnonExclusive(page + i);
> 
>    HugeTLB stores this on head page only (see PageAnonExclusive() at
>    include/linux/page-flags.h:1153-1162), but PTE-mapped THP needs
>    per-page tracking.
> 
> 
> Recent Changes - Per-Page Mapcount Removed
> ------------------------------------------
> 
> Commit 749492229e3bd ("mm: stop maintaining the per-page mapcount of
> large folios") by David Hildenbrand (March 2025) introduced
> CONFIG_NO_PAGE_MAPCOUNT which:
> 
>   - Stops maintaining per-page mapcounts in tail pages of large folios
>   - Tail page mapcount is now always logically 0 (-1 value)
>   - Removed _nr_pages_mapped tracking
> 
> This was a significant simplification, but it does not affect the
> per-page flag tracking described above.
> 
> 
> Flags Stored in Second Page Only (Not Per-Page)
> -----------------------------------------------
> 
> These are stored in the first tail page (FOLIO_SECOND_PAGE) and apply to
> the entire folio, not individual pages:
> 
>   - PG_has_hwpoisoned  - indicates some page in folio is poisoned
>   - PG_large_rmappable - folio is rmappable
>   - PG_partially_mapped - folio is partially mapped
> 
> See PAGE_FLAGS_SECOND definition at include/linux/page-flags.h:1218-1220.
> 
> 
> Conclusion
> ----------
> 
> While per-page mapcount has been eliminated, PG_hwpoison and
> PG_anon_exclusive (for PTE-mapped THP) still require per-page tracking
> in large folios. MM-004's claim remains valid for these specific bits.
> 
> Key source files:
>   - include/linux/page-flags.h (flag definitions and accessors)
>   - mm/huge_memory.c (folio split handling)
>   - mm/rmap.c (reverse mapping with per-page exclusive tracking)

This is pretty good and yet dangerously wrong in some missed nuances.
Which probably summarises the state of the art nicely ;-)

To start with, all flags marked as PF_ANY are set on individual pages
rather than only the folio.  So that's currently:

PAGEFLAG(Private, private, PF_ANY)
PAGEFLAG(HWPoison, hwpoison, PF_ANY)
PAGEFLAG(VmemmapSelfHosted, vmemmap_self_hosted, PF_ANY)
__SETPAGEFLAG(Head, head, PF_ANY)
        return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);

Now, PG_private is a flag we're trying to get rid of -- it should be
identical to (folio->private != NULL), so I haven't made any effort
to convert that from being PF_ANY.  I'm not too unhappy that your chatbot
doesn't talk about PG_private, but a more full answer would include
mention of this.

PG_hwpoison and PG_anon_exclusive will remain per-page state in a
memdesc world, and there's a plan to handle those, so there's no need to
eliminate them.

PG_vmemmap_self_hosted is a very, very internal flag.  It's OK to not
know about it.

PG_head has to remain per-page state for now for obvious reasons ;-)
In a memdesc word, there will be no way to ask if a page is the first
page of an allocation, so this flag will not be needed.

I believe there are some subtleties around PG_hwpoison and hugetlb that
are not fully captured above, but I'm not convinced of my ability to
state definitely what they currently are, so I'll leve that for somebody
else to do.

---

Looking through your prompts, there are definitely some conditions that
could be profitably added.  For example, pages which are mapped into
page tables must be PG_uptodate (we have various assertions in the MM
code that this is true and they occasionally trigger).

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Roman Gushchin 1 month, 1 week ago

Matthew Wilcox <willy@infradead.org> writes:

> On Tue, Dec 30, 2025 at 02:18:51PM -0500, Chris Mason wrote:
>> >>>> I just think you should do a preliminary review of the AI review results
>> >>>> instead of sending them out directly. Otherwise, if everyone does this,
>> >>>> the community will be full of bots.
>> 
>> I do think it's awkward to dump the whole review output for the patch
>> series in a single message.  It looks like there's a sudden jump to
>> XML?
>> It's better to reply to the individual patches with the comments
>> inline, which I think is where Roman is trying to go long term.
>
> I don't know what Roman's trying to do long-term, but his email
> that started this thread was so badly written that it was offensive.
> Had it been sent to me, I would have responded in the style of Arkell
> v Pressdram.

It felt awkward to send a bunch of emails from myself, all beginning
with the "I ran the ai review and here is the output" header.
Once we have a bot, obviously it's better to answer individual emails,
as the bpf subsystem does.

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Chris Mason 1 month, 1 week ago

On 12/30/25 3:51 PM, Matthew Wilcox wrote:
> On Tue, Dec 30, 2025 at 02:18:51PM -0500, Chris Mason wrote:
>>>>>> I just think you should do a preliminary review of the AI review results
>>>>>> instead of sending them out directly. Otherwise, if everyone does this,
>>>>>> the community will be full of bots.>>>>>> 2. Looking at the mm prompt: https://github.com/masoncl/review-prompts/blob/main/mm.md  , are you sure the patterns are all right?
>>>> 	a. Page/Folio States, Large folios require per-page state tracking for
>>>> 		Reference counts. I thought we want to get rid of per page refcount.
>>

[ ... ]

>> Early in prompt development I hand picked a few hundred patches from
>> 6.16 fixing bugs, and I iterated on these adding subsystem knowledge to
>> catch the known bugs.  That's where that rule came from, but as you say
>> there's a risk this information gets old.  Do we want to get rid of per
>> page refcounts or have we done it?  (more on that at the bottom of the
>> email).
> 
> There is no such thing as a per-page reference count.  Any attempt to
> access the page reference count redirects to the folio refcount.  This
> has been the case since 2016 (four years before folios existed).  See
> commit ddc58f27f9ee.
> 
Ok, I'm half out the door to vacation, but I'll fix up the mm.md to
better reflect reality when I get back.

> We do want to git rid of calls to get_page() and put_page() for a
> variety of reasons that will be long and painful to write out.
> 
>> As an example of how I'd fix the prompt if the per page state tracking
>> were causing problems (and if we didn't want to just remove it), I asked
>> claude to analyze how it is still used.  The output is below, I'd double
>> check things as best I could, shorten into prompt form and send to the
>> list for review.
>>
>> Per-Page Tracking in Large Folios - Analysis
>> =============================================
>>
>> Based on analysis of mm/*.c files and commit history, MM-004's claim is
>> still partially true - large folios do need per-page tracking for some
>> bits, though recent work has significantly reduced this.
>>
>>
>> Bits That Still Require Per-Page Tracking
>> ------------------------------------------
>>
>> 1. PG_hwpoison (include/linux/page-flags.h:118)
>>
>>    Defined as PAGEFLAG(HWPoison, hwpoison, PF_ANY), this flag is set on
>>    individual pages within a large folio when hardware memory corruption
>>    is detected.
>>
>>    The folio_test_has_hwpoisoned() flag on the second page indicates at
>>    least one subpage is poisoned, but does not identify which one.
>>
>>    When splitting a large folio, page_range_has_hwpoisoned() in
>>    mm/huge_memory.c:3467 iterates through pages checking PageHWPoison()
>>    for each:
>>
>>        static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
>>        {
>>            for (; nr_pages; page++, nr_pages--)
>>                if (PageHWPoison(page))
>>                    return true;
>>            return false;
>>        }
>>
>>    Used in rmap code (mm/rmap.c:1990, 2070, 2473) to check individual
>>    subpages when unmapping or migrating.
>>
>> 2. PG_anon_exclusive (include/linux/page-flags.h:146)
>>
>>    Per the comment at include/linux/page-flags.h:139-145:
>>
>>        "Depending on the way an anonymous folio can be mapped into a page
>>        table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped
>>        THP), PG_anon_exclusive may be set only for the head page or for
>>        tail pages of an anonymous folio. For now, we only expect it to be
>>        set on tail pages for PTE-mapped THP."
>>
>>    Used at mm/rmap.c:1408-1416: when RMAP_EXCLUSIVE flag is set for
>>    PTE-level mappings, it iterates through each page:
>>
>>        for (i = 0; i < nr_pages; i++)
>>            SetPageAnonExclusive(page + i);
>>
>>    HugeTLB stores this on head page only (see PageAnonExclusive() at
>>    include/linux/page-flags.h:1153-1162), but PTE-mapped THP needs
>>    per-page tracking.
>>
>>
>> Recent Changes - Per-Page Mapcount Removed
>> ------------------------------------------
>>
>> Commit 749492229e3bd ("mm: stop maintaining the per-page mapcount of
>> large folios") by David Hildenbrand (March 2025) introduced
>> CONFIG_NO_PAGE_MAPCOUNT which:
>>
>>   - Stops maintaining per-page mapcounts in tail pages of large folios
>>   - Tail page mapcount is now always logically 0 (-1 value)
>>   - Removed _nr_pages_mapped tracking
>>
>> This was a significant simplification, but it does not affect the
>> per-page flag tracking described above.
>>
>>
>> Flags Stored in Second Page Only (Not Per-Page)
>> -----------------------------------------------
>>
>> These are stored in the first tail page (FOLIO_SECOND_PAGE) and apply to
>> the entire folio, not individual pages:
>>
>>   - PG_has_hwpoisoned  - indicates some page in folio is poisoned
>>   - PG_large_rmappable - folio is rmappable
>>   - PG_partially_mapped - folio is partially mapped
>>
>> See PAGE_FLAGS_SECOND definition at include/linux/page-flags.h:1218-1220.
>>
>>
>> Conclusion
>> ----------
>>
>> While per-page mapcount has been eliminated, PG_hwpoison and
>> PG_anon_exclusive (for PTE-mapped THP) still require per-page tracking
>> in large folios. MM-004's claim remains valid for these specific bits.
>>
>> Key source files:
>>   - include/linux/page-flags.h (flag definitions and accessors)
>>   - mm/huge_memory.c (folio split handling)
>>   - mm/rmap.c (reverse mapping with per-page exclusive tracking)
> 
> This is pretty good and yet dangerously wrong in some missed nuances.
> Which probably summarises the state of the art nicely ;-)
> 

Yeah, that's generally how it goes.  It's a good starting point, but the
details have to get verified.

> To start with, all flags marked as PF_ANY are set on individual pages
> rather than only the folio.  So that's currently:
> 
> PAGEFLAG(Private, private, PF_ANY)
> PAGEFLAG(HWPoison, hwpoison, PF_ANY)
> PAGEFLAG(VmemmapSelfHosted, vmemmap_self_hosted, PF_ANY)
> __SETPAGEFLAG(Head, head, PF_ANY)
>         return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
> 
> Now, PG_private is a flag we're trying to get rid of -- it should be
> identical to (folio->private != NULL), so I haven't made any effort
> to convert that from being PF_ANY.  I'm not too unhappy that your chatbot
> doesn't talk about PG_private, but a more full answer would include
> mention of this.
> 
> PG_hwpoison and PG_anon_exclusive will remain per-page state in a
> memdesc world, and there's a plan to handle those, so there's no need to
> eliminate them.
> 
> PG_vmemmap_self_hosted is a very, very internal flag.  It's OK to not
> know about it.
> 
> PG_head has to remain per-page state for now for obvious reasons ;-)
> In a memdesc word, there will be no way to ask if a page is the first
> page of an allocation, so this flag will not be needed.
> 
> I believe there are some subtleties around PG_hwpoison and hugetlb that
> are not fully captured above, but I'm not convinced of my ability to
> state definitely what they currently are, so I'll leve that for somebody
> else to do.

Thanks for taking the time to debug the output.  I think before trying
to put this into the prompt, I'd step back and ask:

- What bugs do we want AI to catch?  I can see knowing these large folio
details really helping find bugs in the transition, or debug bug reports
down the line, so it feels like an important detail to record.  It's
definitely something AI won't know all by itself.

- What details is AI getting wrong in current reviews?  We don't really
have this answer yet, but if AI isn't getting it wrong, there's no
reason to try and teach it more.

-chris

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Roman Gushchin 1 month, 1 week ago

Chris Mason <clm@meta.com> writes:

> On 12/30/25 3:51 PM, Matthew Wilcox wrote:
>> On Tue, Dec 30, 2025 at 02:18:51PM -0500, Chris Mason wrote:
>>>>>>> I just think you should do a preliminary review of the AI review results
>>>>>>> instead of sending them out directly. Otherwise, if everyone does this,
>>>>>>> the community will be full of bots.>>>>>> 2. Looking at the mm
> prompt: https://github.com/masoncl/review-prompts/blob/main/mm.md ,
> are you sure the patterns are all right?
>>>>> 	a. Page/Folio States, Large folios require per-page state tracking for
>>>>> 		Reference counts. I thought we want to get rid of per page refcount.
>>>
>
> [ ... ]
>
>>> Early in prompt development I hand picked a few hundred patches from
>>> 6.16 fixing bugs, and I iterated on these adding subsystem knowledge to
>>> catch the known bugs.  That's where that rule came from, but as you say
>>> there's a risk this information gets old.  Do we want to get rid of per
>>> page refcounts or have we done it?  (more on that at the bottom of the
>>> email).
>> 
>> There is no such thing as a per-page reference count.  Any attempt to
>> access the page reference count redirects to the folio refcount.  This
>> has been the case since 2016 (four years before folios existed).  See
>> commit ddc58f27f9ee.
>> 
> Ok, I'm half out the door to vacation, but I'll fix up the mm.md to
> better reflect reality when I get back.
>
>> We do want to git rid of calls to get_page() and put_page() for a
>> variety of reasons that will be long and painful to write out.
>> 
>>> As an example of how I'd fix the prompt if the per page state tracking
>>> were causing problems (and if we didn't want to just remove it), I asked
>>> claude to analyze how it is still used.  The output is below, I'd double
>>> check things as best I could, shorten into prompt form and send to the
>>> list for review.
>>>
>>> Per-Page Tracking in Large Folios - Analysis
>>> =============================================
>>>
>>> Based on analysis of mm/*.c files and commit history, MM-004's claim is
>>> still partially true - large folios do need per-page tracking for some
>>> bits, though recent work has significantly reduced this.
>>>
>>>
>>> Bits That Still Require Per-Page Tracking
>>> ------------------------------------------
>>>
>>> 1. PG_hwpoison (include/linux/page-flags.h:118)
>>>
>>>    Defined as PAGEFLAG(HWPoison, hwpoison, PF_ANY), this flag is set on
>>>    individual pages within a large folio when hardware memory corruption
>>>    is detected.
>>>
>>>    The folio_test_has_hwpoisoned() flag on the second page indicates at
>>>    least one subpage is poisoned, but does not identify which one.
>>>
>>>    When splitting a large folio, page_range_has_hwpoisoned() in
>>>    mm/huge_memory.c:3467 iterates through pages checking PageHWPoison()
>>>    for each:
>>>
>>>        static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
>>>        {
>>>            for (; nr_pages; page++, nr_pages--)
>>>                if (PageHWPoison(page))
>>>                    return true;
>>>            return false;
>>>        }
>>>
>>>    Used in rmap code (mm/rmap.c:1990, 2070, 2473) to check individual
>>>    subpages when unmapping or migrating.
>>>
>>> 2. PG_anon_exclusive (include/linux/page-flags.h:146)
>>>
>>>    Per the comment at include/linux/page-flags.h:139-145:
>>>
>>>        "Depending on the way an anonymous folio can be mapped into a page
>>>        table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped
>>>        THP), PG_anon_exclusive may be set only for the head page or for
>>>        tail pages of an anonymous folio. For now, we only expect it to be
>>>        set on tail pages for PTE-mapped THP."
>>>
>>>    Used at mm/rmap.c:1408-1416: when RMAP_EXCLUSIVE flag is set for
>>>    PTE-level mappings, it iterates through each page:
>>>
>>>        for (i = 0; i < nr_pages; i++)
>>>            SetPageAnonExclusive(page + i);
>>>
>>>    HugeTLB stores this on head page only (see PageAnonExclusive() at
>>>    include/linux/page-flags.h:1153-1162), but PTE-mapped THP needs
>>>    per-page tracking.
>>>
>>>
>>> Recent Changes - Per-Page Mapcount Removed
>>> ------------------------------------------
>>>
>>> Commit 749492229e3bd ("mm: stop maintaining the per-page mapcount of
>>> large folios") by David Hildenbrand (March 2025) introduced
>>> CONFIG_NO_PAGE_MAPCOUNT which:
>>>
>>>   - Stops maintaining per-page mapcounts in tail pages of large folios
>>>   - Tail page mapcount is now always logically 0 (-1 value)
>>>   - Removed _nr_pages_mapped tracking
>>>
>>> This was a significant simplification, but it does not affect the
>>> per-page flag tracking described above.
>>>
>>>
>>> Flags Stored in Second Page Only (Not Per-Page)
>>> -----------------------------------------------
>>>
>>> These are stored in the first tail page (FOLIO_SECOND_PAGE) and apply to
>>> the entire folio, not individual pages:
>>>
>>>   - PG_has_hwpoisoned  - indicates some page in folio is poisoned
>>>   - PG_large_rmappable - folio is rmappable
>>>   - PG_partially_mapped - folio is partially mapped
>>>
>>> See PAGE_FLAGS_SECOND definition at include/linux/page-flags.h:1218-1220.
>>>
>>>
>>> Conclusion
>>> ----------
>>>
>>> While per-page mapcount has been eliminated, PG_hwpoison and
>>> PG_anon_exclusive (for PTE-mapped THP) still require per-page tracking
>>> in large folios. MM-004's claim remains valid for these specific bits.
>>>
>>> Key source files:
>>>   - include/linux/page-flags.h (flag definitions and accessors)
>>>   - mm/huge_memory.c (folio split handling)
>>>   - mm/rmap.c (reverse mapping with per-page exclusive tracking)
>> 
>> This is pretty good and yet dangerously wrong in some missed nuances.
>> Which probably summarises the state of the art nicely ;-)
>> 
>
> Yeah, that's generally how it goes.  It's a good starting point, but the
> details have to get verified.
>
>> To start with, all flags marked as PF_ANY are set on individual pages
>> rather than only the folio.  So that's currently:
>> 
>> PAGEFLAG(Private, private, PF_ANY)
>> PAGEFLAG(HWPoison, hwpoison, PF_ANY)
>> PAGEFLAG(VmemmapSelfHosted, vmemmap_self_hosted, PF_ANY)
>> __SETPAGEFLAG(Head, head, PF_ANY)
>>         return test_bit(PG_anon_exclusive, &PF_ANY(page, 1)->flags.f);
>> 
>> Now, PG_private is a flag we're trying to get rid of -- it should be
>> identical to (folio->private != NULL), so I haven't made any effort
>> to convert that from being PF_ANY.  I'm not too unhappy that your chatbot
>> doesn't talk about PG_private, but a more full answer would include
>> mention of this.
>> 
>> PG_hwpoison and PG_anon_exclusive will remain per-page state in a
>> memdesc world, and there's a plan to handle those, so there's no need to
>> eliminate them.
>> 
>> PG_vmemmap_self_hosted is a very, very internal flag.  It's OK to not
>> know about it.
>> 
>> PG_head has to remain per-page state for now for obvious reasons ;-)
>> In a memdesc word, there will be no way to ask if a page is the first
>> page of an allocation, so this flag will not be needed.
>> 
>> I believe there are some subtleties around PG_hwpoison and hugetlb that
>> are not fully captured above, but I'm not convinced of my ability to
>> state definitely what they currently are, so I'll leve that for somebody
>> else to do.
>
> Thanks for taking the time to debug the output.  I think before trying
> to put this into the prompt, I'd step back and ask:
>
> - What bugs do we want AI to catch?  I can see knowing these large folio
> details really helping find bugs in the transition, or debug bug reports
> down the line, so it feels like an important detail to record.  It's
> definitely something AI won't know all by itself.
>
> - What details is AI getting wrong in current reviews?  We don't really
> have this answer yet, but if AI isn't getting it wrong, there's no
> reason to try and teach it more.

Also, we probably don't want to hard-code the current state of the art,
as ideally we should be able to review old patches as well (e.g. on top
of LTS or custom private trees). So in the perfect world we want to provide
some meta-ideas on how the LLM can decode the rules from the code itself.

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Michal Hocko 3 weeks, 6 days ago

I can see that Johannes, Shakeel and others have done a review and also
suggested some minor improvements/modifications. Are you planning to
consolidate those and repost anytime soon?

Thanks
-- 
Michal Hocko
SUSE Labs

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Qi Zheng 3 weeks, 5 days ago

Hi Michal,

On 1/12/26 10:21 PM, Michal Hocko wrote:
> I can see that Johannes, Shakeel and others have done a review and also
> suggested some minor improvements/modifications. Are you planning to

Yes. I'm working on them (mainly non-hierarchical stats issue reported
by Yosry Ahmed).

> consolidate those and repost anytime soon?

I'm testing it locally and will try to release v3 this week.

Thanks,
Qi

> 
> Thanks

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Shakeel Butt 3 weeks, 5 days ago

Hi Qi,

On Tue, Jan 13, 2026 at 10:34:08AM +0800, Qi Zheng wrote:
> Hi Michal,
> 
> On 1/12/26 10:21 PM, Michal Hocko wrote:
> > I can see that Johannes, Shakeel and others have done a review and also
> > suggested some minor improvements/modifications. Are you planning to
> 
> Yes. I'm working on them (mainly non-hierarchical stats issue reported
> by Yosry Ahmed).
> 
> > consolidate those and repost anytime soon?
> 
> I'm testing it locally and will try to release v3 this week.

Please take a look at the AI review posted by Roman. IMO it does have very
interesting comments, so at least go through them and check if those
makes sense. Please point out if something doesn't make sense which we
help us to further improve the AI review process.

thanks,
Shakeel

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Qi Zheng 3 weeks, 4 days ago


On 1/14/26 2:02 AM, Shakeel Butt wrote:
> Hi Qi,
> 
> On Tue, Jan 13, 2026 at 10:34:08AM +0800, Qi Zheng wrote:
>> Hi Michal,
>>
>> On 1/12/26 10:21 PM, Michal Hocko wrote:
>>> I can see that Johannes, Shakeel and others have done a review and also
>>> suggested some minor improvements/modifications. Are you planning to
>>
>> Yes. I'm working on them (mainly non-hierarchical stats issue reported
>> by Yosry Ahmed).
>>
>>> consolidate those and repost anytime soon?
>>
>> I'm testing it locally and will try to release v3 this week.
> 
> Please take a look at the AI review posted by Roman. IMO it does have very
> interesting comments, so at least go through them and check if those

To be honest, until an official AI bot is developed, I don't want to see
any raw AI comment results unless the sender has already reviewd them.

> makes sense. Please point out if something doesn't make sense which we
> help us to further improve the AI review process.

That makes sense, the v3 fixed some issues, so let Roman scan it again
with the AI bot based on v3.

Thanks,
Qi

> 
> thanks,
> Shakeel

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Michal Hocko 3 weeks, 5 days ago

On Tue 13-01-26 10:34:08, Qi Zheng wrote:
> Hi Michal,
> 
> On 1/12/26 10:21 PM, Michal Hocko wrote:
> > I can see that Johannes, Shakeel and others have done a review and also
> > suggested some minor improvements/modifications. Are you planning to
> 
> Yes. I'm working on them (mainly non-hierarchical stats issue reported
> by Yosry Ahmed).
> 
> > consolidate those and repost anytime soon?
> 
> I'm testing it locally and will try to release v3 this week.

Great. I will wait for the v3 then.

Thanks!
-- 
Michal Hocko
SUSE Labs

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Yosry Ahmed 1 month, 2 weeks ago

On Wed, Dec 17, 2025 at 03:27:24PM +0800, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> Changes in v2:
>  - add [PATCH v2 04/28] and remove local_irq_disable() in evict_folios()
>    (pointed by Harry Yoo)
>  - recheck objcg in [PATCH v2 07/28] (pointed by Harry Yoo)
>  - modify the commit message in [PATCH v2 12/28] and [PATCH v2 21/28]
>    (pointed by Harry Yoo)
>  - use rcu lock to protect mm_state in [PATCH v2 14/28] (pointed by Harry Yoo)
>  - fix bad unlock balance warning in [PATCH v2 23/28]
>  - change nr_pages type to long in [PATCH v2 25/28] (pointed by Harry Yoo)
>  - incease mm_state->seq during reparenting to make mm walker work properly in
>    [PATCH v2 25/28] (pointed by Harry Yoo)
>  - add [PATCH v2 18/28] to fix WARNING in folio_memcg() (pointed by Harry Yoo)
>  - collect Reviewed-bys
>  - rebase onto the next-20251216
> 
> Changes in v1:
>  - drop [PATCH RFC 02/28]
>  - drop THP split queue related part, which has been merged as a separate
>    patchset[2]
>  - prevent memory cgroup release in folio_split_queue_lock{_irqsave}() in
>    [PATCH v1 16/26]
>  - Separate the reparenting function of traditional LRU folios to [PATCH v1 22/26]
>  - adapted to the MGLRU scenarios in [PATCH v1 23/26]
>  - refactor memcg_reparent_objcgs() in [PATCH v1 24/26]
>  - collect Acked-bys and Reviewed-bys
>  - rebase onto the next-20251028
> 
> Hi all,
> 
> Introduction
> ============
> 
> This patchset is intended to transfer the LRU pages to the object cgroup
> without holding a reference to the original memory cgroup in order to
> address the issue of the dying memory cgroup. A consensus has already been
> reached regarding this approach recently [1].
> 
> Background
> ==========
> 
> The issue of a dying memory cgroup refers to a situation where a memory
> cgroup is no longer being used by users, but memory (the metadata
> associated with memory cgroups) remains allocated to it. This situation
> may potentially result in memory leaks or inefficiencies in memory
> reclamation and has persisted as an issue for several years. Any memory
> allocation that endures longer than the lifespan (from the users'
> perspective) of a memory cgroup can lead to the issue of dying memory
> cgroup. We have exerted greater efforts to tackle this problem by
> introducing the infrastructure of object cgroup [2].
> 
> Presently, numerous types of objects (slab objects, non-slab kernel
> allocations, per-CPU objects) are charged to the object cgroup without
> holding a reference to the original memory cgroup. The final allocations
> for LRU pages (anonymous pages and file pages) are charged at allocation
> time and continues to hold a reference to the original memory cgroup
> until reclaimed.
> 
> File pages are more complex than anonymous pages as they can be shared
> among different memory cgroups and may persist beyond the lifespan of
> the memory cgroup. The long-term pinning of file pages to memory cgroups
> is a widespread issue that causes recurring problems in practical
> scenarios [3]. File pages remain unreclaimed for extended periods.
> Additionally, they are accessed by successive instances (second, third,
> fourth, etc.) of the same job, which is restarted into a new cgroup each
> time. As a result, unreclaimable dying memory cgroups accumulate,
> leading to memory wastage and significantly reducing the efficiency
> of page reclamation.
> 
> Fundamentals
> ============
> 
> A folio will no longer pin its corresponding memory cgroup. It is necessary
> to ensure that the memory cgroup or the lruvec associated with the memory
> cgroup is not released when a user obtains a pointer to the memory cgroup
> or lruvec returned by folio_memcg() or folio_lruvec(). Users are required
> to hold the RCU read lock or acquire a reference to the memory cgroup
> associated with the folio to prevent its release if they are not concerned
> about the binding stability between the folio and its corresponding memory
> cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock)
> desire a stable binding between the folio and its corresponding memory
> cgroup. An approach is needed to ensure the stability of the binding while
> the lruvec lock is held, and to detect the situation of holding the
> incorrect lruvec lock when there is a race condition during memory cgroup
> reparenting. The following four steps are taken to achieve these goals.
> 
> 1. The first step  to be taken is to identify all users of both functions
>    (folio_memcg() and folio_lruvec()) who are not concerned about binding
>    stability and implement appropriate measures (such as holding a RCU read
>    lock or temporarily obtaining a reference to the memory cgroup for a
>    brief period) to prevent the release of the memory cgroup.
> 
> 2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates
>    how to ensure the binding stability from the user's perspective of
>    folio_lruvec().
> 
>    struct lruvec *folio_lruvec_lock(struct folio *folio)
>    {
>            struct lruvec *lruvec;
> 
>            rcu_read_lock();
>    retry:
>            lruvec = folio_lruvec(folio);
>            spin_lock(&lruvec->lru_lock);
>            if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
>                    spin_unlock(&lruvec->lru_lock);
>                    goto retry;
>            }
> 
>            return lruvec;
>    }
> 
>    From the perspective of memory cgroup removal, the entire reparenting
>    process (altering the binding relationship between folio and its memory
>    cgroup and moving the LRU lists to its parental memory cgroup) should be
>    carried out under both the lruvec lock of the memory cgroup being removed
>    and the lruvec lock of its parent.
> 
> 3. Finally, transfer the LRU pages to the object cgroup without holding a
>    reference to the original memory cgroup.

I think there might be a problem with non-hierarchical stats on cgroup
v1, I brought it up previously [*]. I am not sure if this was addressed
but I couldn't immediately find anything.

In short, if memory is charged to a dying cgroup at the time of
reparenting, when the memory gets uncharged the stats updates will occur
at the parent. This will update both hierarchical and non-hierarchical
stats of the parent, which would corrupt the parent's non-hierarchical
stats (because those counters were never incremented when the memory was
charged).

I didn't track down which stats are affected by this, but off the top of
my head I think all stats tracking anon, file, etc.

The obvious solution is to flush and reparent the stats of a dying memcg
during reparenting, but I don't think this entirely fixes the problem
because the dying memcg stats can still be updated after its reparenting
(e.g. if a ref to the memcg has been held since before reparenting).

AFAICT, the stats of the dying memcg are only stable at release time,
but reparenting the stats at that point means that we have a potentially
large window (between reparenting and release) where the parent
non-hierarchical stats will be wrong and could even underflow.

[*]https://lore.kernel.org/all/CAJD7tkazvC+kZgGaV3idapQp-zPFaWBxoHwnrqTFoodHZGQcPA@mail.gmail.com/

> 
> Effect
> ======
> 
> Finally, it can be observed that the quantity of dying memory cgroups will
> not experience a significant increase if the following test script is
> executed to reproduce the issue.
> 
> ```bash
> #!/bin/bash
> 
> # Create a temporary file 'temp' filled with zero bytes
> dd if=/dev/zero of=temp bs=4096 count=1
> 
> # Display memory-cgroup info from /proc/cgroups
> cat /proc/cgroups | grep memory
> 
> for i in {0..2000}
> do
>     mkdir /sys/fs/cgroup/memory/test$i
>     echo $$ > /sys/fs/cgroup/memory/test$i/cgroup.procs
> 
>     # Append 'temp' file content to 'log'
>     cat temp >> log
> 
>     echo $$ > /sys/fs/cgroup/memory/cgroup.procs
> 
>     # Potentially create a dying memory cgroup
>     rmdir /sys/fs/cgroup/memory/test$i
> done
> 
> # Display memory-cgroup info after test
> cat /proc/cgroups | grep memory
> 
> rm -f temp log
> ```
> 
> Comments and suggestions are welcome!
> 
> Thanks,
> Qi
> 
> [1].https://lore.kernel.org/linux-mm/Z6OkXXYDorPrBvEQ@hm-sls2/
> [2].https://lwn.net/Articles/895431/
> [3].https://github.com/systemd/systemd/pull/36827
> 
> Muchun Song (22):
>   mm: memcontrol: remove dead code of checking parent memory cgroup
>   mm: workingset: use folio_lruvec() in workingset_refault()
>   mm: rename unlock_page_lruvec_irq and its variants
>   mm: vmscan: refactor move_folios_to_lru()
>   mm: memcontrol: allocate object cgroup for non-kmem case
>   mm: memcontrol: return root object cgroup for root memory cgroup
>   mm: memcontrol: prevent memory cgroup release in
>     get_mem_cgroup_from_folio()
>   buffer: prevent memory cgroup release in folio_alloc_buffers()
>   writeback: prevent memory cgroup release in writeback module
>   mm: memcontrol: prevent memory cgroup release in
>     count_memcg_folio_events()
>   mm: page_io: prevent memory cgroup release in page_io module
>   mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
>   mm: mglru: prevent memory cgroup release in mglru
>   mm: memcontrol: prevent memory cgroup release in
>     mem_cgroup_swap_full()
>   mm: workingset: prevent memory cgroup release in lru_gen_eviction()
>   mm: workingset: prevent lruvec release in workingset_refault()
>   mm: zswap: prevent lruvec release in zswap_folio_swapin()
>   mm: swap: prevent lruvec release in lru_gen_clear_refs()
>   mm: workingset: prevent lruvec release in workingset_activation()
>   mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
>   mm: memcontrol: eliminate the problem of dying memory cgroup for LRU
>     folios
>   mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers
> 
> Qi Zheng (6):
>   mm: vmscan: prepare for the refactoring the move_folios_to_lru()
>   mm: thp: prevent memory cgroup release in
>     folio_split_queue_lock{_irqsave}()
>   mm: zswap: prevent memory cgroup release in zswap_compress()
>   mm: vmscan: prepare for reparenting traditional LRU folios
>   mm: vmscan: prepare for reparenting MGLRU folios
>   mm: memcontrol: refactor memcg_reparent_objcgs()
> 
>  fs/buffer.c                      |   4 +-
>  fs/fs-writeback.c                |  22 +-
>  include/linux/memcontrol.h       | 159 ++++++------
>  include/linux/mm_inline.h        |   6 +
>  include/linux/mmzone.h           |  20 ++
>  include/trace/events/writeback.h |   3 +
>  mm/compaction.c                  |  43 +++-
>  mm/huge_memory.c                 |  18 +-
>  mm/memcontrol-v1.c               |  15 +-
>  mm/memcontrol.c                  | 405 ++++++++++++++++++-------------
>  mm/migrate.c                     |   2 +
>  mm/mlock.c                       |   2 +-
>  mm/page_io.c                     |   8 +-
>  mm/percpu.c                      |   2 +-
>  mm/shrinker.c                    |   6 +-
>  mm/swap.c                        |  20 +-
>  mm/vmscan.c                      | 267 ++++++++++++++++----
>  mm/workingset.c                  |  26 +-
>  mm/zswap.c                       |   5 +
>  19 files changed, 677 insertions(+), 356 deletions(-)
> 
> -- 
> 2.20.1
> 
>

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Harry Yoo 1 month, 2 weeks ago

On Tue, Dec 23, 2025 at 08:04:50PM +0000, Yosry Ahmed wrote:
> I think there might be a problem with non-hierarchical stats on cgroup
> v1, I brought it up previously [*]. I am not sure if this was addressed
> but I couldn't immediately find anything.

Hi, Yosry. Thanks for bringing this up!

> In short, if memory is charged to a dying cgroup at the time of
> reparenting, when the memory gets uncharged the stats updates will occur
> at the parent. This will update both hierarchical and non-hierarchical
> stats of the parent, which would corrupt the parent's non-hierarchical
> stats (because those counters were never incremented when the memory was
> charged).

Hmm, I wonder if this only applies to LRU pages.

In theory we should have this problem for NR_SLAB{UN,}RECLAIMABLE_B
because we already reparent objcgs, or am I missing something?

> [*]https://lore.kernel.org/all/CAJD7tkazvC*kZgGaV3idapQp-zPFaWBxoHwnrqTFoodHZGQcPA@mail.gmail.com/

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Yosry Ahmed 1 month, 2 weeks ago

December 24, 2025 at 12:43 AM, "Harry Yoo" <harry.yoo@oracle.com> wrote:


> 
> On Tue, Dec 23, 2025 at 08:04:50PM +0000, Yosry Ahmed wrote:
> 
> > 
> > I think there might be a problem with non-hierarchical stats on cgroup
> >  v1, I brought it up previously [*]. I am not sure if this was addressed
> >  but I couldn't immediately find anything.
> > 
> Hi, Yosry. Thanks for bringing this up!
> 
> > 
> > In short, if memory is charged to a dying cgroup at the time of
> >  reparenting, when the memory gets uncharged the stats updates will occur
> >  at the parent. This will update both hierarchical and non-hierarchical
> >  stats of the parent, which would corrupt the parent's non-hierarchical
> >  stats (because those counters were never incremented when the memory was
> >  charged).
> > 
> Hmm, I wonder if this only applies to LRU pages.
> 
> In theory we should have this problem for NR_SLAB{UN,}RECLAIMABLE_B
> because we already reparent objcgs, or am I missing something?

We do, but we don't expose these stats in cgroup v1, and we don't expose non-hierarchical stats in cgroup v2.

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Harry Yoo 1 month, 2 weeks ago

On Wed, Dec 24, 2025 at 02:51:12PM +0000, Yosry Ahmed wrote:
> December 24, 2025 at 12:43 AM, "Harry Yoo" <harry.yoo@oracle.com> wrote:
> 
> 
> > 
> > On Tue, Dec 23, 2025 at 08:04:50PM +0000, Yosry Ahmed wrote:
> > 
> > > 
> > > I think there might be a problem with non-hierarchical stats on cgroup
> > >  v1, I brought it up previously [*]. I am not sure if this was addressed
> > >  but I couldn't immediately find anything.
> > > 
> > Hi, Yosry. Thanks for bringing this up!
> > 
> > > 
> > > In short, if memory is charged to a dying cgroup at the time of
> > >  reparenting, when the memory gets uncharged the stats updates will occur
> > >  at the parent. This will update both hierarchical and non-hierarchical
> > >  stats of the parent, which would corrupt the parent's non-hierarchical
> > >  stats (because those counters were never incremented when the memory was
> > >  charged).
> > > 
> > Hmm, I wonder if this only applies to LRU pages.
> > 
> > In theory we should have this problem for NR_SLAB{UN,}RECLAIMABLE_B
> > because we already reparent objcgs, or am I missing something?
> 
> We do, but we don't expose these stats in cgroup v1, and we don't expose non-hierarchical stats in cgroup v2.

Oops, right.
I was missing that. Thanks!

-- 
Cheers,
Harry / Hyeonggon

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Shakeel Butt 1 month, 2 weeks ago

On Tue, Dec 23, 2025 at 08:04:50PM +0000, Yosry Ahmed wrote:
[...]
> 
> I think there might be a problem with non-hierarchical stats on cgroup
> v1, I brought it up previously [*]. I am not sure if this was addressed
> but I couldn't immediately find anything.

Sigh, the curse of memcg-v1. Let's see what we can do to not break v1.

> 
> In short, if memory is charged to a dying cgroup 

Not sure why stats updates for dying cgroup is related. Isn't it simply
stat increase at the child memcg and then stat decrease at the parent
memcg would possibly show negative stat_local of the parent.

> at the time of
> reparenting, when the memory gets uncharged the stats updates will occur
> at the parent. This will update both hierarchical and non-hierarchical
> stats of the parent, which would corrupt the parent's non-hierarchical
> stats (because those counters were never incremented when the memory was
> charged).
> 
> I didn't track down which stats are affected by this, but off the top of
> my head I think all stats tracking anon, file, etc.

Let's start with what specific stats might be effected. First the stats
which are monotonically increasing should be fine, like
WORKINGSET_REFAULT_[ANON|FILE], PGPG[IN|OUT], PG[MAJ]FAULT.

So, the following ones are the interesting ones:

NR_FILE_PAGES, NR_ANON_MAPPED, NR_ANON_THPS, NR_SHMEM, NR_FILE_MAPPED,
NR_FILE_DIRTY, NR_WRITEBACK, MEMCG_SWAP, NR_SWAPCACHE.

> 
> The obvious solution is to flush and reparent the stats of a dying memcg
> during reparenting,

Again not sure how flushing will help here and what do you mean by
'reparent the stats'? Do you mean something like:

parent->vmstats->state_local += child->vmstats->state_local;

Hmm this seems fine and I think it should work.

> but I don't think this entirely fixes the problem
> because the dying memcg stats can still be updated after its reparenting
> (e.g. if a ref to the memcg has been held since before reparenting).

How can dying memcg stats can still be updated after reparenting? The
stats which we care about are the anon & file memory and this series is
reparenting them, so dying memcg will not see stats updates unless there
is a concurrent update happening and I think it is very easy to avoid
such situation by putting a grace period between reparenting the
file/anon folios and reparenting dying chils'd stats_local. Am I missing
something?

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Qi Zheng 1 month, 1 week ago


On 12/24/25 7:20 AM, Shakeel Butt wrote:
> On Tue, Dec 23, 2025 at 08:04:50PM +0000, Yosry Ahmed wrote:
> [...]
>>
>> I think there might be a problem with non-hierarchical stats on cgroup
>> v1, I brought it up previously [*]. I am not sure if this was addressed
>> but I couldn't immediately find anything.
> 
> Sigh, the curse of memcg-v1. Let's see what we can do to not break v1.

The memcg-v1 was originally planned to be removed, could we skip
supporting v1?

>

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Harry Yoo 1 month, 1 week ago

On Mon, Dec 29, 2025 at 03:48:26PM +0800, Qi Zheng wrote:
> 
> 
> On 12/24/25 7:20 AM, Shakeel Butt wrote:
> > On Tue, Dec 23, 2025 at 08:04:50PM +0000, Yosry Ahmed wrote:
> > [...]
> > > 
> > > I think there might be a problem with non-hierarchical stats on cgroup
> > > v1, I brought it up previously [*]. I am not sure if this was addressed
> > > but I couldn't immediately find anything.
> > 
> > Sigh, the curse of memcg-v1. Let's see what we can do to not break v1.
> 
> The memcg-v1 was originally planned to be removed, could we skip
> supporting v1?

You mean not reparenting LRU pages if CONFIG_MEMCG_V1 is set?

That may work, but IMHO given that there is no clear timeline for removal
yet (some v1-specific features have been officially deprecated,
but memcg v1 as a whole hasn't), implementing Shakeel's suggestion [1]
may be a good option (it doesn't seem to add much complexity)

But it can be argued that it's not good enough reason to
delay the series; adding support for V1 later sounds sensible to me.

Anyway I'm fine either way and I'm not a memcg maintainer myself,
so this is just my two cents.

[1] https://lore.kernel.org/linux-mm/wvj4w7ifmrifnh5bvftdziudsj52fdnwlhbt2oifwmxmi4eore@ob3mrfahhnm5/

-- 
Cheers,
Harry / Hyeonggon

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Michal Koutný 1 month, 1 week ago

On Mon, Dec 29, 2025 at 06:35:22PM +0900, Harry Yoo <harry.yoo@oracle.com> wrote:
> > The memcg-v1 was originally planned to be removed, could we skip
> > supporting v1?
> 
> You mean not reparenting LRU pages if CONFIG_MEMCG_V1 is set?

More precisely it should be dynamic
	!cgroup_subsys_on_dfl(memory_cgrp_subsys)

> That may work,

But would it make the code actually simpler? (Keeping two versions at
multiple places vs transferring local stats at the right moment.)

Thanks,
Michal

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Qi Zheng 1 month, 1 week ago


On 12/29/25 5:35 PM, Harry Yoo wrote:
> On Mon, Dec 29, 2025 at 03:48:26PM +0800, Qi Zheng wrote:
>>
>>
>> On 12/24/25 7:20 AM, Shakeel Butt wrote:
>>> On Tue, Dec 23, 2025 at 08:04:50PM +0000, Yosry Ahmed wrote:
>>> [...]
>>>>
>>>> I think there might be a problem with non-hierarchical stats on cgroup
>>>> v1, I brought it up previously [*]. I am not sure if this was addressed
>>>> but I couldn't immediately find anything.
>>>
>>> Sigh, the curse of memcg-v1. Let's see what we can do to not break v1.
>>
>> The memcg-v1 was originally planned to be removed, could we skip
>> supporting v1?
> 
> You mean not reparenting LRU pages if CONFIG_MEMCG_V1 is set?
> 
> That may work, but IMHO given that there is no clear timeline for removal
> yet (some v1-specific features have been officially deprecated,
> but memcg v1 as a whole hasn't), implementing Shakeel's suggestion [1]
> may be a good option (it doesn't seem to add much complexity)
> 
> But it can be argued that it's not good enough reason to
> delay the series; adding support for V1 later sounds sensible to me.

Yeah, I will continue to take a closer look at how to support memcg-v1.

> 
> Anyway I'm fine either way and I'm not a memcg maintainer myself,
> so this is just my two cents.
> 
> [1] https://lore.kernel.org/linux-mm/wvj4w7ifmrifnh5bvftdziudsj52fdnwlhbt2oifwmxmi4eore@ob3mrfahhnm5/
>

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Yosry Ahmed 1 month, 2 weeks ago

On Tue, Dec 23, 2025 at 03:20:47PM -0800, Shakeel Butt wrote:
> On Tue, Dec 23, 2025 at 08:04:50PM +0000, Yosry Ahmed wrote:
> [...]
> > 
> > I think there might be a problem with non-hierarchical stats on cgroup
> > v1, I brought it up previously [*]. I am not sure if this was addressed
> > but I couldn't immediately find anything.
> 
> Sigh, the curse of memcg-v1. Let's see what we can do to not break v1.
> 
> > 
> > In short, if memory is charged to a dying cgroup 
> 
> Not sure why stats updates for dying cgroup is related. Isn't it simply
> stat increase at the child memcg and then stat decrease at the parent
> memcg would possibly show negative stat_local of the parent.

Hmm not sure I understand what you mean here. Normally an update to the
child memcg should not update state_local of the parent. So outside the
context of dying cgroups and reparenting I don't see how state_local of
the parent can become negative.

> 
> > at the time of
> > reparenting, when the memory gets uncharged the stats updates will occur
> > at the parent. This will update both hierarchical and non-hierarchical
> > stats of the parent, which would corrupt the parent's non-hierarchical
> > stats (because those counters were never incremented when the memory was
> > charged).
> > 
> > I didn't track down which stats are affected by this, but off the top of
> > my head I think all stats tracking anon, file, etc.
> 
> Let's start with what specific stats might be effected. First the stats
> which are monotonically increasing should be fine, like
> WORKINGSET_REFAULT_[ANON|FILE], PGPG[IN|OUT], PG[MAJ]FAULT.
> 
> So, the following ones are the interesting ones:
> 
> NR_FILE_PAGES, NR_ANON_MAPPED, NR_ANON_THPS, NR_SHMEM, NR_FILE_MAPPED,
> NR_FILE_DIRTY, NR_WRITEBACK, MEMCG_SWAP, NR_SWAPCACHE.
> 
> > 
> > The obvious solution is to flush and reparent the stats of a dying memcg
> > during reparenting,
> 
> Again not sure how flushing will help here and what do you mean by
> 'reparent the stats'? Do you mean something like:

Oh I meant we just need to do an rstat flush to aggregate per-CPU
counters before moving the stats from child to parent.

> 
> parent->vmstats->state_local += child->vmstats->state_local;
> 
> Hmm this seems fine and I think it should work.

Something like that, I didn't look too closely if there's anything else
that needs to be reparented.

> 
> > but I don't think this entirely fixes the problem
> > because the dying memcg stats can still be updated after its reparenting
> > (e.g. if a ref to the memcg has been held since before reparenting).
> 
> How can dying memcg stats can still be updated after reparenting? The
> stats which we care about are the anon & file memory and this series is
> reparenting them, so dying memcg will not see stats updates unless there
> is a concurrent update happening and I think it is very easy to avoid
> such situation by putting a grace period between reparenting the
> file/anon folios and reparenting dying chils'd stats_local. Am I missing
> something?

What prevents the code from obtaining a ref to a parent's memcg before
reparenting, and using it to update the stats after reparenting? A grace
period only works if the entire scope of using the memcg is within the
RCU critical section.

For example, __mem_cgroup_try_charge_swap() currently does this when
incrementing MEMCG_SWAP. While this specific example isn't problematic
because the reference won't be dropped until MEMCG_SWAP is decremented
again, the pattern of grabbing a ref to the memcg then updating a stat
could generally cause the problem.

Most stats are updated using lruvec_stat_mod_folio(), which updates the
stats in the same RCU critical section as obtaining the memcg pointer
from the folio, so it can be fixed with a grace period. However, I think
it can be easily missed in the future if other code paths update memcg
stats in a different way. We should try to enforce that stat updates
cannot only happen from the same RCU critical section where the memcg
pointer is acquired.

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Shakeel Butt 1 month, 2 weeks ago

On Wed, Dec 24, 2025 at 12:07:50AM +0000, Yosry Ahmed wrote:
> On Tue, Dec 23, 2025 at 03:20:47PM -0800, Shakeel Butt wrote:
> > On Tue, Dec 23, 2025 at 08:04:50PM +0000, Yosry Ahmed wrote:
> > [...]
> > > 
> > > I think there might be a problem with non-hierarchical stats on cgroup
> > > v1, I brought it up previously [*]. I am not sure if this was addressed
> > > but I couldn't immediately find anything.
> > 
> > Sigh, the curse of memcg-v1. Let's see what we can do to not break v1.
> > 
> > > 
> > > In short, if memory is charged to a dying cgroup 
> > 
> > Not sure why stats updates for dying cgroup is related. Isn't it simply
> > stat increase at the child memcg and then stat decrease at the parent
> > memcg would possibly show negative stat_local of the parent.
> 
> Hmm not sure I understand what you mean here. Normally an update to the
> child memcg should not update state_local of the parent. So outside the
> context of dying cgroups and reparenting I don't see how state_local of
> the parent can become negative.

We might be talking about same thing. When you said "memory is charged
to a dying cgroup", it does not have to be dying cgroup, it can be alive
child which died later. So to give an example, let's say a process in
the child allocates a file page, NR_FILE_PAGES is increased at the
child and next child has been rmdir'ed, so when that specific file page
is freed, the NR_FILE_PAGES will be decreased at the parent after this
series.

> 
> > 
> > > at the time of
> > > reparenting, when the memory gets uncharged the stats updates will occur
> > > at the parent. This will update both hierarchical and non-hierarchical
> > > stats of the parent, which would corrupt the parent's non-hierarchical
> > > stats (because those counters were never incremented when the memory was
> > > charged).
> > > 
> > > I didn't track down which stats are affected by this, but off the top of
> > > my head I think all stats tracking anon, file, etc.
> > 
> > Let's start with what specific stats might be effected. First the stats
> > which are monotonically increasing should be fine, like
> > WORKINGSET_REFAULT_[ANON|FILE], PGPG[IN|OUT], PG[MAJ]FAULT.
> > 
> > So, the following ones are the interesting ones:
> > 
> > NR_FILE_PAGES, NR_ANON_MAPPED, NR_ANON_THPS, NR_SHMEM, NR_FILE_MAPPED,
> > NR_FILE_DIRTY, NR_WRITEBACK, MEMCG_SWAP, NR_SWAPCACHE.
> > 
> > > 
> > > The obvious solution is to flush and reparent the stats of a dying memcg
> > > during reparenting,
> > 
> > Again not sure how flushing will help here and what do you mean by
> > 'reparent the stats'? Do you mean something like:
> 
> Oh I meant we just need to do an rstat flush to aggregate per-CPU
> counters before moving the stats from child to parent.
> 
> > 
> > parent->vmstats->state_local += child->vmstats->state_local;
> > 
> > Hmm this seems fine and I think it should work.
> 
> Something like that, I didn't look too closely if there's anything else
> that needs to be reparented.
> 
> > 
> > > but I don't think this entirely fixes the problem
> > > because the dying memcg stats can still be updated after its reparenting
> > > (e.g. if a ref to the memcg has been held since before reparenting).
> > 
> > How can dying memcg stats can still be updated after reparenting? The
> > stats which we care about are the anon & file memory and this series is
> > reparenting them, so dying memcg will not see stats updates unless there
> > is a concurrent update happening and I think it is very easy to avoid
> > such situation by putting a grace period between reparenting the
> > file/anon folios and reparenting dying chils'd stats_local. Am I missing
> > something?
> 
> What prevents the code from obtaining a ref to a parent's memcg

I think you meant child's memcg here.

> before
> reparenting, and using it to update the stats after reparenting? A grace
> period only works if the entire scope of using the memcg is within the
> RCU critical section.

Yeah this is an issue.

> 
> For example, __mem_cgroup_try_charge_swap() currently does this when
> incrementing MEMCG_SWAP. While this specific example isn't problematic
> because the reference won't be dropped until MEMCG_SWAP is decremented
> again, the pattern of grabbing a ref to the memcg then updating a stat
> could generally cause the problem.
> 
> Most stats are updated using lruvec_stat_mod_folio(), which updates the
> stats in the same RCU critical section as obtaining the memcg pointer
> from the folio, so it can be fixed with a grace period. However, I think
> it can be easily missed in the future if other code paths update memcg
> stats in a different way. We should try to enforce that stat updates
> cannot only happen from the same RCU critical section where the memcg
> pointer is acquired.

The core stats update functions are mod_memcg_state() and
mod_memcg_lruvec_state(). If for v1 only, we add additional check for
CSS_DYING and go to parent if CSS_DYING is set then shouldn't we avoid
this issue?

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Yosry Ahmed 1 month, 2 weeks ago

On Tue, Dec 23, 2025 at 04:36:18PM -0800, Shakeel Butt wrote:
> On Wed, Dec 24, 2025 at 12:07:50AM +0000, Yosry Ahmed wrote:
> > On Tue, Dec 23, 2025 at 03:20:47PM -0800, Shakeel Butt wrote:
> > > On Tue, Dec 23, 2025 at 08:04:50PM +0000, Yosry Ahmed wrote:
> > > [...]
> > > > 
> > > > I think there might be a problem with non-hierarchical stats on cgroup
> > > > v1, I brought it up previously [*]. I am not sure if this was addressed
> > > > but I couldn't immediately find anything.
> > > 
> > > Sigh, the curse of memcg-v1. Let's see what we can do to not break v1.
> > > 
> > > > 
> > > > In short, if memory is charged to a dying cgroup 
> > > 
> > > Not sure why stats updates for dying cgroup is related. Isn't it simply
> > > stat increase at the child memcg and then stat decrease at the parent
> > > memcg would possibly show negative stat_local of the parent.
> > 
> > Hmm not sure I understand what you mean here. Normally an update to the
> > child memcg should not update state_local of the parent. So outside the
> > context of dying cgroups and reparenting I don't see how state_local of
> > the parent can become negative.
> 
> We might be talking about same thing. When you said "memory is charged
> to a dying cgroup", it does not have to be dying cgroup, it can be alive
> child which died later. So to give an example, let's say a process in
> the child allocates a file page, NR_FILE_PAGES is increased at the
> child and next child has been rmdir'ed, so when that specific file page
> is freed, the NR_FILE_PAGES will be decreased at the parent after this
> series.

Yes this is exactly what I mean. Specifically, an update happens after
the cgroup becomes "dying".

> 
> > 
> > > 
> > > > at the time of
> > > > reparenting, when the memory gets uncharged the stats updates will occur
> > > > at the parent. This will update both hierarchical and non-hierarchical
> > > > stats of the parent, which would corrupt the parent's non-hierarchical
> > > > stats (because those counters were never incremented when the memory was
> > > > charged).
> > > > 
> > > > I didn't track down which stats are affected by this, but off the top of
> > > > my head I think all stats tracking anon, file, etc.
> > > 
> > > Let's start with what specific stats might be effected. First the stats
> > > which are monotonically increasing should be fine, like
> > > WORKINGSET_REFAULT_[ANON|FILE], PGPG[IN|OUT], PG[MAJ]FAULT.
> > > 
> > > So, the following ones are the interesting ones:
> > > 
> > > NR_FILE_PAGES, NR_ANON_MAPPED, NR_ANON_THPS, NR_SHMEM, NR_FILE_MAPPED,
> > > NR_FILE_DIRTY, NR_WRITEBACK, MEMCG_SWAP, NR_SWAPCACHE.
> > > 
> > > > 
> > > > The obvious solution is to flush and reparent the stats of a dying memcg
> > > > during reparenting,
> > > 
> > > Again not sure how flushing will help here and what do you mean by
> > > 'reparent the stats'? Do you mean something like:
> > 
> > Oh I meant we just need to do an rstat flush to aggregate per-CPU
> > counters before moving the stats from child to parent.
> > 
> > > 
> > > parent->vmstats->state_local += child->vmstats->state_local;
> > > 
> > > Hmm this seems fine and I think it should work.
> > 
> > Something like that, I didn't look too closely if there's anything else
> > that needs to be reparented.
> > 
> > > 
> > > > but I don't think this entirely fixes the problem
> > > > because the dying memcg stats can still be updated after its reparenting
> > > > (e.g. if a ref to the memcg has been held since before reparenting).
> > > 
> > > How can dying memcg stats can still be updated after reparenting? The
> > > stats which we care about are the anon & file memory and this series is
> > > reparenting them, so dying memcg will not see stats updates unless there
> > > is a concurrent update happening and I think it is very easy to avoid
> > > such situation by putting a grace period between reparenting the
> > > file/anon folios and reparenting dying chils'd stats_local. Am I missing
> > > something?
> > 
> > What prevents the code from obtaining a ref to a parent's memcg
> 
> I think you meant child's memcg here.

Yes, sorry.

> 
> > before
> > reparenting, and using it to update the stats after reparenting? A grace
> > period only works if the entire scope of using the memcg is within the
> > RCU critical section.
> 
> Yeah this is an issue.
> 
> > 
> > For example, __mem_cgroup_try_charge_swap() currently does this when
> > incrementing MEMCG_SWAP. While this specific example isn't problematic
> > because the reference won't be dropped until MEMCG_SWAP is decremented
> > again, the pattern of grabbing a ref to the memcg then updating a stat
> > could generally cause the problem.
> > 
> > Most stats are updated using lruvec_stat_mod_folio(), which updates the
> > stats in the same RCU critical section as obtaining the memcg pointer
> > from the folio, so it can be fixed with a grace period. However, I think
> > it can be easily missed in the future if other code paths update memcg
> > stats in a different way. We should try to enforce that stat updates
> > cannot only happen from the same RCU critical section where the memcg
> > pointer is acquired.
> 
> The core stats update functions are mod_memcg_state() and
> mod_memcg_lruvec_state(). If for v1 only, we add additional check for
> CSS_DYING and go to parent if CSS_DYING is set then shouldn't we avoid
> this issue?

But this is still racy, right? The cgroup could become dying right after
we check CSS_DYING, no?

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Shakeel Butt 1 month, 2 weeks ago

On Wed, Dec 24, 2025 at 12:43:00AM +0000, Yosry Ahmed wrote:
[...]
> > 
> > I think you meant child's memcg here.
> 
> Yes, sorry.
> 
> > 
> > > before
> > > reparenting, and using it to update the stats after reparenting? A grace
> > > period only works if the entire scope of using the memcg is within the
> > > RCU critical section.
> > 
> > Yeah this is an issue.
> > 
> > > 
> > > For example, __mem_cgroup_try_charge_swap() currently does this when
> > > incrementing MEMCG_SWAP. While this specific example isn't problematic
> > > because the reference won't be dropped until MEMCG_SWAP is decremented
> > > again, the pattern of grabbing a ref to the memcg then updating a stat
> > > could generally cause the problem.
> > > 
> > > Most stats are updated using lruvec_stat_mod_folio(), which updates the
> > > stats in the same RCU critical section as obtaining the memcg pointer
> > > from the folio, so it can be fixed with a grace period. However, I think
> > > it can be easily missed in the future if other code paths update memcg
> > > stats in a different way. We should try to enforce that stat updates
> > > cannot only happen from the same RCU critical section where the memcg
> > > pointer is acquired.
> > 
> > The core stats update functions are mod_memcg_state() and
> > mod_memcg_lruvec_state(). If for v1 only, we add additional check for
> > CSS_DYING and go to parent if CSS_DYING is set then shouldn't we avoid
> > this issue?
> 
> But this is still racy, right? The cgroup could become dying right after
> we check CSS_DYING, no?

We do reparenting in css_offline() callback and cgroup offlining
happen somewhat like this:

1. Set CSS_DYING
2. Trigger percpu ref kill
3. Kernel makes sure css ref killed is seen by all CPUs and then trigger
   css_offline callback.

So, if in the stats update function we check CSS_DYING flag and the
actual stats update within rcu, I think we are good.

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Qi Zheng 1 month, 1 week ago


On 12/24/25 8:58 AM, Shakeel Butt wrote:
> On Wed, Dec 24, 2025 at 12:43:00AM +0000, Yosry Ahmed wrote:
> [...]
>>>
>>> I think you meant child's memcg here.
>>
>> Yes, sorry.
>>
>>>
>>>> before
>>>> reparenting, and using it to update the stats after reparenting? A grace
>>>> period only works if the entire scope of using the memcg is within the
>>>> RCU critical section.
>>>
>>> Yeah this is an issue.
>>>
>>>>
>>>> For example, __mem_cgroup_try_charge_swap() currently does this when
>>>> incrementing MEMCG_SWAP. While this specific example isn't problematic
>>>> because the reference won't be dropped until MEMCG_SWAP is decremented
>>>> again, the pattern of grabbing a ref to the memcg then updating a stat
>>>> could generally cause the problem.
>>>>
>>>> Most stats are updated using lruvec_stat_mod_folio(), which updates the
>>>> stats in the same RCU critical section as obtaining the memcg pointer
>>>> from the folio, so it can be fixed with a grace period. However, I think
>>>> it can be easily missed in the future if other code paths update memcg
>>>> stats in a different way. We should try to enforce that stat updates
>>>> cannot only happen from the same RCU critical section where the memcg
>>>> pointer is acquired.
>>>
>>> The core stats update functions are mod_memcg_state() and
>>> mod_memcg_lruvec_state(). If for v1 only, we add additional check for
>>> CSS_DYING and go to parent if CSS_DYING is set then shouldn't we avoid
>>> this issue?
>>
>> But this is still racy, right? The cgroup could become dying right after
>> we check CSS_DYING, no?
> 
> We do reparenting in css_offline() callback and cgroup offlining
> happen somewhat like this:
> 
> 1. Set CSS_DYING
> 2. Trigger percpu ref kill
> 3. Kernel makes sure css ref killed is seen by all CPUs and then trigger
>     css_offline callback.

it seems that we can add the following to
mem_cgroup_css_free():

parent->vmstats->state_local += child->vmstats->state_local;

Right? I will continue to take a closer look.

Thanks,
Qi

> 
> So, if in the stats update function we check CSS_DYING flag and the
> actual stats update within rcu, I think we are good.

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Michal Koutný 1 month, 1 week ago

On Tue, Dec 23, 2025 at 04:36:18PM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote:
...
> The core stats update functions are mod_memcg_state() and
> mod_memcg_lruvec_state(). If for v1 only, we add additional check for
> CSS_DYING and go to parent if CSS_DYING is set then shouldn't we avoid
> this issue?

...and go to first !CSS_DYING ancestor :-/ (as the whole chain of memcgs
can be offlined)

IIUC thanks to the reparenting charging (modifying state) to an offlined
memcg should be an exception...

On Mon, Dec 29, 2025 at 05:42:43PM +0800, Qi Zheng <qi.zheng@linux.dev> wrote:

> > We do reparenting in css_offline() callback and cgroup offlining
> > happen somewhat like this:
> > 
> > 1. Set CSS_DYING
> > 2. Trigger percpu ref kill
> > 3. Kernel makes sure css ref killed is seen by all CPUs and then trigger
> >     css_offline callback.
> 
> it seems that we can add the following to
> mem_cgroup_css_free():
> 
> parent->vmstats->state_local += child->vmstats->state_local;
> 
> Right? I will continue to take a closer look.

...and the time between offlining and free'ing a memcg should not be
arbitrarily long anymore (right?, the crux of the series).
So only transferring local stats in mem_cgroup_css_free should yield a
correct result after limited time range (with possible underflows
between) with no special precaution for CSS_DYING on charging side.

0.02€,
Michal

Re: [PATCH v2 00/28] Eliminate Dying Memory Cgroup

Posted by Yosry Ahmed 1 month, 1 week ago

On Mon, Dec 29, 2025 at 11:52:52AM +0100, Michal Koutný wrote:
> On Tue, Dec 23, 2025 at 04:36:18PM -0800, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> ...
> > The core stats update functions are mod_memcg_state() and
> > mod_memcg_lruvec_state(). If for v1 only, we add additional check for
> > CSS_DYING and go to parent if CSS_DYING is set then shouldn't we avoid
> > this issue?
> 
> ...and go to first !CSS_DYING ancestor :-/ (as the whole chain of memcgs
> can be offlined)
> 
> IIUC thanks to the reparenting charging (modifying state) to an offlined
> memcg should be an exception...
> 
> 
> On Mon, Dec 29, 2025 at 05:42:43PM +0800, Qi Zheng <qi.zheng@linux.dev> wrote:
> 
> > > We do reparenting in css_offline() callback and cgroup offlining
> > > happen somewhat like this:
> > > 
> > > 1. Set CSS_DYING
> > > 2. Trigger percpu ref kill
> > > 3. Kernel makes sure css ref killed is seen by all CPUs and then trigger
> > >     css_offline callback.
> > 
> > it seems that we can add the following to
> > mem_cgroup_css_free():
> > 
> > parent->vmstats->state_local += child->vmstats->state_local;
> > 
> > Right? I will continue to take a closer look.
> 
> ...and the time between offlining and free'ing a memcg should not be
> arbitrarily long anymore (right?, the crux of the series).
> So only transferring local stats in mem_cgroup_css_free should yield a
> correct result after limited time range (with possible underflows
> between) with no special precaution for CSS_DYING on charging side.

I don't think this works, unfortunately. Even with refs from folios to
memcgs dropped at offlining, there could still be long-living refs (e.g.
from swapped out entries). So we cannot wait until the memcg is released
or freed to do the reparenting of the stats.

I think the right thing to do is, as discussed with Shakeel, move the
stats at the time of offlining after reparenting the LRUs, and forward
further updates to the first non-dying parent.

We'll need to be careful with synchronization. We'll probably need an
RCU sync after reparenting the LRUs before moving the stats to the
parent, and we'll need to make sure stat updaters get the memcg and
update the stats within the same RCU section. Ideally with guards
against breaking this in the future.

> 
> 0.02€,
> Michal