[PATCH v1 00/26] Eliminate Dying Memory Cgroup

Qi Zheng posted 26 patches 3 months, 1 week ago
There is a newer version of this series
fs/buffer.c                      |   4 +-
fs/fs-writeback.c                |  22 +-
include/linux/memcontrol.h       | 159 ++++++------
include/linux/mm_inline.h        |   6 +
include/linux/mmzone.h           |  20 ++
include/trace/events/writeback.h |   3 +
mm/compaction.c                  |  43 +++-
mm/huge_memory.c                 |  18 +-
mm/memcontrol-v1.c               |  15 +-
mm/memcontrol.c                  | 405 ++++++++++++++++++-------------
mm/migrate.c                     |   2 +
mm/mlock.c                       |   2 +-
mm/page_io.c                     |   8 +-
mm/percpu.c                      |   2 +-
mm/shrinker.c                    |   6 +-
mm/swap.c                        |  20 +-
mm/vmscan.c                      | 198 ++++++++++++---
mm/workingset.c                  |  26 +-
mm/zswap.c                       |   2 +
19 files changed, 612 insertions(+), 349 deletions(-)
[PATCH v1 00/26] Eliminate Dying Memory Cgroup
Posted by Qi Zheng 3 months, 1 week ago
From: Qi Zheng <zhengqi.arch@bytedance.com>

Hi all,

This series aims to eliminate the problem of dying memory cgroup. It completes
the adaptation to the MGLRU scenarios based on the Muchun Song's patchset[1].

The adaptation method was discussed with Harry Yoo. For more details, please
refer to the commit message of [PATCH v1 23/26].

The changes are as follows:
 - drop [PATCH RFC 02/28]
 - drop THP split queue related part, which has been merged as a separate
   patchset[2]
 - prevent memory cgroup release in folio_split_queue_lock{_irqsave}() in
   [PATCH v1 16/26]
 - Separate the reparenting function of traditional LRU folios to [PATCH v1 22/26]
 - adapted to the MGLRU scenarios in [PATCH v1 23/26]
 - refactor memcg_reparent_objcgs() in [PATCH v1 24/26]
 - collect Acked-bys and Reviewed-bys
 - rebase onto the next-20251028

Comments and suggestions are welcome!

Thanks,
Qi

[1]. https://lore.kernel.org/all/20250415024532.26632-1-songmuchun@bytedance.com/
[2]. https://lore.kernel.org/all/cover.1760509767.git.zhengqi.arch@bytedance.com

Muchun Song (22):
  mm: memcontrol: remove dead code of checking parent memory cgroup
  mm: workingset: use folio_lruvec() in workingset_refault()
  mm: rename unlock_page_lruvec_irq and its variants
  mm: vmscan: refactor move_folios_to_lru()
  mm: memcontrol: allocate object cgroup for non-kmem case
  mm: memcontrol: return root object cgroup for root memory cgroup
  mm: memcontrol: prevent memory cgroup release in
    get_mem_cgroup_from_folio()
  buffer: prevent memory cgroup release in folio_alloc_buffers()
  writeback: prevent memory cgroup release in writeback module
  mm: memcontrol: prevent memory cgroup release in
    count_memcg_folio_events()
  mm: page_io: prevent memory cgroup release in page_io module
  mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
  mm: mglru: prevent memory cgroup release in mglru
  mm: memcontrol: prevent memory cgroup release in
    mem_cgroup_swap_full()
  mm: workingset: prevent memory cgroup release in lru_gen_eviction()
  mm: workingset: prevent lruvec release in workingset_refault()
  mm: zswap: prevent lruvec release in zswap_folio_swapin()
  mm: swap: prevent lruvec release in swap module
  mm: workingset: prevent lruvec release in workingset_activation()
  mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
  mm: memcontrol: eliminate the problem of dying memory cgroup for LRU
    folios
  mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers

Qi Zheng (4):
  mm: thp: prevent memory cgroup release in
    folio_split_queue_lock{_irqsave}()
  mm: vmscan: prepare for reparenting traditional LRU folios
  mm: vmscan: prepare for reparenting MGLRU folios
  mm: memcontrol: refactor memcg_reparent_objcgs()

 fs/buffer.c                      |   4 +-
 fs/fs-writeback.c                |  22 +-
 include/linux/memcontrol.h       | 159 ++++++------
 include/linux/mm_inline.h        |   6 +
 include/linux/mmzone.h           |  20 ++
 include/trace/events/writeback.h |   3 +
 mm/compaction.c                  |  43 +++-
 mm/huge_memory.c                 |  18 +-
 mm/memcontrol-v1.c               |  15 +-
 mm/memcontrol.c                  | 405 ++++++++++++++++++-------------
 mm/migrate.c                     |   2 +
 mm/mlock.c                       |   2 +-
 mm/page_io.c                     |   8 +-
 mm/percpu.c                      |   2 +-
 mm/shrinker.c                    |   6 +-
 mm/swap.c                        |  20 +-
 mm/vmscan.c                      | 198 ++++++++++++---
 mm/workingset.c                  |  26 +-
 mm/zswap.c                       |   2 +
 19 files changed, 612 insertions(+), 349 deletions(-)

-- 
2.20.1
Re: [PATCH v1 00/26] Eliminate Dying Memory Cgroup
Posted by Michal Hocko 3 months, 1 week ago
On Tue 28-10-25 21:58:13, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> Hi all,
> 
> This series aims to eliminate the problem of dying memory cgroup. It completes
> the adaptation to the MGLRU scenarios based on the Muchun Song's patchset[1].

I high level summary and main design decisions should be describe in the
cover letter.

Thanks!
-- 
Michal Hocko
SUSE Labs
Re: [PATCH v1 00/26] Eliminate Dying Memory Cgroup
Posted by Qi Zheng 3 months, 1 week ago
Hi Michal,

On 10/29/25 3:53 PM, Michal Hocko wrote:
> On Tue 28-10-25 21:58:13, Qi Zheng wrote:
>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>
>> Hi all,
>>
>> This series aims to eliminate the problem of dying memory cgroup. It completes
>> the adaptation to the MGLRU scenarios based on the Muchun Song's patchset[1].
> 
> I high level summary and main design decisions should be describe in the
> cover letter.

Got it. Will add it in the next version.

I've pasted the contents of Muchun Song's cover letter below:

```
## Introduction

This patchset is intended to transfer the LRU pages to the object cgroup
without holding a reference to the original memory cgroup in order to
address the issue of the dying memory cgroup. A consensus has already been
reached regarding this approach recently [1].

## Background

The issue of a dying memory cgroup refers to a situation where a memory
cgroup is no longer being used by users, but memory (the metadata
associated with memory cgroups) remains allocated to it. This situation
may potentially result in memory leaks or inefficiencies in memory
reclamation and has persisted as an issue for several years. Any memory
allocation that endures longer than the lifespan (from the users'
perspective) of a memory cgroup can lead to the issue of dying memory
cgroup. We have exerted greater efforts to tackle this problem by
introducing the infrastructure of object cgroup [2].

Presently, numerous types of objects (slab objects, non-slab kernel
allocations, per-CPU objects) are charged to the object cgroup without
holding a reference to the original memory cgroup. The final allocations
for LRU pages (anonymous pages and file pages) are charged at allocation
time and continues to hold a reference to the original memory cgroup
until reclaimed.

File pages are more complex than anonymous pages as they can be shared
among different memory cgroups and may persist beyond the lifespan of
the memory cgroup. The long-term pinning of file pages to memory cgroups
is a widespread issue that causes recurring problems in practical
scenarios [3]. File pages remain unreclaimed for extended periods.
Additionally, they are accessed by successive instances (second, third,
fourth, etc.) of the same job, which is restarted into a new cgroup each
time. As a result, unreclaimable dying memory cgroups accumulate,
leading to memory wastage and significantly reducing the efficiency
of page reclamation.

## Fundamentals

A folio will no longer pin its corresponding memory cgroup. It is necessary
to ensure that the memory cgroup or the lruvec associated with the memory
cgroup is not released when a user obtains a pointer to the memory cgroup
or lruvec returned by folio_memcg() or folio_lruvec(). Users are required
to hold the RCU read lock or acquire a reference to the memory cgroup
associated with the folio to prevent its release if they are not concerned
about the binding stability between the folio and its corresponding memory
cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock)
desire a stable binding between the folio and its corresponding memory
cgroup. An approach is needed to ensure the stability of the binding while
the lruvec lock is held, and to detect the situation of holding the
incorrect lruvec lock when there is a race condition during memory cgroup
reparenting. The following four steps are taken to achieve these goals.

1. The first step  to be taken is to identify all users of both functions
    (folio_memcg() and folio_lruvec()) who are not concerned about binding
    stability and implement appropriate measures (such as holding a RCU read
    lock or temporarily obtaining a reference to the memory cgroup for a
    brief period) to prevent the release of the memory cgroup.

2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates
    how to ensure the binding stability from the user's perspective of
    folio_lruvec().

    struct lruvec *folio_lruvec_lock(struct folio *folio)
    {
            struct lruvec *lruvec;

            rcu_read_lock();
    retry:
            lruvec = folio_lruvec(folio);
            spin_lock(&lruvec->lru_lock);
            if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
                    spin_unlock(&lruvec->lru_lock);
                    goto retry;
            }

            return lruvec;
    }

    From the perspective of memory cgroup removal, the entire reparenting
    process (altering the binding relationship between folio and its memory
    cgroup and moving the LRU lists to its parental memory cgroup) should be
    carried out under both the lruvec lock of the memory cgroup being 
removed
    and the lruvec lock of its parent.

3. Thirdly, another lock that requires the same approach is the split-queue
    lock of THP.

4. Finally, transfer the LRU pages to the object cgroup without holding a
    reference to the original memory cgroup.
```

And the details of the adaptation are below:

```
Similar to traditional LRU folios, in order to solve the dying memcg
problem, we also need to reparenting MGLRU folios to the parent memcg when
memcg offline.

However, there are the following challenges:

1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the
    number of generations of the parent and child memcg may be different,
    so we cannot simply transfer MGLRU folios in the child memcg to the
    parent memcg as we did for traditional LRU folios.
2. The generation information is stored in folio->flags, but we cannot
    traverse these folios while holding the lru lock, otherwise it may
    cause softlockup.
3. In walk_update_folio(), the gen of folio and corresponding lru size
    may be updated, but the folio is not immediately moved to the
    corresponding lru list. Therefore, there may be folios of different
    generations on an LRU list.
4. In lru_gen_del_folio(), the generation to which the folio belongs is
    found based on the generation information in folio->flags, and the
    corresponding LRU size will be updated. Therefore, we need to update
    the lru size correctly during reparenting, otherwise the lru size may
    be updated incorrectly in lru_gen_del_folio().

Finally, this patch chose a compromise method, which is to splice the lru
list in the child memcg to the lru list of the same generation in the
parent memcg during reparenting. And in order to ensure that the parent
memcg has the same generation, we need to increase the generations in the
parent memcg to the MAX_NR_GENS before reparenting.

Of course, the same generation has different meanings in the parent and
child memcg, this will cause confusion in the hot and cold information of
folios. But other than that, this method is simple enough, the lru size
is correct, and there is no need to consider some concurrency issues (such
as lru_gen_del_folio()).
```

Thanks,
Qi

> 
> Thanks!
Re: [PATCH v1 00/26] Eliminate Dying Memory Cgroup
Posted by Michal Hocko 3 months, 1 week ago
On Wed 29-10-25 16:05:16, Qi Zheng wrote:
> Hi Michal,
> 
> On 10/29/25 3:53 PM, Michal Hocko wrote:
> > On Tue 28-10-25 21:58:13, Qi Zheng wrote:
> > > From: Qi Zheng <zhengqi.arch@bytedance.com>
> > > 
> > > Hi all,
> > > 
> > > This series aims to eliminate the problem of dying memory cgroup. It completes
> > > the adaptation to the MGLRU scenarios based on the Muchun Song's patchset[1].
> > 
> > I high level summary and main design decisions should be describe in the
> > cover letter.
> 
> Got it. Will add it in the next version.
> 
> I've pasted the contents of Muchun Song's cover letter below:
> 
> ```
> ## Introduction
> 
> This patchset is intended to transfer the LRU pages to the object cgroup
> without holding a reference to the original memory cgroup in order to
> address the issue of the dying memory cgroup. A consensus has already been
> reached regarding this approach recently [1].

Could you add those referenced links as well please?

> ## Background
> 
> The issue of a dying memory cgroup refers to a situation where a memory
> cgroup is no longer being used by users, but memory (the metadata
> associated with memory cgroups) remains allocated to it. This situation
> may potentially result in memory leaks or inefficiencies in memory
> reclamation and has persisted as an issue for several years. Any memory
> allocation that endures longer than the lifespan (from the users'
> perspective) of a memory cgroup can lead to the issue of dying memory
> cgroup. We have exerted greater efforts to tackle this problem by
> introducing the infrastructure of object cgroup [2].
> 
> Presently, numerous types of objects (slab objects, non-slab kernel
> allocations, per-CPU objects) are charged to the object cgroup without
> holding a reference to the original memory cgroup. The final allocations
> for LRU pages (anonymous pages and file pages) are charged at allocation
> time and continues to hold a reference to the original memory cgroup
> until reclaimed.
> 
> File pages are more complex than anonymous pages as they can be shared
> among different memory cgroups and may persist beyond the lifespan of
> the memory cgroup. The long-term pinning of file pages to memory cgroups
> is a widespread issue that causes recurring problems in practical
> scenarios [3]. File pages remain unreclaimed for extended periods.
> Additionally, they are accessed by successive instances (second, third,
> fourth, etc.) of the same job, which is restarted into a new cgroup each
> time. As a result, unreclaimable dying memory cgroups accumulate,
> leading to memory wastage and significantly reducing the efficiency
> of page reclamation.

Very useful introduction to the problem. Thanks!

> ## Fundamentals
> 
> A folio will no longer pin its corresponding memory cgroup. It is necessary
> to ensure that the memory cgroup or the lruvec associated with the memory
> cgroup is not released when a user obtains a pointer to the memory cgroup
> or lruvec returned by folio_memcg() or folio_lruvec(). Users are required
> to hold the RCU read lock or acquire a reference to the memory cgroup
> associated with the folio to prevent its release if they are not concerned
> about the binding stability between the folio and its corresponding memory
> cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock)
> desire a stable binding between the folio and its corresponding memory
> cgroup. An approach is needed to ensure the stability of the binding while
> the lruvec lock is held, and to detect the situation of holding the
> incorrect lruvec lock when there is a race condition during memory cgroup
> reparenting. The following four steps are taken to achieve these goals.
> 
> 1. The first step  to be taken is to identify all users of both functions
>    (folio_memcg() and folio_lruvec()) who are not concerned about binding
>    stability and implement appropriate measures (such as holding a RCU read
>    lock or temporarily obtaining a reference to the memory cgroup for a
>    brief period) to prevent the release of the memory cgroup.
> 
> 2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates
>    how to ensure the binding stability from the user's perspective of
>    folio_lruvec().
> 
>    struct lruvec *folio_lruvec_lock(struct folio *folio)
>    {
>            struct lruvec *lruvec;
> 
>            rcu_read_lock();
>    retry:
>            lruvec = folio_lruvec(folio);
>            spin_lock(&lruvec->lru_lock);
>            if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
>                    spin_unlock(&lruvec->lru_lock);
>                    goto retry;
>            }
> 
>            return lruvec;
>    }
> 
>    From the perspective of memory cgroup removal, the entire reparenting
>    process (altering the binding relationship between folio and its memory
>    cgroup and moving the LRU lists to its parental memory cgroup) should be
>    carried out under both the lruvec lock of the memory cgroup being removed
>    and the lruvec lock of its parent.
> 
> 3. Thirdly, another lock that requires the same approach is the split-queue
>    lock of THP.
> 
> 4. Finally, transfer the LRU pages to the object cgroup without holding a
>    reference to the original memory cgroup.
> ```
> 
> And the details of the adaptation are below:
> 
> ```
> Similar to traditional LRU folios, in order to solve the dying memcg
> problem, we also need to reparenting MGLRU folios to the parent memcg when
> memcg offline.
> 
> However, there are the following challenges:
> 
> 1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the
>    number of generations of the parent and child memcg may be different,
>    so we cannot simply transfer MGLRU folios in the child memcg to the
>    parent memcg as we did for traditional LRU folios.
> 2. The generation information is stored in folio->flags, but we cannot
>    traverse these folios while holding the lru lock, otherwise it may
>    cause softlockup.
> 3. In walk_update_folio(), the gen of folio and corresponding lru size
>    may be updated, but the folio is not immediately moved to the
>    corresponding lru list. Therefore, there may be folios of different
>    generations on an LRU list.
> 4. In lru_gen_del_folio(), the generation to which the folio belongs is
>    found based on the generation information in folio->flags, and the
>    corresponding LRU size will be updated. Therefore, we need to update
>    the lru size correctly during reparenting, otherwise the lru size may
>    be updated incorrectly in lru_gen_del_folio().
> 
> Finally, this patch chose a compromise method, which is to splice the lru
> list in the child memcg to the lru list of the same generation in the
> parent memcg during reparenting. And in order to ensure that the parent
> memcg has the same generation, we need to increase the generations in the
> parent memcg to the MAX_NR_GENS before reparenting.
> 
> Of course, the same generation has different meanings in the parent and
> child memcg, this will cause confusion in the hot and cold information of
> folios. But other than that, this method is simple enough, the lru size
> is correct, and there is no need to consider some concurrency issues (such
> as lru_gen_del_folio()).
> ```

Thanks you this is very useful.

A high level overview on how the patch series (of this size) would be
appreaciate as well.
-- 
Michal Hocko
SUSE Labs
Re: [PATCH v1 00/26] Eliminate Dying Memory Cgroup
Posted by Qi Zheng 3 months ago
Hi Michal,

On 10/31/25 6:35 PM, Michal Hocko wrote:
> On Wed 29-10-25 16:05:16, Qi Zheng wrote:
>> Hi Michal,
>>
>> On 10/29/25 3:53 PM, Michal Hocko wrote:
>>> On Tue 28-10-25 21:58:13, Qi Zheng wrote:
>>>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>
>>>> Hi all,
>>>>
>>>> This series aims to eliminate the problem of dying memory cgroup. It completes
>>>> the adaptation to the MGLRU scenarios based on the Muchun Song's patchset[1].
>>>
>>> I high level summary and main design decisions should be describe in the
>>> cover letter.
>>
>> Got it. Will add it in the next version.
>>
>> I've pasted the contents of Muchun Song's cover letter below:
>>
>> ```
>> ## Introduction
>>
>> This patchset is intended to transfer the LRU pages to the object cgroup
>> without holding a reference to the original memory cgroup in order to
>> address the issue of the dying memory cgroup. A consensus has already been
>> reached regarding this approach recently [1].
> 
> Could you add those referenced links as well please?

Oh, I missed that.

[1]. https://lore.kernel.org/linux-mm/Z6OkXXYDorPrBvEQ@hm-sls2/

> 
>> ## Background
>>
>> The issue of a dying memory cgroup refers to a situation where a memory
>> cgroup is no longer being used by users, but memory (the metadata
>> associated with memory cgroups) remains allocated to it. This situation
>> may potentially result in memory leaks or inefficiencies in memory
>> reclamation and has persisted as an issue for several years. Any memory
>> allocation that endures longer than the lifespan (from the users'
>> perspective) of a memory cgroup can lead to the issue of dying memory
>> cgroup. We have exerted greater efforts to tackle this problem by
>> introducing the infrastructure of object cgroup [2].

[2]. https://lwn.net/Articles/895431/

>>
>> Presently, numerous types of objects (slab objects, non-slab kernel
>> allocations, per-CPU objects) are charged to the object cgroup without
>> holding a reference to the original memory cgroup. The final allocations
>> for LRU pages (anonymous pages and file pages) are charged at allocation
>> time and continues to hold a reference to the original memory cgroup
>> until reclaimed.
>>
>> File pages are more complex than anonymous pages as they can be shared
>> among different memory cgroups and may persist beyond the lifespan of
>> the memory cgroup. The long-term pinning of file pages to memory cgroups
>> is a widespread issue that causes recurring problems in practical
>> scenarios [3]. File pages remain unreclaimed for extended periods.

[3]. https://github.com/systemd/systemd/pull/36827

>> Additionally, they are accessed by successive instances (second, third,
>> fourth, etc.) of the same job, which is restarted into a new cgroup each
>> time. As a result, unreclaimable dying memory cgroups accumulate,
>> leading to memory wastage and significantly reducing the efficiency
>> of page reclamation.
> 
> Very useful introduction to the problem. Thanks!
> 
>> ## Fundamentals
>>
>> A folio will no longer pin its corresponding memory cgroup. It is necessary
>> to ensure that the memory cgroup or the lruvec associated with the memory
>> cgroup is not released when a user obtains a pointer to the memory cgroup
>> or lruvec returned by folio_memcg() or folio_lruvec(). Users are required
>> to hold the RCU read lock or acquire a reference to the memory cgroup
>> associated with the folio to prevent its release if they are not concerned
>> about the binding stability between the folio and its corresponding memory
>> cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock)
>> desire a stable binding between the folio and its corresponding memory
>> cgroup. An approach is needed to ensure the stability of the binding while
>> the lruvec lock is held, and to detect the situation of holding the
>> incorrect lruvec lock when there is a race condition during memory cgroup
>> reparenting. The following four steps are taken to achieve these goals.
>>
>> 1. The first step  to be taken is to identify all users of both functions
>>     (folio_memcg() and folio_lruvec()) who are not concerned about binding
>>     stability and implement appropriate measures (such as holding a RCU read
>>     lock or temporarily obtaining a reference to the memory cgroup for a
>>     brief period) to prevent the release of the memory cgroup.
>>
>> 2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates
>>     how to ensure the binding stability from the user's perspective of
>>     folio_lruvec().
>>
>>     struct lruvec *folio_lruvec_lock(struct folio *folio)
>>     {
>>             struct lruvec *lruvec;
>>
>>             rcu_read_lock();
>>     retry:
>>             lruvec = folio_lruvec(folio);
>>             spin_lock(&lruvec->lru_lock);
>>             if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
>>                     spin_unlock(&lruvec->lru_lock);
>>                     goto retry;
>>             }
>>
>>             return lruvec;
>>     }
>>
>>     From the perspective of memory cgroup removal, the entire reparenting
>>     process (altering the binding relationship between folio and its memory
>>     cgroup and moving the LRU lists to its parental memory cgroup) should be
>>     carried out under both the lruvec lock of the memory cgroup being removed
>>     and the lruvec lock of its parent.
>>
>> 3. Thirdly, another lock that requires the same approach is the split-queue
>>     lock of THP.
>>
>> 4. Finally, transfer the LRU pages to the object cgroup without holding a
>>     reference to the original memory cgroup.
>> ```
>>
>> And the details of the adaptation are below:
>>
>> ```
>> Similar to traditional LRU folios, in order to solve the dying memcg
>> problem, we also need to reparenting MGLRU folios to the parent memcg when
>> memcg offline.
>>
>> However, there are the following challenges:
>>
>> 1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the
>>     number of generations of the parent and child memcg may be different,
>>     so we cannot simply transfer MGLRU folios in the child memcg to the
>>     parent memcg as we did for traditional LRU folios.
>> 2. The generation information is stored in folio->flags, but we cannot
>>     traverse these folios while holding the lru lock, otherwise it may
>>     cause softlockup.
>> 3. In walk_update_folio(), the gen of folio and corresponding lru size
>>     may be updated, but the folio is not immediately moved to the
>>     corresponding lru list. Therefore, there may be folios of different
>>     generations on an LRU list.
>> 4. In lru_gen_del_folio(), the generation to which the folio belongs is
>>     found based on the generation information in folio->flags, and the
>>     corresponding LRU size will be updated. Therefore, we need to update
>>     the lru size correctly during reparenting, otherwise the lru size may
>>     be updated incorrectly in lru_gen_del_folio().
>>
>> Finally, this patch chose a compromise method, which is to splice the lru
>> list in the child memcg to the lru list of the same generation in the
>> parent memcg during reparenting. And in order to ensure that the parent
>> memcg has the same generation, we need to increase the generations in the
>> parent memcg to the MAX_NR_GENS before reparenting.
>>
>> Of course, the same generation has different meanings in the parent and
>> child memcg, this will cause confusion in the hot and cold information of
>> folios. But other than that, this method is simple enough, the lru size
>> is correct, and there is no need to consider some concurrency issues (such
>> as lru_gen_del_folio()).
>> ```
> 
> Thanks you this is very useful.
> 
> A high level overview on how the patch series (of this size) would be
> appreaciate as well.

OK. Will add this to the cover letter in the next version.

Thanks,
Qi
[syzbot ci] Re: Eliminate Dying Memory Cgroup
Posted by syzbot ci 3 months, 1 week ago
syzbot ci has tested the following series

[v1] Eliminate Dying Memory Cgroup
https://lore.kernel.org/all/cover.1761658310.git.zhengqi.arch@bytedance.com
* [PATCH v1 01/26] mm: memcontrol: remove dead code of checking parent memory cgroup
* [PATCH v1 02/26] mm: workingset: use folio_lruvec() in workingset_refault()
* [PATCH v1 03/26] mm: rename unlock_page_lruvec_irq and its variants
* [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
* [PATCH v1 05/26] mm: memcontrol: allocate object cgroup for non-kmem case
* [PATCH v1 06/26] mm: memcontrol: return root object cgroup for root memory cgroup
* [PATCH v1 07/26] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()
* [PATCH v1 08/26] buffer: prevent memory cgroup release in folio_alloc_buffers()
* [PATCH v1 09/26] writeback: prevent memory cgroup release in writeback module
* [PATCH v1 10/26] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()
* [PATCH v1 11/26] mm: page_io: prevent memory cgroup release in page_io module
* [PATCH v1 12/26] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
* [PATCH v1 13/26] mm: mglru: prevent memory cgroup release in mglru
* [PATCH v1 14/26] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()
* [PATCH v1 15/26] mm: workingset: prevent memory cgroup release in lru_gen_eviction()
* [PATCH v1 16/26] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()
* [PATCH v1 17/26] mm: workingset: prevent lruvec release in workingset_refault()
* [PATCH v1 18/26] mm: zswap: prevent lruvec release in zswap_folio_swapin()
* [PATCH v1 19/26] mm: swap: prevent lruvec release in swap module
* [PATCH v1 20/26] mm: workingset: prevent lruvec release in workingset_activation()
* [PATCH v1 21/26] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
* [PATCH v1 22/26] mm: vmscan: prepare for reparenting traditional LRU folios
* [PATCH v1 23/26] mm: vmscan: prepare for reparenting MGLRU folios
* [PATCH v1 24/26] mm: memcontrol: refactor memcg_reparent_objcgs()
* [PATCH v1 25/26] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios
* [PATCH v1 26/26] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers

and found the following issue:
WARNING in folio_memcg

Full report is available here:
https://ci.syzbot.org/series/0d48a77a-fb4f-485d-9fd6-086afd6fb650

***

WARNING in folio_memcg

tree:      mm-new
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
base:      b227c04932039bccc21a0a89cd6df50fa57e4716
arch:      amd64
compiler:  Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config:    https://ci.syzbot.org/builds/503d7034-ae99-44d1-8fb2-62e7ef5e1c7c/config
C repro:   https://ci.syzbot.org/findings/880c374a-1b49-436e-9be2-63d5e2c6b6ab/c_repro
syz repro: https://ci.syzbot.org/findings/880c374a-1b49-436e-9be2-63d5e2c6b6ab/syz_repro

exFAT-fs (loop0): failed to load upcase table (idx : 0x00010000, chksum : 0xe5674ec2, utbl_chksum : 0xe619d30d)
exFAT-fs (loop0): failed to load alloc-bitmap
exFAT-fs (loop0): failed to recognize exfat type
------------[ cut here ]------------
WARNING: CPU: 1 PID: 5965 at ./include/linux/memcontrol.h:380 obj_cgroup_memcg include/linux/memcontrol.h:380 [inline]
WARNING: CPU: 1 PID: 5965 at ./include/linux/memcontrol.h:380 folio_memcg+0x148/0x1c0 include/linux/memcontrol.h:434
Modules linked in:
CPU: 1 UID: 0 PID: 5965 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:obj_cgroup_memcg include/linux/memcontrol.h:380 [inline]
RIP: 0010:folio_memcg+0x148/0x1c0 include/linux/memcontrol.h:434
Code: 48 c1 e8 03 42 80 3c 20 00 74 08 48 89 df e8 5f c8 06 00 48 8b 03 5b 41 5c 41 5e 41 5f 5d e9 cf 89 2a 09 cc e8 a9 bb a0 ff 90 <0f> 0b 90 eb ca 44 89 f9 80 e1 07 80 c1 03 38 c1 0f 8c ef fe ff ff
RSP: 0018:ffffc90003ec66b0 EFLAGS: 00010293
RAX: ffffffff821f4b57 RBX: ffff888108b31480 RCX: ffff88816be91d00
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: ffff88816be91d00 R09: 0000000000000002
R10: 00000000fffffff0 R11: 0000000000000000 R12: dffffc0000000000
R13: 00000000ffffffe4 R14: ffffea0006d5f840 R15: ffffea0006d5f870
FS:  000055555db87500(0000) GS:ffff8882a9f35000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2ee63fff CR3: 000000010c308000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 zswap_compress mm/zswap.c:900 [inline]
 zswap_store_page mm/zswap.c:1430 [inline]
 zswap_store+0xfa2/0x1f80 mm/zswap.c:1541
 swap_writeout+0x6e8/0xf20 mm/page_io.c:275
 writeout mm/vmscan.c:651 [inline]
 pageout mm/vmscan.c:699 [inline]
 shrink_folio_list+0x34ec/0x4c40 mm/vmscan.c:1418
 reclaim_folio_list+0xeb/0x500 mm/vmscan.c:2196
 reclaim_pages+0x454/0x520 mm/vmscan.c:2233
 madvise_cold_or_pageout_pte_range+0x1974/0x1d00 mm/madvise.c:565
 walk_pmd_range mm/pagewalk.c:130 [inline]
 walk_pud_range mm/pagewalk.c:224 [inline]
 walk_p4d_range mm/pagewalk.c:262 [inline]
 walk_pgd_range+0xfe9/0x1d40 mm/pagewalk.c:303
 __walk_page_range+0x14c/0x710 mm/pagewalk.c:410
 walk_page_range_vma+0x393/0x440 mm/pagewalk.c:717
 madvise_pageout_page_range mm/madvise.c:624 [inline]
 madvise_pageout mm/madvise.c:649 [inline]
 madvise_vma_behavior+0x311f/0x3a10 mm/madvise.c:1352
 madvise_walk_vmas+0x51c/0xa30 mm/madvise.c:1669
 madvise_do_behavior+0x38e/0x550 mm/madvise.c:1885
 do_madvise+0x1bc/0x270 mm/madvise.c:1978
 __do_sys_madvise mm/madvise.c:1987 [inline]
 __se_sys_madvise mm/madvise.c:1985 [inline]
 __x64_sys_madvise+0xa7/0xc0 mm/madvise.c:1985
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fccac38efc9
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffd9cc58708 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 00007fccac5e5fa0 RCX: 00007fccac38efc9
RDX: 0000000000000015 RSI: 7fffffffffffffff RDI: 0000200000000000
RBP: 00007fccac411f91 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fccac5e5fa0 R14: 00007fccac5e5fa0 R15: 0000000000000003
 </TASK>


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.
Re: [syzbot ci] Re: Eliminate Dying Memory Cgroup
Posted by Harry Yoo 3 months, 1 week ago
On Tue, Oct 28, 2025 at 01:58:33PM -0700, syzbot ci wrote:
> syzbot ci has tested the following series
> 
> [v1] Eliminate Dying Memory Cgroup
> https://lore.kernel.org/all/cover.1761658310.git.zhengqi.arch@bytedance.com
> * [PATCH v1 01/26] mm: memcontrol: remove dead code of checking parent memory cgroup
> * [PATCH v1 02/26] mm: workingset: use folio_lruvec() in workingset_refault()
> * [PATCH v1 03/26] mm: rename unlock_page_lruvec_irq and its variants
> * [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
> * [PATCH v1 05/26] mm: memcontrol: allocate object cgroup for non-kmem case
> * [PATCH v1 06/26] mm: memcontrol: return root object cgroup for root memory cgroup
> * [PATCH v1 07/26] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()
> * [PATCH v1 08/26] buffer: prevent memory cgroup release in folio_alloc_buffers()
> * [PATCH v1 09/26] writeback: prevent memory cgroup release in writeback module
> * [PATCH v1 10/26] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()
> * [PATCH v1 11/26] mm: page_io: prevent memory cgroup release in page_io module
> * [PATCH v1 12/26] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
> * [PATCH v1 13/26] mm: mglru: prevent memory cgroup release in mglru
> * [PATCH v1 14/26] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()
> * [PATCH v1 15/26] mm: workingset: prevent memory cgroup release in lru_gen_eviction()
> * [PATCH v1 16/26] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()
> * [PATCH v1 17/26] mm: workingset: prevent lruvec release in workingset_refault()
> * [PATCH v1 18/26] mm: zswap: prevent lruvec release in zswap_folio_swapin()
> * [PATCH v1 19/26] mm: swap: prevent lruvec release in swap module
> * [PATCH v1 20/26] mm: workingset: prevent lruvec release in workingset_activation()
> * [PATCH v1 21/26] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
> * [PATCH v1 22/26] mm: vmscan: prepare for reparenting traditional LRU folios
> * [PATCH v1 23/26] mm: vmscan: prepare for reparenting MGLRU folios
> * [PATCH v1 24/26] mm: memcontrol: refactor memcg_reparent_objcgs()
> * [PATCH v1 25/26] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios
> * [PATCH v1 26/26] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers
> 
> and found the following issue:
> WARNING in folio_memcg
> 
> Full report is available here:
> https://ci.syzbot.org/series/0d48a77a-fb4f-485d-9fd6-086afd6fb650
> 
> ***
> 
> WARNING in folio_memcg
> 
> tree:      mm-new
> URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
> base:      b227c04932039bccc21a0a89cd6df50fa57e4716
> arch:      amd64
> compiler:  Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
> config:    https://ci.syzbot.org/builds/503d7034-ae99-44d1-8fb2-62e7ef5e1c7c/config
> C repro:   https://ci.syzbot.org/findings/880c374a-1b49-436e-9be2-63d5e2c6b6ab/c_repro
> syz repro: https://ci.syzbot.org/findings/880c374a-1b49-436e-9be2-63d5e2c6b6ab/syz_repro
> 
> exFAT-fs (loop0): failed to load upcase table (idx : 0x00010000, chksum : 0xe5674ec2, utbl_chksum : 0xe619d30d)
> exFAT-fs (loop0): failed to load alloc-bitmap
> exFAT-fs (loop0): failed to recognize exfat type
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 5965 at ./include/linux/memcontrol.h:380 obj_cgroup_memcg include/linux/memcontrol.h:380 [inline]
> WARNING: CPU: 1 PID: 5965 at ./include/linux/memcontrol.h:380 folio_memcg+0x148/0x1c0 include/linux/memcontrol.h:434

This is understandable as the code snippet was added fairly recently
and is easy to miss during rebasing.

#syz test

diff --git a/mm/zswap.c b/mm/zswap.c
index a341814468b9..738d914e5354 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -896,11 +896,14 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	 * to the active LRU list in the case.
 	 */
 	if (comp_ret || !dlen || dlen >= PAGE_SIZE) {
+		rcu_read_lock();
 		if (!mem_cgroup_zswap_writeback_enabled(
 					folio_memcg(page_folio(page)))) {
+			rcu_read_unlock();
 			comp_ret = comp_ret ? comp_ret : -EINVAL;
 			goto unlock;
 		}
+		rcu_read_unlock();
 		comp_ret = 0;
 		dlen = PAGE_SIZE;
 		dst = kmap_local_page(page);
Re: [syzbot ci] Re: Eliminate Dying Memory Cgroup
Posted by Qi Zheng 3 months, 1 week ago
Hi Harry,

On 10/29/25 8:22 AM, Harry Yoo wrote:
> On Tue, Oct 28, 2025 at 01:58:33PM -0700, syzbot ci wrote:
>> syzbot ci has tested the following series
>>
>> [v1] Eliminate Dying Memory Cgroup
>> https://lore.kernel.org/all/cover.1761658310.git.zhengqi.arch@bytedance.com
>> * [PATCH v1 01/26] mm: memcontrol: remove dead code of checking parent memory cgroup
>> * [PATCH v1 02/26] mm: workingset: use folio_lruvec() in workingset_refault()
>> * [PATCH v1 03/26] mm: rename unlock_page_lruvec_irq and its variants
>> * [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
>> * [PATCH v1 05/26] mm: memcontrol: allocate object cgroup for non-kmem case
>> * [PATCH v1 06/26] mm: memcontrol: return root object cgroup for root memory cgroup
>> * [PATCH v1 07/26] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()
>> * [PATCH v1 08/26] buffer: prevent memory cgroup release in folio_alloc_buffers()
>> * [PATCH v1 09/26] writeback: prevent memory cgroup release in writeback module
>> * [PATCH v1 10/26] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()
>> * [PATCH v1 11/26] mm: page_io: prevent memory cgroup release in page_io module
>> * [PATCH v1 12/26] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
>> * [PATCH v1 13/26] mm: mglru: prevent memory cgroup release in mglru
>> * [PATCH v1 14/26] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()
>> * [PATCH v1 15/26] mm: workingset: prevent memory cgroup release in lru_gen_eviction()
>> * [PATCH v1 16/26] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()
>> * [PATCH v1 17/26] mm: workingset: prevent lruvec release in workingset_refault()
>> * [PATCH v1 18/26] mm: zswap: prevent lruvec release in zswap_folio_swapin()
>> * [PATCH v1 19/26] mm: swap: prevent lruvec release in swap module
>> * [PATCH v1 20/26] mm: workingset: prevent lruvec release in workingset_activation()
>> * [PATCH v1 21/26] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
>> * [PATCH v1 22/26] mm: vmscan: prepare for reparenting traditional LRU folios
>> * [PATCH v1 23/26] mm: vmscan: prepare for reparenting MGLRU folios
>> * [PATCH v1 24/26] mm: memcontrol: refactor memcg_reparent_objcgs()
>> * [PATCH v1 25/26] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios
>> * [PATCH v1 26/26] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers
>>
>> and found the following issue:
>> WARNING in folio_memcg
>>
>> Full report is available here:
>> https://ci.syzbot.org/series/0d48a77a-fb4f-485d-9fd6-086afd6fb650
>>
>> ***
>>
>> WARNING in folio_memcg
>>
>> tree:      mm-new
>> URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
>> base:      b227c04932039bccc21a0a89cd6df50fa57e4716
>> arch:      amd64
>> compiler:  Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
>> config:    https://ci.syzbot.org/builds/503d7034-ae99-44d1-8fb2-62e7ef5e1c7c/config
>> C repro:   https://ci.syzbot.org/findings/880c374a-1b49-436e-9be2-63d5e2c6b6ab/c_repro
>> syz repro: https://ci.syzbot.org/findings/880c374a-1b49-436e-9be2-63d5e2c6b6ab/syz_repro
>>
>> exFAT-fs (loop0): failed to load upcase table (idx : 0x00010000, chksum : 0xe5674ec2, utbl_chksum : 0xe619d30d)
>> exFAT-fs (loop0): failed to load alloc-bitmap
>> exFAT-fs (loop0): failed to recognize exfat type
>> ------------[ cut here ]------------
>> WARNING: CPU: 1 PID: 5965 at ./include/linux/memcontrol.h:380 obj_cgroup_memcg include/linux/memcontrol.h:380 [inline]
>> WARNING: CPU: 1 PID: 5965 at ./include/linux/memcontrol.h:380 folio_memcg+0x148/0x1c0 include/linux/memcontrol.h:434
> 
> This is understandable as the code snippet was added fairly recently
> and is easy to miss during rebasing.

My mistake, I should recheck it.

> 
> #syz test
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index a341814468b9..738d914e5354 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -896,11 +896,14 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>   	 * to the active LRU list in the case.
>   	 */
>   	if (comp_ret || !dlen || dlen >= PAGE_SIZE) {
> +		rcu_read_lock();
>   		if (!mem_cgroup_zswap_writeback_enabled(
>   					folio_memcg(page_folio(page)))) {
> +			rcu_read_unlock();
>   			comp_ret = comp_ret ? comp_ret : -EINVAL;
>   			goto unlock;
>   		}
> +		rcu_read_unlock();
>   		comp_ret = 0;
>   		dlen = PAGE_SIZE;
>   		dst = kmap_local_page(page);

LGTM, will do in the next version.

Thanks!

>
Re: Re: [syzbot ci] Re: Eliminate Dying Memory Cgroup
Posted by syzbot ci 3 months, 1 week ago
Unknown command