[v3] Eliminate Dying Memory Cgroup

[PATCH v3 00/30] Eliminate Dying Memory Cgroup

Posted by Qi Zheng 3 weeks, 3 days ago

From: Qi Zheng <zhengqi.arch@bytedance.com>

Changes in v3:
 - modify the commit message in [PATCH v2 04/28], [PATCH v2 06/28],
   [PATCH v2 13/28], [PATCH v2 24/28] and [PATCH v2 27/28]
   (suggested by David Hildenbrand, Chen Ridong and Johannes Weiner)
 - change code style in [PATCH v3 8/30], [PATCH v3 15/30] and [PATCH v3 27/30]
   (suggested by Johannes Weiner and Shakeel Butt)
 - use get_mem_cgroup_from_folio() + mem_cgroup_put() to replace holding rcu
   lock in [PATCH v3 14/30] and [PATCH v3 19/30]
   (pointed by Johannes Weiner)
 - add a comment to folio_split_queue_lock() in [PATCH v3 17/30]
   (suggested by Shakeel Butt)
 - modify the comment above folio_lruvec() in [PATCH v3 24/30]
   (suggested by Johannes Weiner)
 - fix rcu lock issue in lru_note_cost_refault()
   (pointed by Shakeel Butt)
 - add [PATCH v3 28/30] to fix non-hierarchical memcg1_stats issues
   (pointed by Yosry Ahmed)
 - fix lru_zone_size issue in [PATCH v2 24/28] and [PATCH v2 25/28]
 - collect Acked-bys and Reviewed-bys
 - rebase onto the next-20260114

Changes in v2:
 - add [PATCH v2 04/28] and remove local_irq_disable() in evict_folios()
   (pointed by Harry Yoo)
 - recheck objcg in [PATCH v2 07/28] (pointed by Harry Yoo)
 - modify the commit message in [PATCH v2 12/28] and [PATCH v2 21/28]
   (pointed by Harry Yoo)
 - use rcu lock to protect mm_state in [PATCH v2 14/28] (pointed by Harry Yoo)
 - fix bad unlock balance warning in [PATCH v2 23/28]
 - change nr_pages type to long in [PATCH v2 25/28] (pointed by Harry Yoo)
 - incease mm_state->seq during reparenting to make mm walker work properly in
   [PATCH v2 25/28] (pointed by Harry Yoo)
 - add [PATCH v2 18/28] to fix WARNING in folio_memcg() (pointed by Harry Yoo)
 - collect Reviewed-bys
 - rebase onto the next-20251216

Changes in v1:
 - drop [PATCH RFC 02/28]
 - drop THP split queue related part, which has been merged as a separate
   patchset[2]
 - prevent memory cgroup release in folio_split_queue_lock{_irqsave}() in
   [PATCH v1 16/26]
 - Separate the reparenting function of traditional LRU folios to [PATCH v1 22/26]
 - adapted to the MGLRU scenarios in [PATCH v1 23/26]
 - refactor memcg_reparent_objcgs() in [PATCH v1 24/26]
 - collect Acked-bys and Reviewed-bys
 - rebase onto the next-20251028

Hi all,

Introduction
============

This patchset is intended to transfer the LRU pages to the object cgroup
without holding a reference to the original memory cgroup in order to
address the issue of the dying memory cgroup. A consensus has already been
reached regarding this approach recently [1].

Background
==========

The issue of a dying memory cgroup refers to a situation where a memory
cgroup is no longer being used by users, but memory (the metadata
associated with memory cgroups) remains allocated to it. This situation
may potentially result in memory leaks or inefficiencies in memory
reclamation and has persisted as an issue for several years. Any memory
allocation that endures longer than the lifespan (from the users'
perspective) of a memory cgroup can lead to the issue of dying memory
cgroup. We have exerted greater efforts to tackle this problem by
introducing the infrastructure of object cgroup [2].

Presently, numerous types of objects (slab objects, non-slab kernel
allocations, per-CPU objects) are charged to the object cgroup without
holding a reference to the original memory cgroup. The final allocations
for LRU pages (anonymous pages and file pages) are charged at allocation
time and continues to hold a reference to the original memory cgroup
until reclaimed.

File pages are more complex than anonymous pages as they can be shared
among different memory cgroups and may persist beyond the lifespan of
the memory cgroup. The long-term pinning of file pages to memory cgroups
is a widespread issue that causes recurring problems in practical
scenarios [3]. File pages remain unreclaimed for extended periods.
Additionally, they are accessed by successive instances (second, third,
fourth, etc.) of the same job, which is restarted into a new cgroup each
time. As a result, unreclaimable dying memory cgroups accumulate,
leading to memory wastage and significantly reducing the efficiency
of page reclamation.

Fundamentals
============

A folio will no longer pin its corresponding memory cgroup. It is necessary
to ensure that the memory cgroup or the lruvec associated with the memory
cgroup is not released when a user obtains a pointer to the memory cgroup
or lruvec returned by folio_memcg() or folio_lruvec(). Users are required
to hold the RCU read lock or acquire a reference to the memory cgroup
associated with the folio to prevent its release if they are not concerned
about the binding stability between the folio and its corresponding memory
cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock)
desire a stable binding between the folio and its corresponding memory
cgroup. An approach is needed to ensure the stability of the binding while
the lruvec lock is held, and to detect the situation of holding the
incorrect lruvec lock when there is a race condition during memory cgroup
reparenting. The following four steps are taken to achieve these goals.

1. The first step  to be taken is to identify all users of both functions
   (folio_memcg() and folio_lruvec()) who are not concerned about binding
   stability and implement appropriate measures (such as holding a RCU read
   lock or temporarily obtaining a reference to the memory cgroup for a
   brief period) to prevent the release of the memory cgroup.

2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates
   how to ensure the binding stability from the user's perspective of
   folio_lruvec().

   struct lruvec *folio_lruvec_lock(struct folio *folio)
   {
           struct lruvec *lruvec;

           rcu_read_lock();
   retry:
           lruvec = folio_lruvec(folio);
           spin_lock(&lruvec->lru_lock);
           if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
                   spin_unlock(&lruvec->lru_lock);
                   goto retry;
           }

           return lruvec;
   }

   From the perspective of memory cgroup removal, the entire reparenting
   process (altering the binding relationship between folio and its memory
   cgroup and moving the LRU lists to its parental memory cgroup) should be
   carried out under both the lruvec lock of the memory cgroup being removed
   and the lruvec lock of its parent.

3. Finally, transfer the LRU pages to the object cgroup without holding a
   reference to the original memory cgroup.

Effect
======

Finally, it can be observed that the quantity of dying memory cgroups will
not experience a significant increase if the following test script is
executed to reproduce the issue.

```bash
#!/bin/bash

# Create a temporary file 'temp' filled with zero bytes
dd if=/dev/zero of=temp bs=4096 count=1

# Display memory-cgroup info from /proc/cgroups
cat /proc/cgroups | grep memory

for i in {0..2000}
do
    mkdir /sys/fs/cgroup/memory/test$i
    echo $$ > /sys/fs/cgroup/memory/test$i/cgroup.procs

    # Append 'temp' file content to 'log'
    cat temp >> log

    echo $$ > /sys/fs/cgroup/memory/cgroup.procs

    # Potentially create a dying memory cgroup
    rmdir /sys/fs/cgroup/memory/test$i
done

# Display memory-cgroup info after test
cat /proc/cgroups | grep memory

rm -f temp log
```

Comments and suggestions are welcome!

Thanks,
Qi

[1].https://lore.kernel.org/linux-mm/Z6OkXXYDorPrBvEQ@hm-sls2/
[2].https://lwn.net/Articles/895431/
[3].https://github.com/systemd/systemd/pull/36827

Muchun Song (22):
  mm: memcontrol: remove dead code of checking parent memory cgroup
  mm: workingset: use folio_lruvec() in workingset_refault()
  mm: rename unlock_page_lruvec_irq and its variants
  mm: vmscan: refactor move_folios_to_lru()
  mm: memcontrol: allocate object cgroup for non-kmem case
  mm: memcontrol: return root object cgroup for root memory cgroup
  mm: memcontrol: prevent memory cgroup release in
    get_mem_cgroup_from_folio()
  buffer: prevent memory cgroup release in folio_alloc_buffers()
  writeback: prevent memory cgroup release in writeback module
  mm: memcontrol: prevent memory cgroup release in
    count_memcg_folio_events()
  mm: page_io: prevent memory cgroup release in page_io module
  mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
  mm: mglru: prevent memory cgroup release in mglru
  mm: memcontrol: prevent memory cgroup release in
    mem_cgroup_swap_full()
  mm: workingset: prevent memory cgroup release in lru_gen_eviction()
  mm: workingset: prevent lruvec release in workingset_refault()
  mm: zswap: prevent lruvec release in zswap_folio_swapin()
  mm: swap: prevent lruvec release in lru_gen_clear_refs()
  mm: workingset: prevent lruvec release in workingset_activation()
  mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
  mm: memcontrol: eliminate the problem of dying memory cgroup for LRU
    folios
  mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers

Qi Zheng (8):
  mm: vmscan: prepare for the refactoring the move_folios_to_lru()
  mm: thp: prevent memory cgroup release in
    folio_split_queue_lock{_irqsave}()
  mm: zswap: prevent memory cgroup release in zswap_compress()
  mm: do not open-code lruvec lock
  mm: vmscan: prepare for reparenting traditional LRU folios
  mm: vmscan: prepare for reparenting MGLRU folios
  mm: memcontrol: refactor memcg_reparent_objcgs()
  mm: memcontrol: prepare for reparenting state_local

 fs/buffer.c                      |   4 +-
 fs/fs-writeback.c                |  22 +-
 include/linux/memcontrol.h       | 187 ++++++------
 include/linux/mm_inline.h        |   6 +
 include/linux/mmzone.h           |  20 ++
 include/linux/swap.h             |  20 ++
 include/trace/events/writeback.h |   3 +
 mm/compaction.c                  |  43 ++-
 mm/huge_memory.c                 |  22 +-
 mm/memcontrol-v1.c               |  31 +-
 mm/memcontrol-v1.h               |   3 +
 mm/memcontrol.c                  | 472 ++++++++++++++++++++-----------
 mm/migrate.c                     |   2 +
 mm/mlock.c                       |   2 +-
 mm/page_io.c                     |   8 +-
 mm/percpu.c                      |   2 +-
 mm/shrinker.c                    |   6 +-
 mm/swap.c                        |  61 +++-
 mm/vmscan.c                      | 280 +++++++++++++-----
 mm/workingset.c                  |  25 +-
 mm/zswap.c                       |   5 +
 21 files changed, 835 insertions(+), 389 deletions(-)

-- 
2.20.1

Re: [PATCH v3 00/30] Eliminate Dying Memory Cgroup

Posted by Andrew Morton 3 weeks, 3 days ago

On Wed, 14 Jan 2026 19:26:43 +0800 Qi Zheng <qi.zheng@linux.dev> wrote:

> This patchset is intended to transfer the LRU pages to the object cgroup
> without holding a reference to the original memory cgroup in order to
> address the issue of the dying memory cgroup. 

Thanks.  I'll add this to mm.git for testing.  A patchset of this
magnitude at -rc5 is a little ambitious, but Linus is giving us an rc8
so let's see.

I'll suppress the usual added-to-mm email spray.

Re: [PATCH v3 00/30] Eliminate Dying Memory Cgroup

Posted by Lorenzo Stoakes 3 weeks, 2 days ago

On Wed, Jan 14, 2026 at 09:58:39AM -0800, Andrew Morton wrote:
> On Wed, 14 Jan 2026 19:26:43 +0800 Qi Zheng <qi.zheng@linux.dev> wrote:
>
> > This patchset is intended to transfer the LRU pages to the object cgroup
> > without holding a reference to the original memory cgroup in order to
> > address the issue of the dying memory cgroup.
>
> Thanks.  I'll add this to mm.git for testing.  A patchset of this
> magnitude at -rc5 is a little ambitious, but Linus is giving us an rc8
> so let's see.
>
> I'll suppress the usual added-to-mm email spray.

Since this is so large and we are late on in the cycle can I in this case
can I explicitly ask for at least 1 sub-M tag on each commit before
queueing for Linus please?

We are seeing kernel bot reports here so let's obviously stabilise this for
a while also.

Thanks, Lorenzo

Re: [PATCH v3 00/30] Eliminate Dying Memory Cgroup

Posted by Andrew Morton 3 weeks, 1 day ago

On Thu, 15 Jan 2026 12:40:12 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Wed, Jan 14, 2026 at 09:58:39AM -0800, Andrew Morton wrote:
> > On Wed, 14 Jan 2026 19:26:43 +0800 Qi Zheng <qi.zheng@linux.dev> wrote:
> >
> > > This patchset is intended to transfer the LRU pages to the object cgroup
> > > without holding a reference to the original memory cgroup in order to
> > > address the issue of the dying memory cgroup.
> >
> > Thanks.  I'll add this to mm.git for testing.  A patchset of this
> > magnitude at -rc5 is a little ambitious, but Linus is giving us an rc8
> > so let's see.
> >
> > I'll suppress the usual added-to-mm email spray.
> 
> Since this is so large and we are late on in the cycle can I in this case
> can I explicitly ask for at least 1 sub-M tag on each commit before
> queueing for Linus please?

Well, kinda.

fs/buffer.c
fs/fs-writeback.c
include/linux/memcontrol.h
include/linux/mm_inline.h
include/linux/mmzone.h
include/linux/swap.h
include/trace/events/writeback.h
mm/compaction.c
mm/huge_memory.c
mm/memcontrol.c
mm/memcontrol-v1.c
mm/memcontrol-v1.h
mm/migrate.c
mm/mlock.c
mm/page_io.c
mm/percpu.c
mm/shrinker.c
mm/swap.c
mm/vmscan.c
mm/workingset.c
mm/zswap.c

That's a lot of reviewers to round up!  And there are far worse cases -
MM patchsets are often splattered elsewhere.  We can't have MM
patchsets getting stalled because some video driver developer is on
leave or got laid off.  Not suggesting that you were really suggesting
that!

As this is officially a memcg patch, I'd be looking to memcg
maintainers for guidance while viewing acks from others as
nice-to-have, rather than must-have.

> We are seeing kernel bot reports here so let's obviously stabilise this for
> a while also.

Yeah, I'm not feeling optimistic about getting all this into the next
merge window.  But just one day in mm-new led to David's secret ci-bot
discovering a missed rcu_unlock due to a cross-tree integration thing.

I'll keep the series around for at least a few days, see how things
progress.

Re: [PATCH v3 00/30] Eliminate Dying Memory Cgroup

Posted by Lorenzo Stoakes 3 weeks, 1 day ago

On Thu, Jan 15, 2026 at 04:43:06PM -0800, Andrew Morton wrote:
> On Thu, 15 Jan 2026 12:40:12 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > On Wed, Jan 14, 2026 at 09:58:39AM -0800, Andrew Morton wrote:
> > > On Wed, 14 Jan 2026 19:26:43 +0800 Qi Zheng <qi.zheng@linux.dev> wrote:
> > >
> > > > This patchset is intended to transfer the LRU pages to the object cgroup
> > > > without holding a reference to the original memory cgroup in order to
> > > > address the issue of the dying memory cgroup.
> > >
> > > Thanks.  I'll add this to mm.git for testing.  A patchset of this
> > > magnitude at -rc5 is a little ambitious, but Linus is giving us an rc8
> > > so let's see.
> > >
> > > I'll suppress the usual added-to-mm email spray.
> >
> > Since this is so large and we are late on in the cycle can I in this case
> > can I explicitly ask for at least 1 sub-M tag on each commit before
> > queueing for Linus please?
>
> Well, kinda.
>
> fs/buffer.c
> fs/fs-writeback.c
> include/linux/memcontrol.h
> include/linux/mm_inline.h
> include/linux/mmzone.h
> include/linux/swap.h
> include/trace/events/writeback.h
> mm/compaction.c
> mm/huge_memory.c
> mm/memcontrol.c
> mm/memcontrol-v1.c
> mm/memcontrol-v1.h
> mm/migrate.c
> mm/mlock.c
> mm/page_io.c
> mm/percpu.c
> mm/shrinker.c
> mm/swap.c
> mm/vmscan.c
> mm/workingset.c
> mm/zswap.c
>
> That's a lot of reviewers to round up!  And there are far worse cases -
> MM patchsets are often splattered elsewhere.  We can't have MM
> patchsets getting stalled because some video driver developer is on
> leave or got laid off.  Not suggesting that you were really suggesting
> that!

Yeah, obviously judgment needs to be applied in these situations - an 'M'
implies community trusts sensible decisions, so since this is really about
the cgroup behaviour, I'd say simply requiring at least 1 M per-patch from
any of:

M:	Johannes Weiner <hannes@cmpxchg.org>
M:	Michal Hocko <mhocko@kernel.org>
M:	Roman Gushchin <roman.gushchin@linux.dev>
M:	Shakeel Butt <shakeel.butt@linux.dev>

Suffices.

I am obviously not suggesting that we require sign off from _all_ sub-M's
for _all_ affected files, and then some changes may be blurry.

For the most part I think it's usually _fairly_ obvious which part of
MAINTAINERS applies, and in cases where it doesn't obviously people can be
pinged for opinions.

>
> As this is officially a memcg patch, I'd be looking to memcg
> maintainers for guidance while viewing acks from others as
> nice-to-have, rather than must-have.

Yeah agreed.

>
> > We are seeing kernel bot reports here so let's obviously stabilise this for
> > a while also.
>
> Yeah, I'm not feeling optimistic about getting all this into the next
> merge window.  But just one day in mm-new led to David's secret ci-bot
> discovering a missed rcu_unlock due to a cross-tree integration thing.

Yeah and that's not a big deal, things can wait a little while esp. the
bigger changes!

Stabilising it is more important :)

>
> I'll keep the series around for at least a few days, see how things
> progress.
>

Sounds sensible!

Cheers, Lorenzo

Re: [PATCH v3 00/30] Eliminate Dying Memory Cgroup

Posted by Michal Hocko 3 weeks, 1 day ago

On Fri 16-01-26 08:33:44, Lorenzo Stoakes wrote:
> On Thu, Jan 15, 2026 at 04:43:06PM -0800, Andrew Morton wrote:
> > On Thu, 15 Jan 2026 12:40:12 +0000 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> >
> > > On Wed, Jan 14, 2026 at 09:58:39AM -0800, Andrew Morton wrote:
> > > > On Wed, 14 Jan 2026 19:26:43 +0800 Qi Zheng <qi.zheng@linux.dev> wrote:
> > > >
> > > > > This patchset is intended to transfer the LRU pages to the object cgroup
> > > > > without holding a reference to the original memory cgroup in order to
> > > > > address the issue of the dying memory cgroup.
> > > >
> > > > Thanks.  I'll add this to mm.git for testing.  A patchset of this
> > > > magnitude at -rc5 is a little ambitious, but Linus is giving us an rc8
> > > > so let's see.
> > > >
> > > > I'll suppress the usual added-to-mm email spray.
> > >
> > > Since this is so large and we are late on in the cycle can I in this case
> > > can I explicitly ask for at least 1 sub-M tag on each commit before
> > > queueing for Linus please?
> >
> > Well, kinda.
> >
> > fs/buffer.c
> > fs/fs-writeback.c
> > include/linux/memcontrol.h
> > include/linux/mm_inline.h
> > include/linux/mmzone.h
> > include/linux/swap.h
> > include/trace/events/writeback.h
> > mm/compaction.c
> > mm/huge_memory.c
> > mm/memcontrol.c
> > mm/memcontrol-v1.c
> > mm/memcontrol-v1.h
> > mm/migrate.c
> > mm/mlock.c
> > mm/page_io.c
> > mm/percpu.c
> > mm/shrinker.c
> > mm/swap.c
> > mm/vmscan.c
> > mm/workingset.c
> > mm/zswap.c
> >
> > That's a lot of reviewers to round up!  And there are far worse cases -
> > MM patchsets are often splattered elsewhere.  We can't have MM
> > patchsets getting stalled because some video driver developer is on
> > leave or got laid off.  Not suggesting that you were really suggesting
> > that!
> 
> Yeah, obviously judgment needs to be applied in these situations - an 'M'
> implies community trusts sensible decisions, so since this is really about
> the cgroup behaviour, I'd say simply requiring at least 1 M per-patch from
> any of:
> 
> M:	Johannes Weiner <hannes@cmpxchg.org>
> M:	Michal Hocko <mhocko@kernel.org>
> M:	Roman Gushchin <roman.gushchin@linux.dev>
> M:	Shakeel Butt <shakeel.butt@linux.dev>
> 
> Suffices.

I have seen a good deal of review feedback from Johannes, Roman and
Shakeel (thx!). I have it on my todo list as well but the series is
really large and it is not that easy to find time to do the proper
review. Anyway, unlike before xmas when there was barely any review
and I asked to slow down I feel much more confident just by seeing acks
from others memcg maintainers.

That being said, if I fail to find proper time to review myself I am
fully confident to rely on other memcg maintainers here. So this should
not be blocked waiting for me.

Thanks!
-- 
Michal Hocko
SUSE Labs

Re: [PATCH v3 00/30] Eliminate Dying Memory Cgroup

Posted by Qi Zheng 3 weeks, 2 days ago


On 1/15/26 1:58 AM, Andrew Morton wrote:
> On Wed, 14 Jan 2026 19:26:43 +0800 Qi Zheng <qi.zheng@linux.dev> wrote:
> 
>> This patchset is intended to transfer the LRU pages to the object cgroup
>> without holding a reference to the original memory cgroup in order to
>> address the issue of the dying memory cgroup.
> 
> Thanks.  I'll add this to mm.git for testing.  A patchset of this
> magnitude at -rc5 is a little ambitious, but Linus is giving us an rc8
> so let's see.
> 
> I'll suppress the usual added-to-mm email spray.

Hi Andrew,

The issue reported by syzbot needs to be addressed. If you want to test
this patchset, would you like me to provide a fix patch, or would you
prefer me to update to v4?

Thanks,
Qi

Re: [PATCH v3 00/30] Eliminate Dying Memory Cgroup

Posted by Andrew Morton 3 weeks, 2 days ago

On Thu, 15 Jan 2026 11:52:04 +0800 Qi Zheng <qi.zheng@linux.dev> wrote:

> 
> 
> On 1/15/26 1:58 AM, Andrew Morton wrote:
> > On Wed, 14 Jan 2026 19:26:43 +0800 Qi Zheng <qi.zheng@linux.dev> wrote:
> > 
> >> This patchset is intended to transfer the LRU pages to the object cgroup
> >> without holding a reference to the original memory cgroup in order to
> >> address the issue of the dying memory cgroup.
> > 
> > Thanks.  I'll add this to mm.git for testing.  A patchset of this
> > magnitude at -rc5 is a little ambitious, but Linus is giving us an rc8
> > so let's see.
> > 
> > I'll suppress the usual added-to-mm email spray.
> 
> Hi Andrew,
> 
> The issue reported by syzbot needs to be addressed. If you want to test
> this patchset, would you like me to provide a fix patch, or would you
> prefer me to update to v4?

A fix would be preferred if that's reasonable - it's a lot of patches
be resending!

Re: [PATCH v3 00/30] Eliminate Dying Memory Cgroup

Posted by Qi Zheng 3 weeks, 2 days ago


On 1/15/26 1:59 PM, Andrew Morton wrote:
> On Thu, 15 Jan 2026 11:52:04 +0800 Qi Zheng <qi.zheng@linux.dev> wrote:
> 
>>
>>
>> On 1/15/26 1:58 AM, Andrew Morton wrote:
>>> On Wed, 14 Jan 2026 19:26:43 +0800 Qi Zheng <qi.zheng@linux.dev> wrote:
>>>
>>>> This patchset is intended to transfer the LRU pages to the object cgroup
>>>> without holding a reference to the original memory cgroup in order to
>>>> address the issue of the dying memory cgroup.
>>>
>>> Thanks.  I'll add this to mm.git for testing.  A patchset of this
>>> magnitude at -rc5 is a little ambitious, but Linus is giving us an rc8
>>> so let's see.
>>>
>>> I'll suppress the usual added-to-mm email spray.
>>
>> Hi Andrew,
>>
>> The issue reported by syzbot needs to be addressed. If you want to test
>> this patchset, would you like me to provide a fix patch, or would you
>> prefer me to update to v4?
> 
> A fix would be preferred if that's reasonable - it's a lot of patches
> be resending!

OK, I'll send the fix ASAP.

[syzbot ci] Re: Eliminate Dying Memory Cgroup

Posted by syzbot ci 3 weeks, 3 days ago

syzbot ci has tested the following series

[v3] Eliminate Dying Memory Cgroup
https://lore.kernel.org/all/cover.1768389889.git.zhengqi.arch@bytedance.com
* [PATCH v3 01/30] mm: memcontrol: remove dead code of checking parent memory cgroup
* [PATCH v3 02/30] mm: workingset: use folio_lruvec() in workingset_refault()
* [PATCH v3 03/30] mm: rename unlock_page_lruvec_irq and its variants
* [PATCH v3 04/30] mm: vmscan: prepare for the refactoring the move_folios_to_lru()
* [PATCH v3 05/30] mm: vmscan: refactor move_folios_to_lru()
* [PATCH v3 06/30] mm: memcontrol: allocate object cgroup for non-kmem case
* [PATCH v3 07/30] mm: memcontrol: return root object cgroup for root memory cgroup
* [PATCH v3 08/30] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()
* [PATCH v3 09/30] buffer: prevent memory cgroup release in folio_alloc_buffers()
* [PATCH v3 10/30] writeback: prevent memory cgroup release in writeback module
* [PATCH v3 11/30] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()
* [PATCH v3 12/30] mm: page_io: prevent memory cgroup release in page_io module
* [PATCH v3 13/30] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
* [PATCH v3 14/30] mm: mglru: prevent memory cgroup release in mglru
* [PATCH v3 15/30] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()
* [PATCH v3 16/30] mm: workingset: prevent memory cgroup release in lru_gen_eviction()
* [PATCH v3 17/30] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()
* [PATCH v3 18/30] mm: zswap: prevent memory cgroup release in zswap_compress()
* [PATCH v3 19/30] mm: workingset: prevent lruvec release in workingset_refault()
* [PATCH v3 20/30] mm: zswap: prevent lruvec release in zswap_folio_swapin()
* [PATCH v3 21/30] mm: swap: prevent lruvec release in lru_gen_clear_refs()
* [PATCH v3 22/30] mm: workingset: prevent lruvec release in workingset_activation()
* [PATCH v3 23/30] mm: do not open-code lruvec lock
* [PATCH v3 24/30] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
* [PATCH v3 25/30] mm: vmscan: prepare for reparenting traditional LRU folios
* [PATCH v3 26/30] mm: vmscan: prepare for reparenting MGLRU folios
* [PATCH v3 27/30] mm: memcontrol: refactor memcg_reparent_objcgs()
* [PATCH v3 28/30] mm: memcontrol: prepare for reparenting state_local
* [PATCH v3 29/30] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios
* [PATCH v3 30/30] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers

and found the following issue:
UBSAN: array-index-out-of-bounds in reparent_memcg_lruvec_state_local

Full report is available here:
https://ci.syzbot.org/series/45c0b58d-255a-4579-9880-497bdbd4fb99

***

UBSAN: array-index-out-of-bounds in reparent_memcg_lruvec_state_local

tree:      linux-next
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next
base:      b775e489bec70895b7ef6b66927886bbac79598f
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/4d8819ab-0f94-42e8-bd70-87c7e83c37d2/config
syz repro: https://ci.syzbot.org/findings/7850f5dd-4ac7-4b74-85ff-a75ddddebbee/syz_repro

------------[ cut here ]------------
UBSAN: array-index-out-of-bounds in mm/memcontrol.c:530:3
index 33 is out of range for type 'long[33]'
CPU: 1 UID: 0 PID: 31 Comm: kworker/1:1 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Workqueue: cgroup_offline css_killed_work_fn
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 ubsan_epilogue+0xa/0x30 lib/ubsan.c:233
 __ubsan_handle_out_of_bounds+0xe8/0xf0 lib/ubsan.c:455
 reparent_memcg_lruvec_state_local+0x34f/0x460 mm/memcontrol.c:530
 reparent_memcg1_lruvec_state_local+0xa7/0xc0 mm/memcontrol-v1.c:1917
 reparent_state_local mm/memcontrol.c:242 [inline]
 memcg_reparent_objcgs mm/memcontrol.c:299 [inline]
 mem_cgroup_css_offline+0xc7c/0xc90 mm/memcontrol.c:4054
 offline_css kernel/cgroup/cgroup.c:5760 [inline]
 css_killed_work_fn+0x12f/0x570 kernel/cgroup/cgroup.c:6055
 process_one_work+0x949/0x15a0 kernel/workqueue.c:3279
 process_scheduled_works kernel/workqueue.c:3362 [inline]
 worker_thread+0x9af/0xee0 kernel/workqueue.c:3443
 kthread+0x388/0x470 kernel/kthread.c:467
 ret_from_fork+0x51b/0xa40 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246
 </TASK>
---[ end trace ]---
Kernel panic - not syncing: UBSAN: panic_on_warn set ...
CPU: 1 UID: 0 PID: 31 Comm: kworker/1:1 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Workqueue: cgroup_offline css_killed_work_fn
Call Trace:
 <TASK>
 vpanic+0x1e0/0x670 kernel/panic.c:490
 panic+0xc5/0xd0 kernel/panic.c:627
 check_panic_on_warn+0x89/0xb0 kernel/panic.c:377
 __ubsan_handle_out_of_bounds+0xe8/0xf0 lib/ubsan.c:455
 reparent_memcg_lruvec_state_local+0x34f/0x460 mm/memcontrol.c:530
 reparent_memcg1_lruvec_state_local+0xa7/0xc0 mm/memcontrol-v1.c:1917
 reparent_state_local mm/memcontrol.c:242 [inline]
 memcg_reparent_objcgs mm/memcontrol.c:299 [inline]
 mem_cgroup_css_offline+0xc7c/0xc90 mm/memcontrol.c:4054
 offline_css kernel/cgroup/cgroup.c:5760 [inline]
 css_killed_work_fn+0x12f/0x570 kernel/cgroup/cgroup.c:6055
 process_one_work+0x949/0x15a0 kernel/workqueue.c:3279
 process_scheduled_works kernel/workqueue.c:3362 [inline]
 worker_thread+0x9af/0xee0 kernel/workqueue.c:3443
 kthread+0x388/0x470 kernel/kthread.c:467
 ret_from_fork+0x51b/0xa40 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246
 </TASK>
Kernel Offset: disabled
Rebooting in 86400 seconds..


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

Re: [syzbot ci] Re: Eliminate Dying Memory Cgroup

Posted by Qi Zheng 3 weeks, 2 days ago


On 1/15/26 1:07 AM, syzbot ci wrote:
> syzbot ci has tested the following series
> 
> [v3] Eliminate Dying Memory Cgroup
> https://lore.kernel.org/all/cover.1768389889.git.zhengqi.arch@bytedance.com
> * [PATCH v3 01/30] mm: memcontrol: remove dead code of checking parent memory cgroup
> * [PATCH v3 02/30] mm: workingset: use folio_lruvec() in workingset_refault()
> * [PATCH v3 03/30] mm: rename unlock_page_lruvec_irq and its variants
> * [PATCH v3 04/30] mm: vmscan: prepare for the refactoring the move_folios_to_lru()
> * [PATCH v3 05/30] mm: vmscan: refactor move_folios_to_lru()
> * [PATCH v3 06/30] mm: memcontrol: allocate object cgroup for non-kmem case
> * [PATCH v3 07/30] mm: memcontrol: return root object cgroup for root memory cgroup
> * [PATCH v3 08/30] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()
> * [PATCH v3 09/30] buffer: prevent memory cgroup release in folio_alloc_buffers()
> * [PATCH v3 10/30] writeback: prevent memory cgroup release in writeback module
> * [PATCH v3 11/30] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()
> * [PATCH v3 12/30] mm: page_io: prevent memory cgroup release in page_io module
> * [PATCH v3 13/30] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
> * [PATCH v3 14/30] mm: mglru: prevent memory cgroup release in mglru
> * [PATCH v3 15/30] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()
> * [PATCH v3 16/30] mm: workingset: prevent memory cgroup release in lru_gen_eviction()
> * [PATCH v3 17/30] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()
> * [PATCH v3 18/30] mm: zswap: prevent memory cgroup release in zswap_compress()
> * [PATCH v3 19/30] mm: workingset: prevent lruvec release in workingset_refault()
> * [PATCH v3 20/30] mm: zswap: prevent lruvec release in zswap_folio_swapin()
> * [PATCH v3 21/30] mm: swap: prevent lruvec release in lru_gen_clear_refs()
> * [PATCH v3 22/30] mm: workingset: prevent lruvec release in workingset_activation()
> * [PATCH v3 23/30] mm: do not open-code lruvec lock
> * [PATCH v3 24/30] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
> * [PATCH v3 25/30] mm: vmscan: prepare for reparenting traditional LRU folios
> * [PATCH v3 26/30] mm: vmscan: prepare for reparenting MGLRU folios
> * [PATCH v3 27/30] mm: memcontrol: refactor memcg_reparent_objcgs()
> * [PATCH v3 28/30] mm: memcontrol: prepare for reparenting state_local
> * [PATCH v3 29/30] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios
> * [PATCH v3 30/30] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers
> 
> and found the following issue:
> UBSAN: array-index-out-of-bounds in reparent_memcg_lruvec_state_local
> 
> Full report is available here:
> https://ci.syzbot.org/series/45c0b58d-255a-4579-9880-497bdbd4fb99
> 
> ***
> 
> UBSAN: array-index-out-of-bounds in reparent_memcg_lruvec_state_local
> 
> tree:      linux-next
> URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next
> base:      b775e489bec70895b7ef6b66927886bbac79598f
> arch:      amd64
> compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
> config:    https://ci.syzbot.org/builds/4d8819ab-0f94-42e8-bd70-87c7e83c37d2/config
> syz repro: https://ci.syzbot.org/findings/7850f5dd-4ac7-4b74-85ff-a75ddddebbee/syz_repro
> 
> ------------[ cut here ]------------
> UBSAN: array-index-out-of-bounds in mm/memcontrol.c:530:3
> index 33 is out of range for type 'long[33]'

Oh, the size of lruvec_stats->state_local is NR_MEMCG_NODE_STAT_ITEMS,
but memcg1_stats contains MEMCG_SWAP, which is outside the array range.

It seems that only the following items need to be reparented:

1). NR_LRU_LISTS
2). NR_SLAB_RECLAIMABLE_B + NR_SLAB_UNRECLAIMABLE_B

But for 2), since we reparented the slab page a long time ago, it seems
there has always been a problem. So this patchset will only handle 1).


> CPU: 1 UID: 0 PID: 31 Comm: kworker/1:1 Not tainted syzkaller #0 PREEMPT(full)
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> Workqueue: cgroup_offline css_killed_work_fn
> Call Trace:
>   <TASK>
>   dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
>   ubsan_epilogue+0xa/0x30 lib/ubsan.c:233
>   __ubsan_handle_out_of_bounds+0xe8/0xf0 lib/ubsan.c:455
>   reparent_memcg_lruvec_state_local+0x34f/0x460 mm/memcontrol.c:530
>   reparent_memcg1_lruvec_state_local+0xa7/0xc0 mm/memcontrol-v1.c:1917
>   reparent_state_local mm/memcontrol.c:242 [inline]
>   memcg_reparent_objcgs mm/memcontrol.c:299 [inline]
>   mem_cgroup_css_offline+0xc7c/0xc90 mm/memcontrol.c:4054
>   offline_css kernel/cgroup/cgroup.c:5760 [inline]
>   css_killed_work_fn+0x12f/0x570 kernel/cgroup/cgroup.c:6055
>   process_one_work+0x949/0x15a0 kernel/workqueue.c:3279
>   process_scheduled_works kernel/workqueue.c:3362 [inline]
>   worker_thread+0x9af/0xee0 kernel/workqueue.c:3443
>   kthread+0x388/0x470 kernel/kthread.c:467
>   ret_from_fork+0x51b/0xa40 arch/x86/kernel/process.c:158
>   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246
>   </TASK>
> ---[ end trace ]---
> Kernel panic - not syncing: UBSAN: panic_on_warn set ...
> CPU: 1 UID: 0 PID: 31 Comm: kworker/1:1 Not tainted syzkaller #0 PREEMPT(full)
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> Workqueue: cgroup_offline css_killed_work_fn
> Call Trace:
>   <TASK>
>   vpanic+0x1e0/0x670 kernel/panic.c:490
>   panic+0xc5/0xd0 kernel/panic.c:627
>   check_panic_on_warn+0x89/0xb0 kernel/panic.c:377
>   __ubsan_handle_out_of_bounds+0xe8/0xf0 lib/ubsan.c:455
>   reparent_memcg_lruvec_state_local+0x34f/0x460 mm/memcontrol.c:530
>   reparent_memcg1_lruvec_state_local+0xa7/0xc0 mm/memcontrol-v1.c:1917
>   reparent_state_local mm/memcontrol.c:242 [inline]
>   memcg_reparent_objcgs mm/memcontrol.c:299 [inline]
>   mem_cgroup_css_offline+0xc7c/0xc90 mm/memcontrol.c:4054
>   offline_css kernel/cgroup/cgroup.c:5760 [inline]
>   css_killed_work_fn+0x12f/0x570 kernel/cgroup/cgroup.c:6055
>   process_one_work+0x949/0x15a0 kernel/workqueue.c:3279
>   process_scheduled_works kernel/workqueue.c:3362 [inline]
>   worker_thread+0x9af/0xee0 kernel/workqueue.c:3443
>   kthread+0x388/0x470 kernel/kthread.c:467
>   ret_from_fork+0x51b/0xa40 arch/x86/kernel/process.c:158
>   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246
>   </TASK>
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
> 
> 
> ***
> 
> If these findings have caused you to resend the series or submit a
> separate fix, please add the following tag to your commit message:
>    Tested-by: syzbot@syzkaller.appspotmail.com
> 
> ---
> This report is generated by a bot. It may contain errors.
> syzbot ci engineers can be reached at syzkaller@googlegroups.com.