[v5] mm: switch THP shrinker to list_lru

[PATCH v5 0/9] mm: switch THP shrinker to list_lru

Posted by Johannes Weiner 1 week, 4 days ago

This is version 5 of switching the THP shrinker to list_lru.

Core of the new version is the list_lru/set_shrinker_bit fix up front,
which minimally affects later patches; and a rebase onto the latest
mm-unstable - replaced alloc_swap_folio() with __swap_cache_alloc().

The changes seemed small enough that *I chose to keep the review tags
from v4*. Please shout if you object to this!

Changes in v5:
- patch 1 is a new fix for a very old, pre-existing set_shrinker_bit()
  problem in list_lru, where the bit can be set on a dying child memcg
  instead of the ancestor that actually received the item. Pointed out
  by Usama Arif and Sashiko; fix it first to make it minimally
  backportable and so the conversion is safe.
- patches 6 and 9 adapt to that fix's new memcg-by-reference
  lock_list_lru_of_memcg() signature
- collapse_huge_page(): propagate folio_memcg_alloc_deferred() failure
  as SCAN_ALLOC_HUGE_PAGE_FAIL instead of leaking SCAN_SUCCEED and
  falsely reporting a successful MADV_COLLAPSE (Usama Arif, Sashiko)
- deferred_split_isolate(): fix a UAF by reading folio state before
  list_lru_isolate(); once removed, a racing folio_put() frees the
  folio via the lockless list_empty() check while we still touch its
  flags and stats (Sashiko)
- rebased to mm-unstable of 2026-05-27, which simplifies the flatten
  prep patch (now anon-only, as alloc_swap_folio() was folded into the
  new __swap_cache_alloc()) and moves the swap-side
  folio_memcg_alloc_deferred() hook into __swap_cache_alloc(). Kairui,
  I would appreciate an eyeball on that.

Changes in v4:
- guard folio_memcg_alloc_deferred() with mem_cgroup_disabled() to fix
  NULL deref in __memcg_list_lru_alloc() when booting with
  cgroup_disable=memory (e.g., kdump capture kernel) -- reported and
  tested by Mikhail Zaslonko on s390 and x86
- flatten if (folio) branches in alloc_swap_folio() and alloc_anon_folio()
  in a prep patch so the list_lru allocation additions are a clean minimal
  diff (Lorenzo)
- folio_memcg_alloc_deferred() moved out of alloc_charge_folio() into the
  anon-only collapse_huge_page() path; collapse_file() shares that helper
  but its pages don't go on the THP shrinker queue (David)
- guard folio_memcg_alloc_deferred() with order > 1; mTHPs below order-2
  can't be queued on the deferred split list (David)
- make deferred_split_lru static, hide behind folio_memcg_alloc_deferred()
  wrapper with GFP_KERNEL (Lorenzo)
- rename l -> lru throughout huge_memory.c (Lorenzo)
- kdoc for folio_memcg_list_lru_alloc() (Lorenzo)
- list_lru_lock_irq()/unlock_irq()/add_irq() irq-disabling variants;
  use list_lru_add_irq() in deferred_split_scan() (Lorenzo)
- reorder shrinker_free() before list_lru_destroy() (Lorenzo)

Changes in v3:
- dedicated lockdep_key for irqsafe deferred_split_lru.lock (syzbot)
- conditional list_lru ops in __folio_freeze_and_split_unmapped() (syzbot)
- annotate runs of inscrutable false, NULL, false function arguments (David)
- rename to folio_memcg_list_lru_alloc() (David)

Changes in v2:
- explicit rcu_read_lock() in __folio_freeze_and_split_unmapped() (Usama)
- split out list_lru prep bits (Dave)

The open-coded deferred split queue has issues. It's not NUMA-aware
(when cgroup is enabled), and it's more complicated in the callsites
interacting with it. Switching to list_lru fixes the NUMA problem and
streamlines things. It also simplifies planned shrinker work.

Patch 1 fixes a pre-existing list_lru bug where the shrinker bit is
set on the caller's memcg rather than the ancestor whose sublist the
item actually lands on after a walk-up. Standalone, backportable; the
rest of the series depends on it.

Patches 2-5 are cleanups and small refactors in list_lru code. They're
basically independent, but make the THP shrinker conversion easier.

Patch 6 extends the list_lru API to allow the caller to control the
locking scope. The THP shrinker has private state it needs to keep
synchronized with the LRU state.

Patch 7 extends the list_lru API with a convenience helper to do
list_lru head allocation (memcg_list_lru_alloc) when coming from a
folio. Anon THPs are instantiated in several places, and with the
folio reparenting patches pending, folio_memcg() access is now a more
delicate dance. This avoids having to replicate that dance everywhere.

Patch 8 flattens the alloc_anon_folio() retry loop so the next patch's
list_lru hook lands as a clean addition rather than nested deep inside
an if (folio) block.

Patch 9 finally switches the deferred_split_queue to list_lru.

Based on mm-unstable.

 include/linux/huge_mm.h    |   7 +-
 include/linux/list_lru.h   |  70 +++++++++
 include/linux/memcontrol.h |   4 -
 include/linux/mmzone.h     |  12 --
 mm/huge_memory.c           | 364 +++++++++++++++------------------------------
 mm/internal.h              |   2 +-
 mm/khugepaged.c            |   5 +
 mm/list_lru.c              | 238 +++++++++++++++++++----------
 mm/memcontrol.c            |  12 +-
 mm/memory.c                |  38 ++---
 mm/mm_init.c               |  15 --
 mm/swap_state.c            |  10 ++
 12 files changed, 399 insertions(+), 378 deletions(-)

The base moved substantially since v4 (the swap allocation rework in
particular reshuffled the alloc_swap_folio() landing spot), so the
patch-level diff between v4 and v5 is non-obvious from a tree diff
alone. For ease of review, here is the range-diff:

 -:  ------------ >  1:  f4f3933599b9 mm: list_lru: set shrinker bit on the memcg that owns the locked sublist
 1:  846dafe02e8b !  2:  e7b8f8bce2ec mm: list_lru: lock_list_lru_of_memcg() cannot return NULL if !skip_empty
    @@ mm/list_lru.c
     @@ mm/list_lru.c: bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
      	struct list_lru_one *l;
      
    - 	l = lock_list_lru_of_memcg(lru, nid, memcg, false, false);
    + 	l = lock_list_lru_of_memcg(lru, nid, &memcg, false, false);
     -	if (!l)
     -		return false;
      	if (list_empty(item)) {
      		list_add_tail(item, &l->list);
    - 		/* Set shrinker bit if the first element was added */
    + 		/*
     @@ mm/list_lru.c: bool list_lru_del(struct list_lru *lru, struct list_head *item, int nid,
      {
      	struct list_lru_node *nlru = &lru->node[nid];
      	struct list_lru_one *l;
     +
    - 	l = lock_list_lru_of_memcg(lru, nid, memcg, false, false);
    + 	l = lock_list_lru_of_memcg(lru, nid, &memcg, false, false);
     -	if (!l)
     -		return false;
      	if (!list_empty(item)) {
 2:  afe28e645aff !  3:  f1e34640dff9 mm: list_lru: deduplicate unlock_list_lru()
    @@ mm/list_lru.c: static inline bool lock_list_lru(struct list_lru_one *l, bool irq
      		return false;
      	}
      	return true;
    -@@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
    - 	memcg = parent_mem_cgroup(memcg);
    +@@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid,
    + 	*memcg = parent_mem_cgroup(*memcg);
      	goto again;
      }
     -
    @@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_
      #else
      static void list_lru_register(struct list_lru *lru)
      {
    -@@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
    +@@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid,
      
      	return l;
      }
 3:  9e5499facfb1 !  4:  2612b71187ea mm: list_lru: move list dead check to lock_list_lru_of_memcg()
    @@ mm/list_lru.c: list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
      }
      
      static inline struct list_lru_one *
    -@@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
    +@@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid,
      	rcu_read_lock();
      again:
    - 	l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
    + 	l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(*memcg));
     -	if (likely(l) && lock_list_lru(l, irq)) {
     -		rcu_read_unlock();
     -		return l;
 4:  855b908bfb82 !  5:  cc2819362f07 mm: list_lru: deduplicate lock_list_lru()
    @@ mm/list_lru.c: list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
     -}
     -
      static inline struct list_lru_one *
    - lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
    - 		       bool irq, bool skip_empty)
    -@@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
    + lock_list_lru_of_memcg(struct list_lru *lru, int nid,
    + 		       struct mem_cgroup **memcg, bool irq, bool skip_empty)
    +@@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid,
      {
      	struct list_lru_one *l = &lru->node[nid].lru;
      
 5:  b8a70f1016f3 !  6:  08c4561616df mm: list_lru: introduce caller locking for additions and deletions
    @@ include/linux/list_lru.h: int memcg_list_lru_alloc(struct mem_cgroup *memcg, str
     + * list_lru_lock: lock the sublist for the given node and memcg
     + * @lru: the lru pointer
     + * @nid: the node id of the sublist to lock.
    -+ * @memcg: the cgroup of the sublist to lock.
    ++ * @memcg: pointer to the cgroup of the sublist to lock. On return,
    ++ *         updated to the cgroup whose sublist was actually locked,
    ++ *         which may be an ancestor if the original memcg was dying.
     + *
     + * Returns the locked list_lru_one sublist. The caller must call
     + * list_lru_unlock() when done.
    @@ include/linux/list_lru.h: int memcg_list_lru_alloc(struct mem_cgroup *memcg, str
     + * Return: the locked list_lru_one, or NULL on failure
     + */
     +struct list_lru_one *list_lru_lock(struct list_lru *lru, int nid,
    -+		struct mem_cgroup *memcg);
    ++		struct mem_cgroup **memcg);
     +
     +/**
     + * list_lru_unlock: unlock a sublist locked by list_lru_lock()
    @@ include/linux/list_lru.h: int memcg_list_lru_alloc(struct mem_cgroup *memcg, str
     +void list_lru_unlock(struct list_lru_one *l);
     +
     +struct list_lru_one *list_lru_lock_irq(struct list_lru *lru, int nid,
    -+		struct mem_cgroup *memcg);
    ++		struct mem_cgroup **memcg);
     +void list_lru_unlock_irq(struct list_lru_one *l);
     +
     +struct list_lru_one *list_lru_lock_irqsave(struct list_lru *lru, int nid,
    -+		struct mem_cgroup *memcg, unsigned long *irq_flags);
    ++		struct mem_cgroup **memcg, unsigned long *irq_flags);
     +void list_lru_unlock_irqrestore(struct list_lru_one *l,
     +		unsigned long *irq_flags);
     +
    @@ mm/list_lru.c
     @@ mm/list_lru.c: list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
      
      static inline struct list_lru_one *
    - lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
    --		       bool irq, bool skip_empty)
    -+		       bool irq, unsigned long *irq_flags, bool skip_empty)
    + lock_list_lru_of_memcg(struct list_lru *lru, int nid,
    +-		       struct mem_cgroup **memcg, bool irq, bool skip_empty)
    ++		       struct mem_cgroup **memcg, bool irq,
    ++		       unsigned long *irq_flags, bool skip_empty)
      {
      	struct list_lru_one *l;
      
    -@@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
    +@@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid,
      again:
    - 	l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
    + 	l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(*memcg));
      	if (likely(l)) {
     -		lock_list_lru(l, irq);
     +		lock_list_lru(l, irq, irq_flags);
    @@ mm/list_lru.c: lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_
     @@ mm/list_lru.c: list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
      
      static inline struct list_lru_one *
    - lock_list_lru_of_memcg(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
    --		       bool irq, bool skip_empty)
    -+		       bool irq, unsigned long *irq_flags, bool skip_empty)
    + lock_list_lru_of_memcg(struct list_lru *lru, int nid,
    +-		       struct mem_cgroup **memcg, bool irq, bool skip_empty)
    ++		       struct mem_cgroup **memcg, bool irq,
    ++		       unsigned long *irq_flags, bool skip_empty)
      {
      	struct list_lru_one *l = &lru->node[nid].lru;
      
    @@ mm/list_lru.c: list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
     -bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
     -		  struct mem_cgroup *memcg)
     +struct list_lru_one *list_lru_lock(struct list_lru *lru, int nid,
    -+				   struct mem_cgroup *memcg)
    ++				   struct mem_cgroup **memcg)
      {
     -	struct list_lru_node *nlru = &lru->node[nid];
     -	struct list_lru_one *l;
    @@ mm/list_lru.c: list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
     +}
     +
     +struct list_lru_one *list_lru_lock_irq(struct list_lru *lru, int nid,
    -+				       struct mem_cgroup *memcg)
    ++				       struct mem_cgroup **memcg)
     +{
     +	return lock_list_lru_of_memcg(lru, nid, memcg, /*irq=*/true,
     +				      /*irq_flags=*/NULL, /*skip_empty=*/false);
    @@ mm/list_lru.c: list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
     +	unlock_list_lru(l, /*irq_off=*/true, /*irq_flags=*/NULL);
     +}
      
    --	l = lock_list_lru_of_memcg(lru, nid, memcg, false, false);
    +-	l = lock_list_lru_of_memcg(lru, nid, &memcg, false, false);
     +struct list_lru_one *list_lru_lock_irqsave(struct list_lru *lru, int nid,
    -+					   struct mem_cgroup *memcg,
    ++					   struct mem_cgroup **memcg,
     +					   unsigned long *flags)
     +{
     +	return lock_list_lru_of_memcg(lru, nid, memcg, /*irq=*/true,
    @@ mm/list_lru.c: list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
     +{
      	if (list_empty(item)) {
      		list_add_tail(item, &l->list);
    - 		/* Set shrinker bit if the first element was added */
    + 		/*
    +@@ mm/list_lru.c: bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
    + 		 */
      		if (!l->nr_items++)
      			set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
     -		unlock_list_lru(l, false);
    @@ mm/list_lru.c: list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
     +	struct list_lru_one *l;
     +	bool ret;
     +
    -+	l = list_lru_lock(lru, nid, memcg);
    ++	l = list_lru_lock(lru, nid, &memcg);
     +	ret = __list_lru_add(lru, l, item, nid, memcg);
     +	list_lru_unlock(l);
     +	return ret;
    @@ mm/list_lru.c: list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
     +	struct list_lru_one *l;
     +	bool ret;
     +
    -+	l = list_lru_lock_irq(lru, nid, memcg);
    ++	l = list_lru_lock_irq(lru, nid, &memcg);
     +	ret = __list_lru_add(lru, l, item, nid, memcg);
     +	list_lru_unlock_irq(l);
     +	return ret;
    @@ mm/list_lru.c: EXPORT_SYMBOL_GPL(list_lru_add_obj);
      	struct list_lru_one *l;
     +	bool ret;
      
    --	l = lock_list_lru_of_memcg(lru, nid, memcg, false, false);
    +-	l = lock_list_lru_of_memcg(lru, nid, &memcg, false, false);
     -	if (!list_empty(item)) {
     -		list_del_init(item);
     -		l->nr_items--;
    @@ mm/list_lru.c: EXPORT_SYMBOL_GPL(list_lru_add_obj);
     -	}
     -	unlock_list_lru(l, false);
     -	return false;
    -+	l = list_lru_lock(lru, nid, memcg);
    ++	l = list_lru_lock(lru, nid, &memcg);
     +	ret = __list_lru_del(lru, l, item, nid);
     +	list_lru_unlock(l);
     +	return ret;
    @@ mm/list_lru.c: __list_lru_walk_one(struct list_lru *lru, int nid, struct mem_cgr
      	unsigned long isolated = 0;
      
      restart:
    --	l = lock_list_lru_of_memcg(lru, nid, memcg, irq_off, true);
    -+	l = lock_list_lru_of_memcg(lru, nid, memcg, /*irq=*/irq_off,
    +-	l = lock_list_lru_of_memcg(lru, nid, &memcg, irq_off, true);
    ++	l = lock_list_lru_of_memcg(lru, nid, &memcg, /*irq=*/irq_off,
     +				   /*irq_flags=*/NULL, /*skip_empty=*/true);
      	if (!l)
      		return isolated;
 6:  0bf8cd5bc205 =  7:  9b1b9ab5e749 mm: list_lru: introduce folio_memcg_list_lru_alloc()
 7:  a26656c1c0a5 !  8:  fd4e1d364dc2 mm: memory: flatten folio allocation retry loops
    @@ Metadata
     Author: Johannes Weiner <hannes@cmpxchg.org>
     
      ## Commit message ##
    -    mm: memory: flatten folio allocation retry loops
    +    mm: memory: flatten alloc_anon_folio() retry loop
     
    -    alloc_swap_folio() and alloc_anon_folio() use a top-level if (folio)
    -    that buries the success path four levels deep. This makes for awkward
    -    long lines and wrapping. The next patch will add more code here, so
    -    flatten this now to keep things clean and simple.
    +    alloc_anon_folio() uses a top-level if (folio) that buries the success
    +    path four levels deep. This makes for awkward long lines and wrapping.
    +    The next patch will add more code here, so flatten this now to keep
    +    things clean and simple.
     
    -    alloc_anon_folio() already has a next label, use it for !folio. Add
    -    the equivalent to alloc_swap_folio().
    +    The next label is already there, use it for !folio.
     
         No functional change intended.
     
    @@ Commit message
         Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
     
      ## mm/memory.c ##
    -@@ mm/memory.c: static struct folio *alloc_swap_folio(struct vm_fault *vmf)
    - 	while (orders) {
    - 		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
    - 		folio = vma_alloc_folio(gfp, order, vma, addr);
    --		if (folio) {
    --			if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
    --							    gfp, entry))
    --				return folio;
    -+		if (!folio)
    -+			goto next;
    -+		if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, gfp, entry)) {
    - 			count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK_CHARGE);
    - 			folio_put(folio);
    -+			goto next;
    - 		}
    -+		return folio;
    -+next:
    - 		count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK);
    - 		order = next_order(&orders, order);
    - 	}
     @@ mm/memory.c: static struct folio *alloc_anon_folio(struct vm_fault *vmf)
      	while (orders) {
      		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
 8:  e454696ab1b7 !  9:  70fe768450de mm: switch deferred split shrinker to list_lru
    @@ mm/huge_memory.c: static int __folio_freeze_and_split_unmapped(struct folio *fol
     +	 */
     +	dequeue_deferred = folio_test_anon(folio) && old_order > 1;
     +	if (dequeue_deferred) {
    ++		struct mem_cgroup *memcg;
    ++
     +		rcu_read_lock();
    ++		memcg = folio_memcg(folio);
     +		lru = list_lru_lock(&deferred_split_lru,
    -+				    folio_nid(folio), folio_memcg(folio));
    ++				    folio_nid(folio), &memcg);
     +	}
      	if (folio_ref_freeze(folio, folio_cache_ref_count(folio) + 1)) {
      		struct swap_cluster_info *ci = NULL;
    @@ mm/huge_memory.c: int split_folio_to_list(struct folio *folio, struct list_head
      bool __folio_unqueue_deferred_split(struct folio *folio)
      {
     -	struct deferred_split *ds_queue;
    ++	struct mem_cgroup *memcg;
     +	struct list_lru_one *lru;
     +	int nid = folio_nid(folio);
      	unsigned long flags;
    @@ mm/huge_memory.c: int split_folio_to_list(struct folio *folio, struct list_head
     -	if (!list_empty(&folio->_deferred_list)) {
     -		ds_queue->split_queue_len--;
     +	rcu_read_lock();
    -+	lru = list_lru_lock_irqsave(&deferred_split_lru, nid, folio_memcg(folio), &flags);
    ++	memcg = folio_memcg(folio);
    ++	lru = list_lru_lock_irqsave(&deferred_split_lru, nid, &memcg, &flags);
     +	if (__list_lru_del(&deferred_split_lru, lru, &folio->_deferred_list, nid)) {
      		if (folio_test_partially_mapped(folio)) {
      			folio_clear_partially_mapped(folio);
    @@ mm/huge_memory.c: void deferred_split_folio(struct folio *folio, bool partially_
     +
     +	rcu_read_lock();
     +	memcg = folio_memcg(folio);
    -+	lru = list_lru_lock_irqsave(&deferred_split_lru, nid, memcg, &flags);
    ++	lru = list_lru_lock_irqsave(&deferred_split_lru, nid, &memcg, &flags);
      	if (partially_mapped) {
      		if (!folio_test_partially_mapped(folio)) {
      			folio_set_partially_mapped(folio);
    @@ mm/huge_memory.c: static bool thp_underused(struct folio *folio)
     +		return LRU_REMOVED;
     +	}
     +
    -+	/* We lost race with folio_put() */
    -+	list_lru_isolate(lru, item);
    ++	/*
    ++	 * We lost race with folio_put(). Read folio state before the
    ++	 * isolate: folio_unqueue_deferred_split() checks list_empty()
    ++	 * locklessly, so once removed the folio can be freed any time.
    ++	 */
     +	if (folio_test_partially_mapped(folio)) {
     +		folio_clear_partially_mapped(folio);
     +		mod_mthp_stat(folio_order(folio),
     +			      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
     +	}
    ++	list_lru_isolate(lru, item);
     +	return LRU_REMOVED;
     +}
     +
    @@ mm/huge_memory.c: static bool thp_underused(struct folio *folio)
      	struct folio *folio, *next;
     -	int split = 0, i;
     -	struct folio_batch fbatch;
    +-
    +-	folio_batch_init(&fbatch);
     +	int split = 0;
     +	unsigned long isolated;
      
    --	folio_batch_init(&fbatch);
    -+	isolated = list_lru_shrink_walk_irq(&deferred_split_lru, sc,
    -+					    deferred_split_isolate, &dispose);
    - 
     -retry:
     -	ds_queue = split_queue_lock_irqsave(sc->nid, sc->memcg, &flags);
     -	/* Take pin on all head pages to avoid freeing them under us */
    @@ mm/huge_memory.c: static bool thp_underused(struct folio *folio)
     -			break;
     -	}
     -	split_queue_unlock_irqrestore(ds_queue, flags);
    --
    ++	isolated = list_lru_shrink_walk_irq(&deferred_split_lru, sc,
    ++					    deferred_split_isolate, &dispose);
    + 
     -	for (i = 0; i < folio_batch_count(&fbatch); i++) {
     +	list_for_each_entry_safe(folio, next, &dispose, _deferred_list) {
      		bool did_split = false;
    @@ mm/khugepaged.c: static enum scan_result collapse_huge_page(struct mm_struct *mm
      	if (result != SCAN_SUCCEED)
      		goto out_nolock;
      
    -+	if (folio_memcg_alloc_deferred(folio))
    ++	if (folio_memcg_alloc_deferred(folio)) {
    ++		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
     +		goto out_nolock;
    ++	}
     +
      	mmap_read_lock(mm);
      	result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
    @@ mm/memcontrol.c: static void mem_cgroup_css_offline(struct cgroup_subsys_state *
      	reparent_shrinker_deferred(memcg);
     
      ## mm/memory.c ##
    -@@ mm/memory.c: static struct folio *alloc_swap_folio(struct vm_fault *vmf)
    - 			folio_put(folio);
    - 			goto next;
    - 		}
    -+		if (order > 1 && folio_memcg_alloc_deferred(folio)) {
    -+			folio_put(folio);
    -+			goto fallback;
    -+		}
    - 		return folio;
    - next:
    - 		count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK);
     @@ mm/memory.c: static struct folio *alloc_anon_folio(struct vm_fault *vmf)
      			folio_put(folio);
      			goto next;
    @@ mm/mm_init.c: static void __meminit pgdat_init_internals(struct pglist_data *pgd
      	pgdat_init_kcompactd(pgdat);
      
      	init_waitqueue_head(&pgdat->kswapd_wait);
    +
    + ## mm/swap_state.c ##
    +@@ mm/swap_state.c: static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
    + 		return ERR_PTR(-ENOMEM);
    + 	}
    + 
    ++	if (order > 1 && folio_memcg_alloc_deferred(folio)) {
    ++		spin_lock(&ci->lock);
    ++		__swap_cache_do_del_folio(ci, folio, entry, shadow);
    ++		spin_unlock(&ci->lock);
    ++		folio_unlock(folio);
    ++		/* nr_pages refs from swap cache, 1 from allocation */
    ++		folio_put_refs(folio, nr_pages + 1);
    ++		return ERR_PTR(-ENOMEM);
    ++	}
    ++
    + 	/* memsw uncharges swap when folio is added to swap cache */
    + 	memcg1_swapin(folio);
    + 	if (shadow)

Re: [PATCH v5 0/9] mm: switch THP shrinker to list_lru

Posted by Lance Yang 1 week ago

Hi Johannes,

On Wed, May 27, 2026 at 04:45:07PM -0400, Johannes Weiner wrote:
>This is version 5 of switching the THP shrinker to list_lru.
>
>Core of the new version is the list_lru/set_shrinker_bit fix up front,
>which minimally affects later patches; and a rebase onto the latest
>mm-unstable - replaced alloc_swap_folio() with __swap_cache_alloc().
>
>The changes seemed small enough that *I chose to keep the review tags
>from v4*. Please shout if you object to this!
>
>Changes in v5:
>- patch 1 is a new fix for a very old, pre-existing set_shrinker_bit()
>  problem in list_lru, where the bit can be set on a dying child memcg
>  instead of the ancestor that actually received the item. Pointed out
>  by Usama Arif and Sashiko; fix it first to make it minimally
>  backportable and so the conversion is safe.
>- patches 6 and 9 adapt to that fix's new memcg-by-reference
>  lock_list_lru_of_memcg() signature
>- collapse_huge_page(): propagate folio_memcg_alloc_deferred() failure
>  as SCAN_ALLOC_HUGE_PAGE_FAIL instead of leaking SCAN_SUCCEED and
>  falsely reporting a successful MADV_COLLAPSE (Usama Arif, Sashiko)
>- deferred_split_isolate(): fix a UAF by reading folio state before
>  list_lru_isolate(); once removed, a racing folio_put() frees the
>  folio via the lockless list_empty() check while we still touch its
>  flags and stats (Sashiko)
>- rebased to mm-unstable of 2026-05-27, which simplifies the flatten
>  prep patch (now anon-only, as alloc_swap_folio() was folded into the
>  new __swap_cache_alloc()) and moves the swap-side
>  folio_memcg_alloc_deferred() hook into __swap_cache_alloc(). Kairui,
>  I would appreciate an eyeball on that.
>
>Changes in v4:
>- guard folio_memcg_alloc_deferred() with mem_cgroup_disabled() to fix
>  NULL deref in __memcg_list_lru_alloc() when booting with
>  cgroup_disable=memory (e.g., kdump capture kernel) -- reported and
>  tested by Mikhail Zaslonko on s390 and x86
>- flatten if (folio) branches in alloc_swap_folio() and alloc_anon_folio()
>  in a prep patch so the list_lru allocation additions are a clean minimal
>  diff (Lorenzo)
>- folio_memcg_alloc_deferred() moved out of alloc_charge_folio() into the
>  anon-only collapse_huge_page() path; collapse_file() shares that helper
>  but its pages don't go on the THP shrinker queue (David)
>- guard folio_memcg_alloc_deferred() with order > 1; mTHPs below order-2
>  can't be queued on the deferred split list (David)
>- make deferred_split_lru static, hide behind folio_memcg_alloc_deferred()
>  wrapper with GFP_KERNEL (Lorenzo)
>- rename l -> lru throughout huge_memory.c (Lorenzo)
>- kdoc for folio_memcg_list_lru_alloc() (Lorenzo)
>- list_lru_lock_irq()/unlock_irq()/add_irq() irq-disabling variants;
>  use list_lru_add_irq() in deferred_split_scan() (Lorenzo)
>- reorder shrinker_free() before list_lru_destroy() (Lorenzo)
>
>Changes in v3:
>- dedicated lockdep_key for irqsafe deferred_split_lru.lock (syzbot)
>- conditional list_lru ops in __folio_freeze_and_split_unmapped() (syzbot)
>- annotate runs of inscrutable false, NULL, false function arguments (David)
>- rename to folio_memcg_list_lru_alloc() (David)
>
>Changes in v2:
>- explicit rcu_read_lock() in __folio_freeze_and_split_unmapped() (Usama)
>- split out list_lru prep bits (Dave)
>
>The open-coded deferred split queue has issues. It's not NUMA-aware
>(when cgroup is enabled), and it's more complicated in the callsites
>interacting with it. Switching to list_lru fixes the NUMA problem and
>streamlines things. It also simplifies planned shrinker work.
>
>Patch 1 fixes a pre-existing list_lru bug where the shrinker bit is
>set on the caller's memcg rather than the ancestor whose sublist the
>item actually lands on after a walk-up. Standalone, backportable; the
>rest of the series depends on it.
>
>Patches 2-5 are cleanups and small refactors in list_lru code. They're
>basically independent, but make the THP shrinker conversion easier.
>
>Patch 6 extends the list_lru API to allow the caller to control the
>locking scope. The THP shrinker has private state it needs to keep
>synchronized with the LRU state.
>
>Patch 7 extends the list_lru API with a convenience helper to do
>list_lru head allocation (memcg_list_lru_alloc) when coming from a
>folio. Anon THPs are instantiated in several places, and with the
>folio reparenting patches pending, folio_memcg() access is now a more
>delicate dance. This avoids having to replicate that dance everywhere.
>
>Patch 8 flattens the alloc_anon_folio() retry loop so the next patch's
>list_lru hook lands as a clean addition rather than nested deep inside
>an if (folio) block.
>
>Patch 9 finally switches the deferred_split_queue to list_lru.

As the changelog above says, the old queue is per-memcg only, rather
than per-memcg-per-node. So reclaim on one node can still walk the whole
memcg queue and split underused THPs from other nodes in the same memcg.

But I think the new one can lose reclaim in the cgroup.memory=nokmem
case ...

With nokmem, the deferred shrinker can still run from memcg reclaim,
because it is SHRINKER_NONSLAB. But the list_lru is no longer per-memcg:

__list_lru_init() clears memcg_aware,

	if (mem_cgroup_kmem_disabled())
		memcg_aware = false;

so list_lru_from_memcg_idx() falls back to the shared node list:

static inline struct list_lru_one *
list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
{
	if (list_lru_memcg_aware(lru) && idx >= 0) {
[...]
	}
	return &lru->node[nid].lru;
}

That makes the shrinker bit unreliable. __list_lru_add() still sets the
bit on the memcg passed in, but only when the list goes from empty to
non-empty:

bool __list_lru_add(struct list_lru *lru, struct list_lru_one *l,
		    struct list_head *item, int nid,
		    struct mem_cgroup *memcg)
{
	if (list_empty(item)) {
[...]
		if (!l->nr_items++)
			set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
[...]
		return true;
	}
	return false;
}

If memcg A adds the first folio, A gets the bit. If memcg B later adds a
folio to the same shared list, B does not get a bit, because the list
was already non-empty.

So in the A-first/B-later case, reclaim from B may not call the deferred
shrinker at all. The shared list is scanned from memcg reclaim only if
reclaim runs from the memcg that has the bit, such as A here, or from
global reclaim :)

Anyway, only after the shared list is emptied does the next memcg to add
a folio get to be the one with the bit, IIUC :)

Hopefully I didn't miss somthing important ...

Cheers, Lance

Re: [PATCH v5 0/9] mm: switch THP shrinker to list_lru

Posted by Johannes Weiner 5 days, 19 hours ago

On Mon, Jun 01, 2026 at 04:36:52PM +0800, Lance Yang wrote:
> As the changelog above says, the old queue is per-memcg only, rather
> than per-memcg-per-node. So reclaim on one node can still walk the whole
> memcg queue and split underused THPs from other nodes in the same memcg.
> 
> But I think the new one can lose reclaim in the cgroup.memory=nokmem
> case ...
> 
> With nokmem, the deferred shrinker can still run from memcg reclaim,
> because it is SHRINKER_NONSLAB. But the list_lru is no longer per-memcg:
> 
> __list_lru_init() clears memcg_aware,
> 
> 	if (mem_cgroup_kmem_disabled())
> 		memcg_aware = false;
> 
> so list_lru_from_memcg_idx() falls back to the shared node list:
> 
> static inline struct list_lru_one *
> list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
> {
> 	if (list_lru_memcg_aware(lru) && idx >= 0) {
> [...]
> 	}
> 	return &lru->node[nid].lru;
> }
> 
> That makes the shrinker bit unreliable. __list_lru_add() still sets the
> bit on the memcg passed in, but only when the list goes from empty to
> non-empty:
> 
> bool __list_lru_add(struct list_lru *lru, struct list_lru_one *l,
> 		    struct list_head *item, int nid,
> 		    struct mem_cgroup *memcg)
> {
> 	if (list_empty(item)) {
> [...]
> 		if (!l->nr_items++)
> 			set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
> [...]
> 		return true;
> 	}
> 	return false;
> }
> 
> If memcg A adds the first folio, A gets the bit. If memcg B later adds a
> folio to the same shared list, B does not get a bit, because the list
> was already non-empty.
> 
> So in the A-first/B-later case, reclaim from B may not call the deferred
> shrinker at all. The shared list is scanned from memcg reclaim only if
> reclaim runs from the memcg that has the bit, such as A here, or from
> global reclaim :)
> 
> Anyway, only after the shared list is emptied does the next memcg to add
> a folio get to be the one with the bit, IIUC :)

Sorry for the delay, this took me a bit to think about. The shrinker
code is a mess.

I read it the same way you do. And this is true for all list_lru users
when nokmem is set: we just set random nonsense shrinker bits.

HOWEVER, the generic shrinker code fixes that up by IGNORING random
shrinker bits like this when !memcg_kmem_online(). And shrinking
correctly happens only against the shared root queue when the reclaim
iterator walks root_mem_cgroup.

HOWEVER, the THP shrinker explicitly sets SHRINKER_NONSLAB, which in
turn overrides the previous override. So yes there is a weirdness: we
get the root cgroup invocation against the shared queue, and then one
more time triggered by that random memcg bit.

The most direct fix is to just drop SHRINKER_NONSLAB. It declares
independence from kmem, which is no longer true.

Cleaning up the shrinker code is left for another day.

Andrew, if there are no objections, can you please fold this?

---

From 6787efabb9584824c196bf01c517d93aae3764c3 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Tue, 2 Jun 2026 17:11:46 -0400
Subject: [PATCH] mm: switch deferred split shrinker to list_lru fix

Lance Yang points out a weirdness in the list_lru code with
cgroup.memory=nokmem: in this mode, list_lru collapses to a shared
per-node list that holds the folios, but __list_lru_add() still sets
the shrinker bit on the owning memcg.

Usually this is fine, because the generic shrinker code ignores these
random bits when !memcg_kmem_online(). But the THP shrinker still has
the SHRINKER_NONSLAB flag set, which specifically declares an
independence from kmem. As a result, the shrinker fires twice per
reclaim cycle: one during the regular root cgroup scan, and then one
more time triggered from whichever memcg got the shrinker bit.

Drop the flag, since it's no longer true. The deferred_split shrinker
then behaves like every other list_lru-backed shrinker under nokmem,
including the non-kmem ones (zswap, workingset shadow_nodes): skipped
from memcg-internal reclaim, driven by global reclaim only.

This needs proper cleaning up on the shrinker and list_lru side, but
that's scope for a follow-up series. Just make it consistent now.

Reported-by: Lance Yang <lance.yang@linux.dev>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/huge_memory.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 72f6caf0fec6..aef495891f8c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -956,8 +956,7 @@ int folio_memcg_alloc_deferred(struct folio *folio)
 static int __init thp_shrinker_init(void)
 {
 	deferred_split_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE |
-						 SHRINKER_MEMCG_AWARE |
-						 SHRINKER_NONSLAB,
+						 SHRINKER_MEMCG_AWARE,
 						 "thp-deferred_split");
 	if (!deferred_split_shrinker)
 		return -ENOMEM;
-- 
2.54.0

Re: [PATCH v5 0/9] mm: switch THP shrinker to list_lru

Posted by Lance Yang 5 days, 12 hours ago

On Tue, Jun 02, 2026 at 05:46:02PM -0400, Johannes Weiner wrote:
>On Mon, Jun 01, 2026 at 04:36:52PM +0800, Lance Yang wrote:
>> As the changelog above says, the old queue is per-memcg only, rather
>> than per-memcg-per-node. So reclaim on one node can still walk the whole
>> memcg queue and split underused THPs from other nodes in the same memcg.
>> 
>> But I think the new one can lose reclaim in the cgroup.memory=nokmem
>> case ...
>> 
>> With nokmem, the deferred shrinker can still run from memcg reclaim,
>> because it is SHRINKER_NONSLAB. But the list_lru is no longer per-memcg:
>> 
>> __list_lru_init() clears memcg_aware,
>> 
>> 	if (mem_cgroup_kmem_disabled())
>> 		memcg_aware = false;
>> 
>> so list_lru_from_memcg_idx() falls back to the shared node list:
>> 
>> static inline struct list_lru_one *
>> list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
>> {
>> 	if (list_lru_memcg_aware(lru) && idx >= 0) {
>> [...]
>> 	}
>> 	return &lru->node[nid].lru;
>> }
>> 
>> That makes the shrinker bit unreliable. __list_lru_add() still sets the
>> bit on the memcg passed in, but only when the list goes from empty to
>> non-empty:
>> 
>> bool __list_lru_add(struct list_lru *lru, struct list_lru_one *l,
>> 		    struct list_head *item, int nid,
>> 		    struct mem_cgroup *memcg)
>> {
>> 	if (list_empty(item)) {
>> [...]
>> 		if (!l->nr_items++)
>> 			set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
>> [...]
>> 		return true;
>> 	}
>> 	return false;
>> }
>> 
>> If memcg A adds the first folio, A gets the bit. If memcg B later adds a
>> folio to the same shared list, B does not get a bit, because the list
>> was already non-empty.
>> 
>> So in the A-first/B-later case, reclaim from B may not call the deferred
>> shrinker at all. The shared list is scanned from memcg reclaim only if
>> reclaim runs from the memcg that has the bit, such as A here, or from
>> global reclaim :)
>> 
>> Anyway, only after the shared list is emptied does the next memcg to add
>> a folio get to be the one with the bit, IIUC :)
>
>Sorry for the delay, this took me a bit to think about. The shrinker
>code is a mess.
>
>I read it the same way you do. And this is true for all list_lru users
>when nokmem is set: we just set random nonsense shrinker bits.
>
>HOWEVER, the generic shrinker code fixes that up by IGNORING random
>shrinker bits like this when !memcg_kmem_online(). And shrinking
>correctly happens only against the shared root queue when the reclaim
>iterator walks root_mem_cgroup.
>
>HOWEVER, the THP shrinker explicitly sets SHRINKER_NONSLAB, which in
>turn overrides the previous override. So yes there is a weirdness: we
>get the root cgroup invocation against the shared queue, and then one
>more time triggered by that random memcg bit.
>
>The most direct fix is to just drop SHRINKER_NONSLAB. It declares
>independence from kmem, which is no longer true.
>
>Cleaning up the shrinker code is left for another day.

Thanks for working on this!

Wondering if this fix trades one problem for another, though ...

Before this series, the deferred split shrinker had a real per-memcg
queue. Even with cgroup.memory=nokmem, memcg reclaim could still scan
that memcg's own deferred_split_queue:

memcg reclaim -> deferred split shrinker -> sc->memcg->deferred_split_queue

With the fix, nokmem + w/o SHRINKER_NONSLAB falls back to a
non-memcg-aware shrinker:

memcg reclaim -> skip deferred split shrinker

root/global reclaim -> deferred split shrinker -> shared list_lru

Is that expected? There woud be no memcg-driven deferred split reclaim
under nokmem, IIUC ...

Not sure what the right fix is, as I am not a memcg expert ...

Cheers, Lance

>Andrew, if there are no objections, can you please fold this?
>
>---
>
>>From 6787efabb9584824c196bf01c517d93aae3764c3 Mon Sep 17 00:00:00 2001
>From: Johannes Weiner <hannes@cmpxchg.org>
>Date: Tue, 2 Jun 2026 17:11:46 -0400
>Subject: [PATCH] mm: switch deferred split shrinker to list_lru fix
>
>Lance Yang points out a weirdness in the list_lru code with
>cgroup.memory=nokmem: in this mode, list_lru collapses to a shared
>per-node list that holds the folios, but __list_lru_add() still sets
>the shrinker bit on the owning memcg.
>
>Usually this is fine, because the generic shrinker code ignores these
>random bits when !memcg_kmem_online(). But the THP shrinker still has
>the SHRINKER_NONSLAB flag set, which specifically declares an
>independence from kmem. As a result, the shrinker fires twice per
>reclaim cycle: one during the regular root cgroup scan, and then one
>more time triggered from whichever memcg got the shrinker bit.
>
>Drop the flag, since it's no longer true. The deferred_split shrinker
>then behaves like every other list_lru-backed shrinker under nokmem,
>including the non-kmem ones (zswap, workingset shadow_nodes): skipped
>from memcg-internal reclaim, driven by global reclaim only.
>
>This needs proper cleaning up on the shrinker and list_lru side, but
>that's scope for a follow-up series. Just make it consistent now.
>
>Reported-by: Lance Yang <lance.yang@linux.dev>
>Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>---
> mm/huge_memory.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
>diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>index 72f6caf0fec6..aef495891f8c 100644
>--- a/mm/huge_memory.c
>+++ b/mm/huge_memory.c
>@@ -956,8 +956,7 @@ int folio_memcg_alloc_deferred(struct folio *folio)
> static int __init thp_shrinker_init(void)
> {
> 	deferred_split_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE |
>-						 SHRINKER_MEMCG_AWARE |
>-						 SHRINKER_NONSLAB,
>+						 SHRINKER_MEMCG_AWARE,
> 						 "thp-deferred_split");
> 	if (!deferred_split_shrinker)
> 		return -ENOMEM;
>-- 
>2.54.0
>
>

Re: [PATCH v5 0/9] mm: switch THP shrinker to list_lru

Posted by Johannes Weiner 5 days, 5 hours ago

On Wed, Jun 03, 2026 at 12:44:26PM +0800, Lance Yang wrote:
> 
> On Tue, Jun 02, 2026 at 05:46:02PM -0400, Johannes Weiner wrote:
> >On Mon, Jun 01, 2026 at 04:36:52PM +0800, Lance Yang wrote:
> >> As the changelog above says, the old queue is per-memcg only, rather
> >> than per-memcg-per-node. So reclaim on one node can still walk the whole
> >> memcg queue and split underused THPs from other nodes in the same memcg.
> >> 
> >> But I think the new one can lose reclaim in the cgroup.memory=nokmem
> >> case ...
> >> 
> >> With nokmem, the deferred shrinker can still run from memcg reclaim,
> >> because it is SHRINKER_NONSLAB. But the list_lru is no longer per-memcg:
> >> 
> >> __list_lru_init() clears memcg_aware,
> >> 
> >> 	if (mem_cgroup_kmem_disabled())
> >> 		memcg_aware = false;
> >> 
> >> so list_lru_from_memcg_idx() falls back to the shared node list:
> >> 
> >> static inline struct list_lru_one *
> >> list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
> >> {
> >> 	if (list_lru_memcg_aware(lru) && idx >= 0) {
> >> [...]
> >> 	}
> >> 	return &lru->node[nid].lru;
> >> }
> >> 
> >> That makes the shrinker bit unreliable. __list_lru_add() still sets the
> >> bit on the memcg passed in, but only when the list goes from empty to
> >> non-empty:
> >> 
> >> bool __list_lru_add(struct list_lru *lru, struct list_lru_one *l,
> >> 		    struct list_head *item, int nid,
> >> 		    struct mem_cgroup *memcg)
> >> {
> >> 	if (list_empty(item)) {
> >> [...]
> >> 		if (!l->nr_items++)
> >> 			set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
> >> [...]
> >> 		return true;
> >> 	}
> >> 	return false;
> >> }
> >> 
> >> If memcg A adds the first folio, A gets the bit. If memcg B later adds a
> >> folio to the same shared list, B does not get a bit, because the list
> >> was already non-empty.
> >> 
> >> So in the A-first/B-later case, reclaim from B may not call the deferred
> >> shrinker at all. The shared list is scanned from memcg reclaim only if
> >> reclaim runs from the memcg that has the bit, such as A here, or from
> >> global reclaim :)
> >> 
> >> Anyway, only after the shared list is emptied does the next memcg to add
> >> a folio get to be the one with the bit, IIUC :)
> >
> >Sorry for the delay, this took me a bit to think about. The shrinker
> >code is a mess.
> >
> >I read it the same way you do. And this is true for all list_lru users
> >when nokmem is set: we just set random nonsense shrinker bits.
> >
> >HOWEVER, the generic shrinker code fixes that up by IGNORING random
> >shrinker bits like this when !memcg_kmem_online(). And shrinking
> >correctly happens only against the shared root queue when the reclaim
> >iterator walks root_mem_cgroup.
> >
> >HOWEVER, the THP shrinker explicitly sets SHRINKER_NONSLAB, which in
> >turn overrides the previous override. So yes there is a weirdness: we
> >get the root cgroup invocation against the shared queue, and then one
> >more time triggered by that random memcg bit.
> >
> >The most direct fix is to just drop SHRINKER_NONSLAB. It declares
> >independence from kmem, which is no longer true.
> >
> >Cleaning up the shrinker code is left for another day.
> 
> Thanks for working on this!
> 
> Wondering if this fix trades one problem for another, though ...
> 
> Before this series, the deferred split shrinker had a real per-memcg
> queue. Even with cgroup.memory=nokmem, memcg reclaim could still scan
> that memcg's own deferred_split_queue:
> 
> memcg reclaim -> deferred split shrinker -> sc->memcg->deferred_split_queue
> 
> With the fix, nokmem + w/o SHRINKER_NONSLAB falls back to a
> non-memcg-aware shrinker:
> 
> memcg reclaim -> skip deferred split shrinker
> 
> root/global reclaim -> deferred split shrinker -> shared list_lru
> 
> Is that expected? There woud be no memcg-driven deferred split reclaim
> under nokmem, IIUC ...

Yes, this is all correct. list_lru is still inherently tied to the
kmem component of memcg (memcg_kmem_id()).

So without kmem, no isolation. But without kmem, no isolation *for a
lot of stuff*. It's a legacy knob when slab accounting was new and
expensive. But so many things depend on it now, disabling it just
punches a nassive hole into memcg functionality and isolation
coverage. It's not a sanctioned production use flag.

This change is negligible from a memcg semantics POV.

> Not sure what the right fix is, as I am not a memcg expert ...

Re: [PATCH v5 0/9] mm: switch THP shrinker to list_lru

Posted by Lance Yang 5 days, 4 hours ago


On 2026/6/3 19:41, Johannes Weiner wrote:
> On Wed, Jun 03, 2026 at 12:44:26PM +0800, Lance Yang wrote:
>>
>> On Tue, Jun 02, 2026 at 05:46:02PM -0400, Johannes Weiner wrote:
>>> On Mon, Jun 01, 2026 at 04:36:52PM +0800, Lance Yang wrote:
>>>> As the changelog above says, the old queue is per-memcg only, rather
>>>> than per-memcg-per-node. So reclaim on one node can still walk the whole
>>>> memcg queue and split underused THPs from other nodes in the same memcg.
>>>>
>>>> But I think the new one can lose reclaim in the cgroup.memory=nokmem
>>>> case ...
>>>>
>>>> With nokmem, the deferred shrinker can still run from memcg reclaim,
>>>> because it is SHRINKER_NONSLAB. But the list_lru is no longer per-memcg:
>>>>
>>>> __list_lru_init() clears memcg_aware,
>>>>
>>>> 	if (mem_cgroup_kmem_disabled())
>>>> 		memcg_aware = false;
>>>>
>>>> so list_lru_from_memcg_idx() falls back to the shared node list:
>>>>
>>>> static inline struct list_lru_one *
>>>> list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
>>>> {
>>>> 	if (list_lru_memcg_aware(lru) && idx >= 0) {
>>>> [...]
>>>> 	}
>>>> 	return &lru->node[nid].lru;
>>>> }
>>>>
>>>> That makes the shrinker bit unreliable. __list_lru_add() still sets the
>>>> bit on the memcg passed in, but only when the list goes from empty to
>>>> non-empty:
>>>>
>>>> bool __list_lru_add(struct list_lru *lru, struct list_lru_one *l,
>>>> 		    struct list_head *item, int nid,
>>>> 		    struct mem_cgroup *memcg)
>>>> {
>>>> 	if (list_empty(item)) {
>>>> [...]
>>>> 		if (!l->nr_items++)
>>>> 			set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
>>>> [...]
>>>> 		return true;
>>>> 	}
>>>> 	return false;
>>>> }
>>>>
>>>> If memcg A adds the first folio, A gets the bit. If memcg B later adds a
>>>> folio to the same shared list, B does not get a bit, because the list
>>>> was already non-empty.
>>>>
>>>> So in the A-first/B-later case, reclaim from B may not call the deferred
>>>> shrinker at all. The shared list is scanned from memcg reclaim only if
>>>> reclaim runs from the memcg that has the bit, such as A here, or from
>>>> global reclaim :)
>>>>
>>>> Anyway, only after the shared list is emptied does the next memcg to add
>>>> a folio get to be the one with the bit, IIUC :)
>>>
>>> Sorry for the delay, this took me a bit to think about. The shrinker
>>> code is a mess.
>>>
>>> I read it the same way you do. And this is true for all list_lru users
>>> when nokmem is set: we just set random nonsense shrinker bits.
>>>
>>> HOWEVER, the generic shrinker code fixes that up by IGNORING random
>>> shrinker bits like this when !memcg_kmem_online(). And shrinking
>>> correctly happens only against the shared root queue when the reclaim
>>> iterator walks root_mem_cgroup.
>>>
>>> HOWEVER, the THP shrinker explicitly sets SHRINKER_NONSLAB, which in
>>> turn overrides the previous override. So yes there is a weirdness: we
>>> get the root cgroup invocation against the shared queue, and then one
>>> more time triggered by that random memcg bit.
>>>
>>> The most direct fix is to just drop SHRINKER_NONSLAB. It declares
>>> independence from kmem, which is no longer true.
>>>
>>> Cleaning up the shrinker code is left for another day.
>>
>> Thanks for working on this!
>>
>> Wondering if this fix trades one problem for another, though ...
>>
>> Before this series, the deferred split shrinker had a real per-memcg
>> queue. Even with cgroup.memory=nokmem, memcg reclaim could still scan
>> that memcg's own deferred_split_queue:
>>
>> memcg reclaim -> deferred split shrinker -> sc->memcg->deferred_split_queue
>>
>> With the fix, nokmem + w/o SHRINKER_NONSLAB falls back to a
>> non-memcg-aware shrinker:
>>
>> memcg reclaim -> skip deferred split shrinker
>>
>> root/global reclaim -> deferred split shrinker -> shared list_lru
>>
>> Is that expected? There woud be no memcg-driven deferred split reclaim
>> under nokmem, IIUC ...
> 
> Yes, this is all correct. list_lru is still inherently tied to the
> kmem component of memcg (memcg_kmem_id()).
> 
> So without kmem, no isolation. But without kmem, no isolation *for a
> lot of stuff*. It's a legacy knob when slab accounting was new and
> expensive. But so many things depend on it now, disabling it just
> punches a nassive hole into memcg functionality and isolation
> coverage. It's not a sanctioned production use flag.
> 
> This change is negligible from a memcg semantics POV.

Thanks for clarifying!

No strong objection from me. Just wanted to call out the nokmem
behavior change and hear what folks think :D

>> Not sure what the right fix is, as I am not a memcg expert ...