From nobody Tue Apr 7 16:16:03 2026 Received: from mail-qk1-f178.google.com (mail-qk1-f178.google.com [209.85.222.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3EE773B8949 for ; Thu, 12 Mar 2026 20:53:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773348828; cv=none; b=dcxSJjgp1XyQzdPcXuK+tdUTJdOJsFexK+cqzQV33gWUUNMa/J+HbSoWxiQvddFFb5MQvT4qlzZchd0liZTpCMiydjc+reSj57BFmWIPFWEM4UBltouhMRcer3DtIM8bFKPcT3yql9TnZZ2kgXITFTGx/V40qn7wStn4/R/6X2k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773348828; c=relaxed/simple; bh=MQGaM5CV2ObaJ4pw7IH3WbEoXvQIWj5TUBhbfs7ntBE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=tcc+OhHcgqOpd0Lif/rjTH4D9FAwj77mhsHCV14CqN/Da3YLbiw1Tc3KM6LMGSiy9oNgDBXlJpMj2QMqUWQtkTP6gt6mn4ByC3QWVl/wdtXrH+omK/RvCHsm/Fu2hic7hSLPodHIhc6ioT+BlYfKMar/xCoAMzW5uz6IQeBdF94= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org; spf=pass smtp.mailfrom=cmpxchg.org; dkim=pass (2048-bit key) header.d=cmpxchg.org header.i=@cmpxchg.org header.b=HYu8wy6X; arc=none smtp.client-ip=209.85.222.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=cmpxchg.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=cmpxchg.org header.i=@cmpxchg.org header.b="HYu8wy6X" Received: by mail-qk1-f178.google.com with SMTP id af79cd13be357-8cd7aab92dfso181488785a.0 for ; Thu, 12 Mar 2026 13:53:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg.org; s=google; t=1773348823; x=1773953623; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=meNuwYi96fVzujigXO/A5TlsI0kRYb25p0jCYCvGzAg=; b=HYu8wy6XajOrbxfl6G16JHWV9YyvxswKgqWhetQV8WerxB8gJJoK7KKDYquDnNyage G1sT0Im7BOfz0nH2k7rlUH+3J5T4qhunjgPR/aRMkdxjbnHTCGsczMCq2UHD41Sfs+AG Tc27Xy5S7fgfl+gyL6vkH6bU6XD38uS2yPZFKYR7KPFCGfPpgcOvynW0fzfCScPLWKvv s7VHpi2f+eRZR1nk/xJ4j39jEAKOG8U8HwgKTeLTo/7Iv1uKPXvIAgA0yvhgNpmpCFDz RqhWw/pqUT3P2mHx462hWt8poMyQaxKj0p62n7K728DHRIv6FVcwP71UEs+ZYK0fsjxn ELsQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1773348823; x=1773953623; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=meNuwYi96fVzujigXO/A5TlsI0kRYb25p0jCYCvGzAg=; b=rtrLe1buwxSFVndXv+6nDuJJRKz/xZzhzwMltH9vg+TOzvJyfM0JpyfypPuf8DdKfs hj432UPu9ueajVQF1Pb+O7X2Bsco5EDPmPizqYnk9qXssOIlF6DJmBglkoxbX1CcDfZ9 YwZTdbvQL/RXa/xbOMLNybsqr65bN12kSpf7lMBtMM3Onizs69OWfJgoiyUXWxldUg3F qD9Po+sErSQhIhx55klt/Kd+mvL4VYrjLNaX65SCuPCOvdUHps1O+LdxYM9hi1qB2JEY PfZfu2Iv43Jv/sb6uXux7H682dZwqYbECIAijD0xAOCzGCZi25yr6URlo6ngRbgOO/RY +9Uw== X-Forwarded-Encrypted: i=1; AJvYcCWX+aGbzLXZv2trK4K/VGXAjdWX9RFNifTytGBFeM6z6FwRoceiWJhyHfwgoeejgxPLM5Xr5tfmBSHD6jw=@vger.kernel.org X-Gm-Message-State: AOJu0YwMv2z+D/i0JHUZ1AJHy0u35fbwSgBD4I2S1qd1riFWjp15vqRR 7zRi1+u+3Y70AuOVgMmKUAgV0Bz4OTX5fPTSoBjbipuf+wnTMs78rj2k7s4swt8Avio/yf4Avhc n8iJ6 X-Gm-Gg: ATEYQzwqlvldgTs3uaflE7MqvPQZDrBoU2+VLgr+TInIHSdzt6ALzUuL2QXI5U/PXdW MHafVwp269YN5XCSpIF3yweaLMzirGmyJeCMdgR5yX2jSGlVNnZ9RAnw7SEUneByA0nFX2J2pxO 67XUd/pEU/wQyFyW5JkOxqqoFDTfaa9NBCOwNKnixyV7eLaRNBK3T5eDHkYJSJ4X0eFqh3EqX+a buOdMk6GkitC1UeAUUm1QB5d0NomQUmn7wyGC3wosY6bIjwdYgkXzg0am6uBbwQ6usQV6q0kvq3 RPcx13FY46D7Yt3nvIpKHAVe2fw33d5AoBGNW70hF6GUY9fd1FNWG77VtI2K+3qIxhtK/MRq3ZP 4wPV32kg2aCiIh5Djapt/0z0iuVLTi5M1GKwXz6gACEJewDkzwKmz3XRI+1grdZXzyhhRCsa3cq ifqKmmhloNiq47vHDITly2WQ== X-Received: by 2002:a05:620a:4694:b0:8cd:8987:41b3 with SMTP id af79cd13be357-8cdaa7a33c8mr614912485a.11.1773348822904; Thu, 12 Mar 2026 13:53:42 -0700 (PDT) Received: from localhost ([2603:7000:c00:3a00:365a:60ff:fe62:ff29]) by smtp.gmail.com with ESMTPSA id af79cd13be357-8cda21100b8sm417782085a.29.2026.03.12.13.53.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Mar 2026 13:53:42 -0700 (PDT) From: Johannes Weiner To: Andrew Morton Cc: David Hildenbrand , Shakeel Butt , Yosry Ahmed , Zi Yan , "Liam R. Howlett" , Usama Arif , Kiryl Shutsemau , Dave Chinner , Roman Gushchin , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v2 7/7] mm: switch deferred split shrinker to list_lru Date: Thu, 12 Mar 2026 16:51:55 -0400 Message-ID: <20260312205321.638053-8-hannes@cmpxchg.org> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260312205321.638053-1-hannes@cmpxchg.org> References: <20260312205321.638053-1-hannes@cmpxchg.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The deferred split queue handles cgroups in a suboptimal fashion. The queue is per-NUMA node or per-cgroup, not the intersection. That means on a cgrouped system, a node-restricted allocation entering reclaim can end up splitting large pages on other nodes: alloc/unmap deferred_split_folio() list_add_tail(memcg->split_queue) set_shrinker_bit(memcg, node, deferred_shrinker_id) for_each_zone_zonelist_nodemask(restricted_nodes) mem_cgroup_iter() shrink_slab(node, memcg) shrink_slab_memcg(node, memcg) if test_shrinker_bit(memcg, node, deferred_shrinker_id) deferred_split_scan() walks memcg->split_queue The shrinker bit adds an imperfect guard rail. As soon as the cgroup has a single large page on the node of interest, all large pages owned by that memcg, including those on other nodes, will be split. list_lru properly sets up per-node, per-cgroup lists. As a bonus, it streamlines a lot of the list operations and reclaim walks. It's used widely by other major shrinkers already. Convert the deferred split queue as well. The list_lru per-memcg heads are instantiated on demand when the first object of interest is allocated for a cgroup, by calling memcg_list_lru_alloc_folio(). Add calls to where splittable pages are created: anon faults, swapin faults, khugepaged collapse. These calls create all possible node heads for the cgroup at once, so the migration code (between nodes) doesn't need any special care. Signed-off-by: Johannes Weiner --- include/linux/huge_mm.h | 6 +- include/linux/memcontrol.h | 4 - include/linux/mmzone.h | 12 -- mm/huge_memory.c | 330 +++++++++++-------------------------- mm/internal.h | 2 +- mm/khugepaged.c | 7 + mm/memcontrol.c | 12 +- mm/memory.c | 52 +++--- mm/mm_init.c | 15 -- 9 files changed, 140 insertions(+), 300 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index a4d9f964dfde..2d0d0c797dd8 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -414,10 +414,9 @@ static inline int split_huge_page(struct page *page) { return split_huge_page_to_list_to_order(page, NULL, 0); } + +extern struct list_lru deferred_split_lru; void deferred_split_folio(struct folio *folio, bool partially_mapped); -#ifdef CONFIG_MEMCG -void reparent_deferred_split_queue(struct mem_cgroup *memcg); -#endif =20 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long address, bool freeze); @@ -650,7 +649,6 @@ static inline int try_folio_split_to_order(struct folio= *folio, } =20 static inline void deferred_split_folio(struct folio *folio, bool partiall= y_mapped) {} -static inline void reparent_deferred_split_queue(struct mem_cgroup *memcg)= {} #define split_huge_pmd(__vma, __pmd, __address) \ do { } while (0) =20 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 086158969529..0782c72a1997 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -277,10 +277,6 @@ struct mem_cgroup { struct memcg_cgwb_frn cgwb_frn[MEMCG_CGWB_FRN_CNT]; #endif =20 -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - struct deferred_split deferred_split_queue; -#endif - #ifdef CONFIG_LRU_GEN_WALKS_MMU /* per-memcg mm_struct list */ struct lru_gen_mm_list mm_list; diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 7bd0134c241c..232b7a71fd69 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1429,14 +1429,6 @@ struct zonelist { */ extern struct page *mem_map; =20 -#ifdef CONFIG_TRANSPARENT_HUGEPAGE -struct deferred_split { - spinlock_t split_queue_lock; - struct list_head split_queue; - unsigned long split_queue_len; -}; -#endif - #ifdef CONFIG_MEMORY_FAILURE /* * Per NUMA node memory failure handling statistics. @@ -1562,10 +1554,6 @@ typedef struct pglist_data { unsigned long first_deferred_pfn; #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ =20 -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - struct deferred_split deferred_split_queue; -#endif - #ifdef CONFIG_NUMA_BALANCING /* start time in ms of current promote rate limit period */ unsigned int nbp_rl_start; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 7d0a64033b18..ed9b98e2e166 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include @@ -67,6 +68,7 @@ unsigned long transparent_hugepage_flags __read_mostly = =3D (1<count_objects =3D deferred_split_count; deferred_split_shrinker->scan_objects =3D deferred_split_scan; shrinker_register(deferred_split_shrinker); @@ -886,6 +893,7 @@ static int __init thp_shrinker_init(void) =20 huge_zero_folio_shrinker =3D shrinker_alloc(0, "thp-zero"); if (!huge_zero_folio_shrinker) { + list_lru_destroy(&deferred_split_lru); shrinker_free(deferred_split_shrinker); return -ENOMEM; } @@ -900,6 +908,7 @@ static int __init thp_shrinker_init(void) static void __init thp_shrinker_exit(void) { shrinker_free(huge_zero_folio_shrinker); + list_lru_destroy(&deferred_split_lru); shrinker_free(deferred_split_shrinker); } =20 @@ -1080,119 +1089,6 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_s= truct *vma) return pmd; } =20 -static struct deferred_split *split_queue_node(int nid) -{ - struct pglist_data *pgdata =3D NODE_DATA(nid); - - return &pgdata->deferred_split_queue; -} - -#ifdef CONFIG_MEMCG -static inline -struct mem_cgroup *folio_split_queue_memcg(struct folio *folio, - struct deferred_split *queue) -{ - if (mem_cgroup_disabled()) - return NULL; - if (split_queue_node(folio_nid(folio)) =3D=3D queue) - return NULL; - return container_of(queue, struct mem_cgroup, deferred_split_queue); -} - -static struct deferred_split *memcg_split_queue(int nid, struct mem_cgroup= *memcg) -{ - return memcg ? &memcg->deferred_split_queue : split_queue_node(nid); -} -#else -static inline -struct mem_cgroup *folio_split_queue_memcg(struct folio *folio, - struct deferred_split *queue) -{ - return NULL; -} - -static struct deferred_split *memcg_split_queue(int nid, struct mem_cgroup= *memcg) -{ - return split_queue_node(nid); -} -#endif - -static struct deferred_split *split_queue_lock(int nid, struct mem_cgroup = *memcg) -{ - struct deferred_split *queue; - -retry: - queue =3D memcg_split_queue(nid, memcg); - spin_lock(&queue->split_queue_lock); - /* - * There is a period between setting memcg to dying and reparenting - * deferred split queue, and during this period the THPs in the deferred - * split queue will be hidden from the shrinker side. - */ - if (unlikely(memcg_is_dying(memcg))) { - spin_unlock(&queue->split_queue_lock); - memcg =3D parent_mem_cgroup(memcg); - goto retry; - } - - return queue; -} - -static struct deferred_split * -split_queue_lock_irqsave(int nid, struct mem_cgroup *memcg, unsigned long = *flags) -{ - struct deferred_split *queue; - -retry: - queue =3D memcg_split_queue(nid, memcg); - spin_lock_irqsave(&queue->split_queue_lock, *flags); - if (unlikely(memcg_is_dying(memcg))) { - spin_unlock_irqrestore(&queue->split_queue_lock, *flags); - memcg =3D parent_mem_cgroup(memcg); - goto retry; - } - - return queue; -} - -static struct deferred_split *folio_split_queue_lock(struct folio *folio) -{ - struct deferred_split *queue; - - rcu_read_lock(); - queue =3D split_queue_lock(folio_nid(folio), folio_memcg(folio)); - /* - * The memcg destruction path is acquiring the split queue lock for - * reparenting. Once you have it locked, it's safe to drop the rcu lock. - */ - rcu_read_unlock(); - - return queue; -} - -static struct deferred_split * -folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags) -{ - struct deferred_split *queue; - - rcu_read_lock(); - queue =3D split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), = flags); - rcu_read_unlock(); - - return queue; -} - -static inline void split_queue_unlock(struct deferred_split *queue) -{ - spin_unlock(&queue->split_queue_lock); -} - -static inline void split_queue_unlock_irqrestore(struct deferred_split *qu= eue, - unsigned long flags) -{ - spin_unlock_irqrestore(&queue->split_queue_lock, flags); -} - static inline bool is_transparent_hugepage(const struct folio *folio) { if (!folio_test_large(folio)) @@ -1293,6 +1189,14 @@ static struct folio *vma_alloc_anon_folio_pmd(struct= vm_area_struct *vma, count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE); return NULL; } + + if (memcg_list_lru_alloc_folio(folio, &deferred_split_lru, gfp)) { + folio_put(folio); + count_vm_event(THP_FAULT_FALLBACK); + count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK); + return NULL; + } + folio_throttle_swaprate(folio, gfp); =20 /* @@ -3802,33 +3706,28 @@ static int __folio_freeze_and_split_unmapped(struct= folio *folio, unsigned int n struct folio *new_folio, *next; int old_order =3D folio_order(folio); int ret =3D 0; - struct deferred_split *ds_queue; + struct list_lru_one *l; =20 VM_WARN_ON_ONCE(!mapping && end); /* Prevent deferred_split_scan() touching ->_refcount */ - ds_queue =3D folio_split_queue_lock(folio); + rcu_read_lock(); + l =3D list_lru_lock(&deferred_split_lru, folio_nid(folio), folio_memcg(fo= lio)); if (folio_ref_freeze(folio, folio_cache_ref_count(folio) + 1)) { struct swap_cluster_info *ci =3D NULL; struct lruvec *lruvec; =20 if (old_order > 1) { - if (!list_empty(&folio->_deferred_list)) { - ds_queue->split_queue_len--; - /* - * Reinitialize page_deferred_list after removing the - * page from the split_queue, otherwise a subsequent - * split will see list corruption when checking the - * page_deferred_list. - */ - list_del_init(&folio->_deferred_list); - } + __list_lru_del(&deferred_split_lru, l, + &folio->_deferred_list, folio_nid(folio)); if (folio_test_partially_mapped(folio)) { folio_clear_partially_mapped(folio); mod_mthp_stat(old_order, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1); } } - split_queue_unlock(ds_queue); + list_lru_unlock(l); + rcu_read_unlock(); + if (mapping) { int nr =3D folio_nr_pages(folio); =20 @@ -3929,7 +3828,8 @@ static int __folio_freeze_and_split_unmapped(struct f= olio *folio, unsigned int n if (ci) swap_cluster_unlock(ci); } else { - split_queue_unlock(ds_queue); + list_lru_unlock(l); + rcu_read_unlock(); return -EAGAIN; } =20 @@ -4296,33 +4196,35 @@ int split_folio_to_list(struct folio *folio, struct= list_head *list) * queueing THP splits, and that list is (racily observed to be) non-empty. * * It is unsafe to call folio_unqueue_deferred_split() until folio refcoun= t is - * zero: because even when split_queue_lock is held, a non-empty _deferred= _list - * might be in use on deferred_split_scan()'s unlocked on-stack list. + * zero: because even when the list_lru lock is held, a non-empty + * _deferred_list might be in use on deferred_split_scan()'s unlocked + * on-stack list. * - * If memory cgroups are enabled, split_queue_lock is in the mem_cgroup: i= t is - * therefore important to unqueue deferred split before changing folio mem= cg. + * The list_lru sublist is determined by folio's memcg: it is therefore + * important to unqueue deferred split before changing folio memcg. */ bool __folio_unqueue_deferred_split(struct folio *folio) { - struct deferred_split *ds_queue; + struct list_lru_one *l; + int nid =3D folio_nid(folio); unsigned long flags; bool unqueued =3D false; =20 WARN_ON_ONCE(folio_ref_count(folio)); WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg_charged(folio)); =20 - ds_queue =3D folio_split_queue_lock_irqsave(folio, &flags); - if (!list_empty(&folio->_deferred_list)) { - ds_queue->split_queue_len--; + rcu_read_lock(); + l =3D list_lru_lock_irqsave(&deferred_split_lru, nid, folio_memcg(folio),= &flags); + if (__list_lru_del(&deferred_split_lru, l, &folio->_deferred_list, nid)) { if (folio_test_partially_mapped(folio)) { folio_clear_partially_mapped(folio); mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1); } - list_del_init(&folio->_deferred_list); unqueued =3D true; } - split_queue_unlock_irqrestore(ds_queue, flags); + list_lru_unlock_irqrestore(l, &flags); + rcu_read_unlock(); =20 return unqueued; /* useful for debug warnings */ } @@ -4330,7 +4232,9 @@ bool __folio_unqueue_deferred_split(struct folio *fol= io) /* partially_mapped=3Dfalse won't clear PG_partially_mapped folio flag */ void deferred_split_folio(struct folio *folio, bool partially_mapped) { - struct deferred_split *ds_queue; + struct list_lru_one *l; + int nid; + struct mem_cgroup *memcg; unsigned long flags; =20 /* @@ -4353,7 +4257,11 @@ void deferred_split_folio(struct folio *folio, bool = partially_mapped) if (folio_test_swapcache(folio)) return; =20 - ds_queue =3D folio_split_queue_lock_irqsave(folio, &flags); + nid =3D folio_nid(folio); + + rcu_read_lock(); + memcg =3D folio_memcg(folio); + l =3D list_lru_lock_irqsave(&deferred_split_lru, nid, memcg, &flags); if (partially_mapped) { if (!folio_test_partially_mapped(folio)) { folio_set_partially_mapped(folio); @@ -4361,36 +4269,20 @@ void deferred_split_folio(struct folio *folio, bool= partially_mapped) count_vm_event(THP_DEFERRED_SPLIT_PAGE); count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED); mod_mthp_stat(folio_order(folio), MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, 1= ); - } } else { /* partially mapped folios cannot become non-partially mapped */ VM_WARN_ON_FOLIO(folio_test_partially_mapped(folio), folio); } - if (list_empty(&folio->_deferred_list)) { - struct mem_cgroup *memcg; - - memcg =3D folio_split_queue_memcg(folio, ds_queue); - list_add_tail(&folio->_deferred_list, &ds_queue->split_queue); - ds_queue->split_queue_len++; - if (memcg) - set_shrinker_bit(memcg, folio_nid(folio), - shrinker_id(deferred_split_shrinker)); - } - split_queue_unlock_irqrestore(ds_queue, flags); + __list_lru_add(&deferred_split_lru, l, &folio->_deferred_list, nid, memcg= ); + list_lru_unlock_irqrestore(l, &flags); + rcu_read_unlock(); } =20 static unsigned long deferred_split_count(struct shrinker *shrink, struct shrink_control *sc) { - struct pglist_data *pgdata =3D NODE_DATA(sc->nid); - struct deferred_split *ds_queue =3D &pgdata->deferred_split_queue; - -#ifdef CONFIG_MEMCG - if (sc->memcg) - ds_queue =3D &sc->memcg->deferred_split_queue; -#endif - return READ_ONCE(ds_queue->split_queue_len); + return list_lru_shrink_count(&deferred_split_lru, sc); } =20 static bool thp_underused(struct folio *folio) @@ -4420,45 +4312,47 @@ static bool thp_underused(struct folio *folio) return false; } =20 +static enum lru_status deferred_split_isolate(struct list_head *item, + struct list_lru_one *lru, + void *cb_arg) +{ + struct folio *folio =3D container_of(item, struct folio, _deferred_list); + struct list_head *freeable =3D cb_arg; + + if (folio_try_get(folio)) { + list_lru_isolate_move(lru, item, freeable); + return LRU_REMOVED; + } + + /* We lost race with folio_put() */ + list_lru_isolate(lru, item); + if (folio_test_partially_mapped(folio)) { + folio_clear_partially_mapped(folio); + mod_mthp_stat(folio_order(folio), + MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1); + } + return LRU_REMOVED; +} + static unsigned long deferred_split_scan(struct shrinker *shrink, struct shrink_control *sc) { - struct deferred_split *ds_queue; - unsigned long flags; + LIST_HEAD(dispose); struct folio *folio, *next; - int split =3D 0, i; - struct folio_batch fbatch; - - folio_batch_init(&fbatch); + int split =3D 0; + unsigned long isolated; =20 -retry: - ds_queue =3D split_queue_lock_irqsave(sc->nid, sc->memcg, &flags); - /* Take pin on all head pages to avoid freeing them under us */ - list_for_each_entry_safe(folio, next, &ds_queue->split_queue, - _deferred_list) { - if (folio_try_get(folio)) { - folio_batch_add(&fbatch, folio); - } else if (folio_test_partially_mapped(folio)) { - /* We lost race with folio_put() */ - folio_clear_partially_mapped(folio); - mod_mthp_stat(folio_order(folio), - MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1); - } - list_del_init(&folio->_deferred_list); - ds_queue->split_queue_len--; - if (!--sc->nr_to_scan) - break; - if (!folio_batch_space(&fbatch)) - break; - } - split_queue_unlock_irqrestore(ds_queue, flags); + isolated =3D list_lru_shrink_walk_irq(&deferred_split_lru, sc, + deferred_split_isolate, &dispose); =20 - for (i =3D 0; i < folio_batch_count(&fbatch); i++) { + list_for_each_entry_safe(folio, next, &dispose, _deferred_list) { bool did_split =3D false; bool underused =3D false; - struct deferred_split *fqueue; + struct list_lru_one *l; + unsigned long flags; + + list_del_init(&folio->_deferred_list); =20 - folio =3D fbatch.folios[i]; if (!folio_test_partially_mapped(folio)) { /* * See try_to_map_unused_to_zeropage(): we cannot @@ -4481,64 +4375,32 @@ static unsigned long deferred_split_scan(struct shr= inker *shrink, } folio_unlock(folio); next: - if (did_split || !folio_test_partially_mapped(folio)) - continue; /* * Only add back to the queue if folio is partially mapped. * If thp_underused returns false, or if split_folio fails * in the case it was underused, then consider it used and * don't add it back to split_queue. */ - fqueue =3D folio_split_queue_lock_irqsave(folio, &flags); - if (list_empty(&folio->_deferred_list)) { - list_add_tail(&folio->_deferred_list, &fqueue->split_queue); - fqueue->split_queue_len++; + if (!did_split && folio_test_partially_mapped(folio)) { + rcu_read_lock(); + l =3D list_lru_lock_irqsave(&deferred_split_lru, + folio_nid(folio), + folio_memcg(folio), + &flags); + __list_lru_add(&deferred_split_lru, l, + &folio->_deferred_list, + folio_nid(folio), folio_memcg(folio)); + list_lru_unlock_irqrestore(l, &flags); + rcu_read_unlock(); } - split_queue_unlock_irqrestore(fqueue, flags); - } - folios_put(&fbatch); - - if (sc->nr_to_scan && !list_empty(&ds_queue->split_queue)) { - cond_resched(); - goto retry; + folio_put(folio); } =20 - /* - * Stop shrinker if we didn't split any page, but the queue is empty. - * This can happen if pages were freed under us. - */ - if (!split && list_empty(&ds_queue->split_queue)) + if (!split && !isolated) return SHRINK_STOP; return split; } =20 -#ifdef CONFIG_MEMCG -void reparent_deferred_split_queue(struct mem_cgroup *memcg) -{ - struct mem_cgroup *parent =3D parent_mem_cgroup(memcg); - struct deferred_split *ds_queue =3D &memcg->deferred_split_queue; - struct deferred_split *parent_ds_queue =3D &parent->deferred_split_queue; - int nid; - - spin_lock_irq(&ds_queue->split_queue_lock); - spin_lock_nested(&parent_ds_queue->split_queue_lock, SINGLE_DEPTH_NESTING= ); - - if (!ds_queue->split_queue_len) - goto unlock; - - list_splice_tail_init(&ds_queue->split_queue, &parent_ds_queue->split_que= ue); - parent_ds_queue->split_queue_len +=3D ds_queue->split_queue_len; - ds_queue->split_queue_len =3D 0; - - for_each_node(nid) - set_shrinker_bit(parent, nid, shrinker_id(deferred_split_shrinker)); - -unlock: - spin_unlock(&parent_ds_queue->split_queue_lock); - spin_unlock_irq(&ds_queue->split_queue_lock); -} -#endif - #ifdef CONFIG_DEBUG_FS static void split_huge_pages_all(void) { diff --git a/mm/internal.h b/mm/internal.h index 95b583e7e4f7..71d2605f8040 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -857,7 +857,7 @@ static inline bool folio_unqueue_deferred_split(struct = folio *folio) /* * At this point, there is no one trying to add the folio to * deferred_list. If folio is not in deferred_list, it's safe - * to check without acquiring the split_queue_lock. + * to check without acquiring the list_lru lock. */ if (data_race(list_empty(&folio->_deferred_list))) return false; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index b7b4680d27ab..01fd3d5933c5 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1076,6 +1076,7 @@ static enum scan_result alloc_charge_folio(struct fol= io **foliop, struct mm_stru } =20 count_vm_event(THP_COLLAPSE_ALLOC); + if (unlikely(mem_cgroup_charge(folio, mm, gfp))) { folio_put(folio); *foliop =3D NULL; @@ -1084,6 +1085,12 @@ static enum scan_result alloc_charge_folio(struct fo= lio **foliop, struct mm_stru =20 count_memcg_folio_events(folio, THP_COLLAPSE_ALLOC, 1); =20 + if (memcg_list_lru_alloc_folio(folio, &deferred_split_lru, gfp)) { + folio_put(folio); + *foliop =3D NULL; + return SCAN_CGROUP_CHARGE_FAIL; + } + *foliop =3D folio; return SCAN_SUCCEED; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a47fb68dd65f..f381cb6bdff1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4015,11 +4015,6 @@ static struct mem_cgroup *mem_cgroup_alloc(struct me= m_cgroup *parent) for (i =3D 0; i < MEMCG_CGWB_FRN_CNT; i++) memcg->cgwb_frn[i].done =3D __WB_COMPLETION_INIT(&memcg_cgwb_frn_waitq); -#endif -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - spin_lock_init(&memcg->deferred_split_queue.split_queue_lock); - INIT_LIST_HEAD(&memcg->deferred_split_queue.split_queue); - memcg->deferred_split_queue.split_queue_len =3D 0; #endif lru_gen_init_memcg(memcg); return memcg; @@ -4167,11 +4162,10 @@ static void mem_cgroup_css_offline(struct cgroup_su= bsys_state *css) zswap_memcg_offline_cleanup(memcg); =20 memcg_offline_kmem(memcg); - reparent_deferred_split_queue(memcg); /* - * The reparenting of objcg must be after the reparenting of the - * list_lru and deferred_split_queue above, which ensures that they will - * not mistakenly get the parent list_lru and deferred_split_queue. + * The reparenting of objcg must be after the reparenting of + * the list_lru in memcg_offline_kmem(), which ensures that + * they will not mistakenly get the parent list_lru. */ memcg_reparent_objcgs(memcg); reparent_shrinker_deferred(memcg); diff --git a/mm/memory.c b/mm/memory.c index 38062f8e1165..4dad1a7890aa 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4651,13 +4651,19 @@ static struct folio *alloc_swap_folio(struct vm_fau= lt *vmf) while (orders) { addr =3D ALIGN_DOWN(vmf->address, PAGE_SIZE << order); folio =3D vma_alloc_folio(gfp, order, vma, addr); - if (folio) { - if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, - gfp, entry)) - return folio; + if (!folio) + goto next; + if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, gfp, entry)) { count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK_CHARGE); folio_put(folio); + goto next; } + if (memcg_list_lru_alloc_folio(folio, &deferred_split_lru, gfp)) { + folio_put(folio); + goto fallback; + } + return folio; +next: count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK); order =3D next_order(&orders, order); } @@ -5168,24 +5174,28 @@ static struct folio *alloc_anon_folio(struct vm_fau= lt *vmf) while (orders) { addr =3D ALIGN_DOWN(vmf->address, PAGE_SIZE << order); folio =3D vma_alloc_folio(gfp, order, vma, addr); - if (folio) { - if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) { - count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE); - folio_put(folio); - goto next; - } - folio_throttle_swaprate(folio, gfp); - /* - * When a folio is not zeroed during allocation - * (__GFP_ZERO not used) or user folios require special - * handling, folio_zero_user() is used to make sure - * that the page corresponding to the faulting address - * will be hot in the cache after zeroing. - */ - if (user_alloc_needs_zeroing()) - folio_zero_user(folio, vmf->address); - return folio; + if (!folio) + goto next; + if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) { + count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE); + folio_put(folio); + goto next; } + if (memcg_list_lru_alloc_folio(folio, &deferred_split_lru, gfp)) { + folio_put(folio); + goto fallback; + } + folio_throttle_swaprate(folio, gfp); + /* + * When a folio is not zeroed during allocation + * (__GFP_ZERO not used) or user folios require special + * handling, folio_zero_user() is used to make sure + * that the page corresponding to the faulting address + * will be hot in the cache after zeroing. + */ + if (user_alloc_needs_zeroing()) + folio_zero_user(folio, vmf->address); + return folio; next: count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK); order =3D next_order(&orders, order); diff --git a/mm/mm_init.c b/mm/mm_init.c index cec7bb758bdd..f293a62e652a 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -1388,19 +1388,6 @@ static void __init calculate_node_totalpages(struct = pglist_data *pgdat, pr_debug("On node %d totalpages: %lu\n", pgdat->node_id, realtotalpages); } =20 -#ifdef CONFIG_TRANSPARENT_HUGEPAGE -static void pgdat_init_split_queue(struct pglist_data *pgdat) -{ - struct deferred_split *ds_queue =3D &pgdat->deferred_split_queue; - - spin_lock_init(&ds_queue->split_queue_lock); - INIT_LIST_HEAD(&ds_queue->split_queue); - ds_queue->split_queue_len =3D 0; -} -#else -static void pgdat_init_split_queue(struct pglist_data *pgdat) {} -#endif - #ifdef CONFIG_COMPACTION static void pgdat_init_kcompactd(struct pglist_data *pgdat) { @@ -1416,8 +1403,6 @@ static void __meminit pgdat_init_internals(struct pgl= ist_data *pgdat) =20 pgdat_resize_init(pgdat); pgdat_kswapd_lock_init(pgdat); - - pgdat_init_split_queue(pgdat); pgdat_init_kcompactd(pgdat); =20 init_waitqueue_head(&pgdat->kswapd_wait); --=20 2.53.0