From nobody Tue Apr 7 05:44:53 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 81FD233689A for ; Mon, 16 Mar 2026 05:56:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773640609; cv=none; b=dXDMDbh/NHQkA67KT7x/FOgkjzc+ayeJgTLdqMyTxZOA6K2xFuibRPPApAy5YxiscElLcmITAgmbjGXVBHw78eJRC4jwDnts6OUQMZLu+fhF2u5sSAgTsnaZKhX5NqiDxhq3uvpWT5D1lQGwSY3vg+NZ8ED2E8YbNuBCwtFBFxg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773640609; c=relaxed/simple; bh=TlMsf3YX4/n1Rdmd1W9Puqt05NoyrFspFfZYJm+GiSk=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=BE/+WP/5VYqQl57X9X3BADr0ZfegHxo8UHZO70u60gH4M6i28gSEmIfJ1TLT/Fq58ZzRJpXgtPOfl17M+aHiMo+OMoZuijQwLGxOLUJZgq0m9XE4FvlJOSSXhe7vu7DqXddVrZBgTupaDBY7dQrjYhvWdXR+DfvHjdVccMYmL40= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=CXDBG/Vp; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="CXDBG/Vp" Received: by smtp.kernel.org (Postfix) with ESMTPS id 3B215C2BC9E; Mon, 16 Mar 2026 05:56:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1773640609; bh=TlMsf3YX4/n1Rdmd1W9Puqt05NoyrFspFfZYJm+GiSk=; h=From:Date:Subject:References:In-Reply-To:To:Cc:Reply-To:From; b=CXDBG/VpOQ7/5e6+al5KRqxqyApvmdbbjuLoQyBm6fCj/tKn3D40M4b+Z0bNe5Wl9 NTDknIAoZ/tp+BEv9deBUK5MsJGRj+g2eDLbT8mI/lgytWl4tGsXRyaHEZGaTKV3Nc SCFpF08Gn1cMsE1dagNg6wPjC6Q1VgbiYq68mEFiGm+F9NLUHeWM638CvM01aMRcZO 3A8pp71xiupJRI7EtKMnpQ8iZnLHIdwKE5YSnCHeUL8ey/4S/a0egcU6Y1TBUyKUu/ xpfB9k6SlloQTLdD2k1dZrZzVGCAnCsnlgwTtrCQ5wcry/PNaeUN7sNMsZDuyqIZxN 2i3r5nP0OnaKA== Received: from aws-us-west-2-korg-lkml-1.web.codeaurora.org (localhost.localdomain [127.0.0.1]) by smtp.lore.kernel.org (Postfix) with ESMTP id 26375D58B37; Mon, 16 Mar 2026 05:56:49 +0000 (UTC) From: Leno Hou via B4 Relay Date: Mon, 16 Mar 2026 02:18:28 +0800 Subject: [PATCH v3 1/2] mm/mglru: fix cgroup OOM during MGLRU state switching Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260316-b4-switch-mglru-v2-v3-1-c846ce9a2321@gmail.com> References: <20260316-b4-switch-mglru-v2-v3-0-c846ce9a2321@gmail.com> In-Reply-To: <20260316-b4-switch-mglru-v2-v3-0-c846ce9a2321@gmail.com> To: Andrew Morton , Axel Rasmussen , Yuanchu Xie , Wei Xu , Jialing Wang , Yafang Shao , Yu Zhao , Kairui Song , Bingfang Guo , Barry Song Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Leno Hou X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1773598759; l=11348; i=lenohou@gmail.com; s=20260311; h=from:subject:message-id; bh=kyuspqsl57CQHhG9arU0ApaAlBWfHd4kjkO+N+OYlnQ=; b=lJZQoqkROXtQJhx/pEQq7aliD0NS8u0fPJ1wpQyQwBo9/YPQsBfBhdCAFEeVPzqdPIsBP/NDB 5JrJ1B8hIlTAtZBmWasqTgSBzV0uMKjPPFp7/YxZuRrvgduiYgtTHf7 X-Developer-Key: i=lenohou@gmail.com; a=ed25519; pk=8AVHXYurzu1kOGjk9rwvxovwSCynBkv2QAcOvSIe1rw= X-Endpoint-Received: by B4 Relay for lenohou@gmail.com/20260311 with auth_id=674 X-Original-From: Leno Hou Reply-To: lenohou@gmail.com From: Leno Hou When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race condition exists between the state switching and the memory reclaim path. This can lead to unexpected cgroup OOM kills, even when plenty of reclaimable memory is available. Problem Description =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D The issue arises from a "reclaim vacuum" during the transition. 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to false before the pages are drained from MGLRU lists back to traditional LRU lists. 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false and skip the MGLRU path. 3. However, these pages might not have reached the traditional LRU lists yet, or the changes are not yet visible to all CPUs due to a lack of synchronization. 4. get_scan_count() subsequently finds traditional LRU lists empty, concludes there is no reclaimable memory, and triggers an OOM kill. A similar race can occur during enablement, where the reclaimer sees the new state but the MGLRU lists haven't been populated via fill_evictable() yet. Solution =3D=3D=3D=3D=3D=3D=3D=3D Introduce a 'draining' state (`lru_drain_core`) to bridge the transition. When transitioning, the system enters this intermediate state where the reclaimer is forced to attempt both MGLRU and traditional reclaim paths sequentially. This ensures that folios remain visible to at least one reclaim mechanism until the transition is fully materialized across all CPUs. Changes =3D=3D=3D=3D=3D=3D=3D v3: - Rebase onto mm-new branch for queue testing - Don't look around while draining - Fix Barry Song's comment v2: - Repalce with a static branch `lru_drain_core` to track the transition state. - Ensures all LRU helpers correctly identify page state by checking folio_lru_gen(folio) !=3D -1 instead of relying solely on global flags. - Maintain workingset refault context across MGLRU state transitions - Fix build error when CONFIG_LRU_GEN is disabled. v1: - Use smp_store_release() and smp_load_acquire() to ensure the visibility of 'enabled' and 'draining' flags across CPUs. - Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec is in the 'draining' state, the reclaimer will attempt to scan MGLRU lists first, and then fall through to traditional LRU lists instead of returning early. This ensures that folios are visible to at least one reclaim path at any given time. Race & Mitigation =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D A race window exists between checking the 'draining' state and performing the actual list operations. For instance, a reclaimer might observe the draining state as false just before it changes, leading to a suboptimal reclaim path decision. However, this impact is effectively mitigated by the kernel's reclaim retry mechanism (e.g., in do_try_to_free_pages). If a reclaimer pass fails to find eligible folios due to a state transition race, subsequent retries in the loop will observe the updated state and correctly direct the scan to the appropriate LRU lists. This ensures the transient inconsistency does not escalate into a terminal OOM kill. This effectively reduce the race window that previously triggered OOMs under high memory pressure. To: Andrew Morton To: Axel Rasmussen To: Yuanchu Xie To: Wei Xu To: Barry Song <21cnbao@gmail.com> To: Jialing Wang To: Yafang Shao To: Yu Zhao To: Kairui Song To: Bingfang Guo Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Leno Hou --- include/linux/mm_inline.h | 16 ++++++++++++++++ mm/rmap.c | 2 +- mm/swap.c | 15 +++++++++------ mm/vmscan.c | 38 +++++++++++++++++++++++++++++--------- 4 files changed, 55 insertions(+), 16 deletions(-) diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index ad50688d89db..16ac700dac9c 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -102,6 +102,12 @@ static __always_inline enum lru_list folio_lru_list(co= nst struct folio *folio) =20 #ifdef CONFIG_LRU_GEN =20 +static inline bool lru_gen_draining(void) +{ + DECLARE_STATIC_KEY_FALSE(lru_drain_core); + + return static_branch_unlikely(&lru_drain_core); +} #ifdef CONFIG_LRU_GEN_ENABLED static inline bool lru_gen_enabled(void) { @@ -316,11 +322,21 @@ static inline bool lru_gen_enabled(void) return false; } =20 +static inline bool lru_gen_draining(void) +{ + return false; +} + static inline bool lru_gen_in_fault(void) { return false; } =20 +static inline int folio_lru_gen(const struct folio *folio) +{ + return -1; +} + static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *= folio, bool reclaiming) { return false; diff --git a/mm/rmap.c b/mm/rmap.c index 6398d7eef393..0b5f663f3062 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -966,7 +966,7 @@ static bool folio_referenced_one(struct folio *folio, nr =3D folio_pte_batch(folio, pvmw.pte, pteval, max_nr); } =20 - if (lru_gen_enabled() && pvmw.pte) { + if (lru_gen_enabled() && !lru_gen_draining() && pvmw.pte) { if (lru_gen_look_around(&pvmw, nr)) referenced++; } else if (pvmw.pte) { diff --git a/mm/swap.c b/mm/swap.c index 5cc44f0de987..ecb192c02d2e 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -462,7 +462,7 @@ void folio_mark_accessed(struct folio *folio) { if (folio_test_dropbehind(folio)) return; - if (lru_gen_enabled()) { + if (folio_lru_gen(folio) !=3D -1) { lru_gen_inc_refs(folio); return; } @@ -559,7 +559,7 @@ void folio_add_lru_vma(struct folio *folio, struct vm_a= rea_struct *vma) */ static void lru_deactivate_file(struct lruvec *lruvec, struct folio *folio) { - bool active =3D folio_test_active(folio) || lru_gen_enabled(); + bool active =3D folio_test_active(folio) || (folio_lru_gen(folio) !=3D -1= ); long nr_pages =3D folio_nr_pages(folio); =20 if (folio_test_unevictable(folio)) @@ -602,7 +602,9 @@ static void lru_deactivate(struct lruvec *lruvec, struc= t folio *folio) { long nr_pages =3D folio_nr_pages(folio); =20 - if (folio_test_unevictable(folio) || !(folio_test_active(folio) || lru_ge= n_enabled())) + if (folio_test_unevictable(folio) || + !(folio_test_active(folio) || + (folio_lru_gen(folio) !=3D -1))) return; =20 lruvec_del_folio(lruvec, folio); @@ -617,6 +619,7 @@ static void lru_deactivate(struct lruvec *lruvec, struc= t folio *folio) static void lru_lazyfree(struct lruvec *lruvec, struct folio *folio) { long nr_pages =3D folio_nr_pages(folio); + int gen =3D folio_lru_gen(folio); =20 if (!folio_test_anon(folio) || !folio_test_swapbacked(folio) || folio_test_swapcache(folio) || folio_test_unevictable(folio)) @@ -624,7 +627,7 @@ static void lru_lazyfree(struct lruvec *lruvec, struct = folio *folio) =20 lruvec_del_folio(lruvec, folio); folio_clear_active(folio); - if (lru_gen_enabled()) + if (gen !=3D -1) lru_gen_clear_refs(folio); else folio_clear_referenced(folio); @@ -695,7 +698,7 @@ void deactivate_file_folio(struct folio *folio) if (folio_test_unevictable(folio) || !folio_test_lru(folio)) return; =20 - if (lru_gen_enabled() && lru_gen_clear_refs(folio)) + if ((folio_lru_gen(folio) !=3D -1) && lru_gen_clear_refs(folio)) return; =20 folio_batch_add_and_move(folio, lru_deactivate_file); @@ -714,7 +717,7 @@ void folio_deactivate(struct folio *folio) if (folio_test_unevictable(folio) || !folio_test_lru(folio)) return; =20 - if (lru_gen_enabled() ? lru_gen_clear_refs(folio) : !folio_test_active(fo= lio)) + if ((folio_lru_gen(folio) !=3D -1) ? lru_gen_clear_refs(folio) : !folio_t= est_active(folio)) return; =20 folio_batch_add_and_move(folio, lru_deactivate); diff --git a/mm/vmscan.c b/mm/vmscan.c index 33287ba4a500..bcefd8db9c03 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -886,7 +886,7 @@ static enum folio_references folio_check_references(str= uct folio *folio, if (referenced_ptes =3D=3D -1) return FOLIOREF_KEEP; =20 - if (lru_gen_enabled()) { + if (lru_gen_enabled() && !lru_gen_draining()) { if (!referenced_ptes) return FOLIOREF_RECLAIM; =20 @@ -2286,7 +2286,7 @@ static void prepare_scan_control(pg_data_t *pgdat, st= ruct scan_control *sc) unsigned long file; struct lruvec *target_lruvec; =20 - if (lru_gen_enabled()) + if (lru_gen_enabled() && !lru_gen_draining()) return; =20 target_lruvec =3D mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); @@ -2625,6 +2625,7 @@ static bool can_age_anon_pages(struct lruvec *lruvec, =20 #ifdef CONFIG_LRU_GEN =20 +DEFINE_STATIC_KEY_FALSE(lru_drain_core); #ifdef CONFIG_LRU_GEN_ENABLED DEFINE_STATIC_KEY_ARRAY_TRUE(lru_gen_caps, NR_LRU_GEN_CAPS); #define get_cap(cap) static_branch_likely(&lru_gen_caps[cap]) @@ -5318,6 +5319,8 @@ static void lru_gen_change_state(bool enabled) if (enabled =3D=3D lru_gen_enabled()) goto unlock; =20 + static_branch_enable_cpuslocked(&lru_drain_core); + if (enabled) static_branch_enable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]); else @@ -5348,6 +5351,9 @@ static void lru_gen_change_state(bool enabled) =20 cond_resched(); } while ((memcg =3D mem_cgroup_iter(NULL, memcg, NULL))); + + static_branch_disable_cpuslocked(&lru_drain_core); + unlock: mutex_unlock(&state_mutex); put_online_mems(); @@ -5920,9 +5926,12 @@ static void shrink_lruvec(struct lruvec *lruvec, str= uct scan_control *sc) bool proportional_reclaim; struct blk_plug plug; =20 - if (lru_gen_enabled() && !root_reclaim(sc)) { + if ((lru_gen_enabled() || lru_gen_draining()) && !root_reclaim(sc)) { lru_gen_shrink_lruvec(lruvec, sc); - return; + + if (!lru_gen_draining()) + return; + } =20 get_scan_count(lruvec, sc, nr); @@ -6181,11 +6190,17 @@ static void shrink_node(pg_data_t *pgdat, struct sc= an_control *sc) unsigned long nr_reclaimed, nr_scanned, nr_node_reclaimed; struct lruvec *target_lruvec; bool reclaimable =3D false; + s8 priority =3D sc->priority; =20 - if (lru_gen_enabled() && root_reclaim(sc)) { + if ((lru_gen_enabled() || lru_gen_draining()) && root_reclaim(sc)) { memset(&sc->nr, 0, sizeof(sc->nr)); lru_gen_shrink_node(pgdat, sc); - return; + + if (!lru_gen_draining()) + return; + + sc->priority =3D priority; + } =20 target_lruvec =3D mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); @@ -6455,7 +6470,7 @@ static void snapshot_refaults(struct mem_cgroup *targ= et_memcg, pg_data_t *pgdat) struct lruvec *target_lruvec; unsigned long refaults; =20 - if (lru_gen_enabled()) + if (lru_gen_enabled() && !lru_gen_draining()) return; =20 target_lruvec =3D mem_cgroup_lruvec(target_memcg, pgdat); @@ -6844,10 +6859,15 @@ static void kswapd_age_node(struct pglist_data *pgd= at, struct scan_control *sc) { struct mem_cgroup *memcg; struct lruvec *lruvec; + s8 priority =3D sc->priority; =20 - if (lru_gen_enabled()) { + if (lru_gen_enabled() || lru_gen_draining()) { lru_gen_age_node(pgdat, sc); - return; + + if (!lru_gen_draining()) + return; + + sc->priority =3D priority; } =20 lruvec =3D mem_cgroup_lruvec(NULL, pgdat); --=20 2.52.0 From nobody Tue Apr 7 05:44:53 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 81EA1235063 for ; Mon, 16 Mar 2026 05:56:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773640609; cv=none; b=pEMGHakmmaMoW8PFUIyrWHrV2uidrSD8wMQjSHvY2hl6lAvcxDOwgJUopB+Bw0cIgAagCRBMxp9gyxFuCUsmuboJ7+9DSf/CW1VPwgZePzhiWrE7DHsUfWxx+p2qKYGuu7dvnknkkrW+peGyM7Nmpy69vzxxcoi/gkhngcKhV8o= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773640609; c=relaxed/simple; bh=DennSjkJyBmTgQaR8MaO2ZHfX3G1V2Hdi1AKWWCWBSk=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=GV1x08bBbHaC/2QSuwo6LXmxvOV640YRlxGgh3k7vtBEEi3z6CHr9WJ/95iSg3cG85YYys+g+lOLuZVw0QkWYvzUuPqK700oE01rhWCUU05TaN6yFw7sNYpMi2/y86PyURixOMN8RrHizzxuq/ZsowTFwynmEi2KMBhOWzlGXoY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=HhQM2dOV; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="HhQM2dOV" Received: by smtp.kernel.org (Postfix) with ESMTPS id 481D2C2BC87; Mon, 16 Mar 2026 05:56:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1773640609; bh=DennSjkJyBmTgQaR8MaO2ZHfX3G1V2Hdi1AKWWCWBSk=; h=From:Date:Subject:References:In-Reply-To:To:Cc:Reply-To:From; b=HhQM2dOV2rXzA/OqJwrBkMTUURJlvCal670U2dI8nE6VGD6MJJkvxh9JzWZme3f2R uHffRpotJJQ7tdk8EJ0nnOc5D7CGJzWd0ly+FXLZhEjkTUmWjx8hMRObg+OUqqgI7E 0pmCyTzM1aEx7y0DHtS3PZV/q2rGuG2grla1m1Pbd4E6/rxEYVF1b3jHKQ1JZiCe6P rgLcxv5e1HxwN4BkpZz9LiHIEXTZIVWz9LGo20Fm/tRR+6jpThVaURUWzlpO+gvp8H 57Ui2AiXmwvzle5xAjAKjr3dKYCKjRNOC2t+8KpT0oi6Ssn5BlAeTmHOez5vtALXNw S95xH3QPMCE9Q== Received: from aws-us-west-2-korg-lkml-1.web.codeaurora.org (localhost.localdomain [127.0.0.1]) by smtp.lore.kernel.org (Postfix) with ESMTP id 33DFAD58B21; Mon, 16 Mar 2026 05:56:49 +0000 (UTC) From: Leno Hou via B4 Relay Date: Mon, 16 Mar 2026 02:18:29 +0800 Subject: [PATCH v3 2/2] mm/mglru: maintain workingset refault context across state transitions Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Message-Id: <20260316-b4-switch-mglru-v2-v3-2-c846ce9a2321@gmail.com> References: <20260316-b4-switch-mglru-v2-v3-0-c846ce9a2321@gmail.com> In-Reply-To: <20260316-b4-switch-mglru-v2-v3-0-c846ce9a2321@gmail.com> To: Andrew Morton , Axel Rasmussen , Yuanchu Xie , Wei Xu , Jialing Wang , Yafang Shao , Yu Zhao , Kairui Song , Bingfang Guo , Barry Song Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Leno Hou X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1773598759; l=9959; i=lenohou@gmail.com; s=20260311; h=from:subject:message-id; bh=BzLymly7UqWV2k0JhEp7lvZpm6LCjjIAK/A6bt/7X98=; b=grGmb05RZk/p4O8A1anhDQfMEpjTfSb0d2+QR/k5NgLHzs5Ev198xlQlLCliJgPkPW8wfRcKq Az/W/3QAmaVCldO0npLWohcLNzylGLvSBOWmC8MsSvDJz6JQb4wZEzu X-Developer-Key: i=lenohou@gmail.com; a=ed25519; pk=8AVHXYurzu1kOGjk9rwvxovwSCynBkv2QAcOvSIe1rw= X-Endpoint-Received: by B4 Relay for lenohou@gmail.com/20260311 with auth_id=674 X-Original-From: Leno Hou Reply-To: lenohou@gmail.com From: Leno Hou When MGLRU state is toggled dynamically, existing shadow entries (eviction tokens) lose their context. Traditional LRU and MGLRU handle workingset refaults using different logic. Without context, shadow entries re-activated by the "wrong" reclaim logic trigger excessive page activations (pgactivate) and system thrashing, as the kernel cannot correctly distinguish if a refaulted page was originally managed by MGLRU or the traditional LRU. This patch introduces shadow entry context tracking: - Encode MGLRU origin: Introduce WORKINGSET_MGLRU_SHIFT into the shadow entry (eviction token) encoding. This adds an 'is_mglru' bit to shadow entries, allowing the kernel to correctly identify the originating reclaim logic for a page even after the global MGLRU state has been toggled. - Refault logic dispatch: Use this 'is_mglru' bit in workingset_refault() and workingset_test_recent() to dispatch refault events to the correct handler (lru_gen_refault vs. traditional workingset refault). This ensures that refaulted pages are handled by the appropriate reclaim logic regardless of the current MGLRU enabled state, preventing unnecessary thrashing and state-inconsistent refault activations during state transitions. To: Andrew Morton To: Axel Rasmussen To: Yuanchu Xie To: Wei Xu To: Barry Song <21cnbao@gmail.com> To: Jialing Wang To: Yafang Shao To: Yu Zhao To: Kairui Song To: Bingfang Guo Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Leno Hou --- include/linux/swap.h | 2 +- mm/vmscan.c | 17 ++++++++++++----- mm/workingset.c | 22 +++++++++++++++------- 3 files changed, 28 insertions(+), 13 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 7a09df6977a5..5f7d3f08d840 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -297,7 +297,7 @@ static inline swp_entry_t page_swap_entry(struct page *= page) bool workingset_test_recent(void *shadow, bool file, bool *workingset, bool flush); void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pa= ges); -void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_m= emcg); +void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_m= emcg, bool lru_gen); void workingset_refault(struct folio *folio, void *shadow); void workingset_activation(struct folio *folio); =20 diff --git a/mm/vmscan.c b/mm/vmscan.c index bcefd8db9c03..de21343b5cd2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -180,6 +180,9 @@ struct scan_control { =20 /* for recording the reclaimed slab by now */ struct reclaim_state reclaim_state; + + /* whether in lru gen scan context */ + unsigned int lru_gen:1; }; =20 #ifdef ARCH_HAS_PREFETCHW @@ -685,7 +688,7 @@ static pageout_t pageout(struct folio *folio, struct ad= dress_space *mapping, * gets returned with a refcount of 0. */ static int __remove_mapping(struct address_space *mapping, struct folio *f= olio, - bool reclaimed, struct mem_cgroup *target_memcg) + bool reclaimed, struct mem_cgroup *target_memcg, struct scan_control *sc) { int refcount; void *shadow =3D NULL; @@ -739,7 +742,7 @@ static int __remove_mapping(struct address_space *mappi= ng, struct folio *folio, swp_entry_t swap =3D folio->swap; =20 if (reclaimed && !mapping_exiting(mapping)) - shadow =3D workingset_eviction(folio, target_memcg); + shadow =3D workingset_eviction(folio, target_memcg, sc->lru_gen); memcg1_swapout(folio, swap); __swap_cache_del_folio(ci, folio, swap, shadow); swap_cluster_unlock_irq(ci); @@ -765,7 +768,7 @@ static int __remove_mapping(struct address_space *mappi= ng, struct folio *folio, */ if (reclaimed && folio_is_file_lru(folio) && !mapping_exiting(mapping) && !dax_mapping(mapping)) - shadow =3D workingset_eviction(folio, target_memcg); + shadow =3D workingset_eviction(folio, target_memcg, sc->lru_gen); __filemap_remove_folio(folio, shadow); xa_unlock_irq(&mapping->i_pages); if (mapping_shrinkable(mapping)) @@ -802,7 +805,7 @@ static int __remove_mapping(struct address_space *mappi= ng, struct folio *folio, */ long remove_mapping(struct address_space *mapping, struct folio *folio) { - if (__remove_mapping(mapping, folio, false, NULL)) { + if (__remove_mapping(mapping, folio, false, NULL, NULL)) { /* * Unfreezing the refcount with 1 effectively * drops the pagecache ref for us without requiring another @@ -1499,7 +1502,7 @@ static unsigned int shrink_folio_list(struct list_hea= d *folio_list, count_vm_events(PGLAZYFREED, nr_pages); count_memcg_folio_events(folio, PGLAZYFREED, nr_pages); } else if (!mapping || !__remove_mapping(mapping, folio, true, - sc->target_mem_cgroup)) + sc->target_mem_cgroup, sc)) goto keep_locked; =20 folio_unlock(folio); @@ -1599,6 +1602,7 @@ unsigned int reclaim_clean_pages_from_list(struct zon= e *zone, struct scan_control sc =3D { .gfp_mask =3D GFP_KERNEL, .may_unmap =3D 1, + .lru_gen =3D lru_gen_enabled(), }; struct reclaim_stat stat; unsigned int nr_reclaimed; @@ -1993,6 +1997,7 @@ static unsigned long shrink_inactive_list(unsigned lo= ng nr_to_scan, if (nr_taken =3D=3D 0) return 0; =20 + sc->lru_gen =3D 0; nr_reclaimed =3D shrink_folio_list(&folio_list, pgdat, sc, &stat, false, lruvec_memcg(lruvec)); =20 @@ -2167,6 +2172,7 @@ static unsigned int reclaim_folio_list(struct list_he= ad *folio_list, .may_unmap =3D 1, .may_swap =3D 1, .no_demotion =3D 1, + .lru_gen =3D lru_gen_enabled(), }; =20 nr_reclaimed =3D shrink_folio_list(folio_list, pgdat, &sc, &stat, true, N= ULL); @@ -4864,6 +4870,7 @@ static int evict_folios(unsigned long nr_to_scan, str= uct lruvec *lruvec, if (list_empty(&list)) return scanned; retry: + sc->lru_gen =3D 1; reclaimed =3D shrink_folio_list(&list, pgdat, sc, &stat, false, memcg); sc->nr.unqueued_dirty +=3D stat.nr_unqueued_dirty; sc->nr_reclaimed +=3D reclaimed; diff --git a/mm/workingset.c b/mm/workingset.c index 07e6836d0502..3764a4a68c2c 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -181,8 +181,10 @@ * refault distance will immediately activate the refaulting page. */ =20 +#define WORKINGSET_MGLRU_SHIFT 1 #define WORKINGSET_SHIFT 1 #define EVICTION_SHIFT ((BITS_PER_LONG - BITS_PER_XA_VALUE) + \ + WORKINGSET_MGLRU_SHIFT + \ WORKINGSET_SHIFT + NODES_SHIFT + \ MEM_CGROUP_ID_SHIFT) #define EVICTION_SHIFT_ANON (EVICTION_SHIFT + SWAP_COUNT_SHIFT) @@ -200,12 +202,13 @@ static unsigned int bucket_order[ANON_AND_FILE] __read_mostly; =20 static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long evic= tion, - bool workingset, bool file) + bool workingset, bool file, bool is_mglru) { eviction &=3D file ? EVICTION_MASK : EVICTION_MASK_ANON; eviction =3D (eviction << MEM_CGROUP_ID_SHIFT) | memcgid; eviction =3D (eviction << NODES_SHIFT) | pgdat->node_id; eviction =3D (eviction << WORKINGSET_SHIFT) | workingset; + eviction =3D (eviction << WORKINGSET_MGLRU_SHIFT) | is_mglru; =20 return xa_mk_value(eviction); } @@ -217,6 +220,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, = pg_data_t **pgdat, int memcgid, nid; bool workingset; =20 + entry >>=3D WORKINGSET_MGLRU_SHIFT; workingset =3D entry & ((1UL << WORKINGSET_SHIFT) - 1); entry >>=3D WORKINGSET_SHIFT; nid =3D entry & ((1UL << NODES_SHIFT) - 1); @@ -263,7 +267,7 @@ static void *lru_gen_eviction(struct folio *folio) memcg_id =3D mem_cgroup_private_id(memcg); rcu_read_unlock(); =20 - return pack_shadow(memcg_id, pgdat, token, workingset, type); + return pack_shadow(memcg_id, pgdat, token, workingset, type, true); } =20 /* @@ -387,7 +391,8 @@ void workingset_age_nonresident(struct lruvec *lruvec, = unsigned long nr_pages) * Return: a shadow entry to be stored in @folio->mapping->i_pages in place * of the evicted @folio so that a later refault can be detected. */ -void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_m= emcg) +void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_m= emcg, + bool lru_gen) { struct pglist_data *pgdat =3D folio_pgdat(folio); int file =3D folio_is_file_lru(folio); @@ -400,7 +405,7 @@ void *workingset_eviction(struct folio *folio, struct m= em_cgroup *target_memcg) VM_BUG_ON_FOLIO(folio_ref_count(folio), folio); VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); =20 - if (lru_gen_enabled()) + if (lru_gen) return lru_gen_eviction(folio); =20 lruvec =3D mem_cgroup_lruvec(target_memcg, pgdat); @@ -410,7 +415,7 @@ void *workingset_eviction(struct folio *folio, struct m= em_cgroup *target_memcg) eviction >>=3D bucket_order[file]; workingset_age_nonresident(lruvec, folio_nr_pages(folio)); return pack_shadow(memcgid, pgdat, eviction, - folio_test_workingset(folio), file); + folio_test_workingset(folio), file, false); } =20 /** @@ -436,8 +441,10 @@ bool workingset_test_recent(void *shadow, bool file, b= ool *workingset, int memcgid; struct pglist_data *pgdat; unsigned long eviction; + unsigned long entry =3D xa_to_value(shadow); + bool is_mglru =3D !!(entry & WORKINGSET_MGLRU_SHIFT); =20 - if (lru_gen_enabled()) { + if (is_mglru) { bool recent; =20 rcu_read_lock(); @@ -550,10 +557,11 @@ void workingset_refault(struct folio *folio, void *sh= adow) struct lruvec *lruvec; bool workingset; long nr; + unsigned long entry =3D xa_to_value(shadow); =20 VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); =20 - if (lru_gen_enabled()) { + if (entry & ((1UL << WORKINGSET_MGLRU_SHIFT) - 1)) { lru_gen_refault(folio, shadow); return; } --=20 2.52.0