From nobody Wed Dec 17 21:39:24 2025 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9F93F188A3A for ; Tue, 31 Dec 2024 04:36:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1735619764; cv=none; b=n3SFqnklg811nlo65r/qX3G9+UTs5VJ38334RovdVkZh1CK/jxXnrlJxTEeelbT7F04oStgQeOL9w0llTB5o2irfgUdt8qc/UTUYnsII8SULmUiTPrHdkOAiz/WaHN6/9zGTkyPfnTunrE7qbrAxaG0zVwDvbGEEUQ9ljY45tk8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1735619764; c=relaxed/simple; bh=gvDpAs5xvwSCyb39pIZs5CQO7usLnZ5mV5aHRcUGnTo=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=G/IbumZtWYPsd3k/iTSViH2pa5z/WASrgl0jHHfbgLM1bywju68nbUE5U0KPP/EylcdkX1riiry4bFMCWhfxWQjNgmGlqCrlztnXLQQHJ7bnve3/ZQNykp1H0Mg65C3dVGqWQdUiq/p97fVccArQMzJJhHpj6Bw8HmJIUmuyTVY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--yuzhao.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=MRhkRoYr; arc=none smtp.client-ip=209.85.214.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--yuzhao.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="MRhkRoYr" Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-2165433e229so132387965ad.1 for ; Mon, 30 Dec 2024 20:36:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1735619761; x=1736224561; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=HKwjES3e0jLy2VuhFIOMlzL3YWtlXtEvY7LLG+FXeWU=; b=MRhkRoYrBTNcaAdeop9bvfcDdkF9d9LThp72RrbSB80BdJ9x4np6f7ADEYytnS7lj0 P1rSCU55CXznn8wurwVeP9PzMo+wuk2ys0L+RsT1OnK+V/XiwkY7c5Zs3Hejt5a21m7K XGVZjPhJ1k4pWlxdDPMBNTkkXQAlq3QxXU9R1BX1/0GKNDmmWRKqre501WdxEPfVpx2C PHW5GJSPKUc9GARbk4PDOMnT09NLIrjqLbZW87Qrlrgx9wL/GFQ8/jSCAuCWt6kZ3q8w B/UOtIliT73kgCNOCu9bZfoDQYrDsmtxgcdm6OKsk+XJCQ9E7Hn51vyftXU+WjftZWtO y1dQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1735619761; x=1736224561; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=HKwjES3e0jLy2VuhFIOMlzL3YWtlXtEvY7LLG+FXeWU=; b=QfKPZQbIJQYSvuqfcI1CvDyA5SyYiR5ct2eiXmR/q8PVFUo5GMDeXLPT73By9fjCUy SlhcKbwNBn3Sajd++OQFbJzBDEse8RW2pbWWdF7OCoAzPuLZzmwD9JV+Z5r4sazN8z6n YKL/3tYLFhgn0hnZFApG8Y61nnFXe2fnjTOELoOZxlIdZxZnaxLz8JtgVRobnGoaFqIY sRCSZZ/3QXV/pFGnGB6xA2DTI4GZLTUCi6texn654l1mTOozREqBYUKeJ3yFR3eSCcOc f2tagvqK/2GeMkIsJC9ziwKlCdgQkDXy2NeB7JqEqer6eHWAfWdxg7xVQZLD/3eYGlLk +SBQ== X-Forwarded-Encrypted: i=1; AJvYcCU+H+pjmaMM3y701tO1tz/uHdJemKwg5EIbkU0/5ahwNm+x88i1l4KeMRKYKchSzxnPWbojIv1MBASfQEc=@vger.kernel.org X-Gm-Message-State: AOJu0YwFTDVWmOfyx3aItu15DcebAaDkZ+71jcaG+5z45A3tPIeEa+MF NkvHt1L566CRo+epKIiA+dO8niuow+S/b6nAPAqpItpeTU/bA1XFLuY8rmoNcYRxA4KmVmZV7TU bxg== X-Google-Smtp-Source: AGHT+IHWy99TyDT8NaxCGIoVaFrDEtF9zWOWP/0IUrjxMV7CP85GkRQziNTadZKOeAd6kPsRyTnxbmlSQC4= X-Received: from pgjs19.prod.google.com ([2002:a63:f053:0:b0:800:502a:791c]) (user=yuzhao job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a21:78a5:b0:1e1:a0b6:9882 with SMTP id adf61e73a8af0-1e5e046331fmr54638220637.17.1735619760939; Mon, 30 Dec 2024 20:36:00 -0800 (PST) Date: Mon, 30 Dec 2024 21:35:37 -0700 In-Reply-To: <20241231043538.4075764-1-yuzhao@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241231043538.4075764-1-yuzhao@google.com> X-Mailer: git-send-email 2.47.1.613.gc27f4b7a9f-goog Message-ID: <20241231043538.4075764-7-yuzhao@google.com> Subject: [PATCH mm-unstable v4 6/7] mm/mglru: rework workingset protection From: Yu Zhao To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Yu Zhao , Kairui Song , Kalesh Singh Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" With the aging feedback no longer considering the distribution of folios in each generation, rework workingset protection to better distribute folios across MAX_NR_GENS. This is achieved by reusing PG_workingset and PG_referenced/LRU_REFS_FLAGS in a slightly different way. For folios accessed multiple times through file descriptors, make lru_gen_inc_refs() set additional bits of LRU_REFS_WIDTH in folio->flags after PG_referenced, then PG_workingset after LRU_REFS_WIDTH. After all its bits are set, i.e., LRU_REFS_FLAGS|BIT(PG_workingset), a folio is lazily promoted into the second oldest generation in the eviction path. And when folio_inc_gen() does that, it clears LRU_REFS_FLAGS so that lru_gen_inc_refs() can start over. For this case, LRU_REFS_MASK is only valid when PG_referenced is set. For folios accessed multiple times through page tables, folio_update_gen() from a page table walk or lru_gen_set_refs() from a rmap walk sets PG_referenced after the accessed bit is cleared for the first time. Thereafter, those two paths set PG_workingset and promote folios to the youngest generation. Like folio_inc_gen(), when folio_update_gen() does that, it also clears PG_referenced. For this case, LRU_REFS_MASK is not used. For both of the cases, after PG_workingset is set on a folio, it remains until this folio is either reclaimed, or "deactivated" by lru_gen_clear_refs(). It can be set again if lru_gen_test_recent() returns true upon a refault. When adding folios to the LRU lists, lru_gen_folio_seq() distributes them as follows: +---------------------------------+---------------------------------+ | Accessed thru page tables | Accessed thru file descriptors | +---------------------------------+---------------------------------+ | PG_active (set while isolated) | | +----------------+----------------+----------------+----------------+ | PG_workingset | PG_referenced | PG_workingset | LRU_REFS_FLAGS | +---------------------------------+---------------------------------+ |<--------- MIN_NR_GENS --------->| | |<-------------------------- MAX_NR_GENS -------------------------->| After this patch, some typical client and server workloads showed improvements under heavy memory pressure. For example, Python TPC-C, which was used to benchmark a different approach [1] to better detect refault distances, showed a significant decrease in total refaults: Before After Change Time (seconds) 10801 10801 0% Executed (transactions) 41472 43663 +5% workingset_nodes 109070 120244 +10% workingset_refault_anon 5019627 7281831 +45% workingset_refault_file 1294678786 554855564 -57% workingset_refault_total 1299698413 562137395 -57% [1] https://lore.kernel.org/20230920190244.16839-1-ryncsn@gmail.com/ Reported-by: Kairui Song Closes: https://lore.kernel.org/CAOUHufahuWcKf5f1Sg3emnqX+cODuR=3D2TQo7T4Gr= -QYLujn4RA@mail.gmail.com/ Signed-off-by: Yu Zhao Tested-by: Kalesh Singh --- include/linux/mm_inline.h | 88 +++++++++++------------ include/linux/mmzone.h | 82 +++++++++++++-------- mm/swap.c | 24 +++---- mm/vmscan.c | 147 ++++++++++++++++++++++---------------- mm/workingset.c | 29 ++++---- 5 files changed, 204 insertions(+), 166 deletions(-) diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 34e5097182a0..f9157a0c42a5 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -133,31 +133,25 @@ static inline int lru_hist_from_seq(unsigned long seq) return seq % NR_HIST_GENS; } =20 -static inline int lru_tier_from_refs(int refs) +static inline int lru_tier_from_refs(int refs, bool workingset) { VM_WARN_ON_ONCE(refs > BIT(LRU_REFS_WIDTH)); =20 - /* see the comment in folio_lru_refs() */ - return order_base_2(refs + 1); + /* see the comment on MAX_NR_TIERS */ + return workingset ? MAX_NR_TIERS - 1 : order_base_2(refs); } =20 static inline int folio_lru_refs(struct folio *folio) { unsigned long flags =3D READ_ONCE(folio->flags); - bool workingset =3D flags & BIT(PG_workingset); =20 + if (!(flags & BIT(PG_referenced))) + return 0; /* - * Return the number of accesses beyond PG_referenced, i.e., N-1 if the - * total number of accesses is N>1, since N=3D0,1 both map to the first - * tier. lru_tier_from_refs() will account for this off-by-one. Also see - * the comment on MAX_NR_TIERS. + * Return the total number of accesses including PG_referenced. Also see + * the comment on LRU_REFS_FLAGS. */ - return ((flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF) + workingset; -} - -static inline void folio_clear_lru_refs(struct folio *folio) -{ - set_mask_bits(&folio->flags, LRU_REFS_MASK | LRU_REFS_FLAGS, 0); + return ((flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF) + 1; } =20 static inline int folio_lru_gen(struct folio *folio) @@ -223,11 +217,43 @@ static inline void lru_gen_update_size(struct lruvec = *lruvec, struct folio *foli VM_WARN_ON_ONCE(lru_gen_is_active(lruvec, old_gen) && !lru_gen_is_active(= lruvec, new_gen)); } =20 +static inline unsigned long lru_gen_folio_seq(struct lruvec *lruvec, struc= t folio *folio, + bool reclaiming) +{ + int gen; + int type =3D folio_is_file_lru(folio); + struct lru_gen_folio *lrugen =3D &lruvec->lrugen; + + /* + * +-----------------------------------+---------------------------------= --+ + * | Accessed through page tables and | Accessed through file descriptor= s | + * | promoted by folio_update_gen() | and protected by folio_inc_gen()= | + * +-----------------------------------+---------------------------------= --+ + * | PG_active (set while isolated) | = | + * +-----------------+-----------------+-----------------+---------------= --+ + * | PG_workingset | PG_referenced | PG_workingset | LRU_REFS_FLAG= S | + * +-----------------------------------+---------------------------------= --+ + * |<---------- MIN_NR_GENS ---------->| = | + * |<---------------------------- MAX_NR_GENS ---------------------------= ->| + */ + if (folio_test_active(folio)) + gen =3D MIN_NR_GENS - folio_test_workingset(folio); + else if (reclaiming) + gen =3D MAX_NR_GENS; + else if ((!folio_is_file_lru(folio) && !folio_test_swapcache(folio)) || + (folio_test_reclaim(folio) && + (folio_test_dirty(folio) || folio_test_writeback(folio)))) + gen =3D MIN_NR_GENS; + else + gen =3D MAX_NR_GENS - folio_test_workingset(folio); + + return max(READ_ONCE(lrugen->max_seq) - gen + 1, READ_ONCE(lrugen->min_se= q[type])); +} + static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *= folio, bool reclaiming) { unsigned long seq; unsigned long flags; - unsigned long mask; int gen =3D folio_lru_gen(folio); int type =3D folio_is_file_lru(folio); int zone =3D folio_zonenum(folio); @@ -237,40 +263,12 @@ static inline bool lru_gen_add_folio(struct lruvec *l= ruvec, struct folio *folio, =20 if (folio_test_unevictable(folio) || !lrugen->enabled) return false; - /* - * There are four common cases for this page: - * 1. If it's hot, i.e., freshly faulted in, add it to the youngest - * generation, and it's protected over the rest below. - * 2. If it can't be evicted immediately, i.e., a dirty page pending - * writeback, add it to the second youngest generation. - * 3. If it should be evicted first, e.g., cold and clean from - * folio_rotate_reclaimable(), add it to the oldest generation. - * 4. Everything else falls between 2 & 3 above and is added to the - * second oldest generation if it's considered inactive, or the - * oldest generation otherwise. See lru_gen_is_active(). - */ - if (folio_test_active(folio)) - seq =3D lrugen->max_seq; - else if ((type =3D=3D LRU_GEN_ANON && !folio_test_swapcache(folio)) || - (folio_test_reclaim(folio) && - (folio_test_dirty(folio) || folio_test_writeback(folio)))) - seq =3D lrugen->max_seq - 1; - else if (reclaiming || lrugen->min_seq[type] + MIN_NR_GENS >=3D lrugen->m= ax_seq) - seq =3D lrugen->min_seq[type]; - else - seq =3D lrugen->min_seq[type] + 1; =20 + seq =3D lru_gen_folio_seq(lruvec, folio, reclaiming); gen =3D lru_gen_from_seq(seq); flags =3D (gen + 1UL) << LRU_GEN_PGOFF; /* see the comment on MIN_NR_GENS about PG_active */ - mask =3D LRU_GEN_MASK; - /* - * Don't clear PG_workingset here because it can affect PSI accounting - * if the activation is due to workingset refault. - */ - if (folio_test_active(folio)) - mask |=3D LRU_REFS_MASK | BIT(PG_referenced) | BIT(PG_active); - set_mask_bits(&folio->flags, mask, flags); + set_mask_bits(&folio->flags, LRU_GEN_MASK | BIT(PG_active), flags); =20 lru_gen_update_size(lruvec, folio, -1, gen); /* for folio_rotate_reclaimable() */ diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 8245ecb0400b..9540b41894da 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -332,66 +332,88 @@ enum lruvec_flags { #endif /* !__GENERATING_BOUNDS_H */ =20 /* - * Evictable pages are divided into multiple generations. The youngest and= the + * Evictable folios are divided into multiple generations. The youngest an= d the * oldest generation numbers, max_seq and min_seq, are monotonically incre= asing. * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS= ]. An * offset within MAX_NR_GENS, i.e., gen, indexes the LRU list of the * corresponding generation. The gen counter in folio->flags stores gen+1 = while - * a page is on one of lrugen->folios[]. Otherwise it stores 0. + * a folio is on one of lrugen->folios[]. Otherwise it stores 0. * - * A page is added to the youngest generation on faulting. The aging needs= to - * check the accessed bit at least twice before handing this page over to = the - * eviction. The first check takes care of the accessed bit set on the ini= tial - * fault; the second check makes sure this page hasn't been used since the= n. - * This process, AKA second chance, requires a minimum of two generations, - * hence MIN_NR_GENS. And to maintain ABI compatibility with the active/in= active - * LRU, e.g., /proc/vmstat, these two generations are considered active; t= he - * rest of generations, if they exist, are considered inactive. See - * lru_gen_is_active(). + * After a folio is faulted in, the aging needs to check the accessed bit = at + * least twice before handing this folio over to the eviction. The first c= heck + * clears the accessed bit from the initial fault; the second check makes = sure + * this folio hasn't been used since then. This process, AKA second chance, + * requires a minimum of two generations, hence MIN_NR_GENS. And to mainta= in ABI + * compatibility with the active/inactive LRU, e.g., /proc/vmstat, these t= wo + * generations are considered active; the rest of generations, if they exi= st, + * are considered inactive. See lru_gen_is_active(). * - * PG_active is always cleared while a page is on one of lrugen->folios[] = so - * that the aging needs not to worry about it. And it's set again when a p= age - * considered active is isolated for non-reclaiming purposes, e.g., migrat= ion. - * See lru_gen_add_folio() and lru_gen_del_folio(). + * PG_active is always cleared while a folio is on one of lrugen->folios[]= so + * that the sliding window needs not to worry about it. And it's set again= when + * a folio considered active is isolated for non-reclaiming purposes, e.g., + * migration. See lru_gen_add_folio() and lru_gen_del_folio(). * * MAX_NR_GENS is set to 4 so that the multi-gen LRU can support twice the * number of categories of the active/inactive LRU when keeping track of * accesses through page tables. This requires order_base_2(MAX_NR_GENS+1)= bits - * in folio->flags. + * in folio->flags, masked by LRU_GEN_MASK. */ #define MIN_NR_GENS 2U #define MAX_NR_GENS 4U =20 /* - * Each generation is divided into multiple tiers. A page accessed N times - * through file descriptors is in tier order_base_2(N). A page in the firs= t tier - * (N=3D0,1) is marked by PG_referenced unless it was faulted in through p= age - * tables or read ahead. A page in any other tier (N>1) is marked by - * PG_referenced and PG_workingset. This implies a minimum of two tiers is - * supported without using additional bits in folio->flags. + * Each generation is divided into multiple tiers. A folio accessed N times + * through file descriptors is in tier order_base_2(N). A folio in the fir= st + * tier (N=3D0,1) is marked by PG_referenced unless it was faulted in thro= ugh page + * tables or read ahead. A folio in the last tier (MAX_NR_TIERS-1) is mark= ed by + * PG_workingset. A folio in any other tier (1flags. * * In contrast to moving across generations which requires the LRU lock, m= oving * across tiers only involves atomic operations on folio->flags and theref= ore * has a negligible cost in the buffered access path. In the eviction path, - * comparisons of refaulted/(evicted+protected) from the first tier and the - * rest infer whether pages accessed multiple times through file descripto= rs - * are statistically hot and thus worth protecting. + * comparisons of refaulted/(evicted+protected) from the first tier and th= e rest + * infer whether folios accessed multiple times through file descriptors a= re + * statistically hot and thus worth protecting. * * MAX_NR_TIERS is set to 4 so that the multi-gen LRU can support twice the * number of categories of the active/inactive LRU when keeping track of * accesses through file descriptors. This uses MAX_NR_TIERS-2 spare bits = in - * folio->flags. + * folio->flags, masked by LRU_REFS_MASK. */ #define MAX_NR_TIERS 4U =20 #ifndef __GENERATING_BOUNDS_H =20 -struct lruvec; -struct page_vma_mapped_walk; - #define LRU_GEN_MASK ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF) #define LRU_REFS_MASK ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF) =20 +/* + * For folios accessed multiple times through file descriptors, + * lru_gen_inc_refs() sets additional bits of LRU_REFS_WIDTH in folio->fla= gs + * after PG_referenced, then PG_workingset after LRU_REFS_WIDTH. After all= its + * bits are set, i.e., LRU_REFS_FLAGS|BIT(PG_workingset), a folio is lazily + * promoted into the second oldest generation in the eviction path. And wh= en + * folio_inc_gen() does that, it clears LRU_REFS_FLAGS so that + * lru_gen_inc_refs() can start over. Note that for this case, LRU_REFS_MA= SK is + * only valid when PG_referenced is set. + * + * For folios accessed multiple times through page tables, folio_update_ge= n() + * from a page table walk or lru_gen_set_refs() from a rmap walk sets + * PG_referenced after the accessed bit is cleared for the first time. + * Thereafter, those two paths set PG_workingset and promote folios to the + * youngest generation. Like folio_inc_gen(), folio_update_gen() also clea= rs + * PG_referenced. Note that for this case, LRU_REFS_MASK is not used. + * + * For both cases above, after PG_workingset is set on a folio, it remains= until + * this folio is either reclaimed, or "deactivated" by lru_gen_clear_refs(= ). It + * can be set again if lru_gen_test_recent() returns true upon a refault. + */ +#define LRU_REFS_FLAGS (LRU_REFS_MASK | BIT(PG_referenced)) + +struct lruvec; +struct page_vma_mapped_walk; + #ifdef CONFIG_LRU_GEN =20 enum { @@ -406,8 +428,6 @@ enum { NR_LRU_GEN_CAPS }; =20 -#define LRU_REFS_FLAGS (BIT(PG_referenced) | BIT(PG_workingset)) - #define MIN_LRU_BATCH BITS_PER_LONG #define MAX_LRU_BATCH (MIN_LRU_BATCH * 64) =20 diff --git a/mm/swap.c b/mm/swap.c index 649ef7f2b74b..746a5ceba42c 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -387,24 +387,20 @@ static void lru_gen_inc_refs(struct folio *folio) if (folio_test_unevictable(folio)) return; =20 + /* see the comment on LRU_REFS_FLAGS */ if (!folio_test_referenced(folio)) { - folio_set_referenced(folio); + set_mask_bits(&folio->flags, LRU_REFS_MASK, BIT(PG_referenced)); return; } =20 - if (!folio_test_workingset(folio)) { - folio_set_workingset(folio); - return; - } - - /* see the comment on MAX_NR_TIERS */ do { - new_flags =3D old_flags & LRU_REFS_MASK; - if (new_flags =3D=3D LRU_REFS_MASK) - break; + if ((old_flags & LRU_REFS_MASK) =3D=3D LRU_REFS_MASK) { + if (!folio_test_workingset(folio)) + folio_set_workingset(folio); + return; + } =20 - new_flags +=3D BIT(LRU_REFS_PGOFF); - new_flags |=3D old_flags & ~LRU_REFS_MASK; + new_flags =3D old_flags + BIT(LRU_REFS_PGOFF); } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags)); } =20 @@ -417,7 +413,7 @@ static bool lru_gen_clear_refs(struct folio *folio) if (gen < 0) return true; =20 - set_mask_bits(&folio->flags, LRU_REFS_MASK | LRU_REFS_FLAGS, 0); + set_mask_bits(&folio->flags, LRU_REFS_FLAGS | BIT(PG_workingset), 0); =20 lrugen =3D &folio_lruvec(folio)->lrugen; /* whether can do without shuffling under the LRU lock */ @@ -499,7 +495,7 @@ void folio_add_lru(struct folio *folio) folio_test_unevictable(folio), folio); VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); =20 - /* see the comment in lru_gen_add_folio() */ + /* see the comment in lru_gen_folio_seq() */ if (lru_gen_enabled() && !folio_test_unevictable(folio) && lru_gen_in_fault() && !(current->flags & PF_MEMALLOC)) folio_set_active(folio); diff --git a/mm/vmscan.c b/mm/vmscan.c index a33221298fd0..74bc85fc7cdf 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -862,6 +862,31 @@ enum folio_references { FOLIOREF_ACTIVATE, }; =20 +#ifdef CONFIG_LRU_GEN +/* + * Only used on a mapped folio in the eviction (rmap walk) path, where pro= motion + * needs to be done by taking the folio off the LRU list and then adding i= t back + * with PG_active set. In contrast, the aging (page table walk) path uses + * folio_update_gen(). + */ +static bool lru_gen_set_refs(struct folio *folio) +{ + /* see the comment on LRU_REFS_FLAGS */ + if (!folio_test_referenced(folio) && !folio_test_workingset(folio)) { + set_mask_bits(&folio->flags, LRU_REFS_MASK, BIT(PG_referenced)); + return false; + } + + set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_workingset)); + return true; +} +#else +static bool lru_gen_set_refs(struct folio *folio) +{ + return false; +} +#endif /* CONFIG_LRU_GEN */ + static enum folio_references folio_check_references(struct folio *folio, struct scan_control *sc) { @@ -870,7 +895,6 @@ static enum folio_references folio_check_references(str= uct folio *folio, =20 referenced_ptes =3D folio_referenced(folio, 1, sc->target_mem_cgroup, &vm_flags); - referenced_folio =3D folio_test_clear_referenced(folio); =20 /* * The supposedly reclaimable folio was found to be in a VM_LOCKED vma. @@ -888,6 +912,15 @@ static enum folio_references folio_check_references(st= ruct folio *folio, if (referenced_ptes =3D=3D -1) return FOLIOREF_KEEP; =20 + if (lru_gen_enabled()) { + if (!referenced_ptes) + return FOLIOREF_RECLAIM; + + return lru_gen_set_refs(folio) ? FOLIOREF_ACTIVATE : FOLIOREF_KEEP; + } + + referenced_folio =3D folio_test_clear_referenced(folio); + if (referenced_ptes) { /* * All mapped folios start out with page table @@ -1092,11 +1125,6 @@ static unsigned int shrink_folio_list(struct list_he= ad *folio_list, if (!sc->may_unmap && folio_mapped(folio)) goto keep_locked; =20 - /* folio_update_gen() tried to promote this page? */ - if (lru_gen_enabled() && !ignore_references && - folio_mapped(folio) && folio_test_referenced(folio)) - goto keep_locked; - /* * The number of dirty pages determines if a node is marked * reclaim_congested. kswapd will stall and start writing @@ -3167,16 +3195,19 @@ static int folio_update_gen(struct folio *folio, in= t gen) =20 VM_WARN_ON_ONCE(gen >=3D MAX_NR_GENS); =20 + /* see the comment on LRU_REFS_FLAGS */ + if (!folio_test_referenced(folio) && !folio_test_workingset(folio)) { + set_mask_bits(&folio->flags, LRU_REFS_MASK, BIT(PG_referenced)); + return -1; + } + do { /* lru_gen_del_folio() has isolated this page? */ - if (!(old_flags & LRU_GEN_MASK)) { - /* for shrink_folio_list() */ - new_flags =3D old_flags | BIT(PG_referenced); - continue; - } + if (!(old_flags & LRU_GEN_MASK)) + return -1; =20 - new_flags =3D old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAG= S); - new_flags |=3D (gen + 1UL) << LRU_GEN_PGOFF; + new_flags =3D old_flags & ~(LRU_GEN_MASK | LRU_REFS_FLAGS); + new_flags |=3D ((gen + 1UL) << LRU_GEN_PGOFF) | BIT(PG_workingset); } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags)); =20 return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1; @@ -3200,7 +3231,7 @@ static int folio_inc_gen(struct lruvec *lruvec, struc= t folio *folio, bool reclai =20 new_gen =3D (old_gen + 1) % MAX_NR_GENS; =20 - new_flags =3D old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAG= S); + new_flags =3D old_flags & ~(LRU_GEN_MASK | LRU_REFS_FLAGS); new_flags |=3D (new_gen + 1UL) << LRU_GEN_PGOFF; /* for folio_end_writeback() */ if (reclaiming) @@ -3378,9 +3409,11 @@ static unsigned long get_pmd_pfn(pmd_t pmd, struct v= m_area_struct *vma, unsigned static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *m= emcg, struct pglist_data *pgdat) { - struct folio *folio; + struct folio *folio =3D pfn_folio(pfn); + + if (folio_lru_gen(folio) < 0) + return NULL; =20 - folio =3D pfn_folio(pfn); if (folio_nid(folio) !=3D pgdat->node_id) return NULL; =20 @@ -3757,8 +3790,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int ty= pe, int swappiness) while (!list_empty(head)) { struct folio *folio =3D lru_to_folio(head); int refs =3D folio_lru_refs(folio); - int tier =3D lru_tier_from_refs(refs); - int delta =3D folio_nr_pages(folio); + bool workingset =3D folio_test_workingset(folio); =20 VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio); VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio); @@ -3768,8 +3800,14 @@ static bool inc_min_seq(struct lruvec *lruvec, int t= ype, int swappiness) new_gen =3D folio_inc_gen(lruvec, folio, false); list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]); =20 - WRITE_ONCE(lrugen->protected[hist][type][tier], - lrugen->protected[hist][type][tier] + delta); + /* don't count the workingset being lazily promoted */ + if (refs + workingset !=3D BIT(LRU_REFS_WIDTH) + 1) { + int tier =3D lru_tier_from_refs(refs, workingset); + int delta =3D folio_nr_pages(folio); + + WRITE_ONCE(lrugen->protected[hist][type][tier], + lrugen->protected[hist][type][tier] + delta); + } =20 if (!--remaining) return false; @@ -4155,16 +4193,10 @@ bool lru_gen_look_around(struct page_vma_mapped_wal= k *pvmw) old_gen =3D folio_update_gen(folio, new_gen); if (old_gen >=3D 0 && old_gen !=3D new_gen) update_batch_size(walk, folio, old_gen, new_gen); - - continue; - } - - old_gen =3D folio_lru_gen(folio); - if (old_gen < 0) - folio_set_referenced(folio); - else if (old_gen !=3D new_gen) { - folio_clear_lru_refs(folio); - folio_activate(folio); + } else if (lru_gen_set_refs(folio)) { + old_gen =3D folio_lru_gen(folio); + if (old_gen >=3D 0 && old_gen !=3D new_gen) + folio_activate(folio); } } =20 @@ -4325,7 +4357,8 @@ static bool sort_folio(struct lruvec *lruvec, struct = folio *folio, struct scan_c int zone =3D folio_zonenum(folio); int delta =3D folio_nr_pages(folio); int refs =3D folio_lru_refs(folio); - int tier =3D lru_tier_from_refs(refs); + bool workingset =3D folio_test_workingset(folio); + int tier =3D lru_tier_from_refs(refs, workingset); struct lru_gen_folio *lrugen =3D &lruvec->lrugen; =20 VM_WARN_ON_ONCE_FOLIO(gen >=3D MAX_NR_GENS, folio); @@ -4347,14 +4380,17 @@ static bool sort_folio(struct lruvec *lruvec, struc= t folio *folio, struct scan_c } =20 /* protected */ - if (tier > tier_idx || refs =3D=3D BIT(LRU_REFS_WIDTH)) { - int hist =3D lru_hist_from_seq(lrugen->min_seq[type]); - + if (tier > tier_idx || refs + workingset =3D=3D BIT(LRU_REFS_WIDTH) + 1) { gen =3D folio_inc_gen(lruvec, folio, false); - list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]); + list_move(&folio->lru, &lrugen->folios[gen][type][zone]); =20 - WRITE_ONCE(lrugen->protected[hist][type][tier], - lrugen->protected[hist][type][tier] + delta); + /* don't count the workingset being lazily promoted */ + if (refs + workingset !=3D BIT(LRU_REFS_WIDTH) + 1) { + int hist =3D lru_hist_from_seq(lrugen->min_seq[type]); + + WRITE_ONCE(lrugen->protected[hist][type][tier], + lrugen->protected[hist][type][tier] + delta); + } return true; } =20 @@ -4374,8 +4410,7 @@ static bool sort_folio(struct lruvec *lruvec, struct = folio *folio, struct scan_c } =20 /* waiting for writeback */ - if (folio_test_locked(folio) || writeback || - (type =3D=3D LRU_GEN_FILE && dirty)) { + if (writeback || (type =3D=3D LRU_GEN_FILE && dirty)) { gen =3D folio_inc_gen(lruvec, folio, true); list_move(&folio->lru, &lrugen->folios[gen][type][zone]); return true; @@ -4404,13 +4439,12 @@ static bool isolate_folio(struct lruvec *lruvec, st= ruct folio *folio, struct sca return false; } =20 - /* see the comment on MAX_NR_TIERS */ + /* see the comment on LRU_REFS_FLAGS */ if (!folio_test_referenced(folio)) - folio_clear_lru_refs(folio); + set_mask_bits(&folio->flags, LRU_REFS_MASK, 0); =20 /* for shrink_folio_list() */ folio_clear_reclaim(folio); - folio_clear_referenced(folio); =20 success =3D lru_gen_del_folio(lruvec, folio, true); VM_WARN_ON_ONCE_FOLIO(!success, folio); @@ -4600,31 +4634,24 @@ static int evict_folios(struct lruvec *lruvec, stru= ct scan_control *sc, int swap type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON); =20 list_for_each_entry_safe_reverse(folio, next, &list, lru) { + DEFINE_MIN_SEQ(lruvec); + if (!folio_evictable(folio)) { list_del(&folio->lru); folio_putback_lru(folio); continue; } =20 - if (folio_test_reclaim(folio) && - (folio_test_dirty(folio) || folio_test_writeback(folio))) { - /* restore LRU_REFS_FLAGS cleared by isolate_folio() */ - if (folio_test_workingset(folio)) - folio_set_referenced(folio); - continue; - } - - if (skip_retry || folio_test_active(folio) || folio_test_referenced(foli= o) || - folio_mapped(folio) || folio_test_locked(folio) || - folio_test_dirty(folio) || folio_test_writeback(folio)) { - /* don't add rejected folios to the oldest generation */ - set_mask_bits(&folio->flags, LRU_REFS_MASK | LRU_REFS_FLAGS, - BIT(PG_active)); - continue; - } - /* retry folios that may have missed folio_rotate_reclaimable() */ - list_move(&folio->lru, &clean); + if (!skip_retry && !folio_test_active(folio) && !folio_mapped(folio) && + !folio_test_dirty(folio) && !folio_test_writeback(folio)) { + list_move(&folio->lru, &clean); + continue; + } + + /* don't add rejected folios to the oldest generation */ + if (lru_gen_folio_seq(lruvec, folio, false) =3D=3D min_seq[type]) + set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_active)); } =20 spin_lock_irq(&lruvec->lru_lock); diff --git a/mm/workingset.c b/mm/workingset.c index 2c310c29f51e..4841ae8af411 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -239,7 +239,8 @@ static void *lru_gen_eviction(struct folio *folio) int type =3D folio_is_file_lru(folio); int delta =3D folio_nr_pages(folio); int refs =3D folio_lru_refs(folio); - int tier =3D lru_tier_from_refs(refs); + bool workingset =3D folio_test_workingset(folio); + int tier =3D lru_tier_from_refs(refs, workingset); struct mem_cgroup *memcg =3D folio_memcg(folio); struct pglist_data *pgdat =3D folio_pgdat(folio); =20 @@ -253,7 +254,7 @@ static void *lru_gen_eviction(struct folio *folio) hist =3D lru_hist_from_seq(min_seq); atomic_long_add(delta, &lrugen->evicted[hist][type][tier]); =20 - return pack_shadow(mem_cgroup_id(memcg), pgdat, token, refs); + return pack_shadow(mem_cgroup_id(memcg), pgdat, token, workingset); } =20 /* @@ -304,24 +305,20 @@ static void lru_gen_refault(struct folio *folio, void= *shadow) lrugen =3D &lruvec->lrugen; =20 hist =3D lru_hist_from_seq(READ_ONCE(lrugen->min_seq[type])); - /* see the comment in folio_lru_refs() */ - refs =3D (token & (BIT(LRU_REFS_WIDTH) - 1)) + workingset; - tier =3D lru_tier_from_refs(refs); + refs =3D (token & (BIT(LRU_REFS_WIDTH) - 1)) + 1; + tier =3D lru_tier_from_refs(refs, workingset); =20 atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]); - mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta); =20 - /* - * Count the following two cases as stalls: - * 1. For pages accessed through page tables, hotter pages pushed out - * hot pages which refaulted immediately. - * 2. For pages accessed multiple times through file descriptors, - * they would have been protected by sort_folio(). - */ - if (lru_gen_in_fault() || refs >=3D BIT(LRU_REFS_WIDTH) - 1) { - set_mask_bits(&folio->flags, 0, LRU_REFS_MASK | BIT(PG_workingset)); + /* see folio_add_lru() where folio_set_active() will be called */ + if (lru_gen_in_fault()) + mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta); + + if (workingset) { + folio_set_workingset(folio); mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta); - } + } else + set_mask_bits(&folio->flags, LRU_REFS_MASK, (refs - 1UL) << LRU_REFS_PGO= FF); unlock: rcu_read_unlock(); } --=20 2.47.1.613.gc27f4b7a9f-goog