From nobody Mon Jun 8 23:56:06 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id ACA0630C60D for ; Mon, 25 May 2026 12:32:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779712334; cv=none; b=m3jLFcNRuCvGkP9Dbw4h8xtJWD+E55geUNL86fGJXysVrVQDQyEJWtxL2KZss0MKyPzfN5Srzv+qkjwg4cD7aZ73tqndi0/8fiyCUwr9Jbdz9EkvyID4ybCGSAHFCSR3w9PIL/Pj1TkaZ/ZOxLTZR1hh2/JI0E3dH/LSi2S60iI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779712334; c=relaxed/simple; bh=mPZdd7YarUDz7OgysQZEeIT552pswDJJiygONYJeMPU=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version:Content-Type; b=KkSDVdOy80DurAfFb5nxN6aUxnT1jJAVI52Fm9dX115Y6ergaqd9Npf3LV+OxIP2zHYNWLO3TZr2fNb58uG9ZG635uxTdGrL0+Me0C/kAxQ9Ody/W6V+46zVQhBv2jG7PqroiEuOxLmpEFneqe+4Nqd2qYuRAhx6tj47zGs8rhM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=acqgUbrM; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="acqgUbrM" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 21F671F00A3A; Mon, 25 May 2026 12:32:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1779712332; bh=lq1hhMcZSc4bzpk+pP8rA71YpSzcnilUE2l2IgB4YxY=; h=From:To:Cc:Subject:Date; b=acqgUbrMlubnoz+wa+E0pM/77JO66PoMW4lWUmsbl6qZTM9IGS3TmS9921ODBX/Xp 4kmClexamPheD8LhLJ1kAD681tnvYytHoESmXu64/dMvxAJRqBqqntwaQ63kjs05lE D3K1I1jcG6j9tN57JAhUi3o014MOFgp+KUxmr6AzWEHtot3YXbzPD5K30wFg6JMXp0 e68YYaS+wFyGoFO/JFc2hhRdE/OudJLd0r5jVXHgnUWDVKo+wwcxVQIx3Co+GYCeKg sxKOS4T366mXYuFvDEs5YNY33QqZiQrlEK3wSBC2OYngK6Un11xZv4cM/NM1c4JtgF UrkwJ5cviXnzg== From: "Barry Song (Xiaomi)" To: akpm@linux-foundation.org, linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, "Barry Song (Xiaomi)" , Lance Yang , Xueyuan Chen , Pedro Falcato , Kairui Song , Qi Zheng , Shakeel Butt , wangzicheng , Suren Baghdasaryan , Lei Liu , Matthew Wilcox , Axel Rasmussen , Yuanchu Xie , Wei Xu , Will Deacon , Kalesh Singh Subject: [PATCH v2] mm/mglru: use folio_mark_accessed to replace folio_set_active Date: Mon, 25 May 2026 20:32:05 +0800 Message-Id: <20260525123205.51874-1-baohua@kernel.org> X-Mailer: git-send-email 2.39.3 (Apple Git-146) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable MGLRU gives high priority to folios mapped in page tables. As a result, folio_set_active() is invoked for all folios read during page faults. In practice, however, readahead can bring in many folios that are never accessed via page tables. A previous attempt by Lei Liu proposed introducing a separate LRU for readahead[1] to make readahead pages easier to reclaim, but that approach is likely over-engineered. Before commit 4d5d14a01e2c ("mm/mglru: rework workingset protection"), folios with PG_active were always placed in the youngest generation, leading to over-protection and increased refaults. After that commit, PG_active folios are placed in the second youngest generation, which is still too optimistic given the presence of readahead. In contrast, the classic active/inactive scheme is more conservative. This patch switches to using folio_mark_accessed(). If folio_check_references() later detects referenced PTEs, the folio will be promoted based on the reference flag set by folio_mark_accessed(). We should also adjust WORKINGSET_ACTIVATE and lru_gen_folio_seq(); for example, we should not unconditionally protect anon folios accordingly. The following uses a simple model to demonstrate why the current code is not ideal. It runs fio-3.42 in a memcg, reading a file in a strided pattern=E2=80=944KB every 64KB=E2=80=94to simulate prefaulted pages that ma= y not be accessed. #!/bin/bash CG_NAME=3D"mglru_verify_test" CG_PATH=3D"/sys/fs/cgroup/$CG_NAME" MEM_LIMIT=3D"400M" HOT_SIZE=3D"600M" # 1. Environment Setup sudo rmdir "$CG_PATH" 2>/dev/null sudo mkdir -p "$CG_PATH" sudo chown -R $USER:$USER "$CG_PATH" echo "$MEM_LIMIT" > "$CG_PATH/memory.max" # 2. Prepare Data Files dd if=3D/dev/urandom of=3Dhot_data.bin bs=3D1M count=3D600 conv=3Dnotrunc = 2>/dev/null sync echo 3 > /proc/sys/vm/drop_caches # 3. Start Workload (Working Set) ( echo $BASHPID > "$CG_PATH/cgroup.procs" exec ./fio-3.42 --name=3Dhot_ws --rw=3Dread --bs=3D4K --size=3D$HOT_SI= ZE --runtime=3D600 \ --zonemode=3Dstrided --zonesize=3D4K --zonerange=3D64K \ --time_based --direct=3D0 --filename=3Dhot_data.bin --ioengine=3Dm= map \ --fadvise_hint=3D0 --group_reporting --numjobs=3D1 > fio.stats ) & WORKLOAD_PID=3D$! # 4. Waiting for hot data to warm up sleep 30 BASE_FILE=3D$(grep "workingset_refault_file" "$CG_PATH/memory.stat" | awk = '{print $2}') # 5. Running workload for 60second sleep 60 # 6. Report refault and IO bandwidth FINAL_FILE=3D$(grep "workingset_refault_file" "$CG_PATH/memory.stat" | awk= '{print $2}') FINAL_D_FILE=3D$((FINAL_FILE - BASE_FILE)) echo "File Refault Delta is $FINAL_D_FILE" kill $WORKLOAD_PID 2>/dev/null sleep 2 grep -E "READ|WRITE" fio.stats \ | awk '{for(i=3D1;i<=3DNF;i++){if($i ~ /^bw=3D/) bw=3D$i; if($i ~ /^io=3D/= ) io=3D$i} print $1, bw, io}' rm -f hot_data.bin fio.stats Without the patch, we observed 12883855 file refaults and a very low bandwidth of 58.5 MiB/s, because prefaulted but unused pages occupy hot positions, continuously pushing out the real working set and causing incorrect reclaim. With the patch, we observed 0 refaults and bandwidth increased to 5078 MiB/s. Running the same test on x86: w/o patch: File Refault Delta is 3240029 READ: bw=3D13.2MiB/s io=3D1183MiB w/ patch: File Refault Delta is 0 READ: bw=3D7708MiB/s io=3D676GiB On x86, running a kernel build inside a memcg with a 1GB memory limit using 20 threads. w/o patch: real 1m50.764s user 25m32.305s sys 4m0.012s pswpin: 1333245 pswpout: 4366443 pgpgin: 6962592 pgpgout: 17780712 swpout_zero: 1019603 swpin_zero: 14764 refault_file: 287794 refault_anon: 1347963 w/ patch: real 1m48.915s user 25m31.261s sys 3m43.685s pswpin: 915629 pswpout: 3207173 pgpgin: 5249268 pgpgout: 13154492 swpout_zero: 816100 swpin_zero: 15676 refault_file: 257271 refault_anon: 931259 active/inactive LRU: real 1m49.928s user 25m28.196s sys 3m40.740s pswpin: 463452 pswpout: 2309119 pgpgin: 4438856 pgpgout: 9568628 swpout_zero: 743704 swpin_zero: 7244 refault_file: 562555 refault_anon: 470694 Lance and Xueyuan made a huge contribution to this patch through testing. [1] https://lore.kernel.org/linux-mm/20250916072226.220426-1-liulei.rjpt@vi= vo.com/ Signed-off-by: Barry Song (Xiaomi) Tested-by: Lance Yang Tested-by: Xueyuan Chen Cc: Pedro Falcato Cc: Kairui Song Cc: Qi Zheng Cc: Shakeel Butt Cc: wangzicheng Cc: Suren Baghdasaryan Cc: Lei Liu Cc: Matthew Wilcox (Oracle) Cc: Axel Rasmussen Cc: Yuanchu Xie Cc: Wei Xu Cc: Will Deacon Cc: Kalesh Singh --- -v2: * Fix WORKINGSET_ACTIVATE - workingset will be set to active during refaul= t; * Avoid unconditional protecting anon folios in lru_gen_folio_seq(); * Also adjusted workingset set accordingly in folio_check_references(). -v1: https://lore.kernel.org/linux-mm/20260418120233.7162-1-baohua@kernel.org/ -rfc was: [PATCH RFC] mm/mglru: lazily activate folios while folios are really mapped https://lore.kernel.org/linux-mm/20260225212642.15219-1-21cnbao@gmail.com/ include/linux/mm_inline.h | 7 +++---- mm/swap.c | 8 ++++++-- mm/vmscan.c | 3 ++- mm/workingset.c | 8 ++++---- 4 files changed, 15 insertions(+), 11 deletions(-) diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index a171070e15f0..c637e679a450 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -242,12 +242,11 @@ static inline unsigned long lru_gen_folio_seq(const s= truct lruvec *lruvec, gen =3D MIN_NR_GENS - folio_test_workingset(folio); else if (reclaiming) gen =3D MAX_NR_GENS; - else if ((!folio_is_file_lru(folio) && !folio_test_swapcache(folio)) || - (folio_test_reclaim(folio) && - (folio_test_dirty(folio) || folio_test_writeback(folio)))) + else if (folio_test_reclaim(folio) && + (folio_test_dirty(folio) || folio_test_writeback(folio))) gen =3D MIN_NR_GENS; else - gen =3D MAX_NR_GENS - folio_test_workingset(folio); + gen =3D MAX_NR_GENS - folio_test_workingset(folio) || folio_test_referen= ced(folio); =20 return max(READ_ONCE(lrugen->max_seq) - gen + 1, READ_ONCE(lrugen->min_se= q[type])); } diff --git a/mm/swap.c b/mm/swap.c index 5cc44f0de987..f320f93d60df 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -511,8 +511,12 @@ void folio_add_lru(struct folio *folio) =20 /* see the comment in lru_gen_folio_seq() */ if (lru_gen_enabled() && !folio_test_unevictable(folio) && - lru_gen_in_fault() && !(current->flags & PF_MEMALLOC)) - folio_set_active(folio); + lru_gen_in_fault() && !(current->flags & PF_MEMALLOC)) { + if (folio_test_workingset(folio)) + folio_set_active(folio); + else if (!folio_test_referenced(folio)) + folio_mark_accessed(folio); + } =20 folio_batch_add_and_move(folio, lru_add); } diff --git a/mm/vmscan.c b/mm/vmscan.c index e452cb043d46..48355f10542b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -848,7 +848,8 @@ static bool lru_gen_set_refs(struct folio *folio) return false; } =20 - set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_workingset)); + if (folio_test_active(folio)) + set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_workingset)); return true; } #else diff --git a/mm/workingset.c b/mm/workingset.c index 07e6836d0502..2f0c08aa8623 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -319,11 +319,11 @@ static void lru_gen_refault(struct folio *folio, void= *shadow) =20 atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]); =20 - /* see folio_add_lru() where folio_set_active() will be called */ - if (lru_gen_in_fault()) - mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta); - if (workingset) { + if (lru_gen_in_fault()) { + folio_set_active(folio); + mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta); + } folio_set_workingset(folio); mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta); } else --=20 2.34.1