[v2] mm/mglru: use folio_mark_accessed to replace folio_set_active

[PATCH v2] mm/mglru: use folio_mark_accessed to replace folio_set_active

Posted by Barry Song (Xiaomi) 2 weeks ago

MGLRU gives high priority to folios mapped in page tables. As a result,
folio_set_active() is invoked for all folios read during page faults. In
practice, however, readahead can bring in many folios that are never
accessed via page tables.

A previous attempt by Lei Liu proposed introducing a separate LRU for
readahead[1] to make readahead pages easier to reclaim, but that approach
is likely over-engineered.

Before commit 4d5d14a01e2c ("mm/mglru: rework workingset protection"),
folios with PG_active were always placed in the youngest generation,
leading to over-protection and increased refaults. After that commit,
PG_active folios are placed in the second youngest generation, which is
still too optimistic given the presence of readahead. In contrast, the
classic active/inactive scheme is more conservative.

This patch switches to using folio_mark_accessed(). If
folio_check_references() later detects referenced PTEs, the folio
will be promoted based on the reference flag set by
folio_mark_accessed(). We should also adjust
WORKINGSET_ACTIVATE and lru_gen_folio_seq(); for example, we should
not unconditionally protect anon folios accordingly.

The following uses a simple model to demonstrate why the current code is
not ideal. It runs fio-3.42 in a memcg, reading a file in a strided
pattern—4KB every 64KB—to simulate prefaulted pages that may not be
accessed.

 #!/bin/bash

 CG_NAME="mglru_verify_test"
 CG_PATH="/sys/fs/cgroup/$CG_NAME"
 MEM_LIMIT="400M"
 HOT_SIZE="600M"

 # 1. Environment Setup
 sudo rmdir "$CG_PATH" 2>/dev/null
 sudo mkdir -p "$CG_PATH"
 sudo chown -R $USER:$USER "$CG_PATH"
 echo "$MEM_LIMIT" > "$CG_PATH/memory.max"

 # 2. Prepare Data Files
 dd if=/dev/urandom of=hot_data.bin bs=1M count=600 conv=notrunc 2>/dev/null
 sync
 echo 3 > /proc/sys/vm/drop_caches

 # 3. Start Workload (Working Set)
 (
     echo $BASHPID > "$CG_PATH/cgroup.procs"
     exec ./fio-3.42 --name=hot_ws --rw=read --bs=4K --size=$HOT_SIZE --runtime=600 \
          --zonemode=strided --zonesize=4K --zonerange=64K \
         --time_based --direct=0 --filename=hot_data.bin --ioengine=mmap \
          --fadvise_hint=0 --group_reporting --numjobs=1 > fio.stats
 ) &
 WORKLOAD_PID=$!

 # 4. Waiting for hot data to warm up
 sleep 30
 BASE_FILE=$(grep "workingset_refault_file" "$CG_PATH/memory.stat" | awk '{print $2}')

 # 5. Running workload for 60second
 sleep 60

 # 6. Report refault and IO bandwidth
 FINAL_FILE=$(grep "workingset_refault_file" "$CG_PATH/memory.stat" | awk '{print $2}')
 FINAL_D_FILE=$((FINAL_FILE - BASE_FILE))
 echo "File Refault Delta is $FINAL_D_FILE"

 kill $WORKLOAD_PID 2>/dev/null
 sleep 2
 grep -E "READ|WRITE" fio.stats \
 | awk '{for(i=1;i<=NF;i++){if($i ~ /^bw=/) bw=$i; if($i ~ /^io=/) io=$i} print $1, bw, io}'
 rm -f hot_data.bin fio.stats

Without the patch, we observed 12883855 file refaults and a very low
bandwidth of 58.5 MiB/s, because prefaulted but unused pages occupy hot
positions, continuously pushing out the real working set and causing
incorrect reclaim. With the patch, we observed 0 refaults and bandwidth
increased to 5078 MiB/s.

Running the same test on x86:
w/o patch:
File Refault Delta is 3240029
READ: bw=13.2MiB/s io=1183MiB

w/ patch:
File Refault Delta is 0
READ: bw=7708MiB/s io=676GiB

On x86, running a kernel build inside a memcg with a 1GB memory
limit using 20 threads.

w/o patch:
real	1m50.764s
user	25m32.305s
sys	4m0.012s
pswpin: 1333245
pswpout: 4366443
pgpgin: 6962592
pgpgout: 17780712
swpout_zero: 1019603
swpin_zero: 14764
refault_file: 287794
refault_anon: 1347963

w/ patch:
real	1m48.915s
user	25m31.261s
sys	3m43.685s
pswpin: 915629
pswpout: 3207173
pgpgin: 5249268
pgpgout: 13154492
swpout_zero: 816100
swpin_zero: 15676
refault_file: 257271
refault_anon: 931259

active/inactive LRU:

real	1m49.928s
user	25m28.196s
sys	3m40.740s
pswpin: 463452
pswpout: 2309119
pgpgin: 4438856
pgpgout: 9568628
swpout_zero: 743704
swpin_zero: 7244
refault_file: 562555
refault_anon: 470694

Lance and Xueyuan made a huge contribution to this patch through testing.

[1] https://lore.kernel.org/linux-mm/20250916072226.220426-1-liulei.rjpt@vivo.com/
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
Tested-by: Lance Yang <lance.yang@linux.dev>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Kairui Song <kasong@tencent.com>
Cc: Qi Zheng <qi.zheng@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: wangzicheng <wangzicheng@honor.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Lei Liu <liulei.rjpt@vivo.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
---
 -v2:
 * Fix WORKINGSET_ACTIVATE - workingset will be set to active during refault;
 * Avoid unconditional protecting anon folios in lru_gen_folio_seq();
 * Also adjusted workingset set accordingly in folio_check_references().

 -v1:
 https://lore.kernel.org/linux-mm/20260418120233.7162-1-baohua@kernel.org/

 -rfc was:
 [PATCH RFC] mm/mglru: lazily activate folios while folios are really mapped
 https://lore.kernel.org/linux-mm/20260225212642.15219-1-21cnbao@gmail.com/

 include/linux/mm_inline.h | 7 +++----
 mm/swap.c                 | 8 ++++++--
 mm/vmscan.c               | 3 ++-
 mm/workingset.c           | 8 ++++----
 4 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index a171070e15f0..c637e679a450 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -242,12 +242,11 @@ static inline unsigned long lru_gen_folio_seq(const struct lruvec *lruvec,
 		gen = MIN_NR_GENS - folio_test_workingset(folio);
 	else if (reclaiming)
 		gen = MAX_NR_GENS;
-	else if ((!folio_is_file_lru(folio) && !folio_test_swapcache(folio)) ||
-		 (folio_test_reclaim(folio) &&
-		  (folio_test_dirty(folio) || folio_test_writeback(folio))))
+	else if (folio_test_reclaim(folio) &&
+		  (folio_test_dirty(folio) || folio_test_writeback(folio)))
 		gen = MIN_NR_GENS;
 	else
-		gen = MAX_NR_GENS - folio_test_workingset(folio);
+		gen = MAX_NR_GENS - folio_test_workingset(folio) || folio_test_referenced(folio);
 
 	return max(READ_ONCE(lrugen->max_seq) - gen + 1, READ_ONCE(lrugen->min_seq[type]));
 }
diff --git a/mm/swap.c b/mm/swap.c
index 5cc44f0de987..f320f93d60df 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -511,8 +511,12 @@ void folio_add_lru(struct folio *folio)
 
 	/* see the comment in lru_gen_folio_seq() */
 	if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
-	    lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
-		folio_set_active(folio);
+	    lru_gen_in_fault() && !(current->flags & PF_MEMALLOC)) {
+		if (folio_test_workingset(folio))
+			folio_set_active(folio);
+		else if (!folio_test_referenced(folio))
+			folio_mark_accessed(folio);
+	}
 
 	folio_batch_add_and_move(folio, lru_add);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e452cb043d46..48355f10542b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -848,7 +848,8 @@ static bool lru_gen_set_refs(struct folio *folio)
 		return false;
 	}
 
-	set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_workingset));
+	if (folio_test_active(folio))
+		set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_workingset));
 	return true;
 }
 #else
diff --git a/mm/workingset.c b/mm/workingset.c
index 07e6836d0502..2f0c08aa8623 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -319,11 +319,11 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
 
 	atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
 
-	/* see folio_add_lru() where folio_set_active() will be called */
-	if (lru_gen_in_fault())
-		mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
-
 	if (workingset) {
+		if (lru_gen_in_fault()) {
+			folio_set_active(folio);
+			mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
+		}
 		folio_set_workingset(folio);
 		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, delta);
 	} else
-- 
2.34.1

Re: [PATCH v2] mm/mglru: use folio_mark_accessed to replace folio_set_active

Posted by Kairui Song 2 weeks ago

On Mon, May 25, 2026 at 8:32 PM Barry Song (Xiaomi) <baohua@kernel.org> wrote:
>
> MGLRU gives high priority to folios mapped in page tables. As a result,
> folio_set_active() is invoked for all folios read during page faults. In
> practice, however, readahead can bring in many folios that are never
> accessed via page tables.
>
> A previous attempt by Lei Liu proposed introducing a separate LRU for
> readahead[1] to make readahead pages easier to reclaim, but that approach
> is likely over-engineered.
>
> Before commit 4d5d14a01e2c ("mm/mglru: rework workingset protection"),
> folios with PG_active were always placed in the youngest generation,
> leading to over-protection and increased refaults. After that commit,
> PG_active folios are placed in the second youngest generation, which is
> still too optimistic given the presence of readahead. In contrast, the
> classic active/inactive scheme is more conservative.
>
> This patch switches to using folio_mark_accessed(). If
> folio_check_references() later detects referenced PTEs, the folio
> will be promoted based on the reference flag set by
> folio_mark_accessed(). We should also adjust
> WORKINGSET_ACTIVATE and lru_gen_folio_seq(); for example, we should
> not unconditionally protect anon folios accordingly.
>
> The following uses a simple model to demonstrate why the current code is
> not ideal. It runs fio-3.42 in a memcg, reading a file in a strided
> pattern—4KB every 64KB—to simulate prefaulted pages that may not be
> accessed.

Thanks so much for the update! This looks much better than V1 RFC.

> Without the patch, we observed 12883855 file refaults and a very low
> bandwidth of 58.5 MiB/s, because prefaulted but unused pages occupy hot
> positions, continuously pushing out the real working set and causing
> incorrect reclaim. With the patch, we observed 0 refaults and bandwidth
> increased to 5078 MiB/s.
>
> Running the same test on x86:
> w/o patch:
> File Refault Delta is 3240029
> READ: bw=13.2MiB/s io=1183MiB
>
> w/ patch:
> File Refault Delta is 0
> READ: bw=7708MiB/s io=676GiB

Really nice result here.

> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index a171070e15f0..c637e679a450 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -242,12 +242,11 @@ static inline unsigned long lru_gen_folio_seq(const struct lruvec *lruvec,
>                 gen = MIN_NR_GENS - folio_test_workingset(folio);
>         else if (reclaiming)
>                 gen = MAX_NR_GENS;
> -       else if ((!folio_is_file_lru(folio) && !folio_test_swapcache(folio)) ||
> -                (folio_test_reclaim(folio) &&
> -                 (folio_test_dirty(folio) || folio_test_writeback(folio))))
> +       else if (folio_test_reclaim(folio) &&
> +                 (folio_test_dirty(folio) || folio_test_writeback(folio)))

This change is more than just "use folio_mark_accessed to replace
folio_set_active", you are changing the pattern for all anon folios
here? But I think we are mostly working with file folios here
especially in the first test you posted?

I do like your new if condition though, but maybe it's not the right
time, or better add some words about it, how anon are handled now.

>                 gen = MIN_NR_GENS;
>         else
> -               gen = MAX_NR_GENS - folio_test_workingset(folio);
> +               gen = MAX_NR_GENS - folio_test_workingset(folio) || folio_test_referenced(folio);

Nice, this aligns exactly with what I want here :), see lru_gen_folio_seq:
https://lore.kernel.org/linux-mm/20260502-mglru-fg-v1-6-913619b014d9@tencent.com/

But I think you got the precedence wrong? Did you mean:
(MAX_NR_GENS - folio_test_workingset(folio)) || folio_test_referenced(folio)

Or?
MAX_NR_GENS - (folio_test_workingset(folio) || folio_test_referenced(folio))

>
>         return max(READ_ONCE(lrugen->max_seq) - gen + 1, READ_ONCE(lrugen->min_seq[type]));
>  }
> diff --git a/mm/swap.c b/mm/swap.c
> index 5cc44f0de987..f320f93d60df 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -511,8 +511,12 @@ void folio_add_lru(struct folio *folio)
>
>         /* see the comment in lru_gen_folio_seq() */
>         if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
> -           lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
> -               folio_set_active(folio);
> +           lru_gen_in_fault() && !(current->flags & PF_MEMALLOC)) {
> +               if (folio_test_workingset(folio))
> +                       folio_set_active(folio);

What if we also move this set_active to mm/workingset.c? I think that
the only place where a folio with PG_workingset can get here?

> +               else if (!folio_test_referenced(folio))
> +                       folio_mark_accessed(folio);
> +       }

Same here? Also this folio_mark_accessed only means
folio_set_referenced here, which matches the folio_test_referenced
check you just added in lru_gen_folio_seq right?

>
>         folio_batch_add_and_move(folio, lru_add);
>  }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e452cb043d46..48355f10542b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -848,7 +848,8 @@ static bool lru_gen_set_refs(struct folio *folio)
>                 return false;
>         }
>
> -       set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_workingset));
> +       if (folio_test_active(folio))
> +               set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_workingset));

This looks a bit strange, PG_active is always unset for on LRU folios
for MGLRU right?

There is even a comment for that:
"PG_active is always cleared while a folio is on one of lrugen->folios[]"

And lru_gen_set_refs is mostly for on-LRU folios or isolated ones,
either would have this flag. So PG_workingset seems broken by this
point, rmap walk no longer get PG_workingset set, which removes the
"promote on 2nd access" mechanism entirely?

>         return true;
>  }
>  #else
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 07e6836d0502..2f0c08aa8623 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -319,11 +319,11 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
>
>         atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
>
> -       /* see folio_add_lru() where folio_set_active() will be called */
> -       if (lru_gen_in_fault())
> -               mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
> -
>         if (workingset) {
> +               if (lru_gen_in_fault()) {
> +                       folio_set_active(folio);
> +                       mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
> +               }

Overall I agree with this directly that file folios shouldn't get over
protected. It will be even better if we actually activate the folio
using things like refault distance... Which can be done later.

BTW, this would break reactivation if we age periodically. That is
less of a problem compared to overprotection, I think.

And do you think we can keep a folio_set_active if folio_test_anon ==
true in swap.c? Proactive activation relies on the assumption that
applications never expect an jitter for ordinary memory access, which
holds true for anon folios, but file folios don't follow that idea.

And I'm not trying to seize the topic but my previous experiment using
refault distance-based reactivation resulted in a 500% performance
gain for some databases with no other regressions observed:
https://lwn.net/Articles/945266/

So for this patch I think it's on the right track, we can definitely
merge some follow up of this patch first, and then refine the pattern
step by step toward refault distance, MGLRU-FG or both plus others.
That will take a long time, so let's fix the low-hanging fruit first.

Really looking forward to a few more updates from this patch :) !

Re: [PATCH v2] mm/mglru: use folio_mark_accessed to replace folio_set_active

Posted by Barry Song 2 weeks ago

On Tue, May 26, 2026 at 1:58 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Mon, May 25, 2026 at 8:32 PM Barry Song (Xiaomi) <baohua@kernel.org> wrote:
> >
> > MGLRU gives high priority to folios mapped in page tables. As a result,
> > folio_set_active() is invoked for all folios read during page faults. In
> > practice, however, readahead can bring in many folios that are never
> > accessed via page tables.
> >
> > A previous attempt by Lei Liu proposed introducing a separate LRU for
> > readahead[1] to make readahead pages easier to reclaim, but that approach
> > is likely over-engineered.
> >
> > Before commit 4d5d14a01e2c ("mm/mglru: rework workingset protection"),
> > folios with PG_active were always placed in the youngest generation,
> > leading to over-protection and increased refaults. After that commit,
> > PG_active folios are placed in the second youngest generation, which is
> > still too optimistic given the presence of readahead. In contrast, the
> > classic active/inactive scheme is more conservative.
> >
> > This patch switches to using folio_mark_accessed(). If
> > folio_check_references() later detects referenced PTEs, the folio
> > will be promoted based on the reference flag set by
> > folio_mark_accessed(). We should also adjust
> > WORKINGSET_ACTIVATE and lru_gen_folio_seq(); for example, we should
> > not unconditionally protect anon folios accordingly.
> >
> > The following uses a simple model to demonstrate why the current code is
> > not ideal. It runs fio-3.42 in a memcg, reading a file in a strided
> > pattern—4KB every 64KB—to simulate prefaulted pages that may not be
> > accessed.
>
> Thanks so much for the update! This looks much better than V1 RFC.
>
> > Without the patch, we observed 12883855 file refaults and a very low
> > bandwidth of 58.5 MiB/s, because prefaulted but unused pages occupy hot
> > positions, continuously pushing out the real working set and causing
> > incorrect reclaim. With the patch, we observed 0 refaults and bandwidth
> > increased to 5078 MiB/s.
> >
> > Running the same test on x86:
> > w/o patch:
> > File Refault Delta is 3240029
> > READ: bw=13.2MiB/s io=1183MiB
> >
> > w/ patch:
> > File Refault Delta is 0
> > READ: bw=7708MiB/s io=676GiB
>
> Really nice result here.
>
> > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > index a171070e15f0..c637e679a450 100644
> > --- a/include/linux/mm_inline.h
> > +++ b/include/linux/mm_inline.h
> > @@ -242,12 +242,11 @@ static inline unsigned long lru_gen_folio_seq(const struct lruvec *lruvec,
> >                 gen = MIN_NR_GENS - folio_test_workingset(folio);
> >         else if (reclaiming)
> >                 gen = MAX_NR_GENS;
> > -       else if ((!folio_is_file_lru(folio) && !folio_test_swapcache(folio)) ||
> > -                (folio_test_reclaim(folio) &&
> > -                 (folio_test_dirty(folio) || folio_test_writeback(folio))))
> > +       else if (folio_test_reclaim(folio) &&
> > +                 (folio_test_dirty(folio) || folio_test_writeback(folio)))
>
> This change is more than just "use folio_mark_accessed to replace
> folio_set_active", you are changing the pattern for all anon folios
> here? But I think we are mostly working with file folios here
> especially in the first test you posted?
>
> I do like your new if condition though, but maybe it's not the right
> time, or better add some words about it, how anon are handled now.

Right. For anon folios, the generation is currently hardcoded to
MIN_NR_GENS, so skipping folio_set_active will not help anon pages.
It is hardcoded regardless of the current anon folio state.
This change actually aligns anon behavior with file folios. Otherwise,
skipping folio_set_active() would have no effect on anon folios.

Maybe I can change the subject to " mm/mglru: avoid over-promoting
PF folios" and add some words in v3.

>
> >                 gen = MIN_NR_GENS;
> >         else
> > -               gen = MAX_NR_GENS - folio_test_workingset(folio);
> > +               gen = MAX_NR_GENS - folio_test_workingset(folio) || folio_test_referenced(folio);
>
> Nice, this aligns exactly with what I want here :), see lru_gen_folio_seq:
> https://lore.kernel.org/linux-mm/20260502-mglru-fg-v1-6-913619b014d9@tencent.com/
>
> But I think you got the precedence wrong? Did you mean:
> (MAX_NR_GENS - folio_test_workingset(folio)) || folio_test_referenced(folio)
>
> Or?
> MAX_NR_GENS - (folio_test_workingset(folio) || folio_test_referenced(folio))

I mean this one. Sorry for the confusion. will fix it in v3.

>
> >
> >         return max(READ_ONCE(lrugen->max_seq) - gen + 1, READ_ONCE(lrugen->min_seq[type]));
> >  }
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 5cc44f0de987..f320f93d60df 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -511,8 +511,12 @@ void folio_add_lru(struct folio *folio)
> >
> >         /* see the comment in lru_gen_folio_seq() */
> >         if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
> > -           lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
> > -               folio_set_active(folio);
> > +           lru_gen_in_fault() && !(current->flags & PF_MEMALLOC)) {
> > +               if (folio_test_workingset(folio))
> > +                       folio_set_active(folio);
>
> What if we also move this set_active to mm/workingset.c? I think that
> the only place where a folio with PG_workingset can get here?

I agree that folio_set_active() is duplicated between
mm/workingset.c and here. Actually, I should have removed
folio_set_active(folio) from mm/workingset.c to centralize the
active/referenced state handling in folio_add_lru(). But somehow I
forgot to do that before sending the patch.

Maybe we should just centralize it for now; that seems better. We
can reconsider it later together with your larger follow-up work.

>
> > +               else if (!folio_test_referenced(folio))
> > +                       folio_mark_accessed(folio);
> > +       }
>
> Same here? Also this folio_mark_accessed only means
> folio_set_referenced here, which matches the folio_test_referenced
> check you just added in lru_gen_folio_seq right?

Right. With folio_test_referenced(), we place folios into the second
oldest generation. Otherwise, putting them into the oldest generation
seems too aggressive for scanning. But putting them into the active
generation also seems too aggressive for promotion.

>
> >
> >         folio_batch_add_and_move(folio, lru_add);
> >  }
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index e452cb043d46..48355f10542b 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -848,7 +848,8 @@ static bool lru_gen_set_refs(struct folio *folio)
> >                 return false;
> >         }
> >
> > -       set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_workingset));
> > +       if (folio_test_active(folio))
> > +               set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_workingset));
>
> This looks a bit strange, PG_active is always unset for on LRU folios
> for MGLRU right?
>
> There is even a comment for that:
> "PG_active is always cleared while a folio is on one of lrugen->folios[]"

There still seem to be some cases where this flag remains set after
I added the printk.

@@ -848,8 +848,10 @@ static bool lru_gen_set_refs(struct folio *folio)
                return false;
        }

-       if (folio_test_active(folio))
+       if (folio_test_active(folio)) {
+               pr_err("%s folio:%lx\n", __func__, folio);
                set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS,
BIT(PG_workingset));
+       }
        return true;
 }

[   78.928932] vmscan: lru_gen_set_refs folio:fffff645c58d99c0
[   81.198056] vmscan: lru_gen_set_refs folio:fffff645c791a200
[   94.071659] vmscan: lru_gen_set_refs folio:fffff645c773c8c0
[   94.116162] vmscan: lru_gen_set_refs folio:fffff645c4f02300
[  110.215317] vmscan: lru_gen_set_refs folio:fffff645c7c8ea40
[  118.679921] vmscan: lru_gen_set_refs folio:fffff645c5727840
[  135.709167] vmscan: lru_gen_set_refs folio:fffff645c51d2ac0
[  139.118206] vmscan: lru_gen_set_refs folio:fffff645c7749fc0

But you are generally right. Those active flags are probably set after folios
are added to the LRU.

>
> And lru_gen_set_refs is mostly for on-LRU folios or isolated ones,
> either would have this flag. So PG_workingset seems broken by this
> point, rmap walk no longer get PG_workingset set, which removes the
> "promote on 2nd access" mechanism entirely?

Yes. I was trying to preserve the “promote on second access”
behavior, so I used the “active” flag. But it is somewhat
misused.

Before this patch:
1st scan sets the referenced flag;
2nd scan promotes based on seeing the referenced flag set by the first access.

Now the flow is:
1st scan already observes the referenced flag;
2nd scan tries to rely on the active flag. However, as you pointed
out, the active flag
is cleared after being set, so it can’t reliably serve this purpose.

Maybe I can add a ref and only set workingset when both PG_referenced and
that ref are present.

@@ -848,8 +848,11 @@ static bool lru_gen_set_refs(struct folio *folio)
                return false;
        }

-       if (folio_test_active(folio))
+       /* Promote on second access */
+       if (folio_lru_refs(folio) > 1)
                set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS,
BIT(PG_workingset));
+       else
+               folio_mark_accessed(folio);
        return true;
 }

It’s a bit ugly, but I can’t find another flag for it. Do you have a
better idea?

>
> >         return true;
> >  }
> >  #else
> > diff --git a/mm/workingset.c b/mm/workingset.c
> > index 07e6836d0502..2f0c08aa8623 100644
> > --- a/mm/workingset.c
> > +++ b/mm/workingset.c
> > @@ -319,11 +319,11 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
> >
> >         atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
> >
> > -       /* see folio_add_lru() where folio_set_active() will be called */
> > -       if (lru_gen_in_fault())
> > -               mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
> > -
> >         if (workingset) {
> > +               if (lru_gen_in_fault()) {
> > +                       folio_set_active(folio);
> > +                       mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
> > +               }
>
> Overall I agree with this directly that file folios shouldn't get over
> protected. It will be even better if we actually activate the folio
> using things like refault distance... Which can be done later.
>
> BTW, this would break reactivation if we age periodically. That is
> less of a problem compared to overprotection, I think.
>
> And do you think we can keep a folio_set_active if folio_test_anon ==
> true in swap.c? Proactive activation relies on the assumption that
> applications never expect an jitter for ordinary memory access, which
> holds true for anon folios, but file folios don't follow that idea.

I’m not entirely sure, but I see that removing folio_set_active()
could also help reduce anon refaults. Also, having two separate
branches for file and anon seems to make the code a bit ugly.

>
> And I'm not trying to seize the topic but my previous experiment using
> refault distance-based reactivation resulted in a 500% performance
> gain for some databases with no other regressions observed:
> https://lwn.net/Articles/945266/

Yep, that is really great!

I'd be happy to review and work on your refault
distance-based reactivation when I have some time.

>
> So for this patch I think it's on the right track, we can definitely
> merge some follow up of this patch first, and then refine the pattern
> step by step toward refault distance, MGLRU-FG or both plus others.
> That will take a long time, so let's fix the low-hanging fruit first.
>
> Really looking forward to a few more updates from this patch :) !

Thanks very much for your review.

Best Regards
Barry

Re: [PATCH v2] mm/mglru: use folio_mark_accessed to replace folio_set_active

Posted by Barry Song 1 week, 6 days ago

> >
> > And do you think we can keep a folio_set_active if folio_test_anon ==
> > true in swap.c? Proactive activation relies on the assumption that
> > applications never expect an jitter for ordinary memory access, which
> > holds true for anon folios, but file folios don't follow that idea.
>
> I’m not entirely sure, but I see that removing folio_set_active()
> could also help reduce anon refaults. Also, having two separate
> branches for file and anon seems to make the code a bit ugly.

I did a quick test by restoring anon’s behavior as shown below:

--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -242,8 +242,9 @@ static inline unsigned long
lru_gen_folio_seq(const struct lruvec *lruvec,
                gen = MIN_NR_GENS - folio_test_workingset(folio);
        else if (reclaiming)
                gen = MAX_NR_GENS;
-       else if (folio_test_reclaim(folio) &&
-                 (folio_test_dirty(folio) || folio_test_writeback(folio)))
+       else if ((!folio_is_file_lru(folio) && !folio_test_swapcache(folio)) ||
+                (folio_test_reclaim(folio) &&
+                 (folio_test_dirty(folio) || folio_test_writeback(folio))))
                gen = MIN_NR_GENS;
        else
                gen = MAX_NR_GENS - (folio_test_workingset(folio) ||
folio_test_referenced(folio));

I can still see significant anon refault reduction, so I agree I can keep the
existing behavior for anon. However, there is no need to set active, since
(!folio_is_file_lru(folio) && !folio_test_swapcache(folio)) is hardcoded there.

I guess it is probably because file behavior is improved, which then indirectly
improves anon as well. So I feel there is no need to change the subject in v3.

Thanks
Barry