[PATCH 0/2] support batched checks of the references for large folios

Baolin Wang posted 2 patches 6 days, 23 hours ago
arch/arm64/include/asm/pgtable.h | 23 ++++++++++++-----
arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
include/linux/mmu_notifier.h     |  9 ++++---
include/linux/pgtable.h          | 19 ++++++++++++++
mm/rmap.c                        | 22 ++++++++++++++--
5 files changed, 92 insertions(+), 25 deletions(-)
[PATCH 0/2] support batched checks of the references for large folios
Posted by Baolin Wang 6 days, 23 hours ago
Currently, folio_referenced_one() always checks the young flag for each PTE
sequentially, which is inefficient for large folios. This inefficiency is
especially noticeable when reclaiming clean file-backed large folios, where
folio_referenced() is observed as a significant performance hotspot.

Moreover, on Arm architecture, which supports contiguous PTEs, there is already
an optimization to clear the young flags for PTEs within a contiguous range.
However, this is not sufficient. We can extend this to perform batched operations
for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).

By supporting batched checking of the young flags and flushing TLB entries,
I observed a 33% performance improvement in my file-backed folios reclaim tests.

BTW, I still noticed a hotspot in try_to_unmap() in my test. Hope Barry can
resend the optimization patch for try_to_unmap() [1].

[1] https://lore.kernel.org/all/20250513084620.58231-1-21cnbao@gmail.com/

Baolin Wang (2):
  arm64: mm: support batch clearing of the young flag for large folios
  mm: rmap: support batched checks of the references for large folios

 arch/arm64/include/asm/pgtable.h | 23 ++++++++++++-----
 arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
 include/linux/mmu_notifier.h     |  9 ++++---
 include/linux/pgtable.h          | 19 ++++++++++++++
 mm/rmap.c                        | 22 ++++++++++++++--
 5 files changed, 92 insertions(+), 25 deletions(-)

-- 
2.47.3
Re: [PATCH 0/2] support batched checks of the references for large folios
Posted by David Hildenbrand (Red Hat) 8 hours ago
On 11/25/25 01:56, Baolin Wang wrote:
> Currently, folio_referenced_one() always checks the young flag for each PTE
> sequentially, which is inefficient for large folios. This inefficiency is
> especially noticeable when reclaiming clean file-backed large folios, where
> folio_referenced() is observed as a significant performance hotspot.
> 
> Moreover, on Arm architecture, which supports contiguous PTEs, there is already
> an optimization to clear the young flags for PTEs within a contiguous range.
> However, this is not sufficient. We can extend this to perform batched operations
> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
> 
> By supporting batched checking of the young flags and flushing TLB entries,
> I observed a 33% performance improvement in my file-backed folios reclaim tests.

Can you point at the benchmark or briefly explain what it does? What 
exactly are we measuring that improves by 33%?

-- 
Cheers

David
Re: [PATCH 0/2] support batched checks of the references for large folios
Posted by Barry Song 6 days, 14 hours ago
Hi Baolin,

On Tue, Nov 25, 2025 at 8:57 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> Currently, folio_referenced_one() always checks the young flag for each PTE
> sequentially, which is inefficient for large folios. This inefficiency is
> especially noticeable when reclaiming clean file-backed large folios, where
> folio_referenced() is observed as a significant performance hotspot.
>
> Moreover, on Arm architecture, which supports contiguous PTEs, there is already
> an optimization to clear the young flags for PTEs within a contiguous range.
> However, this is not sufficient. We can extend this to perform batched operations
> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
>
> By supporting batched checking of the young flags and flushing TLB entries,
> I observed a 33% performance improvement in my file-backed folios reclaim tests.

nice!

>
> BTW, I still noticed a hotspot in try_to_unmap() in my test. Hope Barry can
> resend the optimization patch for try_to_unmap() [1].

Thanks for waking me up. Yes, it's still on my list—I've just had a lot of
non-technical issues come up that seriously slowed my progress. Sorry for
the delay.

And I suppose we also need that for try_to_migrate().

>
> [1] https://lore.kernel.org/all/20250513084620.58231-1-21cnbao@gmail.com/
>
> Baolin Wang (2):
>   arm64: mm: support batch clearing of the young flag for large folios
>   mm: rmap: support batched checks of the references for large folios
>
>  arch/arm64/include/asm/pgtable.h | 23 ++++++++++++-----
>  arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
>  include/linux/mmu_notifier.h     |  9 ++++---
>  include/linux/pgtable.h          | 19 ++++++++++++++
>  mm/rmap.c                        | 22 ++++++++++++++--
>  5 files changed, 92 insertions(+), 25 deletions(-)
>

Thanks
Barry
Re: [PATCH 0/2] support batched checks of the references for large folios
Posted by Kairui Song 6 days, 6 hours ago
On Tue, Nov 25, 2025 at 6:15 PM Barry Song <21cnbao@gmail.com> wrote:
>
> Hi Baolin,
>
> On Tue, Nov 25, 2025 at 8:57 AM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
> >
> > Currently, folio_referenced_one() always checks the young flag for each PTE
> > sequentially, which is inefficient for large folios. This inefficiency is
> > especially noticeable when reclaiming clean file-backed large folios, where
> > folio_referenced() is observed as a significant performance hotspot.
> >
> > Moreover, on Arm architecture, which supports contiguous PTEs, there is already
> > an optimization to clear the young flags for PTEs within a contiguous range.
> > However, this is not sufficient. We can extend this to perform batched operations
> > for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
> >
> > By supporting batched checking of the young flags and flushing TLB entries,
> > I observed a 33% performance improvement in my file-backed folios reclaim tests.
>
> nice!
>
> >
> > BTW, I still noticed a hotspot in try_to_unmap() in my test. Hope Barry can
> > resend the optimization patch for try_to_unmap() [1].
>
> Thanks for waking me up. Yes, it's still on my list—I've just had a lot of
> non-technical issues come up that seriously slowed my progress. Sorry for
> the delay.
>
> And I suppose we also need that for try_to_migrate().
>
> >
> > [1] https://lore.kernel.org/all/20250513084620.58231-1-21cnbao@gmail.com/

Hi Barry, Baolin.

About the try_to_unmap part, I also noticed that patch and the comment
issue "We only support batched swap_duplicate() for unmapping" in that
patch. I guess one reason is add_swap_count_continuation right? That
limitation will be killed by swap table phase 3:

It can be previewed here:
https://lore.kernel.org/linux-mm/20250514201729.48420-28-ryncsn@gmail.com/

And I think we will be able to handle that much easier by then. Sorry
that it is taking a while to land upstream though.