support batched checks of the references for large folios

[PATCH 0/2] support batched checks of the references for large folios

Posted by Baolin Wang 2 months, 2 weeks ago

Currently, folio_referenced_one() always checks the young flag for each PTE
sequentially, which is inefficient for large folios. This inefficiency is
especially noticeable when reclaiming clean file-backed large folios, where
folio_referenced() is observed as a significant performance hotspot.

Moreover, on Arm architecture, which supports contiguous PTEs, there is already
an optimization to clear the young flags for PTEs within a contiguous range.
However, this is not sufficient. We can extend this to perform batched operations
for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).

By supporting batched checking of the young flags and flushing TLB entries,
I observed a 33% performance improvement in my file-backed folios reclaim tests.

BTW, I still noticed a hotspot in try_to_unmap() in my test. Hope Barry can
resend the optimization patch for try_to_unmap() [1].

[1] https://lore.kernel.org/all/20250513084620.58231-1-21cnbao@gmail.com/

Baolin Wang (2):
  arm64: mm: support batch clearing of the young flag for large folios
  mm: rmap: support batched checks of the references for large folios

 arch/arm64/include/asm/pgtable.h | 23 ++++++++++++-----
 arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
 include/linux/mmu_notifier.h     |  9 ++++---
 include/linux/pgtable.h          | 19 ++++++++++++++
 mm/rmap.c                        | 22 ++++++++++++++--
 5 files changed, 92 insertions(+), 25 deletions(-)

-- 
2.47.3

Re: [PATCH 0/2] support batched checks of the references for large folios

Posted by David Hildenbrand (Red Hat) 2 months, 1 week ago

On 11/25/25 01:56, Baolin Wang wrote:
> Currently, folio_referenced_one() always checks the young flag for each PTE
> sequentially, which is inefficient for large folios. This inefficiency is
> especially noticeable when reclaiming clean file-backed large folios, where
> folio_referenced() is observed as a significant performance hotspot.
> 
> Moreover, on Arm architecture, which supports contiguous PTEs, there is already
> an optimization to clear the young flags for PTEs within a contiguous range.
> However, this is not sufficient. We can extend this to perform batched operations
> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
> 
> By supporting batched checking of the young flags and flushing TLB entries,
> I observed a 33% performance improvement in my file-backed folios reclaim tests.

Can you point at the benchmark or briefly explain what it does? What 
exactly are we measuring that improves by 33%?

-- 
Cheers

David

Re: [PATCH 0/2] support batched checks of the references for large folios

Posted by Baolin Wang 2 months, 1 week ago


On 2025/12/2 00:23, David Hildenbrand (Red Hat) wrote:
> On 11/25/25 01:56, Baolin Wang wrote:
>> Currently, folio_referenced_one() always checks the young flag for 
>> each PTE
>> sequentially, which is inefficient for large folios. This inefficiency is
>> especially noticeable when reclaiming clean file-backed large folios, 
>> where
>> folio_referenced() is observed as a significant performance hotspot.
>>
>> Moreover, on Arm architecture, which supports contiguous PTEs, there 
>> is already
>> an optimization to clear the young flags for PTEs within a contiguous 
>> range.
>> However, this is not sufficient. We can extend this to perform batched 
>> operations
>> for the entire large folio (which might exceed the contiguous range: 
>> CONT_PTE_SIZE).
>>
>> By supporting batched checking of the young flags and flushing TLB 
>> entries,
>> I observed a 33% performance improvement in my file-backed folios 
>> reclaim tests.
> 
> Can you point at the benchmark or briefly explain what it does? What 
> exactly are we measuring that improves by 33%?

Sorry for not being clear. I've described the performance test in patch 
2, and I should have copied it to the cover letter:

"
Performance testing:
Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and 
try to reclaim 8G file-backed folios via the memory.reclaim interface. I 
can observe 33% performance improvement on my Arm64 32-core server (and 
10%+ improvement on my X86 machine). Meanwhile, the hotspot 
folio_check_references() dropped from approximately 35% to around 5%.

W/o patchset:
real	0m1.518s
user	0m0.000s
sys	0m1.518s

W/ patchset:
real	0m1.018s
user	0m0.000s
sys	0m1.018s
"

Re: [PATCH 0/2] support batched checks of the references for large folios

Posted by Barry Song 2 months, 2 weeks ago

Hi Baolin,

On Tue, Nov 25, 2025 at 8:57 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> Currently, folio_referenced_one() always checks the young flag for each PTE
> sequentially, which is inefficient for large folios. This inefficiency is
> especially noticeable when reclaiming clean file-backed large folios, where
> folio_referenced() is observed as a significant performance hotspot.
>
> Moreover, on Arm architecture, which supports contiguous PTEs, there is already
> an optimization to clear the young flags for PTEs within a contiguous range.
> However, this is not sufficient. We can extend this to perform batched operations
> for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
>
> By supporting batched checking of the young flags and flushing TLB entries,
> I observed a 33% performance improvement in my file-backed folios reclaim tests.

nice!

>
> BTW, I still noticed a hotspot in try_to_unmap() in my test. Hope Barry can
> resend the optimization patch for try_to_unmap() [1].

Thanks for waking me up. Yes, it's still on my list—I've just had a lot of
non-technical issues come up that seriously slowed my progress. Sorry for
the delay.

And I suppose we also need that for try_to_migrate().

>
> [1] https://lore.kernel.org/all/20250513084620.58231-1-21cnbao@gmail.com/
>
> Baolin Wang (2):
>   arm64: mm: support batch clearing of the young flag for large folios
>   mm: rmap: support batched checks of the references for large folios
>
>  arch/arm64/include/asm/pgtable.h | 23 ++++++++++++-----
>  arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
>  include/linux/mmu_notifier.h     |  9 ++++---
>  include/linux/pgtable.h          | 19 ++++++++++++++
>  mm/rmap.c                        | 22 ++++++++++++++--
>  5 files changed, 92 insertions(+), 25 deletions(-)
>

Thanks
Barry

Re: [PATCH 0/2] support batched checks of the references for large folios

Posted by Kairui Song 2 months, 2 weeks ago

On Tue, Nov 25, 2025 at 6:15 PM Barry Song <21cnbao@gmail.com> wrote:
>
> Hi Baolin,
>
> On Tue, Nov 25, 2025 at 8:57 AM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
> >
> > Currently, folio_referenced_one() always checks the young flag for each PTE
> > sequentially, which is inefficient for large folios. This inefficiency is
> > especially noticeable when reclaiming clean file-backed large folios, where
> > folio_referenced() is observed as a significant performance hotspot.
> >
> > Moreover, on Arm architecture, which supports contiguous PTEs, there is already
> > an optimization to clear the young flags for PTEs within a contiguous range.
> > However, this is not sufficient. We can extend this to perform batched operations
> > for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
> >
> > By supporting batched checking of the young flags and flushing TLB entries,
> > I observed a 33% performance improvement in my file-backed folios reclaim tests.
>
> nice!
>
> >
> > BTW, I still noticed a hotspot in try_to_unmap() in my test. Hope Barry can
> > resend the optimization patch for try_to_unmap() [1].
>
> Thanks for waking me up. Yes, it's still on my list—I've just had a lot of
> non-technical issues come up that seriously slowed my progress. Sorry for
> the delay.
>
> And I suppose we also need that for try_to_migrate().
>
> >
> > [1] https://lore.kernel.org/all/20250513084620.58231-1-21cnbao@gmail.com/

Hi Barry, Baolin.

About the try_to_unmap part, I also noticed that patch and the comment
issue "We only support batched swap_duplicate() for unmapping" in that
patch. I guess one reason is add_swap_count_continuation right? That
limitation will be killed by swap table phase 3:

It can be previewed here:
https://lore.kernel.org/linux-mm/20250514201729.48420-28-ryncsn@gmail.com/

And I think we will be able to handle that much easier by then. Sorry
that it is taking a while to land upstream though.