arch/arm64/include/asm/pgtable.h | 23 ++++++++++++----- arch/arm64/mm/contpte.c | 44 ++++++++++++++++++++++---------- include/linux/mmu_notifier.h | 9 ++++--- include/linux/pgtable.h | 19 ++++++++++++++ mm/rmap.c | 22 ++++++++++++++-- 5 files changed, 92 insertions(+), 25 deletions(-)
Currently, folio_referenced_one() always checks the young flag for each PTE sequentially, which is inefficient for large folios. This inefficiency is especially noticeable when reclaiming clean file-backed large folios, where folio_referenced() is observed as a significant performance hotspot. Moreover, on Arm architecture, which supports contiguous PTEs, there is already an optimization to clear the young flags for PTEs within a contiguous range. However, this is not sufficient. We can extend this to perform batched operations for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE). By supporting batched checking of the young flags and flushing TLB entries, I observed a 33% performance improvement in my file-backed folios reclaim tests. BTW, I still noticed a hotspot in try_to_unmap() in my test. Hope Barry can resend the optimization patch for try_to_unmap() [1]. [1] https://lore.kernel.org/all/20250513084620.58231-1-21cnbao@gmail.com/ Baolin Wang (2): arm64: mm: support batch clearing of the young flag for large folios mm: rmap: support batched checks of the references for large folios arch/arm64/include/asm/pgtable.h | 23 ++++++++++++----- arch/arm64/mm/contpte.c | 44 ++++++++++++++++++++++---------- include/linux/mmu_notifier.h | 9 ++++--- include/linux/pgtable.h | 19 ++++++++++++++ mm/rmap.c | 22 ++++++++++++++-- 5 files changed, 92 insertions(+), 25 deletions(-) -- 2.47.3
On 11/25/25 01:56, Baolin Wang wrote: > Currently, folio_referenced_one() always checks the young flag for each PTE > sequentially, which is inefficient for large folios. This inefficiency is > especially noticeable when reclaiming clean file-backed large folios, where > folio_referenced() is observed as a significant performance hotspot. > > Moreover, on Arm architecture, which supports contiguous PTEs, there is already > an optimization to clear the young flags for PTEs within a contiguous range. > However, this is not sufficient. We can extend this to perform batched operations > for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE). > > By supporting batched checking of the young flags and flushing TLB entries, > I observed a 33% performance improvement in my file-backed folios reclaim tests. Can you point at the benchmark or briefly explain what it does? What exactly are we measuring that improves by 33%? -- Cheers David
Hi Baolin, On Tue, Nov 25, 2025 at 8:57 AM Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > > Currently, folio_referenced_one() always checks the young flag for each PTE > sequentially, which is inefficient for large folios. This inefficiency is > especially noticeable when reclaiming clean file-backed large folios, where > folio_referenced() is observed as a significant performance hotspot. > > Moreover, on Arm architecture, which supports contiguous PTEs, there is already > an optimization to clear the young flags for PTEs within a contiguous range. > However, this is not sufficient. We can extend this to perform batched operations > for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE). > > By supporting batched checking of the young flags and flushing TLB entries, > I observed a 33% performance improvement in my file-backed folios reclaim tests. nice! > > BTW, I still noticed a hotspot in try_to_unmap() in my test. Hope Barry can > resend the optimization patch for try_to_unmap() [1]. Thanks for waking me up. Yes, it's still on my list—I've just had a lot of non-technical issues come up that seriously slowed my progress. Sorry for the delay. And I suppose we also need that for try_to_migrate(). > > [1] https://lore.kernel.org/all/20250513084620.58231-1-21cnbao@gmail.com/ > > Baolin Wang (2): > arm64: mm: support batch clearing of the young flag for large folios > mm: rmap: support batched checks of the references for large folios > > arch/arm64/include/asm/pgtable.h | 23 ++++++++++++----- > arch/arm64/mm/contpte.c | 44 ++++++++++++++++++++++---------- > include/linux/mmu_notifier.h | 9 ++++--- > include/linux/pgtable.h | 19 ++++++++++++++ > mm/rmap.c | 22 ++++++++++++++-- > 5 files changed, 92 insertions(+), 25 deletions(-) > Thanks Barry
On Tue, Nov 25, 2025 at 6:15 PM Barry Song <21cnbao@gmail.com> wrote: > > Hi Baolin, > > On Tue, Nov 25, 2025 at 8:57 AM Baolin Wang > <baolin.wang@linux.alibaba.com> wrote: > > > > Currently, folio_referenced_one() always checks the young flag for each PTE > > sequentially, which is inefficient for large folios. This inefficiency is > > especially noticeable when reclaiming clean file-backed large folios, where > > folio_referenced() is observed as a significant performance hotspot. > > > > Moreover, on Arm architecture, which supports contiguous PTEs, there is already > > an optimization to clear the young flags for PTEs within a contiguous range. > > However, this is not sufficient. We can extend this to perform batched operations > > for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE). > > > > By supporting batched checking of the young flags and flushing TLB entries, > > I observed a 33% performance improvement in my file-backed folios reclaim tests. > > nice! > > > > > BTW, I still noticed a hotspot in try_to_unmap() in my test. Hope Barry can > > resend the optimization patch for try_to_unmap() [1]. > > Thanks for waking me up. Yes, it's still on my list—I've just had a lot of > non-technical issues come up that seriously slowed my progress. Sorry for > the delay. > > And I suppose we also need that for try_to_migrate(). > > > > > [1] https://lore.kernel.org/all/20250513084620.58231-1-21cnbao@gmail.com/ Hi Barry, Baolin. About the try_to_unmap part, I also noticed that patch and the comment issue "We only support batched swap_duplicate() for unmapping" in that patch. I guess one reason is add_swap_count_continuation right? That limitation will be killed by swap table phase 3: It can be previewed here: https://lore.kernel.org/linux-mm/20250514201729.48420-28-ryncsn@gmail.com/ And I think we will be able to handle that much easier by then. Sorry that it is taking a while to land upstream though.
© 2016 - 2025 Red Hat, Inc.