arch/arm64/include/asm/pgtable.h | 23 ++++++++---- arch/arm64/mm/contpte.c | 62 ++++++++++++++++++++------------ include/linux/mmu_notifier.h | 9 ++--- include/linux/pgtable.h | 31 ++++++++++++++++ mm/rmap.c | 38 ++++++++++++++++---- 5 files changed, 125 insertions(+), 38 deletions(-)
Currently, folio_referenced_one() always checks the young flag for each PTE
sequentially, which is inefficient for large folios. This inefficiency is
especially noticeable when reclaiming clean file-backed large folios, where
folio_referenced() is observed as a significant performance hotspot.
Moreover, on Arm architecture, which supports contiguous PTEs, there is already
an optimization to clear the young flags for PTEs within a contiguous range.
However, this is not sufficient. We can extend this to perform batched operations
for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
Similar to folio_referenced_one(), we can also apply batched unmapping for large
file folios to optimize the performance of file folio reclamation. By supporting
batched checking of the young flags, flushing TLB entries, and unmapping, I can
observed a significant performance improvements in my performance tests for file
folios reclamation. Please check the performance data in the commit message of
each patch.
Run stress-ng and mm selftests, no issues were found.
Patch 1: Add a new generic batched PTE helper that supports batched checks of
the references for large folios.
Patch 2 - 3: Preparation patches.
patch 4: Implement the Arm64 arch-specific clear_flush_young_ptes().
Patch 5: Support batched unmapping for file large folios.
Changes from v4:
- Fix passing the incorrect 'CONT_PTES' for non-batched APIs.
- Rename ptep_clear_flush_young_notify() to clear_flush_young_ptes_notify() (per Ryan).
- Fix some coding style issues (per Ryan).
- Add reviewed tag from Ryan. Thanks.
Changes from v3:
- Fix using an incorrect parameter in ptep_clear_flush_young_notify()
(per Liam).
Changes from v2:
- Rearrange the patch set (per Ryan).
- Add pte_cont() check in clear_flush_young_ptes() (per Ryan).
- Add a helper to do contpte block alignment (per Ryan).
- Fix some coding style issues (per Lorenzo and Ryan).
- Add more comments and update the commit message (per Lorenzo and Ryan).
- Add acked tag from Barry. Thanks.
Changes from v1:
- Add a new patch to support batched unmapping for file large folios.
- Update the cover letter
Baolin Wang (5):
mm: rmap: support batched checks of the references for large folios
arm64: mm: factor out the address and ptep alignment into a new helper
arm64: mm: support batch clearing of the young flag for large folios
arm64: mm: implement the architecture-specific
clear_flush_young_ptes()
mm: rmap: support batched unmapping for file large folios
arch/arm64/include/asm/pgtable.h | 23 ++++++++----
arch/arm64/mm/contpte.c | 62 ++++++++++++++++++++------------
include/linux/mmu_notifier.h | 9 ++---
include/linux/pgtable.h | 31 ++++++++++++++++
mm/rmap.c | 38 ++++++++++++++++----
5 files changed, 125 insertions(+), 38 deletions(-)
--
2.47.3
Andrew - I know this has had a lot of attention, but can we hold off on sending this upstream until either David or I have had a chance to review it? Also note that Dev has discovered an issue with how this interacts with the accursed uffd-wp logic (see [0]) so series needs a respin anyway. Thanks, Lorenzo [0]: https://lore.kernel.org/linux-mm/20260116082721.275178-1-dev.jain@arm.com/ On Fri, Dec 26, 2025 at 02:07:54PM +0800, Baolin Wang wrote: > Currently, folio_referenced_one() always checks the young flag for each PTE > sequentially, which is inefficient for large folios. This inefficiency is > especially noticeable when reclaiming clean file-backed large folios, where > folio_referenced() is observed as a significant performance hotspot. > > Moreover, on Arm architecture, which supports contiguous PTEs, there is already > an optimization to clear the young flags for PTEs within a contiguous range. > However, this is not sufficient. We can extend this to perform batched operations > for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE). > > Similar to folio_referenced_one(), we can also apply batched unmapping for large > file folios to optimize the performance of file folio reclamation. By supporting > batched checking of the young flags, flushing TLB entries, and unmapping, I can > observed a significant performance improvements in my performance tests for file > folios reclamation. Please check the performance data in the commit message of > each patch. > > Run stress-ng and mm selftests, no issues were found. > > Patch 1: Add a new generic batched PTE helper that supports batched checks of > the references for large folios. > Patch 2 - 3: Preparation patches. > patch 4: Implement the Arm64 arch-specific clear_flush_young_ptes(). > Patch 5: Support batched unmapping for file large folios. > > Changes from v4: > - Fix passing the incorrect 'CONT_PTES' for non-batched APIs. > - Rename ptep_clear_flush_young_notify() to clear_flush_young_ptes_notify() (per Ryan). > - Fix some coding style issues (per Ryan). > - Add reviewed tag from Ryan. Thanks. > > Changes from v3: > - Fix using an incorrect parameter in ptep_clear_flush_young_notify() > (per Liam). > > Changes from v2: > - Rearrange the patch set (per Ryan). > - Add pte_cont() check in clear_flush_young_ptes() (per Ryan). > - Add a helper to do contpte block alignment (per Ryan). > - Fix some coding style issues (per Lorenzo and Ryan). > - Add more comments and update the commit message (per Lorenzo and Ryan). > - Add acked tag from Barry. Thanks. > > Changes from v1: > - Add a new patch to support batched unmapping for file large folios. > - Update the cover letter > > Baolin Wang (5): > mm: rmap: support batched checks of the references for large folios > arm64: mm: factor out the address and ptep alignment into a new helper > arm64: mm: support batch clearing of the young flag for large folios > arm64: mm: implement the architecture-specific > clear_flush_young_ptes() > mm: rmap: support batched unmapping for file large folios > > arch/arm64/include/asm/pgtable.h | 23 ++++++++---- > arch/arm64/mm/contpte.c | 62 ++++++++++++++++++++------------ > include/linux/mmu_notifier.h | 9 ++--- > include/linux/pgtable.h | 31 ++++++++++++++++ > mm/rmap.c | 38 ++++++++++++++++---- > 5 files changed, 125 insertions(+), 38 deletions(-) > > -- > 2.47.3 >
On 1/16/26 09:41, Lorenzo Stoakes wrote: > Andrew - > > I know this has had a lot of attention, but can we hold off on sending this > upstream until either David or I have had a chance to review it? Ah, I didn't read your mail before I sent mine. +1 ;) -- Cheers David
On 12/26/25 07:07, Baolin Wang wrote: > Currently, folio_referenced_one() always checks the young flag for each PTE > sequentially, which is inefficient for large folios. This inefficiency is > especially noticeable when reclaiming clean file-backed large folios, where > folio_referenced() is observed as a significant performance hotspot. > > Moreover, on Arm architecture, which supports contiguous PTEs, there is already > an optimization to clear the young flags for PTEs within a contiguous range. > However, this is not sufficient. We can extend this to perform batched operations > for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE). > > Similar to folio_referenced_one(), we can also apply batched unmapping for large > file folios to optimize the performance of file folio reclamation. By supporting > batched checking of the young flags, flushing TLB entries, and unmapping, I can > observed a significant performance improvements in my performance tests for file > folios reclamation. Please check the performance data in the commit message of > each patch. > > Run stress-ng and mm selftests, no issues were found. Baolin, I'm intending to review this, but it might still take me a bit until I get to it. (PTO/vacation and other fun) -- Cheers David
© 2016 - 2026 Red Hat, Inc.