[v3] Optimize anonymous large folio unmapping

[PATCH v3 0/9] Optimize anonymous large folio unmapping

Posted by Dev Jain 1 month, 1 week ago

Speed up unmapping of anonymous large folios by clearing the ptes, and
setting swap ptes, in one go.

The following benchmark (stolen from Barry at [1]) is used to measure the
time taken to swapout 256M worth of memory backed by 64K large folios:

 #define _GNU_SOURCE
 #include <stdio.h>
 #include <stdlib.h>
 #include <sys/mman.h>
 #include <string.h>
 #include <time.h>
 #include <unistd.h>
 #include <errno.h>

 #define SIZE_MB 256
 #define SIZE_BYTES (SIZE_MB * 1024 * 1024)

 int main() {
     void *addr = mmap(NULL, SIZE_BYTES, PROT_READ | PROT_WRITE,
                       MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
     if (addr == MAP_FAILED) {
         perror("mmap failed");
         return 1;
     }

     memset(addr, 0, SIZE_BYTES);

     struct timespec start, end;
     clock_gettime(CLOCK_MONOTONIC, &start);

     if (madvise(addr, SIZE_BYTES, MADV_PAGEOUT) != 0) {
         perror("madvise(MADV_PAGEOUT) failed");
         munmap(addr, SIZE_BYTES);
         return 1;
     }

     clock_gettime(CLOCK_MONOTONIC, &end);

     long duration_ns = (end.tv_sec - start.tv_sec) * 1e9 +
                        (end.tv_nsec - start.tv_nsec);
     printf("madvise(MADV_PAGEOUT) took %ld ns (%.3f ms)\n",
            duration_ns, duration_ns / 1e6);

     munmap(addr, SIZE_BYTES);
     return 0;
 }

Performance as measured on a Linux VM on Apple M3 (arm64):

Vanilla - Mean: 37401913 ns, std dev: 12%
Patched - Mean: 17420282 ns, std dev: 11%

No regression observed on 4K folios.

Performance as measured on bare metal x86:

Vanilla - mean: 54986286 ns, std dev: 1.5%
Patched - mean: 51930795 ns, std dev: 3%

Interestingly, no obvious improvement is observed on x86, hinting that the
benefit lies mainly in the reduction of ptep_get() calls and the reduction
of TLB flushes during contpte-unfolding, on arm64.

No regression is observed on 4K folios on x86 too.

---
Applies on mm-unstable (2d565cbaafd4).

v2->v3:
Mostly a resend after merge window. Some minor changes:

 - Match kerneldoc parameter with function parameter (pte -> ptep)
 - Mention change BUG->WARN in patch description
 - Rename walk_done -> exit_walk in patch 2
 - 
v1->v2:
 - Keep nr_pages as unsigned long
 - Add patch 2
 - Rename some functions, make return type bool for functions returning 0/1
 - Drop page_vma_mapped_walk_jump - this is implicitly handled
 - Drop likely()
 - Add folio_dup/put_swap_pages, do subpage -> page
 - Shorten the kerneldoc to remove unnecessary information - keep it
   aligned with analogous functions
 - Put clear_pages_anon_exclusive to mm.h
 - Some more refactoring in last patch with finish_folio_unmap


Dev Jain (9):
  mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
  mm/rmap: refactor hugetlb pte clearing in try_to_unmap_one
  mm/rmap: refactor some code around lazyfree folio unmapping
  mm/memory: Batch set uffd-wp markers during zapping
  mm/rmap: batch unmap folios belonging to uffd-wp VMAs
  mm/swapfile: Add batched version of folio_dup_swap
  mm/swapfile: Add batched version of folio_put_swap
  mm/rmap: Add batched version of folio_try_share_anon_rmap_pte
  mm/rmap: enable batch unmapping of anonymous folios

 include/linux/mm.h        |  11 ++
 include/linux/mm_inline.h |  34 ++--
 include/linux/rmap.h      |  27 ++-
 mm/internal.h             |  26 +++
 mm/memory.c               |  26 +--
 mm/mprotect.c             |  17 --
 mm/rmap.c                 | 405 ++++++++++++++++++++++++--------------
 mm/shmem.c                |   8 +-
 mm/swap.h                 |  23 ++-
 mm/swapfile.c             |  42 ++--
 10 files changed, 383 insertions(+), 236 deletions(-)

-- 
2.34.1

Re: [PATCH v3 0/9] Optimize anonymous large folio unmapping

Posted by Andrew Morton 1 month ago

On Wed,  6 May 2026 15:14:55 +0530 Dev Jain <dev.jain@arm.com> wrote:

> Speed up unmapping of anonymous large folios by clearing the ptes, and
> setting swap ptes, in one go.
> 
> ...
> 
> Performance as measured on a Linux VM on Apple M3 (arm64):
> 
> Vanilla - Mean: 37401913 ns, std dev: 12%
> Patched - Mean: 17420282 ns, std dev: 11%
> 
> No regression observed on 4K folios.
> 
> Performance as measured on bare metal x86:
> 
> Vanilla - mean: 54986286 ns, std dev: 1.5%
> Patched - mean: 51930795 ns, std dev: 3%

That looks nice.

I'll pass at this time, wait for reviewer input.  Most reviewers are
jetlagged and exhausted, so a resend might be needed ;)

Saskiko said a few things:
	https://sashiko.dev/#/patchset/20260506094504.2588857-1-dev.jain@arm.com

Re: [PATCH v3 0/9] Optimize anonymous large folio unmapping

Posted by Dev Jain 1 month ago


On 09/05/26 5:08 am, Andrew Morton wrote:
> On Wed,  6 May 2026 15:14:55 +0530 Dev Jain <dev.jain@arm.com> wrote:
> 
>> Speed up unmapping of anonymous large folios by clearing the ptes, and
>> setting swap ptes, in one go.
>>
>> ...
>>
>> Performance as measured on a Linux VM on Apple M3 (arm64):
>>
>> Vanilla - Mean: 37401913 ns, std dev: 12%
>> Patched - Mean: 17420282 ns, std dev: 11%
>>
>> No regression observed on 4K folios.
>>
>> Performance as measured on bare metal x86:
>>
>> Vanilla - mean: 54986286 ns, std dev: 1.5%
>> Patched - mean: 51930795 ns, std dev: 3%
> 
> That looks nice.
> 
> I'll pass at this time, wait for reviewer input.  Most reviewers are
> jetlagged and exhausted, so a resend might be needed ;)
> 
> Saskiko said a few things:
> 	https://sashiko.dev/#/patchset/20260506094504.2588857-1-dev.jain@arm.com

Patch 2:

"In the original code, failing hugetlb_vma_trylock_write() triggered a
goto walk_abort, leaving ret set to true."

That is wrong.

Patch 9:

"Since __HAVE_ARCH_UNMAP_ONE is typically defined without a value on sparc64,
__is_defined() will evaluate to 0 because it is primarily designed for Kconfig
symbols that explicitly evaluate to 1."

Which is again wrong?

Patch 9:

"What happens to the remaining pages in the batch? Since get_and_clear_ptes()
cleared all of them upfront, and the loop aborts early without restoring them,
it appears the remaining PTEs are left cleared in the page tables and their
references are not released"

Yes this is valid. I did see it on the v2 Sashiko review but misread it : )

When unmap fails for a sub-batch, I need to restore all the cleared ptes,
not only those of the sub-batch.

This should work:

diff --git a/mm/rmap.c b/mm/rmap.c
index fc953f36d4527..e54c15a82c504 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2023,10 +2023,8 @@ static inline bool __unmap_anon_folio_range(struct vm_area_struct *vma, struct f
 	swp_entry_t entry = page_swap_entry(subpage);
 	struct mm_struct *mm = vma->vm_mm;

-	if (folio_dup_swap_pages(folio, subpage, nr_pages) < 0) {
-		set_ptes(mm, address, ptep, pteval, nr_pages);
+	if (folio_dup_swap_pages(folio, subpage, nr_pages) < 0)
 		return false;
-	}

 	/*
 	 * arch_unmap_one() is expected to be a NOP on
@@ -2036,16 +2034,13 @@ static inline bool __unmap_anon_folio_range(struct vm_area_struct *vma, struct f
 	if (arch_unmap_one(mm, vma, address, pteval) < 0) {
 		VM_WARN_ON(nr_pages != 1);
 		folio_put_swap_pages(folio, subpage, nr_pages);
-		set_pte_at(mm, address, ptep, pteval);
 		return false;
 	}

 	/* See folio_try_share_anon_rmap(): clear PTE first. */
-	if (anon_exclusive && folio_try_share_anon_rmap_ptes(folio, subpage, nr_pages)) {
+	if (anon_exclusive && folio_try_share_anon_rmap_ptes(folio, subpage, nr_pages))
 		folio_put_swap_pages(folio, subpage, nr_pages);
-		set_ptes(mm, address, ptep, pteval, nr_pages);
 		return false;
-	}

 	if (list_empty(&mm->mmlist)) {
 		spin_lock(&mmlist_lock);
@@ -2075,8 +2070,10 @@ static inline bool unmap_anon_folio_range(struct vm_area_struct *vma, struct fol
 						    first_page, expected_anon_exclusive);
 		ret = __unmap_anon_folio_range(vma, folio, first_page + sub_batch_idx,
 					       address, ptep, pteval, len, expected_anon_exclusive);
-		if (!ret)
+		if (!ret) {
+			set_ptes(vma->vm_mm, address, ptep, pteval, nr_pages);
 			return ret;
+		}

 		nr_pages -= len;
 		if (!nr_pages)



> 
>