mm/mremap.c | 58 ++++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 42 insertions(+), 16 deletions(-)
Currently move_ptes() iterates through ptes one by one. If the underlying folio mapped by the ptes is large, we can process those ptes in a batch using folio_pte_batch(), thus clearing and setting the PTEs in one go. For arm64 specifically, this results in a 16x reduction in the number of ptep_get() calls (since on a contig block, ptep_get() on arm64 will iterate through all 16 entries to collect a/d bits), and we also elide extra TLBIs through get_and_clear_full_ptes, replacing ptep_get_and_clear. Mapping 1M of memory with 64K folios, memsetting it, remapping it to src + 1M, and munmapping it 10,000 times, the average execution time reduces from 1.9 to 1.2 seconds, giving a 37% performance optimization, on Apple M3 (arm64). No regression is observed for small folios. The patchset is based on mm-unstable (6ebffe676fcf). Test program for reference: #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <string.h> #include <errno.h> #define SIZE (1UL << 20) // 1M int main(void) { void *new_addr, *addr; for (int i = 0; i < 10000; ++i) { addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (addr == MAP_FAILED) { perror("mmap"); return 1; } memset(addr, 0xAA, SIZE); new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE); if (new_addr != (addr + SIZE)) { perror("mremap"); return 1; } munmap(new_addr, SIZE); } } v3->v4: - Remove comment above mremap_folio_pte_batch, improve patch description differentiating between folio splitting and pagetable splitting v2->v3: - Refactor mremap_folio_pte_batch, drop maybe_contiguous_pte_pfns, fix indentation (Lorenzo), fix cover letter description (512K -> 1M) v1->v2: - Expand patch descriptions, move pte declarations to a new line, reduce indentation in patch 2 by introducing mremap_folio_pte_batch(), fix loop iteration (Lorenzo) - Merge patch 2 and 3 (Anshuman, Lorenzo) - Fix maybe_contiguous_pte_pfns (Willy) Dev Jain (2): mm: Call pointers to ptes as ptep mm: Optimize mremap() by PTE batching mm/mremap.c | 58 ++++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 42 insertions(+), 16 deletions(-) -- 2.30.2
On Tue, Jun 10, 2025 at 09:20:41AM +0530, Dev Jain wrote: > Currently move_ptes() iterates through ptes one by one. If the underlying > folio mapped by the ptes is large, we can process those ptes in a batch > using folio_pte_batch(), thus clearing and setting the PTEs in one go. > For arm64 specifically, this results in a 16x reduction in the number of > ptep_get() calls (since on a contig block, ptep_get() on arm64 will iterate > through all 16 entries to collect a/d bits), and we also elide extra TLBIs > through get_and_clear_full_ptes, replacing ptep_get_and_clear. Thanks this is good! > > Mapping 1M of memory with 64K folios, memsetting it, remapping it to > src + 1M, and munmapping it 10,000 times, the average execution time > reduces from 1.9 to 1.2 seconds, giving a 37% performance optimization, > on Apple M3 (arm64). No regression is observed for small folios. Hmm, I thought people were struggling to get M3 to work with Asahi? :) or is this in a mac-based vm? I've not paid attention to recent developments. > > The patchset is based on mm-unstable (6ebffe676fcf). > > Test program for reference: > > #define _GNU_SOURCE > #include <stdio.h> > #include <stdlib.h> > #include <unistd.h> > #include <sys/mman.h> > #include <string.h> > #include <errno.h> > > #define SIZE (1UL << 20) // 1M > > int main(void) { > void *new_addr, *addr; > > for (int i = 0; i < 10000; ++i) { > addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > if (addr == MAP_FAILED) { > perror("mmap"); > return 1; > } > memset(addr, 0xAA, SIZE); > > new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE); > if (new_addr != (addr + SIZE)) { > perror("mremap"); > return 1; > } > munmap(new_addr, SIZE); > } > > } > Thanks for including! Very useful. > v3->v4: > - Remove comment above mremap_folio_pte_batch, improve patch description > differentiating between folio splitting and pagetable splitting > v2->v3: > - Refactor mremap_folio_pte_batch, drop maybe_contiguous_pte_pfns, fix > indentation (Lorenzo), fix cover letter description (512K -> 1M) It's nitty but these seem to be getting more and more abbreviated :) not a massive big deal however ;) > > v1->v2: > - Expand patch descriptions, move pte declarations to a new line, > reduce indentation in patch 2 by introducing mremap_folio_pte_batch(), > fix loop iteration (Lorenzo) > - Merge patch 2 and 3 (Anshuman, Lorenzo) > - Fix maybe_contiguous_pte_pfns (Willy) > > Dev Jain (2): > mm: Call pointers to ptes as ptep > mm: Optimize mremap() by PTE batching > > mm/mremap.c | 58 ++++++++++++++++++++++++++++++++++++++--------------- > 1 file changed, 42 insertions(+), 16 deletions(-) > > -- > 2.30.2 >
On 10/06/25 5:41 pm, Lorenzo Stoakes wrote: > On Tue, Jun 10, 2025 at 09:20:41AM +0530, Dev Jain wrote: >> Currently move_ptes() iterates through ptes one by one. If the underlying >> folio mapped by the ptes is large, we can process those ptes in a batch >> using folio_pte_batch(), thus clearing and setting the PTEs in one go. >> For arm64 specifically, this results in a 16x reduction in the number of >> ptep_get() calls (since on a contig block, ptep_get() on arm64 will iterate >> through all 16 entries to collect a/d bits), and we also elide extra TLBIs >> through get_and_clear_full_ptes, replacing ptep_get_and_clear. > Thanks this is good! > >> Mapping 1M of memory with 64K folios, memsetting it, remapping it to >> src + 1M, and munmapping it 10,000 times, the average execution time >> reduces from 1.9 to 1.2 seconds, giving a 37% performance optimization, >> on Apple M3 (arm64). No regression is observed for small folios. > Hmm, I thought people were struggling to get M3 to work with Asahi? :) or is > this in a mac-based vm? I've not paid attention to recent developments. I meant a Linux VM on Mac. > >> The patchset is based on mm-unstable (6ebffe676fcf). >> >> Test program for reference: >> >> #define _GNU_SOURCE >> #include <stdio.h> >> #include <stdlib.h> >> #include <unistd.h> >> #include <sys/mman.h> >> #include <string.h> >> #include <errno.h> >> >> #define SIZE (1UL << 20) // 1M >> >> int main(void) { >> void *new_addr, *addr; >> >> for (int i = 0; i < 10000; ++i) { >> addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE, >> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); >> if (addr == MAP_FAILED) { >> perror("mmap"); >> return 1; >> } >> memset(addr, 0xAA, SIZE); >> >> new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE); >> if (new_addr != (addr + SIZE)) { >> perror("mremap"); >> return 1; >> } >> munmap(new_addr, SIZE); >> } >> >> } >> > Thanks for including! Very useful. > >> v3->v4: >> - Remove comment above mremap_folio_pte_batch, improve patch description >> differentiating between folio splitting and pagetable splitting >> v2->v3: >> - Refactor mremap_folio_pte_batch, drop maybe_contiguous_pte_pfns, fix >> indentation (Lorenzo), fix cover letter description (512K -> 1M) > It's nitty but these seem to be getting more and more abbreviated :) not a > massive big deal however ;) > >> v1->v2: >> - Expand patch descriptions, move pte declarations to a new line, >> reduce indentation in patch 2 by introducing mremap_folio_pte_batch(), >> fix loop iteration (Lorenzo) >> - Merge patch 2 and 3 (Anshuman, Lorenzo) >> - Fix maybe_contiguous_pte_pfns (Willy) >> >> Dev Jain (2): >> mm: Call pointers to ptes as ptep >> mm: Optimize mremap() by PTE batching >> >> mm/mremap.c | 58 ++++++++++++++++++++++++++++++++++++++--------------- >> 1 file changed, 42 insertions(+), 16 deletions(-) >> >> -- >> 2.30.2 >>
© 2016 - 2025 Red Hat, Inc.