arch/arm64/include/asm/pgtable.h | 10 ++ arch/arm64/mm/mmu.c | 28 +++- include/linux/pgtable.h | 83 +++++++++- mm/mprotect.c | 269 +++++++++++++++++++++++-------- 4 files changed, 315 insertions(+), 75 deletions(-)
This patchset optimizes the mprotect() system call for large folios by PTE-batching. No issues were observed with mm-selftests, build tested on x86_64. We use the following test cases to measure performance, mprotect()'ing the mapped memory to read-only then read-write 40 times: Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then pte-mapping those THPs Test case 2: Mapping 1G of memory with 64K mTHPs Test case 3: Mapping 1G of memory with 4K pages Average execution time on arm64, Apple M3: Before the patchset: T1: 7.9 seconds T2: 7.9 seconds T3: 4.2 seconds After the patchset: T1: 2.1 seconds T2: 2.2 seconds T3: 4.3 seconds Observing T1/T2 and T3 before the patchset, we also remove the regression introduced by ptep_get() on a contpte block. And, for large folios we get an almost 74% performance improvement, albeit the trade-off being a slight degradation in the small folio case. Here is the test program: #define _GNU_SOURCE #include <sys/mman.h> #include <stdlib.h> #include <string.h> #include <stdio.h> #include <unistd.h> #define SIZE (1024*1024*1024) unsigned long pmdsize = (1UL << 21); unsigned long pagesize = (1UL << 12); static void pte_map_thps(char *mem, size_t size) { size_t offs; int ret = 0; /* PTE-map each THP by temporarily splitting the VMAs. */ for (offs = 0; offs < size; offs += pmdsize) { ret |= madvise(mem + offs, pagesize, MADV_DONTFORK); ret |= madvise(mem + offs, pagesize, MADV_DOFORK); } if (ret) { fprintf(stderr, "ERROR: mprotect() failed\n"); exit(1); } } int main(int argc, char *argv[]) { char *p; int ret = 0; p = mmap((1UL << 30), SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (p != (1UL << 30)) { perror("mmap"); return 1; } memset(p, 0, SIZE); if (madvise(p, SIZE, MADV_NOHUGEPAGE)) perror("madvise"); explicit_bzero(p, SIZE); pte_map_thps(p, SIZE); for (int loops = 0; loops < 40; loops++) { if (mprotect(p, SIZE, PROT_READ)) perror("mprotect"), exit(1); if (mprotect(p, SIZE, PROT_READ|PROT_WRITE)) perror("mprotect"), exit(1); explicit_bzero(p, SIZE); } } --- The patchset is rebased onto Saturday's mm-new. v3->v4: - Refactor skipping logic into a new function, edit patch 1 subject to highlight it is only for MM_CP_PROT_NUMA case (David H) - Refactor the optimization logic, add more documentation to the generic batched functions, do not add clear_flush_ptes, squash patch 4 and 5 (Ryan) v2->v3: - Add comments for the new APIs (Ryan, Lorenzo) - Instead of refactoring, use a "skip_batch" label - Move arm64 patches at the end (Ryan) - In can_change_pte_writable(), check AnonExclusive page-by-page (David H) - Resolve implicit declaration; tested build on x86 (Lance Yang) v1->v2: - Rebase onto mm-unstable (6ebffe676fcf: util_macros.h: make the header more resilient) - Abridge the anon-exclusive condition (Lance Yang) Dev Jain (4): mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs mm: Add batched versions of ptep_modify_prot_start/commit mm: Optimize mprotect() by PTE-batching arm64: Add batched versions of ptep_modify_prot_start/commit arch/arm64/include/asm/pgtable.h | 10 ++ arch/arm64/mm/mmu.c | 28 +++- include/linux/pgtable.h | 83 +++++++++- mm/mprotect.c | 269 +++++++++++++++++++++++-------- 4 files changed, 315 insertions(+), 75 deletions(-) -- 2.30.2
On Sat, 28 Jun 2025 17:04:31 +0530 Dev Jain <dev.jain@arm.com> wrote: > This patchset optimizes the mprotect() system call for large folios > by PTE-batching. No issues were observed with mm-selftests, build > tested on x86_64. um what. Seems to claim that "selftests still compiles after I messed with stuff", which isn't very impressive ;) Please clarify? > We use the following test cases to measure performance, mprotect()'ing > the mapped memory to read-only then read-write 40 times: > > Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then > pte-mapping those THPs > Test case 2: Mapping 1G of memory with 64K mTHPs > Test case 3: Mapping 1G of memory with 4K pages > > Average execution time on arm64, Apple M3: > Before the patchset: > T1: 7.9 seconds T2: 7.9 seconds T3: 4.2 seconds > > After the patchset: > T1: 2.1 seconds T2: 2.2 seconds T3: 4.3 seconds Well that's tasty. > Observing T1/T2 and T3 before the patchset, we also remove the regression > introduced by ptep_get() on a contpte block. And, for large folios we get > an almost 74% performance improvement, albeit the trade-off being a slight > degradation in the small folio case. >
On 30/06/25 4:35 am, Andrew Morton wrote: > On Sat, 28 Jun 2025 17:04:31 +0530 Dev Jain <dev.jain@arm.com> wrote: > >> This patchset optimizes the mprotect() system call for large folios >> by PTE-batching. No issues were observed with mm-selftests, build >> tested on x86_64. > um what. Seems to claim that "selftests still compiles after I messed > with stuff", which isn't very impressive ;) Please clarify? Sorry I mean to say that the mm-selftests pass. > >> We use the following test cases to measure performance, mprotect()'ing >> the mapped memory to read-only then read-write 40 times: >> >> Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then >> pte-mapping those THPs >> Test case 2: Mapping 1G of memory with 64K mTHPs >> Test case 3: Mapping 1G of memory with 4K pages >> >> Average execution time on arm64, Apple M3: >> Before the patchset: >> T1: 7.9 seconds T2: 7.9 seconds T3: 4.2 seconds >> >> After the patchset: >> T1: 2.1 seconds T2: 2.2 seconds T3: 4.3 seconds > Well that's tasty. > >> Observing T1/T2 and T3 before the patchset, we also remove the regression >> introduced by ptep_get() on a contpte block. And, for large folios we get >> an almost 74% performance improvement, albeit the trade-off being a slight >> degradation in the small folio case. >>
On 30/06/2025 04:33, Dev Jain wrote: > > On 30/06/25 4:35 am, Andrew Morton wrote: >> On Sat, 28 Jun 2025 17:04:31 +0530 Dev Jain <dev.jain@arm.com> wrote: >> >>> This patchset optimizes the mprotect() system call for large folios >>> by PTE-batching. No issues were observed with mm-selftests, build >>> tested on x86_64. >> um what. Seems to claim that "selftests still compiles after I messed >> with stuff", which isn't very impressive ;) Please clarify? > > Sorry I mean to say that the mm-selftests pass. I think you're saying you both compiled and ran the mm selftests for arm64. And additionally you compiled for x86_64? (Just trying to help clarify). > >> >>> We use the following test cases to measure performance, mprotect()'ing >>> the mapped memory to read-only then read-write 40 times: >>> >>> Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then >>> pte-mapping those THPs >>> Test case 2: Mapping 1G of memory with 64K mTHPs >>> Test case 3: Mapping 1G of memory with 4K pages >>> >>> Average execution time on arm64, Apple M3: >>> Before the patchset: >>> T1: 7.9 seconds T2: 7.9 seconds T3: 4.2 seconds >>> >>> After the patchset: >>> T1: 2.1 seconds T2: 2.2 seconds T3: 4.3 seconds >> Well that's tasty. >> >>> Observing T1/T2 and T3 before the patchset, we also remove the regression >>> introduced by ptep_get() on a contpte block. And, for large folios we get >>> an almost 74% performance improvement, albeit the trade-off being a slight >>> degradation in the small folio case. >>>
On 30/06/25 4:15 pm, Ryan Roberts wrote: > On 30/06/2025 04:33, Dev Jain wrote: >> On 30/06/25 4:35 am, Andrew Morton wrote: >>> On Sat, 28 Jun 2025 17:04:31 +0530 Dev Jain <dev.jain@arm.com> wrote: >>> >>>> This patchset optimizes the mprotect() system call for large folios >>>> by PTE-batching. No issues were observed with mm-selftests, build >>>> tested on x86_64. >>> um what. Seems to claim that "selftests still compiles after I messed >>> with stuff", which isn't very impressive ;) Please clarify? >> Sorry I mean to say that the mm-selftests pass. > I think you're saying you both compiled and ran the mm selftests for arm64. And > additionally you compiled for x86_64? (Just trying to help clarify). Yes, ran mm-selftests on arm64, and build-tested the patches for x86. > >>>> We use the following test cases to measure performance, mprotect()'ing >>>> the mapped memory to read-only then read-write 40 times: >>>> >>>> Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then >>>> pte-mapping those THPs >>>> Test case 2: Mapping 1G of memory with 64K mTHPs >>>> Test case 3: Mapping 1G of memory with 4K pages >>>> >>>> Average execution time on arm64, Apple M3: >>>> Before the patchset: >>>> T1: 7.9 seconds T2: 7.9 seconds T3: 4.2 seconds >>>> >>>> After the patchset: >>>> T1: 2.1 seconds T2: 2.2 seconds T3: 4.3 seconds >>> Well that's tasty. >>> >>>> Observing T1/T2 and T3 before the patchset, we also remove the regression >>>> introduced by ptep_get() on a contpte block. And, for large folios we get >>>> an almost 74% performance improvement, albeit the trade-off being a slight >>>> degradation in the small folio case. >>>> >
On Sat, Jun 28, 2025 at 05:04:31PM +0530, Dev Jain wrote: > This patchset optimizes the mprotect() system call for large folios > by PTE-batching. No issues were observed with mm-selftests, build > tested on x86_64. Should also be tested on x86-64 not only build tested :) You are still not really giving details here, so same comment as your mremap() series, please explain why you're doing this, what for, what benefits you expect to achieve, where etc. E.g. 'this is deisgned to optimise mTHP cases on arm64, we expect to see benefits on amd64 also and for intel there should be no impact'. It's probably also worth actually going and checking to make sure that this is the case re: other arches. See below on that... > > We use the following test cases to measure performance, mprotect()'ing > the mapped memory to read-only then read-write 40 times: > > Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then > pte-mapping those THPs > Test case 2: Mapping 1G of memory with 64K mTHPs > Test case 3: Mapping 1G of memory with 4K pages > > Average execution time on arm64, Apple M3: > Before the patchset: > T1: 7.9 seconds T2: 7.9 seconds T3: 4.2 seconds > > After the patchset: > T1: 2.1 seconds T2: 2.2 seconds T3: 4.3 seconds > > Observing T1/T2 and T3 before the patchset, we also remove the regression > introduced by ptep_get() on a contpte block. And, for large folios we get > an almost 74% performance improvement, albeit the trade-off being a slight > degradation in the small folio case. This is nice, though order-0 is probably going to be your bread and butter no? Having said that, mprotect() is not a hot path, this delta is small enough to quite possibly just be noise, and personally I'm not all that bothered. But let's run this same test on x86-64 too please and get some before/after numbers just to confirm no major impact. Thanks for including code. > > Here is the test program: > > #define _GNU_SOURCE > #include <sys/mman.h> > #include <stdlib.h> > #include <string.h> > #include <stdio.h> > #include <unistd.h> > > #define SIZE (1024*1024*1024) > > unsigned long pmdsize = (1UL << 21); > unsigned long pagesize = (1UL << 12); > > static void pte_map_thps(char *mem, size_t size) > { > size_t offs; > int ret = 0; > > > /* PTE-map each THP by temporarily splitting the VMAs. */ > for (offs = 0; offs < size; offs += pmdsize) { > ret |= madvise(mem + offs, pagesize, MADV_DONTFORK); > ret |= madvise(mem + offs, pagesize, MADV_DOFORK); > } > > if (ret) { > fprintf(stderr, "ERROR: mprotect() failed\n"); > exit(1); > } > } > > int main(int argc, char *argv[]) > { > char *p; > int ret = 0; > p = mmap((1UL << 30), SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > if (p != (1UL << 30)) { > perror("mmap"); > return 1; > } > > > > memset(p, 0, SIZE); > if (madvise(p, SIZE, MADV_NOHUGEPAGE)) > perror("madvise"); > explicit_bzero(p, SIZE); > pte_map_thps(p, SIZE); > > for (int loops = 0; loops < 40; loops++) { > if (mprotect(p, SIZE, PROT_READ)) > perror("mprotect"), exit(1); > if (mprotect(p, SIZE, PROT_READ|PROT_WRITE)) > perror("mprotect"), exit(1); > explicit_bzero(p, SIZE); > } > } > > --- > The patchset is rebased onto Saturday's mm-new. > > v3->v4: > - Refactor skipping logic into a new function, edit patch 1 subject > to highlight it is only for MM_CP_PROT_NUMA case (David H) > - Refactor the optimization logic, add more documentation to the generic > batched functions, do not add clear_flush_ptes, squash patch 4 > and 5 (Ryan) > > v2->v3: > - Add comments for the new APIs (Ryan, Lorenzo) > - Instead of refactoring, use a "skip_batch" label > - Move arm64 patches at the end (Ryan) > - In can_change_pte_writable(), check AnonExclusive page-by-page (David H) > - Resolve implicit declaration; tested build on x86 (Lance Yang) > > v1->v2: > - Rebase onto mm-unstable (6ebffe676fcf: util_macros.h: make the header more resilient) > - Abridge the anon-exclusive condition (Lance Yang) > > Dev Jain (4): > mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs > mm: Add batched versions of ptep_modify_prot_start/commit > mm: Optimize mprotect() by PTE-batching > arm64: Add batched versions of ptep_modify_prot_start/commit > > arch/arm64/include/asm/pgtable.h | 10 ++ > arch/arm64/mm/mmu.c | 28 +++- > include/linux/pgtable.h | 83 +++++++++- > mm/mprotect.c | 269 +++++++++++++++++++++++-------- > 4 files changed, 315 insertions(+), 75 deletions(-) > > -- > 2.30.2 >
On 30/06/25 4:47 pm, Lorenzo Stoakes wrote: > On Sat, Jun 28, 2025 at 05:04:31PM +0530, Dev Jain wrote: >> This patchset optimizes the mprotect() system call for large folios >> by PTE-batching. No issues were observed with mm-selftests, build >> tested on x86_64. > Should also be tested on x86-64 not only build tested :) > > You are still not really giving details here, so same comment as your mremap() > series, please explain why you're doing this, what for, what benefits you expect > to achieve, where etc. > > E.g. 'this is deisgned to optimise mTHP cases on arm64, we expect to see > benefits on amd64 also and for intel there should be no impact'. Okay. > > It's probably also worth actually going and checking to make sure that this is > the case re: other arches. See below on that... > >> We use the following test cases to measure performance, mprotect()'ing >> the mapped memory to read-only then read-write 40 times: >> >> Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then >> pte-mapping those THPs >> Test case 2: Mapping 1G of memory with 64K mTHPs >> Test case 3: Mapping 1G of memory with 4K pages >> >> Average execution time on arm64, Apple M3: >> Before the patchset: >> T1: 7.9 seconds T2: 7.9 seconds T3: 4.2 seconds >> >> After the patchset: >> T1: 2.1 seconds T2: 2.2 seconds T3: 4.3 seconds >> >> Observing T1/T2 and T3 before the patchset, we also remove the regression >> introduced by ptep_get() on a contpte block. And, for large folios we get >> an almost 74% performance improvement, albeit the trade-off being a slight >> degradation in the small folio case. > This is nice, though order-0 is probably going to be your bread and butter no? > > Having said that, mprotect() is not a hot path, this delta is small enough to > quite possibly just be noise, and personally I'm not all that bothered. It is only the vm_normal_folio() + folio_test_large() overhead. Trying to avoid this by the horrible maybe_contiguous_pte_pfns() I introduced somewhere else is not worth it : ) > > But let's run this same test on x86-64 too please and get some before/after > numbers just to confirm no major impact. > > Thanks for including code. > >> Here is the test program: >> >> #define _GNU_SOURCE >> #include <sys/mman.h> >> #include <stdlib.h> >> #include <string.h> >> #include <stdio.h> >> #include <unistd.h> >> >> #define SIZE (1024*1024*1024) >> >> unsigned long pmdsize = (1UL << 21); >> unsigned long pagesize = (1UL << 12); >> >> static void pte_map_thps(char *mem, size_t size) >> { >> size_t offs; >> int ret = 0; >> >> >> /* PTE-map each THP by temporarily splitting the VMAs. */ >> for (offs = 0; offs < size; offs += pmdsize) { >> ret |= madvise(mem + offs, pagesize, MADV_DONTFORK); >> ret |= madvise(mem + offs, pagesize, MADV_DOFORK); >> } >> >> if (ret) { >> fprintf(stderr, "ERROR: mprotect() failed\n"); >> exit(1); >> } >> } >> >> int main(int argc, char *argv[]) >> { >> char *p; >> int ret = 0; >> p = mmap((1UL << 30), SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); >> if (p != (1UL << 30)) { >> perror("mmap"); >> return 1; >> } >> >> >> >> memset(p, 0, SIZE); >> if (madvise(p, SIZE, MADV_NOHUGEPAGE)) >> perror("madvise"); >> explicit_bzero(p, SIZE); >> pte_map_thps(p, SIZE); >> >> for (int loops = 0; loops < 40; loops++) { >> if (mprotect(p, SIZE, PROT_READ)) >> perror("mprotect"), exit(1); >> if (mprotect(p, SIZE, PROT_READ|PROT_WRITE)) >> perror("mprotect"), exit(1); >> explicit_bzero(p, SIZE); >> } >> } >> >> --- >> The patchset is rebased onto Saturday's mm-new. >> >> v3->v4: >> - Refactor skipping logic into a new function, edit patch 1 subject >> to highlight it is only for MM_CP_PROT_NUMA case (David H) >> - Refactor the optimization logic, add more documentation to the generic >> batched functions, do not add clear_flush_ptes, squash patch 4 >> and 5 (Ryan) >> >> v2->v3: >> - Add comments for the new APIs (Ryan, Lorenzo) >> - Instead of refactoring, use a "skip_batch" label >> - Move arm64 patches at the end (Ryan) >> - In can_change_pte_writable(), check AnonExclusive page-by-page (David H) >> - Resolve implicit declaration; tested build on x86 (Lance Yang) >> >> v1->v2: >> - Rebase onto mm-unstable (6ebffe676fcf: util_macros.h: make the header more resilient) >> - Abridge the anon-exclusive condition (Lance Yang) >> >> Dev Jain (4): >> mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs >> mm: Add batched versions of ptep_modify_prot_start/commit >> mm: Optimize mprotect() by PTE-batching >> arm64: Add batched versions of ptep_modify_prot_start/commit >> >> arch/arm64/include/asm/pgtable.h | 10 ++ >> arch/arm64/mm/mmu.c | 28 +++- >> include/linux/pgtable.h | 83 +++++++++- >> mm/mprotect.c | 269 +++++++++++++++++++++++-------- >> 4 files changed, 315 insertions(+), 75 deletions(-) >> >> -- >> 2.30.2 >>
To reiterate what I said on 1/4 - overall since this series conflicts with David's changes - can we hold off on any respin please until David's settles and lands in mm-new at least? Thanks.
On 30/06/25 4:57 pm, Lorenzo Stoakes wrote: > To reiterate what I said on 1/4 - overall since this series conflicts with > David's changes - can we hold off on any respin please until David's > settles and lands in mm-new at least? > > Thanks. I agree, David's series should get stable by the time I get ready to post the next version, @Andrew could you remove this from mm-new please?
© 2016 - 2025 Red Hat, Inc.