arch/arm64/include/asm/pgtable.h | 10 ++ arch/arm64/mm/mmu.c | 28 ++- include/linux/pgtable.h | 84 ++++++++- mm/internal.h | 11 +- mm/mprotect.c | 295 ++++++++++++++++++++++++------- 5 files changed, 352 insertions(+), 76 deletions(-)
Use folio_pte_batch() to optimize change_pte_range(). On arm64, if the ptes are painted with the contig bit, then ptep_get() will iterate through all 16 entries to collect a/d bits. Hence this optimization will result in a 16x reduction in the number of ptep_get() calls. Next, ptep_modify_prot_start() will eventually call contpte_try_unfold() on every contig block, thus flushing the TLB for the complete large folio range. Instead, use get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only do them on the starting and ending contig block. For split folios, there will be no pte batching; the batch size returned by folio_pte_batch() will be 1. For pagetable split folios, the ptes will still point to the same large folio; for arm64, this results in the optimization described above, and for other arches, a minor improvement is expected due to a reduction in the number of function calls. mm-selftests pass on arm64. I have some failing tests on my x86 VM already; no new tests fail as a result of this patchset. We use the following test cases to measure performance, mprotect()'ing the mapped memory to read-only then read-write 40 times: Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then pte-mapping those THPs Test case 2: Mapping 1G of memory with 64K mTHPs Test case 3: Mapping 1G of memory with 4K pages Average execution time on arm64, Apple M3: Before the patchset: T1: 2.1 seconds T2: 2 seconds T3: 1 second After the patchset: T1: 0.65 seconds T2: 0.7 seconds T3: 1.1 seconds Observing T1/T2 and T3 before the patchset, we also remove the regression introduced by ptep_get() on a contpte block. And, for large folios we get an almost 74% performance improvement, albeit the trade-off being a slight degradation in the small folio case. For x86: Before the patchset: T1: 3.75 seconds T2: 3.7 seconds T3: 3.85 seconds After the patchset: T1: 3.7 seconds T2: 3.7 seconds T3: 3.9 seconds So there is a minor improvement due to reduction in number of function calls, and a slight degradation in the small folio case due to the overhead of vm_normal_folio() + folio_test_large(). Here is the test program: #define _GNU_SOURCE #include <sys/mman.h> #include <stdlib.h> #include <string.h> #include <stdio.h> #include <unistd.h> #define SIZE (1024*1024*1024) unsigned long pmdsize = (1UL << 21); unsigned long pagesize = (1UL << 12); static void pte_map_thps(char *mem, size_t size) { size_t offs; int ret = 0; /* PTE-map each THP by temporarily splitting the VMAs. */ for (offs = 0; offs < size; offs += pmdsize) { ret |= madvise(mem + offs, pagesize, MADV_DONTFORK); ret |= madvise(mem + offs, pagesize, MADV_DOFORK); } if (ret) { fprintf(stderr, "ERROR: mprotect() failed\n"); exit(1); } } int main(int argc, char *argv[]) { char *p; int ret = 0; p = mmap((1UL << 30), SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (p != (1UL << 30)) { perror("mmap"); return 1; } memset(p, 0, SIZE); if (madvise(p, SIZE, MADV_NOHUGEPAGE)) perror("madvise"); explicit_bzero(p, SIZE); pte_map_thps(p, SIZE); for (int loops = 0; loops < 40; loops++) { if (mprotect(p, SIZE, PROT_READ)) perror("mprotect"), exit(1); if (mprotect(p, SIZE, PROT_READ|PROT_WRITE)) perror("mprotect"), exit(1); explicit_bzero(p, SIZE); } } --- v4->v5: - Add patch 4 - Add patch 1 (Lorenzo) - For patch 2, instead of using nr_ptes returned from prot_numa_skip() as a dummy for whether to skip or not, make that function return boolean, and then use folio_pte_batch() to determine how much to skip - Split can_change_pte_writable() (Lorenzo) - Implement patch 6 in a better way v3->v4: - Refactor skipping logic into a new function, edit patch 1 subject to highlight it is only for MM_CP_PROT_NUMA case (David H) - Refactor the optimization logic, add more documentation to the generic batched functions, do not add clear_flush_ptes, squash patch 4 and 5 (Ryan) v2->v3: - Add comments for the new APIs (Ryan, Lorenzo) - Instead of refactoring, use a "skip_batch" label - Move arm64 patches at the end (Ryan) - In can_change_pte_writable(), check AnonExclusive page-by-page (David H) - Resolve implicit declaration; tested build on x86 (Lance Yang) v1->v2: - Rebase onto mm-unstable (6ebffe676fcf: util_macros.h: make the header more resilient) - Abridge the anon-exclusive condition (Lance Yang) Dev Jain (7): mm: Refactor MM_CP_PROT_NUMA skipping case into new function mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs mm: Add batched versions of ptep_modify_prot_start/commit mm: Introduce FPB_RESPECT_WRITE for PTE batching infrastructure mm: Split can_change_pte_writable() into private and shared parts mm: Optimize mprotect() by PTE batching arm64: Add batched versions of ptep_modify_prot_start/commit arch/arm64/include/asm/pgtable.h | 10 ++ arch/arm64/mm/mmu.c | 28 ++- include/linux/pgtable.h | 84 ++++++++- mm/internal.h | 11 +- mm/mprotect.c | 295 ++++++++++++++++++++++++------- 5 files changed, 352 insertions(+), 76 deletions(-) -- 2.30.2
On 18/07/25 2:32 pm, Dev Jain wrote: > Use folio_pte_batch() to optimize change_pte_range(). On arm64, if the ptes > are painted with the contig bit, then ptep_get() will iterate through all > 16 entries to collect a/d bits. Hence this optimization will result in > a 16x reduction in the number of ptep_get() calls. Next, > ptep_modify_prot_start() will eventually call contpte_try_unfold() on > every contig block, thus flushing the TLB for the complete large folio > range. Instead, use get_and_clear_full_ptes() so as to elide TLBIs on > each contig block, and only do them on the starting and ending > contig block. > > For split folios, there will be no pte batching; the batch size returned > by folio_pte_batch() will be 1. For pagetable split folios, the ptes will > still point to the same large folio; for arm64, this results in the > optimization described above, and for other arches, a minor improvement > is expected due to a reduction in the number of function calls. > > mm-selftests pass on arm64. I have some failing tests on my x86 VM already; > no new tests fail as a result of this patchset. > > We use the following test cases to measure performance, mprotect()'ing > the mapped memory to read-only then read-write 40 times: > > Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then > pte-mapping those THPs > Test case 2: Mapping 1G of memory with 64K mTHPs > Test case 3: Mapping 1G of memory with 4K pages > > Average execution time on arm64, Apple M3: > Before the patchset: > T1: 2.1 seconds T2: 2 seconds T3: 1 second > > After the patchset: > T1: 0.65 seconds T2: 0.7 seconds T3: 1.1 seconds > For the note: the numbers are different from the previous versions. I must have run the test for more number of iterations and then pasted the test program here for 40 iterations, that's why the mismatch.
On Fri, Jul 18, 2025 at 03:20:16PM +0530, Dev Jain wrote: > > On 18/07/25 2:32 pm, Dev Jain wrote: > > Use folio_pte_batch() to optimize change_pte_range(). On arm64, if the ptes > > are painted with the contig bit, then ptep_get() will iterate through all > > 16 entries to collect a/d bits. Hence this optimization will result in > > a 16x reduction in the number of ptep_get() calls. Next, > > ptep_modify_prot_start() will eventually call contpte_try_unfold() on > > every contig block, thus flushing the TLB for the complete large folio > > range. Instead, use get_and_clear_full_ptes() so as to elide TLBIs on > > each contig block, and only do them on the starting and ending > > contig block. > > > > For split folios, there will be no pte batching; the batch size returned > > by folio_pte_batch() will be 1. For pagetable split folios, the ptes will > > still point to the same large folio; for arm64, this results in the > > optimization described above, and for other arches, a minor improvement > > is expected due to a reduction in the number of function calls. > > > > mm-selftests pass on arm64. I have some failing tests on my x86 VM already; > > no new tests fail as a result of this patchset. > > > > We use the following test cases to measure performance, mprotect()'ing > > the mapped memory to read-only then read-write 40 times: > > > > Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then > > pte-mapping those THPs > > Test case 2: Mapping 1G of memory with 64K mTHPs > > Test case 3: Mapping 1G of memory with 4K pages > > > > Average execution time on arm64, Apple M3: > > Before the patchset: > > T1: 2.1 seconds T2: 2 seconds T3: 1 second > > > > After the patchset: > > T1: 0.65 seconds T2: 0.7 seconds T3: 1.1 seconds > > > > For the note: the numbers are different from the previous versions. > I must have run the test for more number of iterations and then > pasted the test program here for 40 iterations, that's why the mismatch. > Thanks for this clarification!
© 2016 - 2025 Red Hat, Inc.