arch/arm64/include/asm/pgtable.h | 10 ++ arch/arm64/mm/mmu.c | 28 ++- include/linux/pgtable.h | 84 ++++++++- mm/internal.h | 11 +- mm/mprotect.c | 295 ++++++++++++++++++++++++------- 5 files changed, 352 insertions(+), 76 deletions(-)
Use folio_pte_batch() to optimize change_pte_range(). On arm64, if the ptes
are painted with the contig bit, then ptep_get() will iterate through all
16 entries to collect a/d bits. Hence this optimization will result in
a 16x reduction in the number of ptep_get() calls. Next,
ptep_modify_prot_start() will eventually call contpte_try_unfold() on
every contig block, thus flushing the TLB for the complete large folio
range. Instead, use get_and_clear_full_ptes() so as to elide TLBIs on
each contig block, and only do them on the starting and ending
contig block.
For split folios, there will be no pte batching; the batch size returned
by folio_pte_batch() will be 1. For pagetable split folios, the ptes will
still point to the same large folio; for arm64, this results in the
optimization described above, and for other arches, a minor improvement
is expected due to a reduction in the number of function calls.
mm-selftests pass on arm64. I have some failing tests on my x86 VM already;
no new tests fail as a result of this patchset.
We use the following test cases to measure performance, mprotect()'ing
the mapped memory to read-only then read-write 40 times:
Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then
pte-mapping those THPs
Test case 2: Mapping 1G of memory with 64K mTHPs
Test case 3: Mapping 1G of memory with 4K pages
Average execution time on arm64, Apple M3:
Before the patchset:
T1: 2.1 seconds T2: 2 seconds T3: 1 second
After the patchset:
T1: 0.65 seconds T2: 0.7 seconds T3: 1.1 seconds
Observing T1/T2 and T3 before the patchset, we also remove the regression
introduced by ptep_get() on a contpte block. And, for large folios we get
an almost 74% performance improvement, albeit the trade-off being a slight
degradation in the small folio case.
For x86:
Before the patchset:
T1: 3.75 seconds T2: 3.7 seconds T3: 3.85 seconds
After the patchset:
T1: 3.7 seconds T2: 3.7 seconds T3: 3.9 seconds
So there is a minor improvement due to reduction in number of function
calls, and a slight degradation in the small folio case due to the
overhead of vm_normal_folio() + folio_test_large().
Here is the test program:
#define _GNU_SOURCE
#include <sys/mman.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <unistd.h>
#define SIZE (1024*1024*1024)
unsigned long pmdsize = (1UL << 21);
unsigned long pagesize = (1UL << 12);
static void pte_map_thps(char *mem, size_t size)
{
size_t offs;
int ret = 0;
/* PTE-map each THP by temporarily splitting the VMAs. */
for (offs = 0; offs < size; offs += pmdsize) {
ret |= madvise(mem + offs, pagesize, MADV_DONTFORK);
ret |= madvise(mem + offs, pagesize, MADV_DOFORK);
}
if (ret) {
fprintf(stderr, "ERROR: mprotect() failed\n");
exit(1);
}
}
int main(int argc, char *argv[])
{
char *p;
int ret = 0;
p = mmap((1UL << 30), SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (p != (1UL << 30)) {
perror("mmap");
return 1;
}
memset(p, 0, SIZE);
if (madvise(p, SIZE, MADV_NOHUGEPAGE))
perror("madvise");
explicit_bzero(p, SIZE);
pte_map_thps(p, SIZE);
for (int loops = 0; loops < 40; loops++) {
if (mprotect(p, SIZE, PROT_READ))
perror("mprotect"), exit(1);
if (mprotect(p, SIZE, PROT_READ|PROT_WRITE))
perror("mprotect"), exit(1);
explicit_bzero(p, SIZE);
}
}
---
v4->v5:
- Add patch 4
- Add patch 1 (Lorenzo)
- For patch 2, instead of using nr_ptes returned from prot_numa_skip()
as a dummy for whether to skip or not, make that function return
boolean, and then use folio_pte_batch() to determine how much to
skip
- Split can_change_pte_writable() (Lorenzo)
- Implement patch 6 in a better way
v3->v4:
- Refactor skipping logic into a new function, edit patch 1 subject
to highlight it is only for MM_CP_PROT_NUMA case (David H)
- Refactor the optimization logic, add more documentation to the generic
batched functions, do not add clear_flush_ptes, squash patch 4
and 5 (Ryan)
v2->v3:
- Add comments for the new APIs (Ryan, Lorenzo)
- Instead of refactoring, use a "skip_batch" label
- Move arm64 patches at the end (Ryan)
- In can_change_pte_writable(), check AnonExclusive page-by-page (David H)
- Resolve implicit declaration; tested build on x86 (Lance Yang)
v1->v2:
- Rebase onto mm-unstable (6ebffe676fcf: util_macros.h: make the header more resilient)
- Abridge the anon-exclusive condition (Lance Yang)
Dev Jain (7):
mm: Refactor MM_CP_PROT_NUMA skipping case into new function
mm: Optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEs
mm: Add batched versions of ptep_modify_prot_start/commit
mm: Introduce FPB_RESPECT_WRITE for PTE batching infrastructure
mm: Split can_change_pte_writable() into private and shared parts
mm: Optimize mprotect() by PTE batching
arm64: Add batched versions of ptep_modify_prot_start/commit
arch/arm64/include/asm/pgtable.h | 10 ++
arch/arm64/mm/mmu.c | 28 ++-
include/linux/pgtable.h | 84 ++++++++-
mm/internal.h | 11 +-
mm/mprotect.c | 295 ++++++++++++++++++++++++-------
5 files changed, 352 insertions(+), 76 deletions(-)
--
2.30.2
On 18/07/25 2:32 pm, Dev Jain wrote: > Use folio_pte_batch() to optimize change_pte_range(). On arm64, if the ptes > are painted with the contig bit, then ptep_get() will iterate through all > 16 entries to collect a/d bits. Hence this optimization will result in > a 16x reduction in the number of ptep_get() calls. Next, > ptep_modify_prot_start() will eventually call contpte_try_unfold() on > every contig block, thus flushing the TLB for the complete large folio > range. Instead, use get_and_clear_full_ptes() so as to elide TLBIs on > each contig block, and only do them on the starting and ending > contig block. > > For split folios, there will be no pte batching; the batch size returned > by folio_pte_batch() will be 1. For pagetable split folios, the ptes will > still point to the same large folio; for arm64, this results in the > optimization described above, and for other arches, a minor improvement > is expected due to a reduction in the number of function calls. > > mm-selftests pass on arm64. I have some failing tests on my x86 VM already; > no new tests fail as a result of this patchset. > > We use the following test cases to measure performance, mprotect()'ing > the mapped memory to read-only then read-write 40 times: > > Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then > pte-mapping those THPs > Test case 2: Mapping 1G of memory with 64K mTHPs > Test case 3: Mapping 1G of memory with 4K pages > > Average execution time on arm64, Apple M3: > Before the patchset: > T1: 2.1 seconds T2: 2 seconds T3: 1 second > > After the patchset: > T1: 0.65 seconds T2: 0.7 seconds T3: 1.1 seconds > For the note: the numbers are different from the previous versions. I must have run the test for more number of iterations and then pasted the test program here for 40 iterations, that's why the mismatch.
On Fri, Jul 18, 2025 at 03:20:16PM +0530, Dev Jain wrote: > > On 18/07/25 2:32 pm, Dev Jain wrote: > > Use folio_pte_batch() to optimize change_pte_range(). On arm64, if the ptes > > are painted with the contig bit, then ptep_get() will iterate through all > > 16 entries to collect a/d bits. Hence this optimization will result in > > a 16x reduction in the number of ptep_get() calls. Next, > > ptep_modify_prot_start() will eventually call contpte_try_unfold() on > > every contig block, thus flushing the TLB for the complete large folio > > range. Instead, use get_and_clear_full_ptes() so as to elide TLBIs on > > each contig block, and only do them on the starting and ending > > contig block. > > > > For split folios, there will be no pte batching; the batch size returned > > by folio_pte_batch() will be 1. For pagetable split folios, the ptes will > > still point to the same large folio; for arm64, this results in the > > optimization described above, and for other arches, a minor improvement > > is expected due to a reduction in the number of function calls. > > > > mm-selftests pass on arm64. I have some failing tests on my x86 VM already; > > no new tests fail as a result of this patchset. > > > > We use the following test cases to measure performance, mprotect()'ing > > the mapped memory to read-only then read-write 40 times: > > > > Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then > > pte-mapping those THPs > > Test case 2: Mapping 1G of memory with 64K mTHPs > > Test case 3: Mapping 1G of memory with 4K pages > > > > Average execution time on arm64, Apple M3: > > Before the patchset: > > T1: 2.1 seconds T2: 2 seconds T3: 1 second > > > > After the patchset: > > T1: 0.65 seconds T2: 0.7 seconds T3: 1.1 seconds > > > > For the note: the numbers are different from the previous versions. > I must have run the test for more number of iterations and then > pasted the test program here for 40 iterations, that's why the mismatch. > Thanks for this clarification!
© 2016 - 2026 Red Hat, Inc.