mm/mprotect.c | 146 ++++++++++++++++++++++++++++---------------------- 1 file changed, 81 insertions(+), 65 deletions(-)
Micro-optimize the change_protection functionality and the change_pte_range() routine. This set of functions works in an incredibly tight loop, and even small inefficiencies are incredibly evident when spun hundreds, thousands or hundreds of thousands of times. There was an attempt to keep the batching functionality as much as possible, which introduced some part of the slowness, but not all of it. Removing it for !arm64 architectures would speed mprotect() up even further, but could easily pessimize cases where large folios are mapped (which is not as rare as it seems, particularly when it comes to the page cache these days). The micro-benchmark used for the tests was [0] (usable using google/benchmark and g++ -O2 -lbenchmark repro.cpp) This resulted in the following (first entry is baseline): --------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------- mprotect_bench 85967 ns 85967 ns 6935 mprotect_bench 73374 ns 73373 ns 9602 After the patchset we can observe a 14% speedup in mprotect. Wonderful for the elusive mprotect-based workloads! Testing & more ideas welcome. I suspect there is plenty of improvement possible but it would require more time than what I have on my hands right now. The entire inlined function (which inlines into change_protection()) is gigantic - I'm not surprised this is so finnicky. Note: per my profiling, the next _big_ bottleneck here is modify_prot_start_ptes, exactly on the xchg() done by x86. ptep_get_and_clear() is _expensive_. I don't think there's a properly safe way to go about it since we do depend on the D bit quite a lot. This might not be such an issue on other architectures. [0]: https://gist.github.com/heatd/1450d273005aba91fa5744f44dfcd933 Link: https://lore.kernel.org/all/aY8-XuFZ7zCvXulB@luyang-thinkpadp1gen7.toromso.csb/ Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Luke Yang <luyang@redhat.com> Cc: jhladky@redhat.com Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org v2: - Addressed Sashiko's concerns - Picked up Lorenzo's R-b's (thank you!) - Squashed patch 1 and 4 into a single one (David) - Renamed the softleaf leaf function (David) - Dropped controversial noinlines & patch 3 (Lorenzo & David) v1: https://lore.kernel.org/linux-mm/20260319183108.1105090-1-pfalcato@suse.de/ Pedro Falcato (2): mm/mprotect: move softleaf code out of the main function mm/mprotect: special-case small folios when applying write permissions mm/mprotect.c | 146 ++++++++++++++++++++++++++++---------------------- 1 file changed, 81 insertions(+), 65 deletions(-) -- 2.53.0
Hi Pedro, Thanks for working on this. I just wanted to share that we've created a test kernel with your patches and tested on the following CPUs: --- aarch64 --- Ampere Altra Ampere Altra Max --- x86_64 --- AMD EPYC 7713 AMD EPYC 7351 AMD EPYC 7542 AMD EPYC 7573X AMD EPYC 7702 AMD EPYC 9754 Intel Xeon Gold 6126 Into Xeon Gold 6330 Intel Xeon Gold 6530 Intel Xeon Platinum 8351N Intel Core i7-6820HQ --- ppc64le --- IBM Power 10 On average, we see improvements ranging from a minimum of 5% to a maximum of 55%, with most improvements showing around a 25% speed up in the libmicro/mprot_tw4m micro benchmark. Thanks, Luke On Tue, Mar 24, 2026 at 11:44 AM Pedro Falcato <pfalcato@suse.de> wrote: > > Micro-optimize the change_protection functionality and the > change_pte_range() routine. This set of functions works in an incredibly > tight loop, and even small inefficiencies are incredibly evident when spun > hundreds, thousands or hundreds of thousands of times. > > There was an attempt to keep the batching functionality as much as possible, > which introduced some part of the slowness, but not all of it. Removing it > for !arm64 architectures would speed mprotect() up even further, but could > easily pessimize cases where large folios are mapped (which is not as rare > as it seems, particularly when it comes to the page cache these days). > > The micro-benchmark used for the tests was [0] (usable using google/benchmark > and g++ -O2 -lbenchmark repro.cpp) > > This resulted in the following (first entry is baseline): > > --------------------------------------------------------- > Benchmark Time CPU Iterations > --------------------------------------------------------- > mprotect_bench 85967 ns 85967 ns 6935 > mprotect_bench 73374 ns 73373 ns 9602 > > > After the patchset we can observe a 14% speedup in mprotect. Wonderful > for the elusive mprotect-based workloads! > > Testing & more ideas welcome. I suspect there is plenty of improvement possible > but it would require more time than what I have on my hands right now. The > entire inlined function (which inlines into change_protection()) is gigantic > - I'm not surprised this is so finnicky. > > Note: per my profiling, the next _big_ bottleneck here is modify_prot_start_ptes, > exactly on the xchg() done by x86. ptep_get_and_clear() is _expensive_. I don't think > there's a properly safe way to go about it since we do depend on the D bit > quite a lot. This might not be such an issue on other architectures. > > > [0]: https://gist.github.com/heatd/1450d273005aba91fa5744f44dfcd933 > Link: https://lore.kernel.org/all/aY8-XuFZ7zCvXulB@luyang-thinkpadp1gen7.toromso.csb/ > > Cc: Vlastimil Babka <vbabka@kernel.org> > Cc: Jann Horn <jannh@google.com> > Cc: David Hildenbrand <david@kernel.org> > Cc: Dev Jain <dev.jain@arm.com> > Cc: Luke Yang <luyang@redhat.com> > Cc: jhladky@redhat.com > Cc: linux-mm@kvack.org > Cc: linux-kernel@vger.kernel.org > > v2: > - Addressed Sashiko's concerns > - Picked up Lorenzo's R-b's (thank you!) > - Squashed patch 1 and 4 into a single one (David) > - Renamed the softleaf leaf function (David) > - Dropped controversial noinlines & patch 3 (Lorenzo & David) > > v1: > https://lore.kernel.org/linux-mm/20260319183108.1105090-1-pfalcato@suse.de/ > > Pedro Falcato (2): > mm/mprotect: move softleaf code out of the main function > mm/mprotect: special-case small folios when applying write permissions > > mm/mprotect.c | 146 ++++++++++++++++++++++++++++---------------------- > 1 file changed, 81 insertions(+), 65 deletions(-) > > -- > 2.53.0 >
On Mon, Mar 30, 2026 at 03:55:51PM -0400, Luke Yang wrote: > Hi Pedro, > > Thanks for working on this. I just wanted to share that we've created a > test kernel with your patches and tested on the following CPUs: > > --- aarch64 --- > Ampere Altra > Ampere Altra Max > > --- x86_64 --- > AMD EPYC 7713 > AMD EPYC 7351 > AMD EPYC 7542 > AMD EPYC 7573X > AMD EPYC 7702 > AMD EPYC 9754 > Intel Xeon Gold 6126 > Into Xeon Gold 6330 > Intel Xeon Gold 6530 > Intel Xeon Platinum 8351N > Intel Core i7-6820HQ > > --- ppc64le --- > IBM Power 10 > > On average, we see improvements ranging from a minimum of 5% to a > maximum of 55%, with most improvements showing around a 25% speed up in > the libmicro/mprot_tw4m micro benchmark. Nice! Thanks for the tests. I'm wondering, what CPU saw 5% and what CPU saw 55%? Or was it just inter-run variance? -- Pedro
On Wed, Apr 1, 2026 at 10:11 AM Pedro Falcato <pfalcato@suse.de> wrote: > > On Mon, Mar 30, 2026 at 03:55:51PM -0400, Luke Yang wrote: > > Hi Pedro, > > > > Thanks for working on this. I just wanted to share that we've created a > > test kernel with your patches and tested on the following CPUs: > > > > --- aarch64 --- > > Ampere Altra > > Ampere Altra Max > > > > --- x86_64 --- > > AMD EPYC 7713 > > AMD EPYC 7351 > > AMD EPYC 7542 > > AMD EPYC 7573X > > AMD EPYC 7702 > > AMD EPYC 9754 > > Intel Xeon Gold 6126 > > Into Xeon Gold 6330 > > Intel Xeon Gold 6530 > > Intel Xeon Platinum 8351N > > Intel Core i7-6820HQ > > > > --- ppc64le --- > > IBM Power 10 > > > > On average, we see improvements ranging from a minimum of 5% to a > > maximum of 55%, with most improvements showing around a 25% speed up in > > the libmicro/mprot_tw4m micro benchmark. > > Nice! Thanks for the tests. I'm wondering, what CPU saw 5% and what CPU > saw 55%? Or was it just inter-run variance? > > -- > Pedro > 5% -> Ampere Altra Max 55% -> Ampere Altra Personally, I can't conclude at the moment if this is just inter-run variance. However, let me re-run the tests a few times on these two machines to see if this is consistent. Luke
On Mon, 30 Mar 2026 15:55:51 -0400 Luke Yang <luyang@redhat.com> wrote: > Thanks for working on this. I just wanted to share that we've created a > test kernel with your patches and tested on the following CPUs: > > --- aarch64 --- > Ampere Altra > Ampere Altra Max > > --- x86_64 --- > AMD EPYC 7713 > AMD EPYC 7351 > AMD EPYC 7542 > AMD EPYC 7573X > AMD EPYC 7702 > AMD EPYC 9754 > Intel Xeon Gold 6126 > Into Xeon Gold 6330 > Intel Xeon Gold 6530 > Intel Xeon Platinum 8351N > Intel Core i7-6820HQ > > --- ppc64le --- > IBM Power 10 > > On average, we see improvements ranging from a minimum of 5% to a > maximum of 55%, with most improvements showing around a 25% speed up in > the libmicro/mprot_tw4m micro benchmark. Thanks, that's nice. I've added some of the above into the changelog and I took the liberty of adding your Tested-by: to both patches. fyi, regarding [2/2]: it's unclear to me whether the discussion with David will result in any alterations. If there's something I need to it always helps to lmk ;)
On 3/30/26 22:06, Andrew Morton wrote:
> On Mon, 30 Mar 2026 15:55:51 -0400 Luke Yang <luyang@redhat.com> wrote:
>
>> Thanks for working on this. I just wanted to share that we've created a
>> test kernel with your patches and tested on the following CPUs:
>>
>> --- aarch64 ---
>> Ampere Altra
>> Ampere Altra Max
>>
>> --- x86_64 ---
>> AMD EPYC 7713
>> AMD EPYC 7351
>> AMD EPYC 7542
>> AMD EPYC 7573X
>> AMD EPYC 7702
>> AMD EPYC 9754
>> Intel Xeon Gold 6126
>> Into Xeon Gold 6330
>> Intel Xeon Gold 6530
>> Intel Xeon Platinum 8351N
>> Intel Core i7-6820HQ
>>
>> --- ppc64le ---
>> IBM Power 10
>>
>> On average, we see improvements ranging from a minimum of 5% to a
>> maximum of 55%, with most improvements showing around a 25% speed up in
>> the libmicro/mprot_tw4m micro benchmark.
>
> Thanks, that's nice. I've added some of the above into the changelog
> and I took the liberty of adding your Tested-by: to both patches.
>
> fyi, regarding [2/2]: it's unclear to me whether the discussion with
> David will result in any alterations. If there's something I need to
> it always helps to lmk ;)
I think we want to get a better understanding of which exact __always_inline
is really helpful in patch #2, and where to apply the nr_ptes==1 forced
optimization.
I updated my microbenchmark I use for fork+unmap etc to measure
mprotect as well
https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c?ref_type=heads
Running some simple tests with order-0 on 1 GiB of memory:
Upstream Linus:
./pte-mapped-folio-benchmarks 0 write-protect 5
0.005779
...
./pte-mapped-folio-benchmarks 0 write-unprotect 5
0.009113
...
With Pedro's patch #2:
$ ./pte-mapped-folio-benchmarks 0 write-protect 5
0.003941
...
$ ./pte-mapped-folio-benchmarks 0 write-unprotect 5
0.006163
...
With the patch below:
$ ./pte-mapped-folio-benchmarks 0 write-protect 5
0.003364
$ ./pte-mapped-folio-benchmarks 0 write-unprotect 5
0.005729
So patch #2 might be improved. And the forced inlining of
mprotect_folio_pte_batch() should likely not go into the same patch.
---
From cf1a2a4a6ef95ed541947f2fd9d8351bef664426 Mon Sep 17 00:00:00 2001
From: "David Hildenbrand (Arm)" <david@kernel.org>
Date: Wed, 1 Apr 2026 08:15:44 +0000
Subject: [PATCH] tmp
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
---
mm/mprotect.c | 79 +++++++++++++++++++++++++++++++--------------------
1 file changed, 48 insertions(+), 31 deletions(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c0571445bef7..8d14c05a11a2 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -117,7 +117,7 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
}
/* Set nr_ptes number of ptes, starting from idx */
-static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr,
+static __always_inline void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr,
pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes,
int idx, bool set_write, struct mmu_gather *tlb)
{
@@ -143,7 +143,7 @@ static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long add
* !PageAnonExclusive() pages, starting from start_idx. Caller must enforce
* that the ptes point to consecutive pages of the same anon large folio.
*/
-static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
+static __always_inline int page_anon_exclusive_sub_batch(int start_idx, int max_len,
struct page *first_page, bool expected_anon_exclusive)
{
int idx;
@@ -169,7 +169,7 @@ static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
* pte of the batch. Therefore, we must individually check all pages and
* retrieve sub-batches.
*/
-static void commit_anon_folio_batch(struct vm_area_struct *vma,
+static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma,
struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep,
pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
{
@@ -188,7 +188,7 @@ static void commit_anon_folio_batch(struct vm_area_struct *vma,
}
}
-static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
+static __always_inline void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
struct folio *folio, struct page *page, unsigned long addr, pte_t *ptep,
pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
{
@@ -211,6 +211,41 @@ static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, ptent, nr_ptes, tlb);
}
+static __always_inline void change_present_ptes(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, unsigned long addr,
+ pgprot_t newprot, unsigned long cp_flags,
+ struct folio *folio, struct page *page, pte_t *pte,
+ unsigned int nr_ptes)
+{
+ bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
+ bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+ pte_t oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
+ pte_t ptent = pte_modify(oldpte, newprot);
+
+ if (uffd_wp)
+ ptent = pte_mkuffd_wp(ptent);
+ else if (uffd_wp_resolve)
+ ptent = pte_clear_uffd_wp(ptent);
+
+ /*
+ * In some writable, shared mappings, we might want to catch actual
+ * write access -- see vma_wants_writenotify().
+ *
+ * In all writable, private mappings, we have to properly handle COW.
+ *
+ * In both cases, we can sometimes still change PTEs writable and avoid
+ * the write-fault handler, for example, if a PTE is already dirty and
+ * no other COW or special handling is required.
+ */
+ if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
+ !pte_write(ptent))
+ set_write_prot_commit_flush_ptes(vma, folio, page, addr, pte,
+ oldpte, ptent, nr_ptes, tlb);
+ else
+ prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
+ nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb);
+}
+
static long change_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
unsigned long end, pgprot_t newprot, unsigned long cp_flags)
@@ -242,7 +277,6 @@ static long change_pte_range(struct mmu_gather *tlb,
int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
struct folio *folio = NULL;
struct page *page;
- pte_t ptent;
/* Already in the desired state. */
if (prot_numa && pte_protnone(oldpte))
@@ -268,34 +302,17 @@ static long change_pte_range(struct mmu_gather *tlb,
nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags);
- oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
- ptent = pte_modify(oldpte, newprot);
-
- if (uffd_wp)
- ptent = pte_mkuffd_wp(ptent);
- else if (uffd_wp_resolve)
- ptent = pte_clear_uffd_wp(ptent);
-
/*
- * In some writable, shared mappings, we might want
- * to catch actual write access -- see
- * vma_wants_writenotify().
- *
- * In all writable, private mappings, we have to
- * properly handle COW.
- *
- * In both cases, we can sometimes still change PTEs
- * writable and avoid the write-fault handler, for
- * example, if a PTE is already dirty and no other
- * COW or special handling is required.
+ * Optimize for order-0 folios by optimizing out all
+ * loops.
*/
- if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
- !pte_write(ptent))
- set_write_prot_commit_flush_ptes(vma, folio, page,
- addr, pte, oldpte, ptent, nr_ptes, tlb);
- else
- prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
- nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb);
+ if (nr_ptes == 1) {
+ change_present_ptes(tlb, vma, addr, newprot,
+ cp_flags, folio, page, pte, 1);
+ } else {
+ change_present_ptes(tlb, vma, addr, newprot,
+ cp_flags, folio, page, pte, nr_ptes);
+ }
pages += nr_ptes;
} else if (pte_none(oldpte)) {
/*
--
2.53.0
--
Cheers,
David
On Wed, Apr 01, 2026 at 10:25:40AM +0200, David Hildenbrand (Arm) wrote: > On 3/30/26 22:06, Andrew Morton wrote: > > On Mon, 30 Mar 2026 15:55:51 -0400 Luke Yang <luyang@redhat.com> wrote: > > > >> Thanks for working on this. I just wanted to share that we've created a > >> test kernel with your patches and tested on the following CPUs: > >> > >> --- aarch64 --- > >> Ampere Altra > >> Ampere Altra Max > >> > >> --- x86_64 --- > >> AMD EPYC 7713 > >> AMD EPYC 7351 > >> AMD EPYC 7542 > >> AMD EPYC 7573X > >> AMD EPYC 7702 > >> AMD EPYC 9754 > >> Intel Xeon Gold 6126 > >> Into Xeon Gold 6330 > >> Intel Xeon Gold 6530 > >> Intel Xeon Platinum 8351N > >> Intel Core i7-6820HQ > >> > >> --- ppc64le --- > >> IBM Power 10 > >> > >> On average, we see improvements ranging from a minimum of 5% to a > >> maximum of 55%, with most improvements showing around a 25% speed up in > >> the libmicro/mprot_tw4m micro benchmark. > > > > Thanks, that's nice. I've added some of the above into the changelog > > and I took the liberty of adding your Tested-by: to both patches. > > > > fyi, regarding [2/2]: it's unclear to me whether the discussion with > > David will result in any alterations. If there's something I need to > > it always helps to lmk ;) > > I think we want to get a better understanding of which exact __always_inline > is really helpful in patch #2, and where to apply the nr_ptes==1 forced > optimization. > > I updated my microbenchmark I use for fork+unmap etc to measure > mprotect as well > > https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c?ref_type=heads > > Running some simple tests with order-0 on 1 GiB of memory: > > > Upstream Linus: > > ./pte-mapped-folio-benchmarks 0 write-protect 5 > 0.005779 > ... > ./pte-mapped-folio-benchmarks 0 write-unprotect 5 > 0.009113 > ... > > > With Pedro's patch #2: > $ ./pte-mapped-folio-benchmarks 0 write-protect 5 > 0.003941 > ... > $ ./pte-mapped-folio-benchmarks 0 write-unprotect 5 > 0.006163 > ... > > > With the patch below: > > $ ./pte-mapped-folio-benchmarks 0 write-protect 5 > 0.003364 > > $ ./pte-mapped-folio-benchmarks 0 write-unprotect 5 > 0.005729 Hmm. Thanks for the testing. Interesting. I'll give it a shot. I'll have results and/or a possible v3 by tomorrow, if need be. Apologies for the slight delay here! :) -- Pedro
© 2016 - 2026 Red Hat, Inc.