[v2] mm/mprotect: micro-optimization work

[PATCH v2 0/2] mm/mprotect: micro-optimization work

Posted by Pedro Falcato 1 week, 2 days ago

Micro-optimize the change_protection functionality and the
change_pte_range() routine. This set of functions works in an incredibly
tight loop, and even small inefficiencies are incredibly evident when spun
hundreds, thousands or hundreds of thousands of times.

There was an attempt to keep the batching functionality as much as possible,
which introduced some part of the slowness, but not all of it. Removing it
for !arm64 architectures would speed mprotect() up even further, but could
easily pessimize cases where large folios are mapped (which is not as rare
as it seems, particularly when it comes to the page cache these days).

The micro-benchmark used for the tests was [0] (usable using google/benchmark
and g++ -O2 -lbenchmark repro.cpp)

This resulted in the following (first entry is baseline):

---------------------------------------------------------
Benchmark               Time             CPU   Iterations
---------------------------------------------------------
mprotect_bench      85967 ns        85967 ns         6935
mprotect_bench      73374 ns        73373 ns         9602


After the patchset we can observe a 14% speedup in mprotect. Wonderful
for the elusive mprotect-based workloads!

Testing & more ideas welcome. I suspect there is plenty of improvement possible
but it would require more time than what I have on my hands right now. The
entire inlined function (which inlines into change_protection()) is gigantic
- I'm not surprised this is so finnicky.

Note: per my profiling, the next _big_ bottleneck here is modify_prot_start_ptes,
exactly on the xchg() done by x86. ptep_get_and_clear() is _expensive_. I don't think
there's a properly safe way to go about it since we do depend on the D bit
quite a lot. This might not be such an issue on other architectures.


[0]: https://gist.github.com/heatd/1450d273005aba91fa5744f44dfcd933
Link: https://lore.kernel.org/all/aY8-XuFZ7zCvXulB@luyang-thinkpadp1gen7.toromso.csb/

Cc: Vlastimil Babka <vbabka@kernel.org> 
Cc: Jann Horn <jannh@google.com> 
Cc: David Hildenbrand <david@kernel.org>
Cc: Dev Jain <dev.jain@arm.com> 
Cc: Luke Yang <luyang@redhat.com>
Cc: jhladky@redhat.com
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org

v2:
 - Addressed Sashiko's concerns
 - Picked up Lorenzo's R-b's (thank you!)
 - Squashed patch 1 and 4 into a single one (David)
 - Renamed the softleaf leaf function (David)
 - Dropped controversial noinlines & patch 3 (Lorenzo & David)

v1:
https://lore.kernel.org/linux-mm/20260319183108.1105090-1-pfalcato@suse.de/

Pedro Falcato (2):
  mm/mprotect: move softleaf code out of the main function
  mm/mprotect: special-case small folios when applying write permissions

 mm/mprotect.c | 146 ++++++++++++++++++++++++++++----------------------
 1 file changed, 81 insertions(+), 65 deletions(-)

-- 
2.53.0

Re: [PATCH v2 0/2] mm/mprotect: micro-optimization work

Posted by Luke Yang 3 days, 12 hours ago

Hi Pedro,

Thanks for working on this. I just wanted to share that we've created a
test kernel with your patches and tested on the following CPUs:

--- aarch64 ---
Ampere Altra
Ampere Altra Max

--- x86_64 ---
AMD EPYC 7713
AMD EPYC 7351
AMD EPYC 7542
AMD EPYC 7573X
AMD EPYC 7702
AMD EPYC 9754
Intel Xeon Gold 6126
Into Xeon Gold 6330
Intel Xeon Gold 6530
Intel Xeon Platinum 8351N
Intel Core i7-6820HQ

--- ppc64le ---
IBM Power 10

On average, we see improvements ranging from a minimum of 5% to a
maximum of 55%, with most improvements showing around a 25% speed up in
the libmicro/mprot_tw4m micro benchmark.

Thanks,
Luke

On Tue, Mar 24, 2026 at 11:44 AM Pedro Falcato <pfalcato@suse.de> wrote:
>
> Micro-optimize the change_protection functionality and the
> change_pte_range() routine. This set of functions works in an incredibly
> tight loop, and even small inefficiencies are incredibly evident when spun
> hundreds, thousands or hundreds of thousands of times.
>
> There was an attempt to keep the batching functionality as much as possible,
> which introduced some part of the slowness, but not all of it. Removing it
> for !arm64 architectures would speed mprotect() up even further, but could
> easily pessimize cases where large folios are mapped (which is not as rare
> as it seems, particularly when it comes to the page cache these days).
>
> The micro-benchmark used for the tests was [0] (usable using google/benchmark
> and g++ -O2 -lbenchmark repro.cpp)
>
> This resulted in the following (first entry is baseline):
>
> ---------------------------------------------------------
> Benchmark               Time             CPU   Iterations
> ---------------------------------------------------------
> mprotect_bench      85967 ns        85967 ns         6935
> mprotect_bench      73374 ns        73373 ns         9602
>
>
> After the patchset we can observe a 14% speedup in mprotect. Wonderful
> for the elusive mprotect-based workloads!
>
> Testing & more ideas welcome. I suspect there is plenty of improvement possible
> but it would require more time than what I have on my hands right now. The
> entire inlined function (which inlines into change_protection()) is gigantic
> - I'm not surprised this is so finnicky.
>
> Note: per my profiling, the next _big_ bottleneck here is modify_prot_start_ptes,
> exactly on the xchg() done by x86. ptep_get_and_clear() is _expensive_. I don't think
> there's a properly safe way to go about it since we do depend on the D bit
> quite a lot. This might not be such an issue on other architectures.
>
>
> [0]: https://gist.github.com/heatd/1450d273005aba91fa5744f44dfcd933
> Link: https://lore.kernel.org/all/aY8-XuFZ7zCvXulB@luyang-thinkpadp1gen7.toromso.csb/
>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: Jann Horn <jannh@google.com>
> Cc: David Hildenbrand <david@kernel.org>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Luke Yang <luyang@redhat.com>
> Cc: jhladky@redhat.com
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
>
> v2:
>  - Addressed Sashiko's concerns
>  - Picked up Lorenzo's R-b's (thank you!)
>  - Squashed patch 1 and 4 into a single one (David)
>  - Renamed the softleaf leaf function (David)
>  - Dropped controversial noinlines & patch 3 (Lorenzo & David)
>
> v1:
> https://lore.kernel.org/linux-mm/20260319183108.1105090-1-pfalcato@suse.de/
>
> Pedro Falcato (2):
>   mm/mprotect: move softleaf code out of the main function
>   mm/mprotect: special-case small folios when applying write permissions
>
>  mm/mprotect.c | 146 ++++++++++++++++++++++++++++----------------------
>  1 file changed, 81 insertions(+), 65 deletions(-)
>
> --
> 2.53.0
>

Re: [PATCH v2 0/2] mm/mprotect: micro-optimization work

Posted by Pedro Falcato 1 day, 18 hours ago

On Mon, Mar 30, 2026 at 03:55:51PM -0400, Luke Yang wrote:
> Hi Pedro,
> 
> Thanks for working on this. I just wanted to share that we've created a
> test kernel with your patches and tested on the following CPUs:
> 
> --- aarch64 ---
> Ampere Altra
> Ampere Altra Max
> 
> --- x86_64 ---
> AMD EPYC 7713
> AMD EPYC 7351
> AMD EPYC 7542
> AMD EPYC 7573X
> AMD EPYC 7702
> AMD EPYC 9754
> Intel Xeon Gold 6126
> Into Xeon Gold 6330
> Intel Xeon Gold 6530
> Intel Xeon Platinum 8351N
> Intel Core i7-6820HQ
> 
> --- ppc64le ---
> IBM Power 10
> 
> On average, we see improvements ranging from a minimum of 5% to a
> maximum of 55%, with most improvements showing around a 25% speed up in
> the libmicro/mprot_tw4m micro benchmark.

Nice! Thanks for the tests. I'm wondering, what CPU saw 5% and what CPU
saw 55%? Or was it just inter-run variance?
 
-- 
Pedro

Re: [PATCH v2 0/2] mm/mprotect: micro-optimization work

Posted by Luke Yang 18 hours ago

On Wed, Apr 1, 2026 at 10:11 AM Pedro Falcato <pfalcato@suse.de> wrote:
>
> On Mon, Mar 30, 2026 at 03:55:51PM -0400, Luke Yang wrote:
> > Hi Pedro,
> >
> > Thanks for working on this. I just wanted to share that we've created a
> > test kernel with your patches and tested on the following CPUs:
> >
> > --- aarch64 ---
> > Ampere Altra
> > Ampere Altra Max
> >
> > --- x86_64 ---
> > AMD EPYC 7713
> > AMD EPYC 7351
> > AMD EPYC 7542
> > AMD EPYC 7573X
> > AMD EPYC 7702
> > AMD EPYC 9754
> > Intel Xeon Gold 6126
> > Into Xeon Gold 6330
> > Intel Xeon Gold 6530
> > Intel Xeon Platinum 8351N
> > Intel Core i7-6820HQ
> >
> > --- ppc64le ---
> > IBM Power 10
> >
> > On average, we see improvements ranging from a minimum of 5% to a
> > maximum of 55%, with most improvements showing around a 25% speed up in
> > the libmicro/mprot_tw4m micro benchmark.
>
> Nice! Thanks for the tests. I'm wondering, what CPU saw 5% and what CPU
> saw 55%? Or was it just inter-run variance?
>
> --
> Pedro
>

5% -> Ampere Altra Max
55% -> Ampere Altra

Personally, I can't conclude at the moment if this is just inter-run
variance. However, let me re-run the tests a few times on these two
machines to see if this is consistent.

Luke

Re: [PATCH v2 0/2] mm/mprotect: micro-optimization work

Posted by Andrew Morton 3 days, 12 hours ago

On Mon, 30 Mar 2026 15:55:51 -0400 Luke Yang <luyang@redhat.com> wrote:

> Thanks for working on this. I just wanted to share that we've created a
> test kernel with your patches and tested on the following CPUs:
> 
> --- aarch64 ---
> Ampere Altra
> Ampere Altra Max
> 
> --- x86_64 ---
> AMD EPYC 7713
> AMD EPYC 7351
> AMD EPYC 7542
> AMD EPYC 7573X
> AMD EPYC 7702
> AMD EPYC 9754
> Intel Xeon Gold 6126
> Into Xeon Gold 6330
> Intel Xeon Gold 6530
> Intel Xeon Platinum 8351N
> Intel Core i7-6820HQ
> 
> --- ppc64le ---
> IBM Power 10
> 
> On average, we see improvements ranging from a minimum of 5% to a
> maximum of 55%, with most improvements showing around a 25% speed up in
> the libmicro/mprot_tw4m micro benchmark.

Thanks, that's nice.  I've added some of the above into the changelog
and I took the liberty of adding your Tested-by: to both patches.

fyi, regarding [2/2]: it's unclear to me whether the discussion with
David will result in any alterations.  If there's something I need to
it always helps to lmk ;)

Re: [PATCH v2 0/2] mm/mprotect: micro-optimization work

Posted by David Hildenbrand (Arm) 2 days ago

On 3/30/26 22:06, Andrew Morton wrote:
> On Mon, 30 Mar 2026 15:55:51 -0400 Luke Yang <luyang@redhat.com> wrote:
> 
>> Thanks for working on this. I just wanted to share that we've created a
>> test kernel with your patches and tested on the following CPUs:
>>
>> --- aarch64 ---
>> Ampere Altra
>> Ampere Altra Max
>>
>> --- x86_64 ---
>> AMD EPYC 7713
>> AMD EPYC 7351
>> AMD EPYC 7542
>> AMD EPYC 7573X
>> AMD EPYC 7702
>> AMD EPYC 9754
>> Intel Xeon Gold 6126
>> Into Xeon Gold 6330
>> Intel Xeon Gold 6530
>> Intel Xeon Platinum 8351N
>> Intel Core i7-6820HQ
>>
>> --- ppc64le ---
>> IBM Power 10
>>
>> On average, we see improvements ranging from a minimum of 5% to a
>> maximum of 55%, with most improvements showing around a 25% speed up in
>> the libmicro/mprot_tw4m micro benchmark.
> 
> Thanks, that's nice.  I've added some of the above into the changelog
> and I took the liberty of adding your Tested-by: to both patches.
> 
> fyi, regarding [2/2]: it's unclear to me whether the discussion with
> David will result in any alterations.  If there's something I need to
> it always helps to lmk ;)

I think we want to get a better understanding of which exact __always_inline
is really helpful in patch #2, and where to apply the nr_ptes==1 forced
optimization.

I updated my microbenchmark I use for fork+unmap etc to measure
mprotect as well

	https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c?ref_type=heads

Running some simple tests with order-0 on 1 GiB of memory:


Upstream Linus:

./pte-mapped-folio-benchmarks 0 write-protect 5
0.005779
...
./pte-mapped-folio-benchmarks 0 write-unprotect 5
0.009113
...


With Pedro's patch #2:
$ ./pte-mapped-folio-benchmarks 0 write-protect 5
0.003941
...
$ ./pte-mapped-folio-benchmarks 0 write-unprotect 5
0.006163
...


With the patch below:

$ ./pte-mapped-folio-benchmarks 0 write-protect 5
0.003364

$ ./pte-mapped-folio-benchmarks 0 write-unprotect 5
0.005729


So patch #2 might be improved. And the forced inlining of
mprotect_folio_pte_batch() should likely not go into the same patch.


---

From cf1a2a4a6ef95ed541947f2fd9d8351bef664426 Mon Sep 17 00:00:00 2001
From: "David Hildenbrand (Arm)" <david@kernel.org>
Date: Wed, 1 Apr 2026 08:15:44 +0000
Subject: [PATCH] tmp

Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
---
 mm/mprotect.c | 79 +++++++++++++++++++++++++++++++--------------------
 1 file changed, 48 insertions(+), 31 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index c0571445bef7..8d14c05a11a2 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -117,7 +117,7 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
 }
 
 /* Set nr_ptes number of ptes, starting from idx */
-static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr,
+static __always_inline void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr,
                pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes,
                int idx, bool set_write, struct mmu_gather *tlb)
 {
@@ -143,7 +143,7 @@ static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long add
  * !PageAnonExclusive() pages, starting from start_idx. Caller must enforce
  * that the ptes point to consecutive pages of the same anon large folio.
  */
-static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
+static __always_inline int page_anon_exclusive_sub_batch(int start_idx, int max_len,
                struct page *first_page, bool expected_anon_exclusive)
 {
        int idx;
@@ -169,7 +169,7 @@ static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
  * pte of the batch. Therefore, we must individually check all pages and
  * retrieve sub-batches.
  */
-static void commit_anon_folio_batch(struct vm_area_struct *vma,
+static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma,
                struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep,
                pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
 {
@@ -188,7 +188,7 @@ static void commit_anon_folio_batch(struct vm_area_struct *vma,
        }
 }
 
-static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
+static __always_inline void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
                struct folio *folio, struct page *page, unsigned long addr, pte_t *ptep,
                pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
 {
@@ -211,6 +211,41 @@ static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma,
        commit_anon_folio_batch(vma, folio, page, addr, ptep, oldpte, ptent, nr_ptes, tlb);
 }
 
+static __always_inline void change_present_ptes(struct mmu_gather *tlb,
+               struct vm_area_struct *vma, unsigned long addr,
+               pgprot_t newprot, unsigned long cp_flags,
+               struct folio *folio, struct page *page, pte_t *pte,
+               unsigned int nr_ptes)
+{
+       bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
+       bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+       pte_t oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
+       pte_t ptent = pte_modify(oldpte, newprot);
+
+       if (uffd_wp)
+               ptent = pte_mkuffd_wp(ptent);
+       else if (uffd_wp_resolve)
+               ptent = pte_clear_uffd_wp(ptent);
+
+       /*
+        * In some writable, shared mappings, we might want to catch actual
+        * write access -- see vma_wants_writenotify().
+        *
+        * In all writable, private mappings, we have to properly handle COW.
+        *
+        * In both cases, we can sometimes still change PTEs writable and avoid
+        * the write-fault handler, for example, if a PTE is already dirty and
+        * no other COW or special handling is required.
+        */
+       if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
+            !pte_write(ptent))
+               set_write_prot_commit_flush_ptes(vma, folio, page, addr, pte,
+                                                oldpte, ptent, nr_ptes, tlb);
+       else
+               prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
+                       nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb);
+}
+
 static long change_pte_range(struct mmu_gather *tlb,
                struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
                unsigned long end, pgprot_t newprot, unsigned long cp_flags)
@@ -242,7 +277,6 @@ static long change_pte_range(struct mmu_gather *tlb,
                        int max_nr_ptes = (end - addr) >> PAGE_SHIFT;
                        struct folio *folio = NULL;
                        struct page *page;
-                       pte_t ptent;
 
                        /* Already in the desired state. */
                        if (prot_numa && pte_protnone(oldpte))
@@ -268,34 +302,17 @@ static long change_pte_range(struct mmu_gather *tlb,
 
                        nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags);
 
-                       oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
-                       ptent = pte_modify(oldpte, newprot);
-
-                       if (uffd_wp)
-                               ptent = pte_mkuffd_wp(ptent);
-                       else if (uffd_wp_resolve)
-                               ptent = pte_clear_uffd_wp(ptent);
-
                        /*
-                        * In some writable, shared mappings, we might want
-                        * to catch actual write access -- see
-                        * vma_wants_writenotify().
-                        *
-                        * In all writable, private mappings, we have to
-                        * properly handle COW.
-                        *
-                        * In both cases, we can sometimes still change PTEs
-                        * writable and avoid the write-fault handler, for
-                        * example, if a PTE is already dirty and no other
-                        * COW or special handling is required.
+                        * Optimize for order-0 folios by optimizing out all
+                        * loops.
                         */
-                       if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
-                            !pte_write(ptent))
-                               set_write_prot_commit_flush_ptes(vma, folio, page,
-                               addr, pte, oldpte, ptent, nr_ptes, tlb);
-                       else
-                               prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent,
-                                       nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb);
+                       if (nr_ptes == 1) {
+                               change_present_ptes(tlb, vma, addr, newprot,
+                                                   cp_flags, folio, page, pte, 1);
+                       } else {
+                               change_present_ptes(tlb, vma, addr, newprot,
+                                                   cp_flags, folio, page, pte, nr_ptes);
+                       }
                        pages += nr_ptes;
                } else if (pte_none(oldpte)) {
                        /*
-- 
2.53.0


-- 
Cheers,

David

Re: [PATCH v2 0/2] mm/mprotect: micro-optimization work

Posted by Pedro Falcato 1 day, 18 hours ago

On Wed, Apr 01, 2026 at 10:25:40AM +0200, David Hildenbrand (Arm) wrote:
> On 3/30/26 22:06, Andrew Morton wrote:
> > On Mon, 30 Mar 2026 15:55:51 -0400 Luke Yang <luyang@redhat.com> wrote:
> > 
> >> Thanks for working on this. I just wanted to share that we've created a
> >> test kernel with your patches and tested on the following CPUs:
> >>
> >> --- aarch64 ---
> >> Ampere Altra
> >> Ampere Altra Max
> >>
> >> --- x86_64 ---
> >> AMD EPYC 7713
> >> AMD EPYC 7351
> >> AMD EPYC 7542
> >> AMD EPYC 7573X
> >> AMD EPYC 7702
> >> AMD EPYC 9754
> >> Intel Xeon Gold 6126
> >> Into Xeon Gold 6330
> >> Intel Xeon Gold 6530
> >> Intel Xeon Platinum 8351N
> >> Intel Core i7-6820HQ
> >>
> >> --- ppc64le ---
> >> IBM Power 10
> >>
> >> On average, we see improvements ranging from a minimum of 5% to a
> >> maximum of 55%, with most improvements showing around a 25% speed up in
> >> the libmicro/mprot_tw4m micro benchmark.
> > 
> > Thanks, that's nice.  I've added some of the above into the changelog
> > and I took the liberty of adding your Tested-by: to both patches.
> > 
> > fyi, regarding [2/2]: it's unclear to me whether the discussion with
> > David will result in any alterations.  If there's something I need to
> > it always helps to lmk ;)
> 
> I think we want to get a better understanding of which exact __always_inline
> is really helpful in patch #2, and where to apply the nr_ptes==1 forced
> optimization.
> 
> I updated my microbenchmark I use for fork+unmap etc to measure
> mprotect as well
> 
> 	https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c?ref_type=heads
> 
> Running some simple tests with order-0 on 1 GiB of memory:
> 
> 
> Upstream Linus:
> 
> ./pte-mapped-folio-benchmarks 0 write-protect 5
> 0.005779
> ...
> ./pte-mapped-folio-benchmarks 0 write-unprotect 5
> 0.009113
> ...
> 
> 
> With Pedro's patch #2:
> $ ./pte-mapped-folio-benchmarks 0 write-protect 5
> 0.003941
> ...
> $ ./pte-mapped-folio-benchmarks 0 write-unprotect 5
> 0.006163
> ...
> 
> 
> With the patch below:
> 
> $ ./pte-mapped-folio-benchmarks 0 write-protect 5
> 0.003364
> 
> $ ./pte-mapped-folio-benchmarks 0 write-unprotect 5
> 0.005729

Hmm. Thanks for the testing. Interesting. I'll give it a shot. I'll have
results and/or a possible v3 by tomorrow, if need be.

Apologies for the slight delay here! :)

-- 
Pedro