[v2] mm/mprotect: micro-optimization work

[PATCH v2 2/2] mm/mprotect: special-case small folios when applying write permissions

Posted by Pedro Falcato 1 week, 2 days ago

The common order-0 case is important enough to want its own branch, and
avoids the hairy, large loop logic that the CPU does not seem to handle
particularly well.

While at it, encourage the compiler to inline batch PTE logic and resolve
constant branches by adding __always_inline strategically.

Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Signed-off-by: Pedro Falcato <pfalcato@suse.de>
---
 mm/mprotect.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2eaf862e5734..2fda26107066 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -103,7 +103,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 	return can_change_shared_pte_writable(vma, pte);
 }
 
-static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
+static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
 				    pte_t pte, int max_nr_ptes, fpb_t flags)
 {
 	/* No underlying folio, so cannot batch */
@@ -117,9 +117,9 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
 }
 
 /* Set nr_ptes number of ptes, starting from idx */
-static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr,
-		pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes,
-		int idx, bool set_write, struct mmu_gather *tlb)
+static __always_inline void prot_commit_flush_ptes(struct vm_area_struct *vma,
+		unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent,
+		int nr_ptes, int idx, bool set_write, struct mmu_gather *tlb)
 {
 	/*
 	 * Advance the position in the batch by idx; note that if idx > 0,
@@ -169,7 +169,7 @@ static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
  * pte of the batch. Therefore, we must individually check all pages and
  * retrieve sub-batches.
  */
-static void commit_anon_folio_batch(struct vm_area_struct *vma,
+static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma,
 		struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep,
 		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
 {
@@ -177,6 +177,13 @@ static void commit_anon_folio_batch(struct vm_area_struct *vma,
 	int sub_batch_idx = 0;
 	int len;
 
+	/* Optimize for the common order-0 case. */
+	if (likely(nr_ptes == 1)) {
+		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, 1,
+				       0, PageAnonExclusive(first_page), tlb);
+		return;
+	}
+
 	while (nr_ptes) {
 		expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx);
 		len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes,
-- 
2.53.0

Re: [PATCH v2 2/2] mm/mprotect: special-case small folios when applying write permissions

Posted by David Hildenbrand (Arm) 1 week, 2 days ago

On 3/24/26 16:43, Pedro Falcato wrote:
> The common order-0 case is important enough to want its own branch, and
> avoids the hairy, large loop logic that the CPU does not seem to handle
> particularly well.
> 
> While at it, encourage the compiler to inline batch PTE logic and resolve
> constant branches by adding __always_inline strategically.
> 
> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> ---
>  mm/mprotect.c | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 2eaf862e5734..2fda26107066 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -103,7 +103,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  	return can_change_shared_pte_writable(vma, pte);
>  }
>  
> -static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> +static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
>  				    pte_t pte, int max_nr_ptes, fpb_t flags)
>  {
>  	/* No underlying folio, so cannot batch */
> @@ -117,9 +117,9 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
>  }
>  
>  /* Set nr_ptes number of ptes, starting from idx */
> -static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr,
> -		pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes,
> -		int idx, bool set_write, struct mmu_gather *tlb)
> +static __always_inline void prot_commit_flush_ptes(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent,
> +		int nr_ptes, int idx, bool set_write, struct mmu_gather *tlb)
>  {
>  	/*
>  	 * Advance the position in the batch by idx; note that if idx > 0,
> @@ -169,7 +169,7 @@ static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
>   * pte of the batch. Therefore, we must individually check all pages and
>   * retrieve sub-batches.
>   */
> -static void commit_anon_folio_batch(struct vm_area_struct *vma,
> +static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma,
>  		struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep,
>  		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
>  {
> @@ -177,6 +177,13 @@ static void commit_anon_folio_batch(struct vm_area_struct *vma,
>  	int sub_batch_idx = 0;
>  	int len;
>  
> +	/* Optimize for the common order-0 case. */
> +	if (likely(nr_ptes == 1)) {
> +		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, 1,
> +				       0, PageAnonExclusive(first_page), tlb);

To optimize that one, inlining prot_commit_flush_ptes() would be
sufficient. Does inlining the other two really help? I don't think we
can optimize out loops etc. for them?

I would have thought that specializing on nr_ptes==0 on an even higher
level--where we call
set_write_prot_commit_flush_ptes/prot_commit_flush_ptes() would allow
for optimizing the loops entirely for nr_ptes==0?

-- 
Cheers,

David

Re: [PATCH v2 2/2] mm/mprotect: special-case small folios when applying write permissions

Posted by Pedro Falcato 1 week, 1 day ago

On Tue, Mar 24, 2026 at 09:18:42PM +0100, David Hildenbrand (Arm) wrote:
> On 3/24/26 16:43, Pedro Falcato wrote:
> > The common order-0 case is important enough to want its own branch, and
> > avoids the hairy, large loop logic that the CPU does not seem to handle
> > particularly well.
> > 
> > While at it, encourage the compiler to inline batch PTE logic and resolve
> > constant branches by adding __always_inline strategically.
> > 
> > Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> > Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> > ---
> >  mm/mprotect.c | 17 ++++++++++++-----
> >  1 file changed, 12 insertions(+), 5 deletions(-)
> > 
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 2eaf862e5734..2fda26107066 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -103,7 +103,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> >  	return can_change_shared_pte_writable(vma, pte);
> >  }
> >  
> > -static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> > +static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> >  				    pte_t pte, int max_nr_ptes, fpb_t flags)
> >  {
> >  	/* No underlying folio, so cannot batch */
> > @@ -117,9 +117,9 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> >  }
> >  
> >  /* Set nr_ptes number of ptes, starting from idx */
> > -static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr,
> > -		pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes,
> > -		int idx, bool set_write, struct mmu_gather *tlb)
> > +static __always_inline void prot_commit_flush_ptes(struct vm_area_struct *vma,
> > +		unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent,
> > +		int nr_ptes, int idx, bool set_write, struct mmu_gather *tlb)
> >  {
> >  	/*
> >  	 * Advance the position in the batch by idx; note that if idx > 0,
> > @@ -169,7 +169,7 @@ static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
> >   * pte of the batch. Therefore, we must individually check all pages and
> >   * retrieve sub-batches.
> >   */
> > -static void commit_anon_folio_batch(struct vm_area_struct *vma,
> > +static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma,
> >  		struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep,
> >  		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
> >  {
> > @@ -177,6 +177,13 @@ static void commit_anon_folio_batch(struct vm_area_struct *vma,
> >  	int sub_batch_idx = 0;
> >  	int len;
> >  
> > +	/* Optimize for the common order-0 case. */
> > +	if (likely(nr_ptes == 1)) {
> > +		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, 1,
> > +				       0, PageAnonExclusive(first_page), tlb);
> 
> To optimize that one, inlining prot_commit_flush_ptes() would be
> sufficient. Does inlining the other two really help? I don't think we
> can optimize out loops etc. for them?

Well, I'm getting meaningful (smaller) wins from adding those
__always_inline's. (and I also get a small win for __always_inline on
set_write_prot_commit_flush_ptes, but I didn't realize that until now).

> 
> I would have thought that specializing on nr_ptes==0 on an even higher
> level--where we call
> set_write_prot_commit_flush_ptes/prot_commit_flush_ptes() would allow
> for optimizing the loops entirely for nr_ptes==0?

That could also work, but then set_write_prot_commit_flush_ptes (holy cow
what a long name) would definitely need inlining. And might be a little uglier
overall.

This is the part where having data points other than my giga-fast-giga-powerful
zen5 could prove handy :/

-- 
Pedro

Re: [PATCH v2 2/2] mm/mprotect: special-case small folios when applying write permissions

Posted by David Hildenbrand (Arm) 3 days, 19 hours ago

On 3/25/26 12:37, Pedro Falcato wrote:
> On Tue, Mar 24, 2026 at 09:18:42PM +0100, David Hildenbrand (Arm) wrote:
>> On 3/24/26 16:43, Pedro Falcato wrote:
>>> The common order-0 case is important enough to want its own branch, and
>>> avoids the hairy, large loop logic that the CPU does not seem to handle
>>> particularly well.
>>>
>>> While at it, encourage the compiler to inline batch PTE logic and resolve
>>> constant branches by adding __always_inline strategically.
>>>
>>> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
>>> Signed-off-by: Pedro Falcato <pfalcato@suse.de>
>>> ---
>>>  mm/mprotect.c | 17 ++++++++++++-----
>>>  1 file changed, 12 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 2eaf862e5734..2fda26107066 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -103,7 +103,7 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>>>  	return can_change_shared_pte_writable(vma, pte);
>>>  }
>>>  
>>> -static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
>>> +static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
>>>  				    pte_t pte, int max_nr_ptes, fpb_t flags)
>>>  {
>>>  	/* No underlying folio, so cannot batch */
>>> @@ -117,9 +117,9 @@ static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
>>>  }
>>>  
>>>  /* Set nr_ptes number of ptes, starting from idx */
>>> -static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr,
>>> -		pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes,
>>> -		int idx, bool set_write, struct mmu_gather *tlb)
>>> +static __always_inline void prot_commit_flush_ptes(struct vm_area_struct *vma,
>>> +		unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent,
>>> +		int nr_ptes, int idx, bool set_write, struct mmu_gather *tlb)
>>>  {
>>>  	/*
>>>  	 * Advance the position in the batch by idx; note that if idx > 0,
>>> @@ -169,7 +169,7 @@ static int page_anon_exclusive_sub_batch(int start_idx, int max_len,
>>>   * pte of the batch. Therefore, we must individually check all pages and
>>>   * retrieve sub-batches.
>>>   */
>>> -static void commit_anon_folio_batch(struct vm_area_struct *vma,
>>> +static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma,
>>>  		struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep,
>>>  		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
>>>  {
>>> @@ -177,6 +177,13 @@ static void commit_anon_folio_batch(struct vm_area_struct *vma,
>>>  	int sub_batch_idx = 0;
>>>  	int len;
>>>  
>>> +	/* Optimize for the common order-0 case. */
>>> +	if (likely(nr_ptes == 1)) {
>>> +		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, 1,
>>> +				       0, PageAnonExclusive(first_page), tlb);
>>
>> To optimize that one, inlining prot_commit_flush_ptes() would be
>> sufficient. Does inlining the other two really help? I don't think we
>> can optimize out loops etc. for them?
> 
> Well, I'm getting meaningful (smaller) wins from adding those
> __always_inline's. (and I also get a small win for __always_inline on
> set_write_prot_commit_flush_ptes, but I didn't realize that until now).
> 
>>
>> I would have thought that specializing on nr_ptes==0 on an even higher
>> level--where we call
>> set_write_prot_commit_flush_ptes/prot_commit_flush_ptes() would allow
>> for optimizing the loops entirely for nr_ptes==0?
> 
> That could also work, but then set_write_prot_commit_flush_ptes (holy cow
> what a long name) would definitely need inlining. And might be a little uglier
> overall.

Right. The idea is that you __always__inline any code that has PTE
loops, such that all loops for nr_pages == 1 gets optimized out.

We do that for zap and fork logic.

> 
> This is the part where having data points other than my giga-fast-giga-powerful
> zen5 could prove handy :/
I just recently lost access to my reliably, well tunes, system ...

Is it just the following benchmark?

	https://gist.github.com/heatd/1450d273005aba91fa5744f44dfcd933

?


I can easily extending

	https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/pte-mapped-folio-benchmarks.c

to have an "mprotect" mode. I had that in the past bit discarded it.

Then, we can easily measure the effect on various folio sizes when
mprotect'ing a larger memory area.

With order-0 we can then benchmark small folios exclusively.



-- 
Cheers,

David

Re: [PATCH v2 2/2] mm/mprotect: special-case small folios when applying write permissions

Posted by Andrew Morton 1 day, 10 hours ago

On Mon, 30 Mar 2026 17:16:51 +0200 "David Hildenbrand (Arm)" <david@kernel.org> wrote:

> > That could also work, but then set_write_prot_commit_flush_ptes (holy cow
> > what a long name) would definitely need inlining. And might be a little uglier
> > overall.
> 
> Right. The idea is that you __always__inline any code that has PTE
> loops, such that all loops for nr_pages == 1 gets optimized out.
> 
> We do that for zap and fork logic.
> 
> > 
> > This is the part where having data points other than my giga-fast-giga-powerful
> > zen5 could prove handy :/
> I just recently lost access to my reliably, well tunes, system ...
> 
> Is it just the following benchmark?
> 
> 	https://gist.github.com/heatd/1450d273005aba91fa5744f44dfcd933
> 
> ?
> 
> 
> I can easily extending
> 
> 	https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/pte-mapped-folio-benchmarks.c
> 
> to have an "mprotect" mode. I had that in the past bit discarded it.
> 
> Then, we can easily measure the effect on various folio sizes when
> mprotect'ing a larger memory area.
> 
> With order-0 we can then benchmark small folios exclusively.

It sounds like this is all possible future work?

We have Lorenzo's R-b on this [2/2].  I'm reading this discussion as
"upstream both"?

--- a/mm/mprotect.c~mm-mprotect-special-case-small-folios-when-applying-write-permissions
+++ a/mm/mprotect.c
@@ -103,7 +103,7 @@ bool can_change_pte_writable(struct vm_a
 	return can_change_shared_pte_writable(vma, pte);
 }
 
-static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
+static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
 				    pte_t pte, int max_nr_ptes, fpb_t flags)
 {
 	/* No underlying folio, so cannot batch */
@@ -117,9 +117,9 @@ static int mprotect_folio_pte_batch(stru
 }
 
 /* Set nr_ptes number of ptes, starting from idx */
-static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr,
-		pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes,
-		int idx, bool set_write, struct mmu_gather *tlb)
+static __always_inline void prot_commit_flush_ptes(struct vm_area_struct *vma,
+		unsigned long addr, pte_t *ptep, pte_t oldpte, pte_t ptent,
+		int nr_ptes, int idx, bool set_write, struct mmu_gather *tlb)
 {
 	/*
 	 * Advance the position in the batch by idx; note that if idx > 0,
@@ -169,7 +169,7 @@ static int page_anon_exclusive_sub_batch
  * pte of the batch. Therefore, we must individually check all pages and
  * retrieve sub-batches.
  */
-static void commit_anon_folio_batch(struct vm_area_struct *vma,
+static __always_inline void commit_anon_folio_batch(struct vm_area_struct *vma,
 		struct folio *folio, struct page *first_page, unsigned long addr, pte_t *ptep,
 		pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb)
 {
@@ -177,6 +177,13 @@ static void commit_anon_folio_batch(stru
 	int sub_batch_idx = 0;
 	int len;
 
+	/* Optimize for the common order-0 case. */
+	if (likely(nr_ptes == 1)) {
+		prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, 1,
+				       0, PageAnonExclusive(first_page), tlb);
+		return;
+	}
+
 	while (nr_ptes) {
 		expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx);
 		len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes,
_

Re: [PATCH v2 2/2] mm/mprotect: special-case small folios when applying write permissions

Posted by Andrew Morton 1 day, 6 hours ago

On Wed, 1 Apr 2026 17:09:31 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> > I can easily extending
> > 
> > 	https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/pte-mapped-folio-benchmarks.c
> > 
> > to have an "mprotect" mode. I had that in the past bit discarded it.
> > 
> > Then, we can easily measure the effect on various folio sizes when
> > mprotect'ing a larger memory area.
> > 
> > With order-0 we can then benchmark small folios exclusively.
> 
> It sounds like this is all possible future work?
> 
> We have Lorenzo's R-b on this [2/2].  I'm reading this discussion as
> "upstream both"?

Sorry, please ignore.  Reading skills.

Re: [PATCH v2 2/2] mm/mprotect: special-case small folios when applying write permissions

Posted by David Hildenbrand (Arm) 1 day, 3 hours ago

On 4/2/26 05:44, Andrew Morton wrote:
> On Wed, 1 Apr 2026 17:09:31 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
> 
>>> I can easily extending
>>>
>>> 	https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/pte-mapped-folio-benchmarks.c
>>>
>>> to have an "mprotect" mode. I had that in the past bit discarded it.
>>>
>>> Then, we can easily measure the effect on various folio sizes when
>>> mprotect'ing a larger memory area.
>>>
>>> With order-0 we can then benchmark small folios exclusively.
>>
>> It sounds like this is all possible future work?
>>
>> We have Lorenzo's R-b on this [2/2].  I'm reading this discussion as
>> "upstream both"?
> 
> Sorry, please ignore.  Reading skills.

Maybe writing skills on my end. I think we should do #2 properly.

#1 is good to go.

-- 
Cheers,

David