Hoist some previous, important to be inlined checks to the main loop
and noinline the entirety of folio_pte_batch_flags(). The loop itself is
quite large and whether it is inlined or not should not matter, as we
are ideally dealing with larger orders.
Signed-off-by: Pedro Falcato <pfalcato@suse.de>
---
mm/mprotect.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8d4fa38a8a26..aa845f5bf14d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -103,16 +103,9 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
return can_change_shared_pte_writable(vma, pte);
}
-static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
+static noinline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
pte_t pte, int max_nr_ptes, fpb_t flags)
{
- /* No underlying folio, so cannot batch */
- if (!folio)
- return 1;
-
- if (!folio_test_large(folio))
- return 1;
-
return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags);
}
@@ -333,7 +326,12 @@ static long change_pte_range(struct mmu_gather *tlb,
continue;
}
- nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags);
+ /* No underlying folio (or not large), so cannot batch */
+ if (likely(!folio || !folio_test_large(folio)))
+ nr_ptes = 1;
+ else
+ nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte,
+ max_nr_ptes, flags);
oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
ptent = pte_modify(oldpte, newprot);
--
2.53.0
On Thu, Mar 19, 2026 at 06:31:07PM +0000, Pedro Falcato wrote:
> Hoist some previous, important to be inlined checks to the main loop
> and noinline the entirety of folio_pte_batch_flags(). The loop itself is
> quite large and whether it is inlined or not should not matter, as we
> are ideally dealing with larger orders.
>
> Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> ---
> mm/mprotect.c | 16 +++++++---------
> 1 file changed, 7 insertions(+), 9 deletions(-)
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 8d4fa38a8a26..aa845f5bf14d 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -103,16 +103,9 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> return can_change_shared_pte_writable(vma, pte);
> }
>
> -static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> +static noinline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
Hmm I'm iffy about noinline-ing a 1 line function now?
And now yu're noinlining something that contains an inline (but not guaranteed
to be function, it's a bit strange overall?
> pte_t pte, int max_nr_ptes, fpb_t flags)
> {
> - /* No underlying folio, so cannot batch */
> - if (!folio)
> - return 1;
> -
> - if (!folio_test_large(folio))
> - return 1;
> -
> return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags);
> }
>
> @@ -333,7 +326,12 @@ static long change_pte_range(struct mmu_gather *tlb,
> continue;
> }
>
^^^ up here is a mprotect_folio_pte_batch() invocation for the prot_numa case,
which now won't be doing:
if (!folio)
return 1;
if (!folio_test_large(folio))
return 1;
At all, so isn't that potentially a pessimisation in itself?
> - nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags);
> + /* No underlying folio (or not large), so cannot batch */
> + if (likely(!folio || !folio_test_large(folio)))
We actually sure of this likely()? Is there data to support it? Am not a fan of
using likely()/unlikey() without something to back it.
> + nr_ptes = 1;
> + else
> + nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte,
> + max_nr_ptes, flags);
It's also pretty gross to throw this out into a massive function.
>
> oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
> ptent = pte_modify(oldpte, newprot);
> --
> 2.53.0
>
This is all seems VERY delicate, and subject to somebody else coming along and
breaking it/causing some of these noinline/__always_inline invocations to make
things far worse.
I also reserve the right to seriously rework this pile of crap software.
I'd rather we try to find less fragile ways to optimise!
Maybe there's some steps that are bigger wins than others?
Thanks, Lorenzo
On Thu, Mar 19, 2026 at 07:14:38PM +0000, Lorenzo Stoakes (Oracle) wrote:
> On Thu, Mar 19, 2026 at 06:31:07PM +0000, Pedro Falcato wrote:
> > Hoist some previous, important to be inlined checks to the main loop
> > and noinline the entirety of folio_pte_batch_flags(). The loop itself is
> > quite large and whether it is inlined or not should not matter, as we
> > are ideally dealing with larger orders.
> >
> > Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> > ---
> > mm/mprotect.c | 16 +++++++---------
> > 1 file changed, 7 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 8d4fa38a8a26..aa845f5bf14d 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -103,16 +103,9 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> > return can_change_shared_pte_writable(vma, pte);
> > }
> >
> > -static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> > +static noinline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
>
> Hmm I'm iffy about noinline-ing a 1 line function now?
Well, it's a 1 line function that itself (probably) has an inlined
folio_pte_batch_flags() with a bunch of inlined functions, calls, etc. But yes, I
see your point.
>
> And now yu're noinlining something that contains an inline (but not guaranteed
> to be function, it's a bit strange overall?
>
> > pte_t pte, int max_nr_ptes, fpb_t flags)
> > {
> > - /* No underlying folio, so cannot batch */
> > - if (!folio)
> > - return 1;
> > -
> > - if (!folio_test_large(folio))
> > - return 1;
> > -
> > return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags);
> > }
> >
> > @@ -333,7 +326,12 @@ static long change_pte_range(struct mmu_gather *tlb,
> > continue;
> > }
> >
>
> ^^^ up here is a mprotect_folio_pte_batch() invocation for the prot_numa case,
> which now won't be doing:
Yep, this is a bug (that sashiko also caught!)
>
> if (!folio)
> return 1;
>
> if (!folio_test_large(folio))
> return 1;
>
> At all, so isn't that potentially a pessimisation in itself?
>
>
> > - nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags);
> > + /* No underlying folio (or not large), so cannot batch */
> > + if (likely(!folio || !folio_test_large(folio)))
>
> We actually sure of this likely()? Is there data to support it? Am not a fan of
> using likely()/unlikey() without something to back it.
I did a small little test yesterday with bpftrace + a kernel build +
filemap_get_folio. I got a staggering majority of order-0 folios, with some
smaller folios spread out across a bunch of orders (up to 8, iirc). Of course
I'm on x86 so I don't have mTHP, etc enabled, so those will all be order-0 or
PMD_ORDER folios (which we won't see here).
>
> > + nr_ptes = 1;
> > + else
> > + nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte,
> > + max_nr_ptes, flags);
>
> It's also pretty gross to throw this out into a massive function.
>
> >
> > oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
> > ptent = pte_modify(oldpte, newprot);
> > --
> > 2.53.0
> >
>
> This is all seems VERY delicate, and subject to somebody else coming along and
> breaking it/causing some of these noinline/__always_inline invocations to make
> things far worse.
Who told you optimization isn't delicate?!
>
> I also reserve the right to seriously rework this pile of crap software.
>
> I'd rather we try to find less fragile ways to optimise!
But yes, I understand your point. Hm. I'll need to think about this some more.
There's nothing I would love more than to simply slap a
if (pte_batch_hint(ptep, pte) == 1)
nr_ptes = 1;
on !contpte archs. I don't see where most of the wins for those architectures
would exist, but apparently they do. Confusing :/
>
> Maybe there's some steps that are bigger wins than others?
Definitely :) So yeah I'll probably be dropping this patch, or at least
reworking this one a good bit.
--
Pedro
On Fri, Mar 20, 2026 at 10:34:29AM +0000, Pedro Falcato wrote:
> On Thu, Mar 19, 2026 at 07:14:38PM +0000, Lorenzo Stoakes (Oracle) wrote:
> > On Thu, Mar 19, 2026 at 06:31:07PM +0000, Pedro Falcato wrote:
> > > Hoist some previous, important to be inlined checks to the main loop
> > > and noinline the entirety of folio_pte_batch_flags(). The loop itself is
> > > quite large and whether it is inlined or not should not matter, as we
> > > are ideally dealing with larger orders.
> > >
> > > Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> > > ---
> > > mm/mprotect.c | 16 +++++++---------
> > > 1 file changed, 7 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > > index 8d4fa38a8a26..aa845f5bf14d 100644
> > > --- a/mm/mprotect.c
> > > +++ b/mm/mprotect.c
> > > @@ -103,16 +103,9 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> > > return can_change_shared_pte_writable(vma, pte);
> > > }
> > >
> > > -static __always_inline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> > > +static noinline int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep,
> >
> > Hmm I'm iffy about noinline-ing a 1 line function now?
>
> Well, it's a 1 line function that itself (probably) has an inlined
> folio_pte_batch_flags() with a bunch of inlined functions, calls, etc. But yes, I
> see your point.
I mean again the weird thing is folio_pte_batch_flags() being inline in a header
file in the first place. It even says:
* This function will be inlined to optimize based on the input parameters;
* consider using folio_pte_batch() instead if applicable.
In the comment.
noinline'ing to not inline an inline function that explicitly says it's inline
for performance seems... silly.
>
> >
> > And now yu're noinlining something that contains an inline (but not guaranteed
> > to be function, it's a bit strange overall?
> >
> > > pte_t pte, int max_nr_ptes, fpb_t flags)
> > > {
> > > - /* No underlying folio, so cannot batch */
> > > - if (!folio)
> > > - return 1;
> > > -
> > > - if (!folio_test_large(folio))
> > > - return 1;
> > > -
> > > return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags);
> > > }
> > >
> > > @@ -333,7 +326,12 @@ static long change_pte_range(struct mmu_gather *tlb,
> > > continue;
> > > }
> > >
> >
> > ^^^ up here is a mprotect_folio_pte_batch() invocation for the prot_numa case,
> > which now won't be doing:
>
> Yep, this is a bug (that sashiko also caught!)
>
> >
> > if (!folio)
> > return 1;
> >
> > if (!folio_test_large(folio))
> > return 1;
> >
> > At all, so isn't that potentially a pessimisation in itself?
> >
> >
> > > - nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags);
> > > + /* No underlying folio (or not large), so cannot batch */
> > > + if (likely(!folio || !folio_test_large(folio)))
> >
> > We actually sure of this likely()? Is there data to support it? Am not a fan of
> > using likely()/unlikey() without something to back it.
>
> I did a small little test yesterday with bpftrace + a kernel build +
> filemap_get_folio. I got a staggering majority of order-0 folios, with some
> smaller folios spread out across a bunch of orders (up to 8, iirc). Of course
> I'm on x86 so I don't have mTHP, etc enabled, so those will all be order-0 or
> PMD_ORDER folios (which we won't see here).
>
> >
> > > + nr_ptes = 1;
> > > + else
> > > + nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte,
> > > + max_nr_ptes, flags);
> >
> > It's also pretty gross to throw this out into a massive function.
> >
> > >
> > > oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes);
> > > ptent = pte_modify(oldpte, newprot);
> > > --
> > > 2.53.0
> > >
> >
> > This is all seems VERY delicate, and subject to somebody else coming along and
> > breaking it/causing some of these noinline/__always_inline invocations to make
> > things far worse.
>
> Who told you optimization isn't delicate?!
I don't want to accept changes that will randomly break in future, it's fairly
pointless, because I guarantee somebody will break things at some point if
they're fragile.
And we _really_ do not want or need to have functions where a bot will shout at
you and you spend X hours figuring out why you reduced the time on a
microbenchmark that does something nobody cares about.
BUT, if we can _abstract_ the bigger wins by rethinking batching somewhat then
we could achieve something here, and that'd definitely be worthwhile.
I think that really the batching implementation is a bit of a mess because it
was done without fixing any existing technical debt first.
We need to stop accepting series that pile more code on top of existing code
that is already a mess like that.
>
> >
> > I also reserve the right to seriously rework this pile of crap software.
> >
> > I'd rather we try to find less fragile ways to optimise!
>
> But yes, I understand your point. Hm. I'll need to think about this some more.
> There's nothing I would love more than to simply slap a
>
> if (pte_batch_hint(ptep, pte) == 1)
> nr_ptes = 1;
>
> on !contpte archs. I don't see where most of the wins for those architectures
> would exist, but apparently they do. Confusing :/
>
> >
> > Maybe there's some steps that are bigger wins than others?
>
> Definitely :) So yeah I'll probably be dropping this patch, or at least
> reworking this one a good bit.
Thanks, sorry to be negative - I appreciate the detailed perf analysis, but I
want to make sure we do this in such a way that future Lorenzo and future Pedro
and future David and future whoever can remember WHY we did X and NOT to do Y
that breaks it :)
So that means abstracting it in a way where that's made easy for us (abstraction
stays stable and handles the perf details for us which are documented in e.g. a
comment, callers call, everybody's happy-ish).
>
> --
> Pedro
Thanks, Lorenzo
> > This is all seems VERY delicate, and subject to somebody else coming along and > breaking it/causing some of these noinline/__always_inline invocations to make > things far worse. > > I also reserve the right to seriously rework this pile of crap software. > > I'd rather we try to find less fragile ways to optimise! > > Maybe there's some steps that are bigger wins than others? What we can do is, collect similar folio_pte_batch_*() variants and centralize them in mm/utils.c. For nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, /* flags = */ 0) We might just be able to use folio_pte_batch()? For the other variant (soft-dirt+write) we'd have to create a helper like folio_pte_batch_sd_w() [better name suggestion welcome] That will reduce the code footprint overall I guess. -- Cheers, David
On Thu, Mar 19, 2026 at 10:41:54PM +0100, David Hildenbrand (Arm) wrote: > > > > > This is all seems VERY delicate, and subject to somebody else coming along and > > breaking it/causing some of these noinline/__always_inline invocations to make > > things far worse. > > > > I also reserve the right to seriously rework this pile of crap software. > > > > I'd rather we try to find less fragile ways to optimise! > > > > Maybe there's some steps that are bigger wins than others? > > What we can do is, collect similar folio_pte_batch_*() variants and > centralize them in mm/utils.c. I'm not sure that addresses any of the comments above? Putting logic specific to components of mm away from where those components are and into mm/util.c seems like a complete regression in terms of fragility and code separation. And for what reason would you want to do that? To force a noinline of an inline and people 'just have to know' that's why you randomly separated the two? Doesn't sound appealing overall. I'd rather we find a way to implement the batching such that it doesn't exhibit bad inlining decisions in the first place. I mean mprotect_folio_batch() having !folio, !folio_test_large() checks only there is already silly, we should have a _general_ function that does optimisations like that. Isn't the issue rather than folio_pte_batch_flags() shouldn't be an inline function in internal.h but rather a function in mm/util.c? > > For > > nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, > max_nr_ptes, /* flags = */ 0) > > We might just be able to use folio_pte_batch()? Yup. > > For the other variant (soft-dirt+write) we'd have to create a helper like > > folio_pte_batch_sd_w() [better name suggestion welcome] > > That will reduce the code footprint overall I guess. I mean yeah that's a terrible name so obviously it'd have to be something better. But again, this seems pretty stupid, now we're writing a bunch of duplicate per-case code to force noinline because we made the central function inline no? That's also super fragile, because an engineer might later decide that pattern is horrible and fix it, and we regress this. But I mean overall, is the perf here really all that important? Are people really that dependent on mprotect() et al. performing brilliantly fast? Couldn't we do this with any mm interface and end up making efforts that degrade code quality, increase fragility for dubious benefit? > > -- > Cheers, > > David Thanks, Lorenzo
On 3/20/26 11:36, Lorenzo Stoakes (Oracle) wrote: > On Thu, Mar 19, 2026 at 10:41:54PM +0100, David Hildenbrand (Arm) wrote: >> >>> >>> This is all seems VERY delicate, and subject to somebody else coming along and >>> breaking it/causing some of these noinline/__always_inline invocations to make >>> things far worse. >>> >>> I also reserve the right to seriously rework this pile of crap software. >>> >>> I'd rather we try to find less fragile ways to optimise! >>> >>> Maybe there's some steps that are bigger wins than others? >> >> What we can do is, collect similar folio_pte_batch_*() variants and >> centralize them in mm/utils.c. > > I'm not sure that addresses any of the comments above? Well, the point is that you still end up with an optimized variant regarding flags, and you don't have to play games in this file with "inline". > > Putting logic specific to components of mm away from where those components > are and into mm/util.c seems like a complete regression in terms of > fragility and code separation. My point is that in mm/rmap.c we have another caller that passes FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY, that could use the same optimized function. For folio_pte_batch_flags() we primarily care about propagating the flags such that __pte_batch_clear_ignored() will become as short and small as possible. How much that matters in practice? For fork() and unmap() it was measurable. So I'd assume for mprotect() it would matter as well. > > And for what reason would you want to do that? To force a noinline of an > inline and people 'just have to know' that's why you randomly separated the > two? Again, I don't want the noinline. But providing a single optimized instance for two users that care about the same flags makes perfect sense to me. > > Doesn't sound appealing overall. > > I'd rather we find a way to implement the batching such that it doesn't > exhibit bad inlining decisions in the first place. > > I mean mprotect_folio_batch() having !folio, !folio_test_large() checks > only there is already silly, we should have a _general_ function that does > optimisations like that. > > Isn't the issue rather than folio_pte_batch_flags() shouldn't be an inline > function in internal.h but rather a function in mm/util.c? No, we want specialized code for fork and zapping. That's the whole purpose: optimize out the flag checks in the loop. > >> >> For >> >> nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, >> max_nr_ptes, /* flags = */ 0) >> >> We might just be able to use folio_pte_batch()? > > Yup. > >> >> For the other variant (soft-dirt+write) we'd have to create a helper like >> >> folio_pte_batch_sd_w() [better name suggestion welcome] >> >> That will reduce the code footprint overall I guess. > > I mean yeah that's a terrible name so obviously it'd have to be something > better. > > But again, this seems pretty stupid, now we're writing a bunch of duplicate > per-case code to force noinline because we made the central function inline > no? > > That's also super fragile, because an engineer might later decide that > pattern is horrible and fix it, and we regress this. > > But I mean overall, is the perf here really all that important? Are people > really that dependent on mprotect() et al. performing brilliantly fast? For basic primitives like mprotect/zap/fork I think yes. For other stuff like rmap.c, maybe not. > > Couldn't we do this with any mm interface and end up making efforts that > degrade code quality, increase fragility for dubious benefit? I don't see a big issue here like you do. -- Cheers, David
On Fri, Mar 20, 2026 at 12:01:09PM +0100, David Hildenbrand (Arm) wrote:
> On 3/20/26 11:36, Lorenzo Stoakes (Oracle) wrote:
> > On Thu, Mar 19, 2026 at 10:41:54PM +0100, David Hildenbrand (Arm) wrote:
> >>
> >>>
> >>> This is all seems VERY delicate, and subject to somebody else coming along and
> >>> breaking it/causing some of these noinline/__always_inline invocations to make
> >>> things far worse.
> >>>
> >>> I also reserve the right to seriously rework this pile of crap software.
> >>>
> >>> I'd rather we try to find less fragile ways to optimise!
> >>>
> >>> Maybe there's some steps that are bigger wins than others?
> >>
> >> What we can do is, collect similar folio_pte_batch_*() variants and
> >> centralize them in mm/utils.c.
> >
> > I'm not sure that addresses any of the comments above?
>
> Well, the point is that you still end up with an optimized variant
> regarding flags, and you don't have to play games in this file with
> "inline".
Right, I mean maybe we're talking across purposes here, if the variants are say
generalised specific sets of behaviour rather than 'mprotect folio pte batch' or
'mremap folio pte batch' or whatever then that's fine.
>
> >
> > Putting logic specific to components of mm away from where those components
> > are and into mm/util.c seems like a complete regression in terms of
> > fragility and code separation.
>
> My point is that in mm/rmap.c we have another caller that passes
> FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY, that could use the same
> optimized function.
>
> For folio_pte_batch_flags() we primarily care about propagating the
> flags such that __pte_batch_clear_ignored() will become as short and
> small as possible.
>
> How much that matters in practice? For fork() and unmap() it was
> measurable. So I'd assume for mprotect() it would matter as well.
Yeah I think this was a misunderstanding, that's fine.
>
> >
> > And for what reason would you want to do that? To force a noinline of an
> > inline and people 'just have to know' that's why you randomly separated the
> > two?
>
> Again, I don't want the noinline. But providing a single optimized
> instance for two users that care about the same flags makes perfect
> sense to me.
Yup, that's fine! I agree with you now I realise this was a
misunderstanding :p
>
> >
> > Doesn't sound appealing overall.
> >
> > I'd rather we find a way to implement the batching such that it doesn't
> > exhibit bad inlining decisions in the first place.
> >
> > I mean mprotect_folio_batch() having !folio, !folio_test_large() checks
> > only there is already silly, we should have a _general_ function that does
> > optimisations like that.
> >
> > Isn't the issue rather than folio_pte_batch_flags() shouldn't be an inline
> > function in internal.h but rather a function in mm/util.c?
>
> No, we want specialized code for fork and zapping. That's the whole
> purpose: optimize out the flag checks in the loop.
Yeah, I think something isn't quite working right there though if we're
seeing measureable improvements by doing:
static noinline unsigned int blahdy_blah()
{
...
return folio_pte_batch_flags(...);
}
But we need to figure out exactly what via actual perf analysis, as perf is
an area that totally confounds developer expectation and 'going with your
gut' is a perfect way of doing completely the wrong thing.
(And I've repeatedly seen otherwise smart developers do this OVER and OVER
again, it's like a honey trap for the clever).
Not saying any of that applies to you or Pedro :) just a general point.
I don't _love_ that function being inline like that, but I will cede to the
data, but right now it seems, at least for I guess order-0 folios that it's
not quite doing what we want.
Anyway we're agreed the best way forwards here, for now, is to specialise
order-0 which just seems like an all-round win, and maybe the rest is not
as much of a contributor?
>
> >
> >>
> >> For
> >>
> >> nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte,
> >> max_nr_ptes, /* flags = */ 0)
> >>
> >> We might just be able to use folio_pte_batch()?
> >
> > Yup.
> >
> >>
> >> For the other variant (soft-dirt+write) we'd have to create a helper like
> >>
> >> folio_pte_batch_sd_w() [better name suggestion welcome]
> >>
> >> That will reduce the code footprint overall I guess.
> >
> > I mean yeah that's a terrible name so obviously it'd have to be something
> > better.
> >
> > But again, this seems pretty stupid, now we're writing a bunch of duplicate
> > per-case code to force noinline because we made the central function inline
> > no?
> >
> > That's also super fragile, because an engineer might later decide that
> > pattern is horrible and fix it, and we regress this.
> >
> > But I mean overall, is the perf here really all that important? Are people
> > really that dependent on mprotect() et al. performing brilliantly fast?
>
> For basic primitives like mprotect/zap/fork I think yes. For other stuff
> like rmap.c, maybe not.
Well on big ranges of mprotect() it could be, and I know often databases
like to do this kind of thing potentially, so yeah sure.
But more so the microbenchmark stuff of *a million protect() invocations*
is not something to optimise for so much.
Rather I'd say mprotect() over larger ranges is what we should look to.
Fork of course is utterly critical despite fork being both a terrible
mistake and a great idea at the exact same time in OS development history
:)
>
> >
> > Couldn't we do this with any mm interface and end up making efforts that
> > degrade code quality, increase fragility for dubious benefit?
>
> I don't see a big issue here like you do.
As I've said to Pedro elsewhere here, I guess my concern is nuanced:
So if we introduce stuff like carefully chosen __always_inline or noinline
or other things that have characteristics like:
- They're beneficial for the code AS-IS.
- They're based on compiler codegen that can easily be altered by other
changes.
- It is not obvious how other changes to the code might break them.
We are asking for trouble - because people WILL change that code and WILL
break that, OR a possibly worse outcome - something like a noinline sticks
around when it makes sense, but everybody's scared to remove it + _doesn't
know why it's there_ - so it becomes a part of 'oh yeah we don't touch
that' lore that exists for a lot of 'weird' stuff in the kernel.
Then it might end up actually _worsening_ the performance in future
accidentally because nobody dare touch it.
Or another hellish future is one in which such things cause bot perf
regression reports for otherwise fine series, on microoptimisations we're
not even clear matter, and cause developers to have to spend hours figuring
out how to avoid them, meanwhile potentially making it even more difficult
to understand why the code is the way it is.
So what is the solution?
1. Focus on the changes that are NOT brittle like this, e.g. special casing
order-0 is fine, adding profile/benchmark-proven likely()/unlikely(),
etc. - these are not things that have the above characteristics and are
just wins.
2. For cases where things MIGHT have the characteristics listed above,
avoid the issue by abstracting it as much as possible, adding lengthily
comments and making it as hard as possible to screw it up/misunderstand
it.
3. Often times perf issues coming up might be an indication that the
underlying mechanism is itself not well abstracted/already adding
unnecessary complexity that manifests in perf issues, so in that case -
rework first.
mm/mprotect.c is a good example of case 3 I think in that it's a big ball
of mud with overly long functions (e.g. change_pte_range(),
do_mprotect_pkey()) that likely refactoring code actually see perf gains
just from the compiler not having to heuristically determine inlining/other
optimisations due to functions being in smalle rchunks.
Anwyay in this case, we can pull out an example of approach 3 (the softleaf
specialisation), and approach 1 (order-0) handling easily, and defer
considering taking approach 2 if it makes sense to later, if we get most of
the wins by doing the former 2 things.
>
> --
> Cheers,
>
> David
Cheers, Lorenzo
>>> I mean yeah that's a terrible name so obviously it'd have to be something >>> better. >>> >>> But again, this seems pretty stupid, now we're writing a bunch of duplicate >>> per-case code to force noinline because we made the central function inline >>> no? >>> >>> That's also super fragile, because an engineer might later decide that >>> pattern is horrible and fix it, and we regress this. >>> >>> But I mean overall, is the perf here really all that important? Are people >>> really that dependent on mprotect() et al. performing brilliantly fast? >> >> For basic primitives like mprotect/zap/fork I think yes. For other stuff >> like rmap.c, maybe not. > > Well on big ranges of mprotect() it could be, and I know often databases > like to do this kind of thing potentially, so yeah sure. > > But more so the microbenchmark stuff of *a million protect() invocations* > is not something to optimise for so much. > > Rather I'd say mprotect() over larger ranges is what we should look to. I tend to agree (and I think I made a similar point in previous discussions around mprotect() performance). There is the use case for userspace jits etc to call mprotect() on individual pages. I suspected that TLB flushing and syscall overhead would overshadow most micro-optimizations. :) [...] > > As I've said to Pedro elsewhere here, I guess my concern is nuanced: > > So if we introduce stuff like carefully chosen __always_inline or noinline > or other things that have characteristics like: > > - They're beneficial for the code AS-IS. > - They're based on compiler codegen that can easily be altered by other > changes. > - It is not obvious how other changes to the code might break them. > > We are asking for trouble - because people WILL change that code and WILL > break that, OR a possibly worse outcome - something like a noinline sticks > around when it makes sense, but everybody's scared to remove it + _doesn't > know why it's there_ - so it becomes a part of 'oh yeah we don't touch > that' lore that exists for a lot of 'weird' stuff in the kernel. > > Then it might end up actually _worsening_ the performance in future > accidentally because nobody dare touch it. > > Or another hellish future is one in which such things cause bot perf > regression reports for otherwise fine series, on microoptimisations we're > not even clear matter, and cause developers to have to spend hours figuring > out how to avoid them, meanwhile potentially making it even more difficult > to understand why the code is the way it is. > > So what is the solution? > > 1. Focus on the changes that are NOT brittle like this, e.g. special casing > order-0 is fine, adding profile/benchmark-proven likely()/unlikely(), > etc. - these are not things that have the above characteristics and are > just wins. Agreed. > > 2. For cases where things MIGHT have the characteristics listed above, > avoid the issue by abstracting it as much as possible, adding lengthily > comments and making it as hard as possible to screw it up/misunderstand > it. Agreed. > > 3. Often times perf issues coming up might be an indication that the > underlying mechanism is itself not well abstracted/already adding > unnecessary complexity that manifests in perf issues, so in that case - > rework first. Agreed. I think the usage of noinline for micro-performance optimization is really questionable and should be avoided at all costs. The folio_pte_patch() stuff likely really should just be a set of mm/util.c helpers that specialize on the flags only to make the inner loop as efficient as possible. -- Cheers, David
On Fri, Mar 20, 2026 at 10:36:59AM +0000, Lorenzo Stoakes (Oracle) wrote:
> On Thu, Mar 19, 2026 at 10:41:54PM +0100, David Hildenbrand (Arm) wrote:
> >
> > >
> > > This is all seems VERY delicate, and subject to somebody else coming along and
> > > breaking it/causing some of these noinline/__always_inline invocations to make
> > > things far worse.
> > >
> > > I also reserve the right to seriously rework this pile of crap software.
> > >
> > > I'd rather we try to find less fragile ways to optimise!
> > >
> > > Maybe there's some steps that are bigger wins than others?
> >
> > What we can do is, collect similar folio_pte_batch_*() variants and
> > centralize them in mm/utils.c.
>
> I'm not sure that addresses any of the comments above?
>
> Putting logic specific to components of mm away from where those components
> are and into mm/util.c seems like a complete regression in terms of
> fragility and code separation.
>
> And for what reason would you want to do that? To force a noinline of an
> inline and people 'just have to know' that's why you randomly separated the
> two?
>
> Doesn't sound appealing overall.
>
> I'd rather we find a way to implement the batching such that it doesn't
> exhibit bad inlining decisions in the first place.
Yes, you make a good point. At the end of the day (taking change_protection()
as the example at hand here), after these changes:
change_protection()
loop over p4ds
loop over puds
loop over pmds
loop over ptes
nr_ptes = loop over ptes and find out how many we have
if (making write) {
loop over nr_ptes
loop over ptes and find out how many are anonexclusive or not, in a row
loop over ptes and set them
} else {
loop over ptes and set them
}
which the compiler FWIW tries to inline it all into one function, but then
does a poor job at figuring things out. And the CPU gets confused too. It
was frankly shocking how much performance I could squeeze out of a
if (nr_ptes == 1) {
/* don't bother with the loops and the funny logic */
}
I would not be surprised if the other syscalls have similar problems.
>
> I mean mprotect_folio_batch() having !folio, !folio_test_large() checks
> only there is already silly, we should have a _general_ function that does
> optimisations like that.
>
> Isn't the issue rather than folio_pte_batch_flags() shouldn't be an inline
> function in internal.h but rather a function in mm/util.c?
>
> >
> > For
> >
> > nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte,
> > max_nr_ptes, /* flags = */ 0)
> >
> > We might just be able to use folio_pte_batch()?
>
> Yup.
>
> >
> > For the other variant (soft-dirt+write) we'd have to create a helper like
> >
> > folio_pte_batch_sd_w() [better name suggestion welcome]
> >
> > That will reduce the code footprint overall I guess.
>
> I mean yeah that's a terrible name so obviously it'd have to be something
> better.
>
> But again, this seems pretty stupid, now we're writing a bunch of duplicate
> per-case code to force noinline because we made the central function inline
> no?
Yeah.
>
> That's also super fragile, because an engineer might later decide that
> pattern is horrible and fix it, and we regress this.
>
> But I mean overall, is the perf here really all that important? Are people
> really that dependent on mprotect() et al. performing brilliantly fast?
Obviously no one truly depends on mprotect, but I believe in fast primitives :)
> Couldn't we do this with any mm interface and end up making efforts that
> degrade code quality, increase fragility for dubious benefit?
Yes, which is why I don't want to degrade code quality :) It would be ideal to
find something that works both ways. Per my description of the existing code
above, you can tell that it's neither fast, nor beautiful :p
--
Pedro
On Fri, Mar 20, 2026 at 10:59:55AM +0000, Pedro Falcato wrote:
> On Fri, Mar 20, 2026 at 10:36:59AM +0000, Lorenzo Stoakes (Oracle) wrote:
> > On Thu, Mar 19, 2026 at 10:41:54PM +0100, David Hildenbrand (Arm) wrote:
> > >
> > > >
> > > > This is all seems VERY delicate, and subject to somebody else coming along and
> > > > breaking it/causing some of these noinline/__always_inline invocations to make
> > > > things far worse.
> > > >
> > > > I also reserve the right to seriously rework this pile of crap software.
> > > >
> > > > I'd rather we try to find less fragile ways to optimise!
> > > >
> > > > Maybe there's some steps that are bigger wins than others?
> > >
> > > What we can do is, collect similar folio_pte_batch_*() variants and
> > > centralize them in mm/utils.c.
> >
> > I'm not sure that addresses any of the comments above?
> >
> > Putting logic specific to components of mm away from where those components
> > are and into mm/util.c seems like a complete regression in terms of
> > fragility and code separation.
> >
> > And for what reason would you want to do that? To force a noinline of an
> > inline and people 'just have to know' that's why you randomly separated the
> > two?
> >
> > Doesn't sound appealing overall.
> >
> > I'd rather we find a way to implement the batching such that it doesn't
> > exhibit bad inlining decisions in the first place.
>
> Yes, you make a good point. At the end of the day (taking change_protection()
> as the example at hand here), after these changes:
> change_protection()
> loop over p4ds
> loop over puds
> loop over pmds
> loop over ptes
> nr_ptes = loop over ptes and find out how many we have
> if (making write) {
> loop over nr_ptes
> loop over ptes and find out how many are anonexclusive or not, in a row
> loop over ptes and set them
> } else {
> loop over ptes and set them
> }
>
> which the compiler FWIW tries to inline it all into one function, but then
> does a poor job at figuring things out. And the CPU gets confused too. It
> was frankly shocking how much performance I could squeeze out of a
>
> if (nr_ptes == 1) {
> /* don't bother with the loops and the funny logic */
> }
Let's maybe focus on generalising this then to start?
I think David + I are in agreement that these are obvious (TM) improvements. But
let's see if we can generalise them?
>
> I would not be surprised if the other syscalls have similar problems.
>
> >
> > I mean mprotect_folio_batch() having !folio, !folio_test_large() checks
> > only there is already silly, we should have a _general_ function that does
> > optimisations like that.
> >
> > Isn't the issue rather than folio_pte_batch_flags() shouldn't be an inline
> > function in internal.h but rather a function in mm/util.c?
> >
> > >
> > > For
> > >
> > > nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte,
> > > max_nr_ptes, /* flags = */ 0)
> > >
> > > We might just be able to use folio_pte_batch()?
> >
> > Yup.
> >
> > >
> > > For the other variant (soft-dirt+write) we'd have to create a helper like
> > >
> > > folio_pte_batch_sd_w() [better name suggestion welcome]
> > >
> > > That will reduce the code footprint overall I guess.
> >
> > I mean yeah that's a terrible name so obviously it'd have to be something
> > better.
> >
> > But again, this seems pretty stupid, now we're writing a bunch of duplicate
> > per-case code to force noinline because we made the central function inline
> > no?
>
> Yeah.
I guess we can discuss on other part of thread :)
>
> >
> > That's also super fragile, because an engineer might later decide that
> > pattern is horrible and fix it, and we regress this.
> >
> > But I mean overall, is the perf here really all that important? Are people
> > really that dependent on mprotect() et al. performing brilliantly fast?
>
> Obviously no one truly depends on mprotect, but I believe in fast primitives :)
>
> > Couldn't we do this with any mm interface and end up making efforts that
> > degrade code quality, increase fragility for dubious benefit?
>
> Yes, which is why I don't want to degrade code quality :) It would be ideal to
> find something that works both ways. Per my description of the existing code
> above, you can tell that it's neither fast, nor beautiful :p
I mean the code is terrible in most of these places (mremap() is much better
because I put a lot of time into de-insane-it-ising it), but it's less degrading
quality but rather:
static noinline void blahdy_blah(a trillion parameters)
{
... lots of code ...
}
Developer comes along, modifies code/params/etc. or uses function in another
place and degrades perf somehow and suddenly what made sense at X point of time
no longer makes sense at Y point of time, but develoeprs don't understand it so
are scared to remove it and etc.
Which is why it's better to try to abstract as much as possible and have some
way of then putting the sensitive stuff in a specific place like:
/*
* Lorenzo-style overly long comment with lots of exposition, a beginning,
* middle and end, ASCII diagrams and all sorts...
*
* ...
*/
static noinline void __super_special_perf_bit_of_whatever_dont_touch_please(...)
{
}
That's wrapped up in some saner interface that people can use at will with the
perf-y component safely locked away in an insane asylum^W^W another compilation
unit or header.
>
> --
> Pedro
Cheers, Lorenzo
>
> which the compiler FWIW tries to inline it all into one function, but then
> does a poor job at figuring things out. And the CPU gets confused too. It
> was frankly shocking how much performance I could squeeze out of a
>
> if (nr_ptes == 1) {
> /* don't bother with the loops and the funny logic */
> }
Yes, same experience with fork/munmap.
--
Cheers,
David
© 2016 - 2026 Red Hat, Inc.