fs: mark lookup_slow() as noinline

[PATCH] fs: mark lookup_slow() as noinline

Posted by Mateusz Guzik 2 months, 2 weeks ago

Otherwise it gets inlined notably in walk_component(), which convinces
the compiler to push/pop additional registers in the fast path to
accomodate existence of the inlined version.

Shortens the fast path of that routine from 87 to 71 bytes.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---

The intent is to get to a point where I can inline walk_component and
step_into. This is probably the last patch of the sort before I write a
patchset + provide bench results.

 fs/namei.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 11295fcf877c..667360deef48 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1863,7 +1863,7 @@ static struct dentry *__lookup_slow(const struct qstr *name,
 	return dentry;
 }
 
-static struct dentry *lookup_slow(const struct qstr *name,
+static noinline struct dentry *lookup_slow(const struct qstr *name,
 				  struct dentry *dir,
 				  unsigned int flags)
 {
-- 
2.48.1

Re: [PATCH] fs: mark lookup_slow() as noinline

Posted by Christian Brauner 2 months, 1 week ago

On Wed, 19 Nov 2025 15:49:30 +0100, Mateusz Guzik wrote:
> Otherwise it gets inlined notably in walk_component(), which convinces
> the compiler to push/pop additional registers in the fast path to
> accomodate existence of the inlined version.
> 
> Shortens the fast path of that routine from 87 to 71 bytes.
> 
> 
> [...]

Applied to the vfs-6.19.misc branch of the vfs/vfs.git tree.
Patches in the vfs-6.19.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.19.misc

[1/1] fs: mark lookup_slow() as noinline
      https://git.kernel.org/vfs/vfs/c/8d79ec9e7f63

Re: [PATCH] fs: mark lookup_slow() as noinline

Posted by Christian Brauner 2 months, 1 week ago

On Wed, Nov 19, 2025 at 03:49:30PM +0100, Mateusz Guzik wrote:
> Otherwise it gets inlined notably in walk_component(), which convinces
> the compiler to push/pop additional registers in the fast path to
> accomodate existence of the inlined version.
> 
> Shortens the fast path of that routine from 87 to 71 bytes.
> 
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---

Fwiw, the biggest problem is that we need to end up with something
obvious so that we don't accidently destroy any potential performance
gain in say 2 years because everyone forgot why we did things this way.

> 
> The intent is to get to a point where I can inline walk_component and
> step_into. This is probably the last patch of the sort before I write a
> patchset + provide bench results.
> 
>  fs/namei.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 11295fcf877c..667360deef48 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1863,7 +1863,7 @@ static struct dentry *__lookup_slow(const struct qstr *name,
>  	return dentry;
>  }
>  
> -static struct dentry *lookup_slow(const struct qstr *name,
> +static noinline struct dentry *lookup_slow(const struct qstr *name,
>  				  struct dentry *dir,
>  				  unsigned int flags)
>  {
> -- 
> 2.48.1
>

Re: [PATCH] fs: mark lookup_slow() as noinline

Posted by Mateusz Guzik 2 months, 1 week ago

On Tue, Nov 25, 2025 at 10:05 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Wed, Nov 19, 2025 at 03:49:30PM +0100, Mateusz Guzik wrote:
> > Otherwise it gets inlined notably in walk_component(), which convinces
> > the compiler to push/pop additional registers in the fast path to
> > accomodate existence of the inlined version.
> >
> > Shortens the fast path of that routine from 87 to 71 bytes.
> >
> > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> > ---
>
> Fwiw, the biggest problem is that we need to end up with something
> obvious so that we don't accidently destroy any potential performance
> gain in say 2 years because everyone forgot why we did things this way.
>

I don't think there is a way around reviewing patches with an eye on
what they are doing to the fast path, on top of periodically checking
if whatever is considered the current compiler decided to start doing
something funky with it.

I'm going to save a rant about benchmarking changes like these in the
current kernel for another day.

Re: [PATCH] fs: mark lookup_slow() as noinline

Posted by Christian Brauner 2 months, 1 week ago

On Tue, Nov 25, 2025 at 10:54:25AM +0100, Mateusz Guzik wrote:
> On Tue, Nov 25, 2025 at 10:05 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Wed, Nov 19, 2025 at 03:49:30PM +0100, Mateusz Guzik wrote:
> > > Otherwise it gets inlined notably in walk_component(), which convinces
> > > the compiler to push/pop additional registers in the fast path to
> > > accomodate existence of the inlined version.
> > >
> > > Shortens the fast path of that routine from 87 to 71 bytes.
> > >
> > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> > > ---
> >
> > Fwiw, the biggest problem is that we need to end up with something
> > obvious so that we don't accidently destroy any potential performance
> > gain in say 2 years because everyone forgot why we did things this way.
> >
> 
> I don't think there is a way around reviewing patches with an eye on
> what they are doing to the fast path, on top of periodically checking
> if whatever is considered the current compiler decided to start doing
> something funky with it.
> 
> I'm going to save a rant about benchmarking changes like these in the
> current kernel for another day.

Without knocking any tooling that we currently have but I don't think we
have _meaningful_ performance testing - especially not automated.

Re: [PATCH] fs: mark lookup_slow() as noinline

Posted by Mateusz Guzik 2 months, 1 week ago

On Wed, Nov 26, 2025 at 11:08 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Tue, Nov 25, 2025 at 10:54:25AM +0100, Mateusz Guzik wrote:
> > I'm going to save a rant about benchmarking changes like these in the
> > current kernel for another day.
>
> Without knocking any tooling that we currently have but I don't think we
> have _meaningful_ performance testing - especially not automated.

so *this* is the day? ;)

Even if one was to pretend for a minute that excellent benchmark suite
for vfs exists and is being used here, it would still fail to spot
numerous pessimizations.

To give you an example, legitimize_mnt has the smp_mb fence which
makes it visible when profiling things like access(2) or stat(2)
(something around 2% on my profiles). However, if one was to whack the
fence just to check if it is worth writing a real patch to do it,
access(2) perf would increase a little bit while stat(2) would remain
virtually the same. I am not speculating here, I did it. stat for me
is just shy of 4 mln ops/s. Patching the kernel with a tunable to
optionally skip the smp_mb fence pushes legitimize_mnt way down, while
*not* increasing performance -- the win is eaten by stalls elsewhere
(perf *does* increase for access(2), which is less shafted). This is
why the path walking benches I posted are all lifted from access()
usage as opposed to stat btw.

Or to put it differently, stat(2) is already gimped and you can keep
adding slowdowns without measurably altering anything, but that's only
because the CPU is already stalled big time while executing the
codepath.

Part of the systemic problem is the pesky 'rep movsq/rep stosq' usage
by gcc, notably emitted for stat (see vfs_getattr_nosec). It is my
understanding that future versions of the compiler will fix it, but
that's still years of damage to stay even if someone updates the
kernel in their distro, so that's "nice". The good news is that clang
does not do it, but it also optimized things differently in other
manners, so it may not even be representative what people will see
with gcc.

Rant aside on that front aside, I don't know what would encompass a
good test suite.

I am however confident it would include real-life usage lifted from
actual workloads for microbenchmarking purposes, like for example I
did with access() vs gcc. Better quality bench for path lookup would
involve all the syscalls invoked by gcc which do it, but per the above
the current state of the kernel would downplay improvements to next to
nothing.

Inspired by this little thing:  https://lkml.org/lkml/2015/5/19/1009
... I was screwing around with going through *all* vfs syscalls,
ordered in a way which provides the biggest data and instruction cache
busting potential. non-vfs code is not called specifically not be
shafted by slodowns elsewhere. It's not ready, but definitely worth
exploring.

I know there are some big bench suites out there (AIM?) but they look
weirdly unmaintained and I never verified if they do what they claim.

The microbenchmarks like will-it-scale are missing syscall coverage
(for example: no readlink or symlink), the syscalls which are covered
have spotty usage (for example there is a bench for parallel rw open
of a file, while opening *ro* is more common and has different
scalability), and even ignoring that all the lookups are done against
/tmp/willitscale.XXXXXX. That's not representative of most real
lookups in that there few path components *and* one of them is
unusually long.

and so on.

That rant also aside:
1. concerning legitimize_mnt: I strongly suspect the fence can be
avoided by guaranteeing that clearing ->mnt_ns waits for the rcu grace
period before issuing mntput. the question is how painful it is to
implement it
2. concering stat: the current code boils down to going to statx and
telling it to not fill some of the fields, getting some fields stat is
not going to look at anyway and finally converting the result to
userspace-compatible layout. the last bit is universal across unix
kernels afaics, curious how that happened. anyway my idea here is to
instead implement a ->stat inode op which would fill in 'struct stat'
(not kstat!), avoiding most of the current work. there is the obvious
concern of code duplication, which I think I can cover in an
acceptable manner by implementing generic helpers for fields the
filesystem does not want to mess with on its own.

that legitimize_mnt thing has been annoying me for a long time now, i
did not post any patches as the namespace code is barely readable for
me and i'm trying to not dig into it