fs/namei.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Otherwise it gets inlined notably in walk_component(), which convinces
the compiler to push/pop additional registers in the fast path to
accomodate existence of the inlined version.
Shortens the fast path of that routine from 87 to 71 bytes.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
The intent is to get to a point where I can inline walk_component and
step_into. This is probably the last patch of the sort before I write a
patchset + provide bench results.
fs/namei.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/namei.c b/fs/namei.c
index 11295fcf877c..667360deef48 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1863,7 +1863,7 @@ static struct dentry *__lookup_slow(const struct qstr *name,
return dentry;
}
-static struct dentry *lookup_slow(const struct qstr *name,
+static noinline struct dentry *lookup_slow(const struct qstr *name,
struct dentry *dir,
unsigned int flags)
{
--
2.48.1
On Wed, 19 Nov 2025 15:49:30 +0100, Mateusz Guzik wrote:
> Otherwise it gets inlined notably in walk_component(), which convinces
> the compiler to push/pop additional registers in the fast path to
> accomodate existence of the inlined version.
>
> Shortens the fast path of that routine from 87 to 71 bytes.
>
>
> [...]
Applied to the vfs-6.19.misc branch of the vfs/vfs.git tree.
Patches in the vfs-6.19.misc branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.19.misc
[1/1] fs: mark lookup_slow() as noinline
https://git.kernel.org/vfs/vfs/c/8d79ec9e7f63
On Wed, Nov 19, 2025 at 03:49:30PM +0100, Mateusz Guzik wrote:
> Otherwise it gets inlined notably in walk_component(), which convinces
> the compiler to push/pop additional registers in the fast path to
> accomodate existence of the inlined version.
>
> Shortens the fast path of that routine from 87 to 71 bytes.
>
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---
Fwiw, the biggest problem is that we need to end up with something
obvious so that we don't accidently destroy any potential performance
gain in say 2 years because everyone forgot why we did things this way.
>
> The intent is to get to a point where I can inline walk_component and
> step_into. This is probably the last patch of the sort before I write a
> patchset + provide bench results.
>
> fs/namei.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 11295fcf877c..667360deef48 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1863,7 +1863,7 @@ static struct dentry *__lookup_slow(const struct qstr *name,
> return dentry;
> }
>
> -static struct dentry *lookup_slow(const struct qstr *name,
> +static noinline struct dentry *lookup_slow(const struct qstr *name,
> struct dentry *dir,
> unsigned int flags)
> {
> --
> 2.48.1
>
On Tue, Nov 25, 2025 at 10:05 AM Christian Brauner <brauner@kernel.org> wrote: > > On Wed, Nov 19, 2025 at 03:49:30PM +0100, Mateusz Guzik wrote: > > Otherwise it gets inlined notably in walk_component(), which convinces > > the compiler to push/pop additional registers in the fast path to > > accomodate existence of the inlined version. > > > > Shortens the fast path of that routine from 87 to 71 bytes. > > > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> > > --- > > Fwiw, the biggest problem is that we need to end up with something > obvious so that we don't accidently destroy any potential performance > gain in say 2 years because everyone forgot why we did things this way. > I don't think there is a way around reviewing patches with an eye on what they are doing to the fast path, on top of periodically checking if whatever is considered the current compiler decided to start doing something funky with it. I'm going to save a rant about benchmarking changes like these in the current kernel for another day.
On Tue, Nov 25, 2025 at 10:54:25AM +0100, Mateusz Guzik wrote: > On Tue, Nov 25, 2025 at 10:05 AM Christian Brauner <brauner@kernel.org> wrote: > > > > On Wed, Nov 19, 2025 at 03:49:30PM +0100, Mateusz Guzik wrote: > > > Otherwise it gets inlined notably in walk_component(), which convinces > > > the compiler to push/pop additional registers in the fast path to > > > accomodate existence of the inlined version. > > > > > > Shortens the fast path of that routine from 87 to 71 bytes. > > > > > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> > > > --- > > > > Fwiw, the biggest problem is that we need to end up with something > > obvious so that we don't accidently destroy any potential performance > > gain in say 2 years because everyone forgot why we did things this way. > > > > I don't think there is a way around reviewing patches with an eye on > what they are doing to the fast path, on top of periodically checking > if whatever is considered the current compiler decided to start doing > something funky with it. > > I'm going to save a rant about benchmarking changes like these in the > current kernel for another day. Without knocking any tooling that we currently have but I don't think we have _meaningful_ performance testing - especially not automated.
On Wed, Nov 26, 2025 at 11:08 AM Christian Brauner <brauner@kernel.org> wrote: > > On Tue, Nov 25, 2025 at 10:54:25AM +0100, Mateusz Guzik wrote: > > I'm going to save a rant about benchmarking changes like these in the > > current kernel for another day. > > Without knocking any tooling that we currently have but I don't think we > have _meaningful_ performance testing - especially not automated. so *this* is the day? ;) Even if one was to pretend for a minute that excellent benchmark suite for vfs exists and is being used here, it would still fail to spot numerous pessimizations. To give you an example, legitimize_mnt has the smp_mb fence which makes it visible when profiling things like access(2) or stat(2) (something around 2% on my profiles). However, if one was to whack the fence just to check if it is worth writing a real patch to do it, access(2) perf would increase a little bit while stat(2) would remain virtually the same. I am not speculating here, I did it. stat for me is just shy of 4 mln ops/s. Patching the kernel with a tunable to optionally skip the smp_mb fence pushes legitimize_mnt way down, while *not* increasing performance -- the win is eaten by stalls elsewhere (perf *does* increase for access(2), which is less shafted). This is why the path walking benches I posted are all lifted from access() usage as opposed to stat btw. Or to put it differently, stat(2) is already gimped and you can keep adding slowdowns without measurably altering anything, but that's only because the CPU is already stalled big time while executing the codepath. Part of the systemic problem is the pesky 'rep movsq/rep stosq' usage by gcc, notably emitted for stat (see vfs_getattr_nosec). It is my understanding that future versions of the compiler will fix it, but that's still years of damage to stay even if someone updates the kernel in their distro, so that's "nice". The good news is that clang does not do it, but it also optimized things differently in other manners, so it may not even be representative what people will see with gcc. Rant aside on that front aside, I don't know what would encompass a good test suite. I am however confident it would include real-life usage lifted from actual workloads for microbenchmarking purposes, like for example I did with access() vs gcc. Better quality bench for path lookup would involve all the syscalls invoked by gcc which do it, but per the above the current state of the kernel would downplay improvements to next to nothing. Inspired by this little thing: https://lkml.org/lkml/2015/5/19/1009 ... I was screwing around with going through *all* vfs syscalls, ordered in a way which provides the biggest data and instruction cache busting potential. non-vfs code is not called specifically not be shafted by slodowns elsewhere. It's not ready, but definitely worth exploring. I know there are some big bench suites out there (AIM?) but they look weirdly unmaintained and I never verified if they do what they claim. The microbenchmarks like will-it-scale are missing syscall coverage (for example: no readlink or symlink), the syscalls which are covered have spotty usage (for example there is a bench for parallel rw open of a file, while opening *ro* is more common and has different scalability), and even ignoring that all the lookups are done against /tmp/willitscale.XXXXXX. That's not representative of most real lookups in that there few path components *and* one of them is unusually long. and so on. That rant also aside: 1. concerning legitimize_mnt: I strongly suspect the fence can be avoided by guaranteeing that clearing ->mnt_ns waits for the rcu grace period before issuing mntput. the question is how painful it is to implement it 2. concering stat: the current code boils down to going to statx and telling it to not fill some of the fields, getting some fields stat is not going to look at anyway and finally converting the result to userspace-compatible layout. the last bit is universal across unix kernels afaics, curious how that happened. anyway my idea here is to instead implement a ->stat inode op which would fill in 'struct stat' (not kstat!), avoiding most of the current work. there is the obvious concern of code duplication, which I think I can cover in an acceptable manner by implementing generic helpers for fields the filesystem does not want to mess with on its own. that legitimize_mnt thing has been annoying me for a long time now, i did not post any patches as the namespace code is barely readable for me and i'm trying to not dig into it
© 2016 - 2025 Red Hat, Inc.