fs/ext4/inode.c | 3 ++- fs/ext4/namei.c | 4 +++- fs/namei.c | 34 +++++++++++++++++++--------------- fs/proc/namespaces.c | 2 +- include/linux/fs.h | 12 ++++++++++-- mm/shmem.c | 6 ++++-- security/apparmor/apparmorfs.c | 2 +- 7 files changed, 40 insertions(+), 23 deletions(-)
quote: When utilized it dodges strlen() in vfs_readlink(), giving about 1.5% speed up when issuing readlink on /initrd.img on ext4. Benchmark code at the bottom. ext4 and tmpfs are patched, other filesystems can also get there with some more work. Arguably the current get_link API should be patched to let the fs return the size, but that's not a churn I'm interested into diving in. On my v1 Jan remarked 1.5% is not a particularly high win questioning whether doing this makes sense. I noted the value is only this small because of other slowdowns. To elaborate here are highlights while benching on Sapphire Rapids: 1. putname using atomics (over 3.5% on my profile) sounds like Al has plans to do something here(?), I'm not touching it if it can be helped. the atomic definitely does not have to be there in the common case. 2. kmem_cache_alloc_noprof/kmem_cache_free (over 7% combined) They are both dog slow due to cmpxchg16b. A patchset was posted which adds a different allocation/free fast path which should whack majority of the problem, see: https://lore.kernel.org/linux-mm/20241112-slub-percpu-caches-v1-0-ddc0bdc27e05@suse.cz/ If this lands I'll definitely try to make the pathname allocs use it, should drop about 5-6 percentage points on this sucker. 3. __legitimize_mnt issues a full fence (again over 3%) As far as avoiding the fence is concerned waiting on rcu grace period on unmount should do the trick. However, I found there is a bunch complexity there to sort out before doing this will be feasible (notably there are multiple mounts freed in one go, this needs to be batched). There may be other issues which I missed and which make this not worth it, but the fence is definitely avoidable in principle and I would be surprised if there was no way to sensibly get there. No ETA, for now I'm just pointing this out. There is also the idea to speculatively elide lockref, but when I tried that last time I ran into significant LSM-related trouble. All that aside there is also quite a bit of branching and func calling which does not need to be there (example: make vfsuid/vfsgid, could be combined into one routine etc.). Ultimately there is single-threaded perf left on the table in various spots. v2: - add a dedicated field, flag and a helper instead of using i_size - constify i_link v1 can be found here: https://lore.kernel.org/linux-fsdevel/20241118085357.494178-1-mjguzik@gmail.com/ benchmark: plug into will-it-scale into tests/readlink1.c: #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <assert.h> #include <string.h> char *testcase_description = "readlink /initrd.img"; void testcase(unsigned long long *iterations, unsigned long nr) { char *tmplink = "/initrd.img"; char buf[1024]; while (1) { int error = readlink(tmplink, buf, sizeof(buf)); assert(error > 0); (*iterations)++; } } Mateusz Guzik (3): vfs: support caching symlink lengths in inodes ext4: use inode_set_cached_link() tmpfs: use inode_set_cached_link() fs/ext4/inode.c | 3 ++- fs/ext4/namei.c | 4 +++- fs/namei.c | 34 +++++++++++++++++++--------------- fs/proc/namespaces.c | 2 +- include/linux/fs.h | 12 ++++++++++-- mm/shmem.c | 6 ++++-- security/apparmor/apparmorfs.c | 2 +- 7 files changed, 40 insertions(+), 23 deletions(-) -- 2.43.0
On Tue, Nov 19, 2024 at 10:45:52AM +0100, Mateusz Guzik wrote: > quote: > When utilized it dodges strlen() in vfs_readlink(), giving about 1.5% > speed up when issuing readlink on /initrd.img on ext4. > > Benchmark code at the bottom. > > ext4 and tmpfs are patched, other filesystems can also get there with > some more work. > > Arguably the current get_link API should be patched to let the fs return > the size, but that's not a churn I'm interested into diving in. > > On my v1 Jan remarked 1.5% is not a particularly high win questioning > whether doing this makes sense. I noted the value is only this small > because of other slowdowns. The thing is that you're stealing one of the holes I just put into struct inode a cycle ago or so. The general idea has been to shrink struct inode if we can and I'm not sure that caching the link length is actually worth losing that hole. Otherwise I wouldn't object. > All that aside there is also quite a bit of branching and func calling > which does not need to be there (example: make vfsuid/vfsgid, could be > combined into one routine etc.). They should probably also be made inline functions and likely/unlikely sprinkled in there.
On Wed, Nov 20, 2024 at 11:33 AM Christian Brauner <brauner@kernel.org> wrote: > > On Tue, Nov 19, 2024 at 10:45:52AM +0100, Mateusz Guzik wrote: > > quote: > > When utilized it dodges strlen() in vfs_readlink(), giving about 1.5% > > speed up when issuing readlink on /initrd.img on ext4. > > > > Benchmark code at the bottom. > > > > ext4 and tmpfs are patched, other filesystems can also get there with > > some more work. > > > > Arguably the current get_link API should be patched to let the fs return > > the size, but that's not a churn I'm interested into diving in. > > > > On my v1 Jan remarked 1.5% is not a particularly high win questioning > > whether doing this makes sense. I noted the value is only this small > > because of other slowdowns. > > The thing is that you're stealing one of the holes I just put into struct > inode a cycle ago or so. The general idea has been to shrink struct > inode if we can and I'm not sure that caching the link length is > actually worth losing that hole. Otherwise I wouldn't object. > Per the patch description this can be a union with something not used for symlinks. I'll find a nice field. > > All that aside there is also quite a bit of branching and func calling > > which does not need to be there (example: make vfsuid/vfsgid, could be > > combined into one routine etc.). > > They should probably also be made inline functions and likely/unlikely > sprinkled in there. someone(tm) should at least do a sweep through in-vfs code. for example LOOKUP_IS_SCOPED is sometimes marked as unlikely and other times has no annotations whatsoever, even though ultimately it all executes in the same setting Interestingly even __read_seqcount_begin (used *twice* in path_init()) is missing one. I sent a patch to fix it long time ago but the recipient did not respond, maybe I should resend with more people in cc (who though?), see: https://lore.kernel.org/all/20230727180355.813995-1-mjguzik@gmail.com/ -- Mateusz Guzik <mjguzik gmail.com>
On Wed, Nov 20, 2024 at 11:42:33AM +0100, Mateusz Guzik wrote: > On Wed, Nov 20, 2024 at 11:33 AM Christian Brauner <brauner@kernel.org> wrote: > > > > On Tue, Nov 19, 2024 at 10:45:52AM +0100, Mateusz Guzik wrote: > > > quote: > > > When utilized it dodges strlen() in vfs_readlink(), giving about 1.5% > > > speed up when issuing readlink on /initrd.img on ext4. > > > > > > Benchmark code at the bottom. > > > > > > ext4 and tmpfs are patched, other filesystems can also get there with > > > some more work. > > > > > > Arguably the current get_link API should be patched to let the fs return > > > the size, but that's not a churn I'm interested into diving in. > > > > > > On my v1 Jan remarked 1.5% is not a particularly high win questioning > > > whether doing this makes sense. I noted the value is only this small > > > because of other slowdowns. > > > > The thing is that you're stealing one of the holes I just put into struct > > inode a cycle ago or so. The general idea has been to shrink struct > > inode if we can and I'm not sure that caching the link length is > > actually worth losing that hole. Otherwise I wouldn't object. > > > > Per the patch description this can be a union with something not used > for symlinks. I'll find a nice field. Ok! > > > > All that aside there is also quite a bit of branching and func calling > > > which does not need to be there (example: make vfsuid/vfsgid, could be > > > combined into one routine etc.). > > > > They should probably also be made inline functions and likely/unlikely > > sprinkled in there. > > someone(tm) should at least do a sweep through in-vfs code. for Yeah, in this case I was specifically talking about make_vfs{g,u}id(). They should be inlines and they should contain likely/unlikely. > example LOOKUP_IS_SCOPED is sometimes marked as unlikely and other > times has no annotations whatsoever, even though ultimately it all > executes in the same setting > > Interestingly even __read_seqcount_begin (used *twice* in path_init()) > is missing one. I sent a patch to fix it long time ago but the > recipient did not respond I snatched it.
On Wed, Nov 20, 2024 at 12:13 PM Christian Brauner <brauner@kernel.org> wrote: > > On Wed, Nov 20, 2024 at 11:42:33AM +0100, Mateusz Guzik wrote: > > Interestingly even __read_seqcount_begin (used *twice* in path_init()) > > is missing one. I sent a patch to fix it long time ago but the > > recipient did not respond > > I snatched it. Thanks. But I have to say having *two* counters to check for each lookup is bothering me and making me wonder if they could be unified (or another counter added to cover for either of those?)? No clue about feasibility, is there a known showstopper? Both are defined like so: __cacheline_aligned_in_smp DEFINE_SEQLOCK(mount_lock); __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock); Suppose nothing can be done to only look at one counter on lookup. In that case how about combining the suckers into one cacheline at least? Sure, this will result in new bounces for threads modifying these, but this is relatively infrequent compared to how often lookups performed and with these slapped together there will be only one line spent on it, instead of two. Just RFC'ing it here. -- Mateusz Guzik <mjguzik gmail.com>
On Tue, Nov 19, 2024 at 10:45:52AM +0100, Mateusz Guzik wrote: > > On my v1 Jan remarked 1.5% is not a particularly high win questioning > whether doing this makes sense. I noted the value is only this small > because of other slowdowns. Do you have a workload in mind which calls readlink() at sufficiently high numbers such that this would be noticeable in non-micro-benchmarks? What motiviated you to do this work? Thanks, - Ted
On Tue, Nov 19, 2024 at 6:53 PM Theodore Ts'o <tytso@mit.edu> wrote: > > On Tue, Nov 19, 2024 at 10:45:52AM +0100, Mateusz Guzik wrote: > > > > On my v1 Jan remarked 1.5% is not a particularly high win questioning > > whether doing this makes sense. I noted the value is only this small > > because of other slowdowns. > > Do you have a workload in mind which calls readlink() at sufficiently > high numbers such that this would be noticeable in > non-micro-benchmarks? What motiviated you to do this work? > I'm just messing about here. Given the triviality of the patch I'm not sure where the objection is coming from. I can point a finger at changes made by other people for supposed perf gains which are virtually guaranteed to be invisible in isolation, just like this one. Anyhow, I landed here from strlen -- the sucker is operating one byte at a time which I was looking to sort out, but then I noticed that one of the more commonly executing consumers does not even need to call it. -- Mateusz Guzik <mjguzik gmail.com>
© 2016 - 2024 Red Hat, Inc.