9p: Performance improvements for build workloads

[RFC PATCH 0/5] 9p: Performance improvements for build workloads

Posted by Remi Pommarel 1 month ago

This patchset introduces several performance optimizations for the 9p
filesystem when used with cache=loose option (exclusive or read only
mounts). These improvements particularly target workloads with frequent
lookups of non-existent paths and repeated symlink resolutions.

The very state of the art benchmark consisting of cloning a fresh
hostap repository and building hostapd and wpa_supplicant for hwsim
tests (cd tests/hwsim; time ./build.sh) in a VM running on a 9pfs rootfs
(with trans=virtio,cache=loose options) has been used to test those
optimizations impact.

For reference, the build takes 0m56.492s on my laptop natively while it
completes in 2m18.702sec on the VM. This represents a significant
performance penalty considering running the same build on a VM using a
virtiofs rootfs (with "--cache always" virtiofsd option) takes around
1m32.141s. This patchset aims to bring the 9pfs build time close to
that of virtiofs, rather than the native host time, as a realistic
expectation.

This first two patches in this series focus on keeping negative dentries
in the cache, ensuring that subsequent lookups for paths known to not
exist do not require redundant 9P RPC calls. This optimization reduces
the time needed for the compiler to search for header files across known
locations. These two patches introduce a new mount option, ndentrytmo,
which specifies the number of ms to keep the dentry in the cache. Using
ndentrytmo=-1 (keeping the negative dentry indifinetly) shrunk build
time to 1m46.198s.

The third patch extends page cache usage to symlinks by allowing
p9_client_readlink() results to be cached. Resolving symlink is
apparently something done quite frequently during the build process and
avoiding the cost of a 9P RPC call round trip for already known symlinks
helps reduce the build time to 1m26.602s, outperforming the virtiofs
setup.

The last two patches are only here to attribute time spent waiting
for server responses during a 9P RPC call to I/0 wait time in system
metrics.

Here is summary of the different hostapd/wpa_supplicant build times:

  - Baseline (no patch): 2m18.702s
  - negative dentry caching (patches 1-2): 1m46.198s (23% improvement)
  - Above + symlink caching (patches 1-3): 1m26.302s (an additional 18%
    improvement, 37% in total)

With this ~37% performance gain, 9pfs with cache=loose can compete with
virtiofs for (at least) this specific scenario. Although this benchmark
is not the most typical, I do think that these caching optimizations
could benefit a wide range of other workflows as well.

Further investigation may be needed to address the remaining gap with
native build performance. Using the last two patches it appears there is
still a fair amount of time spent waiting for I/O, though. This could be
related to the two systematic RPC calls made when opening a file (one to
clone the fid and another one to open the file). Maybe reusing fids or
openned files could potentially reduce client/server transactions and
bring performance even closer to native levels ? But that are just
random thoughs I haven't dig enough yet.

Any feedbacks on this approach would be welcomed,

Thanks.

Best regards,

-- 
Remi

Remi Pommarel (5):
  9p: Cache negative dentries for lookup performance
  9p: Introduce option for negative dentry cache retention time
  9p: Enable symlink caching in page cache
  wait: Introduce io_wait_event_killable()
  9p: Track 9P RPC waiting time as IO

 fs/9p/fid.c            |  11 +++--
 fs/9p/v9fs.c           |  16 +++++-
 fs/9p/v9fs.h           |   3 ++
 fs/9p/v9fs_vfs.h       |  15 ++++++
 fs/9p/vfs_dentry.c     | 109 +++++++++++++++++++++++++++++++++++------
 fs/9p/vfs_inode.c      |  14 ++++--
 fs/9p/vfs_inode_dotl.c |  94 +++++++++++++++++++++++++++++++----
 include/linux/wait.h   |  15 ++++++
 net/9p/client.c        |   4 +-
 9 files changed, 244 insertions(+), 37 deletions(-)

-- 
2.50.1

Re: [RFC PATCH 0/5] 9p: Performance improvements for build workloads

Posted by Dominique Martinet 2 weeks, 4 days ago

Remi Pommarel wrote on Sun, Aug 31, 2025 at 09:03:38PM +0200:
> This patchset introduces several performance optimizations for the 9p
> filesystem when used with cache=loose option (exclusive or read only
> mounts). These improvements particularly target workloads with frequent
> lookups of non-existent paths and repeated symlink resolutions.

Sorry for slow reply, I think a negative cache and symlink cache make
sense.
I haven't tested these yet, and there's a conversion to the "new" mount
API that's brewing and will conflict with 2nd patch, but I'll be happy
to take these patches as time allows.
What was the reason this was sent as RFC, does something require more work?

I can't comment on io_wait_event_killable, it makes sense to me as well
but it's probably more appropriate to send through the scheduler tree.


> The third patch extends page cache usage to symlinks by allowing
> p9_client_readlink() results to be cached. Resolving symlink is
> apparently something done quite frequently during the build process and
> avoiding the cost of a 9P RPC call round trip for already known symlinks
> helps reduce the build time to 1m26.602s, outperforming the virtiofs
> setup.

That's rather impressive!
(I assume virtiofs does not have such negative lookup or symlink cache so
they'll catch up soon enough if someone cares? But that's no reason to
refuse this with cache=loose)

> Further investigation may be needed to address the remaining gap with
> native build performance. Using the last two patches it appears there is
> still a fair amount of time spent waiting for I/O, though. This could be
> related to the two systematic RPC calls made when opening a file (one to
> clone the fid and another one to open the file). Maybe reusing fids or
> openned files could potentially reduce client/server transactions and
> bring performance even closer to native levels ? But that are just
> random thoughs I haven't dig enough yet.

Another thing I tried ages ago was making clunk asynchronous,
but that didn't go well;
protocol-wise clunk errors are ignored so I figured it was safe enough
to just fire it in the background, but it caused some regressions I
never had time to look into...

As for reusing fids, I'm not sure it's obvious because of things like
locking that basically consider one open file = one fid;
I think we're already re-using fids when we can, but I guess it's
technically possible to mark a fid as shared and only clone it if an
operation that requires an exclusive fid is done...?
I'm not sure I want to go down that hole though, sounds like an easy way
to mess up and give someone access to data they shouldn't be able to
access by sharing a fid opened by another user or something more
subtle..

-- 
Dominique Martinet | Asmadeus

Re: [RFC PATCH 0/5] 9p: Performance improvements for build workloads

Posted by Remi Pommarel 2 weeks ago

Hi Dominique,

On Sun, Sep 14, 2025 at 09:34:11PM +0900, Dominique Martinet wrote:
> Remi Pommarel wrote on Sun, Aug 31, 2025 at 09:03:38PM +0200:
> > This patchset introduces several performance optimizations for the 9p
> > filesystem when used with cache=loose option (exclusive or read only
> > mounts). These improvements particularly target workloads with frequent
> > lookups of non-existent paths and repeated symlink resolutions.
> 
> Sorry for slow reply, I think a negative cache and symlink cache make
> sense.
> I haven't tested these yet, and there's a conversion to the "new" mount
> API that's brewing and will conflict with 2nd patch, but I'll be happy
> to take these patches as time allows.
> What was the reason this was sent as RFC, does something require more work?
> 
> I can't comment on io_wait_event_killable, it makes sense to me as well
> but it's probably more appropriate to send through the scheduler tree.
> 

RFC was mainly here to know if a io_wait_event_killable() would made
sense before getting the scheduler tree involved. Also as it is my first
contribution in v9fs (and fs subsystem) wanted to be sure I wasn't
missing something obvious, caching could be a complex subject to grasp.
This also comes with some drawbacks, if for example server removes a
shared file or modify a symlink the client will be desynchronized, so I
wanted first to be sure we were ok with that when using cache=loose.

I'll try to monitor the new mount API and rebase the series when that
get merged. I'll probably separate the io_wait_event_killable() in its
own patchset though.

> 
> > The third patch extends page cache usage to symlinks by allowing
> > p9_client_readlink() results to be cached. Resolving symlink is
> > apparently something done quite frequently during the build process and
> > avoiding the cost of a 9P RPC call round trip for already known symlinks
> > helps reduce the build time to 1m26.602s, outperforming the virtiofs
> > setup.
> 
> That's rather impressive!
> (I assume virtiofs does not have such negative lookup or symlink cache so
> they'll catch up soon enough if someone cares? But that's no reason to
> refuse this with cache=loose)
> 

virtiofs does have negative lookup (when used with cache=always) and
symlink caches (this serie is even quite a bit inspired by what fuse
does). I don't really know what makes virtiofs a bit slower here, I
haven't dig into it either though but won't be surprised it could easily
catch up.

> > Further investigation may be needed to address the remaining gap with
> > native build performance. Using the last two patches it appears there is
> > still a fair amount of time spent waiting for I/O, though. This could be
> > related to the two systematic RPC calls made when opening a file (one to
> > clone the fid and another one to open the file). Maybe reusing fids or
> > openned files could potentially reduce client/server transactions and
> > bring performance even closer to native levels ? But that are just
> > random thoughs I haven't dig enough yet.
> 
> Another thing I tried ages ago was making clunk asynchronous,
> but that didn't go well;
> protocol-wise clunk errors are ignored so I figured it was safe enough
> to just fire it in the background, but it caused some regressions I
> never had time to look into...
> 
> As for reusing fids, I'm not sure it's obvious because of things like
> locking that basically consider one open file = one fid;
> I think we're already re-using fids when we can, but I guess it's
> technically possible to mark a fid as shared and only clone it if an
> operation that requires an exclusive fid is done...?
> I'm not sure I want to go down that hole though, sounds like an easy way
> to mess up and give someone access to data they shouldn't be able to
> access by sharing a fid opened by another user or something more
> subtle..

Yes I gave that a bit more thinking and came up with quite the same
conclusion, I then gave up on this idea. The asynchronous clunk seems
interesting though, maybe I'll take a look into that.

Thanks for your time.

-- 
Remi

Re: [RFC PATCH 0/5] 9p: Performance improvements for build workloads

Posted by Dominique Martinet 2 weeks ago

Remi Pommarel wrote on Thu, Sep 18, 2025 at 09:17:33PM +0200:
> RFC was mainly here to know if a io_wait_event_killable() would made
> sense before getting the scheduler tree involved. Also as it is my first
> contribution in v9fs (and fs subsystem) wanted to be sure I wasn't
> missing something obvious, caching could be a complex subject to grasp.
> This also comes with some drawbacks, if for example server removes a
> shared file or modify a symlink the client will be desynchronized, so I
> wanted first to be sure we were ok with that when using cache=loose.

Ok!
I think it's completely fine for cache=loose, we're basically telling
the client we're alone in the world.

> I'll try to monitor the new mount API and rebase the series when that
> get merged. I'll probably separate the io_wait_event_killable() in its
> own patchset though.

Thanks, I need to find time to check the v9ses lifetime as I asked about
after a syzcaller bug showed up[1], so it might not be immediate, but
I'll get to it eventually

[1] https://lore.kernel.org/v9fs/aKlg5Ci4WC11GZGz@codewreck.org/T/#u

> > Another thing I tried ages ago was making clunk asynchronous,
> > but that didn't go well;
> > protocol-wise clunk errors are ignored so I figured it was safe enough
> > to just fire it in the background, but it caused some regressions I
> > never had time to look into...
> > 
> > As for reusing fids, I'm not sure it's obvious because of things like
> > locking that basically consider one open file = one fid;
> > I think we're already re-using fids when we can, but I guess it's
> > technically possible to mark a fid as shared and only clone it if an
> > operation that requires an exclusive fid is done...?
> > I'm not sure I want to go down that hole though, sounds like an easy way
> > to mess up and give someone access to data they shouldn't be able to
> > access by sharing a fid opened by another user or something more
> > subtle..
> 
> Yes I gave that a bit more thinking and came up with quite the same
> conclusion, I then gave up on this idea. The asynchronous clunk seems
> interesting though, maybe I'll take a look into that.

It's been a while, but the last time I rebased the patches was around here:
https://github.com/martinetd/linux/commits/9p-async-v2/
(the v1 branch also had clunks async, with this comment
> This has a few problems, but mostly we can't just replace all clunks
> with async ones: depending on the server, explicit close() must clunk
> to make sure the IO is flushed, so these should wait for clunk to finish.
)

If you have time to play with this, happy to consider it again, but
it'll definitely need careful testing (possibly implement the clunk part
as a non-default option? although I'm not sure how that'd fly, linux
doesn't really like options that sacrifice reliability for performance...)

Anyway, that's something I definitely don't have time for short term,
but happy to discuss :)

Cheers,
-- 
Dominique Martinet | Asmadeus