fuse: allow server to increase max_readahead via FUSE_INIT reply

[PATCH] fuse: allow server to increase max_readahead via FUSE_INIT reply

Posted by Jim Harris 5 days, 7 hours ago

A FUSE server that advertises a large max_pages and max_write (e.g.
max_pages=256, max_write=1MB) cannot currently obtain matching
FUSE_READ request sizes from the kernel.  Buffered sequential writes
arrive at the server at the negotiated max_write size, but buffered
sequential reads remain capped at the kernel's default readahead
window (VM_READAHEAD_PAGES, 128KB; doubled to 256KB for files marked
POSIX_FADV_SEQUENTIAL).  A 1MB application read() therefore turns
into four sequential 256KB FUSE_READ round-trips instead of one.

This is because process_init_reply() processes the server's
max_readahead response as:

	ra_pages = arg->max_readahead / PAGE_SIZE;
	fm->sb->s_bdi->ra_pages = min(fm->sb->s_bdi->ra_pages, ra_pages);

Since the kernel sends its current bdi->ra_pages as
init_in->max_readahead, and bdi->ra_pages is the default
VM_READAHEAD_PAGES at this point, the server can only ever decrease
the readahead window -- never increase it.  Even if the server
replies with max_readahead=1MB, the min() clamps it back to 128KB.

This clamp dates to commit 9cd684551124 ("[PATCH] fuse: fix async
read for legacy filesystems"), which introduced max_readahead at FUSE
protocol 7.6 and used min() to preserve legacy (<7.6) filesystem
behaviour.  Modern filesystems that explicitly advertise a larger
max_readahead are silently overridden.

Other filesystems set ra_pages or io_pages directly from negotiated
server/device capabilities: cifs sets ra_pages from rsize/rasize,
ceph from rasize/rsize mount options, 9p from maxdata, and nfs sets
io_pages from rpages.

Use the server's max_readahead response directly, bounded by
fc->max_pages (which is itself bounded by fc->max_pages_limit and,
for virtio-fs, by the virtqueue descriptor count):

	fm->sb->s_bdi->ra_pages = min_t(unsigned int, ra_pages,
					fc->max_pages);

This is backward compatible:

 - Servers that echo init_in->max_readahead back unchanged see the
   same effective readahead as today.
 - Servers that reply with a smaller value still reduce ra_pages.
 - Servers that do not negotiate FUSE_MAX_PAGES see no change, since
   fc->max_pages defaults to FUSE_DEFAULT_MAX_PAGES_PER_REQ (32),
   matching VM_READAHEAD_PAGES.
 - Only servers that both negotiate FUSE_MAX_PAGES and advertise a
   larger max_readahead see the new behaviour, and in that case
   fc->max_pages already gates per-request data size.

Signed-off-by: Jim Harris <jim.harris@nvidia.com>
Assisted-by: Cursor:claude-opus-4.7
---
Notes on AI assistance:

  The code analysis (tracing the readahead negotiation in
  process_init_reply(), confirming the behaviour of ractl_max_pages()
  in mm/readahead.c, and surveying how other filesystems set
  ra_pages/io_pages) and the bulk of this changelog were drafted with
  an AI coding assistant (see Assisted-by trailer).  The one-line code
  change was reviewed by me.  The motivating performance observation
  (a 1MB application read producing four 256KB FUSE_READ requests
  against a server advertising max_pages=256 and max_write=1MB) was
  observed by me on a real virtio-fs workload prior to any AI
  involvement, and verification of patched and unpatched behaviour
  was performed by me.

 fs/fuse/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index deddfffb037f..272026f11a34 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1494,7 +1494,7 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
 		init_server_timeout(fc, timeout);
 
 		fm->sb->s_bdi->ra_pages =
-				min(fm->sb->s_bdi->ra_pages, ra_pages);
+				min_t(unsigned int, ra_pages, fc->max_pages);
 		fc->minor = arg->minor;
 		fc->max_write = arg->minor < 5 ? 4096 : arg->max_write;
 		fc->max_write = max_t(unsigned, 4096, fc->max_write);
-- 
2.43.0

Re: [PATCH] fuse: allow server to increase max_readahead via FUSE_INIT reply

Posted by Joanne Koong 2 days, 10 hours ago

On Tue, Jun 2, 2026 at 2:14 PM Jim Harris <jim.harris@nvidia.com> wrote:
>
> A FUSE server that advertises a large max_pages and max_write (e.g.
> max_pages=256, max_write=1MB) cannot currently obtain matching
> FUSE_READ request sizes from the kernel.  Buffered sequential writes
> arrive at the server at the negotiated max_write size, but buffered
> sequential reads remain capped at the kernel's default readahead
> window (VM_READAHEAD_PAGES, 128KB; doubled to 256KB for files marked
> POSIX_FADV_SEQUENTIAL).  A 1MB application read() therefore turns
> into four sequential 256KB FUSE_READ round-trips instead of one.
>
> This is because process_init_reply() processes the server's
> max_readahead response as:
>
>         ra_pages = arg->max_readahead / PAGE_SIZE;
>         fm->sb->s_bdi->ra_pages = min(fm->sb->s_bdi->ra_pages, ra_pages);
>
> Since the kernel sends its current bdi->ra_pages as
> init_in->max_readahead, and bdi->ra_pages is the default
> VM_READAHEAD_PAGES at this point, the server can only ever decrease
> the readahead window -- never increase it.  Even if the server
> replies with max_readahead=1MB, the min() clamps it back to 128KB.
>
> This clamp dates to commit 9cd684551124 ("[PATCH] fuse: fix async
> read for legacy filesystems"), which introduced max_readahead at FUSE
> protocol 7.6 and used min() to preserve legacy (<7.6) filesystem
> behaviour.  Modern filesystems that explicitly advertise a larger
> max_readahead are silently overridden.
>
> Other filesystems set ra_pages or io_pages directly from negotiated
> server/device capabilities: cifs sets ra_pages from rsize/rasize,
> ceph from rasize/rsize mount options, 9p from maxdata, and nfs sets
> io_pages from rpages.
>
> Use the server's max_readahead response directly, bounded by
> fc->max_pages (which is itself bounded by fc->max_pages_limit and,
> for virtio-fs, by the virtqueue descriptor count):
>
>         fm->sb->s_bdi->ra_pages = min_t(unsigned int, ra_pages,
>                                         fc->max_pages);
>
> This is backward compatible:
>
>  - Servers that echo init_in->max_readahead back unchanged see the
>    same effective readahead as today.
>  - Servers that reply with a smaller value still reduce ra_pages.
>  - Servers that do not negotiate FUSE_MAX_PAGES see no change, since
>    fc->max_pages defaults to FUSE_DEFAULT_MAX_PAGES_PER_REQ (32),
>    matching VM_READAHEAD_PAGES.
>  - Only servers that both negotiate FUSE_MAX_PAGES and advertise a
>    larger max_readahead see the new behaviour, and in that case
>    fc->max_pages already gates per-request data size.
>
> Signed-off-by: Jim Harris <jim.harris@nvidia.com>
> Assisted-by: Cursor:claude-opus-4.7
> ---
> Notes on AI assistance:
>
>   The code analysis (tracing the readahead negotiation in
>   process_init_reply(), confirming the behaviour of ractl_max_pages()
>   in mm/readahead.c, and surveying how other filesystems set
>   ra_pages/io_pages) and the bulk of this changelog were drafted with
>   an AI coding assistant (see Assisted-by trailer).  The one-line code
>   change was reviewed by me.  The motivating performance observation
>   (a 1MB application read producing four 256KB FUSE_READ requests
>   against a server advertising max_pages=256 and max_write=1MB) was
>   observed by me on a real virtio-fs workload prior to any AI
>   involvement, and verification of patched and unpatched behaviour
>   was performed by me.
>
>  fs/fuse/inode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index deddfffb037f..272026f11a34 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1494,7 +1494,7 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
>                 init_server_timeout(fc, timeout);
>
>                 fm->sb->s_bdi->ra_pages =
> -                               min(fm->sb->s_bdi->ra_pages, ra_pages);
> +                               min_t(unsigned int, ra_pages, fc->max_pages);


Looking at how the mm code uses ra_pages, I think this will also have
the side effect of upping the number of pages read into the page cache
for speculative readahead. I don't think this is safe, at least not
for unprivileged servers.

I think setting s_bdi->io_pages would be a better fit here. fromthe
logic in page_cache_sync_ra() -> ractl_max_pages(),  this would only
exceed the readahead window size limit for non-speculative readahead.

Thanks,
Joanne

>                 fc->minor = arg->minor;
>                 fc->max_write = arg->minor < 5 ? 4096 : arg->max_write;
>                 fc->max_write = max_t(unsigned, 4096, fc->max_write);
> --
> 2.43.0
>
>