[PATCH v2 00/21] netfs: Keep track of folios in a segmented bio_vec[] chain

David Howells posted 21 patches 6 days, 5 hours ago
Documentation/core-api/folio_queue.rst      |  209 ----
Documentation/core-api/index.rst            |    1 -
Documentation/filesystems/netfs_library.rst |    2 +-
fs/9p/vfs_addr.c                            |   49 +-
fs/afs/dir.c                                |   40 +-
fs/afs/dir_edit.c                           |   43 +-
fs/afs/dir_search.c                         |   33 +-
fs/afs/file.c                               |   28 +-
fs/afs/fsclient.c                           |    8 +-
fs/afs/inode.c                              |    2 +-
fs/afs/internal.h                           |   12 +-
fs/afs/symlink.c                            |   31 +-
fs/afs/write.c                              |   32 +-
fs/afs/yfsclient.c                          |    6 +-
fs/cachefiles/interface.c                   |   82 +-
fs/cachefiles/internal.h                    |   13 +-
fs/cachefiles/io.c                          |  530 +++++++---
fs/cachefiles/namei.c                       |   19 +-
fs/cachefiles/xattr.c                       |   24 +-
fs/ceph/Kconfig                             |    1 +
fs/ceph/addr.c                              |  119 ++-
fs/netfs/Kconfig                            |    3 +
fs/netfs/Makefile                           |    4 +-
fs/netfs/buffered_read.c                    |  508 +++++----
fs/netfs/buffered_write.c                   |   30 +-
fs/netfs/bvecq.c                            |  763 ++++++++++++++
fs/netfs/direct_read.c                      |  107 +-
fs/netfs/direct_write.c                     |  167 +--
fs/netfs/fscache_io.c                       |    8 +-
fs/netfs/internal.h                         |  112 +-
fs/netfs/iterator.c                         |  369 ++-----
fs/netfs/misc.c                             |  168 +--
fs/netfs/objects.c                          |   22 +-
fs/netfs/read_collect.c                     |  159 +--
fs/netfs/read_pgpriv2.c                     |  188 ++--
fs/netfs/read_retry.c                       |  243 ++---
fs/netfs/read_single.c                      |  169 +--
fs/netfs/rolling_buffer.c                   |  222 ----
fs/netfs/stats.c                            |    6 +-
fs/netfs/write_collect.c                    |  236 +++--
fs/netfs/write_issue.c                      | 1049 +++++++++++--------
fs/netfs/write_retry.c                      |  147 +--
fs/nfs/Kconfig                              |    1 +
fs/nfs/fscache.c                            |   23 +-
fs/smb/client/cifsglob.h                    |    2 +-
fs/smb/client/cifssmb.c                     |   13 +-
fs/smb/client/file.c                        |  137 +--
fs/smb/client/smb2ops.c                     |   79 +-
fs/smb/client/smb2pdu.c                     |   28 +-
fs/smb/client/transport.c                   |   15 +-
fs/smb/smbdirect/connection.c               |  134 ++-
include/linux/bvec.h                        |   17 +
include/linux/bvecq.h                       |  325 ++++++
include/linux/folio_queue.h                 |  282 -----
include/linux/fscache.h                     |   17 +
include/linux/iov_iter.h                    |   82 +-
include/linux/netfs.h                       |  166 +--
include/linux/pagemap.h                     |   10 +
include/linux/rolling_buffer.h              |   61 --
include/linux/uio.h                         |   17 +-
include/trace/events/cachefiles.h           |   17 +-
include/trace/events/netfs.h                |  155 ++-
kernel/bpf/btf.c                            |    2 -
lib/iov_iter.c                              |  545 +++++-----
lib/scatterlist.c                           |   59 +-
lib/tests/kunit_iov_iter.c                  |  135 ++-
mm/readahead.c                              |    5 +
net/9p/client.c                             |    8 +-
68 files changed, 4694 insertions(+), 3605 deletions(-)
delete mode 100644 Documentation/core-api/folio_queue.rst
create mode 100644 fs/netfs/bvecq.c
delete mode 100644 fs/netfs/rolling_buffer.c
create mode 100644 include/linux/bvecq.h
delete mode 100644 include/linux/folio_queue.h
delete mode 100644 include/linux/rolling_buffer.h
[PATCH v2 00/21] netfs: Keep track of folios in a segmented bio_vec[] chain
Posted by David Howells 6 days, 5 hours ago
Hi Christian,

Could you add these patches to the VFS tree for next?

The patches get rid of folio_queue, rolling_buffer and ITER_FOLIOQ,
replacing the folio queue construct used to manage buffers in netfslib with
one based around a segmented chain of bio_vec arrays instead.  There are
three main aims here:

 (1) The kernel file I/O subsystem seems to be moving towards consolidating
     on the use of bio_vec arrays, so embrace this by moving netfslib to
     keep track of its buffers for buffered I/O in bio_vec[] form.

 (2) Netfslib already uses a bio_vec[] to handle unbuffered/DIO, so the
     number of different buffering schemes used can be reduced to just a
     single one.

 (3) Always send an entire filesystem RPC request message to a TCP socket
     with single kernel_sendmsg() call as this is faster, more efficient
     and doesn't require the use of corking as it puts the entire
     transmission loop inside of a single tcp_sendmsg().

For the replacement of folio_queue, a segmented chain of bio_vec arrays
rather than a single monolithic array is provided:

	struct bvecq {
		struct bvecq		*next;
		struct bvecq		*prev;
		unsigned long long	fpos;
		refcount_t		ref;
		u32			priv;
		u16			nr_segs;
		u16			max_segs;
		enum bvecq_mem		mem_type:2;
		bool			inline_bv:1;
		bool			discontig:1;
		struct bio_vec		*bv;
		struct bio_vec		__bv[];
	};

The fields are:

 (1) next, prev - Link segments together in a list.  I want this to be
     NULL-terminated linear rather than circular to make it possible to
     arbitrarily glue bits on the front.

 (2) fpos, discontig - Note the current file position of the first byte of
     the segment; all the bio_vecs in ->bv[] must be contiguous in the file
     space.  The fpos can be used to find the folio by file position rather
     then from the info in the bio_vec.

     If there's a discontiguity, this should break over into a new bvecq
     segment with the discontig flag set (though this is redundant if you
     keep track of the file position).  Note that the beginning and end
     file positions in a segment need not be aligned to any filesystem
     block size.

 (3) ref - Refcount.  Each bvecq keeps a ref on the next.  I'm not sure
     this is entirely necessary, but it makes sharing slices easier.

 (4) priv - Private data for the owner.  Dispensible; currently only used
     for storing a debug ID for tracing in a patch not included here.

 (5) max_segs, nr_segs.  The size of bv[] and the number of elements used.
     I've assumed a maximum of 65535 bio_vecs in the array (which would
     represent a ~1MiB allocation).

 (6) bv, __bv, inline_bv.  bv points to the bio_vec[] array handled by
     this segment.  This may begin at __bv and if it does inline_bv should
     be set (otherwise it's impossible to distinguish a separately
     allocated bio_vec[] that follows immediately by coincidence).

 (7) mem_type.  Indicates how the memory attached to the bio_vecs should be
     disposed of when the bvecq is destroyed.  It can be one of:

	BVECQ_MEM_EXTERNAL	- Externally tracked ref; don't put
	BVECQ_MEM_PAGECACHE	- Pagecache; must be put
	BVECQ_MEM_GUP		- Pinned by from GUP; needs unpin
	BVECQ_MEM_ALLOCED	- Plain alloc'd pages; can be mempooled


I've also defined an iov_iter iterator type ITER_BVECQ to walk this sort of
construct so that it can be passed directly to sendmsg() or block-based DIO
(as cachefiles does).


This series makes the following changes to netfslib:

 (1) The folio_queue chain used to hold folios for buffered I/O is replaced
     with a bvecq chain.  Each bio_vec then holds (a portion of) one folio.
     Each bvecq holds a contiguous sequence of folios, but adjacent bvecqs
     in a chain may be discontiguous.

 (2) For unbuffered/DIO, the source iov_iter is extracted into a bvecq
     chain.

 (3) An abstract position representation ('bvecq_pos') is created that can
     used to hold a position in a bvecq chain.  For the moment, this takes
     a ref on the bvecq it points to, but that may be excessive.

 (4) Buffer tracking is managed with three cursors:  The load_cursor, at
     which new folios are added as we go; the dispatch_cursor, at which new
     subrequests' buffers start when they're created; and the
     collect_cursor, the point at which folios are being unlocked.

     Not all cursors are necessarily needed in all situations and during
     buffered writeback, we need a dispatch cursor per stream (one for the
     network filesystem and one for the cache).

 (5) ->prepare_read(), buffer setting up and ->issue_read() are merged, as
     are the write variants, with the filesystem calling back up to
     netfslib to prepare its buffer.  This simplifies the process of
     setting up a subrequest.  It may even make sense to have the
     filesystem allocate the subrequest.

 (6) Retry dispatch tracking is added to netfs_io_request so that the
     buffer preparation functions can find it.  Retry requires an
     additional buffer cursor.

 (7) Netfslib dispatches I/O by accumulating enough bufferage to dispatch
     at least one subrequest, then looping to generate as many as the
     filesystem wants to (they may be limited by other constraints,
     e.g. max RDMA segment count or negotiated max size).  This loop could
     be moved down into the filesystem.  A new method is provided by which
     netfslib can ask the filesystem to provide an estimate of the data
     that should be accumulated before dispatch begins.

 (8) Reading from the cache is now managed by querying the cache to provide
     a list of the next two data extents within the cache.

 (9) AFS directories are switched to using a bvecq rather than a
     folio_queue to hold their contents.

(10) CIFS is switch to using a bvecq rather than a folio_queue for holding
     a temporary encryption buffer.

(11) CIFS RDMA is given the ability to extract ITER_BVECQ and support for
     extracting ITER_FOLIOQ is removed.

(12) All the folio_queue and rolling_buffer code is removed.

Cachefiles is also modified:

 (1) The object type in the cachefiles file xattr is now correctly set to
     CACHEFILES_CONTENT_{SINGLE,ALL,BACKFS_MAP} rather than just being 0,
     to indicate whether we have a single monolithic blob, all the data up
     to cache i_size with no holes or a sparse file with the data mapped by
     the backing file system (as currently upstream).

 (2) For "ALL" type files, the cache's i_size is used to track how much
     data is saved in the cache and no longer bears any relation to the
     netfs i_size.  The actual object size is stored in the xattr.

 (3) For most typical files which are contiguous and written progressively,
     the object type is now set to "ALL".  For anything else, cachefiles
     uses SEEK_DATA/HOLE to find extent outlines at before (this is the
     current behaviour and needs to be fixed, but in a separate set of
     patches as it's not trivial).

Two further things that I'm working on (but not in this branch) are:

 (1) Make it so that a filesystem can be given a copy of a subchain which
     it can then tack header and trailer protocol elements upon to form a
     single message (I have this working for cifs) and even join copies
     together with intervening protocol elements to form compounds.

 (2) Make it so that a filesystem can 'splice' out the contents of the TCP
     receive queue into a bvecq chain.  This allows the socket lock to be
     dropped much more quickly and the copying of data read to the
     destination buffers to happen without the lock.  I have this working
     for cifs too.  Kernel recvmsg() doesn't then block kernel sendmsg()
     for anywhere near as long.

There are also some things I want to consider for the future:

 (1) Create one or more batched iteration functions to 'unlock' all the
     folios in a bio_vec[], where 'unlock' is the appropriate action for
     ending a read or a write.  Batching should hopefully also improve the
     efficiency of wrangling the marks on the xarray.  Very often these
     marks are going to be represented by contiguous bits, so there may be
     a way to change them in bulk.

 (2) Rather than walking the bvecq chain to get each individual folio out
     via bv_page, use the file position stored on the bvecq and the sum of
     bv_len to iterate over the appropriate range in i_pages.

 (3) Change iov_iter to store the initial starting point and for
     iov_iter_revert() to reset to that and advance.  This would (a) help
     prevent over-reversion and (b) dispense with the need for a prev
     pointer.

 (4) Use bvecq to replace scatterlist.  One problem with replacing
     scatterlist is that crypto drivers like to glue bits on the front of
     the scatterlists they're given (something trivial with that API) - and
     this is one way to achieve it.

The patches can also be found here:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-next

Thanks,
David

Changes
=======
ver #2)
- Fixed a number of bugs reported by Sashiko[1].
- Split a bunch of fixes out and posted them separately[2].

[1] https://sashiko.dev/#/patchset/20260326104544.509518-1-dhowells%40redhat.com
[2] https://lore.kernel.org/linux-fsdevel/20260512-infozentrum-becher-7f86c47c96c8@brauner/T/#t

David Howells (21):
  cachefiles: Don't rely on backing fs storage map for most use cases
  netfs: Add the cache object ID to netfs_read/write tracepoints
  mm: Make readahead store folio count in readahead_control
  netfs: Bulk load the readahead-provided folios up front
  Add a function to kmap one page of a multipage bio_vec
  iov_iter: Make iov_iter_get_pages*() wrap iov_iter_extract_pages()
  iov_iter: Add a segmented queue of bio_vec[]
  netfs: Add some tools for managing bvecq chains
  netfs: Add a function to extract from an iter into a bvecq
  afs: Use a bvecq to hold dir content rather than folioq
  cifs: Use a bvecq for buffering instead of a folioq
  cifs: Support ITER_BVECQ in smb_extract_iter_to_rdma()
  netfs: Switch to using bvecq rather than folio_queue and
    rolling_buffer
  cifs: Remove support for ITER_FOLIOQ from smb_extract_iter_to_rdma()
  netfs: Remove netfs_alloc/free_folioq_buffer()
  netfs: Remove netfs_extract_user_iter()
  iov_iter: Remove ITER_FOLIOQ
  netfs: Remove folio_queue and rolling_buffer
  netfs: Check for too much data being read
  netfs: Limit the minimum trigger for progress reporting
  netfs: Combine prepare and issue ops and grab the buffers on request

 Documentation/core-api/folio_queue.rst      |  209 ----
 Documentation/core-api/index.rst            |    1 -
 Documentation/filesystems/netfs_library.rst |    2 +-
 fs/9p/vfs_addr.c                            |   49 +-
 fs/afs/dir.c                                |   40 +-
 fs/afs/dir_edit.c                           |   43 +-
 fs/afs/dir_search.c                         |   33 +-
 fs/afs/file.c                               |   28 +-
 fs/afs/fsclient.c                           |    8 +-
 fs/afs/inode.c                              |    2 +-
 fs/afs/internal.h                           |   12 +-
 fs/afs/symlink.c                            |   31 +-
 fs/afs/write.c                              |   32 +-
 fs/afs/yfsclient.c                          |    6 +-
 fs/cachefiles/interface.c                   |   82 +-
 fs/cachefiles/internal.h                    |   13 +-
 fs/cachefiles/io.c                          |  530 +++++++---
 fs/cachefiles/namei.c                       |   19 +-
 fs/cachefiles/xattr.c                       |   24 +-
 fs/ceph/Kconfig                             |    1 +
 fs/ceph/addr.c                              |  119 ++-
 fs/netfs/Kconfig                            |    3 +
 fs/netfs/Makefile                           |    4 +-
 fs/netfs/buffered_read.c                    |  508 +++++----
 fs/netfs/buffered_write.c                   |   30 +-
 fs/netfs/bvecq.c                            |  763 ++++++++++++++
 fs/netfs/direct_read.c                      |  107 +-
 fs/netfs/direct_write.c                     |  167 +--
 fs/netfs/fscache_io.c                       |    8 +-
 fs/netfs/internal.h                         |  112 +-
 fs/netfs/iterator.c                         |  369 ++-----
 fs/netfs/misc.c                             |  168 +--
 fs/netfs/objects.c                          |   22 +-
 fs/netfs/read_collect.c                     |  159 +--
 fs/netfs/read_pgpriv2.c                     |  188 ++--
 fs/netfs/read_retry.c                       |  243 ++---
 fs/netfs/read_single.c                      |  169 +--
 fs/netfs/rolling_buffer.c                   |  222 ----
 fs/netfs/stats.c                            |    6 +-
 fs/netfs/write_collect.c                    |  236 +++--
 fs/netfs/write_issue.c                      | 1049 +++++++++++--------
 fs/netfs/write_retry.c                      |  147 +--
 fs/nfs/Kconfig                              |    1 +
 fs/nfs/fscache.c                            |   23 +-
 fs/smb/client/cifsglob.h                    |    2 +-
 fs/smb/client/cifssmb.c                     |   13 +-
 fs/smb/client/file.c                        |  137 +--
 fs/smb/client/smb2ops.c                     |   79 +-
 fs/smb/client/smb2pdu.c                     |   28 +-
 fs/smb/client/transport.c                   |   15 +-
 fs/smb/smbdirect/connection.c               |  134 ++-
 include/linux/bvec.h                        |   17 +
 include/linux/bvecq.h                       |  325 ++++++
 include/linux/folio_queue.h                 |  282 -----
 include/linux/fscache.h                     |   17 +
 include/linux/iov_iter.h                    |   82 +-
 include/linux/netfs.h                       |  166 +--
 include/linux/pagemap.h                     |   10 +
 include/linux/rolling_buffer.h              |   61 --
 include/linux/uio.h                         |   17 +-
 include/trace/events/cachefiles.h           |   17 +-
 include/trace/events/netfs.h                |  155 ++-
 kernel/bpf/btf.c                            |    2 -
 lib/iov_iter.c                              |  545 +++++-----
 lib/scatterlist.c                           |   59 +-
 lib/tests/kunit_iov_iter.c                  |  135 ++-
 mm/readahead.c                              |    5 +
 net/9p/client.c                             |    8 +-
 68 files changed, 4694 insertions(+), 3605 deletions(-)
 delete mode 100644 Documentation/core-api/folio_queue.rst
 create mode 100644 fs/netfs/bvecq.c
 delete mode 100644 fs/netfs/rolling_buffer.c
 create mode 100644 include/linux/bvecq.h
 delete mode 100644 include/linux/folio_queue.h
 delete mode 100644 include/linux/rolling_buffer.h
Re: [PATCH v2 00/21] netfs: Keep track of folios in a segmented bio_vec[] chain
Posted by David Laight 5 days, 19 hours ago
On Mon, 18 May 2026 23:29:32 +0100
David Howells <dhowells@redhat.com> wrote:

> Hi Christian,
> 
> Could you add these patches to the VFS tree for next?
> 
> The patches get rid of folio_queue, rolling_buffer and ITER_FOLIOQ,
> replacing the folio queue construct used to manage buffers in netfslib with
> one based around a segmented chain of bio_vec arrays instead.  There are
> three main aims here:
> 
>  (1) The kernel file I/O subsystem seems to be moving towards consolidating
>      on the use of bio_vec arrays, so embrace this by moving netfslib to
>      keep track of its buffers for buffered I/O in bio_vec[] form.
> 
>  (2) Netfslib already uses a bio_vec[] to handle unbuffered/DIO, so the
>      number of different buffering schemes used can be reduced to just a
>      single one.
> 
>  (3) Always send an entire filesystem RPC request message to a TCP socket
>      with single kernel_sendmsg() call as this is faster, more efficient
>      and doesn't require the use of corking as it puts the entire
>      transmission loop inside of a single tcp_sendmsg().
> 
> For the replacement of folio_queue, a segmented chain of bio_vec arrays
> rather than a single monolithic array is provided:
> 
> 	struct bvecq {
> 		struct bvecq		*next;
> 		struct bvecq		*prev;
> 		unsigned long long	fpos;
> 		refcount_t		ref;
> 		u32			priv;
> 		u16			nr_segs;
> 		u16			max_segs;
> 		enum bvecq_mem		mem_type:2;
> 		bool			inline_bv:1;
> 		bool			discontig:1;

There doesn't seem to be any point using bitfields.
There is a massive hole here anyway.

> 		struct bio_vec		*bv;
> 		struct bio_vec		__bv[];
> 	};
> 
> The fields are:
> 
>  (1) next, prev - Link segments together in a list.  I want this to be
>      NULL-terminated linear rather than circular to make it possible to
>      arbitrarily glue bits on the front.

Do you ever need to follow the list backwards?
If not making prev point to the pointer to the entry (probably a tailq?)
makes the logic simpler (and safer) because you can remove an item without
knowing whether it is the head or which list it is on.

> 
>  (2) fpos, discontig - Note the current file position of the first byte of
>      the segment; all the bio_vecs in ->bv[] must be contiguous in the file
>      space.  The fpos can be used to find the folio by file position rather
>      then from the info in the bio_vec.

Should fpos be off_t (or u64) rather than 'long long' (they are all the
same underlying type).

>      If there's a discontiguity, this should break over into a new bvecq
>      segment with the discontig flag set (though this is redundant if you
>      keep track of the file position).  Note that the beginning and end
>      file positions in a segment need not be aligned to any filesystem
>      block size.

At this point you lose me :-)

-- David
Re: [PATCH v2 00/21] netfs: Keep track of folios in a segmented bio_vec[] chain
Posted by David Howells 5 days, 18 hours ago
David Laight <david.laight.linux@gmail.com> wrote:

> > 	struct bvecq {
> > 		struct bvecq		*next;
> > 		struct bvecq		*prev;
> > 		unsigned long long	fpos;
> > 		refcount_t		ref;
> > 		u32			priv;
> > 		u16			nr_segs;
> > 		u16			max_segs;
> > 		enum bvecq_mem		mem_type:2;
> > 		bool			inline_bv:1;
> > 		bool			discontig:1;
> 
> There doesn't seem to be any point using bitfields.
> There is a massive hole here anyway.

Depends on how you define "massive".  On a 64-bit machine, the whole thing
fits into 48 bytes - 6 words (or 3 bio_vec slots).  next, prev, fpos, bv and
ref+priv take up 5 of those words; nr_segs and max_segs take up half of the
6th, leaving a 4 byte hole.

You're right, though, I could make them all non-bitfields as the enum is
marked mode(byte).

> >  (1) next, prev - Link segments together in a list.  I want this to be
> >      NULL-terminated linear rather than circular to make it possible to
> >      arbitrarily glue bits on the front.
> 
> Do you ever need to follow the list backwards?

iov_iter_revert() exists, unfortunately, but yes, I would like to avoid having
a prev pointer.

I have a couple of ideas on how to get rid of that - or at least store the
start in struct iov_iter and always work forwards - but I haven't got round to
trying that yet.

> >  (2) fpos, discontig - Note the current file position of the first byte of
> >      the segment; all the bio_vecs in ->bv[] must be contiguous in the file
> >      space.  The fpos can be used to find the folio by file position rather
> >      then from the info in the bio_vec.
> 
> Should fpos be off_t (or u64) rather than 'long long' (they are all the
> same underlying type).

It's not 'long long' and off_t is actually 'long' in asm-generic.  Actually, I
should probably switch to using uoff_t.  Note that this file position should
never be seen as negative; I think loff_t should only really be used in
llseek.

> >      If there's a discontiguity, this should break over into a new bvecq
> >      segment with the discontig flag set (though this is redundant if you
> >      keep track of the file position).  Note that the beginning and end
> >      file positions in a segment need not be aligned to any filesystem
> >      block size.
> 
> At this point you lose me :-)

Apologies, but I'm trying to define how a bvecq chain works.  I need to codify
it more coherently.

So there's a number of reasons I want to be able to maintain the file position
information in the chain:

 (1) I can treat buffered writeback and DIO write more similarly if there's no
     requirement to access the folios in the list to get file position
     information.

 (2) When cleaning up lists of folios in buffered writeback, the file position
     is needed to access the i_pages xarray in order to clean up the marks on
     it.  This means I don't need to go from my list to access each folio, but
     can look them up through the xarray instead.

 (3) Some network filesystems, e.g. ceph, allow discontiguous (sparse) writes
     to be made to the server in a single RPC operation.  This gives a means
     to convey that information to them, but then allows the data to be
     conveyed in a single blob to the socket (the mapping between blob offsets
     and file regions is tabulated separately within the RPC call).

Note that some of this also applies to reads too.

The last bit about filesystem block size alignment is because network
filesystems don't typically require any block alignment, doing RMW locally on
the server.  I should really have separated that from the discontiguity bit.

David