From nobody Mon May 25 04:33:52 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0F8F12E5B21
	for <linux-kernel@vger.kernel.org>; Mon, 18 May 2026 22:30:33 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779143438; cv=none;
 b=Oa1zOKu8E2kzAynQl+YpcjYJBRqGt3mKx2ozs7BB7nPcwSZISNkV4oBmOViN/LkTygrsYNI1pe4eNeT2FFoU6utA0ABLHyBJJwJrK5PpswHvUczHztgBCmAYbmTTCC0yJpdSEOUKGEdzjKRBj3KrDBxFuvpg0RrBAKA34tVF1vc=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779143438; c=relaxed/simple;
	bh=2SWqT9+4RPZ8fMQB92OaRR2VrZ7cXEFWr8+XCjfQGBY=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=JMPnsN95zrzlqOuAwALrDnsF9mr0vGdrfza+0y91kNFt5kBxqGA3RCYKdisbFnw3TzUlrp1tW1cDjE9nFqoSlGAejnhI8Lwv9TxUZttnbFCs4/er3xfUYrlKjEqZVIEZoFHVToN1os1F1oANTZssZyqzpsUiGlj4+q8wuTEgnK0=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=CUlimxQf; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="CUlimxQf"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1779143433;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=7Gqa0zYJLlYipcNVUIF1baw/6uMaf1APMmWnO/ud3Sw=;
	b=CUlimxQfstWScq7IXax69v6tFgJ7RCaHw8FihHJOiZ/e4tiuXrSQIlaQPqMjuCvluNZEOe
	+eFCBWoBkSnQNdhyMp4WZJJcdkd0S1Kyu1VE+YpbrQFmdhcsrUoxWb+8/YZlgU5m6ZnuZd
	Ov6u9RVTukROAl8PU0B0gPITZv/y3sQ=
Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-42-dcy-VFHVMXW3M-X-a1k4hA-1; Mon,
 18 May 2026 18:30:27 -0400
X-MC-Unique: dcy-VFHVMXW3M-X-a1k4hA-1
X-Mimecast-MFC-AGG-ID: dcy-VFHVMXW3M-X-a1k4hA_1779143424
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id B7A351800451;
	Mon, 18 May 2026 22:30:22 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.48.33])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id B512519560A2;
	Mon, 18 May 2026 22:30:14 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.org>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 01/21] cachefiles: Don't rely on backing fs storage map for
 most use cases
Date: Mon, 18 May 2026 23:29:33 +0100
Message-ID: <20260518222959.488126-2-dhowells@redhat.com>
In-Reply-To: <20260518222959.488126-1-dhowells@redhat.com>
References: <20260518222959.488126-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12
Content-Type: text/plain; charset="utf-8"

Cachefiles currently uses the backing filesystem's idea of what data is
held in a backing file and queries this by means of SEEK_DATA and
SEEK_HOLE.  However, this means it does two seek operations on the backing
file for each individual read call it wants to prepare (unless the first
returns -ENXIO).  Worse, the backing filesystem is at liberty to insert or
remove blocks of zeros in order to optimise its layout which may cause
false positives and false negatives.

The problem is that keeping track of what is dirty is tricky (if storing
info in xattrs, which may have limited capacity and must be read and
written as one piece) and expensive (in terms of diskspace at least) and is
basically duplicating what a filesystem does.

However, the most common write case, in which the application does {
open(O_TRUNC); write(); write(); ... write(); close(); } where each write
follows directly on from the previous and leaves no gaps in the file is
reasonably easy to detect and can be noted in the primary xattr as
CACHEFILES_CONTENT_ALL, indicating we have everything up to the object size
stored.

In this specific case, given that it is known that there are no holes in
the file, there's no need to call SEEK_DATA/HOLE or use any other mechanism
to track the contents.  That speeds things up enormously.

Even when it is necessary to use SEEK_DATA/HOLE, it may not be necessary to
call it for each cache read subrequest generated.

Implement this by adding support for the CACHEFILES_CONTENT_ALL content
type (which is defined, but currently unused), which requires a slight
adjustment in how backing files are managed.  Specifically, the driver
needs to know how much of the tail block is data and whether storing more
data will create a hole.

To this end, the way that the size of a backing file is managed is changed.
Currently, the backing file is expanded to strictly match the size of the
network file, but this can be changed to carry more useful information.
This makes two pieces of metadata available: xattr.object_size and the
backing file's i_size.  Apply the following schema:

  (a) i_size is always a multiple of the DIO block size.

  (b) i_size is only updated to the end of the highest write stored.  This
      is used to work out if we are following on without leaving a hole.

  (c) xattr.object_size is the size of the network filesystem file cached
      in this backing file.

  (d) xattr.object_size must point after the start of the last block
      (unless both are 0).

  (e) If xattr.object_size is at or after the block at the current end of
      the backing file (ie. i_size), then we have all the contents of the
      block (if xattr.content =3D=3D CACHEFILES_CONTENT_ALL).

  (f) If xattr.object_size is somewhere in the middle of the last block,
      then the data following it is invalid and must be ignored.

  (g) If data is added to the last block, then that block must be fetched,
      modified and rewritten (it must be a buffered write through the
      pagecache and not DIO).

  (h) Writes to cache are rounded out to blocks on both sides and the
      folios used as sources must contain data for any lower gap and must
      have been cleared for any upper gap, and so will rewrite any
      non-data area in the tail block.

To implement this, the following changes are made:

 (1) cookie->object_size is no longer updated when writes are copied into
     the pagecache, but rather only updated when a write request completes.

     This prevents object size miscomparison when checking the xattr
     causing the backing file to be invalidated (opening and marking the
     backing file and modifying the pagecache run in parallel).

 (2) The cache's current idea of the amount of data that should be stored
     in the backing file is kept track of in object->object_size.

     Possibly this is redundant with cookie->object_size, but the latter
     gets updated in some addition circumstances.

 (3) The size of the backing file at the start of a request is now tracked
     in struct netfs_cache_resources so that the partial EOF block can be
     located and cleaned.

 (4) The cache block size is now used consistently rather than using
     CACHEFILES_DIO_BLOCK_SIZE (4096).

 (5) The backing file size is no longer adjusted when looking up an object.

 (6) When shortening a file, if the new size is not block aligned, the part
     beyond the new size is cleared.  If the file is truncated to zero, the
     content_info gets reset to CACHEFILES_CONTENT_NO_DATA.

 (7) A new struct, fscache_occupancy, is instituted to track the region
     being read.  Netfslib allocates it and fills in the start and end of
     the region to be read then calls the ->query_occupancy() method to
     find and fill in the extents.  It also indicates whether a recorded
     extent contains data or just contains a region that's all zeros
     (FSCACHE_EXTENT_DATA or FSCACHE_EXTENT_ZERO).

 (8) The ->prepare_read() cache method is changed such that, if given, it
     just limits the amount that can be read from the cache in one go.  It
     no longer indicates what source of read should be done; that
     information is now obtained from ->query_occupancy().

 (9) A new cache method, ->collect_write(), is added that is called when a
     contiguous series of writes have completed and a discontiguity or the
     end of the request has been hit.  It it supplied with the start and
     length of the write made to the backing file and can use this
     information to update the cache metadata.

(10) cachefiles_query_occupancy() is altered to find the next two "extents"
     of data stored in the backing file by doing SEEK_DATA/HOLE between the
     bounds set - unless it is known that there are no holes, in which case
     a whole-file first extent can be set.

(11) cachefiles_collect_write() is implemented to take the collated write
     completion information and use this to update the cache metadata, in
     particular working out whether there's now a hole in the backing file
     requiring future use of SEEK_DATA/HOLE instead of just assuming the
     data is all present.

     It also uses fallocate(FALLOC_FL_ZERO_RANGE) to clean the part of a
     partial block that extended beyond the old object size.  It might be
     better to perform a synchronous DIO write for this purpose, but that
     would mandate an RMW cycle.  Ideally, it should be all zeros anyway,
     but, unfortunately, shared-writable mmap can interfere.

(12) cachefiles_begin_operation() is updated to note the current backing
     file size and the cache DIO size.

(13) cachefiles_create_tmpfile() no longer expands the backing file when it
     creates it.

(14) cachefiles_set_object_xattr() is changed to use object->object_size
     rather than cookie->object_size.

(15) cachefiles_check_auxdata() is altered to actually store the content
     type and to also set object->object_size.  The cachefiles_coherency
     tracepoint is also modified to display xattr.object_size.

(16) netfs_read_to_pagecache() is reworked.  The cache ->prepare_read()
     method is replaced with ->query_occupancy() as the arbiter of what
     region of the file is read from where, and that retrieves up to two
     occupied extents of the backing file at once.

     The cache ->prepare_read() method is now repurposed to be the same as
     the equivalent network filesystem method and allows the cache to limit
     the size of the read before the iterator is prepared.

     netfs_single_dispatch_read() is similarly modified.

(17) netfs_update_i_size() and afs_update_i_size() no longer call
     fscache_update_cookie() to update cookie->object_size.

(18) Write collection now collates contiguous sequences of writes to the
     cache and calls the cache ->collect_write() method.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/afs/file.c                     |   1 -
 fs/cachefiles/interface.c         |  82 +-------
 fs/cachefiles/internal.h          |  13 +-
 fs/cachefiles/io.c                | 300 +++++++++++++++++++++++-------
 fs/cachefiles/namei.c             |  19 +-
 fs/cachefiles/xattr.c             |  24 ++-
 fs/netfs/buffered_read.c          | 176 +++++++++++-------
 fs/netfs/buffered_write.c         |   3 -
 fs/netfs/internal.h               |   2 +
 fs/netfs/read_retry.c             |   2 +
 fs/netfs/read_single.c            |  39 ++--
 fs/netfs/write_collect.c          | 133 ++++++++++---
 fs/netfs/write_issue.c            |  18 ++
 fs/netfs/write_retry.c            |   3 +
 include/linux/fscache.h           |  17 ++
 include/linux/netfs.h             |  40 +++-
 include/trace/events/cachefiles.h |  17 +-
 include/trace/events/netfs.h      |   9 +-
 18 files changed, 614 insertions(+), 284 deletions(-)

diff --git a/fs/afs/file.c b/fs/afs/file.c
index 0467742bfeee..67f38e99ada7 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -448,7 +448,6 @@ void afs_set_i_size(struct afs_vnode *vnode, loff_t new=
_i_size)
 	}
 	spin_unlock(&inode->i_lock);
 	write_sequnlock(&vnode->cb_lock);
-	fscache_update_cookie(afs_vnode_cache(vnode), NULL, &new_i_size);
 }
=20
 static void afs_update_i_size(struct inode *inode, loff_t new_i_size)
diff --git a/fs/cachefiles/interface.c b/fs/cachefiles/interface.c
index a08250d244ea..736bfcaa4e1d 100644
--- a/fs/cachefiles/interface.c
+++ b/fs/cachefiles/interface.c
@@ -105,73 +105,6 @@ void cachefiles_put_object(struct cachefiles_object *o=
bject,
 	_leave("");
 }
=20
-/*
- * Adjust the size of a cache file if necessary to match the DIO size.  We=
 keep
- * the EOF marker a multiple of DIO blocks so that we don't fall back to d=
oing
- * non-DIO for a partial block straddling the EOF, but we also have to be
- * careful of someone expanding the file and accidentally accreting the
- * padding.
- */
-static int cachefiles_adjust_size(struct cachefiles_object *object)
-{
-	struct iattr newattrs;
-	struct file *file =3D object->file;
-	uint64_t ni_size;
-	loff_t oi_size;
-	int ret;
-
-	ni_size =3D object->cookie->object_size;
-	ni_size =3D round_up(ni_size, CACHEFILES_DIO_BLOCK_SIZE);
-
-	_enter("{OBJ%x},[%llu]",
-	       object->debug_id, (unsigned long long) ni_size);
-
-	if (!file)
-		return -ENOBUFS;
-
-	oi_size =3D i_size_read(file_inode(file));
-	if (oi_size =3D=3D ni_size)
-		return 0;
-
-	inode_lock(file_inode(file));
-
-	/* if there's an extension to a partial page at the end of the backing
-	 * file, we need to discard the partial page so that we pick up new
-	 * data after it */
-	if (oi_size & ~PAGE_MASK && ni_size > oi_size) {
-		_debug("discard tail %llx", oi_size);
-		newattrs.ia_valid =3D ATTR_SIZE;
-		newattrs.ia_size =3D oi_size & PAGE_MASK;
-		ret =3D cachefiles_inject_remove_error();
-		if (ret =3D=3D 0)
-			ret =3D notify_change(&nop_mnt_idmap, file->f_path.dentry,
-					    &newattrs, NULL);
-		if (ret < 0)
-			goto truncate_failed;
-	}
-
-	newattrs.ia_valid =3D ATTR_SIZE;
-	newattrs.ia_size =3D ni_size;
-	ret =3D cachefiles_inject_write_error();
-	if (ret =3D=3D 0)
-		ret =3D notify_change(&nop_mnt_idmap, file->f_path.dentry,
-				    &newattrs, NULL);
-
-truncate_failed:
-	inode_unlock(file_inode(file));
-
-	if (ret < 0)
-		trace_cachefiles_io_error(NULL, file_inode(file), ret,
-					  cachefiles_trace_notify_change_error);
-	if (ret =3D=3D -EIO) {
-		cachefiles_io_error_obj(object, "Size set failed");
-		ret =3D -ENOBUFS;
-	}
-
-	_leave(" =3D %d", ret);
-	return ret;
-}
-
 /*
  * Attempt to look up the nominated node in this cache
  */
@@ -204,7 +137,6 @@ static bool cachefiles_lookup_cookie(struct fscache_coo=
kie *cookie)
 	spin_lock(&cache->object_list_lock);
 	list_add(&object->cache_link, &cache->object_list);
 	spin_unlock(&cache->object_list_lock);
-	cachefiles_adjust_size(object);
=20
 	cachefiles_end_secure(cache, saved_cred);
 	_leave(" =3D t");
@@ -238,7 +170,7 @@ static bool cachefiles_shorten_object(struct cachefiles=
_object *object,
 	loff_t i_size, dio_size;
 	int ret;
=20
-	dio_size =3D round_up(new_size, CACHEFILES_DIO_BLOCK_SIZE);
+	dio_size =3D round_up(new_size, cache->bsize);
 	i_size =3D i_size_read(inode);
=20
 	trace_cachefiles_trunc(object, inode, i_size, dio_size,
@@ -270,6 +202,7 @@ static bool cachefiles_shorten_object(struct cachefiles=
_object *object,
 		}
 	}
=20
+	object->object_size =3D new_size;
 	return true;
 }
=20
@@ -284,15 +217,20 @@ static void cachefiles_resize_cookie(struct netfs_cac=
he_resources *cres,
 	struct fscache_cookie *cookie =3D object->cookie;
 	const struct cred *saved_cred;
 	struct file *file =3D cachefiles_cres_file(cres);
-	loff_t old_size =3D cookie->object_size;
+	unsigned long long i_size =3D i_size_read(file_inode(file));
=20
-	_enter("%llu->%llu", old_size, new_size);
+	_enter("%llu->%llu", i_size, new_size);
=20
-	if (new_size < old_size) {
+	if (new_size < i_size) {
+		/* The file is being shrunk - we need to downsize the backing
+		 * file and clear the end of the final block.
+		 */
 		cachefiles_begin_secure(cache, &saved_cred);
 		cachefiles_shorten_object(object, file, new_size);
 		cachefiles_end_secure(cache, saved_cred);
 		object->cookie->object_size =3D new_size;
+		if (new_size =3D=3D 0)
+			object->content_info =3D CACHEFILES_CONTENT_NO_DATA;
 		return;
 	}
=20
diff --git a/fs/cachefiles/internal.h b/fs/cachefiles/internal.h
index b62cd3e9a18e..fb1a92e45ca1 100644
--- a/fs/cachefiles/internal.h
+++ b/fs/cachefiles/internal.h
@@ -18,8 +18,6 @@
 #include <linux/xarray.h>
 #include <linux/cachefiles.h>
=20
-#define CACHEFILES_DIO_BLOCK_SIZE 4096
-
 struct cachefiles_cache;
 struct cachefiles_object;
=20
@@ -68,12 +66,17 @@ struct cachefiles_object {
 	struct list_head		cache_link;	/* Link in cache->*_list */
 	struct file			*file;		/* The file representing this object */
 	char				*d_name;	/* Backing file name */
+	unsigned long			flags;
+#define CACHEFILES_OBJECT_USING_TMPFILE	0		/* Have an unlinked tmpfile */
+	unsigned long long		object_size;	/* Size of the object stored
+							 * (independent of cookie->object_size for
+							 * coherency reasons)
+							 */
+	atomic64_t			read_limit;	/* Point beyond which uncommitted writes */
 	int				debug_id;
 	spinlock_t			lock;
 	refcount_t			ref;
-	enum cachefiles_content		content_info:8;	/* Info about content presence */
-	unsigned long			flags;
-#define CACHEFILES_OBJECT_USING_TMPFILE	0		/* Have an unlinked tmpfile */
+	enum cachefiles_content		content_info;	/* Info about content presence */
 #ifdef CONFIG_CACHEFILES_ONDEMAND
 	struct cachefiles_ondemand_info	*ondemand;
 #endif
diff --git a/fs/cachefiles/io.c b/fs/cachefiles/io.c
index d879b80a0bed..42265fdcc17e 100644
--- a/fs/cachefiles/io.c
+++ b/fs/cachefiles/io.c
@@ -32,6 +32,8 @@ struct cachefiles_kiocb {
 	u64			b_writing;
 };
=20
+#define IS_ERR_VALUE_LL(x) unlikely((x) >=3D (unsigned long long)-MAX_ERRN=
O)
+
 static inline void cachefiles_put_kiocb(struct cachefiles_kiocb *ki)
 {
 	if (refcount_dec_and_test(&ki->ki_refcnt)) {
@@ -193,60 +195,81 @@ static int cachefiles_read(struct netfs_cache_resourc=
es *cres,
 }
=20
 /*
- * Query the occupancy of the cache in a region, returning where the next =
chunk
- * of data starts and how long it is.
+ * Query the occupancy of the cache in a region, returning the extent of t=
he
+ * next two chunks of cached data and the next hole.
  */
 static int cachefiles_query_occupancy(struct netfs_cache_resources *cres,
-				      loff_t start, size_t len, size_t granularity,
-				      loff_t *_data_start, size_t *_data_len)
+				      struct fscache_occupancy *occ)
 {
 	struct cachefiles_object *object;
+	struct inode *inode;
 	struct file *file;
-	loff_t off, off2;
-
-	*_data_start =3D -1;
-	*_data_len =3D 0;
+	unsigned long long read_limit;
+	loff_t ret;
+	int i;
=20
 	if (!fscache_wait_for_operation(cres, FSCACHE_WANT_READ))
 		return -ENOBUFS;
=20
 	object =3D cachefiles_cres_object(cres);
 	file =3D cachefiles_cres_file(cres);
-	granularity =3D max_t(size_t, object->volume->cache->bsize, granularity);
+	inode =3D file_inode(file);
+	occ->granularity =3D object->volume->cache->bsize;
+	/* Read read_limit before content_info. */
+	read_limit =3D atomic64_read_acquire(&object->read_limit);
+
+	_enter("%pD,%llu,%llx-%llx/%llx",
+	       file, inode->i_ino, occ->query_from, occ->query_to, read_limit);
+
+	if (read_limit =3D=3D 0)
+		goto done;
+
+	switch (READ_ONCE(object->content_info)) {
+	case CACHEFILES_CONTENT_ALL:
+	case CACHEFILES_CONTENT_SINGLE:
+		if (read_limit > occ->query_from) {
+			occ->cached_from[0] =3D 0;
+			occ->cached_to[0] =3D read_limit;
+			occ->cached_type[0] =3D FSCACHE_EXTENT_DATA;
+			occ->query_from =3D ULLONG_MAX;
+		}
+		goto done;
+	default:
+		break;
+	}
=20
-	_enter("%pD,%llu,%llx,%zx/%llx",
-	       file, file_inode(file)->i_ino, start, len,
-	       i_size_read(file_inode(file)));
+	for (i =3D 0; i < ARRAY_SIZE(occ->cached_from); i++) {
+		ret =3D cachefiles_inject_read_error();
+		if (ret =3D=3D 0)
+			ret =3D vfs_llseek(file, occ->query_from, SEEK_DATA);
+		if (IS_ERR_VALUE_LL(ret)) {
+			if (ret !=3D -ENXIO)
+				return ret;
+			occ->query_from =3D ULLONG_MAX;
+			goto done;
+		}
+		occ->cached_type[i] =3D FSCACHE_EXTENT_DATA;
+		occ->cached_from[i] =3D ret;
+		occ->query_from =3D ret;
+
+		ret =3D cachefiles_inject_read_error();
+		if (ret =3D=3D 0)
+			ret =3D vfs_llseek(file, occ->query_from, SEEK_HOLE);
+		if (IS_ERR_VALUE_LL(ret)) {
+			if (ret !=3D -ENXIO)
+				return ret;
+			occ->query_from =3D ULLONG_MAX;
+			goto done;
+		}
+		occ->cached_to[i] =3D ret;
+		occ->query_from =3D ret;
+		if (occ->query_from >=3D occ->query_to)
+			break;
+	}
=20
-	off =3D cachefiles_inject_read_error();
-	if (off =3D=3D 0)
-		off =3D vfs_llseek(file, start, SEEK_DATA);
-	if (off =3D=3D -ENXIO)
-		return -ENODATA; /* Beyond EOF */
-	if (off < 0 && off >=3D (loff_t)-MAX_ERRNO)
-		return -ENOBUFS; /* Error. */
-	if (round_up(off, granularity) >=3D start + len)
-		return -ENODATA; /* No data in range */
-
-	off2 =3D cachefiles_inject_read_error();
-	if (off2 =3D=3D 0)
-		off2 =3D vfs_llseek(file, off, SEEK_HOLE);
-	if (off2 =3D=3D -ENXIO)
-		return -ENODATA; /* Beyond EOF */
-	if (off2 < 0 && off2 >=3D (loff_t)-MAX_ERRNO)
-		return -ENOBUFS; /* Error. */
-
-	/* Round away partial blocks */
-	off =3D round_up(off, granularity);
-	off2 =3D round_down(off2, granularity);
-	if (off2 <=3D off)
-		return -ENODATA;
-
-	*_data_start =3D off;
-	if (off2 > start + len)
-		*_data_len =3D len;
-	else
-		*_data_len =3D off2 - off;
+done:
+	_debug("query[0] %llx-%llx", occ->cached_from[0], occ->cached_to[0]);
+	_debug("query[1] %llx-%llx", occ->cached_from[1], occ->cached_to[1]);
 	return 0;
 }
=20
@@ -489,18 +512,6 @@ cachefiles_do_prepare_read(struct netfs_cache_resource=
s *cres,
 	return ret;
 }
=20
-/*
- * Prepare a read operation, shortening it to a cached/uncached
- * boundary as appropriate.
- */
-static enum netfs_io_source cachefiles_prepare_read(struct netfs_io_subreq=
uest *subreq,
-						    unsigned long long i_size)
-{
-	return cachefiles_do_prepare_read(&subreq->rreq->cache_resources,
-					  subreq->start, &subreq->len, i_size,
-					  &subreq->flags, subreq->rreq->inode->i_ino);
-}
-
 /*
  * Prepare an on-demand read operation, shortening it to a cached/uncached
  * boundary as appropriate.
@@ -527,7 +538,7 @@ int __cachefiles_prepare_write(struct cachefiles_object=
 *object,
 	int ret;
=20
 	/* Round to DIO size */
-	start =3D round_down(*_start, PAGE_SIZE);
+	start =3D round_down(*_start, cache->bsize);
 	if (start !=3D *_start || *_len > upper_len) {
 		/* Probably asked to cache a streaming write written into the
 		 * pagecache when the cookie was temporarily out of service to
@@ -537,7 +548,7 @@ int __cachefiles_prepare_write(struct cachefiles_object=
 *object,
 		return -ENOBUFS;
 	}
=20
-	*_len =3D round_up(len, PAGE_SIZE);
+	*_len =3D round_up(len, cache->bsize);
=20
 	/* We need to work out whether there's sufficient disk space to perform
 	 * the write - but we can skip that check if we have space already
@@ -563,7 +574,7 @@ int __cachefiles_prepare_write(struct cachefiles_object=
 *object,
 	 * space, we need to see if it's fully allocated.  If it's not, we may
 	 * want to cull it.
 	 */
-	if (cachefiles_has_space(cache, 0, *_len / PAGE_SIZE,
+	if (cachefiles_has_space(cache, 0, *_len / cache->bsize,
 				 cachefiles_has_space_check) =3D=3D 0)
 		return 0; /* Enough space to simply overwrite the whole block */
=20
@@ -595,7 +606,7 @@ int __cachefiles_prepare_write(struct cachefiles_object=
 *object,
 	return ret;
=20
 check_space:
-	return cachefiles_has_space(cache, 0, *_len / PAGE_SIZE,
+	return cachefiles_has_space(cache, 0, *_len / cache->bsize,
 				    cachefiles_has_space_for_write);
 }
=20
@@ -658,9 +669,9 @@ static void cachefiles_issue_write(struct netfs_io_subr=
equest *subreq)
 	       wreq->debug_id, subreq->debug_index, start, start + len - 1);
=20
 	/* We need to start on the cache granularity boundary */
-	off =3D start & (CACHEFILES_DIO_BLOCK_SIZE - 1);
+	off =3D start & (cache->bsize - 1);
 	if (off) {
-		pre =3D CACHEFILES_DIO_BLOCK_SIZE - off;
+		pre =3D cache->bsize - off;
 		if (pre >=3D len) {
 			fscache_count_dio_misfit();
 			netfs_write_subrequest_terminated(subreq, len);
@@ -674,8 +685,8 @@ static void cachefiles_issue_write(struct netfs_io_subr=
equest *subreq)
=20
 	/* We also need to end on the cache granularity boundary */
 	if (start + len =3D=3D wreq->i_size) {
-		size_t part =3D len % CACHEFILES_DIO_BLOCK_SIZE;
-		size_t need =3D CACHEFILES_DIO_BLOCK_SIZE - part;
+		size_t part =3D len & (cache->bsize - 1);
+		size_t need =3D cache->bsize - part;
=20
 		if (part && stream->submit_extendable_to >=3D need) {
 			len +=3D need;
@@ -684,7 +695,7 @@ static void cachefiles_issue_write(struct netfs_io_subr=
equest *subreq)
 		}
 	}
=20
-	post =3D len & (CACHEFILES_DIO_BLOCK_SIZE - 1);
+	post =3D len & (cache->bsize - 1);
 	if (post) {
 		len -=3D post;
 		if (len =3D=3D 0) {
@@ -711,6 +722,161 @@ static void cachefiles_issue_write(struct netfs_io_su=
brequest *subreq)
 			 netfs_write_subrequest_terminated, subreq);
 }
=20
+/*
+ * Collect the result of buffered writeback to the cache.  This includes
+ * copying a read to the cache.  Netfslib collates the results, which might
+ * occur out of order, and delivers them to the cache so that it can updat=
e its
+ * content record.
+ *
+ * block_type is one of:
+ * - NETFS_CACHE_COLLECT_WRITE_DATA for a contiguous block of data
+ * - NETFS_CACHE_COLLECT_WRITE_GAP if a discontiguity was skipped
+ * - NETFS_CACHE_COLLECT_WRITE_CANCEL for a hole due to a failed/cancelled=
 write
+ *
+ * The writes we made are all rounded out at both sides to the nearest DIO
+ * block boundary, so if the final block contains the EOF in the middle of=
 it
+ * (rather than at the end), padding will have been written to the file.  =
The
+ * backing file's filesize will have been updated if the write extended the
+ * file; the filesize may still change due to outstanding subreqs.
+ *
+ * The metadata in the cache file xattr records the size of the object we =
have
+ * stored, but the cache file EOF only goes up to where we've cached data =
to
+ * and, furthermore, is rounded up to the nearest DIO block boundary.
+ */
+static void cachefiles_collect_write(struct netfs_io_request *wreq,
+				     unsigned long long start, size_t len,
+				     enum netfs_cache_collect block_type)
+{
+	struct netfs_cache_resources *cres =3D &wreq->cache_resources;
+	struct cachefiles_object *object =3D cachefiles_cres_object(cres);
+	struct cachefiles_cache *cache =3D object->volume->cache;
+	struct file *file =3D cachefiles_cres_file(cres);
+	struct inode *inode =3D file_inode(file);
+	unsigned long long read_limit;
+	unsigned long long old_size =3D cres->cache_i_size;
+	unsigned long long new_size =3D i_size_read(inode);
+	unsigned long long data_to =3D object->object_size;
+	unsigned long long end =3D start + len;
+	int ret;
+
+	_enter("%llx,%zx,%x", start, len, cache->bsize);
+
+	if (WARN_ON(old_size	& (cache->bsize - 1)) ||
+	    WARN_ON(new_size	& (cache->bsize - 1)) ||
+	    WARN_ON(start	& (cache->bsize - 1)) ||
+	    WARN_ON(len		& (cache->bsize - 1))) {
+		trace_cachefiles_io_error(object, inode, -EIO,
+					  cachefiles_trace_alignment_error);
+		cachefiles_remove_object_xattr(cache, object, file->f_path.dentry);
+		return;
+	}
+
+	/* If this is recording a gap, due to discontiguous writes or lack of
+	 * cache space, then a hole may have been introduced into the backing
+	 * file.  Treat it as a zero-length data block.
+	 */
+	if (block_type =3D=3D NETFS_CACHE_COLLECT_WRITE_GAP ||
+	    block_type =3D=3D NETFS_CACHE_COLLECT_WRITE_CANCEL) {
+		start =3D end;
+		len =3D 0;
+	}
+
+	/* Zeroth case: Single monolithic files are handled specially.
+	 */
+	if (wreq->origin =3D=3D NETFS_WRITEBACK_SINGLE) {
+		object->content_info =3D CACHEFILES_CONTENT_SINGLE;
+		goto update_sizes;
+	}
+
+	/* First case: The backing file was empty. */
+	if (old_size =3D=3D 0) {
+		if (start =3D=3D 0)
+			object->content_info =3D CACHEFILES_CONTENT_ALL;
+		else
+			object->content_info =3D CACHEFILES_CONTENT_BACKFS_MAP;
+		goto update_sizes;
+	}
+
+	/* Second case: The backing file is entirely within the old object size
+	 * and thus there can be no partial tail block to deal with in the
+	 * cache file.
+	 */
+	if (old_size <=3D data_to) {
+		if (start > old_size)
+			goto discontiguous;
+		goto update_sizes;
+	}
+
+	/* Third case: The write happened entirely within the bounds of the
+	 * current cache file's size.
+	 */
+	if (end <=3D old_size)
+		goto update_sizes;
+
+	/* Fourth case: The write overwrote the partial tail block and extended
+	 * the file.  We only need to update the object size because netfslib
+	 * rounds out/pads cache writes to whole disk blocks.
+	 */
+	if (start < old_size)
+		goto update_sizes;
+
+	/* Fifth case: The write started from the end of the whole tail block
+	 * and extended the file.  Just extend our notion of the filesize.
+	 */
+	if (start =3D=3D old_size && old_size =3D=3D data_to)
+		goto update_sizes;
+
+	/* Sixth case: The write continued on from the partial tail block and
+	 * extended the file.  Need to clear the gap.
+	 */
+	if (start =3D=3D old_size && old_size > data_to)
+		goto clear_gap;
+
+discontiguous:
+	/* Seventh case: The write was beyond the EOF on the cache file, so now
+	 * there's a hole in the file and we can no longer say in the metadata
+	 * that we can assume we have it all.  We may also need to clear the
+	 * end of the partial tail block.
+	 */
+	/* TODO: For the moment, we will have to use SEEK_HOLE/SEEK_DATA. */
+	if (object->content_info !=3D CACHEFILES_CONTENT_BACKFS_MAP) {
+		object->content_info =3D CACHEFILES_CONTENT_BACKFS_MAP;
+		trace_cachefiles_coherency(object, inode->i_ino, data_to,
+					   be64_to_cpup((__be64 *)object->cookie->inline_aux),
+					   CACHEFILES_CONTENT_BACKFS_MAP,
+					   cachefiles_coherency_discontiguous);
+	}
+
+clear_gap:
+	/* We need to clear any partial padding that got jumped over.  It
+	 * *should* be all zeros, but shared-writable mmap exists...
+	 */
+	if (old_size > data_to) {
+		trace_cachefiles_trunc(object, inode, data_to, old_size,
+				       cachefiles_trunc_clear_padding);
+		ret =3D cachefiles_inject_write_error();
+		if (ret =3D=3D 0)
+			ret =3D vfs_fallocate(file, FALLOC_FL_ZERO_RANGE,
+					    data_to, old_size - data_to);
+		if (ret < 0) {
+			trace_cachefiles_io_error(object, inode, ret,
+						  cachefiles_trace_fallocate_error);
+			cachefiles_io_error_obj(object, "fallocate zero pad failed %d", ret);
+			cachefiles_remove_object_xattr(cache, object, file->f_path.dentry);
+			return;
+		}
+	}
+
+update_sizes:
+	read_limit =3D umax(old_size, end);
+	cres->cache_i_size =3D read_limit;
+	object->object_size =3D umin(read_limit, wreq->i_size);
+
+	/* Raise the limit at which reads can access the file. */
+	/* Update read_limit after content_info */
+	atomic64_set_release(&object->read_limit, read_limit);
+}
+
 /*
  * Clean up an operation.
  */
@@ -728,11 +894,11 @@ static const struct netfs_cache_ops cachefiles_netfs_=
cache_ops =3D {
 	.read			=3D cachefiles_read,
 	.write			=3D cachefiles_write,
 	.issue_write		=3D cachefiles_issue_write,
-	.prepare_read		=3D cachefiles_prepare_read,
 	.prepare_write		=3D cachefiles_prepare_write,
 	.prepare_write_subreq	=3D cachefiles_prepare_write_subreq,
 	.prepare_ondemand_read	=3D cachefiles_prepare_ondemand_read,
 	.query_occupancy	=3D cachefiles_query_occupancy,
+	.collect_write		=3D cachefiles_collect_write,
 };
=20
 /*
@@ -742,14 +908,18 @@ bool cachefiles_begin_operation(struct netfs_cache_re=
sources *cres,
 				enum fscache_want_state want_state)
 {
 	struct cachefiles_object *object =3D cachefiles_cres_object(cres);
+	struct file *file;
=20
 	if (!cachefiles_cres_file(cres)) {
 		cres->ops =3D &cachefiles_netfs_cache_ops;
 		if (object->file) {
 			spin_lock(&object->lock);
-			if (!cres->cache_priv2 && object->file)
-				cres->cache_priv2 =3D get_file(object->file);
+			file =3D object->file;
+			if (!cres->cache_priv2 && file)
+				cres->cache_priv2 =3D get_file(file);
 			spin_unlock(&object->lock);
+			cres->cache_i_size =3D i_size_read(file_inode(file));
+			cres->dio_size =3D object->volume->cache->bsize;
 		}
 	}
=20
diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
index 2937db690b40..71d249344c8a 100644
--- a/fs/cachefiles/namei.c
+++ b/fs/cachefiles/namei.c
@@ -415,7 +415,6 @@ struct file *cachefiles_create_tmpfile(struct cachefile=
s_object *object)
 	struct dentry *fan =3D volume->fanout[(u8)object->cookie->key_hash];
 	struct file *file;
 	const struct path parentpath =3D { .mnt =3D cache->mnt, .dentry =3D fan };
-	uint64_t ni_size;
 	long ret;
=20
=20
@@ -447,23 +446,6 @@ struct file *cachefiles_create_tmpfile(struct cachefil=
es_object *object)
 	if (ret < 0)
 		goto err_unuse;
=20
-	ni_size =3D object->cookie->object_size;
-	ni_size =3D round_up(ni_size, CACHEFILES_DIO_BLOCK_SIZE);
-
-	if (ni_size > 0) {
-		trace_cachefiles_trunc(object, file_inode(file), 0, ni_size,
-				       cachefiles_trunc_expand_tmpfile);
-		ret =3D cachefiles_inject_write_error();
-		if (ret =3D=3D 0)
-			ret =3D vfs_truncate(&file->f_path, ni_size);
-		if (ret < 0) {
-			trace_cachefiles_vfs_error(
-				object, file_inode(file), ret,
-				cachefiles_trace_trunc_error);
-			goto err_unuse;
-		}
-	}
-
 	ret =3D -EINVAL;
 	if (unlikely(!file->f_op->read_iter) ||
 	    unlikely(!file->f_op->write_iter)) {
@@ -473,6 +455,7 @@ struct file *cachefiles_create_tmpfile(struct cachefile=
s_object *object)
 	}
 out:
 	cachefiles_end_secure(cache, saved_cred);
+	object->content_info =3D CACHEFILES_CONTENT_ALL;
 	return file;
=20
 err_unuse:
diff --git a/fs/cachefiles/xattr.c b/fs/cachefiles/xattr.c
index f8ae78b3f7b6..25f2a906f984 100644
--- a/fs/cachefiles/xattr.c
+++ b/fs/cachefiles/xattr.c
@@ -54,7 +54,7 @@ int cachefiles_set_object_xattr(struct cachefiles_object =
*object)
 	if (!buf)
 		return -ENOMEM;
=20
-	buf->object_size	=3D cpu_to_be64(object->cookie->object_size);
+	buf->object_size	=3D cpu_to_be64(object->object_size);
 	buf->zero_point		=3D 0;
 	buf->type		=3D CACHEFILES_COOKIE_TYPE_DATA;
 	buf->content		=3D object->content_info;
@@ -77,6 +77,7 @@ int cachefiles_set_object_xattr(struct cachefiles_object =
*object)
 		trace_cachefiles_vfs_error(object, file_inode(file), ret,
 					   cachefiles_trace_setxattr_error);
 		trace_cachefiles_coherency(object, file_inode(file)->i_ino,
+					   object->object_size,
 					   be64_to_cpup((__be64 *)buf->data),
 					   buf->content,
 					   cachefiles_coherency_set_fail);
@@ -86,6 +87,7 @@ int cachefiles_set_object_xattr(struct cachefiles_object =
*object)
 				"Failed to set xattr with error %d", ret);
 	} else {
 		trace_cachefiles_coherency(object, file_inode(file)->i_ino,
+					   object->object_size,
 					   be64_to_cpup((__be64 *)buf->data),
 					   buf->content,
 					   cachefiles_coherency_set_ok);
@@ -103,9 +105,11 @@ int cachefiles_check_auxdata(struct cachefiles_object =
*object, struct file *file
 {
 	struct cachefiles_xattr *buf;
 	struct dentry *dentry =3D file->f_path.dentry;
+	struct inode *inode =3D file_inode(file);
 	unsigned int len =3D object->cookie->aux_len, tlen;
 	const void *p =3D fscache_get_aux(object->cookie);
 	enum cachefiles_coherency_trace why;
+	unsigned long long obj_size;
 	ssize_t xlen;
 	int ret =3D -ESTALE;
=20
@@ -120,36 +124,41 @@ int cachefiles_check_auxdata(struct cachefiles_object=
 *object, struct file *file
 	if (xlen !=3D tlen) {
 		if (xlen < 0) {
 			ret =3D xlen;
-			trace_cachefiles_vfs_error(object, file_inode(file), xlen,
+			trace_cachefiles_vfs_error(object, inode, xlen,
 						   cachefiles_trace_getxattr_error);
 		}
 		if (xlen =3D=3D -EIO)
 			cachefiles_io_error_obj(
 				object,
 				"Failed to read aux with error %zd", xlen);
-		why =3D cachefiles_coherency_check_xattr;
+		trace_cachefiles_coherency(object, inode->i_ino, 0, 0, 0,
+					   cachefiles_coherency_check_xattr);
 		goto out;
 	}
=20
+	obj_size =3D be64_to_cpu(buf->object_size);
 	if (buf->type !=3D CACHEFILES_COOKIE_TYPE_DATA) {
 		why =3D cachefiles_coherency_check_type;
 	} else if (memcmp(buf->data, p, len) !=3D 0) {
 		why =3D cachefiles_coherency_check_aux;
-	} else if (be64_to_cpu(buf->object_size) !=3D object->cookie->object_size=
) {
+	} else if (obj_size !=3D object->cookie->object_size) {
 		why =3D cachefiles_coherency_check_objsize;
 	} else if (buf->content =3D=3D CACHEFILES_CONTENT_DIRTY) {
 		// TODO: Begin conflict resolution
 		pr_warn("Dirty object in cache\n");
 		why =3D cachefiles_coherency_check_dirty;
 	} else {
+		object->content_info =3D buf->content;
+		object->object_size =3D obj_size;
+		atomic64_set(&object->read_limit, i_size_read(inode));
 		why =3D cachefiles_coherency_check_ok;
 		ret =3D 0;
 	}
=20
-out:
-	trace_cachefiles_coherency(object, file_inode(file)->i_ino,
+	trace_cachefiles_coherency(object, inode->i_ino, obj_size,
 				   be64_to_cpup((__be64 *)buf->data),
 				   buf->content, why);
+out:
 	kfree(buf);
 	return ret;
 }
@@ -163,6 +172,9 @@ int cachefiles_remove_object_xattr(struct cachefiles_ca=
che *cache,
 {
 	int ret;
=20
+	trace_cachefiles_coherency(object, d_inode(dentry)->i_ino, 0, 0, 0,
+				   cachefiles_coherency_remove);
+
 	ret =3D cachefiles_inject_remove_error();
 	if (ret =3D=3D 0) {
 		ret =3D mnt_want_write(cache->mnt);
diff --git a/fs/netfs/buffered_read.c b/fs/netfs/buffered_read.c
index 76d0f6a29aba..8f96bc0f6c03 100644
--- a/fs/netfs/buffered_read.c
+++ b/fs/netfs/buffered_read.c
@@ -127,21 +127,6 @@ static ssize_t netfs_prepare_read_iterator(struct netf=
s_io_subrequest *subreq,
 	return subreq->len;
 }
=20
-static enum netfs_io_source netfs_cache_prepare_read(struct netfs_io_reque=
st *rreq,
-						     struct netfs_io_subrequest *subreq,
-						     loff_t i_size)
-{
-	struct netfs_cache_resources *cres =3D &rreq->cache_resources;
-	enum netfs_io_source source;
-
-	if (!cres->ops)
-		return NETFS_DOWNLOAD_FROM_SERVER;
-	source =3D cres->ops->prepare_read(subreq, i_size);
-	trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
-	return source;
-
-}
-
 /*
  * Issue a read against the cache.
  * - Eats the caller's ref on subreq.
@@ -156,6 +141,19 @@ static void netfs_read_cache_to_pagecache(struct netfs=
_io_request *rreq,
 			netfs_cache_read_terminated, subreq);
 }
=20
+int netfs_read_query_cache(struct netfs_io_request *rreq, struct fscache_o=
ccupancy *occ)
+{
+	struct netfs_cache_resources *cres =3D &rreq->cache_resources;
+
+	occ->granularity =3D PAGE_SIZE;
+	if (occ->query_from >=3D occ->query_to)
+		return 0;
+	if (!cres->ops)
+		return 0;
+	occ->query_from =3D round_up(occ->query_from, occ->granularity);
+	return cres->ops->query_occupancy(cres, occ);
+}
+
 void netfs_queue_read(struct netfs_io_request *rreq,
 		      struct netfs_io_subrequest *subreq)
 {
@@ -209,15 +207,54 @@ static void netfs_issue_read(struct netfs_io_request =
*rreq,
 static void netfs_read_to_pagecache(struct netfs_io_request *rreq,
 				    struct readahead_control *ractl)
 {
+	struct fscache_occupancy _occ =3D {
+		.query_from	=3D rreq->start,
+		.query_to	=3D rreq->start + rreq->len,
+		.cached_from[0]	=3D 0,
+		.cached_to[0]	=3D 0,
+		.cached_from[1]	=3D ULLONG_MAX,
+		.cached_to[1]	=3D ULLONG_MAX,
+	};
+	struct fscache_occupancy *occ =3D &_occ;
 	unsigned long long start =3D rreq->start;
 	ssize_t size =3D rreq->len;
 	int ret =3D 0;
=20
 	do {
+		int (*prepare_read)(struct netfs_io_subrequest *subreq) =3D NULL;
 		struct netfs_io_subrequest *subreq;
-		enum netfs_io_source source =3D NETFS_SOURCE_UNKNOWN;
+		unsigned long long hole_to, cache_to;
 		ssize_t slice;
=20
+		/* If we don't have any, find out the next couple of data
+		 * extents from the cache, containing of following the
+		 * specified start offset.  Holes have to be fetched from the
+		 * server; data regions from the cache.
+		 */
+		hole_to =3D occ->cached_from[0];
+		cache_to =3D occ->cached_to[0];
+		if (start >=3D cache_to) {
+			/* Extent exhausted; shuffle down. */
+			int i;
+
+			for (i =3D 0; i < ARRAY_SIZE(occ->cached_from) - 1; i++) {
+				occ->cached_from[i] =3D occ->cached_from[i + 1];
+				occ->cached_to[i]   =3D occ->cached_to[i + 1];
+				occ->cached_type[i] =3D occ->cached_type[i + 1];
+			}
+			occ->cached_from[i] =3D ULLONG_MAX;
+			occ->cached_to[i]   =3D ULLONG_MAX;
+
+			if (occ->cached_from[0] !=3D ULLONG_MAX)
+				continue;
+
+			/* Get new extents */
+			ret =3D netfs_read_query_cache(rreq, occ);
+			if (ret < 0)
+				break;
+			continue;
+		}
+
 		subreq =3D netfs_alloc_subrequest(rreq);
 		if (!subreq) {
 			ret =3D -ENOMEM;
@@ -229,63 +266,75 @@ static void netfs_read_to_pagecache(struct netfs_io_r=
equest *rreq,
=20
 		netfs_queue_read(rreq, subreq);
=20
-		source =3D netfs_cache_prepare_read(rreq, subreq, rreq->i_size);
-		subreq->source =3D source;
-		if (source =3D=3D NETFS_DOWNLOAD_FROM_SERVER) {
-			unsigned long long zero_point =3D netfs_read_zero_point(rreq->inode);
-			unsigned long long zp =3D umin(zero_point, rreq->i_size);
-			size_t len =3D subreq->len;
-
-			if (unlikely(rreq->origin =3D=3D NETFS_READ_SINGLE))
-				zp =3D rreq->i_size;
-			if (subreq->start >=3D zp) {
-				subreq->source =3D source =3D NETFS_FILL_WITH_ZEROES;
-				goto fill_with_zeroes;
+		unsigned long long zero_point =3D netfs_read_zero_point(rreq->inode);
+		unsigned long long zlimit =3D umin(zero_point, rreq->i_size);
+
+		_debug("rsub %llx %llx-%llx", subreq->start, hole_to, cache_to);
+
+		if (start >=3D hole_to && start < cache_to) {
+			/* Overlap with a cached region, where the cache may
+			 * record a block of zeroes.
+			 */
+			_debug("cached s=3D%llx c=3D%llx l=3D%zx", start, cache_to, size);
+			subreq->len =3D umin(cache_to - start, size);
+			subreq->len =3D round_up(subreq->len, occ->granularity);
+			if (occ->cached_type[0] =3D=3D FSCACHE_EXTENT_ZERO) {
+				subreq->source =3D NETFS_FILL_WITH_ZEROES;
+				netfs_stat(&netfs_n_rh_zero);
+			} else {
+				subreq->source =3D NETFS_READ_FROM_CACHE;
+				prepare_read =3D rreq->cache_resources.ops->prepare_read;
 			}
=20
-			if (len > zp - subreq->start)
-				len =3D zp - subreq->start;
-			if (len =3D=3D 0) {
-				pr_err("ZERO-LEN READ: R=3D%08x[%x] l=3D%zx/%zx s=3D%llx z=3D%llx i=3D=
%llx",
-				       rreq->debug_id, subreq->debug_index,
-				       subreq->len, size,
-				       subreq->start, zero_point, rreq->i_size);
-				netfs_cancel_read(subreq, ret);
-				break;
-			}
-			subreq->len =3D len;
-
-			netfs_stat(&netfs_n_rh_download);
-			if (rreq->netfs_ops->prepare_read) {
-				ret =3D rreq->netfs_ops->prepare_read(subreq);
-				if (ret < 0) {
-					netfs_cancel_read(subreq, ret);
-					break;
-				}
-				trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
-			}
-			goto issue;
-		}
+			trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
=20
-	fill_with_zeroes:
-		if (source =3D=3D NETFS_FILL_WITH_ZEROES) {
+		} else if (subreq->start >=3D zlimit && size > 0) {
+			/* If this range lies beyond the zero-point, that part
+			 * can just be cleared locally.
+			 */
+			_debug("zero %llx-%llx", start, start + size);
+			subreq->len =3D size;
 			subreq->source =3D NETFS_FILL_WITH_ZEROES;
-			trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
+			if (rreq->cache_resources.ops)
+				__set_bit(NETFS_SREQ_COPY_TO_CACHE, &subreq->flags);
 			netfs_stat(&netfs_n_rh_zero);
-			goto issue;
+		} else {
+			/* Read a cache hole from the server.  If any part of
+			 * this range lies beyond the zero-point or the EOF,
+			 * that part can just be cleared locally.
+			 */
+			unsigned long long limit =3D min3(zlimit, start + size, hole_to);
+
+			_debug("limit %llx %llx", rreq->i_size, zero_point);
+			_debug("download %llx-%llx", start, start + size);
+			subreq->len =3D umin(limit - subreq->start, ULONG_MAX);
+			subreq->source =3D NETFS_DOWNLOAD_FROM_SERVER;
+			if (rreq->cache_resources.ops)
+				__set_bit(NETFS_SREQ_COPY_TO_CACHE, &subreq->flags);
+			netfs_stat(&netfs_n_rh_download);
 		}
=20
-		if (source =3D=3D NETFS_READ_FROM_CACHE) {
-			trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
-			goto issue;
+		if (size =3D=3D 0) {
+			pr_err("ZERO-LEN READ: R=3D%08x[%x] l=3D%zx/%zx s=3D%llx z=3D%llx i=3D%=
llx",
+			       rreq->debug_id, subreq->debug_index,
+			       subreq->len, size,
+			       subreq->start, zero_point, rreq->i_size);
+			netfs_cancel_read(subreq, ret);
+			break;
 		}
=20
-		pr_err("Unexpected read source %u\n", source);
-		WARN_ON_ONCE(1);
-		netfs_cancel_read(subreq, ret);
-		break;
+		rreq->io_streams[0].sreq_max_len =3D MAX_RW_COUNT;
+		rreq->io_streams[0].sreq_max_segs =3D INT_MAX;
+
+		if (prepare_read) {
+			ret =3D prepare_read(subreq);
+			if (ret < 0) {
+				netfs_cancel_read(subreq, ret);
+				break;
+			}
+			trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
+		}
=20
-	issue:
 		slice =3D netfs_prepare_read_iterator(subreq, ractl);
 		if (slice < 0) {
 			ret =3D slice;
@@ -299,6 +348,7 @@ static void netfs_read_to_pagecache(struct netfs_io_req=
uest *rreq,
 			set_bit(NETFS_RREQ_ALL_QUEUED, &rreq->flags);
 		}
=20
+		trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
 		netfs_issue_read(rreq, subreq);
=20
 		if (test_bit(NETFS_RREQ_PAUSE, &rreq->flags))
diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c
index 6bde3320bcec..fb3120fb24db 100644
--- a/fs/netfs/buffered_write.c
+++ b/fs/netfs/buffered_write.c
@@ -54,9 +54,6 @@ void netfs_update_i_size(struct netfs_inode *ctx, struct =
inode *inode,
 	i_size =3D i_size_read(inode);
 	if (end > i_size) {
 		i_size_write(inode, end);
-#if IS_ENABLED(CONFIG_FSCACHE)
-		fscache_update_cookie(ctx->cache, NULL, &end);
-#endif
=20
 		gap =3D SECTOR_SIZE - (i_size & (SECTOR_SIZE - 1));
 		if (copied > gap) {
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index 645996ecfc80..d82f2116f8e0 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -23,6 +23,8 @@
 /*
  * buffered_read.c
  */
+int netfs_read_query_cache(struct netfs_io_request *rreq,
+			   struct fscache_occupancy *occ);
 void netfs_queue_read(struct netfs_io_request *rreq,
 		      struct netfs_io_subrequest *subreq);
 void netfs_cache_read_terminated(void *priv, ssize_t transferred_or_error);
diff --git a/fs/netfs/read_retry.c b/fs/netfs/read_retry.c
index f59a70f3a086..bf45b1f5f3e0 100644
--- a/fs/netfs/read_retry.c
+++ b/fs/netfs/read_retry.c
@@ -267,6 +267,7 @@ void netfs_retry_reads(struct netfs_io_request *rreq)
 	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
=20
 	netfs_stat(&netfs_n_rh_retry_read_req);
+	trace_netfs_rreq(rreq, netfs_rreq_trace_retry_begin);
=20
 	/* Wait for all outstanding I/O to quiesce before performing retries as
 	 * we may need to renegotiate the I/O sizes.
@@ -277,6 +278,7 @@ void netfs_retry_reads(struct netfs_io_request *rreq)
=20
 	trace_netfs_rreq(rreq, netfs_rreq_trace_resubmit);
 	netfs_retry_read_subrequests(rreq);
+	trace_netfs_rreq(rreq, netfs_rreq_trace_retry_end);
 }
=20
 /*
diff --git a/fs/netfs/read_single.c b/fs/netfs/read_single.c
index 8833550d2eb6..af16c91947b5 100644
--- a/fs/netfs/read_single.c
+++ b/fs/netfs/read_single.c
@@ -58,20 +58,6 @@ static int netfs_single_begin_cache_read(struct netfs_io=
_request *rreq, struct n
 	return fscache_begin_read_operation(&rreq->cache_resources, netfs_i_cooki=
e(ctx));
 }
=20
-static void netfs_single_cache_prepare_read(struct netfs_io_request *rreq,
-					    struct netfs_io_subrequest *subreq)
-{
-	struct netfs_cache_resources *cres =3D &rreq->cache_resources;
-
-	if (!cres->ops) {
-		subreq->source =3D NETFS_DOWNLOAD_FROM_SERVER;
-		return;
-	}
-	subreq->source =3D cres->ops->prepare_read(subreq, rreq->i_size);
-	trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
-
-}
-
 static void netfs_single_read_cache(struct netfs_io_request *rreq,
 				    struct netfs_io_subrequest *subreq)
 {
@@ -89,6 +75,14 @@ static void netfs_single_read_cache(struct netfs_io_requ=
est *rreq,
  */
 static int netfs_single_dispatch_read(struct netfs_io_request *rreq)
 {
+	struct fscache_occupancy occ =3D {
+		.query_from	=3D 0,
+		.query_to	=3D rreq->len,
+		.cached_from[0]	=3D ULLONG_MAX,
+		.cached_to[0]	=3D ULLONG_MAX,
+		.cached_from[1]	=3D ULLONG_MAX,
+		.cached_to[1]	=3D ULLONG_MAX,
+	};
 	struct netfs_io_subrequest *subreq;
 	int ret =3D 0;
=20
@@ -96,14 +90,21 @@ static int netfs_single_dispatch_read(struct netfs_io_r=
equest *rreq)
 	if (!subreq)
 		return -ENOMEM;
=20
-	subreq->source	=3D NETFS_SOURCE_UNKNOWN;
+	subreq->source	=3D NETFS_DOWNLOAD_FROM_SERVER;
 	subreq->start	=3D 0;
 	subreq->len	=3D rreq->len;
 	subreq->io_iter	=3D rreq->buffer.iter;
=20
 	netfs_queue_read(rreq, subreq);
=20
-	netfs_single_cache_prepare_read(rreq, subreq);
+	/* Try to use the cache if the cache content matches the size of the
+	 * remote file.
+	 */
+	netfs_read_query_cache(rreq, &occ);
+	if (occ.cached_from[0] =3D=3D 0 &&
+	    occ.cached_to[0] =3D=3D rreq->len)
+		subreq->source =3D NETFS_READ_FROM_CACHE;
+
 	switch (subreq->source) {
 	case NETFS_DOWNLOAD_FROM_SERVER:
 		netfs_stat(&netfs_n_rh_download);
@@ -119,6 +120,12 @@ static int netfs_single_dispatch_read(struct netfs_io_=
request *rreq)
 		rreq->submitted +=3D subreq->len;
 		break;
 	case NETFS_READ_FROM_CACHE:
+		if (rreq->cache_resources.ops->prepare_read) {
+			ret =3D rreq->cache_resources.ops->prepare_read(subreq);
+			if (ret < 0)
+				goto cancel;
+		}
+
 		smp_wmb(); /* Write lists before ALL_QUEUED. */
 		set_bit(NETFS_RREQ_ALL_QUEUED, &rreq->flags);
 		trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
diff --git a/fs/netfs/write_collect.c b/fs/netfs/write_collect.c
index 24fc2bb2f8a4..9e837cf0eb8f 100644
--- a/fs/netfs/write_collect.c
+++ b/fs/netfs/write_collect.c
@@ -188,6 +188,26 @@ static void netfs_writeback_unlock_folios(struct netfs=
_io_request *wreq,
 	wreq->buffer.first_tail_slot =3D slot;
 }
=20
+/*
+ * Collect cache results.
+ */
+static void netfs_cache_collect(struct netfs_io_request *wreq,
+				struct netfs_io_stream *stream,
+				enum netfs_cache_collect block_type)
+{
+	struct netfs_cache_resources *cres =3D &wreq->cache_resources;
+
+	if (stream->source !=3D NETFS_WRITE_TO_CACHE ||
+	    wreq->cache_coll_to >=3D stream->collected_to)
+		return;
+
+	if (cres->ops && cres->ops->collect_write)
+		cres->ops->collect_write(wreq, wreq->cache_coll_to,
+					 stream->collected_to - wreq->cache_coll_to,
+					 block_type);
+	wreq->cache_coll_to =3D stream->collected_to;
+}
+
 /*
  * Collect and assess the results of various write subrequests.  We may ne=
ed to
  * retry some of the results - or even do an RMW cycle for content crypto.
@@ -236,13 +256,16 @@ static void netfs_collect_write_results(struct netfs_=
io_request *wreq)
 		/* Read first subreq pointer before IN_PROGRESS flag. */
=20
 		while (front) {
+			bool cancelled;
 			trace_netfs_collect_sreq(wreq, front);
 			//_debug("sreq [%x] %llx %zx/%zx",
 			//       front->debug_index, front->start, front->transferred, front->l=
en);
=20
 			if (stream->collected_to < front->start) {
 				trace_netfs_collect_gap(wreq, stream, issued_to, 'F');
+				netfs_cache_collect(wreq, stream, NETFS_CACHE_COLLECT_WRITE_DATA);
 				stream->collected_to =3D front->start;
+				netfs_cache_collect(wreq, stream, NETFS_CACHE_COLLECT_WRITE_GAP);
 			}
=20
 			/* Stall if the front is still undergoing I/O. */
@@ -250,7 +273,6 @@ static void netfs_collect_write_results(struct netfs_io=
_request *wreq)
 				notes |=3D HIT_PENDING;
 				break;
 			}
-			smp_rmb(); /* Read counters after I-P flag. */
=20
 			if (stream->failed) {
 				stream->collected_to =3D front->start + front->len;
@@ -263,15 +285,45 @@ static void netfs_collect_write_results(struct netfs_=
io_request *wreq)
 				stream->transferred_valid =3D true;
 				notes |=3D MADE_PROGRESS;
 			}
-			if (test_bit(NETFS_SREQ_FAILED, &front->flags)) {
-				stream->failed =3D true;
-				stream->error =3D front->error;
-				if (stream->source =3D=3D NETFS_UPLOAD_TO_SERVER)
-					mapping_set_error(wreq->mapping, front->error);
-				notes |=3D NEED_REASSESS | SAW_FAILURE;
+
+			/* Handle failed or cancelled subreqs.  Failure of
+			 * cache writes are handled differently to upload
+			 * failures.  Cache writes aren't fatal, provided we're
+			 * not doing disconnected operation, and so we can kind
+			 * of treat them as if they had succeeded - except that
+			 * we need to log any holes they cause.
+			 */
+			switch (stream->source) {
+			case NETFS_UPLOAD_TO_SERVER:
+				if (test_bit(NETFS_SREQ_FAILED, &front->flags)) {
+					if (!stream->failed) {
+						stream->failed =3D true;
+						stream->error =3D front->error;
+						mapping_set_error(wreq->mapping, front->error);
+						break;
+					}
+					notes |=3D NEED_REASSESS | SAW_FAILURE;
+				}
+				break;
+
+			case NETFS_WRITE_TO_CACHE:
+				cancelled =3D test_bit(NETFS_SREQ_CANCELLED, &front->flags);
+				if (cancelled !=3D stream->cancelled &&
+				    stream->collected_to < front->start) {
+					trace_netfs_rreq(wreq, netfs_rreq_trace_cache_fail_collect);
+					netfs_cache_collect(wreq, stream,
+							    cancelled ?
+							    NETFS_CACHE_COLLECT_WRITE_CANCEL :
+							    NETFS_CACHE_COLLECT_WRITE_DATA);
+					stream->cancelled =3D !stream->cancelled;
+				}
+				break;
+
+			default:
+				WARN_ON(1);
 				break;
 			}
-			if (front->transferred < front->len) {
+			if (test_bit(NETFS_SREQ_NEED_RETRY, &front->flags)) {
 				stream->need_retry =3D true;
 				notes |=3D NEED_RETRY | MADE_PROGRESS;
 				break;
@@ -360,6 +412,7 @@ static void netfs_collect_write_results(struct netfs_io=
_request *wreq)
  */
 bool netfs_write_collection(struct netfs_io_request *wreq)
 {
+	struct netfs_io_stream *cstream =3D &wreq->io_streams[1];
 	struct netfs_inode *ictx =3D netfs_inode(wreq->inode);
 	size_t transferred;
 	bool transferred_valid =3D false;
@@ -395,13 +448,22 @@ bool netfs_write_collection(struct netfs_io_request *=
wreq)
 		wreq->transferred =3D transferred;
 	trace_netfs_rreq(wreq, netfs_rreq_trace_write_done);
=20
-	if (wreq->io_streams[1].active &&
-	    wreq->io_streams[1].failed &&
-	    ictx->ops->invalidate_cache) {
-		/* Cache write failure doesn't prevent writeback completion
-		 * unless we're in disconnected mode.
-		 */
-		ictx->ops->invalidate_cache(wreq);
+	if (cstream->active) {
+		if (test_bit(NETFS_RREQ_CACHE_ERROR, &wreq->flags)) {
+			if (ictx->ops->invalidate_cache) {
+				/* Cache write failure doesn't prevent
+				 * writeback completion unless we're in
+				 * disconnected mode.
+				 */
+				trace_netfs_rreq(wreq, netfs_rreq_trace_inval_cache);
+				ictx->ops->invalidate_cache(wreq);
+			}
+		} else if (!cstream->failed) {
+			netfs_cache_collect(wreq, cstream,
+					    cstream->cancelled ?
+					    NETFS_CACHE_COLLECT_WRITE_CANCEL :
+					    NETFS_CACHE_COLLECT_WRITE_DATA);
+		}
 	}
=20
 	_debug("finished");
@@ -476,24 +538,51 @@ void netfs_write_subrequest_terminated(void *_op, ssi=
ze_t transferred_or_error)
=20
 	if (IS_ERR_VALUE(transferred_or_error)) {
 		subreq->error =3D transferred_or_error;
-		/* if need retry is set, error should not matter */
-		if (!test_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags)) {
-			set_bit(NETFS_SREQ_FAILED, &subreq->flags);
-			trace_netfs_failure(wreq, subreq, transferred_or_error, netfs_fail_writ=
e);
-		}
=20
 		switch (subreq->source) {
 		case NETFS_WRITE_TO_CACHE:
+			/* We don't mark a cache-write subreq as failed.
+			 * Instead we tell the issuer to produce dummy subreqs
+			 * instead and make a note if we need to invalidate the
+			 * cache at the end.  We also don't pause the loop that
+			 * grabs pages and launches upload subreqs.
+			 *
+			 * Note that we need to distinguish between -ENOBUFS
+			 * (no space available in the cache) and other errors.
+			 * In the former case, we can keep the data we have,
+			 * though we might have to change the way the on-disk
+			 * data is tracked.
+			 */
 			netfs_stat(&netfs_n_wh_write_failed);
+			if (test_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags))
+				break;
+
+			trace_netfs_failure(wreq, subreq, transferred_or_error, netfs_fail_writ=
e);
+			__set_bit(NETFS_SREQ_CANCELLED, &subreq->flags);
+			set_bit(NETFS_RREQ_CACHE_STOP, &wreq->flags);
+			if (transferred_or_error =3D=3D -ENOBUFS)
+				trace_netfs_rreq(wreq, netfs_rreq_trace_cache_no_space);
+			else if (!test_and_set_bit(NETFS_RREQ_CACHE_ERROR, &wreq->flags))
+				trace_netfs_rreq(wreq, netfs_rreq_trace_cache_failed);
+			subreq->transferred =3D subreq->len;
 			break;
+
 		case NETFS_UPLOAD_TO_SERVER:
+			/* If need_retry is set, error should not matter */
+			if (!test_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags)) {
+				set_bit(NETFS_SREQ_FAILED, &subreq->flags);
+				trace_netfs_failure(wreq, subreq, transferred_or_error,
+						    netfs_fail_upload);
+			}
+
+			set_bit(NETFS_RREQ_PAUSE, &wreq->flags);
+			trace_netfs_rreq(wreq, netfs_rreq_trace_set_pause);
 			netfs_stat(&netfs_n_wh_upload_failed);
 			break;
+
 		default:
 			break;
 		}
-		trace_netfs_rreq(wreq, netfs_rreq_trace_set_pause);
-		set_bit(NETFS_RREQ_PAUSE, &wreq->flags);
 	} else {
 		if (WARN(transferred_or_error > subreq->len - subreq->transferred,
 			 "Subreq excess write: R=3D%x[%x] %zd > %zu - %zu",
diff --git a/fs/netfs/write_issue.c b/fs/netfs/write_issue.c
index c03c7cc45e47..7f38b6676002 100644
--- a/fs/netfs/write_issue.c
+++ b/fs/netfs/write_issue.c
@@ -112,6 +112,8 @@ struct netfs_io_request *netfs_create_write_req(struct =
address_space *mapping,
 		goto nomem;
=20
 	wreq->cleaned_to =3D wreq->start;
+	if (wreq->cache_resources.dio_size > 1)
+		wreq->cache_coll_to =3D round_down(wreq->start, wreq->cache_resources.di=
o_size);
=20
 	wreq->io_streams[0].stream_nr		=3D 0;
 	wreq->io_streams[0].source		=3D NETFS_UPLOAD_TO_SERVER;
@@ -231,6 +233,21 @@ static void netfs_do_issue_write(struct netfs_io_strea=
m *stream,
=20
 	_enter("R=3D%x[%x],%zx", wreq->debug_id, subreq->debug_index, subreq->len=
);
=20
+	if (stream->source =3D=3D NETFS_WRITE_TO_CACHE &&
+	    unlikely(test_bit(NETFS_RREQ_CACHE_STOP, &wreq->flags))) {
+		size_t dio_size =3D wreq->cache_resources.dio_size;
+		size_t len, disp;
+
+		disp =3D subreq->start & (dio_size - 1);
+		len =3D round_up(subreq->len + disp, dio_size);
+
+		subreq->start -=3D disp;
+		subreq->len =3D len;
+
+		__set_bit(NETFS_SREQ_CANCELLED, &subreq->flags);
+		return netfs_write_subrequest_terminated(subreq, subreq->len);
+	}
+
 	if (test_bit(NETFS_SREQ_FAILED, &subreq->flags))
 		return netfs_write_subrequest_terminated(subreq, subreq->error);
=20
@@ -264,6 +281,7 @@ void netfs_issue_write(struct netfs_io_request *wreq,
=20
 	if (!subreq)
 		return;
+
 	stream->construct =3D NULL;
 	subreq->io_iter.count =3D subreq->len;
 	netfs_do_issue_write(stream, subreq);
diff --git a/fs/netfs/write_retry.c b/fs/netfs/write_retry.c
index 32735abfa03f..32e1058bf252 100644
--- a/fs/netfs/write_retry.c
+++ b/fs/netfs/write_retry.c
@@ -206,6 +206,7 @@ void netfs_retry_writes(struct netfs_io_request *wreq)
 	int s;
=20
 	netfs_stat(&netfs_n_wh_retry_write_req);
+	trace_netfs_rreq(wreq, netfs_rreq_trace_retry_begin);
=20
 	/* Wait for all outstanding I/O to quiesce before performing retries as
 	 * we may need to renegotiate the I/O sizes.
@@ -230,4 +231,6 @@ void netfs_retry_writes(struct netfs_io_request *wreq)
 			netfs_retry_write_stream(wreq, stream);
 		}
 	}
+
+	trace_netfs_rreq(wreq, netfs_rreq_trace_retry_end);
 }
diff --git a/include/linux/fscache.h b/include/linux/fscache.h
index 58fdb9605425..850d20241075 100644
--- a/include/linux/fscache.h
+++ b/include/linux/fscache.h
@@ -147,6 +147,23 @@ struct fscache_cookie {
 	};
 };
=20
+enum fscache_extent_type {
+	FSCACHE_EXTENT_DATA,
+	FSCACHE_EXTENT_ZERO,
+} __mode(byte);
+
+/*
+ * Cache occupancy information.
+ */
+struct fscache_occupancy {
+	unsigned long long	query_from;	/* Point to query from */
+	unsigned long long	query_to;	/* Point to query to */
+	unsigned long long	cached_from[2];	/* Point at which cache extents start =
*/
+	unsigned long long	cached_to[2];	/* Point at which cache extents end */
+	unsigned int		granularity;	/* Granularity desired */
+	enum fscache_extent_type cached_type[2];	/* Type of cache extent */
+};
+
 /*
  * slow-path functions for when there is actually caching available, and t=
he
  * netfs does actually have a valid token
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 243c0f737938..a83a4ea86e2b 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -22,6 +22,7 @@
=20
 enum netfs_sreq_ref_trace;
 typedef struct mempool mempool_t;
+struct fscache_occupancy;
 struct folio_queue;
=20
 /**
@@ -150,6 +151,7 @@ struct netfs_io_stream {
 	bool			need_retry;	/* T if this stream needs retrying */
 	bool			failed;		/* T if this stream failed */
 	bool			transferred_valid; /* T is ->transferred is valid */
+	bool			cancelled;	/* T if stream is cancelled */
 };
=20
 /*
@@ -159,8 +161,10 @@ struct netfs_cache_resources {
 	const struct netfs_cache_ops	*ops;
 	void				*cache_priv;
 	void				*cache_priv2;
+	unsigned long long		cache_i_size;	/* Initial size of cache file */
 	unsigned int			debug_id;	/* Cookie debug ID */
 	unsigned int			inval_counter;	/* object->inval_counter at begin_op */
+	unsigned int			dio_size;	/* DIO block size */
 };
=20
 /*
@@ -195,6 +199,7 @@ struct netfs_io_subrequest {
 #define NETFS_SREQ_IN_PROGRESS		8	/* Unlocked when the subrequest complete=
s */
 #define NETFS_SREQ_NEED_RETRY		9	/* Set if the filesystem requests a retry=
 */
 #define NETFS_SREQ_FAILED		10	/* Set if the subreq failed unretryably */
+#define NETFS_SREQ_CANCELLED		11	/* Set if the subreq was cancelled by net=
fslib */
 };
=20
 enum netfs_io_origin {
@@ -250,6 +255,7 @@ struct netfs_io_request {
 	unsigned long long	start;		/* Start position */
 	atomic64_t		issued_to;	/* Write issuer folio cursor */
 	unsigned long long	collected_to;	/* Point we've collected to */
+	unsigned long long	cache_coll_to;	/* Point the cache has collected to */
 	unsigned long long	cleaned_to;	/* Position we've cleaned folios to */
 	unsigned long long	abandon_to;	/* Position to abandon folios to */
 	const struct folio	*no_unlock_folio; /* Don't unlock this folio after rea=
d */
@@ -271,11 +277,13 @@ struct netfs_io_request {
 #define NETFS_RREQ_FAILED		3	/* The request failed */
 #define NETFS_RREQ_RETRYING		4	/* Set if we're in the retry path */
 #define NETFS_RREQ_SHORT_TRANSFER	5	/* Set if we have a short transfer */
-#define NETFS_RREQ_OFFLOAD_COLLECTION	8	/* Offload collection to workqueue=
 */
-#define NETFS_RREQ_NO_UNLOCK_FOLIO	9	/* Don't unlock no_unlock_folio on co=
mpletion */
-#define NETFS_RREQ_FOLIO_COPY_TO_CACHE	10	/* Copy current folio to cache f=
rom read */
-#define NETFS_RREQ_UPLOAD_TO_SERVER	11	/* Need to write to the server */
-#define NETFS_RREQ_USE_IO_ITER		12	/* Use ->io_iter rather than ->i_pages =
*/
+#define NETFS_RREQ_CACHE_STOP		8	/* Set to stop caching (ENOBUFS or error)=
 */
+#define NETFS_RREQ_CACHE_ERROR		9	/* Set if we got an error from the cache=
 */
+#define NETFS_RREQ_OFFLOAD_COLLECTION	12	/* Offload collection to workqueu=
e */
+#define NETFS_RREQ_NO_UNLOCK_FOLIO	13	/* Don't unlock no_unlock_folio on c=
ompletion */
+#define NETFS_RREQ_FOLIO_COPY_TO_CACHE	14	/* Copy current folio to cache f=
rom read */
+#define NETFS_RREQ_UPLOAD_TO_SERVER	15	/* Need to write to the server */
+#define NETFS_RREQ_USE_IO_ITER		16	/* Use ->io_iter rather than ->i_pages =
*/
 #define NETFS_RREQ_USE_PGPRIV2		31	/* [DEPRECATED] Use PG_private_2 to mark
 						 * write to cache on read */
 	const struct netfs_request_ops *netfs_ops;
@@ -320,6 +328,12 @@ enum netfs_read_from_hole {
 	NETFS_READ_HOLE_FAIL,
 };
=20
+enum netfs_cache_collect {
+	NETFS_CACHE_COLLECT_WRITE_DATA,
+	NETFS_CACHE_COLLECT_WRITE_GAP,
+	NETFS_CACHE_COLLECT_WRITE_CANCEL,
+};
+
 /*
  * Table of operations for access to a cache.
  */
@@ -354,8 +368,7 @@ struct netfs_cache_ops {
 	/* Prepare a read operation, shortening it to a cached/uncached
 	 * boundary as appropriate.
 	 */
-	enum netfs_io_source (*prepare_read)(struct netfs_io_subrequest *subreq,
-					     unsigned long long i_size);
+	int (*prepare_read)(struct netfs_io_subrequest *subreq);
=20
 	/* Prepare a write subrequest, working out if we're allowed to do it
 	 * and finding out the maximum amount of data to gather before
@@ -383,8 +396,17 @@ struct netfs_cache_ops {
 	 * next chunk of data starts and how long it is.
 	 */
 	int (*query_occupancy)(struct netfs_cache_resources *cres,
-			       loff_t start, size_t len, size_t granularity,
-			       loff_t *_data_start, size_t *_data_len);
+			       struct fscache_occupancy *occ);
+
+	/* Collect the result of buffered writeback to the cache.  This
+	 * includes copying a read to the cache.  block_type is one of:
+	 * - NETFS_CACHE_COLLECT_WRITE_DATA for a block of data
+	 * - NETFS_CACHE_COLLECT_WRITE_GAP if a discontiguity was skipped
+	 * - NETFS_CACHE_COLLECT_WRITE_CANCEL for a cancellation gap
+	 */
+	void (*collect_write)(struct netfs_io_request *wreq,
+			      unsigned long long start, size_t len,
+			      enum netfs_cache_collect block_type);
 };
=20
 /* High-level read API. */
diff --git a/include/trace/events/cachefiles.h b/include/trace/events/cache=
files.h
index 6e3b1424eea4..a03f085793f2 100644
--- a/include/trace/events/cachefiles.h
+++ b/include/trace/events/cachefiles.h
@@ -56,6 +56,8 @@ enum cachefiles_coherency_trace {
 	cachefiles_coherency_check_ok,
 	cachefiles_coherency_check_type,
 	cachefiles_coherency_check_xattr,
+	cachefiles_coherency_discontiguous,
+	cachefiles_coherency_remove,
 	cachefiles_coherency_set_fail,
 	cachefiles_coherency_set_ok,
 	cachefiles_coherency_vol_check_cmp,
@@ -67,6 +69,7 @@ enum cachefiles_coherency_trace {
 };
=20
 enum cachefiles_trunc_trace {
+	cachefiles_trunc_clear_padding,
 	cachefiles_trunc_dio_adjust,
 	cachefiles_trunc_expand_tmpfile,
 	cachefiles_trunc_shrink,
@@ -84,6 +87,7 @@ enum cachefiles_prepare_read_trace {
 };
=20
 enum cachefiles_error_trace {
+	cachefiles_trace_alignment_error,
 	cachefiles_trace_fallocate_error,
 	cachefiles_trace_getxattr_error,
 	cachefiles_trace_link_error,
@@ -144,6 +148,8 @@ enum cachefiles_error_trace {
 	EM(cachefiles_coherency_check_ok,	"OK      ")		\
 	EM(cachefiles_coherency_check_type,	"BAD type")		\
 	EM(cachefiles_coherency_check_xattr,	"BAD xatt")		\
+	EM(cachefiles_coherency_discontiguous,	"--- gap ")		\
+	EM(cachefiles_coherency_remove,		"REMOVE  ")		\
 	EM(cachefiles_coherency_set_fail,	"SET fail")		\
 	EM(cachefiles_coherency_set_ok,		"SET ok  ")		\
 	EM(cachefiles_coherency_vol_check_cmp,	"VOL BAD cmp ")		\
@@ -154,6 +160,7 @@ enum cachefiles_error_trace {
 	E_(cachefiles_coherency_vol_set_ok,	"VOL SET ok  ")
=20
 #define cachefiles_trunc_traces						\
+	EM(cachefiles_trunc_clear_padding,	"CLRPAD")		\
 	EM(cachefiles_trunc_dio_adjust,		"DIOADJ")		\
 	EM(cachefiles_trunc_expand_tmpfile,	"EXPTMP")		\
 	E_(cachefiles_trunc_shrink,		"SHRINK")
@@ -169,6 +176,7 @@ enum cachefiles_error_trace {
 	E_(cachefiles_trace_read_seek_nxio,	"seek-enxio")
=20
 #define cachefiles_error_traces						\
+	EM(cachefiles_trace_alignment_error,	"align")		\
 	EM(cachefiles_trace_fallocate_error,	"fallocate")		\
 	EM(cachefiles_trace_getxattr_error,	"getxattr")		\
 	EM(cachefiles_trace_link_error,		"link")			\
@@ -379,12 +387,12 @@ TRACE_EVENT(cachefiles_rename,
=20
 TRACE_EVENT(cachefiles_coherency,
 	    TP_PROTO(struct cachefiles_object *obj,
-		     ino_t ino,
+		     ino_t ino, unsigned long long obj_size,
 		     u64 disk_aux,
 		     enum cachefiles_content content,
 		     enum cachefiles_coherency_trace why),
=20
-	    TP_ARGS(obj, ino, disk_aux, content, why),
+	    TP_ARGS(obj, ino, obj_size, disk_aux, content, why),
=20
 	    /* Note that obj may be NULL */
 	    TP_STRUCT__entry(
@@ -392,6 +400,7 @@ TRACE_EVENT(cachefiles_coherency,
 		    __field(enum cachefiles_coherency_trace,	why)
 		    __field(enum cachefiles_content,		content)
 		    __field(u64,				ino)
+		    __field(u64,				obj_size)
 		    __field(u64,				aux)
 		    __field(u64,				disk_aux)
 			     ),
@@ -401,14 +410,16 @@ TRACE_EVENT(cachefiles_coherency,
 		    __entry->why	=3D why;
 		    __entry->content	=3D content;
 		    __entry->ino	=3D ino;
+		    __entry->obj_size	=3D obj_size;
 		    __entry->aux	=3D be64_to_cpup((__be64 *)obj->cookie->inline_aux);
 		    __entry->disk_aux	=3D disk_aux;
 			   ),
=20
-	    TP_printk("o=3D%08x %s B=3D%llx c=3D%u aux=3D%llx dsk=3D%llx",
+	    TP_printk("o=3D%08x %s B=3D%llx oz=3D%llx c=3D%u aux=3D%llx dsk=3D%ll=
x",
 		      __entry->obj,
 		      __print_symbolic(__entry->why, cachefiles_coherency_traces),
 		      __entry->ino,
+		      __entry->obj_size,
 		      __entry->content,
 		      __entry->aux,
 		      __entry->disk_aux)
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index 082cb03c6131..83d161f8c726 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -50,6 +50,10 @@
=20
 #define netfs_rreq_traces					\
 	EM(netfs_rreq_trace_assess,		"ASSESS ")	\
+	EM(netfs_rreq_trace_cache_cancelled,	"CA-CNCL")	\
+	EM(netfs_rreq_trace_cache_failed,	"CA-FAIL")	\
+	EM(netfs_rreq_trace_cache_fail_collect,	"CA-F-CO")	\
+	EM(netfs_rreq_trace_cache_no_space,	"CA-NOSP")	\
 	EM(netfs_rreq_trace_collect,		"COLLECT")	\
 	EM(netfs_rreq_trace_complete,		"COMPLET")	\
 	EM(netfs_rreq_trace_copy,		"COPY   ")	\
@@ -58,10 +62,13 @@
 	EM(netfs_rreq_trace_end_copy_to_cache,	"END-C2C")	\
 	EM(netfs_rreq_trace_free,		"FREE   ")	\
 	EM(netfs_rreq_trace_intr,		"INTR   ")	\
+	EM(netfs_rreq_trace_inval_cache,	"INVL-CA")	\
 	EM(netfs_rreq_trace_ki_complete,	"KI-CMPL")	\
 	EM(netfs_rreq_trace_recollect,		"RECLLCT")	\
 	EM(netfs_rreq_trace_redirty,		"REDIRTY")	\
 	EM(netfs_rreq_trace_resubmit,		"RESUBMT")	\
+	EM(netfs_rreq_trace_retry_begin,	"RETRY-BEGIN")	\
+	EM(netfs_rreq_trace_retry_end,		"RETRY-END")	\
 	EM(netfs_rreq_trace_set_abandon,	"S-ABNDN")	\
 	EM(netfs_rreq_trace_set_pause,		"PAUSE  ")	\
 	EM(netfs_rreq_trace_unlock,		"UNLOCK ")	\
@@ -131,12 +138,12 @@
=20
 #define netfs_failures							\
 	EM(netfs_fail_check_write_begin,	"check-write-begin")	\
-	EM(netfs_fail_copy_to_cache,		"copy-to-cache")	\
 	EM(netfs_fail_dio_read_short,		"dio-read-short")	\
 	EM(netfs_fail_dio_read_zero,		"dio-read-zero")	\
 	EM(netfs_fail_read,			"read")			\
 	EM(netfs_fail_short_read,		"short-read")		\
 	EM(netfs_fail_prepare_write,		"prep-write")		\
+	EM(netfs_fail_upload,			"upload")		\
 	E_(netfs_fail_write,			"write")
=20
 #define netfs_rreq_ref_traces					\
From nobody Mon May 25 04:33:52 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2C20D352C5B
	for <linux-kernel@vger.kernel.org>; Mon, 18 May 2026 22:30:42 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779143443; cv=none;
 b=QwKT3Wj0+WdqQHBfnH8aDP35rSn5R3fs4ziWY/KxOf99TxNvFMzz1kHvXQyBdigfq1wFiQLAtNh5eU0zYJ59Z8ouyUolidx188LF2Av30FR9OUZbSxkC4uU4YZvYSudMbkSk3ZSIIhz0Ckg/xsddS9+foK6JZ4AiFVbdaDpXj2Q=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779143443; c=relaxed/simple;
	bh=aFC/AfAts6RejXMeXjUtlrV6sVMEbFGujrlPfAy6yX4=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=sRVU9Q++YsFs7h6z6V34szXYVmSgTYIgpHXCGx/0KvcaSIb4bL8kxsp9tjQq47QOSvsOCKBrX7CIStIPWMZ5pVizXTkfeix4dI7dbTRSwr6T4ZIEiLqzuzE94YcLG3fddxY112BBBdHmIJULbxOKNLhxZQ+5+gFrqvHJ6zUd7h4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=Tmn+tC+W; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="Tmn+tC+W"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1779143441;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=soBo8nUow0/Mt0ICicV5RVH+h1Do/YU51PNZLQVaIMs=;
	b=Tmn+tC+WNVH0UDcBWXavVFIiWqH5c18ALV6On9+Y72rK7WO4luPyWuu078daYEr6ZBaPHG
	h8kx4LtP/OdVc92g5EpCvmqAaFqrQb4RGaPAC1vvp+eVNCd/7xVdJWozSNbewe8IuROHNd
	PaPO6r6wZ1ma7ODqHy15gUVBz8fugwA=
Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-388-Si3L0j3lO72wXYJQSzMZPQ-1; Mon,
 18 May 2026 18:30:35 -0400
X-MC-Unique: Si3L0j3lO72wXYJQSzMZPQ-1
X-Mimecast-MFC-AGG-ID: Si3L0j3lO72wXYJQSzMZPQ_1779143433
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 7B57F1956089;
	Mon, 18 May 2026 22:30:31 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.48.33])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 71ED819560A3;
	Mon, 18 May 2026 22:30:24 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.org>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 02/21] netfs: Add the cache object ID to netfs_read/write
 tracepoints
Date: Mon, 18 May 2026 23:29:34 +0100
Message-ID: <20260518222959.488126-3-dhowells@redhat.com>
In-Reply-To: <20260518222959.488126-1-dhowells@redhat.com>
References: <20260518222959.488126-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12
Content-Type: text/plain; charset="utf-8"

Add the cache object debug ID to netfs_read/write tracepoints to make
debugging easier as there's now a direct cross-reference with the
cachefiles tracepoints that only log that debug ID.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/cachefiles/io.c           |  1 +
 fs/netfs/fscache_io.c        |  2 +-
 include/linux/netfs.h        |  3 ++-
 include/trace/events/netfs.h | 27 +++++++++++++++------------
 4 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/fs/cachefiles/io.c b/fs/cachefiles/io.c
index 42265fdcc17e..7e32b1caf6fe 100644
--- a/fs/cachefiles/io.c
+++ b/fs/cachefiles/io.c
@@ -918,6 +918,7 @@ bool cachefiles_begin_operation(struct netfs_cache_reso=
urces *cres,
 			if (!cres->cache_priv2 && file)
 				cres->cache_priv2 =3D get_file(file);
 			spin_unlock(&object->lock);
+			cres->object_id =3D object->debug_id;
 			cres->cache_i_size =3D i_size_read(file_inode(file));
 			cres->dio_size =3D object->volume->cache->bsize;
 		}
diff --git a/fs/netfs/fscache_io.c b/fs/netfs/fscache_io.c
index 37f05b4d3469..fafa8c6bec57 100644
--- a/fs/netfs/fscache_io.c
+++ b/fs/netfs/fscache_io.c
@@ -79,7 +79,7 @@ static int fscache_begin_operation(struct netfs_cache_res=
ources *cres,
 	cres->ops		=3D NULL;
 	cres->cache_priv	=3D cookie;
 	cres->cache_priv2	=3D NULL;
-	cres->debug_id		=3D cookie->debug_id;
+	cres->cookie_id		=3D cookie->debug_id;
 	cres->inval_counter	=3D cookie->inval_counter;
=20
 	if (!fscache_begin_cookie_access(cookie, why)) {
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index a83a4ea86e2b..d175c63ff659 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -162,7 +162,8 @@ struct netfs_cache_resources {
 	void				*cache_priv;
 	void				*cache_priv2;
 	unsigned long long		cache_i_size;	/* Initial size of cache file */
-	unsigned int			debug_id;	/* Cookie debug ID */
+	unsigned int			cookie_id;	/* Cache cookie debug ID */
+	unsigned int			object_id;	/* Cache object debug ID */
 	unsigned int			inval_counter;	/* object->inval_counter at begin_op */
 	unsigned int			dio_size;	/* DIO block size */
 };
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index 83d161f8c726..63ed1d771bd8 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -311,6 +311,7 @@ TRACE_EVENT(netfs_read,
 	    TP_STRUCT__entry(
 		    __field(unsigned int,		rreq)
 		    __field(unsigned int,		cookie)
+		    __field(unsigned int,		object)
 		    __field(loff_t,			i_size)
 		    __field(loff_t,			start)
 		    __field(size_t,			len)
@@ -320,7 +321,8 @@ TRACE_EVENT(netfs_read,
=20
 	    TP_fast_assign(
 		    __entry->rreq	=3D rreq->debug_id;
-		    __entry->cookie	=3D rreq->cache_resources.debug_id;
+		    __entry->cookie	=3D rreq->cache_resources.cookie_id;
+		    __entry->object	=3D rreq->cache_resources.object_id;
 		    __entry->i_size	=3D rreq->i_size;
 		    __entry->start	=3D start;
 		    __entry->len	=3D len;
@@ -328,10 +330,10 @@ TRACE_EVENT(netfs_read,
 		    __entry->netfs_inode =3D rreq->inode->i_ino;
 			   ),
=20
-	    TP_printk("R=3D%08x %s c=3D%08x ni=3D%llx s=3D%llx l=3D%zx sz=3D%llx",
+	    TP_printk("R=3D%08x %s c=3D%08x o=3D%08x ni=3D%llx s=3D%llx l=3D%zx s=
z=3D%llx",
 		      __entry->rreq,
 		      __print_symbolic(__entry->what, netfs_read_traces),
-		      __entry->cookie,
+		      __entry->cookie, __entry->object,
 		      __entry->netfs_inode,
 		      __entry->start, __entry->len, __entry->i_size)
 	    );
@@ -552,6 +554,7 @@ TRACE_EVENT(netfs_write,
 	    TP_STRUCT__entry(
 		    __field(unsigned int,		wreq)
 		    __field(unsigned int,		cookie)
+		    __field(unsigned int,		object)
 		    __field(unsigned int,		ino)
 		    __field(enum netfs_write_trace,	what)
 		    __field(unsigned long long,		start)
@@ -559,20 +562,19 @@ TRACE_EVENT(netfs_write,
 			     ),
=20
 	    TP_fast_assign(
-		    struct netfs_inode *__ctx =3D netfs_inode(wreq->inode);
-		    struct fscache_cookie *__cookie =3D netfs_i_cookie(__ctx);
 		    __entry->wreq	=3D wreq->debug_id;
-		    __entry->cookie	=3D __cookie ? __cookie->debug_id : 0;
+		    __entry->cookie	=3D wreq->cache_resources.cookie_id;
+		    __entry->object	=3D wreq->cache_resources.object_id;
 		    __entry->ino	=3D wreq->inode->i_ino;
 		    __entry->what	=3D what;
 		    __entry->start	=3D wreq->start;
 		    __entry->len	=3D wreq->len;
 			   ),
=20
-	    TP_printk("R=3D%08x %s c=3D%08x i=3D%x by=3D%llx-%llx",
+	    TP_printk("R=3D%08x %s c=3D%08x o=3D%08x i=3D%x by=3D%llx-%llx",
 		      __entry->wreq,
 		      __print_symbolic(__entry->what, netfs_write_traces),
-		      __entry->cookie,
+		      __entry->cookie, __entry->object,
 		      __entry->ino,
 		      __entry->start, __entry->start + __entry->len - 1)
 	    );
@@ -587,22 +589,23 @@ TRACE_EVENT(netfs_copy2cache,
 		    __field(unsigned int,		rreq)
 		    __field(unsigned int,		creq)
 		    __field(unsigned int,		cookie)
+		    __field(unsigned int,		object)
 		    __field(unsigned int,		ino)
 			     ),
=20
 	    TP_fast_assign(
-		    struct netfs_inode *__ctx =3D netfs_inode(rreq->inode);
-		    struct fscache_cookie *__cookie =3D netfs_i_cookie(__ctx);
 		    __entry->rreq	=3D rreq->debug_id;
 		    __entry->creq	=3D creq->debug_id;
-		    __entry->cookie	=3D __cookie ? __cookie->debug_id : 0;
+		    __entry->cookie	=3D rreq->cache_resources.cookie_id;
+		    __entry->object	=3D rreq->cache_resources.object_id;
 		    __entry->ino	=3D rreq->inode->i_ino;
 			   ),
=20
-	    TP_printk("R=3D%08x CR=3D%08x c=3D%08x i=3D%x ",
+	    TP_printk("R=3D%08x CR=3D%08x c=3D%08x o=3D%08x i=3D%x ",
 		      __entry->rreq,
 		      __entry->creq,
 		      __entry->cookie,
+		      __entry->object,
 		      __entry->ino)
 	    );
From nobody Mon May 25 04:33:52 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1350331F9BA
	for <linux-kernel@vger.kernel.org>; Mon, 18 May 2026 22:30:49 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779143452; cv=none;
 b=ETEMX+2LRsBlIVRekgSwi87amjbE6rvkpBVthttQaJVzmFDfOaDe2R3OT1ItaipvjrzRL9vUl7cbgyOCIvBS5pzl9B4CXWsaVc8WzCIl/85zGQ6m/hbA7lCcGaQv+sef/UuiGbWosIqiwbS/59PcV97L0AloSZlZh9v9SuZiRw4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779143452; c=relaxed/simple;
	bh=WXerj7RGtu33oTH5VH8v/x9NiUIiMf4VG85D69rQykA=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=cNumHOlk3klNUVJgnpg9E/AnR0eLngSju1p2OkRpFAzcLSAgjsUcAGOqpaxxwwF3qHSJChSSQ3V0Vyr1K4dsMdthPLjFRYFsyFeGNoMleE7lHeeBSJub9i++DtM967ZmF9WTZsaGs98+E8R3RyALenzfvpOqw/4gLLYhdHvSzsc=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=fEi5FOFs; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="fEi5FOFs"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1779143448;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=HD9l3rshhdvjF/dR6TAQujaSrodzR3TFHgxLsri2Hp0=;
	b=fEi5FOFsm0kcx2p1+/v+PHLBM5SsSbPAwaDUFezaBFn/3GOPNv1j3FK85SqnCP3DiAVW95
	e947D4gOGnrISKyxpH5pBnB47Y1iBKZspxY8uSDA51qKg/PW4DRqqnAxT7wm2wZT8JT5v4
	sgXksOE5+mDsVNS7vt3N5AjwCqZbj+g=
Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-208-RvsjchMwN_myCVkbLd44bg-1; Mon,
 18 May 2026 18:30:43 -0400
X-MC-Unique: RvsjchMwN_myCVkbLd44bg-1
X-Mimecast-MFC-AGG-ID: RvsjchMwN_myCVkbLd44bg_1779143441
Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 20738195609D;
	Mon, 18 May 2026 22:30:40 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.48.33])
	by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 16B91180034E;
	Mon, 18 May 2026 22:30:32 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.org>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: [PATCH v2 03/21] mm: Make readahead store folio count in
 readahead_control
Date: Mon, 18 May 2026 23:29:35 +0100
Message-ID: <20260518222959.488126-4-dhowells@redhat.com>
In-Reply-To: <20260518222959.488126-1-dhowells@redhat.com>
References: <20260518222959.488126-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111
Content-Type: text/plain; charset="utf-8"

Make readahead store folio count in readahead_control so that the
filesystem can know in advance how many folios it needs to keep track of.

This is cleared by read_pages() in case it is called from a loop.

The count is accessed by the filesystem with readahead_folio_count().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
---
 include/linux/pagemap.h | 10 ++++++++++
 mm/readahead.c          |  5 +++++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 31a848485ad9..1de60ecfd6e3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1350,6 +1350,7 @@ struct readahead_control {
 	struct file_ra_state *ra;
 /* private: use the readahead_* accessors instead */
 	pgoff_t _index;
+	unsigned int _nr_folios;
 	unsigned int _nr_pages;
 	unsigned int _batch_count;
 	bool dropbehind;
@@ -1529,6 +1530,15 @@ static inline size_t readahead_batch_length(const st=
ruct readahead_control *rac)
 	return rac->_batch_count * PAGE_SIZE;
 }
=20
+/**
+ * readahead_folio_count - Get the number of folios in this readahead requ=
est.
+ * @rac: The readahead request.
+ */
+static inline unsigned int readahead_folio_count(const struct readahead_co=
ntrol *rac)
+{
+	return rac->_nr_folios;
+}
+
 static inline unsigned long dir_pages(const struct inode *inode)
 {
 	return (unsigned long)(inode->i_size + PAGE_SIZE - 1) >>
diff --git a/mm/readahead.c b/mm/readahead.c
index 7b05082c89ea..eba194f4635f 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -177,6 +177,7 @@ static void read_pages(struct readahead_control *rac)
 	if (unlikely(rac->_workingset))
 		psi_memstall_leave(&rac->_pflags);
 	rac->_workingset =3D false;
+	rac->_nr_folios =3D 0;
=20
 	BUG_ON(readahead_count(rac));
 }
@@ -292,6 +293,7 @@ void page_cache_ra_unbounded(struct readahead_control *=
ractl,
 		if (i =3D=3D mark)
 			folio_set_readahead(folio);
 		ractl->_workingset |=3D folio_test_workingset(folio);
+		ractl->_nr_folios++;
 		ractl->_nr_pages +=3D min_nrpages;
 		i +=3D min_nrpages;
 	}
@@ -459,6 +461,7 @@ static inline int ra_alloc_folio(struct readahead_contr=
ol *ractl, pgoff_t index,
 		return err;
 	}
=20
+	ractl->_nr_folios++;
 	ractl->_nr_pages +=3D 1UL << order;
 	ractl->_workingset |=3D folio_test_workingset(folio);
 	return 0;
@@ -802,6 +805,7 @@ void readahead_expand(struct readahead_control *ractl,
 			ractl->_workingset =3D true;
 			psi_memstall_enter(&ractl->_pflags);
 		}
+		ractl->_nr_folios++;
 		ractl->_nr_pages +=3D min_nrpages;
 		ractl->_index =3D folio->index;
 	}
@@ -831,6 +835,7 @@ void readahead_expand(struct readahead_control *ractl,
 			ractl->_workingset =3D true;
 			psi_memstall_enter(&ractl->_pflags);
 		}
+		ractl->_nr_folios++;
 		ractl->_nr_pages +=3D min_nrpages;
 		if (ra) {
 			ra->size +=3D min_nrpages;
From nobody Mon May 25 04:33:52 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5AFCC3AF66E
	for <linux-kernel@vger.kernel.org>; Mon, 18 May 2026 22:30:58 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779143460; cv=none;
 b=dzIJ5cGmDXPqlUoQKLZ8VfmV3Pr3/zmn2tELeRmWfRf0Gt0lZLJVWHiHY0RS4ucZJuu8bpYHHP5d8jDVrCnWO7A5lf7uGLoY5Am4g1gJRENhei9sghWWLRM+tPrLw/t+mcPiBttlNTYnyeLTBrC8/srqhyxDD9upm/SWZRhD/QI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779143460; c=relaxed/simple;
	bh=TMRLkzHmyPtAviQfJlAF6bAmvotz23vnD791zjA9bMA=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=fDym9OtaRSuIt5U5ob5uwuczYvRKQKwtH4qmBObpRJcmzUwgy9dR/uEmtVYaOjMUKXF5nWiAlcSWegtbiRRkpCYqWHCjGsM/W8I/FopGhLHWRRFbfUOOv0MO7QeBG77RB/395PDzBcCmKT7dcTLt4iBxDX6jCHv5WaM5ADhlTlo=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=DOF5luEn; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="DOF5luEn"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1779143457;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=teEUuZhuWoXgSFheLlMPHTNVwXxTVlWqdQpm4SzhaW4=;
	b=DOF5luEnAdrD2LDss3GWx2gUFN+m/jTu+EJA0fQ+DElqHdoC6IH/2cU7VtzkJ/NlMw+mpl
	BFwx/xEDa02L3YHlaA75JM3A+YQQitOYUa87AdIJaTiV2DWC5FuKr61vGxj9WHCpziH6AM
	8lSFGaYEqWOnfqwpp7R3BgFcdD+IaUU=
Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-510-WHRttJ9sPAK5UpG52xVOrw-1; Mon,
 18 May 2026 18:30:52 -0400
X-MC-Unique: WHRttJ9sPAK5UpG52xVOrw-1
X-Mimecast-MFC-AGG-ID: WHRttJ9sPAK5UpG52xVOrw_1779143448
Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 9215E1956096;
	Mon, 18 May 2026 22:30:48 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.48.33])
	by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id B1F9D1800357;
	Mon, 18 May 2026 22:30:41 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.org>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: [PATCH v2 04/21] netfs: Bulk load the readahead-provided folios up
 front
Date: Mon, 18 May 2026 23:29:36 +0100
Message-ID: <20260518222959.488126-5-dhowells@redhat.com>
In-Reply-To: <20260518222959.488126-1-dhowells@redhat.com>
References: <20260518222959.488126-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93
Content-Type: text/plain; charset="utf-8"

Load all the folios by the VM for readahead up front into the folio queue.
With the number of folios provided by the VM, the folio queue can be fully
allocated first and then the loading happen in one go inside the RCU read
lock.  The folio refs acquired from readahead are dropped in bulk once the
first subrequest is dispatched as it's quite a slow operation.  The
collector waits for NETFS_RREQ_NEED_PUT_RA_REFS to be cleared so that it
doesn't unlock folios before the xarray has been scanned for them.

This simplifies the buffer handling later and isn't noticeably slower as
the xarray doesn't need to be modified and the folios are all already
pre-locked.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/netfs/buffered_read.c       | 97 +++++++++++++++++++++-------------
 fs/netfs/internal.h            |  1 +
 fs/netfs/misc.c                | 19 +++++++
 fs/netfs/read_collect.c        |  7 +++
 fs/netfs/rolling_buffer.c      | 75 ++++++++++++++++++++++++++
 include/linux/netfs.h          |  1 +
 include/linux/rolling_buffer.h |  3 ++
 include/trace/events/netfs.h   |  3 ++
 8 files changed, 169 insertions(+), 37 deletions(-)

diff --git a/fs/netfs/buffered_read.c b/fs/netfs/buffered_read.c
index 8f96bc0f6c03..146a2cf64af0 100644
--- a/fs/netfs/buffered_read.c
+++ b/fs/netfs/buffered_read.c
@@ -54,6 +54,42 @@ static void netfs_rreq_expand(struct netfs_io_request *r=
req,
 	}
 }
=20
+/*
+ * Drop the folio refs acquired from the readahead API.
+ */
+static void netfs_bulk_drop_ra_refs(struct netfs_io_request *rreq)
+{
+	struct folio_batch fbatch;
+	struct folio *folio;
+	pgoff_t nr_pages =3D DIV_ROUND_UP(rreq->len, PAGE_SIZE);
+	pgoff_t first =3D rreq->start / PAGE_SIZE;
+	XA_STATE(xas, &rreq->mapping->i_pages, first);
+
+	folio_batch_init(&fbatch);
+
+	rcu_read_lock();
+
+	xas_for_each(&xas, folio,  first + nr_pages - 1) {
+		if (xas_retry(&xas, folio))
+			continue;
+
+		if (!folio_batch_add(&fbatch, folio))
+			folio_batch_release(&fbatch);
+	}
+
+	rcu_read_unlock();
+	folio_batch_release(&fbatch);
+	trace_netfs_rreq(rreq, netfs_rreq_trace_ra_put_ref);
+	clear_bit_unlock(NETFS_RREQ_NEED_PUT_RA_REFS, &rreq->flags);
+	wake_up(&rreq->waitq);
+}
+
+static void netfs_maybe_bulk_drop_ra_refs(struct netfs_io_request *rreq)
+{
+	if (test_bit(NETFS_RREQ_NEED_PUT_RA_REFS, &rreq->flags))
+		netfs_bulk_drop_ra_refs(rreq);
+}
+
 /*
  * Begin an operation, and fetch the stored zero point value from the cook=
ie if
  * available.
@@ -74,12 +110,8 @@ static int netfs_begin_cache_read(struct netfs_io_reque=
st *rreq, struct netfs_in
  *
  * Returns the limited size if successful and -ENOMEM if insufficient memo=
ry
  * available.
- *
- * [!] NOTE: This must be run in the same thread as ->issue_read() was cal=
led
- * in as we access the readahead_control struct.
  */
-static ssize_t netfs_prepare_read_iterator(struct netfs_io_subrequest *sub=
req,
-					   struct readahead_control *ractl)
+static ssize_t netfs_prepare_read_iterator(struct netfs_io_subrequest *sub=
req)
 {
 	struct netfs_io_request *rreq =3D subreq->rreq;
 	size_t rsize =3D subreq->len;
@@ -87,28 +119,6 @@ static ssize_t netfs_prepare_read_iterator(struct netfs=
_io_subrequest *subreq,
 	if (subreq->source =3D=3D NETFS_DOWNLOAD_FROM_SERVER)
 		rsize =3D umin(rsize, rreq->io_streams[0].sreq_max_len);
=20
-	if (ractl) {
-		/* If we don't have sufficient folios in the rolling buffer,
-		 * extract a folioq's worth from the readahead region at a time
-		 * into the buffer.  Note that this acquires a ref on each page
-		 * that we will need to release later - but we don't want to do
-		 * that until after we've started the I/O.
-		 */
-		struct folio_batch put_batch;
-
-		folio_batch_init(&put_batch);
-		while (rreq->submitted < subreq->start + rsize) {
-			ssize_t added;
-
-			added =3D rolling_buffer_load_from_ra(&rreq->buffer, ractl,
-							    &put_batch);
-			if (added < 0)
-				return added;
-			rreq->submitted +=3D added;
-		}
-		folio_batch_release(&put_batch);
-	}
-
 	subreq->len =3D rsize;
 	if (unlikely(rreq->io_streams[0].sreq_max_segs)) {
 		size_t limit =3D netfs_limit_iter(&rreq->buffer.iter, 0, rsize,
@@ -204,8 +214,7 @@ static void netfs_issue_read(struct netfs_io_request *r=
req,
  * slicing up the region to be read according to available cache blocks and
  * network rsize.
  */
-static void netfs_read_to_pagecache(struct netfs_io_request *rreq,
-				    struct readahead_control *ractl)
+static void netfs_read_to_pagecache(struct netfs_io_request *rreq)
 {
 	struct fscache_occupancy _occ =3D {
 		.query_from	=3D rreq->start,
@@ -335,7 +344,7 @@ static void netfs_read_to_pagecache(struct netfs_io_req=
uest *rreq,
 			trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
 		}
=20
-		slice =3D netfs_prepare_read_iterator(subreq, ractl);
+		slice =3D netfs_prepare_read_iterator(subreq);
 		if (slice < 0) {
 			ret =3D slice;
 			netfs_cancel_read(subreq, ret);
@@ -350,6 +359,7 @@ static void netfs_read_to_pagecache(struct netfs_io_req=
uest *rreq,
=20
 		trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
 		netfs_issue_read(rreq, subreq);
+		netfs_maybe_bulk_drop_ra_refs(rreq);
=20
 		if (test_bit(NETFS_RREQ_PAUSE, &rreq->flags))
 			netfs_wait_for_paused_read(rreq);
@@ -388,6 +398,7 @@ void netfs_readahead(struct readahead_control *ractl)
 	struct netfs_io_request *rreq;
 	struct netfs_inode *ictx =3D netfs_inode(ractl->mapping->host);
 	unsigned long long start =3D readahead_pos(ractl);
+	ssize_t added;
 	size_t size =3D readahead_length(ractl);
 	int ret;
=20
@@ -408,11 +419,23 @@ void netfs_readahead(struct readahead_control *ractl)
=20
 	netfs_rreq_expand(rreq, ractl);
=20
-	rreq->submitted =3D rreq->start;
-	if (rolling_buffer_init(&rreq->buffer, rreq->debug_id, ITER_DEST) < 0)
+	/* Load the folios to be read into a bvecq chain.  Note that this
+	 * acquires a ref on each folio that we will need to release later -
+	 * but we don't want to do that until after we've started the I/O.
+	 */
+	added =3D rolling_buffer_bulk_load_from_ra(&rreq->buffer, ractl, rreq->de=
bug_id);
+	if (added < 0) {
+		ret =3D added;
 		goto cleanup_free;
-	netfs_read_to_pagecache(rreq, ractl);
+	}
+	__set_bit(NETFS_RREQ_NEED_PUT_RA_REFS, &rreq->flags);
+
+	rreq->submitted =3D rreq->start + added;
+	rreq->cleaned_to =3D rreq->start;
+	rreq->front_folio_order =3D folio_order(rreq->buffer.tail->vec.folios[0]);
=20
+	netfs_read_to_pagecache(rreq);
+	netfs_maybe_bulk_drop_ra_refs(rreq);
 	return netfs_put_request(rreq, netfs_rreq_trace_put_return);
=20
 cleanup_free:
@@ -505,7 +528,7 @@ static int netfs_read_gaps(struct file *file, struct fo=
lio *folio)
 	iov_iter_bvec(&rreq->buffer.iter, ITER_DEST, bvec, i, rreq->len);
 	rreq->submitted =3D rreq->start + flen;
=20
-	netfs_read_to_pagecache(rreq, NULL);
+	netfs_read_to_pagecache(rreq);
=20
 	ret =3D netfs_wait_for_read(rreq);
 	if (ret >=3D 0) {
@@ -580,7 +603,7 @@ int netfs_read_folio(struct file *file, struct folio *f=
olio)
 	if (ret < 0)
 		goto discard;
=20
-	netfs_read_to_pagecache(rreq, NULL);
+	netfs_read_to_pagecache(rreq);
 	ret =3D netfs_wait_for_read(rreq);
 	netfs_put_request(rreq, netfs_rreq_trace_put_return);
 	return ret < 0 ? ret : 0;
@@ -737,7 +760,7 @@ int netfs_write_begin(struct netfs_inode *ctx,
 	if (ret < 0)
 		goto error_put;
=20
-	netfs_read_to_pagecache(rreq, NULL);
+	netfs_read_to_pagecache(rreq);
 	ret =3D netfs_wait_for_read(rreq);
 	netfs_put_request(rreq, netfs_rreq_trace_put_return);
 	if (ret < 0)
@@ -802,7 +825,7 @@ int netfs_prefetch_for_write(struct file *file, struct =
folio *folio,
 	if (ret < 0)
 		goto error_put;
=20
-	netfs_read_to_pagecache(rreq, NULL);
+	netfs_read_to_pagecache(rreq);
 	ret =3D netfs_wait_for_read(rreq);
 	netfs_put_request(rreq, netfs_rreq_trace_put_return);
 	return ret < 0 ? ret : 0;
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index d82f2116f8e0..4b0f9304b970 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -80,6 +80,7 @@ ssize_t netfs_wait_for_read(struct netfs_io_request *rreq=
);
 ssize_t netfs_wait_for_write(struct netfs_io_request *rreq);
 void netfs_wait_for_paused_read(struct netfs_io_request *rreq);
 void netfs_wait_for_paused_write(struct netfs_io_request *rreq);
+void netfs_wait_for_put_ra_refs(struct netfs_io_request *rreq);
=20
 /*
  * objects.c
diff --git a/fs/netfs/misc.c b/fs/netfs/misc.c
index 5d554512ed23..f5c1c463f4ff 100644
--- a/fs/netfs/misc.c
+++ b/fs/netfs/misc.c
@@ -563,3 +563,22 @@ void netfs_wait_for_paused_write(struct netfs_io_reque=
st *rreq)
 {
 	return netfs_wait_for_pause(rreq, netfs_write_collection);
 }
+
+/*
+ * Wait for the readahead-acquired refs to be put.
+ */
+void netfs_wait_for_put_ra_refs(struct netfs_io_request *rreq)
+{
+	DEFINE_WAIT(myself);
+
+	for (;;) {
+		trace_netfs_rreq(rreq, netfs_rreq_trace_wait_put_ra_refs);
+		prepare_to_wait(&rreq->waitq, &myself, TASK_UNINTERRUPTIBLE);
+		if (!test_bit(NETFS_RREQ_NEED_PUT_RA_REFS, &rreq->flags))
+			break;
+		schedule();
+	}
+
+	trace_netfs_rreq(rreq, netfs_rreq_trace_waited_put_ra_refs);
+	finish_wait(&rreq->waitq, &myself);
+}
diff --git a/fs/netfs/read_collect.c b/fs/netfs/read_collect.c
index 23660a590124..edf7cea7e2f9 100644
--- a/fs/netfs/read_collect.c
+++ b/fs/netfs/read_collect.c
@@ -118,6 +118,13 @@ static void netfs_read_unlock_folios(struct netfs_io_r=
equest *rreq,
 		slot =3D 0;
 	}
=20
+	/* We have to wait for readahead refs to have been released before we
+	 * can unlock any folios as the ref-dropper walks i_pages and the only
+	 * thing preventing these folios from being removed is the folio lock.
+	 */
+	if (test_bit(NETFS_RREQ_NEED_PUT_RA_REFS, &rreq->flags))
+		netfs_wait_for_put_ra_refs(rreq);
+
 	for (;;) {
 		struct folio *folio;
 		unsigned long long fpos, fend;
diff --git a/fs/netfs/rolling_buffer.c b/fs/netfs/rolling_buffer.c
index a17fbf9853a4..576b425a227d 100644
--- a/fs/netfs/rolling_buffer.c
+++ b/fs/netfs/rolling_buffer.c
@@ -149,6 +149,81 @@ ssize_t rolling_buffer_load_from_ra(struct rolling_buf=
fer *roll,
 	return size;
 }
=20
+/*
+ * Decant the entire list of folios to read into a rolling buffer.
+ */
+ssize_t rolling_buffer_bulk_load_from_ra(struct rolling_buffer *roll,
+					 struct readahead_control *ractl,
+					 unsigned int rreq_id)
+{
+	XA_STATE(xas, &ractl->mapping->i_pages, ractl->_index);
+	struct folio_queue *fq;
+	struct folio *folio;
+	ssize_t loaded =3D 0;
+	int nr, slot =3D 0, npages =3D 0;
+
+	/* First allocate all the folioqs we're going to need to avoid having
+	 * to deal with ENOMEM later.
+	 */
+	nr =3D ractl->_nr_folios;
+	do {
+		fq =3D netfs_folioq_alloc(rreq_id, GFP_KERNEL,
+					netfs_trace_folioq_make_space);
+		if (!fq) {
+			rolling_buffer_clear(roll);
+			return -ENOMEM;
+		}
+		fq->prev =3D roll->head;
+		if (!roll->tail)
+			roll->tail =3D fq;
+		else
+			roll->head->next =3D fq;
+		roll->head =3D fq;
+
+		nr -=3D folioq_nr_slots(fq);
+	} while (nr > 0);
+
+	rcu_read_lock();
+
+	fq =3D roll->tail;
+	xas_for_each(&xas, folio, ractl->_index + ractl->_nr_pages - 1) {
+		unsigned int order;
+
+		if (xas_retry(&xas, folio))
+			continue;
+		VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+
+		order =3D folio_order(folio);
+		fq->orders[slot] =3D order;
+		fq->vec.folios[slot] =3D folio;
+		loaded +=3D PAGE_SIZE << order;
+		npages +=3D 1 << order;
+		trace_netfs_folio(folio, netfs_folio_trace_read);
+
+		slot++;
+		if (slot >=3D folioq_nr_slots(fq)) {
+			fq->vec.nr =3D slot;
+			fq =3D fq->next;
+			if (!fq) {
+				WARN_ON_ONCE(npages < readahead_count(ractl));
+				break;
+			}
+			slot =3D 0;
+		}
+	}
+
+	rcu_read_unlock();
+
+	if (fq)
+		fq->vec.nr =3D slot;
+
+	WRITE_ONCE(roll->iter.count, loaded);
+	iov_iter_folio_queue(&roll->iter, ITER_DEST, roll->tail, 0, 0, loaded);
+	ractl->_index    +=3D npages;
+	ractl->_nr_pages -=3D npages;
+	return loaded;
+}
+
 /*
  * Append a folio to the rolling buffer.
  */
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index d175c63ff659..f7f55b7621f3 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -285,6 +285,7 @@ struct netfs_io_request {
 #define NETFS_RREQ_FOLIO_COPY_TO_CACHE	14	/* Copy current folio to cache f=
rom read */
 #define NETFS_RREQ_UPLOAD_TO_SERVER	15	/* Need to write to the server */
 #define NETFS_RREQ_USE_IO_ITER		16	/* Use ->io_iter rather than ->i_pages =
*/
+#define NETFS_RREQ_NEED_PUT_RA_REFS	17	/* Need to put the folio refs RA ga=
ve us */
 #define NETFS_RREQ_USE_PGPRIV2		31	/* [DEPRECATED] Use PG_private_2 to mark
 						 * write to cache on read */
 	const struct netfs_request_ops *netfs_ops;
diff --git a/include/linux/rolling_buffer.h b/include/linux/rolling_buffer.h
index ac15b1ffdd83..b35ef43f325f 100644
--- a/include/linux/rolling_buffer.h
+++ b/include/linux/rolling_buffer.h
@@ -48,6 +48,9 @@ int rolling_buffer_make_space(struct rolling_buffer *roll=
);
 ssize_t rolling_buffer_load_from_ra(struct rolling_buffer *roll,
 				    struct readahead_control *ractl,
 				    struct folio_batch *put_batch);
+ssize_t rolling_buffer_bulk_load_from_ra(struct rolling_buffer *roll,
+					 struct readahead_control *ractl,
+					 unsigned int rreq_id);
 ssize_t rolling_buffer_append(struct rolling_buffer *roll, struct folio *f=
olio,
 			      unsigned int flags);
 struct folio_queue *rolling_buffer_delete_spent(struct rolling_buffer *rol=
l);
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index 63ed1d771bd8..83266835b7ad 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -64,6 +64,7 @@
 	EM(netfs_rreq_trace_intr,		"INTR   ")	\
 	EM(netfs_rreq_trace_inval_cache,	"INVL-CA")	\
 	EM(netfs_rreq_trace_ki_complete,	"KI-CMPL")	\
+	EM(netfs_rreq_trace_ra_put_ref,		"RA-PUT ")	\
 	EM(netfs_rreq_trace_recollect,		"RECLLCT")	\
 	EM(netfs_rreq_trace_redirty,		"REDIRTY")	\
 	EM(netfs_rreq_trace_resubmit,		"RESUBMT")	\
@@ -77,9 +78,11 @@
 	EM(netfs_rreq_trace_unpause,		"UNPAUSE")	\
 	EM(netfs_rreq_trace_wait_ip,		"WAIT-IP")	\
 	EM(netfs_rreq_trace_wait_pause,		"--PAUSED--")	\
+	EM(netfs_rreq_trace_wait_put_ra_refs,	"WAIT-P-RA")	\
 	EM(netfs_rreq_trace_wait_quiesce,	"WAIT-QUIESCE")	\
 	EM(netfs_rreq_trace_waited_ip,		"DONE-IP")	\
 	EM(netfs_rreq_trace_waited_pause,	"--UNPAUSED--")	\
+	EM(netfs_rreq_trace_waited_put_ra_refs,	"DONE-P-RA")	\
 	EM(netfs_rreq_trace_waited_quiesce,	"DONE-QUIESCE")	\
 	EM(netfs_rreq_trace_wake_ip,		"WAKE-IP")	\
 	EM(netfs_rreq_trace_wake_queue,		"WAKE-Q ")	\
From nobody Mon May 25 04:33:52 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5565C3B0AE6
	for <linux-kernel@vger.kernel.org>; Mon, 18 May 2026 22:31:04 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779143465; cv=none;
 b=h276Dg8Z4Qa0WNdo6xlPx75KJ9TKP+ftRR/1xfNSyEmjZm13CfI9Ji3ZOZDZeciHD7mrAgJaGuSs3drPZIoZa5CMIWoIzWaeP7Ro4uRylXgjkq1O991UtjTGA7Q69CLPoGbntquByHC/3/P5lu1eMKbFELPResJqW78MDncKuhU=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779143465; c=relaxed/simple;
	bh=v9qX9pDrVHgmnAQEadpynPw/iCBrNo2C6n+A4BzV/tc=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=EHOyKw5xoWqjYY/i7gO460jH3LjhYkYfsVmlEZOgWtxYy28JN/kAG2YPrRMlyszYxQgDKqXhl1SyyFYtYCKIkNLfN92AgRxxOt7pJrYLuASdk5f7rpPvuy0uK38NroiNGzyNui3N6umQyrVazGPk+biAbH+VJr8xePOhX9U/EEs=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=cS2YkyJj; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="cS2YkyJj"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1779143463;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=vcN8r0sMSJfS+kOrSGHEP+bv70A9Lsgfu7bf9ESMv5s=;
	b=cS2YkyJjmCKjMWHVcGMi++9EVHHM3bPJS5p/9FiTT1Cj8eegbHLEjZuOTwtKyz91ioJ0/B
	mCG7chHYYwlPthHf3pAwe3pEVOYN4ppkb01ezxPyBjQ/ayW4tDPzhmaQx3LM7BYZM4aKg5
	W5Xb3xzViMns+EdiYAIWEMuexZq4Liw=
Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-151-m7JavSrTOOOJ-1I2d2y8Cw-1; Mon,
 18 May 2026 18:30:59 -0400
X-MC-Unique: m7JavSrTOOOJ-1I2d2y8Cw-1
X-Mimecast-MFC-AGG-ID: m7JavSrTOOOJ-1I2d2y8Cw_1779143457
Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 7E245180044D;
	Mon, 18 May 2026 22:30:56 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.48.33])
	by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 334351800352;
	Mon, 18 May 2026 22:30:49 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.org>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-block@vger.kernel.org
Subject: [PATCH v2 05/21] Add a function to kmap one page of a multipage
 bio_vec
Date: Mon, 18 May 2026 23:29:37 +0100
Message-ID: <20260518222959.488126-6-dhowells@redhat.com>
In-Reply-To: <20260518222959.488126-1-dhowells@redhat.com>
References: <20260518222959.488126-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111
Content-Type: text/plain; charset="utf-8"

Add a function to kmap one page of a multipage bio_vec by offset (which is
added to the offset in the bio_vec internally).  The caller is responsible
for calculating how much of the page is then available.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 include/linux/bvec.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index d36dd476feda..9df4a56fef61 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -299,4 +299,21 @@ static inline phys_addr_t bvec_phys(const struct bio_v=
ec *bvec)
 	return page_to_phys(bvec->bv_page) + bvec->bv_offset;
 }
=20
+/**
+ * kmap_local_bvec - Map part of a bvec into the kernel virtual address sp=
ace
+ * @bvec: bvec to map
+ * @offset: Offset into bvec
+ *
+ * Map the page containing the byte at @offset into the kernel virtual add=
ress
+ * space.  The caller is responsible for making sure this doesn't overrun.
+ *
+ * Call kunmap_local on the returned address to unmap.
+ */
+static inline void *kmap_local_bvec(struct bio_vec *bvec, size_t offset)
+{
+	offset +=3D bvec->bv_offset;
+
+	return kmap_local_page(bvec->bv_page + offset / PAGE_SIZE) + offset % PAG=
E_SIZE;
+}
+
 #endif /* __LINUX_BVEC_H */
From nobody Mon May 25 04:33:52 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id F253C391826
	for <linux-kernel@vger.kernel.org>; Mon, 18 May 2026 22:31:15 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779143477; cv=none;
 b=llY1nPjT5IeSSEUeGEGztM4Ol3c0rlnjoDFoC+1CeWGFQ3aixgZJpkACT4XVrMMYqKiE6H4Q5HQ9TIl58p1g1HiC3agqGe7m8R9NfZp4UWKf3jnM7XrATzoJDdT+iB/6cqWCdvYyTIcbMZMApRhGYttO/YwU9VVEN3QromWiT24=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779143477; c=relaxed/simple;
	bh=0C4gcVZKHQyqmJ8NewTG/Zw/3dNQrnlhH5C2JawXraI=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=smcI6+zMplvL346Za0J07NtFPuFCdlX5S8xPrJhrvkL47RnB7bPckoD2x8UzQO4aC6bluqw5FUl6n4EyIiFKICSP//taJifekEwjMwQgR1dN5Bz4EiGKtXDv9lrfH3NL6Pc4+Y3LBJ9SBPgGlv6D+17Zw/KGD4qIdwefhXEfq7A=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=OzTVmqP3; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="OzTVmqP3"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1779143475;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=x2ry9yfrsp4bz8l1rJhaUu5Q7jTF4XqPXt4HsPXH488=;
	b=OzTVmqP3yP/amtYR7hjOwMfak6D2dhYMAZ5BRTP+A8rbUs9IKUOgBmMZa8E7QSGjZaI1BH
	kBYudnwd/rTcWcKlb+lfZ3GspyLzJFHpng1t7rHf/nnG7rGYrmw1b9XuoZgHObQx4Df+Eu
	bZxcGHvVYp8ULHbYHepNC3F9ixCd68Y=
Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-624-bvIr_lgANT-eK6jnpcjFjQ-1; Mon,
 18 May 2026 18:31:08 -0400
X-MC-Unique: bvIr_lgANT-eK6jnpcjFjQ-1
X-Mimecast-MFC-AGG-ID: bvIr_lgANT-eK6jnpcjFjQ_1779143465
Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 51D821956053;
	Mon, 18 May 2026 22:31:05 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.48.33])
	by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 4E40E1800576;
	Mon, 18 May 2026 22:30:57 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.org>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-block@vger.kernel.org
Subject: [PATCH v2 06/21] iov_iter: Make iov_iter_get_pages*() wrap
 iov_iter_extract_pages()
Date: Mon, 18 May 2026 23:29:38 +0100
Message-ID: <20260518222959.488126-7-dhowells@redhat.com>
In-Reply-To: <20260518222959.488126-1-dhowells@redhat.com>
References: <20260518222959.488126-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111
Content-Type: text/plain; charset="utf-8"

Make iov_iter_get_pages*() wrap iov_iter_extract_pages() for kernel
iterator types (e.g. ITER_BVEC, ITER_FOLIOQ, ITER_XARRAY).  The pages
obtained have their refcounts incremented afterwards if they're not slab
pages.  ITER_KVEC is left returning -EFAULT.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 lib/iov_iter.c | 164 ++++++-------------------------------------------
 1 file changed, 19 insertions(+), 145 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 243662af1af7..cac7d7364bc2 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -910,118 +910,34 @@ static int want_pages_array(struct page ***res, size=
_t size,
 	return count;
 }
=20
-static ssize_t iter_folioq_get_pages(struct iov_iter *iter,
+/*
+ * Wrap iov_iter_extract_pages() and then pin the non-slab pages we got ba=
ck.
+ * This only works for non-user iterator types as get_pages uses get_user_=
pages
+ * not pin_user_pages.
+ */
+static ssize_t iter_get_kernel_pages(struct iov_iter *iter,
 				     struct page ***ppages, size_t maxsize,
 				     unsigned maxpages, size_t *_start_offset)
 {
-	const struct folio_queue *folioq =3D iter->folioq;
 	struct page **pages;
-	unsigned int slot =3D iter->folioq_slot;
-	size_t extracted =3D 0, count =3D iter->count, iov_offset =3D iter->iov_o=
ffset;
+	ssize_t ret, done;
=20
-	if (slot >=3D folioq_nr_slots(folioq)) {
-		folioq =3D folioq->next;
-		slot =3D 0;
-		if (WARN_ON(iov_offset !=3D 0))
-			return -EIO;
-	}
+	ret =3D iov_iter_extract_pages(iter, ppages, maxsize, maxpages,
+				     0, _start_offset);
+	if (ret <=3D 0)
+		return ret;
=20
-	maxpages =3D want_pages_array(ppages, maxsize, iov_offset & ~PAGE_MASK, m=
axpages);
-	if (!maxpages)
-		return -ENOMEM;
-	*_start_offset =3D iov_offset & ~PAGE_MASK;
 	pages =3D *ppages;
+	for (done =3D ret + *_start_offset; done > 0; done -=3D PAGE_SIZE) {
+		struct folio *folio =3D page_folio(*pages);
=20
-	for (;;) {
-		struct folio *folio =3D folioq_folio(folioq, slot);
-		size_t offset =3D iov_offset, fsize =3D folioq_folio_size(folioq, slot);
-		size_t part =3D PAGE_SIZE - offset % PAGE_SIZE;
-
-		if (offset < fsize) {
-			part =3D umin(part, umin(maxsize - extracted, fsize - offset));
-			count -=3D part;
-			iov_offset +=3D part;
-			extracted +=3D part;
-
-			*pages =3D folio_page(folio, offset / PAGE_SIZE);
-			get_page(*pages);
-			pages++;
-			maxpages--;
-		}
-
-		if (maxpages =3D=3D 0 || extracted >=3D maxsize)
-			break;
-
-		if (iov_offset >=3D fsize) {
-			iov_offset =3D 0;
-			slot++;
-			if (slot =3D=3D folioq_nr_slots(folioq) && folioq->next) {
-				folioq =3D folioq->next;
-				slot =3D 0;
-			}
-		}
-	}
-
-	iter->count =3D count;
-	iter->iov_offset =3D iov_offset;
-	iter->folioq =3D folioq;
-	iter->folioq_slot =3D slot;
-	return extracted;
-}
-
-static ssize_t iter_xarray_populate_pages(struct page **pages, struct xarr=
ay *xa,
-					  pgoff_t index, unsigned int nr_pages)
-{
-	XA_STATE(xas, xa, index);
-	struct folio *folio;
-	unsigned int ret =3D 0;
-
-	rcu_read_lock();
-	for (folio =3D xas_load(&xas); folio; folio =3D xas_next(&xas)) {
-		if (xas_retry(&xas, folio))
-			continue;
-
-		/* Has the folio moved or been split? */
-		if (unlikely(folio !=3D xas_reload(&xas))) {
-			xas_reset(&xas);
-			continue;
-		}
-
-		pages[ret] =3D folio_file_page(folio, xas.xa_index);
-		folio_get(folio);
-		if (++ret =3D=3D nr_pages)
-			break;
+		if (!folio_test_slab(folio))
+			folio_get(folio);
+		pages++;
 	}
-	rcu_read_unlock();
 	return ret;
 }
=20
-static ssize_t iter_xarray_get_pages(struct iov_iter *i,
-				     struct page ***pages, size_t maxsize,
-				     unsigned maxpages, size_t *_start_offset)
-{
-	unsigned nr, offset, count;
-	pgoff_t index;
-	loff_t pos;
-
-	pos =3D i->xarray_start + i->iov_offset;
-	index =3D pos >> PAGE_SHIFT;
-	offset =3D pos & ~PAGE_MASK;
-	*_start_offset =3D offset;
-
-	count =3D want_pages_array(pages, maxsize, offset, maxpages);
-	if (!count)
-		return -ENOMEM;
-	nr =3D iter_xarray_populate_pages(*pages, i->xarray, index, count);
-	if (nr =3D=3D 0)
-		return 0;
-
-	maxsize =3D min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
-	i->iov_offset +=3D maxsize;
-	i->count -=3D maxsize;
-	return maxsize;
-}
-
 /* must be done on non-empty ITER_UBUF or ITER_IOVEC one */
 static unsigned long first_iovec_segment(const struct iov_iter *i, size_t =
*size)
 {
@@ -1044,22 +960,6 @@ static unsigned long first_iovec_segment(const struct=
 iov_iter *i, size_t *size)
 	BUG(); // if it had been empty, we wouldn't get called
 }
=20
-/* must be done on non-empty ITER_BVEC one */
-static struct page *first_bvec_segment(const struct iov_iter *i,
-				       size_t *size, size_t *start)
-{
-	struct page *page;
-	size_t skip =3D i->iov_offset, len;
-
-	len =3D i->bvec->bv_len - skip;
-	if (*size > len)
-		*size =3D len;
-	skip +=3D i->bvec->bv_offset;
-	page =3D i->bvec->bv_page + skip / PAGE_SIZE;
-	*start =3D skip % PAGE_SIZE;
-	return page;
-}
-
 static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   unsigned int maxpages, size_t *start)
@@ -1095,36 +995,10 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov=
_iter *i,
 		iov_iter_advance(i, maxsize);
 		return maxsize;
 	}
-	if (iov_iter_is_bvec(i)) {
-		struct page **p;
-		struct page *page;
=20
-		page =3D first_bvec_segment(i, &maxsize, start);
-		n =3D want_pages_array(pages, maxsize, *start, maxpages);
-		if (!n)
-			return -ENOMEM;
-		p =3D *pages;
-		for (int k =3D 0; k < n; k++) {
-			struct folio *folio =3D page_folio(page + k);
-			p[k] =3D page + k;
-			if (!folio_test_slab(folio))
-				folio_get(folio);
-		}
-		maxsize =3D min_t(size_t, maxsize, n * PAGE_SIZE - *start);
-		i->count -=3D maxsize;
-		i->iov_offset +=3D maxsize;
-		if (i->iov_offset =3D=3D i->bvec->bv_len) {
-			i->iov_offset =3D 0;
-			i->bvec++;
-			i->nr_segs--;
-		}
-		return maxsize;
-	}
-	if (iov_iter_is_folioq(i))
-		return iter_folioq_get_pages(i, pages, maxsize, maxpages, start);
-	if (iov_iter_is_xarray(i))
-		return iter_xarray_get_pages(i, pages, maxsize, maxpages, start);
-	return -EFAULT;
+	if (iov_iter_is_kvec(i))
+		return -EFAULT;
+	return iter_get_kernel_pages(i, pages, maxsize, maxpages, start);
 }
=20
 ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
From nobody Mon May 25 04:33:52 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4163D3B3BE1
	for <linux-kernel@vger.kernel.org>; Mon, 18 May 2026 22:31:23 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779143486; cv=none;
 b=GerT7KxGPZ2KZnmCH/ZVcrzurljNpZnYKYttNixRvSIANrWGKC3y3k3rqhXlQy48sETgh8gWim8QAuWbVewuTz/d2Th3axvN6psnQll7fU4StKKBEWc8bRQPNwlxVW9cdYz5qSfIdbFDjIQFPqTdjzWLXjcfixOQlkcxOgn/hwk=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779143486; c=relaxed/simple;
	bh=aF3E8A4wa7tpk19I/NZuyFmWRsy0cO9pPPHDvrFOVMw=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=sbUutH6Xl+IcoksZtu1XS0jP+DyW3pDdWNLRAUUOYr6bvCW/VPo5mEhjVOnGOYrgV3OoFNYvoio3WNzYbb49Kv77NJniP4ypppi57ibzmkhofycO8ELIK6Epw/ovlcWZVA3zJEvCs2jTKriIHfXyMNdXJk/er1B3ePYgeCBNavg=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=AV+Zjmqy; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="AV+Zjmqy"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1779143482;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=FO7JZZVTBveCMZ6VInptp0MmTWFl/vWk+h7ZqISgNLg=;
	b=AV+ZjmqyjNdoI2apJaBkTs3SNaQW6k0GyOL61d8e100mpucTneQ4KKaYIrGQtywUW/ud0y
	1Tjar3nCcpyo6s8VS2QCdiC8TyzUNw4w9pXGvbiXxKNGScZPuJTnlv8eqo2jCTR/3rri1Q
	3hK1V0xezJWDTtYGmc/EYdWFDH7oba0=
Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-91-QQuGty48O_m-7v1Ml8GcKQ-1; Mon,
 18 May 2026 18:31:16 -0400
X-MC-Unique: QQuGty48O_m-7v1Ml8GcKQ-1
X-Mimecast-MFC-AGG-ID: QQuGty48O_m-7v1Ml8GcKQ_1779143474
Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 85FFF1956095;
	Mon, 18 May 2026 22:31:13 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.48.33])
	by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 0002B1800352;
	Mon, 18 May 2026 22:31:06 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.org>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-block@vger.kernel.org
Subject: [PATCH v2 07/21] iov_iter: Add a segmented queue of bio_vec[]
Date: Mon, 18 May 2026 23:29:39 +0100
Message-ID: <20260518222959.488126-8-dhowells@redhat.com>
In-Reply-To: <20260518222959.488126-1-dhowells@redhat.com>
References: <20260518222959.488126-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111
Content-Type: text/plain; charset="utf-8"

Add the concept of a segmented queue of bio_vec[] arrays.  This allows an
indefinite quantity of elements to be handled and allows things like
network filesystems and crypto drivers to glue bits on the ends without
having to reallocate the array.

The bvecq struct that defines each segment also carries capacity/usage
information along with flags indicating whether the constituent memory
regions need freeing or unpinning and the file position of the first
element in a segment.  The bvecq structs are refcounted to allow a queue to
be extracted in batches and split between a number of subrequests.

The bvecq can have the bio_vec[] it manages allocated in with it, but this
is not required.  A flag is provided for if this is the case as comparing
->bv to ->__bv is not sufficient to detect this case.

Add an iterator type ITER_BVECQ for it.  This is intended to replace
ITER_FOLIOQ (and ITER_XARRAY).

Note that the prev pointer is only really needed for iov_iter_revert() and
could be dispensed with if struct iov_iter contained the head information
as well as the current point.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 include/linux/bvecq.h      |  56 +++++++
 include/linux/iov_iter.h   |  69 +++++++-
 include/linux/uio.h        |  11 ++
 lib/iov_iter.c             | 322 ++++++++++++++++++++++++++++++++++++-
 lib/scatterlist.c          |  70 +++++++-
 lib/tests/kunit_iov_iter.c | 262 ++++++++++++++++++++++++++++++
 6 files changed, 784 insertions(+), 6 deletions(-)
 create mode 100644 include/linux/bvecq.h

diff --git a/include/linux/bvecq.h b/include/linux/bvecq.h
new file mode 100644
index 000000000000..15f16f905877
--- /dev/null
+++ b/include/linux/bvecq.h
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Implementation of a segmented queue of bio_vec[].
+ *
+ * Copyright (C) 2026 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#ifndef _LINUX_BVECQ_H
+#define _LINUX_BVECQ_H
+
+#include <linux/bvec.h>
+
+/*
+ * The type of memory retention used by the elements in bvecq->bv[] and ho=
w to
+ * clean it up.
+ */
+enum bvecq_mem {
+	BVECQ_MEM_EXTERNAL,	/* Externally retained memory - no freeing */
+	BVECQ_MEM_PAGECACHE,	/* Ref'd pagecache pages - must put */
+	BVECQ_MEM_GUP,		/* Pinned memory from get_user_pages() - unpin */
+	BVECQ_MEM_ALLOCED,	/* Memory alloc'd by bvecq - can be freed/pooled */
+} __mode(byte);
+
+/*
+ * Segmented bio_vec queue.
+ *
+ * These can be linked together to form messages of indefinite length and
+ * iterated over with an ITER_BVECQ iterator.  The list is non-circular; n=
ext
+ * and prev are NULL at the ends.
+ *
+ * The bv pointer points to the bio_vec array; this may be __bv if allocat=
ed
+ * together.  The caller is responsible for determining whether or not thi=
s is
+ * the case as the array pointed to by bv may be follow on directly from t=
he
+ * bvecq by accident of allocation (ie. ->bv =3D=3D ->__bv is *not* suffic=
ient to
+ * determine this).
+ *
+ * The file position and discontiguity flag allow non-contiguous data sets=
 to
+ * be chained together, but still teased apart without the need to convert=
 the
+ * info in the bio_vec back into a folio pointer.
+ */
+struct bvecq {
+	struct bvecq	*next;		/* Next bvec in the list or NULL */
+	struct bvecq	*prev;		/* Prev bvec in the list or NULL */
+	unsigned long long fpos;	/* File position */
+	refcount_t	ref;
+	u32		priv;		/* Private data */
+	u16		nr_slots;	/* Number of elements in bv[] used */
+	u16		max_slots;	/* Number of elements allocated in bv[] */
+	enum bvecq_mem	mem_type:3;	/* What sort of memory and how to free it */
+	bool		inline_bv:1;	/* T if __bv[] is being used */
+	bool		discontig:1;	/* T if not contiguous with previous bvecq */
+	struct bio_vec	*bv;		/* Pointer to array of page fragments */
+	struct bio_vec	__bv[];		/* Default array (if ->inline_bv) */
+};
+
+#endif /* _LINUX_BVECQ_H */
diff --git a/include/linux/iov_iter.h b/include/linux/iov_iter.h
index f9a17fbbd398..c19a4c561ab4 100644
--- a/include/linux/iov_iter.h
+++ b/include/linux/iov_iter.h
@@ -10,6 +10,7 @@
=20
 #include <linux/uio.h>
 #include <linux/bvec.h>
+#include <linux/bvecq.h>
 #include <linux/folio_queue.h>
=20
 typedef size_t (*iov_step_f)(void *iter_base, size_t progress, size_t len,
@@ -141,6 +142,66 @@ size_t iterate_bvec(struct iov_iter *iter, size_t len,=
 void *priv, void *priv2,
 	return progress;
 }
=20
+/*
+ * Handle ITER_BVECQ.
+ */
+static __always_inline
+size_t iterate_bvecq(struct iov_iter *iter, size_t len, void *priv, void *=
priv2,
+		     iov_step_f step)
+{
+	const struct bvecq *bq =3D iter->bvecq;
+	unsigned int slot =3D iter->bvecq_slot;
+	size_t progress =3D 0, skip =3D iter->iov_offset;
+
+	do {
+		const struct bio_vec *bvec;
+		struct page *page;
+		size_t poff, plen;
+		void *base;
+
+		if (slot >=3D bq->nr_slots) {
+			if (!bq->next)
+				break;
+			bq =3D bq->next;
+			slot =3D 0;
+		}
+
+		bvec =3D &bq->bv[slot];
+		page =3D bvec->bv_page + (bvec->bv_offset + skip) / PAGE_SIZE;
+		poff =3D (bvec->bv_offset + skip) % PAGE_SIZE;
+		plen =3D umin(bvec->bv_len - skip, len);
+
+		while (plen > 0) {
+			size_t part, remain, consumed;
+
+			part =3D umin(plen, PAGE_SIZE - poff);
+			base =3D kmap_local_page(page) + poff;
+			remain =3D step(base, progress, part, priv, priv2);
+			kunmap_local(base);
+
+			consumed =3D part - remain;
+			progress +=3D consumed;
+			skip +=3D consumed;
+			len -=3D consumed;
+			if (!len || remain)
+				goto stop;
+			page++;
+			poff =3D 0;
+			plen -=3D consumed;
+		}
+
+		skip =3D 0;
+		slot++;
+	} while (len);
+
+stop:
+	iter->bvecq_slot =3D slot;
+	iter->bvecq =3D bq;
+	iter->iov_offset =3D skip;
+	iter->count -=3D progress;
+	return progress;
+}
+
 /*
  * Handle ITER_FOLIOQ.
  */
@@ -306,6 +367,8 @@ size_t iterate_and_advance2(struct iov_iter *iter, size=
_t len, void *priv,
 		return iterate_bvec(iter, len, priv, priv2, step);
 	if (iov_iter_is_kvec(iter))
 		return iterate_kvec(iter, len, priv, priv2, step);
+	if (iov_iter_is_bvecq(iter))
+		return iterate_bvecq(iter, len, priv, priv2, step);
 	if (iov_iter_is_folioq(iter))
 		return iterate_folioq(iter, len, priv, priv2, step);
 	if (iov_iter_is_xarray(iter))
@@ -342,8 +405,8 @@ size_t iterate_and_advance(struct iov_iter *iter, size_=
t len, void *priv,
  * buffer is presented in segments, which for kernel iteration are broken =
up by
  * physical pages and mapped, with the mapped address being presented.
  *
- * [!] Note This will only handle BVEC, KVEC, FOLIOQ, XARRAY and DISCARD-t=
ype
- * iterators; it will not handle UBUF or IOVEC-type iterators.
+ * [!] Note This will only handle BVEC, KVEC, BVECQ, FOLIOQ, XARRAY and
+ * DISCARD-type iterators; it will not handle UBUF or IOVEC-type iterators.
  *
  * A step functions, @step, must be provided, one for handling mapped kern=
el
  * addresses and the other is given user addresses which have the potentia=
l to
@@ -370,6 +433,8 @@ size_t iterate_and_advance_kernel(struct iov_iter *iter=
, size_t len, void *priv,
 		return iterate_bvec(iter, len, priv, priv2, step);
 	if (iov_iter_is_kvec(iter))
 		return iterate_kvec(iter, len, priv, priv2, step);
+	if (iov_iter_is_bvecq(iter))
+		return iterate_bvecq(iter, len, priv, priv2, step);
 	if (iov_iter_is_folioq(iter))
 		return iterate_folioq(iter, len, priv, priv2, step);
 	if (iov_iter_is_xarray(iter))
diff --git a/include/linux/uio.h b/include/linux/uio.h
index a9bc5b3067e3..f7cfa6ea8213 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -26,6 +26,7 @@ enum iter_type {
 	ITER_IOVEC,
 	ITER_BVEC,
 	ITER_KVEC,
+	ITER_BVECQ,
 	ITER_FOLIOQ,
 	ITER_XARRAY,
 	ITER_DISCARD,
@@ -68,6 +69,7 @@ struct iov_iter {
 				const struct iovec *__iov;
 				const struct kvec *kvec;
 				const struct bio_vec *bvec;
+				const struct bvecq *bvecq;
 				const struct folio_queue *folioq;
 				struct xarray *xarray;
 				void __user *ubuf;
@@ -77,6 +79,7 @@ struct iov_iter {
 	};
 	union {
 		unsigned long nr_segs;
+		u16 bvecq_slot;
 		u8 folioq_slot;
 		loff_t xarray_start;
 	};
@@ -145,6 +148,11 @@ static inline bool iov_iter_is_discard(const struct io=
v_iter *i)
 	return iov_iter_type(i) =3D=3D ITER_DISCARD;
 }
=20
+static inline bool iov_iter_is_bvecq(const struct iov_iter *i)
+{
+	return iov_iter_type(i) =3D=3D ITER_BVECQ;
+}
+
 static inline bool iov_iter_is_folioq(const struct iov_iter *i)
 {
 	return iov_iter_type(i) =3D=3D ITER_FOLIOQ;
@@ -295,6 +303,9 @@ void iov_iter_kvec(struct iov_iter *i, unsigned int dir=
ection, const struct kvec
 void iov_iter_bvec(struct iov_iter *i, unsigned int direction, const struc=
t bio_vec *bvec,
 			unsigned long nr_segs, size_t count);
 void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t c=
ount);
+void iov_iter_bvec_queue(struct iov_iter *i, unsigned int direction,
+			 const struct bvecq *bvecq,
+			 unsigned int first_slot, unsigned int offset, size_t count);
 void iov_iter_folio_queue(struct iov_iter *i, unsigned int direction,
 			  const struct folio_queue *folioq,
 			  unsigned int first_slot, unsigned int offset, size_t count);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index cac7d7364bc2..63fc75c2bc48 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -538,6 +538,39 @@ static void iov_iter_iovec_advance(struct iov_iter *i,=
 size_t size)
 	i->__iov =3D iov;
 }
=20
+static void iov_iter_bvecq_advance(struct iov_iter *i, size_t by)
+{
+	const struct bvecq *bq =3D i->bvecq;
+	unsigned int slot =3D i->bvecq_slot;
+
+	if (!i->count)
+		return;
+	i->count -=3D by;
+
+	by +=3D i->iov_offset; /* From beginning of current segment. */
+	do {
+		size_t len;
+
+		while (slot >=3D bq->nr_slots) {
+			if (!bq->next)
+				break;
+			bq =3D bq->next;
+			slot =3D 0;
+		}
+
+		len =3D bq->bv[slot].bv_len;
+
+		if (likely(by < len))
+			break;
+		by -=3D len;
+		slot++;
+	} while (by);
+
+	i->iov_offset =3D by;
+	i->bvecq_slot =3D slot;
+	i->bvecq =3D bq;
+}
+
 static void iov_iter_folioq_advance(struct iov_iter *i, size_t size)
 {
 	const struct folio_queue *folioq =3D i->folioq;
@@ -583,6 +616,8 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
 		iov_iter_iovec_advance(i, size);
 	} else if (iov_iter_is_bvec(i)) {
 		iov_iter_bvec_advance(i, size);
+	} else if (iov_iter_is_bvecq(i)) {
+		iov_iter_bvecq_advance(i, size);
 	} else if (iov_iter_is_folioq(i)) {
 		iov_iter_folioq_advance(i, size);
 	} else if (iov_iter_is_discard(i)) {
@@ -591,6 +626,32 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
 }
 EXPORT_SYMBOL(iov_iter_advance);
=20
+static void iov_iter_bvecq_revert(struct iov_iter *i, size_t unroll)
+{
+	const struct bvecq *bq =3D i->bvecq;
+	unsigned int slot =3D i->bvecq_slot;
+
+	for (;;) {
+		size_t len;
+
+		if (slot =3D=3D 0) {
+			bq =3D bq->prev;
+			slot =3D bq->nr_slots;
+		}
+		slot--;
+
+		len =3D bq->bv[slot].bv_len;
+		if (unroll <=3D len) {
+			i->iov_offset =3D len - unroll;
+			break;
+		}
+		unroll -=3D len;
+	}
+
+	i->bvecq_slot =3D slot;
+	i->bvecq =3D bq;
+}
+
 static void iov_iter_folioq_revert(struct iov_iter *i, size_t unroll)
 {
 	const struct folio_queue *folioq =3D i->folioq;
@@ -648,6 +709,9 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
 			}
 			unroll -=3D n;
 		}
+	} else if (iov_iter_is_bvecq(i)) {
+		i->iov_offset =3D 0;
+		iov_iter_bvecq_revert(i, unroll);
 	} else if (iov_iter_is_folioq(i)) {
 		i->iov_offset =3D 0;
 		iov_iter_folioq_revert(i, unroll);
@@ -678,9 +742,24 @@ size_t iov_iter_single_seg_count(const struct iov_iter=
 *i)
 		if (iov_iter_is_bvec(i))
 			return min(i->count, i->bvec->bv_len - i->iov_offset);
 	}
+	if (!i->count)
+		return 0;
+	if (unlikely(iov_iter_is_bvecq(i))) {
+		const struct bvecq *bq =3D i->bvecq;
+		unsigned int slot =3D i->bvecq_slot;
+		size_t offset =3D i->iov_offset;
+
+		while (slot >=3D bq->nr_slots) {
+			bq =3D bq->next;
+			if (!bq)
+				return 0;
+			slot =3D 0;
+			offset =3D 0;
+		}
+		return umin(i->count, bq->bv[slot].bv_len - offset);
+	}
 	if (unlikely(iov_iter_is_folioq(i)))
-		return !i->count ? 0 :
-			umin(folioq_folio_size(i->folioq, i->folioq_slot), i->count);
+		return umin(folioq_folio_size(i->folioq, i->folioq_slot), i->count);
 	return i->count;
 }
 EXPORT_SYMBOL(iov_iter_single_seg_count);
@@ -717,6 +796,35 @@ void iov_iter_bvec(struct iov_iter *i, unsigned int di=
rection,
 }
 EXPORT_SYMBOL(iov_iter_bvec);
=20
+/**
+ * iov_iter_bvec_queue - Initialise an I/O iterator to use a segmented bve=
c queue
+ * @i: The iterator to initialise.
+ * @direction: The direction of the transfer.
+ * @bvecq: The starting point in the bvec queue.
+ * @first_slot: The first slot in the bvec queue to use
+ * @offset: The offset into the bvec in the first slot to start at
+ * @count: The size of the I/O buffer in bytes.
+ *
+ * Set up an I/O iterator to either draw data out of the buffers attached =
to an
+ * inode or to inject data into those buffers.  The pages *must* be preven=
ted
+ * from evaporation, either by the caller.
+ */
+void iov_iter_bvec_queue(struct iov_iter *i, unsigned int direction,
+			 const struct bvecq *bvecq, unsigned int first_slot,
+			 unsigned int offset, size_t count)
+{
+	WARN_ON(direction & ~(READ | WRITE));
+	*i =3D (struct iov_iter) {
+		.iter_type	=3D ITER_BVECQ,
+		.data_source	=3D direction,
+		.bvecq		=3D bvecq,
+		.bvecq_slot	=3D first_slot,
+		.count		=3D count,
+		.iov_offset	=3D offset,
+	};
+}
+EXPORT_SYMBOL(iov_iter_bvec_queue);
+
 /**
  * iov_iter_folio_queue - Initialise an I/O iterator to use the folios in =
a folio queue
  * @i: The iterator to initialise.
@@ -839,6 +947,37 @@ static unsigned long iov_iter_alignment_bvec(const str=
uct iov_iter *i)
 	return res;
 }
=20
+static unsigned long iov_iter_alignment_bvecq(const struct iov_iter *iter)
+{
+	const struct bvecq *bq;
+	unsigned long res =3D 0;
+	unsigned int slot =3D iter->bvecq_slot;
+	size_t skip =3D iter->iov_offset;
+	size_t size =3D iter->count;
+
+	if (!size)
+		return res;
+
+	for (bq =3D iter->bvecq; bq; bq =3D bq->next) {
+		for (; slot < bq->nr_slots; slot++) {
+			const struct bio_vec *bvec =3D &bq->bv[slot];
+			size_t part =3D umin(bvec->bv_len - skip, size);
+
+			res |=3D bvec->bv_offset + skip;
+			res |=3D part;
+
+			size -=3D part;
+			if (size =3D=3D 0)
+				return res;
+			skip =3D 0;
+		}
+
+		slot =3D 0;
+	}
+
+	return res;
+}
+
 unsigned long iov_iter_alignment(const struct iov_iter *i)
 {
 	if (likely(iter_is_ubuf(i))) {
@@ -854,6 +993,8 @@ unsigned long iov_iter_alignment(const struct iov_iter =
*i)
=20
 	if (iov_iter_is_bvec(i))
 		return iov_iter_alignment_bvec(i);
+	if (iov_iter_is_bvecq(i))
+		return iov_iter_alignment_bvecq(i);
=20
 	/* With both xarray and folioq types, we're dealing with whole folios. */
 	if (iov_iter_is_folioq(i))
@@ -1066,6 +1207,36 @@ static int bvec_npages(const struct iov_iter *i, int=
 maxpages)
 	return npages;
 }
=20
+static size_t iov_npages_bvecq(const struct iov_iter *iter, size_t maxpage=
s)
+{
+	const struct bvecq *bq;
+	unsigned int slot =3D iter->bvecq_slot;
+	size_t npages =3D 0;
+	size_t skip =3D iter->iov_offset;
+	size_t size =3D iter->count;
+
+	for (bq =3D iter->bvecq; bq; bq =3D bq->next) {
+		for (; slot < bq->nr_slots; slot++) {
+			const struct bio_vec *bvec =3D &bq->bv[slot];
+			size_t offs =3D (bvec->bv_offset + skip) % PAGE_SIZE;
+			size_t part =3D umin(bvec->bv_len - skip, size);
+
+			npages +=3D DIV_ROUND_UP(offs + part, PAGE_SIZE);
+			if (npages >=3D maxpages)
+				goto out;
+
+			size -=3D part;
+			if (!size)
+				goto out;
+			skip =3D 0;
+		}
+
+		slot =3D 0;
+	}
+out:
+	return umin(npages, maxpages);
+}
+
 int iov_iter_npages(const struct iov_iter *i, int maxpages)
 {
 	if (unlikely(!i->count))
@@ -1080,6 +1251,8 @@ int iov_iter_npages(const struct iov_iter *i, int max=
pages)
 		return iov_npages(i, maxpages);
 	if (iov_iter_is_bvec(i))
 		return bvec_npages(i, maxpages);
+	if (iov_iter_is_bvecq(i))
+		return iov_npages_bvecq(i, maxpages);
 	if (iov_iter_is_folioq(i)) {
 		unsigned offset =3D i->iov_offset % PAGE_SIZE;
 		int npages =3D DIV_ROUND_UP(offset + i->count, PAGE_SIZE);
@@ -1366,6 +1539,147 @@ void iov_iter_restore(struct iov_iter *i, struct io=
v_iter_state *state)
 	i->nr_segs =3D state->nr_segs;
 }
=20
+/*
+ * Count the number of virtually contiguous pages coming up next in an
+ * ITER_BVECQ iterator, up to the specified maxima.
+ */
+static unsigned int iter_count_bvecq_pages(const struct iov_iter *iter,
+					   size_t maxsize,
+					   unsigned int maxpages)
+{
+	const struct bvecq *bvecq =3D iter->bvecq;
+	unsigned int slot =3D iter->bvecq_slot;
+	ssize_t remain =3D umin(maxsize, iter->count);
+	size_t count =3D 0, offset =3D iter->iov_offset;
+
+	for (;; slot++) {
+		const struct bio_vec *bv;
+		size_t boff, blen;
+
+		while (slot >=3D bvecq->nr_slots) {
+			if (!bvecq->next) {
+				WARN_ON_ONCE(remain > 0);
+				break;
+			}
+			bvecq =3D bvecq->next;
+			slot =3D 0;
+		}
+
+		bv =3D &bvecq->bv[slot];
+		boff =3D bv->bv_offset;
+		blen =3D bv->bv_len;
+
+		if (unlikely(!bv->bv_page)) {
+			if (blen && count > 0)
+				break;
+			continue;
+		}
+		if (!PAGE_ALIGNED(boff) && count > 0)
+			break;
+
+		boff +=3D offset;
+		blen -=3D offset;
+		offset =3D 0;
+		if (!blen)
+			continue;
+
+		blen =3D umin(blen, remain);
+		remain -=3D blen;
+		blen +=3D offset_in_page(boff);
+		count +=3D DIV_ROUND_UP(blen, PAGE_SIZE);
+
+		if (!PAGE_ALIGNED(blen))
+			break;
+		if (remain <=3D 0)
+			break;
+		if (count >=3D maxpages)
+			break;
+	}
+
+	return umin(count, maxpages);
+}
+
+/*
+ * Extract a list of virtually contiguous pages from an ITER_BVECQ iterato=
r.
+ * This does not get references on the pages, nor does it get a pin on the=
m.
+ */
+static ssize_t iov_iter_extract_bvecq_pages(struct iov_iter *iter,
+					    struct page ***pages, size_t maxsize,
+					    unsigned int maxpages,
+					    iov_iter_extraction_t extraction_flags,
+					    size_t *offset0)
+{
+	const struct bvecq *bvecq;
+	struct page **p;
+	unsigned int slot, nr =3D 0;
+	size_t extracted =3D 0, offset;
+
+	/* Count the next run of virtually contiguous pages. */
+	maxpages =3D iter_count_bvecq_pages(iter, maxsize, maxpages);
+
+	if (!*pages) {
+		*pages =3D kvmalloc_array(maxpages, sizeof(struct page *), GFP_KERNEL);
+		if (!*pages)
+			return -ENOMEM;
+	}
+
+	p =3D *pages;
+
+	/* Now transcribe the page pointers. */
+	extracted =3D 0;
+	bvecq =3D iter->bvecq;
+	offset =3D iter->iov_offset;
+	slot =3D iter->bvecq_slot;
+
+	do {
+		const struct bio_vec *bv;
+		size_t boff, blen;
+
+		while (slot >=3D bvecq->nr_slots) {
+			if (!bvecq->next) {
+				WARN_ON_ONCE(extracted < iter->count);
+				break;
+			}
+			bvecq =3D bvecq->next;
+			slot =3D 0;
+		}
+
+		bv =3D &bvecq->bv[slot];
+		boff =3D bv->bv_offset;
+		blen =3D bv->bv_len;
+
+		if (!bv->bv_page)
+			blen =3D 0;
+
+		if (offset < blen) {
+			size_t part =3D umin(maxsize - extracted, blen - offset);
+			size_t poff =3D (boff + offset) % PAGE_SIZE;
+			size_t pix =3D (boff + offset) / PAGE_SIZE;
+
+			if (poff + part > PAGE_SIZE)
+				part =3D PAGE_SIZE - poff;
+
+			if (!extracted)
+				*offset0 =3D poff;
+
+			p[nr++] =3D bv->bv_page + pix;
+			offset +=3D part;
+			extracted +=3D part;
+		}
+
+		if (offset >=3D blen) {
+			offset =3D 0;
+			slot++;
+		}
+	} while (nr < maxpages && extracted < maxsize);
+
+	iter->bvecq =3D bvecq;
+	iter->bvecq_slot =3D slot;
+	iter->iov_offset =3D offset;
+	iter->count -=3D extracted;
+	return extracted;
+}
+
 /*
  * Extract a list of contiguous pages from an ITER_FOLIOQ iterator.  This =
does
  * not get references on the pages, nor does it get a pin on them.
@@ -1708,6 +2022,10 @@ ssize_t iov_iter_extract_pages(struct iov_iter *i,
 		return iov_iter_extract_bvec_pages(i, pages, maxsize,
 						   maxpages, extraction_flags,
 						   offset0);
+	if (iov_iter_is_bvecq(i))
+		return iov_iter_extract_bvecq_pages(i, pages, maxsize,
+						    maxpages, extraction_flags,
+						    offset0);
 	if (iov_iter_is_folioq(i))
 		return iov_iter_extract_folioq_pages(i, pages, maxsize,
 						     maxpages, extraction_flags,
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index b7fe91ef35b8..b92144659543 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -10,6 +10,7 @@
 #include <linux/highmem.h>
 #include <linux/kmemleak.h>
 #include <linux/bvec.h>
+#include <linux/bvecq.h>
 #include <linux/uio.h>
 #include <linux/folio_queue.h>
=20
@@ -1267,6 +1268,68 @@ static ssize_t extract_kvec_to_sg(struct iov_iter *i=
ter,
 	return ret;
 }
=20
+/*
+ * Extract up to sg_max folios from an BVECQ-type iterator and add them to
+ * the scatterlist.  The pages are not pinned.
+ */
+static ssize_t extract_bvecq_to_sg(struct iov_iter *iter,
+				   ssize_t maxsize,
+				   struct sg_table *sgtable,
+				   unsigned int sg_max,
+				   iov_iter_extraction_t extraction_flags)
+{
+	const struct bvecq *bvecq =3D iter->bvecq;
+	struct scatterlist *sg =3D sgtable->sgl + sgtable->nents;
+	unsigned int seg =3D iter->bvecq_slot;
+	ssize_t ret =3D 0;
+	size_t offset =3D iter->iov_offset;
+
+	if (seg >=3D bvecq->nr_slots) {
+		bvecq =3D bvecq->next;
+		if (WARN_ON_ONCE(!bvecq))
+			return 0;
+		seg =3D 0;
+	}
+
+	do {
+		const struct bio_vec *bv =3D &bvecq->bv[seg];
+		size_t blen =3D bv->bv_len;
+
+		if (!bv->bv_page)
+			blen =3D 0;
+
+		if (offset < blen) {
+			size_t part =3D umin(maxsize - ret, blen - offset);
+
+			sg_set_page(sg, bv->bv_page, part, bv->bv_offset + offset);
+			sgtable->nents++;
+			sg++;
+			sg_max--;
+			offset +=3D part;
+			ret +=3D part;
+		}
+
+		if (offset >=3D blen) {
+			offset =3D 0;
+			seg++;
+			if (seg >=3D bvecq->nr_slots) {
+				if (!bvecq->next) {
+					WARN_ON_ONCE(ret < iter->count);
+					break;
+				}
+				bvecq =3D bvecq->next;
+				seg =3D 0;
+			}
+		}
+	} while (sg_max > 0 && ret < maxsize);
+
+	iter->bvecq =3D bvecq;
+	iter->bvecq_slot =3D seg;
+	iter->iov_offset =3D offset;
+	iter->count -=3D ret;
+	return ret;
+}
+
 /*
  * Extract up to sg_max folios from an FOLIOQ-type iterator and add them to
  * the scatterlist.  The pages are not pinned.
@@ -1390,8 +1453,8 @@ static ssize_t extract_xarray_to_sg(struct iov_iter *=
iter,
  * addition of @sg_max elements.
  *
  * The pages referred to by UBUF- and IOVEC-type iterators are extracted a=
nd
- * pinned; BVEC-, KVEC-, FOLIOQ- and XARRAY-type are extracted but aren't
- * pinned; DISCARD-type is not supported.
+ * pinned; BVEC-, BVECQ-, KVEC-, FOLIOQ- and XARRAY-type are extracted but
+ * aren't pinned; DISCARD-type is not supported.
  *
  * No end mark is placed on the scatterlist; that's left to the caller.
  *
@@ -1423,6 +1486,9 @@ ssize_t extract_iter_to_sg(struct iov_iter *iter, siz=
e_t maxsize,
 	case ITER_KVEC:
 		return extract_kvec_to_sg(iter, maxsize, sgtable, sg_max,
 					  extraction_flags);
+	case ITER_BVECQ:
+		return extract_bvecq_to_sg(iter, maxsize, sgtable, sg_max,
+					   extraction_flags);
 	case ITER_FOLIOQ:
 		return extract_folioq_to_sg(iter, maxsize, sgtable, sg_max,
 					    extraction_flags);
diff --git a/lib/tests/kunit_iov_iter.c b/lib/tests/kunit_iov_iter.c
index 37bd6eb25896..1342487dd211 100644
--- a/lib/tests/kunit_iov_iter.c
+++ b/lib/tests/kunit_iov_iter.c
@@ -12,6 +12,7 @@
 #include <linux/mm.h>
 #include <linux/uio.h>
 #include <linux/bvec.h>
+#include <linux/bvecq.h>
 #include <linux/folio_queue.h>
 #include <linux/scatterlist.h>
 #include <linux/minmax.h>
@@ -545,6 +546,185 @@ static void __init iov_kunit_copy_from_folioq(struct =
kunit *test)
 	KUNIT_SUCCEED(test);
 }
=20
+static void iov_kunit_destroy_bvecq(void *data)
+{
+	struct bvecq *bq, *next;
+
+	for (bq =3D data; bq; bq =3D next) {
+		next =3D bq->next;
+		for (int i =3D 0; i < bq->nr_slots; i++)
+			if (bq->bv[i].bv_page)
+				put_page(bq->bv[i].bv_page);
+		kfree(bq);
+	}
+}
+
+static struct bvecq *iov_kunit_alloc_bvecq(struct kunit *test, unsigned in=
t max_slots)
+{
+	struct bvecq *bq;
+
+	bq =3D kzalloc(struct_size(bq, __bv, max_slots), GFP_KERNEL);
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, bq);
+	bq->max_slots =3D max_slots;
+	bq->bv =3D bq->__bv;
+	bq->inline_bv =3D true;
+	return bq;
+}
+
+static struct bvecq *iov_kunit_create_bvecq(struct kunit *test, unsigned i=
nt max_slots)
+{
+	struct bvecq *bq;
+
+	bq =3D iov_kunit_alloc_bvecq(test, max_slots);
+	kunit_add_action_or_reset(test, iov_kunit_destroy_bvecq, bq);
+	return bq;
+}
+
+static void __init iov_kunit_load_bvecq(struct kunit *test,
+					struct iov_iter *iter, int dir,
+					struct bvecq *bq_head,
+					struct page **pages, size_t npages)
+{
+	struct bvecq *bq =3D bq_head;
+	size_t size =3D 0;
+
+	for (int i =3D 0; i < npages; i++) {
+		if (bq->nr_slots >=3D bq->max_slots) {
+			bq->next =3D iov_kunit_alloc_bvecq(test, 13);
+			bq->next->prev =3D bq;
+			bq =3D bq->next;
+		}
+		bvec_set_page(&bq->bv[bq->nr_slots], pages[i], PAGE_SIZE, 0);
+		bq->nr_slots++;
+		size +=3D PAGE_SIZE;
+	}
+	iov_iter_bvec_queue(iter, dir, bq_head, 0, 0, size);
+}
+
+/*
+ * Test copying to a ITER_BVECQ-type iterator.
+ */
+static void __init iov_kunit_copy_to_bvecq(struct kunit *test)
+{
+	const struct kvec_test_range *pr;
+	struct iov_iter iter;
+	struct bvecq *bq;
+	struct page **spages, **bpages;
+	u8 *scratch, *buffer;
+	size_t bufsize, npages, size, copied;
+	int i, patt;
+
+	bufsize =3D 0x100000;
+	npages =3D bufsize / PAGE_SIZE;
+
+	bq =3D iov_kunit_create_bvecq(test, 13);
+
+	scratch =3D iov_kunit_create_buffer(test, &spages, npages);
+	for (i =3D 0; i < bufsize; i++)
+		scratch[i] =3D pattern(i);
+
+	buffer =3D iov_kunit_create_buffer(test, &bpages, npages);
+	memset(buffer, 0, bufsize);
+
+	iov_kunit_load_bvecq(test, &iter, READ, bq, bpages, npages);
+
+	i =3D 0;
+	for (pr =3D kvec_test_ranges; pr->from >=3D 0; pr++) {
+		size =3D pr->to - pr->from;
+		KUNIT_ASSERT_LE(test, pr->to, bufsize);
+
+		iov_iter_bvec_queue(&iter, READ, bq, 0, 0, pr->to);
+		iov_iter_advance(&iter, pr->from);
+		copied =3D copy_to_iter(scratch + i, size, &iter);
+
+		KUNIT_EXPECT_EQ(test, copied, size);
+		KUNIT_EXPECT_EQ(test, iter.count, 0);
+		i +=3D size;
+		if (test->status =3D=3D KUNIT_FAILURE)
+			goto stop;
+	}
+
+	/* Build the expected image in the scratch buffer. */
+	patt =3D 0;
+	memset(scratch, 0, bufsize);
+	for (pr =3D kvec_test_ranges; pr->from >=3D 0; pr++)
+		for (i =3D pr->from; i < pr->to; i++)
+			scratch[i] =3D pattern(patt++);
+
+	/* Compare the images */
+	for (i =3D 0; i < bufsize; i++) {
+		KUNIT_EXPECT_EQ_MSG(test, buffer[i], scratch[i], "at i=3D%x", i);
+		if (buffer[i] !=3D scratch[i])
+			return;
+	}
+
+stop:
+	KUNIT_SUCCEED(test);
+}
+
+/*
+ * Test copying from a ITER_BVECQ-type iterator.
+ */
+static void __init iov_kunit_copy_from_bvecq(struct kunit *test)
+{
+	const struct kvec_test_range *pr;
+	struct iov_iter iter;
+	struct bvecq *bq;
+	struct page **spages, **bpages;
+	u8 *scratch, *buffer;
+	size_t bufsize, npages, size, copied;
+	int i, j;
+
+	bufsize =3D 0x100000;
+	npages =3D bufsize / PAGE_SIZE;
+
+	bq =3D iov_kunit_create_bvecq(test, 13);
+
+	buffer =3D iov_kunit_create_buffer(test, &bpages, npages);
+	for (i =3D 0; i < bufsize; i++)
+		buffer[i] =3D pattern(i);
+
+	scratch =3D iov_kunit_create_buffer(test, &spages, npages);
+	memset(scratch, 0, bufsize);
+
+	iov_kunit_load_bvecq(test, &iter, READ, bq, bpages, npages);
+
+	i =3D 0;
+	for (pr =3D kvec_test_ranges; pr->from >=3D 0; pr++) {
+		size =3D pr->to - pr->from;
+		KUNIT_ASSERT_LE(test, pr->to, bufsize);
+
+		iov_iter_bvec_queue(&iter, WRITE, bq, 0, 0, pr->to);
+		iov_iter_advance(&iter, pr->from);
+		copied =3D copy_from_iter(scratch + i, size, &iter);
+
+		KUNIT_EXPECT_EQ(test, copied, size);
+		KUNIT_EXPECT_EQ(test, iter.count, 0);
+		i +=3D size;
+	}
+
+	/* Build the expected image in the main buffer. */
+	i =3D 0;
+	memset(buffer, 0, bufsize);
+	for (pr =3D kvec_test_ranges; pr->from >=3D 0; pr++) {
+		for (j =3D pr->from; j < pr->to; j++) {
+			buffer[i++] =3D pattern(j);
+			if (i >=3D bufsize)
+				goto stop;
+		}
+	}
+stop:
+
+	/* Compare the images */
+	for (i =3D 0; i < bufsize; i++) {
+		KUNIT_EXPECT_EQ_MSG(test, scratch[i], buffer[i], "at i=3D%x", i);
+		if (scratch[i] !=3D buffer[i])
+			return;
+	}
+
+	KUNIT_SUCCEED(test);
+}
+
 static void iov_kunit_destroy_xarray(void *data)
 {
 	struct xarray *xarray =3D data;
@@ -860,6 +1040,85 @@ static void __init iov_kunit_extract_pages_bvec(struc=
t kunit *test)
 	KUNIT_SUCCEED(test);
 }
=20
+/*
+ * Test the extraction of ITER_BVECQ-type iterators.
+ */
+static void __init iov_kunit_extract_pages_bvecq(struct kunit *test)
+{
+	const struct kvec_test_range *pr;
+	struct iov_iter iter;
+	struct bvecq *bq;
+	struct page **bpages, *pagelist[8], **pages =3D pagelist;
+	ssize_t len;
+	size_t bufsize, size =3D 0, npages;
+	int i, from;
+
+	bufsize =3D 0x100000;
+	npages =3D bufsize / PAGE_SIZE;
+
+	bq =3D iov_kunit_create_bvecq(test, 13);
+
+	iov_kunit_create_buffer(test, &bpages, npages);
+	iov_kunit_load_bvecq(test, &iter, READ, bq, bpages, npages);
+
+	for (pr =3D kvec_test_ranges; pr->from >=3D 0; pr++) {
+		from =3D pr->from;
+		size =3D pr->to - from;
+		KUNIT_ASSERT_LE(test, pr->to, bufsize);
+
+		iov_iter_bvec_queue(&iter, WRITE, bq, 0, 0, pr->to);
+		iov_iter_advance(&iter, from);
+
+		do {
+			size_t offset0 =3D LONG_MAX;
+
+			for (i =3D 0; i < ARRAY_SIZE(pagelist); i++)
+				pagelist[i] =3D (void *)(unsigned long)0xaa55aa55aa55aa55ULL;
+
+			len =3D iov_iter_extract_pages(&iter, &pages, 100 * 1024,
+						     ARRAY_SIZE(pagelist), 0, &offset0);
+			KUNIT_EXPECT_GE(test, len, 0);
+			if (len < 0)
+				break;
+			KUNIT_EXPECT_LE(test, len, size);
+			KUNIT_EXPECT_EQ(test, iter.count, size - len);
+			if (len =3D=3D 0)
+				break;
+			size -=3D len;
+			KUNIT_EXPECT_GE(test, (ssize_t)offset0, 0);
+			KUNIT_EXPECT_LT(test, offset0, PAGE_SIZE);
+
+			for (i =3D 0; i < ARRAY_SIZE(pagelist); i++) {
+				struct page *p;
+				ssize_t part =3D min_t(ssize_t, len, PAGE_SIZE - offset0);
+				int ix;
+
+				KUNIT_ASSERT_GE(test, part, 0);
+				ix =3D from / PAGE_SIZE;
+				KUNIT_ASSERT_LT(test, ix, npages);
+				p =3D bpages[ix];
+				KUNIT_EXPECT_PTR_EQ(test, pagelist[i], p);
+				KUNIT_EXPECT_EQ(test, offset0, from % PAGE_SIZE);
+				from +=3D part;
+				len -=3D part;
+				KUNIT_ASSERT_GE(test, len, 0);
+				if (len =3D=3D 0)
+					break;
+				offset0 =3D 0;
+			}
+
+			if (test->status =3D=3D KUNIT_FAILURE)
+				goto stop;
+		} while (iov_iter_count(&iter) > 0);
+
+		KUNIT_EXPECT_EQ(test, size, 0);
+		KUNIT_EXPECT_EQ(test, iter.count, 0);
+	}
+
+stop:
+	KUNIT_SUCCEED(test);
+}
+
 /*
  * Test the extraction of ITER_FOLIOQ-type iterators.
  */
@@ -1219,12 +1478,15 @@ static struct kunit_case __refdata iov_kunit_cases[=
] =3D {
 	KUNIT_CASE(iov_kunit_copy_from_kvec),
 	KUNIT_CASE(iov_kunit_copy_to_bvec),
 	KUNIT_CASE(iov_kunit_copy_from_bvec),
+	KUNIT_CASE(iov_kunit_copy_to_bvecq),
+	KUNIT_CASE(iov_kunit_copy_from_bvecq),
 	KUNIT_CASE(iov_kunit_copy_to_folioq),
 	KUNIT_CASE(iov_kunit_copy_from_folioq),
 	KUNIT_CASE(iov_kunit_copy_to_xarray),
 	KUNIT_CASE(iov_kunit_copy_from_xarray),
 	KUNIT_CASE(iov_kunit_extract_pages_kvec),
 	KUNIT_CASE(iov_kunit_extract_pages_bvec),
+	KUNIT_CASE(iov_kunit_extract_pages_bvecq),
 	KUNIT_CASE(iov_kunit_extract_pages_folioq),
 	KUNIT_CASE(iov_kunit_extract_pages_xarray),
 	KUNIT_CASE(iov_kunit_iter_to_sg_kvec),
From nobody Mon May 25 04:33:52 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id EF7FF3B19D9
	for <linux-kernel@vger.kernel.org>; Mon, 18 May 2026 22:31:29 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779143492; cv=none;
 b=NqrsBskj1JSIJ8RERFuIfSfssU0gcb2Cn5mcVcpOyvwbYGZlMIpN06TEvXZFZPxVmCrvc7GA24vRJN6AHH1k7TT3G/+l7Fvd65aSyCySE1P1MpsP/p2HbCtIl2uHJyIf+6/JYSWYAnsx15Mp6xLQ+hq76qtr5YHr8srWJqa5+m8=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779143492; c=relaxed/simple;
	bh=n7VZ60VhzHmlQt4Oryuyi9u8opzqejbmXt0GXaOk3a8=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=qk64Ecl/QAzxwmoDp3yIKoyaZSkuC6p58OPdYtjsJ0j5rueQrRFk4BcFjIlYDU6EIFTOzSqv+pO2jMYBxQIKjyhyw5kBlgoN/v7547nmydTyAW22ENNwUW4uddNBnz3uh+aOxJEKiCDY017JO68SQjOPkrIQoro6jNmDYVVdTiM=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=DwM/UPgf; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="DwM/UPgf"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1779143489;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=h6cK/+wZZBgV9aDVhNdrE2DJVenyEWleKupZz8Kyaho=;
	b=DwM/UPgffvBNx6XgcQhOp9vkL7Leb7Iabd2KvdfQpilDf/1X3S356JlTQookKl3S/LJjlt
	GE2Sl8tqJgqQuUtF4SPoPYCz9+iGCh2gfRCw7KpkiFUozRAfP4+w7BDnbjwC0lRWTLShzx
	8eUUO/ng3wFR+/vrU/r+OXmcn9mDues=
Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-695-4VU8WuiPP_uwOBmn2F2DFw-1; Mon,
 18 May 2026 18:31:25 -0400
X-MC-Unique: 4VU8WuiPP_uwOBmn2F2DFw-1
X-Mimecast-MFC-AGG-ID: 4VU8WuiPP_uwOBmn2F2DFw_1779143482
Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 728C6180047F;
	Mon, 18 May 2026 22:31:22 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.48.33])
	by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 3D78B1956053;
	Mon, 18 May 2026 22:31:14 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.org>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 08/21] netfs: Add some tools for managing bvecq chains
Date: Mon, 18 May 2026 23:29:40 +0100
Message-ID: <20260518222959.488126-9-dhowells@redhat.com>
In-Reply-To: <20260518222959.488126-1-dhowells@redhat.com>
References: <20260518222959.488126-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17
Content-Type: text/plain; charset="utf-8"

Provide a selection of tools for managing bvec queue chains.  This
includes:

 (1) Allocation, prepopulation, expansion, shortening and refcounting of
     bvecqs and bvecq chains.

     This can be used to do things like creating an encryption buffer in
     cifs or a directory content buffer in afs.  The memory segments will
     be appropriate disposed off according to the flags on the bvecq.

 (2) Management of a bvecq chain as a rolling buffer and the management of
     positions within it.

 (3) Loading folios, slicing chains and clearing content.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/netfs/Makefile            |   1 +
 fs/netfs/bvecq.c             | 763 +++++++++++++++++++++++++++++++++++
 fs/netfs/internal.h          |   1 +
 fs/netfs/stats.c             |   4 +-
 include/linux/bvecq.h        | 269 ++++++++++++
 include/linux/netfs.h        |   1 +
 include/trace/events/netfs.h |  24 ++
 7 files changed, 1062 insertions(+), 1 deletion(-)
 create mode 100644 fs/netfs/bvecq.c

diff --git a/fs/netfs/Makefile b/fs/netfs/Makefile
index b43188d64bd8..e1f12ecb5abf 100644
--- a/fs/netfs/Makefile
+++ b/fs/netfs/Makefile
@@ -3,6 +3,7 @@
 netfs-y :=3D \
 	buffered_read.o \
 	buffered_write.o \
+	bvecq.o \
 	direct_read.o \
 	direct_write.o \
 	iterator.o \
diff --git a/fs/netfs/bvecq.c b/fs/netfs/bvecq.c
new file mode 100644
index 000000000000..b3822fe87f64
--- /dev/null
+++ b/fs/netfs/bvecq.c
@@ -0,0 +1,763 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Buffering helpers for bvec queues
+ *
+ * Copyright (C) 2026 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <linux/bvecq.h>
+#include "internal.h"
+
+void bvecq_dump(const struct bvecq *bq)
+{
+	int b =3D 0;
+
+	for (; bq; bq =3D bq->next, b++) {
+		int skipz =3D 0;
+
+		pr_notice("BQ[%u] %u/%u fp=3D%llx%s\n",
+			  b, bq->nr_slots, bq->max_slots, bq->fpos,
+			  bq->discontig ? " discontig" : "");
+		for (int s =3D 0; s < bq->nr_slots; s++) {
+			const struct bio_vec *bv =3D &bq->bv[s];
+
+			if (!bv->bv_page && !bv->bv_len && skipz < 2) {
+				skipz =3D 1;
+				continue;
+			}
+			if (skipz =3D=3D 1)
+				pr_notice("BQ[%u:00-%02u] ...\n", b, s - 1);
+			skipz =3D 2;
+			pr_notice("BQ[%u:%02u] %10lx %04x %04x %u\n",
+				  b, s,
+				  bv->bv_page ? page_to_pfn(bv->bv_page) : 0,
+				  bv->bv_offset, bv->bv_len,
+				  bv->bv_page ? page_count(bv->bv_page) : 0);
+		}
+	}
+}
+EXPORT_SYMBOL(bvecq_dump);
+
+/**
+ * bvecq_alloc_one - Allocate a single bvecq node with unpopulated slots
+ * @nr_slots: Number of slots to allocate
+ * @gfp: The allocation constraints.
+ *
+ * Allocate a single bvecq node and initialise the header.  A number of in=
line
+ * slots are also allocated, rounded up to fit after the header in a power=
-of-2
+ * slab object of up to 512 bytes (up to 29 slots on a 64-bit cpu).  The s=
lot
+ * array is not initialised.
+ *
+ * Return: The node pointer or NULL on allocation failure.
+ */
+struct bvecq *bvecq_alloc_one(size_t nr_slots, gfp_t gfp)
+{
+	struct bvecq *bq;
+	const size_t max_size =3D 512;
+	const size_t max_slots =3D (max_size - sizeof(*bq)) / sizeof(bq->__bv[0]);
+	size_t part =3D umin(nr_slots, max_slots);
+	size_t size =3D roundup_pow_of_two(struct_size(bq, __bv, part));
+
+	bq =3D kmalloc(size, gfp & ~GFP_ZONEMASK);
+	if (bq) {
+		*bq =3D (struct bvecq) {
+			.ref		=3D REFCOUNT_INIT(1),
+			.bv		=3D bq->__bv,
+			.inline_bv	=3D true,
+			.max_slots	=3D (size - sizeof(*bq)) / sizeof(bq->__bv[0]),
+		};
+		netfs_stat(&netfs_n_bvecq);
+	}
+	return bq;
+}
+EXPORT_SYMBOL(bvecq_alloc_one);
+
+/**
+ * bvecq_alloc_chain - Allocate an unpopulated bvecq chain
+ * @nr_slots: Number of slots to allocate
+ * @gfp: The allocation constraints.
+ *
+ * Allocate a chain of bvecq nodes providing at least the requested cumula=
tive
+ * number of slots.
+ *
+ * Return: The first node pointer or NULL on allocation failure.
+ */
+struct bvecq *bvecq_alloc_chain(size_t nr_slots, gfp_t gfp)
+{
+	struct bvecq *head =3D NULL, *tail =3D NULL;
+
+	_enter("%zu", nr_slots);
+
+	for (;;) {
+		struct bvecq *bq;
+
+		bq =3D bvecq_alloc_one(nr_slots, gfp);
+		if (!bq)
+			goto oom;
+
+		if (tail) {
+			tail->next =3D bq;
+			bq->prev =3D tail;
+		} else {
+			head =3D bq;
+		}
+		tail =3D bq;
+		if (tail->max_slots >=3D nr_slots)
+			break;
+		nr_slots -=3D tail->max_slots;
+	}
+
+	return head;
+oom:
+	bvecq_put(head);
+	return NULL;
+}
+EXPORT_SYMBOL(bvecq_alloc_chain);
+
+/**
+ * bvecq_alloc_buffer2 - Allocate a bvecq chain and populate with buffers
+ * @size: Target size of the buffer (can be 0 for an empty buffer)
+ * @pre_slots: Number of preamble slots to set aside
+ * @gfp: The allocation constraints.
+ *
+ * Allocate a chain of bvecq nodes and populate the slots with sufficient =
pages
+ * to provide at least the requested amount of space, leaving the first
+ * @pre_slots slots unset.  The pre-slots must all fit into the the first
+ * bvecq.
+ *
+ * The pages allocated may be compound pages larger than PAGE_SIZE and thus
+ * occupy fewer slots.  The pages have their refcounts set to 1 and can be
+ * passed to MSG_SPLICE_PAGES.
+ *
+ * Return: The first node pointer or NULL on allocation failure.
+ */
+struct bvecq *bvecq_alloc_buffer2(size_t size, unsigned int pre_slots, gfp=
_t gfp)
+{
+	struct bvecq *head =3D NULL, *tail =3D NULL, *p =3D NULL;
+	size_t nr_per_bq =3D BVECQ_STD_SLOTS;
+	size_t count =3D DIV_ROUND_UP(size, PAGE_SIZE);
+
+	_enter("%zx,%zx,%u", size, count, pre_slots);
+
+	if (WARN_ON_ONCE(pre_slots > nr_per_bq))
+		return NULL;
+
+	do {
+		struct page **pages;
+		int want, got;
+
+		p =3D bvecq_alloc_one(min(pre_slots + count, nr_per_bq), gfp);
+		if (!p)
+			goto oom;
+
+		p->mem_type =3D BVECQ_MEM_ALLOCED;
+
+		if (tail) {
+			tail->next =3D p;
+			p->prev =3D tail;
+		} else {
+			head =3D p;
+		}
+		tail =3D p;
+		if (!count)
+			break;
+
+		/* Need to clear pre slots and pages[], so just clear all. */
+		memset(p->bv, 0, p->max_slots * sizeof(p->bv[0]));
+
+		pages =3D (struct page **)&p->bv[p->max_slots];
+		pages -=3D p->max_slots - pre_slots;
+
+		want =3D min(count, p->max_slots - pre_slots);
+		got =3D alloc_pages_bulk(gfp, want, pages);
+		if (got < want) {
+			for (int i =3D 0; i < got; i++) {
+				__free_page(pages[i]);
+				pages[i] =3D NULL;
+			}
+			goto oom;
+		}
+
+		tail->nr_slots =3D pre_slots + got;
+		for (int i =3D 0; i < got; i++) {
+			int j =3D pre_slots + i;
+
+			set_page_count(pages[i], 1);
+			bvec_set_page(&tail->bv[j], pages[i], PAGE_SIZE, 0);
+		}
+
+		count -=3D got;
+		pre_slots =3D 0;
+	} while (count > 0);
+
+	return head;
+oom:
+	bvecq_put(head);
+	return NULL;
+}
+EXPORT_SYMBOL(bvecq_alloc_buffer2);
+
+/*
+ * Free the page pointed to by a slot as necessary.
+ */
+static void bvecq_free_slot(struct bvecq *bq, unsigned int slot)
+{
+	struct page *page =3D bq->bv[slot].bv_page;
+
+	if (!page)
+		return;
+
+	switch (bq->mem_type) {
+	case BVECQ_MEM_EXTERNAL:
+		break;
+	case BVECQ_MEM_PAGECACHE:
+		put_page(page);
+		break;
+	case BVECQ_MEM_GUP:
+		unpin_user_page(page);
+		break;
+	case BVECQ_MEM_ALLOCED:
+		__free_pages(page, compound_order(page));
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		break;
+	}
+}
+
+/**
+ * bvecq_put - Put a ref on a bvec queue
+ * @bq: The start of the folio queue to free
+ *
+ * Put the ref(s) on the nodes in a bvec queue, freeing up the node and the
+ * page fragments it points to as the refcounts become zero.
+ */
+void bvecq_put(struct bvecq *bq)
+{
+	struct bvecq *next;
+
+	for (; bq; bq =3D next) {
+		if (!refcount_dec_and_test(&bq->ref))
+			break;
+		for (int slot =3D 0; slot < bq->nr_slots; slot++)
+			bvecq_free_slot(bq, slot);
+		next =3D bq->next;
+		netfs_stat_d(&netfs_n_bvecq);
+		kfree(bq);
+	}
+}
+EXPORT_SYMBOL(bvecq_put);
+
+/**
+ * bvecq_expand_buffer - Allocate buffer space into a bvec queue
+ * @_buffer: Pointer to the bvecq chain to expand (may point to a NULL; up=
dated).
+ * @_cur_size: Current size of the buffer (updated).
+ * @size: Target size of the buffer.
+ * @gfp: The allocation constraints.
+ */
+int bvecq_expand_buffer(struct bvecq **_buffer, size_t *_cur_size, ssize_t=
 size, gfp_t gfp)
+{
+	struct bvecq *tail =3D *_buffer;
+
+	size =3D round_up(size, PAGE_SIZE);
+	if (*_cur_size >=3D size)
+		return 0;
+
+	if (tail)
+		while (tail->next)
+			tail =3D tail->next;
+
+	do {
+		struct page *page;
+		int order =3D 0;
+
+		if (!tail || bvecq_is_full(tail)) {
+			struct bvecq *p;
+
+			p =3D bvecq_alloc_one(BVECQ_STD_SLOTS, gfp);
+			if (!p)
+				return -ENOMEM;
+			if (tail) {
+				tail->next =3D p;
+				p->prev =3D tail;
+			} else {
+				*_buffer =3D p;
+			}
+			tail =3D p;
+			p->mem_type =3D BVECQ_MEM_ALLOCED;
+		}
+
+		if (size - *_cur_size > PAGE_SIZE)
+			order =3D umin(ilog2(size - *_cur_size) - PAGE_SHIFT,
+				     MAX_PAGECACHE_ORDER);
+
+		page =3D alloc_pages(gfp | __GFP_COMP, order);
+		if (!page && order > 0) {
+			page =3D alloc_pages(gfp | __GFP_COMP, 0);
+			order =3D 0;
+		}
+		if (!page)
+			return -ENOMEM;
+
+		bvec_set_page(&tail->bv[tail->nr_slots++], page, PAGE_SIZE << order, 0);
+		*_cur_size +=3D PAGE_SIZE << order;
+	} while (*_cur_size < size);
+
+	return 0;
+}
+EXPORT_SYMBOL(bvecq_expand_buffer);
+
+/**
+ * bvecq_shorten_buffer - Shorten a bvec queue buffer
+ * @bq: The start of the buffer to shorten
+ * @slot: The slot to start from
+ * @size: The size to retain
+ *
+ * Shorten the content of a bvec queue down to the minimum number of slots,
+ * starting at the specified slot, to retain the specified size.
+ *
+ * Return: 0 if successful; -EMSGSIZE if there is insufficient content.
+ */
+int bvecq_shorten_buffer(struct bvecq *bq, unsigned int slot, size_t size)
+{
+	ssize_t retain =3D size;
+
+	/* Skip through the segments we want to keep. */
+	for (; bq; bq =3D bq->next) {
+		for (; slot < bq->nr_slots; slot++) {
+			retain -=3D bq->bv[slot].bv_len;
+			if (retain < 0)
+				goto found;
+		}
+		slot =3D 0;
+	}
+	if (WARN_ON_ONCE(retain > 0))
+		return -EMSGSIZE;
+	return 0;
+
+found:
+	/* Shorten the entry to be retained and clean the rest of this bvecq. */
+	bq->bv[slot].bv_len +=3D retain;
+	slot++;
+	for (int i =3D slot; i < bq->nr_slots; i++)
+		bvecq_free_slot(bq, i);
+	bq->nr_slots =3D slot;
+
+	/* Free the queue tail. */
+	bvecq_put(bq->next);
+	bq->next =3D NULL;
+	return 0;
+}
+EXPORT_SYMBOL(bvecq_shorten_buffer);
+
+/**
+ * bvecq_buffer_init - Initialise a buffer and set position
+ * @pos: The position to point at the new buffer.
+ * @gfp: The allocation constraints.
+ *
+ * Initialise a rolling buffer.  We allocate an unpopulated bvecq node to =
so
+ * that the pointers can be independently driven by the producer and the
+ * consumer.
+ *
+ * Return 0 if successful; -ENOMEM on allocation failure.
+ */
+int bvecq_buffer_init(struct bvecq_pos *pos, gfp_t gfp)
+{
+	struct bvecq *bq;
+
+	bq =3D bvecq_alloc_one(BVECQ_STD_SLOTS, gfp);
+	if (!bq)
+		return -ENOMEM;
+
+	pos->bvecq  =3D bq; /* Comes with a ref. */
+	pos->slot   =3D 0;
+	pos->offset =3D 0;
+	return 0;
+}
+
+/**
+ * bvecq_buffer_append - Append a new bvecq node to a buffer
+ * @pos: The position of the last node.
+ * @bq: The buffer to add.
+ *
+ * Add a new node on to the buffer chain at the specified position, either
+ * because the previous one is full or because we have a discontiguity to
+ * contend with, and update @pos to point to it.
+ */
+void bvecq_buffer_append(struct bvecq_pos *pos, struct bvecq *bq)
+{
+	struct bvecq *head =3D pos->bvecq;
+
+	bq->prev =3D head;
+
+	pos->bvecq =3D bvecq_get(bq);
+	pos->slot =3D 0;
+	pos->offset =3D 0;
+
+	/* Make sure the initialisation is stored before the next pointer.
+	 *
+	 * [!] NOTE: After we set head->next, the consumer is at liberty to
+	 * immediately delete the old head.
+	 */
+	smp_store_release(&head->next, bq);
+	bvecq_put(head);
+}
+
+/**
+ * bvecq_pos_advance - Advance a bvecq position
+ * @pos: The position to advance.
+ * @amount: The amount of bytes to advance by.
+ *
+ * Advance the specified bvecq position by @amount bytes.  @pos is updated=
 and
+ * bvecq ref counts may have been manipulated.  If the position hits the e=
nd of
+ * the queue, then it is left pointing beyond the last slot of the last bv=
ecq
+ * so that it doesn't break the chain.
+ */
+void bvecq_pos_advance(struct bvecq_pos *pos, size_t amount)
+{
+	struct bvecq *bq =3D pos->bvecq;
+	unsigned int slot =3D pos->slot;
+	size_t offset =3D pos->offset;
+
+	while (amount) {
+		size_t part;
+
+		while (bvecq_acquire_slot(bq, slot)) {
+			if (!bq->next) {
+				WARN_ON_ONCE(amount > 0);
+				break;
+			}
+			bq =3D bq->next;
+			slot =3D 0;
+		}
+
+		part =3D bq->bv[slot].bv_len - offset;
+
+		if (part > amount) {
+			offset +=3D amount;
+			break;
+		}
+		amount -=3D part;
+		offset =3D 0;
+		slot++;
+	}
+
+	pos->slot   =3D slot;
+	pos->offset =3D offset;
+	bvecq_pos_move(pos, bq);
+}
+
+/*
+ * Clear part of the memory pointed to by a bio_vec.
+ */
+static void bvec_zero(const struct bio_vec *bv, size_t offset, size_t len)
+{
+	struct page *page =3D bv->bv_page;
+
+	offset +=3D bv->bv_offset;
+
+	page  +=3D offset / PAGE_SIZE;
+	offset =3D offset % PAGE_SIZE;
+
+	while (len) {
+		size_t part =3D umin(len, PAGE_SIZE - offset);
+		char *p =3D kmap_local_page(page);
+
+		memset(p + offset, 0, part);
+		kunmap_local(p);
+
+		len -=3D part;
+		offset =3D 0;
+		page++;
+	}
+}
+
+/**
+ * bvecq_zero - Clear memory starting at the bvecq position.
+ * @pos: The position in the bvecq chain to start clearing.
+ * @amount: The number of bytes to clear.
+ *
+ * Clear memory fragments pointed to by a bvec queue.  @pos is updated and
+ * bvecq ref counts may have been manipulated.  If the position hits the e=
nd of
+ * the queue, then it is left pointing beyond the last slot of the last bv=
ecq
+ * so that it doesn't break the chain.
+ *
+ * Return: The number of bytes cleared.
+ */
+ssize_t bvecq_zero(struct bvecq_pos *pos, size_t amount)
+{
+	struct bvecq *bq;
+	unsigned int slot =3D pos->slot;
+	size_t cleared =3D 0, offset =3D pos->offset;
+
+	bq =3D pos->bvecq;
+	for (;;) {
+		for (; slot < bq->nr_slots; slot++) {
+			const struct bio_vec *bvec =3D &bq->bv[slot];
+
+			if (offset < bvec->bv_len && bvec->bv_page) {
+				size_t part =3D umin(bvec->bv_len - offset, amount);
+
+				bvec_zero(bvec, offset, part);
+
+				cleared +=3D part;
+				offset +=3D part;
+				amount -=3D part;
+				if (!amount)
+					goto out;
+			}
+			offset =3D 0;
+		}
+
+		/* pos->bvecq isn't allowed to go NULL as the queue may get
+		 * extended and we would lose our place.
+		 */
+		if (!bq->next)
+			break;
+		slot =3D 0;
+		bq =3D bq->next;
+	}
+
+out:
+	if (slot =3D=3D bq->nr_slots && bq->next) {
+		bq =3D bq->next;
+		slot =3D 0;
+		offset =3D 0;
+	}
+	bvecq_pos_move(pos, bq);
+	pos->slot =3D slot;
+	pos->offset =3D offset;
+	return cleared;
+}
+
+/**
+ * bvecq_slice - Find a slice of a bvecq queue
+ * @pos: The position to start at.
+ * @max_size: The maximum size of the slice (or ULONG_MAX).
+ * @max_slots: The maximum number of slots in the slice (or INT_MAX).
+ * @_nr_slots: Where to put the number of slots (updated).
+ *
+ * Determine the size and number of slots that can be obtained the next sl=
ice
+ * of bvec queue up to the maximum size and slot count specified.  The sli=
ce is
+ * also limited if a discontiguity is found.
+ *
+ * @pos is updated to the end of the slice.  If the position hits the end =
of
+ * the queue, then it is left pointing beyond the last slot of the last bv=
ecq
+ * so that it doesn't break the chain.
+ *
+ * Return: The number of bytes in the slice.
+ */
+size_t bvecq_slice(struct bvecq_pos *pos, size_t max_size,
+		   unsigned int max_slots, unsigned int *_nr_slots)
+{
+	struct bvecq *bq;
+	unsigned int slot =3D pos->slot, nslots =3D 0;
+	size_t size =3D 0, offset =3D pos->offset;
+
+	bq =3D pos->bvecq;
+	for (;;) {
+		for (; slot < bq->nr_slots; slot++) {
+			const struct bio_vec *bvec =3D &bq->bv[slot];
+
+			if (offset < bvec->bv_len && bvec->bv_page) {
+				size_t part =3D umin(bvec->bv_len - offset, max_size);
+
+				size +=3D part;
+				offset +=3D part;
+				max_size -=3D part;
+				nslots++;
+				if (!max_size || nslots >=3D max_slots)
+					goto out;
+			}
+			offset =3D 0;
+		}
+
+		/* pos->bvecq isn't allowed to go NULL as the queue may get
+		 * extended and we would lose our place.
+		 */
+		if (!bq->next)
+			break;
+		slot =3D 0;
+		bq =3D bq->next;
+		if (bq->discontig && size > 0)
+			break;
+	}
+
+out:
+	*_nr_slots =3D nslots;
+	if (slot =3D=3D bq->nr_slots && bq->next) {
+		bq =3D bq->next;
+		slot =3D 0;
+		offset =3D 0;
+	}
+	bvecq_pos_move(pos, bq);
+	pos->slot =3D slot;
+	pos->offset =3D offset;
+	return size;
+}
+
+/**
+ * bvecq_extract - Extract a slice of a bvecq queue into a new bvecq queue
+ * @pos: The position to start at.
+ * @max_size: The maximum size of the slice (or ULONG_MAX).
+ * @max_slots: The maximum number of slots in the slice (or INT_MAX).
+ * @to: Where to put the extraction bvecq chain head (updated).
+ *
+ * Allocate a new bvecq and extract into it memory fragments from a slice =
of
+ * bvec queue, starting at @pos.  The slice is also limited if a discontig=
uity
+ * is found.  No refs are taken on the page.
+ *
+ * @pos is updated to the end of the slice.  If the position hits the end =
of
+ * the queue, then it is left pointing beyond the last slot of the last bv=
ecq
+ * so that it doesn't break the chain.
+ *
+ * If successful, *@to is set to point to the head of the newly allocated =
chain
+ * and the caller inherits a ref to it.
+ *
+ * Return: The number of bytes extracted; -ENOMEM on allocation failure or=
 -EIO
+ * if no slots were available to extract.
+ */
+ssize_t bvecq_extract(struct bvecq_pos *pos, size_t max_size,
+		      unsigned int max_slots, struct bvecq **to)
+{
+	struct bvecq_pos tmp_pos;
+	struct bvecq *src, *dst =3D NULL;
+	unsigned int slot =3D pos->slot, dslot =3D 0, nslots;
+	ssize_t extracted =3D 0;
+	size_t offset =3D pos->offset, amount;
+
+	*to =3D NULL;
+	if (WARN_ON_ONCE(!max_slots))
+		max_slots =3D INT_MAX;
+
+	bvecq_pos_set(&tmp_pos, pos);
+	amount =3D bvecq_slice(&tmp_pos, max_size, max_slots, &nslots);
+	bvecq_pos_unset(&tmp_pos);
+	if (nslots =3D=3D 0)
+		return -EIO;
+
+	dst =3D bvecq_alloc_chain(nslots, GFP_KERNEL);
+	if (!dst)
+		return -ENOMEM;
+	*to =3D dst;
+	max_slots =3D nslots;
+	nslots =3D 0;
+
+	/* Transcribe the slots */
+	src =3D pos->bvecq;
+	for (;;) {
+		for (; slot < src->nr_slots; slot++) {
+			const struct bio_vec *sv =3D &src->bv[slot];
+			struct bio_vec *dv =3D &dst->bv[dslot];
+
+			_debug("EXTR BQ=3D%x[%x] off=3D%zx am=3D%zx p=3D%lx",
+			       src->priv, slot, offset, amount, page_to_pfn(sv->bv_page));
+
+			if (offset < sv->bv_len && sv->bv_page) {
+				size_t part =3D umin(sv->bv_len - offset, amount);
+
+				bvec_set_page(dv, sv->bv_page, part,
+					      sv->bv_offset + offset);
+				extracted +=3D part;
+				amount -=3D part;
+				offset +=3D part;
+				trace_netfs_bv_slot(dst, dslot);
+				dslot++;
+				nslots++;
+				if (dslot >=3D dst->max_slots) {
+					bvecq_filled_to(dst, dslot);
+					dst =3D dst->next;
+					dslot =3D 0;
+				}
+				if (nslots >=3D max_slots)
+					goto out;
+				if (amount =3D=3D 0)
+					goto out;
+			}
+			offset =3D 0;
+		}
+
+		/* pos->bvecq isn't allowed to go NULL as the queue may get
+		 * extended and we would lose our place.
+		 */
+		if (!src->next)
+			break;
+		slot =3D 0;
+		src =3D src->next;
+		if (src->discontig && extracted > 0)
+			break;
+	}
+
+out:
+	if (dst)
+		bvecq_filled_to(dst, dslot);
+	if (slot =3D=3D src->nr_slots && src->next) {
+		src =3D src->next;
+		slot =3D 0;
+		offset =3D 0;
+	}
+	bvecq_pos_move(pos, src);
+	pos->slot =3D slot;
+	pos->offset =3D offset;
+	return extracted;
+}
+
+/**
+ * bvecq_load_from_ra - Allocate a bvecq chain and load from readahead
+ * @pos: Blank position object to attach the new chain to.
+ * @ractl: The readahead control context.
+ *
+ * Decant the set of folios to be read from the readahead context into a b=
vecq
+ * chain.  Each folio occupies one bio_vec element.
+ *
+ * Return: Amount of data loaded or -ENOMEM on allocation failure.
+ */
+ssize_t bvecq_load_from_ra(struct bvecq_pos *pos, struct readahead_control=
 *ractl)
+{
+	XA_STATE(xas, &ractl->mapping->i_pages, ractl->_index);
+	struct folio *folio;
+	struct bvecq *bq;
+	unsigned int slot =3D 0;
+	size_t loaded =3D 0;
+
+	bq =3D bvecq_alloc_chain(ractl->_nr_folios, GFP_NOFS);
+	if (!bq)
+		return -ENOMEM;
+
+	pos->bvecq  =3D bq;
+	pos->slot   =3D 0;
+	pos->offset =3D 0;
+
+	rcu_read_lock();
+
+	xas_for_each(&xas, folio, ractl->_index + ractl->_nr_pages - 1) {
+		size_t len;
+
+		if (xas_retry(&xas, folio))
+			continue;
+		VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+
+		len =3D folio_size(folio);
+		bvec_set_folio(&bq->bv[slot], folio, len, 0);
+		loaded +=3D len;
+		slot++;
+		trace_netfs_folio(folio, netfs_folio_trace_read);
+
+		if (slot >=3D bq->max_slots) {
+			bvecq_filled_to(bq, slot);
+			bq =3D bq->next;
+			if (!bq)
+				break;
+			slot =3D 0;
+		}
+	}
+
+	rcu_read_unlock();
+
+	if (bq)
+		bvecq_filled_to(bq, slot);
+
+	ractl->_index +=3D ractl->_nr_pages;
+	ractl->_nr_pages =3D 0;
+	return loaded;
+}
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index 4b0f9304b970..53e1fcc42a19 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -172,6 +172,7 @@ extern atomic_t netfs_n_wh_retry_write_subreq;
 extern atomic_t netfs_n_wb_lock_skip;
 extern atomic_t netfs_n_wb_lock_wait;
 extern atomic_t netfs_n_folioq;
+extern atomic_t netfs_n_bvecq;
=20
 int netfs_stats_show(struct seq_file *m, void *v);
=20
diff --git a/fs/netfs/stats.c b/fs/netfs/stats.c
index ab6b916addc4..84c2a4bcc762 100644
--- a/fs/netfs/stats.c
+++ b/fs/netfs/stats.c
@@ -48,6 +48,7 @@ atomic_t netfs_n_wh_retry_write_subreq;
 atomic_t netfs_n_wb_lock_skip;
 atomic_t netfs_n_wb_lock_wait;
 atomic_t netfs_n_folioq;
+atomic_t netfs_n_bvecq;
=20
 int netfs_stats_show(struct seq_file *m, void *v)
 {
@@ -90,9 +91,10 @@ int netfs_stats_show(struct seq_file *m, void *v)
 		   atomic_read(&netfs_n_rh_retry_read_subreq),
 		   atomic_read(&netfs_n_wh_retry_write_req),
 		   atomic_read(&netfs_n_wh_retry_write_subreq));
-	seq_printf(m, "Objs   : rr=3D%u sr=3D%u foq=3D%u wsc=3D%u\n",
+	seq_printf(m, "Objs   : rr=3D%u sr=3D%u bq=3D%u foq=3D%u wsc=3D%u\n",
 		   atomic_read(&netfs_n_rh_rreq),
 		   atomic_read(&netfs_n_rh_sreq),
+		   atomic_read(&netfs_n_bvecq),
 		   atomic_read(&netfs_n_folioq),
 		   atomic_read(&netfs_n_wh_wstream_conflict));
 	seq_printf(m, "WbLock : skip=3D%u wait=3D%u\n",
diff --git a/include/linux/bvecq.h b/include/linux/bvecq.h
index 15f16f905877..dd2e60e3b743 100644
--- a/include/linux/bvecq.h
+++ b/include/linux/bvecq.h
@@ -53,4 +53,273 @@ struct bvecq {
 	struct bio_vec	__bv[];		/* Default array (if ->inline_bv) */
 };
=20
+#if BITS_PER_LONG =3D=3D 64
+/* Number of slots in __bv[] for a bvecq in a 512-byte kmalloc block. */
+#define BVECQ_STD_SLOTS		29	/* 2 words/slot; 32 slots; bvecq is 6 words (3=
 slots) */
+#elif  BITS_PER_LONG =3D=3D 32
+/* Number of slots in __bv[] for a bvecq in a 256-byte kmalloc block. */
+#define BVECQ_STD_SLOTS		18	/* 3 words/slot; 21 slots; bvecq is 9 words (3=
 slots) */
+#else
+#error BVECQ_STD_SLOTS undetermined
+#endif
+
+/*
+ * Position in a bio_vec queue.  The bvecq holds a ref on the queue segmen=
t it
+ * points to.
+ */
+struct bvecq_pos {
+	struct bvecq		*bvecq;		/* The first bvecq */
+	unsigned int		offset;		/* The offset within the starting slot */
+	u16			slot;		/* The starting slot */
+};
+
+void bvecq_dump(const struct bvecq *bq);
+struct bvecq *bvecq_alloc_one(size_t nr_slots, gfp_t gfp);
+struct bvecq *bvecq_alloc_chain(size_t nr_slots, gfp_t gfp);
+struct bvecq *bvecq_alloc_buffer2(size_t size, unsigned int pre_slots, gfp=
_t gfp);
+void bvecq_put(struct bvecq *bq);
+int bvecq_expand_buffer(struct bvecq **_buffer, size_t *_cur_size, ssize_t=
 size, gfp_t gfp);
+int bvecq_shorten_buffer(struct bvecq *bq, unsigned int slot, size_t size);
+int bvecq_buffer_init(struct bvecq_pos *pos, gfp_t gfp);
+void bvecq_buffer_append(struct bvecq_pos *pos, struct bvecq *bq);
+void bvecq_pos_advance(struct bvecq_pos *pos, size_t amount);
+ssize_t bvecq_zero(struct bvecq_pos *pos, size_t amount);
+size_t bvecq_slice(struct bvecq_pos *pos, size_t max_size,
+		   unsigned int max_slots, unsigned int *_nr_slots);
+ssize_t bvecq_extract(struct bvecq_pos *pos, size_t max_size,
+		      unsigned int max_slots, struct bvecq **to);
+ssize_t bvecq_load_from_ra(struct bvecq_pos *pos, struct readahead_control=
 *ractl);
+
+/**
+ * bvecq_alloc_buffer - Allocate a bvecq chain and populate with buffers
+ * @size: Target size of the buffer (can be 0 for an empty buffer)
+ * @gfp: The allocation constraints.
+ *
+ * Wrapper around %bvecq_alloc_buffer2().
+ */
+static inline struct bvecq *bvecq_alloc_buffer(size_t size, gfp_t gfp)
+{
+	return bvecq_alloc_buffer2(size, 0, gfp);
+}
+
+/**
+ * bvecq_get - Get a ref on a bvecq
+ * @bq: The bvecq to get a ref on
+ */
+static inline struct bvecq *bvecq_get(struct bvecq *bq)
+{
+	refcount_inc(&bq->ref);
+	return bq;
+}
+
+/**
+ * bvecq_is_full - Determine if a bvecq is full
+ * @bvecq: The object to query
+ *
+ * Return: true if full; false if not.
+ */
+static inline bool bvecq_is_full(const struct bvecq *bvecq)
+{
+	return bvecq->nr_slots >=3D bvecq->max_slots;
+}
+
+/**
+ * bvecq_filled_to - Release filled slots with release barrier
+ * @bvecq: The object modified
+ * @to: The latest slot filled + 1
+ */
+static inline void bvecq_filled_to(struct bvecq *bvecq, unsigned int to)
+{
+	/* Set the slot counter after filling the slot */
+	smp_store_release(&bvecq->nr_slots, to);
+}
+
+/**
+ * bvecq_nr_slots_acquire - Get the number of filled slots with acquire ba=
rrier
+ * @bvecq: The object to query
+ *
+ * Return: The number of filled slots
+ */
+static inline unsigned int bvecq_nr_slots_acquire(const struct bvecq *bvec=
q)
+{
+	/* Read the slot counter before looking at the slot */
+	return smp_load_acquire(&bvecq->nr_slots);
+}
+
+/**
+ * bvecq_acquire_slot - Determine if a slot is valid with acquire barrier
+ * @bvecq: The object to query
+ * @slot: The next slot
+ *
+ * Return: true if valid; false if might not be valid
+ */
+static inline bool bvecq_acquire_slot(const struct bvecq *bvecq, unsigned =
int slot)
+{
+	/* Read the slot counter before looking at the slot */
+	return slot < bvecq_nr_slots_acquire(bvecq);
+}
+
+/**
+ * bvecq_pos_set - Set one position to be the same as another
+ * @pos: The position object to set
+ * @at: The source position.
+ *
+ * Set @pos to have the same position as @at.  This may take a ref on the
+ * bvecq pointed to.
+ */
+static inline void bvecq_pos_set(struct bvecq_pos *pos, const struct bvecq=
_pos *at)
+{
+	*pos =3D *at;
+	bvecq_get(pos->bvecq);
+}
+
+/**
+ * bvecq_pos_unset - Unset a position
+ * @pos: The position object to unset
+ *
+ * Unset @pos.  This does any needed ref cleanup.
+ */
+static inline void bvecq_pos_unset(struct bvecq_pos *pos)
+{
+	bvecq_put(pos->bvecq);
+	pos->bvecq =3D NULL;
+	pos->slot =3D 0;
+	pos->offset =3D 0;
+}
+
+/**
+ * bvecq_pos_transfer - Transfer one position to another, clearing the fir=
st
+ * @pos: The position object to set
+ * @from: The source position to clear.
+ *
+ * Set @pos to have the same position as @from and then clear @from.  This=
 may
+ * transfer a ref on the bvecq pointed to.
+ */
+static inline void bvecq_pos_transfer(struct bvecq_pos *pos, struct bvecq_=
pos *from)
+{
+	*pos =3D *from;
+	from->bvecq =3D NULL;
+	from->slot =3D 0;
+	from->offset =3D 0;
+}
+
+/**
+ * bvecq_pos_move - Update a position to a new bvecq
+ * @pos: The position object to update.
+ * @to: The new bvecq to point at.
+ *
+ * Update @pos to point to @to if it doesn't already do so.  This may
+ * manipulate refs on the bvecqs pointed to.
+ */
+static inline void bvecq_pos_move(struct bvecq_pos *pos, struct bvecq *to)
+{
+	struct bvecq *old =3D pos->bvecq;
+
+	if (old !=3D to) {
+		pos->bvecq =3D bvecq_get(to);
+		bvecq_put(old);
+	}
+}
+
+/**
+ * bvecq_pos_nudge - Nudge a position onto the next segment if current use=
d up
+ * @pos: The position object to nudge.
+ *
+ * Update @pos to point to the next segment in the chain if we've used up =
the
+ * current segment.  This may manipulate refs on the bvecqs pointed to.
+ *
+ * Return: true if found a new segment, false if hit the end.
+ */
+static inline bool bvecq_pos_nudge(struct bvecq_pos *pos)
+{
+	struct bvecq *bq =3D pos->bvecq;
+
+	for (;;) {
+		if (!bvecq_acquire_slot(bq, pos->slot)) {
+			bq =3D bq->next;
+			if (!bq)
+				return false;
+			bvecq_pos_move(pos, bq);
+			pos->slot =3D 0;
+			pos->offset =3D 0;
+			continue;
+		}
+		if (pos->offset >=3D bq->bv[pos->slot].bv_len) {
+			pos->slot++;
+			pos->offset =3D 0;
+			continue;
+		}
+		return true;
+	}
+}
+
+/**
+ * bvecq_pos_step - Step a position to the next slot if possible
+ * @pos: The position object to step.
+ *
+ * Update @pos to point to the next slot in the queue if not at the end.  =
This
+ * may manipulate refs on the bvecqs pointed to.
+ *
+ * Return: true if successful, false if was at the end.
+ */
+static inline bool bvecq_pos_step(struct bvecq_pos *pos)
+{
+	struct bvecq *bq =3D pos->bvecq;
+
+	pos->slot++;
+	pos->offset =3D 0;
+	if (pos->slot <=3D bq->nr_slots)
+		return true;
+	if (!bq->next)
+		return false;
+	bvecq_pos_move(pos, bq->next);
+	return true;
+}
+
+/**
+ * bvecq_delete_spent - Delete the bvecq at the front if possible
+ * @pos: The position object to update.
+ * @slot: Current slot.
+ *
+ * Delete the used up bvecq at the front of the queue that @pos points to =
if it
+ * is not the last node in the queue; if it is the last node in the queue,=
 it
+ * is kept so that the queue doesn't become detached from the other end.  =
This
+ * may manipulate refs on the bvecqs pointed to.  It is also possible that=
 the
+ * producer will fill more slots in the current bvecq.
+ *
+ * Also, we have to be very careful: the consumer can catch the producer, =
which
+ * could lead to us having nothing left in the queue, causing the front and
+ * back pointers to end up on different tracks.  To avoid this, we must al=
ways
+ * keep at least one segment in the queue.
+ *
+ * The caller must reload from @pos after calling this.
+ *
+ * Return: true if there's more available; false if not.
+ */
+static inline bool bvecq_delete_spent(struct bvecq_pos *pos, unsigned int =
slot)
+{
+	struct bvecq *spent =3D pos->bvecq;
+	struct bvecq *next;
+
+again:
+	/* Read the contents of the queue node after the pointer to it. */
+	next =3D smp_load_acquire(&spent->next);
+	if (!next)
+		return false; /* Nothing more to consume at the moment. */
+	if (slot < bvecq_nr_slots_acquire(spent))
+		return true; /* The producer added more. */
+	next->prev =3D NULL;
+	spent->next =3D NULL;
+	bvecq_put(spent);
+	pos->bvecq =3D next; /* We take spent's ref. */
+	pos->slot =3D 0;
+	pos->offset =3D 0;
+	if (!bvecq_acquire_slot(next, 0)) {
+		spent =3D next;
+		slot =3D 0;
+		goto again;
+	}
+	return true;
+}
+
 #endif /* _LINUX_BVECQ_H */
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index f7f55b7621f3..12e5c51c11c8 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -17,6 +17,7 @@
 #include <linux/workqueue.h>
 #include <linux/fs.h>
 #include <linux/pagemap.h>
+#include <linux/bvecq.h>
 #include <linux/uio.h>
 #include <linux/rolling_buffer.h>
=20
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index 83266835b7ad..d5723ce18cbb 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -799,6 +799,30 @@ TRACE_EVENT(netfs_folioq,
 		      __print_symbolic(__entry->trace, netfs_folioq_traces))
 	    );
=20
+TRACE_EVENT(netfs_bv_slot,
+	    TP_PROTO(const struct bvecq *bq, int slot),
+
+	    TP_ARGS(bq, slot),
+
+	    TP_STRUCT__entry(
+		    __field(unsigned long,		pfn)
+		    __field(unsigned int,		offset)
+		    __field(unsigned int,		len)
+		    __field(unsigned int,		slot)
+			     ),
+
+	    TP_fast_assign(
+		    __entry->slot =3D slot;
+		    __entry->pfn =3D page_to_pfn(bq->bv[slot].bv_page);
+		    __entry->offset =3D bq->bv[slot].bv_offset;
+		    __entry->len =3D bq->bv[slot].bv_len;
+			   ),
+
+	    TP_printk("bq[%x] p=3D%lx %x-%x",
+		      __entry->slot,
+		      __entry->pfn, __entry->offset, __entry->offset + __entry->len)
+	    );
+
 #undef EM
 #undef E_
 #endif /* _TRACE_NETFS_H */
From nobody Mon May 25 04:33:52 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B154039D6F2
	for <linux-kernel@vger.kernel.org>; Mon, 18 May 2026 22:31:38 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779143500; cv=none;
 b=cXvMD6EYWWpYaxYov2KUk4r3V6xy1B3063oWcpvB1g0WSAhsGj9j6AwFUdm/sN1r+t4qbQqVuWaDP5ldEEZHVeGvS10n+Ol08jnsbGQlP2pc6ML2VBEsjAq3JOpPiN3OCvo9h2RJXLGqGiu8Hv/XMMiHULSjxR1FGhI2iysDcwE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779143500; c=relaxed/simple;
	bh=q0nHBdS8B0gFtMjCdDHzxTp81+J/3i877MqD45cDrfk=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=dsdSE95fzPWfGdFEe6zdUGWPuERPAKALWlg8zAwgc/Qw/qcJFNfcebcj7imdy1jyR1OJUb9rMQZPYjpXJlDesKet55rbXLjvQYv0nYMV5jr7qXsXGVF+wAGDh4s7B+FQl/10+1n+sxhuUwXxlp43/4NVk+yAYEXQkj9qqNTI+64=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=Er4EcS+v; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="Er4EcS+v"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1779143497;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=vbkc2ThYu5P21iahS6kl0ozQ9hmor5lbvUCQK6sK6n0=;
	b=Er4EcS+v88uaGQ3JupP7SQ9hJEM+RM0aixHweC1TFgSsXCPY6Oj7ssV8w9sDa/0D26vRmq
	iDMawQBNmWrp5dEnp9Wh3YlQqbqL6Su4tckEK2EGkH8HOcZmRGRZ3LRX1mgpOra/MK1jAN
	yr2r/nuO/7Pyhc8ZHKS0tjez3Y/yq+M=
Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-96-NHnU8fn0NzaoB1QLdEw12w-1; Mon,
 18 May 2026 18:31:33 -0400
X-MC-Unique: NHnU8fn0NzaoB1QLdEw12w-1
X-Mimecast-MFC-AGG-ID: NHnU8fn0NzaoB1QLdEw12w_1779143490
Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 4F7E9195608A;
	Mon, 18 May 2026 22:31:30 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.48.33])
	by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 44FEF180034E;
	Mon, 18 May 2026 22:31:23 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.org>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 09/21] netfs: Add a function to extract from an iter into a
 bvecq
Date: Mon, 18 May 2026 23:29:41 +0100
Message-ID: <20260518222959.488126-10-dhowells@redhat.com>
In-Reply-To: <20260518222959.488126-1-dhowells@redhat.com>
References: <20260518222959.488126-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111
Content-Type: text/plain; charset="utf-8"

Add a function to extract a slice of data from an iterator of any type into
a bvec queue chain.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/netfs/iterator.c   | 125 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/netfs.h |   3 +
 2 files changed, 128 insertions(+)

diff --git a/fs/netfs/iterator.c b/fs/netfs/iterator.c
index b375567e0520..d2c3055a488c 100644
--- a/fs/netfs/iterator.c
+++ b/fs/netfs/iterator.c
@@ -13,6 +13,131 @@
 #include <linux/netfs.h>
 #include "internal.h"
=20
+/**
+ * netfs_extract_iter - Extract virtually contiguous pages from an iterato=
r into a bvecq
+ * @orig: The original iterator
+ * @max_len: Maximum number of bytes to extract
+ * @max_pages: Maximum number of pages to extract
+ * @fpos: Starting file position to label the bvecq with
+ * @_bvecq_head: Where to cache the bvec queue
+ * @extraction_flags: Flags to qualify the request
+ *
+ * Extract virtually contiguous page fragments from the source iterator up=
 to
+ * the given maxima and build bvec queue that refers to all of those bits.
+ * This allows the original iterator to disposed of.
+ *
+ * @extraction_flags can have ITER_ALLOW_P2PDMA set to request peer-to-pee=
r DMA be
+ * allowed on the pages extracted.
+ *
+ * On success, the amount of data in the bvec is returned, the original
+ * iterator will have been advanced by the amount extracted.
+ *
+ * The bvecq segments are marked with indications on how to get clean up t=
he
+ * extracted fragments.
+ */
+ssize_t netfs_extract_iter(struct iov_iter *orig, size_t max_len, size_t m=
ax_pages,
+			   unsigned long long fpos, struct bvecq **_bvecq_head,
+			   iov_iter_extraction_t extraction_flags)
+{
+	struct bvecq *bq_tail =3D NULL;
+	ssize_t ret =3D 0;
+	size_t extracted =3D 0;
+
+	_enter("{%u,%zx},%zx", orig->iter_type, orig->count, max_len);
+
+	if (max_len > orig->count)
+		max_len =3D orig->count;
+	if (WARN_ON_ONCE(!max_len || !max_pages))
+		return 0;
+
+	max_pages =3D iov_iter_npages(orig, max_pages);
+	if (!max_pages)
+		return 0;
+
+	do {
+		struct bvecq *bq;
+
+		bq =3D bvecq_alloc_one(max_pages, GFP_NOFS);
+		if (!bq) {
+			ret =3D -ENOMEM;
+			break;
+		}
+		if (user_backed_iter(orig))
+			bq->mem_type =3D iov_iter_extract_will_pin(orig) ?
+				BVECQ_MEM_GUP : BVECQ_MEM_PAGECACHE;
+		bq->prev	=3D bq_tail;
+		bq->fpos	=3D fpos + extracted;
+
+		if (bq_tail)
+			bq_tail->next =3D bq;
+		else
+			*_bvecq_head =3D bq;
+		bq_tail =3D bq;
+
+		if (max_len =3D=3D 0)
+			break;
+
+		struct bio_vec *bv =3D bq->bv;
+		do {
+			struct page **pages;
+			ssize_t got;
+			size_t offset;
+			size_t space =3D bq->max_slots - bq->nr_slots;
+			size_t bv_size =3D array_size(bq->max_slots, sizeof(*bv));
+			size_t pg_size =3D array_size(space, sizeof(*pages));
+
+			/* Put the page list at the end of the bvec list
+			 * storage.  bvec elements are larger than page
+			 * pointers, so as long as we work 0->last, we should
+			 * be fine.
+			 */
+			pages =3D (void *)bv + bv_size - pg_size;
+
+			got =3D iov_iter_extract_pages(orig, &pages, max_len,
+						     space, extraction_flags, &offset);
+			if (got < 0) {
+				ret =3D got;
+				goto out;
+			}
+
+			if (got =3D=3D 0) {
+				pr_err("extract_pages gave nothing from %zu, %zu\n",
+				       extracted, max_len);
+				ret =3D -EIO;
+				goto out;
+			}
+
+			if (WARN(got > max_len,
+				 "%s: extract_pages overrun %zd > %zu bytes\n",
+				 __func__, got, max_len)) {
+				ret =3D -EIO;
+				break;
+			}
+
+			extracted +=3D got;
+			max_len -=3D got;
+
+			do {
+				size_t len =3D umin(got, PAGE_SIZE - offset);
+
+				BUG_ON(bq->nr_slots >=3D bq->max_slots);
+
+				bvec_set_page(&bq->bv[bq->nr_slots],
+					      *pages++, len, offset);
+				bq->nr_slots++;
+				got -=3D len;
+				offset =3D 0;
+			} while (got > 0);
+		} while (max_len > 0 && !bvecq_is_full(bq));
+
+		max_pages -=3D bq->nr_slots;
+	} while (max_len > 0 && max_pages > 0);
+
+out:
+	return extracted ?: ret;
+}
+EXPORT_SYMBOL_GPL(netfs_extract_iter);
+
 /**
  * netfs_extract_user_iter - Extract the pages from a user iterator into a=
 bvec
  * @orig: The original iterator
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 12e5c51c11c8..40f45ecf1db8 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -460,6 +460,9 @@ void netfs_get_subrequest(struct netfs_io_subrequest *s=
ubreq,
 			  enum netfs_sreq_ref_trace what);
 void netfs_put_subrequest(struct netfs_io_subrequest *subreq,
 			  enum netfs_sreq_ref_trace what);
+ssize_t netfs_extract_iter(struct iov_iter *orig, size_t max_len, size_t m=
ax_pages,
+			   unsigned long long fpos, struct bvecq **_bvecq_head,
+			   iov_iter_extraction_t extraction_flags);
 ssize_t netfs_extract_user_iter(struct iov_iter *orig, size_t orig_len,
 				struct iov_iter *new,
 				iov_iter_extraction_t extraction_flags);
From nobody Mon May 25 04:33:52 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1F84B3AD513
	for <linux-kernel@vger.kernel.org>; Mon, 18 May 2026 22:31:47 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779143509; cv=none;
 b=HV0L49oC6qNJwi8vhdyHsrHFM4bu9ReYPuSVc9Kk96ErnprfWw4By2Pb0oeYtzal6tx2zYg9zwXb1BniQwuoaNlOKOClIehDs0IRgHIkNAC3Q8mVslLjyW0hgKaDEW17T2lFRKuIQlPMwRXoCt3o1TEHfSq8csHfMUVfNyXl3Vk=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779143509; c=relaxed/simple;
	bh=OAW2mP3KLCJgpAVtTfu5IP/aUHBcMzIgkIeJJ9Odc+4=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=hZgn35VLxgQ4McTojAJdRaMO9kKGTBhecBEXEVnS4fSBrQQpzIhXO1x9OAw1fbF9FLQYwKPp4q1JAYZQP4E91+aQim3RoxOcJoWoULLjxGSQXxzN0utBWQZyfbFW9desAq5h80yxUfaEZsI2XrrqF+BpA5k/l236OO7FsO2Qy+E=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=b6QXZDOY; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="b6QXZDOY"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1779143506;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=VCH4vZFGyxN+iuNiVjvNTp9U4LCQz31WVr7WO5SmONk=;
	b=b6QXZDOYE+1zgApgCKbWhajcJeCcmaxHiYocRVfHqgVu0XtIKj3BoexEPZ9Pitlb2zI+lY
	st/pTBXMr8bu6B6X4Mak0BYTBIYIcgrfvd1anUe11qudSq4sk3+3saOZIa44MeHO7Szu8S
	TnRoaufm2f8bNUkRJfAgIW3NheQrqvI=
Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-408-M3jtuUVRMeOzoANLjCrEWQ-1; Mon,
 18 May 2026 18:31:40 -0400
X-MC-Unique: M3jtuUVRMeOzoANLjCrEWQ-1
X-Mimecast-MFC-AGG-ID: M3jtuUVRMeOzoANLjCrEWQ_1779143498
Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 454D91800578;
	Mon, 18 May 2026 22:31:38 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.48.33])
	by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id DFF66180034E;
	Mon, 18 May 2026 22:31:31 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.org>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 10/21] afs: Use a bvecq to hold dir content rather than
 folioq
Date: Mon, 18 May 2026 23:29:42 +0100
Message-ID: <20260518222959.488126-11-dhowells@redhat.com>
In-Reply-To: <20260518222959.488126-1-dhowells@redhat.com>
References: <20260518222959.488126-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111
Content-Type: text/plain; charset="utf-8"

Use a bvecq to hold the contents of a directory rather than the folioq so
that the latter can be phased out.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/afs/dir.c           |  35 +++++----
 fs/afs/dir_edit.c      |  42 +++++------
 fs/afs/dir_search.c    |  33 ++++-----
 fs/afs/inode.c         |   2 +-
 fs/afs/internal.h      |   6 +-
 fs/afs/symlink.c       |  28 +++-----
 fs/netfs/write_issue.c | 156 ++++++-----------------------------------
 7 files changed, 88 insertions(+), 214 deletions(-)

diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 498b99ccdf0e..774d86bf878e 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -138,9 +138,9 @@ static void afs_dir_dump(struct afs_vnode *dvnode)
 	pr_warn("DIR %llx:%llx is=3D%llx\n",
 		dvnode->fid.vid, dvnode->fid.vnode, i_size);
=20
-	iov_iter_folio_queue(&iter, ITER_SOURCE, dvnode->directory, 0, 0, i_size);
-	iterate_folioq(&iter, iov_iter_count(&iter), NULL, NULL,
-		       afs_dir_dump_step);
+	iov_iter_bvec_queue(&iter, ITER_SOURCE, dvnode->directory, 0, 0, i_size);
+	iterate_bvecq(&iter, iov_iter_count(&iter), NULL, NULL,
+		      afs_dir_dump_step);
 }
=20
 /*
@@ -201,9 +201,9 @@ static int afs_dir_check(struct afs_vnode *dvnode)
 	if (unlikely(!i_size))
 		return 0;
=20
-	iov_iter_folio_queue(&iter, ITER_SOURCE, dvnode->directory, 0, 0, i_size);
-	checked =3D iterate_folioq(&iter, iov_iter_count(&iter), dvnode, NULL,
-				 afs_dir_check_step);
+	iov_iter_bvec_queue(&iter, ITER_SOURCE, dvnode->directory, 0, 0, i_size);
+	checked =3D iterate_bvecq(&iter, iov_iter_count(&iter), dvnode, NULL,
+				afs_dir_check_step);
 	if (checked !=3D i_size) {
 		afs_dir_dump(dvnode);
 		return -EIO;
@@ -248,15 +248,14 @@ static ssize_t afs_do_read_single(struct afs_vnode *d=
vnode, struct file *file)
 	if (dvnode->directory_size < i_size) {
 		size_t cur_size =3D dvnode->directory_size;
=20
-		ret =3D netfs_alloc_folioq_buffer(NULL,
-						&dvnode->directory, &cur_size, i_size,
-						mapping_gfp_mask(dvnode->netfs.inode.i_mapping));
+		ret =3D bvecq_expand_buffer(&dvnode->directory, &cur_size, i_size,
+					  GFP_KERNEL);
 		dvnode->directory_size =3D cur_size;
 		if (ret < 0)
 			return ret;
 	}
=20
-	iov_iter_folio_queue(&iter, ITER_DEST, dvnode->directory, 0, 0, dvnode->d=
irectory_size);
+	iov_iter_bvec_queue(&iter, ITER_DEST, dvnode->directory, 0, 0, dvnode->di=
rectory_size);
=20
 	/* AFS requires us to perform the read of a directory synchronously as
 	 * a single unit to avoid issues with the directory contents being
@@ -292,8 +291,8 @@ static ssize_t afs_read_single(struct afs_vnode *dvnode=
, struct file *file)
 }
=20
 /*
- * Read the directory into a folio_queue buffer in one go, scrubbing the
- * previous contents.  We return -ESTALE if the caller needs to call us ag=
ain.
+ * Read the directory into the buffer in one go, scrubbing the previous
+ * contents.  We return -ESTALE if the caller needs to call us again.
  */
 ssize_t afs_read_dir(struct afs_vnode *dvnode, struct file *file)
 	__acquires(&dvnode->validate_lock)
@@ -474,7 +473,7 @@ static size_t afs_dir_iterate_step(void *iter_base, siz=
e_t progress, size_t len,
 }
=20
 /*
- * Iterate through the directory folios.
+ * Iterate through the directory content.
  */
 static int afs_dir_iterate_contents(struct inode *dir, struct dir_context =
*dir_ctx)
 {
@@ -489,11 +488,11 @@ static int afs_dir_iterate_contents(struct inode *dir=
, struct dir_context *dir_c
 	if (i_size <=3D 0 || dir_ctx->pos >=3D i_size)
 		return 0;
=20
-	iov_iter_folio_queue(&iter, ITER_SOURCE, dvnode->directory, 0, 0, i_size);
+	iov_iter_bvec_queue(&iter, ITER_SOURCE, dvnode->directory, 0, 0, i_size);
 	iov_iter_advance(&iter, round_down(dir_ctx->pos, AFS_DIR_BLOCK_SIZE));
=20
-	iterate_folioq(&iter, iov_iter_count(&iter), dvnode, &ctx,
-		       afs_dir_iterate_step);
+	iterate_bvecq(&iter, iov_iter_count(&iter), dvnode, &ctx,
+		      afs_dir_iterate_step);
=20
 	if (ctx.error =3D=3D -ESTALE)
 		afs_invalidate_dir(dvnode, afs_dir_invalid_iter_stale);
@@ -2218,8 +2217,8 @@ static int afs_dir_writepages(struct address_space *m=
apping,
 	}
=20
 	if (test_bit(AFS_VNODE_DIR_VALID, &dvnode->flags)) {
-		iov_iter_folio_queue(&iter, ITER_SOURCE, dvnode->directory, 0, 0,
-				     i_size_read(&dvnode->netfs.inode));
+		iov_iter_bvec_queue(&iter, ITER_SOURCE, dvnode->directory, 0, 0,
+				    i_size_read(&dvnode->netfs.inode));
 		ret =3D netfs_writeback_single(mapping, wbc, &iter);
 		if (ret =3D=3D 1)
 			ret =3D 0; /* Skipped write due to lock conflict. */
diff --git a/fs/afs/dir_edit.c b/fs/afs/dir_edit.c
index fd3aa9f97ce6..fc918b3d8f68 100644
--- a/fs/afs/dir_edit.c
+++ b/fs/afs/dir_edit.c
@@ -110,9 +110,8 @@ static void afs_clear_contig_bits(union afs_xdr_dir_blo=
ck *block,
  */
 static union afs_xdr_dir_block *afs_dir_get_block(struct afs_dir_iter *ite=
r, size_t block)
 {
-	struct folio_queue *fq;
 	struct afs_vnode *dvnode =3D iter->dvnode;
-	struct folio *folio;
+	struct bvecq *bq;
 	size_t blpos =3D block * AFS_DIR_BLOCK_SIZE;
 	size_t blend =3D (block + 1) * AFS_DIR_BLOCK_SIZE, fpos =3D iter->fpos;
 	int ret;
@@ -120,41 +119,38 @@ static union afs_xdr_dir_block *afs_dir_get_block(str=
uct afs_dir_iter *iter, siz
 	if (dvnode->directory_size < blend) {
 		size_t cur_size =3D dvnode->directory_size;
=20
-		ret =3D netfs_alloc_folioq_buffer(
-			NULL, &dvnode->directory, &cur_size, blend,
-			mapping_gfp_mask(dvnode->netfs.inode.i_mapping));
+		ret =3D bvecq_expand_buffer(&dvnode->directory, &cur_size, blend,
+					  GFP_KERNEL);
 		dvnode->directory_size =3D cur_size;
 		if (ret < 0)
 			goto fail;
 	}
=20
-	fq =3D iter->fq;
-	if (!fq)
-		fq =3D dvnode->directory;
+	bq =3D iter->bq;
+	if (!bq)
+		bq =3D dvnode->directory;
=20
-	/* Search the folio queue for the folio containing the block... */
-	for (; fq; fq =3D fq->next) {
-		for (int s =3D iter->fq_slot; s < folioq_count(fq); s++) {
-			size_t fsize =3D folioq_folio_size(fq, s);
+	/* Search the contents for the region containing the block... */
+	for (; bq; bq =3D bq->next) {
+		for (int s =3D iter->bq_slot; s < bq->nr_slots; s++) {
+			struct bio_vec *bv =3D &bq->bv[s];
+			size_t bsize =3D bv->bv_len;
=20
-			if (blend <=3D fpos + fsize) {
+			if (blend <=3D fpos + bsize) {
 				/* ... and then return the mapped block. */
-				folio =3D folioq_folio(fq, s);
-				if (WARN_ON_ONCE(folio_pos(folio) !=3D fpos))
-					goto fail;
-				iter->fq =3D fq;
-				iter->fq_slot =3D s;
+				iter->bq =3D bq;
+				iter->bq_slot =3D s;
 				iter->fpos =3D fpos;
-				return kmap_local_folio(folio, blpos - fpos);
+				return kmap_local_bvec(bv, blpos - fpos);
 			}
-			fpos +=3D fsize;
+			fpos +=3D bsize;
 		}
-		iter->fq_slot =3D 0;
+		iter->bq_slot =3D 0;
 	}
=20
 fail:
-	iter->fq =3D NULL;
-	iter->fq_slot =3D 0;
+	iter->bq =3D NULL;
+	iter->bq_slot =3D 0;
 	afs_invalidate_dir(dvnode, afs_dir_invalid_edit_get_block);
 	return NULL;
 }
diff --git a/fs/afs/dir_search.c b/fs/afs/dir_search.c
index 104411c0692f..71c9a8a526f4 100644
--- a/fs/afs/dir_search.c
+++ b/fs/afs/dir_search.c
@@ -66,12 +66,11 @@ bool afs_dir_init_iter(struct afs_dir_iter *iter, const=
 struct qstr *name)
  */
 union afs_xdr_dir_block *afs_dir_find_block(struct afs_dir_iter *iter, siz=
e_t block)
 {
-	struct folio_queue *fq =3D iter->fq;
 	struct afs_vnode *dvnode =3D iter->dvnode;
-	struct folio *folio;
+	struct bvecq *bq =3D iter->bq;
 	size_t blpos =3D block * AFS_DIR_BLOCK_SIZE;
 	size_t blend =3D (block + 1) * AFS_DIR_BLOCK_SIZE, fpos =3D iter->fpos;
-	int slot =3D iter->fq_slot;
+	int slot =3D iter->bq_slot;
=20
 	_enter("%zx,%d", block, slot);
=20
@@ -83,36 +82,34 @@ union afs_xdr_dir_block *afs_dir_find_block(struct afs_=
dir_iter *iter, size_t bl
 	if (dvnode->directory_size < blend)
 		goto fail;
=20
-	if (!fq || blpos < fpos) {
-		fq =3D dvnode->directory;
+	if (!bq || blpos < fpos) {
+		bq =3D dvnode->directory;
 		slot =3D 0;
 		fpos =3D 0;
 	}
=20
 	/* Search the folio queue for the folio containing the block... */
-	for (; fq; fq =3D fq->next) {
-		for (; slot < folioq_count(fq); slot++) {
-			size_t fsize =3D folioq_folio_size(fq, slot);
+	for (; bq; bq =3D bq->next) {
+		for (; slot < bq->nr_slots; slot++) {
+			struct bio_vec *bv =3D &bq->bv[slot];
+			size_t bsize =3D bv->bv_len;
=20
-			if (blend <=3D fpos + fsize) {
+			if (blend <=3D fpos + bsize) {
 				/* ... and then return the mapped block. */
-				folio =3D folioq_folio(fq, slot);
-				if (WARN_ON_ONCE(folio_pos(folio) !=3D fpos))
-					goto fail;
-				iter->fq =3D fq;
-				iter->fq_slot =3D slot;
+				iter->bq =3D bq;
+				iter->bq_slot =3D slot;
 				iter->fpos =3D fpos;
-				iter->block =3D kmap_local_folio(folio, blpos - fpos);
+				iter->block =3D kmap_local_bvec(bv, blpos - fpos);
 				return iter->block;
 			}
-			fpos +=3D fsize;
+			fpos +=3D bsize;
 		}
 		slot =3D 0;
 	}
=20
 fail:
-	iter->fq =3D NULL;
-	iter->fq_slot =3D 0;
+	iter->bq =3D NULL;
+	iter->bq_slot =3D 0;
 	afs_invalidate_dir(dvnode, afs_dir_invalid_edit_get_block);
 	return NULL;
 }
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index 3f48458694ba..1e7bfde6189a 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -684,7 +684,7 @@ void afs_evict_inode(struct inode *inode)
=20
 	netfs_wait_for_outstanding_io(inode);
 	truncate_inode_pages_final(&inode->i_data);
-	netfs_free_folioq_buffer(vnode->directory);
+	bvecq_put(vnode->directory);
 	if (vnode->symlink)
 		afs_evict_symlink(vnode);
=20
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 0b72a8566299..d2641efc756f 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -709,7 +709,7 @@ struct afs_vnode {
 #define AFS_VNODE_MODIFYING	10		/* Set if we're performing a modification =
op */
 #define AFS_VNODE_DIR_READ	11		/* Set if we've read a dir's contents */
=20
-	struct folio_queue	*directory;	/* Directory contents */
+	struct bvecq		*directory;	/* Directory contents */
 	struct afs_symlink __rcu *symlink;	/* Symlink content */
 	struct list_head	wb_keys;	/* List of keys available for writeback */
 	struct list_head	pending_locks;	/* locks waiting to be granted */
@@ -992,9 +992,9 @@ static inline void afs_invalidate_cache(struct afs_vnod=
e *vnode, unsigned int fl
 struct afs_dir_iter {
 	struct afs_vnode	*dvnode;
 	union afs_xdr_dir_block *block;
-	struct folio_queue	*fq;
+	struct bvecq		*bq;
 	unsigned int		fpos;
-	int			fq_slot;
+	int			bq_slot;
 	unsigned int		loop_check;
 	u8			nr_slots;
 	u8			bucket;
diff --git a/fs/afs/symlink.c b/fs/afs/symlink.c
index ed5868369f37..6709b119e8a0 100644
--- a/fs/afs/symlink.c
+++ b/fs/afs/symlink.c
@@ -56,7 +56,6 @@ void afs_evict_symlink(struct afs_vnode *vnode)
 void afs_init_new_symlink(struct afs_vnode *vnode, struct afs_operation *o=
p)
 {
 	struct afs_symlink *symlink =3D op->create.symlink;
-	size_t dsize =3D 0;
 	size_t size =3D strlen(symlink->content) + 1;
 	char *p;
=20
@@ -66,12 +65,12 @@ void afs_init_new_symlink(struct afs_vnode *vnode, stru=
ct afs_operation *op)
 	if (!fscache_cookie_enabled(netfs_i_cookie(&vnode->netfs)))
 		return;
=20
-	if (netfs_alloc_folioq_buffer(NULL, &vnode->directory, &dsize, size,
-				      mapping_gfp_mask(vnode->netfs.inode.i_mapping)) < 0)
+	vnode->directory =3D bvecq_alloc_buffer(PAGE_SIZE, GFP_KERNEL);
+	if (!vnode->directory)
 		return;
=20
-	vnode->directory_size =3D dsize;
-	p =3D kmap_local_folio(folioq_folio(vnode->directory, 0), 0);
+	vnode->directory_size =3D size;
+	p =3D kmap_local_bvec(&vnode->directory->bv[0], 0);
 	memcpy(p, symlink->content, size);
 	kunmap_local(p);
 	netfs_single_mark_inode_dirty(&vnode->netfs.inode);
@@ -94,17 +93,12 @@ static ssize_t afs_do_read_symlink(struct afs_vnode *vn=
ode)
 	}
=20
 	if (!vnode->directory) {
-		size_t cur_size =3D 0;
-
-		ret =3D netfs_alloc_folioq_buffer(NULL,
-						&vnode->directory, &cur_size, PAGE_SIZE,
-						mapping_gfp_mask(vnode->netfs.inode.i_mapping));
-		vnode->directory_size =3D PAGE_SIZE - 1;
+		vnode->directory =3D bvecq_alloc_buffer(PAGE_SIZE, GFP_KERNEL);
 		if (ret < 0)
 			return ret;
 	}
=20
-	iov_iter_folio_queue(&iter, ITER_DEST, vnode->directory, 0, 0, PAGE_SIZE);
+	iov_iter_bvec_queue(&iter, ITER_DEST, vnode->directory, 0, 0, PAGE_SIZE);
=20
 	/* AFS requires us to perform the read of a symlink as a single unit to
 	 * avoid issues with the content being changed between reads.
@@ -127,7 +121,7 @@ static ssize_t afs_do_read_symlink(struct afs_vnode *vn=
ode)
 		refcount_set(&symlink->ref, 1);
 		symlink->content[i_size] =3D 0;
=20
-		const char *s =3D kmap_local_folio(folioq_folio(vnode->directory, 0), 0);
+		const char *s =3D kmap_local_bvec(&vnode->directory->bv[0], 0);
=20
 		memcpy(symlink->content, s, i_size);
 		kunmap_local(s);
@@ -136,7 +130,7 @@ static ssize_t afs_do_read_symlink(struct afs_vnode *vn=
ode)
 	}
=20
 	if (!fscache_cookie_enabled(netfs_i_cookie(&vnode->netfs))) {
-		netfs_free_folioq_buffer(vnode->directory);
+		bvecq_put(vnode->directory);
 		vnode->directory =3D NULL;
 		vnode->directory_size =3D 0;
 	}
@@ -249,14 +243,14 @@ int afs_symlink_writepages(struct address_space *mapp=
ing,
=20
 	if (vnode->directory &&
 	    atomic64_read(&vnode->cb_expires_at) !=3D AFS_NO_CB_PROMISE) {
-		iov_iter_folio_queue(&iter, ITER_SOURCE, vnode->directory, 0, 0,
-				     i_size_read(&vnode->netfs.inode));
+		iov_iter_bvec_queue(&iter, ITER_SOURCE, vnode->directory, 0, 0,
+				    i_size_read(&vnode->netfs.inode));
 		ret =3D netfs_writeback_single(mapping, wbc, &iter);
 	}
=20
 	if (ret =3D=3D 0) {
 		mutex_lock(&vnode->netfs.wb_lock);
-		netfs_free_folioq_buffer(vnode->directory);
+		bvecq_put(vnode->directory);
 		vnode->directory =3D NULL;
 		vnode->directory_size =3D 0;
 		mutex_unlock(&vnode->netfs.wb_lock);
diff --git a/fs/netfs/write_issue.c b/fs/netfs/write_issue.c
index 7f38b6676002..b2f626568fe5 100644
--- a/fs/netfs/write_issue.c
+++ b/fs/netfs/write_issue.c
@@ -727,124 +727,11 @@ ssize_t netfs_end_writethrough(struct netfs_io_reque=
st *wreq, struct writeback_c
 	return ret;
 }
=20
-/*
- * Write some of a pending folio data back to the server and/or the cache.
- */
-static int netfs_write_folio_single(struct netfs_io_request *wreq,
-				    struct folio *folio)
-{
-	struct netfs_io_stream *upload =3D &wreq->io_streams[0];
-	struct netfs_io_stream *cache  =3D &wreq->io_streams[1];
-	struct netfs_io_stream *stream;
-	size_t iter_off =3D 0;
-	size_t fsize =3D folio_size(folio), flen;
-	loff_t fpos =3D folio_pos(folio);
-	bool to_eof =3D false;
-	bool no_debug =3D false;
-
-	_enter("");
-
-	flen =3D folio_size(folio);
-	if (flen > wreq->i_size - fpos) {
-		flen =3D wreq->i_size - fpos;
-		folio_zero_segment(folio, flen, fsize);
-		to_eof =3D true;
-	} else if (flen =3D=3D wreq->i_size - fpos) {
-		to_eof =3D true;
-	}
-
-	_debug("folio %zx/%zx", flen, fsize);
-
-	if (!upload->avail && !cache->avail) {
-		trace_netfs_folio(folio, netfs_folio_trace_cancel_store);
-		return 0;
-	}
-
-	if (!upload->construct)
-		trace_netfs_folio(folio, netfs_folio_trace_store);
-	else
-		trace_netfs_folio(folio, netfs_folio_trace_store_plus);
-
-	/* Attach the folio to the rolling buffer. */
-	folio_get(folio);
-	rolling_buffer_append(&wreq->buffer, folio, NETFS_ROLLBUF_PUT_MARK);
-
-	/* Move the submission point forward to allow for write-streaming data
-	 * not starting at the front of the page.  We don't do write-streaming
-	 * with the cache as the cache requires DIO alignment.
-	 *
-	 * Also skip uploading for data that's been read and just needs copying
-	 * to the cache.
-	 */
-	for (int s =3D 0; s < NR_IO_STREAMS; s++) {
-		stream =3D &wreq->io_streams[s];
-		stream->submit_off =3D 0;
-		stream->submit_len =3D flen;
-		if (!stream->avail) {
-			stream->submit_off =3D UINT_MAX;
-			stream->submit_len =3D 0;
-		}
-	}
-
-	/* Attach the folio to one or more subrequests.  For a big folio, we
-	 * could end up with thousands of subrequests if the wsize is small -
-	 * but we might need to wait during the creation of subrequests for
-	 * network resources (eg. SMB credits).
-	 */
-	for (;;) {
-		ssize_t part;
-		size_t lowest_off =3D ULONG_MAX;
-		int choose_s =3D -1;
-
-		/* Always add to the lowest-submitted stream first. */
-		for (int s =3D 0; s < NR_IO_STREAMS; s++) {
-			stream =3D &wreq->io_streams[s];
-			if (stream->submit_len > 0 &&
-			    stream->submit_off < lowest_off) {
-				lowest_off =3D stream->submit_off;
-				choose_s =3D s;
-			}
-		}
-
-		if (choose_s < 0)
-			break;
-		stream =3D &wreq->io_streams[choose_s];
-
-		/* Advance the iterator(s). */
-		if (stream->submit_off > iter_off) {
-			rolling_buffer_advance(&wreq->buffer, stream->submit_off - iter_off);
-			iter_off =3D stream->submit_off;
-		}
-
-		atomic64_set(&wreq->issued_to, fpos + stream->submit_off);
-		stream->submit_extendable_to =3D fsize - stream->submit_off;
-		part =3D netfs_advance_write(wreq, stream, fpos + stream->submit_off,
-					   stream->submit_len, to_eof);
-		stream->submit_off +=3D part;
-		if (part > stream->submit_len)
-			stream->submit_len =3D 0;
-		else
-			stream->submit_len -=3D part;
-		if (part > 0)
-			no_debug =3D true;
-	}
-
-	wreq->buffer.iter.iov_offset =3D 0;
-	if (fsize > iter_off)
-		rolling_buffer_advance(&wreq->buffer, fsize - iter_off);
-	atomic64_set(&wreq->issued_to, fpos + fsize);
-
-	if (!no_debug)
-		kdebug("R=3D%x: No submit", wreq->debug_id);
-	_leave(" =3D 0");
-	return 0;
-}
-
 /**
  * netfs_writeback_single - Write back a monolithic payload
  * @mapping: The mapping to write from
  * @wbc: Hints from the VM
- * @iter: Data to write, must be ITER_FOLIOQ.
+ * @iter: Data to write.
  *
  * Write a monolithic, non-pagecache object back to the server and/or
  * the cache.
@@ -858,13 +745,8 @@ int netfs_writeback_single(struct address_space *mappi=
ng,
 {
 	struct netfs_io_request *wreq;
 	struct netfs_inode *ictx =3D netfs_inode(mapping->host);
-	struct folio_queue *fq;
-	size_t size =3D iov_iter_count(iter);
 	int ret;
=20
-	if (WARN_ON_ONCE(!iov_iter_is_folioq(iter)))
-		return -EIO;
-
 	if (!mutex_trylock(&ictx->wb_lock)) {
 		if (wbc->sync_mode =3D=3D WB_SYNC_NONE) {
 			/* The VFS will have undirtied the inode. */
@@ -882,6 +764,9 @@ int netfs_writeback_single(struct address_space *mappin=
g,
 		goto couldnt_start;
 	}
=20
+	wreq->buffer.iter =3D *iter;
+	wreq->len =3D iov_iter_count(iter);
+
 	__set_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &wreq->flags);
 	trace_netfs_write(wreq, netfs_write_trace_writeback_single);
 	netfs_stat(&netfs_n_wh_writepages);
@@ -889,31 +774,34 @@ int netfs_writeback_single(struct address_space *mapp=
ing,
 	if (__test_and_set_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags))
 		wreq->netfs_ops->begin_writeback(wreq);
=20
-	for (fq =3D (struct folio_queue *)iter->folioq; fq; fq =3D fq->next) {
-		for (int slot =3D 0; slot < folioq_count(fq); slot++) {
-			struct folio *folio =3D folioq_folio(fq, slot);
-			size_t part =3D umin(folioq_folio_size(fq, slot), size);
+	for (int s =3D 0; s < NR_IO_STREAMS; s++) {
+		struct netfs_io_subrequest *subreq;
+		struct netfs_io_stream *stream =3D &wreq->io_streams[s];
+
+		if (!stream->avail)
+			continue;
=20
-			_debug("wbiter %lx %llx", folio->index, atomic64_read(&wreq->issued_to)=
);
+		netfs_prepare_write(wreq, stream, 0);
=20
-			ret =3D netfs_write_folio_single(wreq, folio);
-			if (ret < 0)
-				goto stop;
-			size -=3D part;
-			if (size <=3D 0)
-				goto stop;
-		}
+		subreq =3D stream->construct;
+		subreq->len =3D wreq->len;
+		stream->submit_len =3D subreq->len;
+		stream->submit_extendable_to =3D round_up(wreq->len, PAGE_SIZE);
+
+		netfs_issue_write(wreq, stream);
 	}
=20
-stop:
-	for (int s =3D 0; s < NR_IO_STREAMS; s++)
-		netfs_issue_write(wreq, &wreq->io_streams[s]);
 	smp_wmb(); /* Write lists before ALL_QUEUED. */
 	set_bit(NETFS_RREQ_ALL_QUEUED, &wreq->flags);
=20
 	mutex_unlock(&ictx->wb_lock);
 	netfs_wake_collector(wreq);
=20
+	/* TODO: Might want to be async here if WB_SYNC_NONE, but then need to
+	 * wait before modifying.
+	 */
+	ret =3D netfs_wait_for_write(wreq);
+
 	netfs_put_request(wreq, netfs_rreq_trace_put_return);
 	_leave(" =3D %d", ret);
 	return ret;
From nobody Mon May 25 04:33:52 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 29A203B5F59
	for <linux-kernel@vger.kernel.org>; Mon, 18 May 2026 22:31:54 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779143515; cv=none;
 b=tuKQuMRnbJ0To4gvON8wJAtzLMNghicgdupcu45XsiJJmFzK0lEdis3QwglT03DxjC0noRNK4xalSVB/jw8ftpgnkML53/OjorbTGppmUX6gSoY1rpSBwq5UInjVbLxry/8rty4X7Vj50X3vDQ1+JmuOknVJiqFSSwDbnN7j0ak=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779143515; c=relaxed/simple;
	bh=Wvhb9+slocgIYEUY/Bt/Sx39bkRuQbjzq/9onEcbDHw=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=kU92Nq0hBhfv0hyvd0OmciWjP8bq+Yo/cJ5frhf9fSHe2WmdjI5j7qK7T4fqdgKOvLSRLmLv6Y/hPL70keRLJ+6seyQtl89N7FMo2WAAzWpT2sTQy/CQPgR0yG8jfjt6NYVOQHEkR9FSNSqtGcuVWsuTwCLAK9sUNLMgRSBGK5s=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=JrwQFPV7; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="JrwQFPV7"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1779143513;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=cTl/y/xoXy47yOY7S3XBgEp05w5k7fiZoqysydk4IiE=;
	b=JrwQFPV7KhfwgIlp1xXlSMItPws9C4L84W+eKbHXodLtEbA7+ugOGNSRbJ9bv3DcQI0bqS
	tuGtDPIaK+PxnFdhTizd7eDt+4Pn+Q6kWDf89NevzR0qvg8Oe26q5RDUhHGtVwmtXy9CQE
	lV1Zia4PsJh7cCpSeZIq4OYPEmqh8Dw=
Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-634-WqrWpMuoPimVOEHW1oA05Q-1; Mon,
 18 May 2026 18:31:49 -0400
X-MC-Unique: WqrWpMuoPimVOEHW1oA05Q-1
X-Mimecast-MFC-AGG-ID: WqrWpMuoPimVOEHW1oA05Q_1779143506
Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 659D91956058;
	Mon, 18 May 2026 22:31:46 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.48.33])
	by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 00B2530001A2;
	Mon, 18 May 2026 22:31:39 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.org>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 11/21] cifs: Use a bvecq for buffering instead of a folioq
Date: Mon, 18 May 2026 23:29:43 +0100
Message-ID: <20260518222959.488126-12-dhowells@redhat.com>
In-Reply-To: <20260518222959.488126-1-dhowells@redhat.com>
References: <20260518222959.488126-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4
Content-Type: text/plain; charset="utf-8"

Use a bvecq for internal buffering for crypto purposes instead of a folioq
so that the latter can be phased out.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/smb/client/cifsglob.h |  2 +-
 fs/smb/client/smb2ops.c  | 71 +++++++++++++++++++---------------------
 2 files changed, 35 insertions(+), 38 deletions(-)

diff --git a/fs/smb/client/cifsglob.h b/fs/smb/client/cifsglob.h
index 82e0adc1dabd..fc4028b5b5c8 100644
--- a/fs/smb/client/cifsglob.h
+++ b/fs/smb/client/cifsglob.h
@@ -288,7 +288,7 @@ struct smb_rqst {
 	struct kvec	*rq_iov;	/* array of kvecs */
 	unsigned int	rq_nvec;	/* number of kvecs in array */
 	struct iov_iter	rq_iter;	/* Data iterator */
-	struct folio_queue *rq_buffer;	/* Buffer for encryption */
+	struct bvecq	*rq_buffer;	/* Buffer for encryption */
 };
=20
 struct mid_q_entry;
diff --git a/fs/smb/client/smb2ops.c b/fs/smb/client/smb2ops.c
index 189bb863a9af..230102f2e411 100644
--- a/fs/smb/client/smb2ops.c
+++ b/fs/smb/client/smb2ops.c
@@ -4542,19 +4542,18 @@ crypt_message(struct TCP_Server_Info *server, int n=
um_rqst,
 }
=20
 /*
- * Copy data from an iterator to the folios in a folio queue buffer.
+ * Copy data from an iterator to the pages in a bvec queue buffer.
  */
-static bool cifs_copy_iter_to_folioq(struct iov_iter *iter, size_t size,
-				     struct folio_queue *buffer)
+static bool cifs_copy_iter_to_bvecq(struct iov_iter *iter, size_t size,
+				    struct bvecq *buffer)
 {
 	for (; buffer; buffer =3D buffer->next) {
-		for (int s =3D 0; s < folioq_count(buffer); s++) {
-			struct folio *folio =3D folioq_folio(buffer, s);
-			size_t part =3D folioq_folio_size(buffer, s);
+		for (int s =3D 0; s < buffer->nr_slots; s++) {
+			struct bio_vec *bv =3D &buffer->bv[s];
+			size_t part =3D umin(bv->bv_len, size);
=20
-			part =3D umin(part, size);
-
-			if (copy_folio_from_iter(folio, 0, part, iter) !=3D part)
+			if (copy_page_from_iter(bv->bv_page, bv->bv_offset,
+						part, iter) !=3D part)
 				return false;
 			size -=3D part;
 		}
@@ -4566,7 +4565,7 @@ void
 smb3_free_compound_rqst(int num_rqst, struct smb_rqst *rqst)
 {
 	for (int i =3D 0; i < num_rqst; i++)
-		netfs_free_folioq_buffer(rqst[i].rq_buffer);
+		bvecq_put(rqst[i].rq_buffer);
 }
=20
 /*
@@ -4593,7 +4592,7 @@ smb3_init_transform_rq(struct TCP_Server_Info *server=
, int num_rqst,
 	for (int i =3D 1; i < num_rqst; i++) {
 		struct smb_rqst *old =3D &old_rq[i - 1];
 		struct smb_rqst *new =3D &new_rq[i];
-		struct folio_queue *buffer =3D NULL;
+		struct bvecq *buffer =3D NULL;
 		size_t size =3D iov_iter_count(&old->rq_iter);
=20
 		orig_len +=3D smb_rqst_len(server, old);
@@ -4601,17 +4600,16 @@ smb3_init_transform_rq(struct TCP_Server_Info *serv=
er, int num_rqst,
 		new->rq_nvec =3D old->rq_nvec;
=20
 		if (size > 0) {
-			size_t cur_size =3D 0;
-			rc =3D netfs_alloc_folioq_buffer(NULL, &buffer, &cur_size,
-						       size, GFP_NOFS);
-			if (rc < 0)
+			rc =3D -ENOMEM;
+			buffer =3D bvecq_alloc_buffer(size, GFP_NOFS);
+			if (!buffer)
 				goto err_free;
=20
 			new->rq_buffer =3D buffer;
-			iov_iter_folio_queue(&new->rq_iter, ITER_SOURCE,
-					     buffer, 0, 0, size);
+			iov_iter_bvec_queue(&new->rq_iter, ITER_SOURCE,
+					    buffer, 0, 0, size);
=20
-			if (!cifs_copy_iter_to_folioq(&old->rq_iter, size, buffer)) {
+			if (!cifs_copy_iter_to_bvecq(&old->rq_iter, size, buffer)) {
 				rc =3D smb_EIO1(smb_eio_trace_tx_copy_iter_to_buf, size);
 				goto err_free;
 			}
@@ -4701,16 +4699,15 @@ decrypt_raw_data(struct TCP_Server_Info *server, ch=
ar *buf,
 }
=20
 static int
-cifs_copy_folioq_to_iter(struct folio_queue *folioq, size_t data_size,
-			 size_t skip, struct iov_iter *iter)
+cifs_copy_bvecq_to_iter(struct bvecq *bq, size_t data_size,
+			size_t skip, struct iov_iter *iter)
 {
-	for (; folioq; folioq =3D folioq->next) {
-		for (int s =3D 0; s < folioq_count(folioq); s++) {
-			struct folio *folio =3D folioq_folio(folioq, s);
-			size_t fsize =3D folio_size(folio);
-			size_t n, len =3D umin(fsize - skip, data_size);
+	for (; bq; bq =3D bq->next) {
+		for (int s =3D 0; s < bq->nr_slots; s++) {
+			struct bio_vec *bv =3D &bq->bv[s];
+			size_t n, len =3D umin(bv->bv_len - skip, data_size);
=20
-			n =3D copy_folio_to_iter(folio, skip, len, iter);
+			n =3D copy_page_to_iter(bv->bv_page, bv->bv_offset + skip, len, iter);
 			if (n !=3D len) {
 				cifs_dbg(VFS, "%s: something went wrong\n", __func__);
 				return smb_EIO2(smb_eio_trace_rx_copy_to_iter,
@@ -4726,7 +4723,7 @@ cifs_copy_folioq_to_iter(struct folio_queue *folioq, =
size_t data_size,
=20
 static int
 handle_read_data(struct TCP_Server_Info *server, struct mid_q_entry *mid,
-		 char *buf, unsigned int buf_len, struct folio_queue *buffer,
+		 char *buf, unsigned int buf_len, struct bvecq *buffer,
 		 unsigned int buffer_len, bool is_offloaded)
 {
 	unsigned int data_offset;
@@ -4836,8 +4833,8 @@ handle_read_data(struct TCP_Server_Info *server, stru=
ct mid_q_entry *mid,
 		}
=20
 		/* Copy the data to the output I/O iterator. */
-		rdata->result =3D cifs_copy_folioq_to_iter(buffer, buffer_len,
-							 cur_off, &rdata->subreq.io_iter);
+		rdata->result =3D cifs_copy_bvecq_to_iter(buffer, buffer_len,
+							cur_off, &rdata->subreq.io_iter);
 		if (rdata->result !=3D 0) {
 			if (is_offloaded)
 				mid->mid_state =3D MID_RESPONSE_MALFORMED;
@@ -4876,7 +4873,7 @@ handle_read_data(struct TCP_Server_Info *server, stru=
ct mid_q_entry *mid,
 struct smb2_decrypt_work {
 	struct work_struct decrypt;
 	struct TCP_Server_Info *server;
-	struct folio_queue *buffer;
+	struct bvecq *buffer;
 	char *buf;
 	unsigned int len;
 };
@@ -4890,7 +4887,7 @@ static void smb2_decrypt_offload(struct work_struct *=
work)
 	struct mid_q_entry *mid;
 	struct iov_iter iter;
=20
-	iov_iter_folio_queue(&iter, ITER_DEST, dw->buffer, 0, 0, dw->len);
+	iov_iter_bvec_queue(&iter, ITER_DEST, dw->buffer, 0, 0, dw->len);
 	rc =3D decrypt_raw_data(dw->server, dw->buf, dw->server->vals->read_rsp_s=
ize,
 			      &iter, true);
 	if (rc) {
@@ -4939,7 +4936,7 @@ static void smb2_decrypt_offload(struct work_struct *=
work)
 	}
=20
 free_pages:
-	netfs_free_folioq_buffer(dw->buffer);
+	bvecq_put(dw->buffer);
 	cifs_small_buf_release(dw->buf);
 	kfree(dw);
 }
@@ -4985,12 +4982,12 @@ receive_encrypted_read(struct TCP_Server_Info *serv=
er, struct mid_q_entry **mid,
 	dw->len =3D len;
 	len =3D round_up(dw->len, PAGE_SIZE);
=20
-	size_t cur_size =3D 0;
-	rc =3D netfs_alloc_folioq_buffer(NULL, &dw->buffer, &cur_size, len, GFP_N=
OFS);
-	if (rc < 0)
+	rc =3D -ENOMEM;
+	dw->buffer =3D bvecq_alloc_buffer(len, GFP_NOFS);
+	if (!dw->buffer)
 		goto discard_data;
=20
-	iov_iter_folio_queue(&iter, ITER_DEST, dw->buffer, 0, 0, len);
+	iov_iter_bvec_queue(&iter, ITER_DEST, dw->buffer, 0, 0, len);
=20
 	/* Read the data into the buffer and clear excess bufferage. */
 	rc =3D cifs_read_iter_from_socket(server, &iter, dw->len);
@@ -5048,7 +5045,7 @@ receive_encrypted_read(struct TCP_Server_Info *server=
, struct mid_q_entry **mid,
 	}
=20
 free_pages:
-	netfs_free_folioq_buffer(dw->buffer);
+	bvecq_put(dw->buffer);
 free_dw:
 	kfree(dw);
 	return rc;
From nobody Mon May 25 04:33:52 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id D529138AC85
	for <linux-kernel@vger.kernel.org>; Mon, 18 May 2026 22:32:01 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779143523; cv=none;
 b=Ym5ouug+QH9DdoV1l1+tFi+C67KKUmd8gmlmzHmOHEQNmnaOO9WP+6frMv/sV1E1GvG6q9YT6Eqi1HZdksANBlN61IDJjONo6hb8ZirdtE20/eDZT7Ycmh/sJetVFP+Kvilfae/8IBh9KO6cUN6mq0nQEeU91PWj/R4WHF6zLqY=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779143523; c=relaxed/simple;
	bh=dEc6uEsjtYZ/oe9KKaCOE3DOrFz2QczKwaOYulrx7WU=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=N53H3vy7lAZ4/OjcSHkD4dgPeT6hMy3oUEgnm4BeQ2GQyHf+n6y4VgGkGKvu6FbKn9/ho2ZckRZwdxT6eXUWXHHc73QfGO7mRbOd3ZgBYzHUpjeIJM/RHAB+sWgtS6kFJM2Rj3VsjVUJCm/hK2Hd6kxpQrceuZf6fdf5FoTqH6o=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=Mx84aMbV; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="Mx84aMbV"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1779143521;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=hFmOrWj3QbJhifdo6DejOIOjhb0pzhAsn+3q3pJlIgw=;
	b=Mx84aMbVn0GE+r1BIMfze994BF5v69PadP9Oy7oBSvjPsoI6RY2k4U+7/jUzbsdsPMu9f6
	Y5g3A1/q9W4XaIFgcuDfi3qkZWs6HInwVzBFkcCQqn/lYc4WenHPD3srlvd/gAwMfeVVHT
	u4uizoNEGD0xZEnRWiar/2D5qpm93tk=
Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-139-BCo_4RkBNjerN4AfRyAMvA-1; Mon,
 18 May 2026 18:31:57 -0400
X-MC-Unique: BCo_4RkBNjerN4AfRyAMvA-1
X-Mimecast-MFC-AGG-ID: BCo_4RkBNjerN4AfRyAMvA_1779143514
Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id B3BA91800283;
	Mon, 18 May 2026 22:31:54 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.48.33])
	by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 2229B1956053;
	Mon, 18 May 2026 22:31:47 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.org>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Shyam Prasad N <sprasad@microsoft.com>,
	Tom Talpey <tom@talpey.com>
Subject: [PATCH v2 12/21] cifs: Support ITER_BVECQ in
 smb_extract_iter_to_rdma()
Date: Mon, 18 May 2026 23:29:44 +0100
Message-ID: <20260518222959.488126-13-dhowells@redhat.com>
In-Reply-To: <20260518222959.488126-1-dhowells@redhat.com>
References: <20260518222959.488126-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17
Content-Type: text/plain; charset="utf-8"

Add support for ITER_BVECQ to smb_extract_iter_to_rdma().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <sprasad@microsoft.com>
cc: Tom Talpey <tom@talpey.com>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Acked-by: Stefan Metzmacher <metze@samba.org>
---
 fs/smb/smbdirect/connection.c | 66 +++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/fs/smb/smbdirect/connection.c b/fs/smb/smbdirect/connection.c
index 8adf58097534..4d2a1700104e 100644
--- a/fs/smb/smbdirect/connection.c
+++ b/fs/smb/smbdirect/connection.c
@@ -5,6 +5,7 @@
  */
=20
 #include "internal.h"
+#include <linux/bvecq.h>
 #include <linux/folio_queue.h>
=20
 struct smbdirect_map_sges {
@@ -2006,6 +2007,68 @@ static ssize_t smbdirect_map_sges_from_bvec(struct i=
ov_iter *iter,
 	return ret;
 }
=20
+/*
+ * Extract memory fragments from a BVECQ-class iterator and add them to an=
 RDMA
+ * list.  The fragments are not pinned.
+ */
+static ssize_t smbdirect_map_sges_from_bvecq(struct iov_iter *iter,
+					     struct smbdirect_map_sges *state,
+					     ssize_t maxsize)
+{
+	const struct bvecq *bq =3D iter->bvecq;
+	unsigned int slot =3D iter->bvecq_slot;
+	ssize_t extracted =3D 0;
+	size_t offset =3D iter->iov_offset;
+
+	maxsize =3D umin(maxsize, iov_iter_count(iter));
+
+	do {
+		struct bio_vec *bv;
+		size_t bsize;
+
+		while (slot >=3D bq->nr_slots) {
+			if (!bq->next) {
+				if (WARN_ON_ONCE(maxsize > 0))
+					return -EIO;
+				goto out;
+			}
+			bq =3D bq->next;
+			slot =3D 0;
+		}
+
+		bv =3D &bq->bv[slot];
+		bsize =3D bv->bv_len;
+
+		if (offset < bsize) {
+			size_t part =3D umin(maxsize, bsize - offset);
+			bool ok;
+
+			ok =3D smbdirect_map_sges_single_page(state,
+							    bv->bv_page,
+							    bv->bv_offset + offset,
+							    part);
+			if (!ok)
+				return -EIO;
+
+			offset +=3D part;
+			extracted +=3D part;
+			maxsize -=3D part;
+		}
+
+		if (offset >=3D bsize) {
+			offset =3D 0;
+			slot++;
+		}
+	} while (state->num_sge < state->max_sge && maxsize > 0);
+
+out:
+	iter->bvecq =3D bq;
+	iter->bvecq_slot =3D slot;
+	iter->iov_offset =3D offset;
+	iter->count -=3D extracted;
+	return extracted;
+}
+
 /*
  * Extract fragments from a KVEC-class iterator and add them to an ib_sge =
list.
  * This can deal with vmalloc'd buffers as well as kmalloc'd or static buf=
fers.
@@ -2155,6 +2218,9 @@ static ssize_t smbdirect_map_sges_from_iter(struct io=
v_iter *iter, size_t len,
 	case ITER_BVEC:
 		ret =3D smbdirect_map_sges_from_bvec(iter, state, len);
 		break;
+	case ITER_BVECQ:
+		ret =3D smbdirect_map_sges_from_bvecq(iter, state, len);
+		break;
 	case ITER_KVEC:
 		ret =3D smbdirect_map_sges_from_kvec(iter, state, len);
 		break;
From nobody Mon May 25 04:33:52 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5162B3AF66E
	for <linux-kernel@vger.kernel.org>; Mon, 18 May 2026 22:32:12 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779143537; cv=none;
 b=Mr9Fpu63RPVuCA1snfgFzEdjVlvLVDBSaWzcCeY4zhGTAWbyztsY1WIlPJ/qwU1XldHJQ3g8LhSvJ9pu1Sph68yVYMJhfki++n5VqK5rkHZ36KB7k/Z7edEWzQ5/bly7WPI9XVSGh9lt4RlkZVJAQJTBERKRXN6zyThFh6ksQBA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779143537; c=relaxed/simple;
	bh=hGeySUJ1PwPWGTytprzAYxLrWILKPAkSOMUV4V2M3rM=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=Xz6bPGeB0iuBIjCQ3DzEfHQgJbtZJjNLARZov5lxAwKqF12iwTvZ6ArODPw01H0frEelQ4qSAC4F+JTAM6jlp3BUFOhRFhucIkDOaOlu2gZdPJS7xprc9NFsdrDY/TNwYHxQQVgY7zZ2ltN/TjN2Xqim4se8r/hZsKlVk61ZAjU=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=NrXM6c2S; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="NrXM6c2S"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1779143531;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=yOwGNFx6hUOV6y7BFiHgopb3Z5ZAbiINN8nrr53uSQM=;
	b=NrXM6c2SnyKSS7ldD4bryVr+Ysz1UBblBm5JVL4PrO7MrtR/AemMH356tsHyFdqCrypRX9
	UNdT2lXqXlCxcGHb0mju9z6XuHA5EyaDlQxtyaph9AueUFOCqOqXGY/R5HctJYyHRtlCU3
	6M6F5sdoeIscK/eDYLzGsymw+5KvmrQ=
Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-563-IcjxRAPJMnOPgPRm0Ga3IA-1; Mon,
 18 May 2026 18:32:06 -0400
X-MC-Unique: IcjxRAPJMnOPgPRm0Ga3IA-1
X-Mimecast-MFC-AGG-ID: IcjxRAPJMnOPgPRm0Ga3IA_1779143524
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id E1CB71800578;
	Mon, 18 May 2026 22:32:03 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.48.33])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 9DA1E19560A2;
	Mon, 18 May 2026 22:31:56 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.org>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Shyam Prasad N <sprasad@microsoft.com>,
	Tom Talpey <tom@talpey.com>
Subject: [PATCH v2 13/21] netfs: Switch to using bvecq rather than folio_queue
 and rolling_buffer
Date: Mon, 18 May 2026 23:29:45 +0100
Message-ID: <20260518222959.488126-14-dhowells@redhat.com>
In-Reply-To: <20260518222959.488126-1-dhowells@redhat.com>
References: <20260518222959.488126-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12
Content-Type: text/plain; charset="utf-8"

Switch netfslib to using bvecq, a segmented bio_vec[] queue, instead of the
folio_queue and rolling_buffer constructs, to keep track of the regions of
memory it is performing I/O upon.  Each bvecq struct in the chain is marked
with the starting file position of that sequence so that discontiguities
can be handled (the contents of each individual bvecq must be contiguous).

For unbuffered/direct I/O, the iterator is extracted into the queue up
front.  For buffered I/O, the folios are added to the queue as the
operation proceeds, much as it does now with folio_queues.  There is one
important change for buffered writes: only the relevant part of the folio
is included; this is expanded for writes to the cache in a copy of the
bvecq segment (it is known that each bio_vec corresponds to part of a
folio in this case).

The bvecq structs are marked with information as to how the regions
contained therein should be disposed of (unlock-only, free, unpin).

When setting up a subrequest, netfslib will furnish it with a slice of the
main buffer queue as a pointer to starting bvecq, slot and offset and, for
the moment, an ITER_BVECQ iterator is set to cover the slice in
subreq->io_iter.

Notes on the implementation:

 (1) This patch uses the concept of a 'bvecq position', which is a tuple of
     { bvecq, slot, offset }.  This is lighter weight than using a full
     iov_iter, though that would also suffice.  If not NULL, the position
     also holds a reference on the bvecq it is pointing to.  This is
     probably overkill as only the hindmost position (that of collection)
     needs to hold a reference.

 (2) There are three positions on the netfs_io_request struct.  Not all are
     used by every request type.

     Firstly, there's ->load_cursor, which is used by buffered read and
     write to point to the next slot to have a folio inserted into it
     (either loaded from the readahead_control or from writeback_iter()).

     Secondly, there's ->dispatch_cursor, which is used to provide the
     position in the buffer from which we start dispatching a subrequest.

     Thirdly, there's the ->collect_cursor, which is used by the collection
     routines to point to the next memory region to be cleaned up.

 (3) There are two positions on the netfs_io_subrequest struct.

     Firstly, there's ->dispatch_pos, which indicates the position from
     which a subrequest's buffer begins.  This is used as the base of the
     position from which to retry (advanced by ->transfer).

     Secondly, there's ->content, which is normally the same as
     ->dispatch_pos but if the bvecq chain got duplicated or the content
     got copied, then this will point to that and will that will be
     disposed of on retry.

 (4) Maintenance of the position structs is done with helper functions,
     such as bvecq_pos_attach() to hide the refcounting.

 (5) When sending a write to the cache, the bvecq will be duplicated and
     the ends rounded up/down to the backing file's DIO block alignment.

 (6) bvec_slice() is used to select a slice of the source buffer and assign
     it to a subrequest.  The source buffer position is advanced.

 (7) netfs_extract_iter() is used by unbuffered/direct I/O API functions to
     decant a chunk of the iov_iter supplied by the VFS into a bvecq chain
     - and to label the bvecqs with appropriate disposal information
     (e.g. unpin, free, nothing).

There are further options that can be explored in the future:

 (1) Allow the provision of a duplicated bvecq chain for just that region
     so that the filesystem can add bits on either end (such as adding
     protocol headers and trailers and gluing several things together into
     a compound operation).

 (2) If a filesystem supports vectored/sparse read and write ops, it can be
     given a chain with discontiguities in it to perform in a single op
     (Ceph, for example, can do this).

 (3) Because each bvecq notes the start file position of the regions
     contained therein, there's no need to translate the info in the
     bio_vec into folio pointers in order to unlock the page after I/O.
     Instead, the inode's pagecache can be iterated over and the xarray
     marks cleared en masse.

 (4) Make MSG_SPLICE_PAGES handling read the disposal info in the bvecq and
     use that to indicate how it should get rid of the stuff it pasted into
     a sk_buff.

 (5) If a bounce buffer is needed (encryption, for example), the bounce
     buffer can be held in a bvecq and sliced up instead of the main buffer
     queue.

 (6) Get rid of subreq->io_iter and move the iov_iter stuff down into the
     filesystem.  The I/O iterators are normally only needed transitorily,
     and the one currently in netfs_io_subrequest is unnecessary most of
     the time.

folio_queue and rolling_buffer will be removed in a follow up patch.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <sprasad@microsoft.com>
cc: Tom Talpey <tom@talpey.com>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/cachefiles/io.c           |  12 ---
 fs/netfs/Makefile            |   1 -
 fs/netfs/buffered_read.c     | 116 +++++++++++---------
 fs/netfs/direct_read.c       |  77 +++++---------
 fs/netfs/direct_write.c      |  86 +++++++--------
 fs/netfs/internal.h          |  10 +-
 fs/netfs/iterator.c          |  12 ++-
 fs/netfs/misc.c              |  20 +---
 fs/netfs/objects.c           |  17 ++-
 fs/netfs/read_collect.c      |  97 +++++++++--------
 fs/netfs/read_pgpriv2.c      |  89 +++++++++++-----
 fs/netfs/read_retry.c        | 102 ++++++++++--------
 fs/netfs/read_single.c       |  12 ++-
 fs/netfs/stats.c             |   4 +-
 fs/netfs/write_collect.c     |  51 ++++-----
 fs/netfs/write_issue.c       | 198 +++++++++++++++++++++++++++--------
 fs/netfs/write_retry.c       |  54 ++++++----
 include/linux/netfs.h        |  25 ++---
 include/trace/events/netfs.h |  46 ++++----
 19 files changed, 585 insertions(+), 444 deletions(-)

diff --git a/fs/cachefiles/io.c b/fs/cachefiles/io.c
index 7e32b1caf6fe..eebebda46a09 100644
--- a/fs/cachefiles/io.c
+++ b/fs/cachefiles/io.c
@@ -659,7 +659,6 @@ static void cachefiles_issue_write(struct netfs_io_subr=
equest *subreq)
 	struct netfs_cache_resources *cres =3D &wreq->cache_resources;
 	struct cachefiles_object *object =3D cachefiles_cres_object(cres);
 	struct cachefiles_cache *cache =3D object->volume->cache;
-	struct netfs_io_stream *stream =3D &wreq->io_streams[subreq->stream_nr];
 	const struct cred *saved_cred;
 	size_t off, pre, post, len =3D subreq->len;
 	loff_t start =3D subreq->start;
@@ -684,17 +683,6 @@ static void cachefiles_issue_write(struct netfs_io_sub=
request *subreq)
 	}
=20
 	/* We also need to end on the cache granularity boundary */
-	if (start + len =3D=3D wreq->i_size) {
-		size_t part =3D len & (cache->bsize - 1);
-		size_t need =3D cache->bsize - part;
-
-		if (part && stream->submit_extendable_to >=3D need) {
-			len +=3D need;
-			subreq->len +=3D need;
-			subreq->io_iter.count +=3D need;
-		}
-	}
-
 	post =3D len & (cache->bsize - 1);
 	if (post) {
 		len -=3D post;
diff --git a/fs/netfs/Makefile b/fs/netfs/Makefile
index e1f12ecb5abf..0621e6870cbd 100644
--- a/fs/netfs/Makefile
+++ b/fs/netfs/Makefile
@@ -15,7 +15,6 @@ netfs-y :=3D \
 	read_pgpriv2.o \
 	read_retry.o \
 	read_single.o \
-	rolling_buffer.o \
 	write_collect.o \
 	write_issue.o \
 	write_retry.o
diff --git a/fs/netfs/buffered_read.c b/fs/netfs/buffered_read.c
index 146a2cf64af0..92716a6c9133 100644
--- a/fs/netfs/buffered_read.c
+++ b/fs/netfs/buffered_read.c
@@ -114,26 +114,21 @@ static int netfs_begin_cache_read(struct netfs_io_req=
uest *rreq, struct netfs_in
 static ssize_t netfs_prepare_read_iterator(struct netfs_io_subrequest *sub=
req)
 {
 	struct netfs_io_request *rreq =3D subreq->rreq;
+	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
+	ssize_t extracted;
 	size_t rsize =3D subreq->len;
=20
 	if (subreq->source =3D=3D NETFS_DOWNLOAD_FROM_SERVER)
-		rsize =3D umin(rsize, rreq->io_streams[0].sreq_max_len);
-
-	subreq->len =3D rsize;
-	if (unlikely(rreq->io_streams[0].sreq_max_segs)) {
-		size_t limit =3D netfs_limit_iter(&rreq->buffer.iter, 0, rsize,
-						rreq->io_streams[0].sreq_max_segs);
-
-		if (limit < rsize) {
-			subreq->len =3D limit;
-			trace_netfs_sreq(subreq, netfs_sreq_trace_limited);
-		}
+		rsize =3D umin(rsize, stream->sreq_max_len);
+
+	bvecq_pos_set(&subreq->dispatch_pos, &rreq->dispatch_cursor);
+	extracted =3D bvecq_slice(&rreq->dispatch_cursor, subreq->len,
+				stream->sreq_max_segs, &subreq->nr_segs);
+	if (extracted < rsize) {
+		subreq->len =3D extracted;
+		trace_netfs_sreq(subreq, netfs_sreq_trace_limited);
 	}
=20
-	subreq->io_iter	=3D rreq->buffer.iter;
-
-	iov_iter_truncate(&subreq->io_iter, subreq->len);
-	rolling_buffer_advance(&rreq->buffer, subreq->len);
 	return subreq->len;
 }
=20
@@ -192,6 +187,10 @@ void netfs_queue_read(struct netfs_io_request *rreq,
 static void netfs_issue_read(struct netfs_io_request *rreq,
 			     struct netfs_io_subrequest *subreq)
 {
+	bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
+	iov_iter_bvec_queue(&subreq->io_iter, ITER_DEST, subreq->content.bvecq,
+			    subreq->content.slot, subreq->content.offset, subreq->len);
+
 	switch (subreq->source) {
 	case NETFS_DOWNLOAD_FROM_SERVER:
 		rreq->netfs_ops->issue_read(subreq);
@@ -200,7 +199,8 @@ static void netfs_issue_read(struct netfs_io_request *r=
req,
 		netfs_read_cache_to_pagecache(rreq, subreq);
 		break;
 	default:
-		__set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
+		bvecq_zero(&rreq->dispatch_cursor, subreq->len);
+		subreq->transferred =3D subreq->len;
 		subreq->error =3D 0;
 		iov_iter_zero(subreq->len, &subreq->io_iter);
 		subreq->transferred =3D subreq->len;
@@ -229,6 +229,11 @@ static void netfs_read_to_pagecache(struct netfs_io_re=
quest *rreq)
 	ssize_t size =3D rreq->len;
 	int ret =3D 0;
=20
+	_enter("R=3D%08x", rreq->debug_id);
+
+	bvecq_pos_set(&rreq->dispatch_cursor, &rreq->load_cursor);
+	bvecq_pos_set(&rreq->collect_cursor, &rreq->dispatch_cursor);
+
 	do {
 		int (*prepare_read)(struct netfs_io_subrequest *subreq) =3D NULL;
 		struct netfs_io_subrequest *subreq;
@@ -376,6 +381,9 @@ static void netfs_read_to_pagecache(struct netfs_io_req=
uest *rreq)
=20
 	/* Defer error return as we may need to wait for outstanding I/O. */
 	cmpxchg(&rreq->error, 0, ret);
+
+	bvecq_pos_unset(&rreq->load_cursor);
+	bvecq_pos_unset(&rreq->dispatch_cursor);
 }
=20
 /**
@@ -423,7 +431,7 @@ void netfs_readahead(struct readahead_control *ractl)
 	 * acquires a ref on each folio that we will need to release later -
 	 * but we don't want to do that until after we've started the I/O.
 	 */
-	added =3D rolling_buffer_bulk_load_from_ra(&rreq->buffer, ractl, rreq->de=
bug_id);
+	added =3D bvecq_load_from_ra(&rreq->load_cursor, ractl);
 	if (added < 0) {
 		ret =3D added;
 		goto cleanup_free;
@@ -432,7 +440,7 @@ void netfs_readahead(struct readahead_control *ractl)
=20
 	rreq->submitted =3D rreq->start + added;
 	rreq->cleaned_to =3D rreq->start;
-	rreq->front_folio_order =3D folio_order(rreq->buffer.tail->vec.folios[0]);
+	rreq->front_folio_order =3D get_order(rreq->load_cursor.bvecq->bv[0].bv_l=
en);
=20
 	netfs_read_to_pagecache(rreq);
 	netfs_maybe_bulk_drop_ra_refs(rreq);
@@ -444,20 +452,20 @@ void netfs_readahead(struct readahead_control *ractl)
 EXPORT_SYMBOL(netfs_readahead);
=20
 /*
- * Create a rolling buffer with a single occupying folio.
+ * Create a buffer queue with a single occupying folio.
  */
-static int netfs_create_singular_buffer(struct netfs_io_request *rreq, str=
uct folio *folio,
-					unsigned int rollbuf_flags)
+static int netfs_create_singular_buffer(struct netfs_io_request *rreq, str=
uct folio *folio)
 {
-	ssize_t added;
+	struct bvecq *bq;
+	size_t fsize =3D folio_size(folio);
=20
-	if (rolling_buffer_init(&rreq->buffer, rreq->debug_id, ITER_DEST) < 0)
+	if (bvecq_buffer_init(&rreq->load_cursor, GFP_KERNEL) < 0)
 		return -ENOMEM;
=20
-	added =3D rolling_buffer_append(&rreq->buffer, folio, rollbuf_flags);
-	if (added < 0)
-		return added;
-	rreq->submitted =3D rreq->start + added;
+	bq =3D rreq->load_cursor.bvecq;
+	bvec_set_folio(&bq->bv[0], folio, fsize, 0);
+	bvecq_filled_to(bq, 1);
+	rreq->submitted =3D rreq->start + fsize;
 	return 0;
 }
=20
@@ -471,11 +479,11 @@ static int netfs_read_gaps(struct file *file, struct =
folio *folio)
 	struct netfs_group *group =3D netfs_folio_group(folio);
 	struct netfs_folio *finfo =3D netfs_folio_info(folio);
 	struct netfs_inode *ctx =3D netfs_inode(mapping->host);
-	struct folio *sink =3D NULL;
-	struct bio_vec *bvec;
+	struct bvecq *bq =3D NULL;
+	struct page *sink =3D NULL;
 	unsigned int from =3D finfo->dirty_offset;
 	unsigned int to =3D from + finfo->dirty_len;
-	unsigned int off =3D 0, i =3D 0;
+	unsigned int off =3D 0, slot =3D 0;
 	size_t flen =3D folio_size(folio);
 	size_t nr_bvec =3D flen / PAGE_SIZE + 2;
 	size_t part;
@@ -500,32 +508,41 @@ static int netfs_read_gaps(struct file *file, struct =
folio *folio)
 	 * end get copied to, but the middle is discarded.
 	 */
 	ret =3D -ENOMEM;
-	bvec =3D kmalloc_objs(*bvec, nr_bvec);
-	if (!bvec)
+	bq =3D bvecq_alloc_one(nr_bvec, GFP_KERNEL);
+	if (!bq)
 		goto discard;
+	rreq->load_cursor.bvecq =3D bq;
=20
-	sink =3D folio_alloc(GFP_KERNEL, 0);
-	if (!sink) {
-		kfree(bvec);
+	sink =3D alloc_page(GFP_KERNEL);
+	if (!sink)
 		goto discard;
-	}
=20
 	trace_netfs_folio(folio, netfs_folio_trace_read_gaps);
=20
-	rreq->direct_bv =3D bvec;
-	rreq->direct_bv_count =3D nr_bvec;
+	for (struct bvecq *p =3D bq; p; p =3D p->next)
+		p->mem_type =3D BVECQ_MEM_PAGECACHE;
+
 	if (from > 0) {
-		bvec_set_folio(&bvec[i++], folio, from, 0);
+		folio_get(folio);
+		bvec_set_folio(&bq->bv[slot++], folio, from, 0);
 		off =3D from;
 	}
 	while (off < to) {
-		part =3D min_t(size_t, to - off, PAGE_SIZE);
-		bvec_set_folio(&bvec[i++], sink, part, 0);
+		if (bvecq_is_full(bq))
+			bq =3D bq->next;
+		part =3D umin(to - off, PAGE_SIZE);
+		get_page(sink);
+		bvec_set_page(&bq->bv[slot++], sink, part, 0);
 		off +=3D part;
 	}
-	if (to < flen)
-		bvec_set_folio(&bvec[i++], folio, flen - to, to);
-	iov_iter_bvec(&rreq->buffer.iter, ITER_DEST, bvec, i, rreq->len);
+	if (to < flen) {
+		if (bvecq_is_full(bq))
+			bq =3D bq->next;
+		folio_get(folio);
+		bvec_set_folio(&bq->bv[slot++], folio, flen - to, to);
+	}
+	bvecq_filled_to(bq, slot);
+
 	rreq->submitted =3D rreq->start + flen;
=20
 	netfs_read_to_pagecache(rreq);
@@ -542,13 +559,14 @@ static int netfs_read_gaps(struct file *file, struct =
folio *folio)
 		folio_mark_uptodate(folio);
 	}
=20
-	if (sink)
-		folio_put(sink);
+	put_page(sink);
 	folio_unlock(folio);
 	netfs_put_request(rreq, netfs_rreq_trace_put_return);
 	return ret < 0 ? ret : 0;
=20
 discard:
+	if (sink)
+		put_page(sink);
 	netfs_put_failed_request(rreq);
 alloc_error:
 	folio_unlock(folio);
@@ -599,7 +617,7 @@ int netfs_read_folio(struct file *file, struct folio *f=
olio)
 	trace_netfs_read(rreq, rreq->start, rreq->len, netfs_read_trace_readpage);
=20
 	/* Set up the output buffer */
-	ret =3D netfs_create_singular_buffer(rreq, folio, 0);
+	ret =3D netfs_create_singular_buffer(rreq, folio);
 	if (ret < 0)
 		goto discard;
=20
@@ -756,7 +774,7 @@ int netfs_write_begin(struct netfs_inode *ctx,
 	trace_netfs_read(rreq, pos, len, netfs_read_trace_write_begin);
=20
 	/* Set up the output buffer */
-	ret =3D netfs_create_singular_buffer(rreq, folio, 0);
+	ret =3D netfs_create_singular_buffer(rreq, folio);
 	if (ret < 0)
 		goto error_put;
=20
@@ -821,7 +839,7 @@ int netfs_prefetch_for_write(struct file *file, struct =
folio *folio,
 	trace_netfs_read(rreq, start, flen, netfs_read_trace_prefetch_for_write);
=20
 	/* Set up the output buffer */
-	ret =3D netfs_create_singular_buffer(rreq, folio, NETFS_ROLLBUF_PAGECACHE=
_MARK);
+	ret =3D netfs_create_singular_buffer(rreq, folio);
 	if (ret < 0)
 		goto error_put;
=20
diff --git a/fs/netfs/direct_read.c b/fs/netfs/direct_read.c
index 6a8fb0d55e04..3c52c7584489 100644
--- a/fs/netfs/direct_read.c
+++ b/fs/netfs/direct_read.c
@@ -16,44 +16,21 @@
 #include <linux/netfs.h>
 #include "internal.h"
=20
-static void netfs_prepare_dio_read_iterator(struct netfs_io_subrequest *su=
breq)
-{
-	struct netfs_io_request *rreq =3D subreq->rreq;
-	size_t rsize;
-
-	rsize =3D umin(subreq->len, rreq->io_streams[0].sreq_max_len);
-	subreq->len =3D rsize;
-
-	if (unlikely(rreq->io_streams[0].sreq_max_segs)) {
-		size_t limit =3D netfs_limit_iter(&rreq->buffer.iter, 0, rsize,
-						rreq->io_streams[0].sreq_max_segs);
-
-		if (limit < rsize) {
-			subreq->len =3D limit;
-			trace_netfs_sreq(subreq, netfs_sreq_trace_limited);
-		}
-	}
-
-	trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
-
-	subreq->io_iter	=3D rreq->buffer.iter;
-	iov_iter_truncate(&subreq->io_iter, subreq->len);
-	iov_iter_advance(&rreq->buffer.iter, subreq->len);
-}
-
 /*
  * Perform a read to a buffer from the server, slicing up the region to be=
 read
  * according to the network rsize.
  */
 static void netfs_dispatch_unbuffered_reads(struct netfs_io_request *rreq)
 {
+	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
 	unsigned long long start =3D rreq->start;
 	ssize_t size =3D rreq->len;
 	int ret;
=20
+	bvecq_pos_set(&rreq->dispatch_cursor, &rreq->load_cursor);
+
 	do {
 		struct netfs_io_subrequest *subreq;
-		ssize_t slice;
=20
 		subreq =3D netfs_alloc_subrequest(rreq);
 		if (!subreq) {
@@ -79,16 +56,24 @@ static void netfs_dispatch_unbuffered_reads(struct netf=
s_io_request *rreq)
 			}
 		}
=20
-		netfs_prepare_dio_read_iterator(subreq);
-		slice =3D subreq->len;
-		size -=3D slice;
-		start +=3D slice;
-		rreq->submitted +=3D slice;
+		bvecq_pos_set(&subreq->dispatch_pos, &rreq->dispatch_cursor);
+		bvecq_pos_set(&subreq->content, &rreq->dispatch_cursor);
+		subreq->len =3D bvecq_slice(&rreq->dispatch_cursor,
+					  umin(size, stream->sreq_max_len),
+					  stream->sreq_max_segs,
+					  &subreq->nr_segs);
+
+		size -=3D subreq->len;
+		start +=3D subreq->len;
+		rreq->submitted +=3D subreq->len;
 		if (size <=3D 0) {
 			smp_wmb(); /* Write lists before ALL_QUEUED. */
 			set_bit(NETFS_RREQ_ALL_QUEUED, &rreq->flags);
 		}
=20
+		iov_iter_bvec_queue(&subreq->io_iter, ITER_DEST, subreq->content.bvecq,
+				    subreq->content.slot, subreq->content.offset, subreq->len);
+
 		rreq->netfs_ops->issue_read(subreq);
=20
 		if (test_bit(NETFS_RREQ_PAUSE, &rreq->flags))
@@ -103,6 +88,8 @@ static void netfs_dispatch_unbuffered_reads(struct netfs=
_io_request *rreq)
 		set_bit(NETFS_RREQ_ALL_QUEUED, &rreq->flags);
 		netfs_wake_collector(rreq);
 	}
+
+	bvecq_pos_unset(&rreq->dispatch_cursor);
 }
=20
 /*
@@ -181,25 +168,17 @@ ssize_t netfs_unbuffered_read_iter_locked(struct kioc=
b *iocb, struct iov_iter *i
 	 * buffer for ourselves as the caller's iterator will be trashed when
 	 * we return.
 	 *
-	 * In such a case, extract an iterator to represent as much of the the
-	 * output buffer as we can manage.  Note that the extraction might not
-	 * be able to allocate a sufficiently large bvec array and may shorten
-	 * the request.
+	 * Extract a buffer queue to represent as much of the output buffer as
+	 * we can manage.  The fragments are extracted into a bvecq which will
+	 * have sufficient nodes allocated to hold all the data, though this
+	 * may end up truncated if ENOMEM is encountered.
 	 */
-	if (user_backed_iter(iter)) {
-		ret =3D netfs_extract_user_iter(iter, rreq->len, &rreq->buffer.iter, 0);
-		if (ret < 0)
-			goto error_put;
-		rreq->direct_bv =3D (struct bio_vec *)rreq->buffer.iter.bvec;
-		rreq->direct_bv_count =3D ret;
-		rreq->direct_bv_unpin =3D iov_iter_extract_will_pin(iter);
-		rreq->len =3D iov_iter_count(&rreq->buffer.iter);
-	} else {
-		rreq->buffer.iter =3D *iter;
-		rreq->len =3D orig_count;
-		rreq->direct_bv_unpin =3D false;
-		iov_iter_advance(iter, orig_count);
-	}
+	ret =3D netfs_extract_iter(iter, rreq->len, INT_MAX, iocb->ki_pos,
+				 &rreq->load_cursor.bvecq, 0);
+	if (ret < 0)
+		goto error_put;
+
+	rreq->len =3D ret;
=20
 	// TODO: Set up bounce buffer if needed
=20
diff --git a/fs/netfs/direct_write.c b/fs/netfs/direct_write.c
index 25f8ceb15fad..0309dd3c37d2 100644
--- a/fs/netfs/direct_write.c
+++ b/fs/netfs/direct_write.c
@@ -73,7 +73,11 @@ static void netfs_unbuffered_write_collect(struct netfs_=
io_request *wreq,
 	spin_unlock(&wreq->lock);
=20
 	wreq->transferred +=3D subreq->transferred;
-	iov_iter_advance(&wreq->buffer.iter, subreq->transferred);
+	if (subreq->transferred < subreq->len) {
+		bvecq_pos_unset(&wreq->dispatch_cursor);
+		bvecq_pos_transfer(&wreq->dispatch_cursor, &subreq->dispatch_pos);
+		bvecq_pos_advance(&wreq->dispatch_cursor, subreq->transferred);
+	}
=20
 	stream->collected_to =3D subreq->start + subreq->transferred;
 	wreq->collected_to =3D stream->collected_to;
@@ -99,6 +103,9 @@ static int netfs_unbuffered_write(struct netfs_io_reques=
t *wreq)
=20
 	_enter("%llx", wreq->len);
=20
+	bvecq_pos_set(&wreq->dispatch_cursor, &wreq->load_cursor);
+	bvecq_pos_set(&wreq->collect_cursor, &wreq->dispatch_cursor);
+
 	if (wreq->origin =3D=3D NETFS_DIO_WRITE)
 		inode_dio_begin(wreq->inode);
=20
@@ -111,6 +118,8 @@ static int netfs_unbuffered_write(struct netfs_io_reque=
st *wreq)
 			netfs_prepare_write(wreq, stream, wreq->start + wreq->transferred);
 			subreq =3D stream->construct;
 			stream->construct =3D NULL;
+		} else {
+			bvecq_pos_set(&subreq->dispatch_pos, &wreq->dispatch_cursor);
 		}
=20
 		/* Check if (re-)preparation failed. */
@@ -120,16 +129,18 @@ static int netfs_unbuffered_write(struct netfs_io_req=
uest *wreq)
 			break;
 		}
=20
-		iov_iter_truncate(&subreq->io_iter, wreq->len - wreq->transferred);
+		subreq->len =3D bvecq_slice(&wreq->dispatch_cursor, stream->sreq_max_len,
+					  stream->sreq_max_segs, &subreq->nr_segs);
+		bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
+
+		iov_iter_bvec_queue(&subreq->io_iter, ITER_SOURCE,
+				    subreq->content.bvecq, subreq->content.slot,
+				    subreq->content.offset,
+				    subreq->len);
+
 		if (!iov_iter_count(&subreq->io_iter))
 			break;
=20
-		subreq->len =3D netfs_limit_iter(&subreq->io_iter, 0,
-					       stream->sreq_max_len,
-					       stream->sreq_max_segs);
-		iov_iter_truncate(&subreq->io_iter, subreq->len);
-		stream->submit_extendable_to =3D subreq->len;
-
 		trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
 		stream->issue_write(subreq);
=20
@@ -166,8 +177,15 @@ static int netfs_unbuffered_write(struct netfs_io_requ=
est *wreq)
 		 */
 		subreq->error =3D -EAGAIN;
 		trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
-		if (subreq->transferred > 0)
-			iov_iter_advance(&wreq->buffer.iter, subreq->transferred);
+
+		bvecq_pos_unset(&subreq->content);
+		bvecq_pos_unset(&wreq->dispatch_cursor);
+		bvecq_pos_transfer(&wreq->dispatch_cursor, &subreq->dispatch_pos);
+
+		if (subreq->transferred > 0) {
+			wreq->transferred +=3D subreq->transferred;
+			bvecq_pos_advance(&wreq->dispatch_cursor, subreq->transferred);
+		}
=20
 		if (stream->source =3D=3D NETFS_UPLOAD_TO_SERVER &&
 		    wreq->netfs_ops->retry_request)
@@ -176,7 +194,6 @@ static int netfs_unbuffered_write(struct netfs_io_reque=
st *wreq)
 		__clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
 		__clear_bit(NETFS_SREQ_BOUNDARY, &subreq->flags);
 		__clear_bit(NETFS_SREQ_FAILED, &subreq->flags);
-		subreq->io_iter		=3D wreq->buffer.iter;
 		subreq->start		=3D wreq->start + wreq->transferred;
 		subreq->len		=3D wreq->len   - wreq->transferred;
 		subreq->transferred	=3D 0;
@@ -186,19 +203,14 @@ static int netfs_unbuffered_write(struct netfs_io_req=
uest *wreq)
=20
 		netfs_get_subrequest(subreq, netfs_sreq_trace_get_resubmit);
=20
-		if (stream->prepare_write) {
+		if (stream->prepare_write)
 			stream->prepare_write(subreq);
-			__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
-			netfs_stat(&netfs_n_wh_retry_write_subreq);
-		} else {
-			struct iov_iter source;
-
-			netfs_reset_iter(subreq);
-			source =3D subreq->io_iter;
-			netfs_reissue_write(stream, subreq, &source);
-		}
+		__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
+		netfs_stat(&netfs_n_wh_retry_write_subreq);
 	}
=20
+	bvecq_pos_unset(&wreq->dispatch_cursor);
+	bvecq_pos_unset(&wreq->load_cursor);
 	netfs_unbuffered_write_done(wreq);
 	_leave(" =3D %d", ret);
 	return ret;
@@ -217,12 +229,12 @@ static void netfs_unbuffered_write_async(struct work_=
struct *work)
  * encrypted file.  This can also be used for direct I/O writes.
  */
 ssize_t netfs_unbuffered_write_iter_locked(struct kiocb *iocb, struct iov_=
iter *iter,
-						  struct netfs_group *netfs_group)
+					   struct netfs_group *netfs_group)
 {
 	struct netfs_io_request *wreq;
 	unsigned long long start =3D iocb->ki_pos;
 	unsigned long long end =3D start + iov_iter_count(iter);
-	ssize_t ret, n;
+	ssize_t ret;
 	size_t len =3D iov_iter_count(iter);
 	bool async =3D !is_sync_kiocb(iocb);
=20
@@ -256,25 +268,17 @@ ssize_t netfs_unbuffered_write_iter_locked(struct kio=
cb *iocb, struct iov_iter *
 		 * allocate a sufficiently large bvec array and may shorten the
 		 * request.
 		 */
-		if (user_backed_iter(iter)) {
-			n =3D netfs_extract_user_iter(iter, len, &wreq->buffer.iter, 0);
-			if (n < 0) {
-				ret =3D n;
-				goto error_put;
-			}
-			wreq->direct_bv =3D (struct bio_vec *)wreq->buffer.iter.bvec;
-			wreq->direct_bv_count =3D n;
-			wreq->direct_bv_unpin =3D iov_iter_extract_will_pin(iter);
-		} else {
-			/* If this is a kernel-generated async DIO request,
-			 * assume that any resources the iterator points to
-			 * (eg. a bio_vec array) will persist till the end of
-			 * the op.
-			 */
-			wreq->buffer.iter =3D *iter;
-		}
+		ssize_t n =3D netfs_extract_iter(iter, len, INT_MAX, iocb->ki_pos,
+					       &wreq->load_cursor.bvecq, 0);
=20
-		wreq->len =3D iov_iter_count(&wreq->buffer.iter);
+		if (n < 0) {
+			ret =3D n;
+			goto error_put;
+		}
+		wreq->len =3D n;
+		_debug("dio-write %zx/%zx %u/%u",
+		       n, len, wreq->load_cursor.bvecq->nr_slots,
+		       wreq->load_cursor.bvecq->max_slots);
 	}
=20
 	__set_bit(NETFS_RREQ_USE_IO_ITER, &wreq->flags);
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index 53e1fcc42a19..5674a57f2e22 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -7,7 +7,6 @@
=20
 #include <linux/slab.h>
 #include <linux/seq_file.h>
-#include <linux/folio_queue.h>
 #include <linux/netfs.h>
 #include <linux/fscache.h>
 #include <linux/fscache-cache.h>
@@ -69,9 +68,8 @@ static inline void netfs_proc_del_rreq(struct netfs_io_re=
quest *rreq) {}
 /*
  * misc.c
  */
-struct folio_queue *netfs_buffer_make_space(struct netfs_io_request *rreq,
-					    enum netfs_folioq_trace trace);
-void netfs_reset_iter(struct netfs_io_subrequest *subreq);
+struct bvecq *netfs_buffer_make_space(struct netfs_io_request *rreq,
+				      enum netfs_bvecq_trace trace);
 void netfs_wake_collector(struct netfs_io_request *rreq);
 void netfs_subreq_clear_in_progress(struct netfs_io_subrequest *subreq);
 void netfs_wait_for_in_progress_stream(struct netfs_io_request *rreq,
@@ -171,7 +169,6 @@ extern atomic_t netfs_n_wh_retry_write_req;
 extern atomic_t netfs_n_wh_retry_write_subreq;
 extern atomic_t netfs_n_wb_lock_skip;
 extern atomic_t netfs_n_wb_lock_wait;
-extern atomic_t netfs_n_folioq;
 extern atomic_t netfs_n_bvecq;
=20
 int netfs_stats_show(struct seq_file *m, void *v);
@@ -209,8 +206,7 @@ void netfs_prepare_write(struct netfs_io_request *wreq,
 			 struct netfs_io_stream *stream,
 			 loff_t start);
 void netfs_reissue_write(struct netfs_io_stream *stream,
-			 struct netfs_io_subrequest *subreq,
-			 struct iov_iter *source);
+			 struct netfs_io_subrequest *subreq);
 void netfs_issue_write(struct netfs_io_request *wreq,
 		       struct netfs_io_stream *stream);
 size_t netfs_advance_write(struct netfs_io_request *wreq,
diff --git a/fs/netfs/iterator.c b/fs/netfs/iterator.c
index d2c3055a488c..10a25a618712 100644
--- a/fs/netfs/iterator.c
+++ b/fs/netfs/iterator.c
@@ -80,6 +80,7 @@ ssize_t netfs_extract_iter(struct iov_iter *orig, size_t =
max_len, size_t max_pag
 		struct bio_vec *bv =3D bq->bv;
 		do {
 			struct page **pages;
+			unsigned int slot =3D 0;
 			ssize_t got;
 			size_t offset;
 			size_t space =3D bq->max_slots - bq->nr_slots;
@@ -120,14 +121,15 @@ ssize_t netfs_extract_iter(struct iov_iter *orig, siz=
e_t max_len, size_t max_pag
 			do {
 				size_t len =3D umin(got, PAGE_SIZE - offset);
=20
-				BUG_ON(bq->nr_slots >=3D bq->max_slots);
+				BUG_ON(slot >=3D bq->max_slots);
=20
-				bvec_set_page(&bq->bv[bq->nr_slots],
-					      *pages++, len, offset);
-				bq->nr_slots++;
+				bvec_set_page(&bq->bv[slot], *pages++, len, offset);
+				slot++;
 				got -=3D len;
 				offset =3D 0;
 			} while (got > 0);
+
+			bvecq_filled_to(bq, slot);
 		} while (max_len > 0 && !bvecq_is_full(bq));
=20
 		max_pages -=3D bq->nr_slots;
@@ -138,6 +140,7 @@ ssize_t netfs_extract_iter(struct iov_iter *orig, size_=
t max_len, size_t max_pag
 }
 EXPORT_SYMBOL_GPL(netfs_extract_iter);
=20
+#if 0
 /**
  * netfs_extract_user_iter - Extract the pages from a user iterator into a=
 bvec
  * @orig: The original iterator
@@ -431,3 +434,4 @@ size_t netfs_limit_iter(const struct iov_iter *iter, si=
ze_t start_offset,
 	BUG();
 }
 EXPORT_SYMBOL(netfs_limit_iter);
+#endif
diff --git a/fs/netfs/misc.c b/fs/netfs/misc.c
index f5c1c463f4ff..ee67a0681784 100644
--- a/fs/netfs/misc.c
+++ b/fs/netfs/misc.c
@@ -8,6 +8,7 @@
 #include <linux/swap.h>
 #include "internal.h"
=20
+#if 0
 /**
  * netfs_alloc_folioq_buffer - Allocate buffer space into a folio queue
  * @mapping: Address space to set on the folio (or NULL).
@@ -103,24 +104,7 @@ void netfs_free_folioq_buffer(struct folio_queue *fq)
 	folio_batch_release(&fbatch);
 }
 EXPORT_SYMBOL(netfs_free_folioq_buffer);
-
-/*
- * Reset the subrequest iterator to refer just to the region remaining to =
be
- * read.  The iterator may or may not have been advanced by socket ops or
- * extraction ops to an extent that may or may not match the amount actual=
ly
- * read.
- */
-void netfs_reset_iter(struct netfs_io_subrequest *subreq)
-{
-	struct iov_iter *io_iter =3D &subreq->io_iter;
-	size_t remain =3D subreq->len - subreq->transferred;
-
-	if (io_iter->count > remain)
-		iov_iter_advance(io_iter, io_iter->count - remain);
-	else if (io_iter->count < remain)
-		iov_iter_revert(io_iter, remain - io_iter->count);
-	iov_iter_truncate(&subreq->io_iter, remain);
-}
+#endif
=20
 /**
  * netfs_dirty_folio - Mark folio dirty and pin a cache object for writeba=
ck
diff --git a/fs/netfs/objects.c b/fs/netfs/objects.c
index b8c4918d3dcd..7f5187c64ae9 100644
--- a/fs/netfs/objects.c
+++ b/fs/netfs/objects.c
@@ -119,7 +119,6 @@ static void netfs_free_request_rcu(struct rcu_head *rcu)
 static void netfs_deinit_request(struct netfs_io_request *rreq)
 {
 	struct netfs_inode *ictx =3D netfs_inode(rreq->inode);
-	unsigned int i;
=20
 	trace_netfs_rreq(rreq, netfs_rreq_trace_free);
=20
@@ -134,16 +133,10 @@ static void netfs_deinit_request(struct netfs_io_requ=
est *rreq)
 		rreq->netfs_ops->free_request(rreq);
 	if (rreq->cache_resources.ops)
 		rreq->cache_resources.ops->end_operation(&rreq->cache_resources);
-	if (rreq->direct_bv) {
-		for (i =3D 0; i < rreq->direct_bv_count; i++) {
-			if (rreq->direct_bv[i].bv_page) {
-				if (rreq->direct_bv_unpin)
-					unpin_user_page(rreq->direct_bv[i].bv_page);
-			}
-		}
-		kvfree(rreq->direct_bv);
-	}
-	rolling_buffer_clear(&rreq->buffer);
+	bvecq_pos_unset(&rreq->load_cursor);
+	bvecq_pos_unset(&rreq->dispatch_cursor);
+	bvecq_pos_unset(&rreq->collect_cursor);
+	bvecq_put(rreq->spare);
=20
 	if (atomic_dec_and_test(&ictx->io_count))
 		wake_up_var(&ictx->io_count);
@@ -236,6 +229,8 @@ static void netfs_free_subrequest(struct netfs_io_subre=
quest *subreq)
 	trace_netfs_sreq(subreq, netfs_sreq_trace_free);
 	if (rreq->netfs_ops->free_subrequest)
 		rreq->netfs_ops->free_subrequest(subreq);
+	bvecq_pos_unset(&subreq->dispatch_pos);
+	bvecq_pos_unset(&subreq->content);
 	mempool_free(subreq, rreq->netfs_ops->subrequest_pool ?: &netfs_subreques=
t_pool);
 	netfs_stat_d(&netfs_n_rh_sreq);
 	netfs_put_request(rreq, netfs_rreq_trace_put_subreq);
diff --git a/fs/netfs/read_collect.c b/fs/netfs/read_collect.c
index edf7cea7e2f9..977b69ac8725 100644
--- a/fs/netfs/read_collect.c
+++ b/fs/netfs/read_collect.c
@@ -27,9 +27,13 @@
  */
 static void netfs_clear_unread(struct netfs_io_subrequest *subreq)
 {
-	netfs_reset_iter(subreq);
-	WARN_ON_ONCE(subreq->len - subreq->transferred !=3D iov_iter_count(&subre=
q->io_iter));
-	iov_iter_zero(iov_iter_count(&subreq->io_iter), &subreq->io_iter);
+	struct iov_iter iter;
+
+	iov_iter_bvec_queue(&iter, ITER_DEST, subreq->content.bvecq,
+			    subreq->content.slot, subreq->content.offset, subreq->len);
+	iov_iter_advance(&iter, subreq->transferred);
+	iov_iter_zero(subreq->len, &iter);
+
 	if (subreq->start + subreq->transferred >=3D subreq->rreq->i_size)
 		__set_bit(NETFS_SREQ_HIT_EOF, &subreq->flags);
 }
@@ -40,11 +44,11 @@ static void netfs_clear_unread(struct netfs_io_subreque=
st *subreq)
  * dirty and let writeback handle it.
  */
 static void netfs_unlock_read_folio(struct netfs_io_request *rreq,
-				    struct folio_queue *folioq,
+				    struct bvecq *bvecq,
 				    int slot)
 {
 	struct netfs_folio *finfo;
-	struct folio *folio =3D folioq_folio(folioq, slot);
+	struct folio *folio =3D page_folio(bvecq->bv[slot].bv_page);
=20
 	if (unlikely(folio_pos(folio) < rreq->abandon_to)) {
 		trace_netfs_folio(folio, netfs_folio_trace_abandon);
@@ -75,7 +79,7 @@ static void netfs_unlock_read_folio(struct netfs_io_reque=
st *rreq,
 			trace_netfs_folio(folio, netfs_folio_trace_read_done);
 		}
=20
-		folioq_clear(folioq, slot);
+		bvecq->bv[slot].bv_page =3D NULL;
 	} else {
 		// TODO: Use of PG_private_2 is deprecated.
 		if (test_bit(NETFS_RREQ_FOLIO_COPY_TO_CACHE, &rreq->flags))
@@ -91,7 +95,7 @@ static void netfs_unlock_read_folio(struct netfs_io_reque=
st *rreq,
 		folio_unlock(folio);
 	}
=20
-	folioq_clear(folioq, slot);
+	bvecq->bv[slot].bv_page =3D NULL;
 }
=20
 /*
@@ -100,24 +104,15 @@ static void netfs_unlock_read_folio(struct netfs_io_r=
equest *rreq,
 static void netfs_read_unlock_folios(struct netfs_io_request *rreq,
 				     unsigned int *notes)
 {
-	struct folio_queue *folioq =3D rreq->buffer.tail;
+	struct bvecq *bvecq =3D rreq->collect_cursor.bvecq;
 	unsigned long long collected_to =3D rreq->collected_to;
-	unsigned int slot =3D rreq->buffer.first_tail_slot;
+	unsigned int slot =3D rreq->collect_cursor.slot;
=20
 	if (rreq->cleaned_to >=3D rreq->collected_to)
 		return;
=20
 	// TODO: Begin decryption
=20
-	if (slot >=3D folioq_nr_slots(folioq)) {
-		folioq =3D rolling_buffer_delete_spent(&rreq->buffer);
-		if (!folioq) {
-			rreq->front_folio_order =3D 0;
-			return;
-		}
-		slot =3D 0;
-	}
-
 	/* We have to wait for readahead refs to have been released before we
 	 * can unlock any folios as the ref-dropper walks i_pages and the only
 	 * thing preventing these folios from being removed is the folio lock.
@@ -131,16 +126,29 @@ static void netfs_read_unlock_folios(struct netfs_io_=
request *rreq,
 		unsigned int order;
 		size_t fsize;
=20
+		/* Clean up the head bvecq segment.  If we clear an entire
+		 * segment, then we can get rid of it provided it's not also
+		 * the tail segment being filled by the issuer.
+		 */
+		if (!bvecq_acquire_slot(bvecq, slot)) {
+			if (!bvecq_delete_spent(&rreq->collect_cursor, slot)) {
+				rreq->front_folio_order =3D 0;
+				return;
+			}
+			bvecq =3D rreq->collect_cursor.bvecq;
+			slot  =3D rreq->collect_cursor.slot;
+		}
+
 		if (*notes & COPY_TO_CACHE)
 			set_bit(NETFS_RREQ_FOLIO_COPY_TO_CACHE, &rreq->flags);
=20
-		folio =3D folioq_folio(folioq, slot);
+		folio =3D page_folio(bvecq->bv[slot].bv_page);
 		if (WARN_ONCE(!folio_test_locked(folio),
 			      "R=3D%08x: folio %lx is not locked\n",
 			      rreq->debug_id, folio->index))
 			trace_netfs_folio(folio, netfs_folio_trace_not_locked);
=20
-		order =3D folioq_folio_order(folioq, slot);
+		order =3D folio_order(folio);
 		rreq->front_folio_order =3D order;
 		fsize =3D PAGE_SIZE << order;
 		fpos =3D folio_pos(folio);
@@ -152,33 +160,19 @@ static void netfs_read_unlock_folios(struct netfs_io_=
request *rreq,
 		if (collected_to < fend)
 			break;
=20
-		netfs_unlock_read_folio(rreq, folioq, slot);
+		netfs_unlock_read_folio(rreq, bvecq, slot);
+		slot++;
 		WRITE_ONCE(rreq->cleaned_to, fpos + fsize);
 		*notes |=3D MADE_PROGRESS;
=20
 		clear_bit(NETFS_RREQ_FOLIO_COPY_TO_CACHE, &rreq->flags);
=20
-		/* Clean up the head folioq.  If we clear an entire folioq, then
-		 * we can get rid of it provided it's not also the tail folioq
-		 * being filled by the issuer.
-		 */
-		folioq_clear(folioq, slot);
-		slot++;
-		if (slot >=3D folioq_nr_slots(folioq)) {
-			folioq =3D rolling_buffer_delete_spent(&rreq->buffer);
-			if (!folioq)
-				goto done;
-			slot =3D 0;
-			trace_netfs_folioq(folioq, netfs_trace_folioq_read_progress);
-		}
-
 		if (fpos + fsize >=3D collected_to)
 			break;
 	}
=20
-	rreq->buffer.tail =3D folioq;
-done:
-	rreq->buffer.first_tail_slot =3D slot;
+	bvecq_pos_move(&rreq->collect_cursor, bvecq);
+	rreq->collect_cursor.slot =3D slot;
 }
=20
 /*
@@ -355,12 +349,17 @@ static void netfs_rreq_assess_dio(struct netfs_io_req=
uest *rreq)
=20
 	if (rreq->origin =3D=3D NETFS_UNBUFFERED_READ ||
 	    rreq->origin =3D=3D NETFS_DIO_READ) {
-		for (i =3D 0; i < rreq->direct_bv_count; i++) {
-			flush_dcache_page(rreq->direct_bv[i].bv_page);
-			// TODO: cifs marks pages in the destination buffer
-			// dirty under some circumstances after a read.  Do we
-			// need to do that too?
-			set_page_dirty(rreq->direct_bv[i].bv_page);
+		for (struct bvecq *bq =3D rreq->collect_cursor.bvecq; bq; bq =3D bq->nex=
t) {
+			unsigned int nr_slots =3D bvecq_nr_slots_acquire(bq);
+			/* Read the slot count before the slots. */
+
+			for (i =3D 0; i < nr_slots; i++) {
+				flush_dcache_page(bq->bv[i].bv_page);
+				// TODO: cifs marks pages in the destination buffer
+				// dirty under some circumstances after a read.  Do we
+				// need to do that too?
+				set_page_dirty(bq->bv[i].bv_page);
+			}
 		}
 	}
=20
@@ -451,7 +450,15 @@ bool netfs_read_collection(struct netfs_io_request *rr=
eq)
=20
 	trace_netfs_rreq(rreq, netfs_rreq_trace_done);
 	netfs_clear_subrequests(rreq);
-	netfs_unlock_abandoned_read_pages(rreq);
+	switch (rreq->origin) {
+	case NETFS_READAHEAD:
+	case NETFS_READPAGE:
+	case NETFS_READ_FOR_WRITE:
+		netfs_unlock_abandoned_read_pages(rreq);
+		break;
+	default:
+		break;
+	}
 	if (unlikely(rreq->copy_to_cache))
 		netfs_pgpriv2_end_copy_to_cache(rreq);
 	return true;
diff --git a/fs/netfs/read_pgpriv2.c b/fs/netfs/read_pgpriv2.c
index a1489aa29f78..f9a0fb3e89e3 100644
--- a/fs/netfs/read_pgpriv2.c
+++ b/fs/netfs/read_pgpriv2.c
@@ -19,6 +19,9 @@
 static void netfs_pgpriv2_copy_folio(struct netfs_io_request *creq, struct=
 folio *folio)
 {
 	struct netfs_io_stream *cache =3D &creq->io_streams[1];
+	struct bvecq *queue;
+	unsigned int slot;
+	size_t dio_size =3D PAGE_SIZE;
 	size_t fsize =3D folio_size(folio), flen =3D fsize;
 	loff_t fpos =3D folio_pos(folio), i_size;
 	bool to_eof =3D false;
@@ -48,17 +51,39 @@ static void netfs_pgpriv2_copy_folio(struct netfs_io_re=
quest *creq, struct folio
 		to_eof =3D true;
 	}
=20
+	flen =3D round_up(flen, dio_size);
+
 	_debug("folio %zx %zx", flen, fsize);
=20
 	trace_netfs_folio(folio, netfs_folio_trace_store_copy);
=20
-	/* Attach the folio to the rolling buffer. */
-	if (rolling_buffer_append(&creq->buffer, folio, 0) < 0) {
-		clear_bit(NETFS_RREQ_FOLIO_COPY_TO_CACHE, &creq->flags);
-		return;
+	/* Institute a new bvec queue segment if the current one is full or if
+	 * we encounter a discontiguity.  The discontiguity break is important
+	 * when it comes to bulk unlocking folios by file range.
+	 */
+	queue =3D creq->load_cursor.bvecq;
+	if (bvecq_is_full(queue) ||
+	    (fpos !=3D creq->last_end && creq->last_end > 0 && queue->nr_slots > =
0)) {
+		bvecq_buffer_append(&creq->load_cursor, creq->spare);
+		creq->spare =3D NULL;
+
+		queue =3D creq->load_cursor.bvecq;
+		queue->fpos =3D fpos;
+		if (fpos !=3D creq->last_end)
+			queue->discontig =3D true;
 	}
=20
-	cache->submit_extendable_to =3D fsize;
+	/* Attach the folio to the rolling buffer. */
+	slot =3D queue->nr_slots;
+	bvec_set_folio(&queue->bv[slot], folio, fsize, 0);
+	trace_netfs_bv_slot(queue, slot);
+	slot++;
+	bvecq_filled_to(queue, slot);
+	creq->load_cursor.slot =3D slot;
+	creq->load_cursor.offset =3D 0;
+
+	bvecq_pos_nudge(&creq->dispatch_cursor);
+=09
 	cache->submit_off =3D 0;
 	cache->submit_len =3D flen;
=20
@@ -70,10 +95,9 @@ static void netfs_pgpriv2_copy_folio(struct netfs_io_req=
uest *creq, struct folio
 	do {
 		ssize_t part;
=20
-		creq->buffer.iter.iov_offset =3D cache->submit_off;
+		creq->dispatch_cursor.offset =3D cache->submit_off;
=20
 		atomic64_set(&creq->issued_to, fpos + cache->submit_off);
-		cache->submit_extendable_to =3D fsize - cache->submit_off;
 		part =3D netfs_advance_write(creq, cache, fpos + cache->submit_off,
 					   cache->submit_len, to_eof);
 		cache->submit_off +=3D part;
@@ -83,8 +107,7 @@ static void netfs_pgpriv2_copy_folio(struct netfs_io_req=
uest *creq, struct folio
 			cache->submit_len -=3D part;
 	} while (cache->submit_len > 0);
=20
-	creq->buffer.iter.iov_offset =3D 0;
-	rolling_buffer_advance(&creq->buffer, fsize);
+	bvecq_pos_step(&creq->dispatch_cursor);
 	atomic64_set(&creq->issued_to, fpos + fsize);
=20
 	if (flen < fsize)
@@ -110,6 +133,10 @@ static struct netfs_io_request *netfs_pgpriv2_begin_co=
py_to_cache(
 	if (!creq->io_streams[1].avail)
 		goto cancel_put;
=20
+	bvecq_buffer_init(&creq->load_cursor, GFP_KERNEL);
+	bvecq_pos_set(&creq->dispatch_cursor, &creq->load_cursor);
+	bvecq_pos_set(&creq->collect_cursor, &creq->dispatch_cursor);
+
 	__set_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &creq->flags);
 	trace_netfs_copy2cache(rreq, creq);
 	trace_netfs_write(creq, netfs_write_trace_copy_to_cache);
@@ -138,6 +165,14 @@ void netfs_pgpriv2_copy_to_cache(struct netfs_io_reque=
st *rreq, struct folio *fo
 	if (IS_ERR(creq))
 		return;
=20
+	if (!creq->spare) {
+		creq->spare =3D bvecq_alloc_one(BVECQ_STD_SLOTS, GFP_NOFS);
+		if (!creq->spare) {
+			clear_bit(NETFS_RREQ_FOLIO_COPY_TO_CACHE, &creq->flags);
+			return;
+		}
+	}
+
 	trace_netfs_folio(folio, netfs_folio_trace_copy_to_cache);
 	folio_start_private_2(folio);
 	netfs_pgpriv2_copy_folio(creq, folio);
@@ -170,22 +205,26 @@ void netfs_pgpriv2_end_copy_to_cache(struct netfs_io_=
request *rreq)
  */
 bool netfs_pgpriv2_unlock_copied_folios(struct netfs_io_request *creq)
 {
-	struct folio_queue *folioq =3D creq->buffer.tail;
+	struct bvecq *bq =3D creq->collect_cursor.bvecq;
 	unsigned long long collected_to =3D creq->collected_to;
-	unsigned int slot =3D creq->buffer.first_tail_slot;
+	unsigned int slot;
 	bool made_progress =3D false;
=20
-	if (slot >=3D folioq_nr_slots(folioq)) {
-		folioq =3D rolling_buffer_delete_spent(&creq->buffer);
-		slot =3D 0;
-	}
+	slot =3D creq->collect_cursor.slot;
=20
 	for (;;) {
 		struct folio *folio;
 		unsigned long long fpos, fend;
 		size_t fsize, flen;
=20
-		folio =3D folioq_folio(folioq, slot);
+		if (!bvecq_acquire_slot(bq, slot)) {
+			if (!bvecq_delete_spent(&creq->collect_cursor, slot))
+				return false;
+			bq   =3D creq->collect_cursor.bvecq;
+			slot =3D creq->collect_cursor.slot;
+		}
+
+		folio =3D page_folio(bq->bv[slot].bv_page);
 		if (WARN_ONCE(!folio_test_private_2(folio),
 			      "R=3D%08x: folio %lx is not marked private_2\n",
 			      creq->debug_id, folio->index))
@@ -208,25 +247,17 @@ bool netfs_pgpriv2_unlock_copied_folios(struct netfs_=
io_request *creq)
 		creq->cleaned_to =3D fpos + fsize;
 		made_progress =3D true;
=20
-		/* Clean up the head folioq.  If we clear an entire folioq, then
-		 * we can get rid of it provided it's not also the tail folioq
-		 * being filled by the issuer.
+		/* Clean up the head segment.  If we clear an entire segment,
+		 * then we can get rid of it provided it's not also the tail
+		 * segment being filled by the issuer.
 		 */
-		folioq_clear(folioq, slot);
+		bq->bv[slot].bv_page =3D NULL;
 		slot++;
-		if (slot >=3D folioq_nr_slots(folioq)) {
-			folioq =3D rolling_buffer_delete_spent(&creq->buffer);
-			if (!folioq)
-				goto done;
-			slot =3D 0;
-		}
=20
 		if (fpos + fsize >=3D collected_to)
 			break;
 	}
=20
-	creq->buffer.tail =3D folioq;
-done:
-	creq->buffer.first_tail_slot =3D slot;
+	creq->collect_cursor.slot =3D slot;
 	return made_progress;
 }
diff --git a/fs/netfs/read_retry.c b/fs/netfs/read_retry.c
index bf45b1f5f3e0..c45aef8dc03c 100644
--- a/fs/netfs/read_retry.c
+++ b/fs/netfs/read_retry.c
@@ -12,6 +12,12 @@
 static void netfs_reissue_read(struct netfs_io_request *rreq,
 			       struct netfs_io_subrequest *subreq)
 {
+	bvecq_pos_unset(&subreq->content);
+	bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
+	iov_iter_bvec_queue(&subreq->io_iter, ITER_DEST, subreq->content.bvecq,
+			    subreq->content.slot, subreq->content.offset, subreq->len);
+	iov_iter_advance(&subreq->io_iter, subreq->transferred);
+
 	subreq->error =3D 0;
 	__clear_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
 	__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
@@ -27,6 +33,7 @@ static void netfs_retry_read_subrequests(struct netfs_io_=
request *rreq)
 {
 	struct netfs_io_subrequest *subreq;
 	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
+	struct bvecq_pos dispatch_cursor =3D {};
 	struct list_head *next;
=20
 	_enter("R=3D%x", rreq->debug_id);
@@ -46,9 +53,7 @@ static void netfs_retry_read_subrequests(struct netfs_io_=
request *rreq)
 			if (test_bit(NETFS_SREQ_FAILED, &subreq->flags))
 				break;
 			if (__test_and_clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags)) {
-				__clear_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
 				subreq->retry_count++;
-				netfs_reset_iter(subreq);
 				netfs_get_subrequest(subreq, netfs_sreq_trace_get_resubmit);
 				netfs_reissue_read(rreq, subreq);
 			}
@@ -74,11 +79,12 @@ static void netfs_retry_read_subrequests(struct netfs_i=
o_request *rreq)
=20
 	do {
 		struct netfs_io_subrequest *from, *to, *tmp;
-		struct iov_iter source;
 		unsigned long long start, len;
 		size_t part;
 		bool boundary =3D false, subreq_superfluous =3D false;
=20
+		bvecq_pos_unset(&dispatch_cursor);
+
 		/* Go through the subreqs and find the next span of contiguous
 		 * buffer that we then rejig (cifs, for example, needs the
 		 * rsize renegotiating) and reissue.
@@ -100,7 +106,8 @@ static void netfs_retry_read_subrequests(struct netfs_i=
o_request *rreq)
=20
 		list_for_each_continue(next, &stream->subrequests) {
 			subreq =3D list_entry(next, struct netfs_io_subrequest, rreq_link);
-			if (subreq->start + subreq->transferred !=3D start + len ||
+			if (subreq->start !=3D start + len ||
+			    subreq->transferred > 0 ||
 			    test_bit(NETFS_SREQ_BOUNDARY, &subreq->flags) ||
 			    !test_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags))
 				break;
@@ -113,11 +120,14 @@ static void netfs_retry_read_subrequests(struct netfs=
_io_request *rreq)
 		/* Determine the set of buffers we're going to use.  Each
 		 * subreq gets a subset of a single overall contiguous buffer.
 		 */
-		netfs_reset_iter(from);
-		source =3D from->io_iter;
-		source.count =3D len;
+		bvecq_pos_transfer(&dispatch_cursor, &from->dispatch_pos);
+		bvecq_pos_advance(&dispatch_cursor, from->transferred);
+		from->transferred =3D 0;
=20
-		/* Work through the sublist. */
+		/* Work through the sublist.  The chain of buffers we're going
+		 * to fill is attached to dispatch_cursor and we need to read
+		 * 'len' amount of data from 'start'.
+		 */
 		subreq =3D from;
 		list_for_each_entry_from(subreq, &stream->subrequests, rreq_link) {
 			if (!len) {
@@ -125,16 +135,22 @@ static void netfs_retry_read_subrequests(struct netfs=
_io_request *rreq)
 				break;
 			}
 			subreq->source	=3D NETFS_DOWNLOAD_FROM_SERVER;
-			subreq->start	=3D start - subreq->transferred;
-			subreq->len	=3D len   + subreq->transferred;
+			subreq->start	=3D start;
+			subreq->len	=3D len;
 			__clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
 			__clear_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
 			subreq->retry_count++;
+			subreq->transferred =3D 0;
+
+			bvecq_pos_unset(&subreq->content);
+			bvecq_pos_unset(&subreq->dispatch_pos);
+			bvecq_pos_set(&subreq->dispatch_pos, &dispatch_cursor);
=20
 			trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
=20
 			/* Renegotiate max_len (rsize) */
-			stream->sreq_max_len =3D subreq->len;
+			stream->sreq_max_len =3D len;
+			stream->sreq_max_segs =3D INT_MAX;
 			if (rreq->netfs_ops->prepare_read &&
 			    rreq->netfs_ops->prepare_read(subreq) < 0) {
 				trace_netfs_sreq(subreq, netfs_sreq_trace_reprep_failed);
@@ -142,13 +158,12 @@ static void netfs_retry_read_subrequests(struct netfs=
_io_request *rreq)
 				goto abandon;
 			}
=20
-			part =3D umin(len, stream->sreq_max_len);
-			if (unlikely(stream->sreq_max_segs))
-				part =3D netfs_limit_iter(&source, 0, part, stream->sreq_max_segs);
-			subreq->len =3D subreq->transferred + part;
-			subreq->io_iter =3D source;
-			iov_iter_truncate(&subreq->io_iter, part);
-			iov_iter_advance(&source, part);
+			part =3D bvecq_slice(&dispatch_cursor,
+					   umin(len, stream->sreq_max_len),
+					   stream->sreq_max_segs,
+					   &subreq->nr_segs);
+			subreq->len =3D part;
+
 			len -=3D part;
 			start +=3D part;
 			if (!len) {
@@ -212,9 +227,7 @@ static void netfs_retry_read_subrequests(struct netfs_i=
o_request *rreq)
 			trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
=20
 			stream->sreq_max_len	=3D umin(len, rreq->rsize);
-			stream->sreq_max_segs	=3D 0;
-			if (unlikely(stream->sreq_max_segs))
-				part =3D netfs_limit_iter(&source, 0, part, stream->sreq_max_segs);
+			stream->sreq_max_segs	=3D INT_MAX;
=20
 			netfs_stat(&netfs_n_rh_download);
 			if (rreq->netfs_ops->prepare_read(subreq) < 0) {
@@ -223,11 +236,12 @@ static void netfs_retry_read_subrequests(struct netfs=
_io_request *rreq)
 				goto abandon;
 			}
=20
-			part =3D umin(len, stream->sreq_max_len);
-			subreq->len =3D subreq->transferred + part;
-			subreq->io_iter =3D source;
-			iov_iter_truncate(&subreq->io_iter, part);
-			iov_iter_advance(&source, part);
+			bvecq_pos_set(&subreq->dispatch_pos, &dispatch_cursor);
+			part =3D bvecq_slice(&dispatch_cursor,
+					   umin(len, stream->sreq_max_len),
+					   stream->sreq_max_segs,
+					   &subreq->nr_segs);
+			subreq->len =3D part;
=20
 			len -=3D part;
 			start +=3D part;
@@ -241,6 +255,8 @@ static void netfs_retry_read_subrequests(struct netfs_i=
o_request *rreq)
=20
 	} while (!list_is_head(next, &stream->subrequests));
=20
+out:
+	bvecq_pos_unset(&dispatch_cursor);
 	return;
=20
 	/* If we hit an error, fail all remaining incomplete subrequests */
@@ -257,6 +273,7 @@ static void netfs_retry_read_subrequests(struct netfs_i=
o_request *rreq)
 		__set_bit(NETFS_SREQ_FAILED, &subreq->flags);
 		__clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
 	}
+	goto out;
 }
=20
 /*
@@ -287,23 +304,24 @@ void netfs_retry_reads(struct netfs_io_request *rreq)
  */
 void netfs_unlock_abandoned_read_pages(struct netfs_io_request *rreq)
 {
-	struct folio_queue *p;
-
-	for (p =3D rreq->buffer.tail; p; p =3D p->next) {
-		for (int slot =3D 0; slot < folioq_count(p); slot++) {
-			struct folio *folio =3D folioq_folio(p, slot);
-
-			if (folio && !folioq_is_marked2(p, slot)) {
-				if (folio =3D=3D rreq->no_unlock_folio &&
-				    test_bit(NETFS_RREQ_NO_UNLOCK_FOLIO,
-					     &rreq->flags)) {
-					_debug("no unlock");
-				} else {
-					trace_netfs_folio(folio,
-						netfs_folio_trace_abandon);
-					folio_unlock(folio);
-				}
+	struct bvecq *p;
+
+	for (p =3D rreq->collect_cursor.bvecq; p; p =3D p->next) {
+		unsigned int nr_slots =3D bvecq_nr_slots_acquire(p);
+
+		for (int slot =3D 0; slot < nr_slots; slot++) {
+			if (!p->bv[slot].bv_page)
+				continue;
+
+			struct folio *folio =3D page_folio(p->bv[slot].bv_page);
+
+			if (folio =3D=3D rreq->no_unlock_folio &&
+			    test_bit(NETFS_RREQ_NO_UNLOCK_FOLIO, &rreq->flags)) {
+				_debug("no unlock");
+				continue;
 			}
+			trace_netfs_folio(folio, netfs_folio_trace_abandon);
+			folio_unlock(folio);
 		}
 	}
 }
diff --git a/fs/netfs/read_single.c b/fs/netfs/read_single.c
index af16c91947b5..98938a54810e 100644
--- a/fs/netfs/read_single.c
+++ b/fs/netfs/read_single.c
@@ -93,7 +93,12 @@ static int netfs_single_dispatch_read(struct netfs_io_re=
quest *rreq)
 	subreq->source	=3D NETFS_DOWNLOAD_FROM_SERVER;
 	subreq->start	=3D 0;
 	subreq->len	=3D rreq->len;
-	subreq->io_iter	=3D rreq->buffer.iter;
+
+	bvecq_pos_set(&subreq->dispatch_pos, &rreq->dispatch_cursor);
+	bvecq_pos_set(&subreq->content, &rreq->dispatch_cursor);
+
+	iov_iter_bvec_queue(&subreq->io_iter, ITER_DEST, subreq->content.bvecq,
+			    subreq->content.slot, subreq->content.offset, subreq->len);
=20
 	netfs_queue_read(rreq, subreq);
=20
@@ -177,6 +182,10 @@ ssize_t netfs_read_single(struct inode *inode, struct =
file *file, struct iov_ite
 	if (IS_ERR(rreq))
 		return PTR_ERR(rreq);
=20
+	ret =3D netfs_extract_iter(iter, rreq->len, INT_MAX, 0, &rreq->dispatch_c=
ursor.bvecq, 0);
+	if (ret < 0)
+		goto cleanup_free;
+
 	ret =3D netfs_single_begin_cache_read(rreq, ictx);
 	if (ret =3D=3D -ENOMEM || ret =3D=3D -EINTR || ret =3D=3D -ERESTARTSYS)
 		goto cleanup_free;
@@ -184,7 +193,6 @@ ssize_t netfs_read_single(struct inode *inode, struct f=
ile *file, struct iov_ite
 	netfs_stat(&netfs_n_rh_read_single);
 	trace_netfs_read(rreq, 0, rreq->len, netfs_read_trace_read_single);
=20
-	rreq->buffer.iter =3D *iter;
 	netfs_single_dispatch_read(rreq);
=20
 	ret =3D netfs_wait_for_read(rreq);
diff --git a/fs/netfs/stats.c b/fs/netfs/stats.c
index 84c2a4bcc762..1dfb5667b931 100644
--- a/fs/netfs/stats.c
+++ b/fs/netfs/stats.c
@@ -47,7 +47,6 @@ atomic_t netfs_n_wh_retry_write_req;
 atomic_t netfs_n_wh_retry_write_subreq;
 atomic_t netfs_n_wb_lock_skip;
 atomic_t netfs_n_wb_lock_wait;
-atomic_t netfs_n_folioq;
 atomic_t netfs_n_bvecq;
=20
 int netfs_stats_show(struct seq_file *m, void *v)
@@ -91,11 +90,10 @@ int netfs_stats_show(struct seq_file *m, void *v)
 		   atomic_read(&netfs_n_rh_retry_read_subreq),
 		   atomic_read(&netfs_n_wh_retry_write_req),
 		   atomic_read(&netfs_n_wh_retry_write_subreq));
-	seq_printf(m, "Objs   : rr=3D%u sr=3D%u bq=3D%u foq=3D%u wsc=3D%u\n",
+	seq_printf(m, "Objs   : rr=3D%u sr=3D%u bq=3D%u wsc=3D%u\n",
 		   atomic_read(&netfs_n_rh_rreq),
 		   atomic_read(&netfs_n_rh_sreq),
 		   atomic_read(&netfs_n_bvecq),
-		   atomic_read(&netfs_n_folioq),
 		   atomic_read(&netfs_n_wh_wstream_conflict));
 	seq_printf(m, "WbLock : skip=3D%u wait=3D%u\n",
 		   atomic_read(&netfs_n_wb_lock_skip),
diff --git a/fs/netfs/write_collect.c b/fs/netfs/write_collect.c
index 9e837cf0eb8f..a91b34cf01f5 100644
--- a/fs/netfs/write_collect.c
+++ b/fs/netfs/write_collect.c
@@ -114,12 +114,12 @@ int netfs_folio_written_back(struct folio *folio)
 static void netfs_writeback_unlock_folios(struct netfs_io_request *wreq,
 					  unsigned int *notes)
 {
-	struct folio_queue *folioq =3D wreq->buffer.tail;
+	struct bvecq *bvecq =3D wreq->collect_cursor.bvecq;
 	unsigned long long collected_to =3D wreq->collected_to;
-	unsigned int slot =3D wreq->buffer.first_tail_slot;
+	unsigned int slot =3D wreq->collect_cursor.slot;
=20
-	if (WARN_ON_ONCE(!folioq)) {
-		pr_err("[!] Writeback unlock found empty rolling buffer!\n");
+	if (WARN_ON_ONCE(!bvecq)) {
+		pr_err("[!] Writeback unlock found empty buffer!\n");
 		netfs_dump_request(wreq);
 		return;
 	}
@@ -130,20 +130,27 @@ static void netfs_writeback_unlock_folios(struct netf=
s_io_request *wreq,
 		return;
 	}
=20
-	if (slot >=3D folioq_nr_slots(folioq)) {
-		folioq =3D rolling_buffer_delete_spent(&wreq->buffer);
-		if (!folioq)
-			return;
-		slot =3D 0;
-	}
-
 	for (;;) {
 		struct folio *folio;
 		struct netfs_folio *finfo;
 		unsigned long long fpos, fend;
 		size_t fsize, flen;
=20
-		folio =3D folioq_folio(folioq, slot);
+		/* Try to clean up the head of the queue if it appears to be
+		 * used up, but we need to be very careful - the cleanup can
+		 * catch the dispatcher, which could lead to us having nothing
+		 * left in the queue, causing the front and back pointers to
+		 * end up on different tracks.  To avoid this, we must always
+		 * keep at least one segment in the queue.
+		 */
+		if (!bvecq_acquire_slot(bvecq, slot)) {
+			if (!bvecq_delete_spent(&wreq->collect_cursor, slot))
+				return;
+			bvecq =3D wreq->collect_cursor.bvecq;
+			slot  =3D wreq->collect_cursor.slot;
+		}
+
+		folio =3D page_folio(bvecq->bv[slot].bv_page);
 		if (WARN_ONCE(!folio_test_writeback(folio),
 			      "R=3D%08x: folio %lx is not under writeback\n",
 			      wreq->debug_id, folio->index))
@@ -166,26 +173,13 @@ static void netfs_writeback_unlock_folios(struct netf=
s_io_request *wreq,
 		wreq->cleaned_to =3D fpos + fsize;
 		*notes |=3D MADE_PROGRESS;
=20
-		/* Clean up the head folioq.  If we clear an entire folioq, then
-		 * we can get rid of it provided it's not also the tail folioq
-		 * being filled by the issuer.
-		 */
-		folioq_clear(folioq, slot);
+		bvecq->bv[slot].bv_page =3D NULL;
 		slot++;
-		if (slot >=3D folioq_nr_slots(folioq)) {
-			folioq =3D rolling_buffer_delete_spent(&wreq->buffer);
-			if (!folioq)
-				goto done;
-			slot =3D 0;
-		}
-
 		if (fpos + fsize >=3D collected_to)
 			break;
 	}
=20
-	wreq->buffer.tail =3D folioq;
-done:
-	wreq->buffer.first_tail_slot =3D slot;
+	wreq->collect_cursor.slot =3D slot;
 }
=20
 /*
@@ -230,7 +224,8 @@ static void netfs_collect_write_results(struct netfs_io=
_request *wreq)
 	trace_netfs_rreq(wreq, netfs_rreq_trace_collect);
=20
 reassess_streams:
-	issued_to =3D atomic64_read(&wreq->issued_to);
+	/* Order reading the issued_to point before reading the queue it refers t=
o. */
+	issued_to =3D atomic64_read_acquire(&wreq->issued_to);
 	smp_rmb();
 	collected_to =3D ULLONG_MAX;
 	if (wreq->origin =3D=3D NETFS_WRITEBACK ||
diff --git a/fs/netfs/write_issue.c b/fs/netfs/write_issue.c
index b2f626568fe5..986a578fd0da 100644
--- a/fs/netfs/write_issue.c
+++ b/fs/netfs/write_issue.c
@@ -108,8 +108,6 @@ struct netfs_io_request *netfs_create_write_req(struct =
address_space *mapping,
 	ictx =3D netfs_inode(wreq->inode);
 	if (is_cacheable && netfs_is_cache_enabled(ictx))
 		fscache_begin_write_operation(&wreq->cache_resources, netfs_i_cookie(ict=
x));
-	if (rolling_buffer_init(&wreq->buffer, wreq->debug_id, ITER_SOURCE) < 0)
-		goto nomem;
=20
 	wreq->cleaned_to =3D wreq->start;
 	if (wreq->cache_resources.dio_size > 1)
@@ -134,9 +132,6 @@ struct netfs_io_request *netfs_create_write_req(struct =
address_space *mapping,
 	}
=20
 	return wreq;
-nomem:
-	netfs_put_failed_request(wreq);
-	return ERR_PTR(-ENOMEM);
 }
=20
 /**
@@ -161,21 +156,13 @@ void netfs_prepare_write(struct netfs_io_request *wre=
q,
 			 loff_t start)
 {
 	struct netfs_io_subrequest *subreq;
-	struct iov_iter *wreq_iter =3D &wreq->buffer.iter;
-
-	/* Make sure we don't point the iterator at a used-up folio_queue
-	 * struct being used as a placeholder to prevent the queue from
-	 * collapsing.  In such a case, extend the queue.
-	 */
-	if (iov_iter_is_folioq(wreq_iter) &&
-	    wreq_iter->folioq_slot >=3D folioq_nr_slots(wreq_iter->folioq))
-		rolling_buffer_make_space(&wreq->buffer);
=20
 	subreq =3D netfs_alloc_subrequest(wreq);
 	subreq->source		=3D stream->source;
 	subreq->start		=3D start;
 	subreq->stream_nr	=3D stream->stream_nr;
-	subreq->io_iter		=3D *wreq_iter;
+
+	bvecq_pos_set(&subreq->dispatch_pos, &wreq->dispatch_cursor);
=20
 	_enter("R=3D%x[%x]", wreq->debug_id, subreq->debug_index);
=20
@@ -256,15 +243,15 @@ static void netfs_do_issue_write(struct netfs_io_stre=
am *stream,
 }
=20
 void netfs_reissue_write(struct netfs_io_stream *stream,
-			 struct netfs_io_subrequest *subreq,
-			 struct iov_iter *source)
+			 struct netfs_io_subrequest *subreq)
 {
-	size_t size =3D subreq->len - subreq->transferred;
-
 	// TODO: Use encrypted buffer
-	subreq->io_iter =3D *source;
-	iov_iter_advance(source, size);
-	iov_iter_truncate(&subreq->io_iter, size);
+	bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
+	iov_iter_bvec_queue(&subreq->io_iter, ITER_SOURCE,
+			    subreq->content.bvecq, subreq->content.slot,
+			    subreq->content.offset,
+			    subreq->len);
+	iov_iter_advance(&subreq->io_iter, subreq->transferred);
=20
 	subreq->retry_count++;
 	subreq->error =3D 0;
@@ -282,8 +269,67 @@ void netfs_issue_write(struct netfs_io_request *wreq,
 	if (!subreq)
 		return;
=20
+	/* If we have a write to the cache, we need to round out the first and
+	 * last entries (only those as the data will be on virtually contiguous
+	 * folios) to cache DIO boundaries.
+	 */
+	if (subreq->source =3D=3D NETFS_WRITE_TO_CACHE) {
+		struct bvecq_pos tmp_pos;
+		struct bio_vec *bv;
+		struct bvecq *bq;
+		size_t dio_size =3D wreq->cache_resources.dio_size;
+		size_t disp, len;
+		int ret;
+
+		bvecq_pos_set(&tmp_pos, &subreq->dispatch_pos);
+		ret =3D bvecq_extract(&tmp_pos, subreq->len, INT_MAX, &subreq->content.b=
vecq);
+		bvecq_pos_unset(&tmp_pos);
+		if (ret < 0) {
+			netfs_write_subrequest_terminated(subreq, -ENOMEM);
+			return;
+		}
+
+		/* Round the first entry down.  We should be able to get away
+		 * with this as this path only happens for buffered reads and
+		 * writes.  As such, a bio_vec must always point to a whole
+		 * folio (or part thereof) in the pagecache with writeback set,
+		 * so presuming that dio_size < folio size, we should be able
+		 * to round out bv_offset and bv_len.
+		 *
+		 * Further, streaming-write pages don't get sent to the cache
+		 * (and aren't normally generated if there is a cache), so we
+		 * only see fully uptodate pages here.
+		 */
+		bq =3D subreq->content.bvecq;
+		bv =3D &bq->bv[0];
+		disp =3D bv->bv_offset & (dio_size - 1);
+		if (disp) {
+			bv->bv_offset -=3D disp;
+			bv->bv_len +=3D disp;
+			bq->fpos -=3D disp;
+			subreq->start -=3D disp;
+			subreq->len +=3D disp;
+		}
+
+		/* Round the end of the last entry up. */
+		while (bq->next)
+			bq =3D bq->next;
+		bv =3D &bq->bv[bq->nr_slots - 1];
+		len =3D round_up(bv->bv_len, dio_size);
+		if (len > bv->bv_len) {
+			subreq->len +=3D len - bv->bv_len;
+			bv->bv_len =3D len;
+		}
+	} else {
+		bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
+	}
+
+	iov_iter_bvec_queue(&subreq->io_iter, ITER_SOURCE,
+			    subreq->content.bvecq, subreq->content.slot,
+			    subreq->content.offset,
+			    subreq->len);
+
 	stream->construct =3D NULL;
-	subreq->io_iter.count =3D subreq->len;
 	netfs_do_issue_write(stream, subreq);
 }
=20
@@ -320,7 +366,6 @@ size_t netfs_advance_write(struct netfs_io_request *wre=
q,
 	_debug("part %zx/%zx %zx/%zx", subreq->len, stream->sreq_max_len, part, l=
en);
 	subreq->len +=3D part;
 	subreq->nr_segs++;
-	stream->submit_extendable_to -=3D part;
=20
 	if (subreq->len >=3D stream->sreq_max_len ||
 	    subreq->nr_segs >=3D stream->sreq_max_segs ||
@@ -344,7 +389,8 @@ static int netfs_write_folio(struct netfs_io_request *w=
req,
 	struct netfs_io_stream *stream;
 	struct netfs_group *fgroup; /* TODO: Use this with ceph */
 	struct netfs_folio *finfo;
-	size_t iter_off =3D 0;
+	struct bvecq *queue =3D wreq->load_cursor.bvecq;
+	unsigned int slot;
 	size_t fsize =3D folio_size(folio), flen =3D fsize, foff =3D 0;
 	loff_t fpos =3D folio_pos(folio), i_size;
 	bool to_eof =3D false, streamw =3D false;
@@ -352,8 +398,13 @@ static int netfs_write_folio(struct netfs_io_request *=
wreq,
=20
 	_enter("");
=20
-	if (rolling_buffer_make_space(&wreq->buffer) < 0)
-		return -ENOMEM;
+	if (!wreq->spare) {
+		wreq->spare =3D bvecq_alloc_one(BVECQ_STD_SLOTS, GFP_NOFS);
+		if (!wreq->spare) {
+			folio_unlock(folio);
+			return -ENOMEM;
+		}
+	}
=20
 	/* netfs_perform_write() may shift i_size around the page or from out
 	 * of the page to beyond it, but cannot move i_size into or through the
@@ -453,8 +504,32 @@ static int netfs_write_folio(struct netfs_io_request *=
wreq,
 		trace_netfs_folio(folio, netfs_folio_trace_store_plus);
 	}
=20
+	/* Institute a new bvec queue segment if the current one is full or if
+	 * we encounter a discontiguity.  The discontiguity break is important
+	 * when it comes to bulk unlocking folios by file range.
+	 */
+	if (bvecq_is_full(queue) ||
+	    (fpos !=3D wreq->last_end && wreq->last_end > 0)) {
+		bvecq_buffer_append(&wreq->load_cursor, wreq->spare);
+		wreq->spare =3D NULL;
+
+		queue =3D wreq->load_cursor.bvecq;
+		queue->fpos =3D fpos;
+		if (fpos !=3D wreq->last_end)
+			queue->discontig =3D true;
+		bvecq_pos_move(&wreq->dispatch_cursor, queue);
+		wreq->dispatch_cursor.slot =3D 0;
+	}
+
 	/* Attach the folio to the rolling buffer. */
-	rolling_buffer_append(&wreq->buffer, folio, 0);
+	slot =3D queue->nr_slots;
+	bvec_set_folio(&queue->bv[slot], folio, flen, 0);
+	trace_netfs_bv_slot(queue, slot);
+	slot++;
+	bvecq_filled_to(queue, slot);
+	wreq->load_cursor.slot =3D slot;
+	wreq->load_cursor.offset =3D 0;
+	wreq->last_end =3D fpos + foff + flen;
=20
 	/* Move the submission point forward to allow for write-streaming data
 	 * not starting at the front of the page.  We don't do write-streaming
@@ -463,9 +538,11 @@ static int netfs_write_folio(struct netfs_io_request *=
wreq,
 	 * Also skip uploading for data that's been read and just needs copying
 	 * to the cache.
 	 */
+	bvecq_pos_nudge(&wreq->dispatch_cursor);
+=09
 	for (int s =3D 0; s < NR_IO_STREAMS; s++) {
 		stream =3D &wreq->io_streams[s];
-		stream->submit_off =3D foff;
+		stream->submit_off =3D 0;
 		stream->submit_len =3D flen;
 		if (!stream->avail ||
 		    (stream->source =3D=3D NETFS_WRITE_TO_CACHE && streamw) ||
@@ -500,15 +577,11 @@ static int netfs_write_folio(struct netfs_io_request =
*wreq,
 			break;
 		stream =3D &wreq->io_streams[choose_s];
=20
-		/* Advance the iterator(s). */
-		if (stream->submit_off > iter_off) {
-			rolling_buffer_advance(&wreq->buffer, stream->submit_off - iter_off);
-			iter_off =3D stream->submit_off;
-		}
+		/* Advance the cursor. */
+		wreq->dispatch_cursor.offset =3D stream->submit_off;
=20
-		atomic64_set(&wreq->issued_to, fpos + stream->submit_off);
-		stream->submit_extendable_to =3D fsize - stream->submit_off;
-		part =3D netfs_advance_write(wreq, stream, fpos + stream->submit_off,
+		atomic64_set(&wreq->issued_to, fpos + foff + stream->submit_off);
+		part =3D netfs_advance_write(wreq, stream, fpos + foff + stream->submit_=
off,
 					   stream->submit_len, to_eof);
 		stream->submit_off +=3D part;
 		if (part > stream->submit_len)
@@ -519,9 +592,9 @@ static int netfs_write_folio(struct netfs_io_request *w=
req,
 			debug =3D true;
 	}
=20
-	if (fsize > iter_off)
-		rolling_buffer_advance(&wreq->buffer, fsize - iter_off);
-	atomic64_set(&wreq->issued_to, fpos + fsize);
+	bvecq_pos_step(&wreq->dispatch_cursor);
+	/* Order loading the queue before updating the issue_to point */
+	atomic64_set_release(&wreq->issued_to, fpos + fsize);
=20
 	if (!debug)
 		kdebug("R=3D%x: No submit", wreq->debug_id);
@@ -589,6 +662,11 @@ int netfs_writepages(struct address_space *mapping,
 		goto couldnt_start;
 	}
=20
+	if (bvecq_buffer_init(&wreq->load_cursor, GFP_NOFS) < 0)
+		goto nomem;
+	bvecq_pos_set(&wreq->dispatch_cursor, &wreq->load_cursor);
+	bvecq_pos_set(&wreq->collect_cursor, &wreq->dispatch_cursor);
+
 	__set_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &wreq->flags);
 	trace_netfs_write(wreq, netfs_write_trace_writeback);
 	netfs_stat(&netfs_n_wh_writepages);
@@ -613,12 +691,17 @@ int netfs_writepages(struct address_space *mapping,
 	netfs_end_issue_write(wreq);
=20
 	mutex_unlock(&ictx->wb_lock);
+	bvecq_pos_unset(&wreq->load_cursor);
+	bvecq_pos_unset(&wreq->dispatch_cursor);
 	netfs_wake_collector(wreq);
=20
 	netfs_put_request(wreq, netfs_rreq_trace_put_return);
 	_leave(" =3D %d", error);
 	return error;
=20
+nomem:
+	error =3D -ENOMEM;
+	netfs_put_failed_request(wreq);
 couldnt_start:
 	netfs_kill_dirty_pages(mapping, wbc, folio);
 out:
@@ -645,6 +728,15 @@ struct netfs_io_request *netfs_begin_writethrough(stru=
ct kiocb *iocb, size_t len
 		return wreq;
 	}
=20
+	if (bvecq_buffer_init(&wreq->load_cursor, GFP_NOFS) < 0) {
+		netfs_put_failed_request(wreq);
+		mutex_unlock(&ictx->wb_lock);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	bvecq_pos_set(&wreq->dispatch_cursor, &wreq->load_cursor);
+	bvecq_pos_set(&wreq->collect_cursor, &wreq->dispatch_cursor);
+
 	wreq->io_streams[0].avail =3D true;
 	trace_netfs_write(wreq, netfs_write_trace_writethrough);
 	return wreq;
@@ -662,8 +754,8 @@ int netfs_advance_writethrough(struct netfs_io_request =
*wreq, struct writeback_c
 {
 	int ret;
=20
-	_enter("R=3D%x ic=3D%zu ws=3D%u cp=3D%zu tp=3D%u",
-	       wreq->debug_id, wreq->buffer.iter.count, wreq->wsize, copied, to_p=
age_end);
+	_enter("R=3D%x ws=3D%u cp=3D%zu tp=3D%u",
+	       wreq->debug_id, wreq->wsize, copied, to_page_end);
=20
 	/* The folio is locked. */
=20
@@ -719,6 +811,9 @@ ssize_t netfs_end_writethrough(struct netfs_io_request =
*wreq, struct writeback_c
=20
 	mutex_unlock(&ictx->wb_lock);
=20
+	bvecq_pos_unset(&wreq->load_cursor);
+	bvecq_pos_unset(&wreq->dispatch_cursor);
+
 	if (wreq->iocb)
 		ret =3D -EIOCBQUEUED;
 	else
@@ -734,10 +829,11 @@ ssize_t netfs_end_writethrough(struct netfs_io_reques=
t *wreq, struct writeback_c
  * @iter: Data to write.
  *
  * Write a monolithic, non-pagecache object back to the server and/or
- * the cache.
+ * the cache.  There's a maximum of one subrequest per stream.
  *
  * Return: 0 if successful; 1 if skipped due to lock conflict and WB_SYNC_=
NONE;
  * or a negative error code.
+ * the cache.  There's a maximum of one subrequest per stream.
  */
 int netfs_writeback_single(struct address_space *mapping,
 			   struct writeback_control *wbc,
@@ -763,10 +859,18 @@ int netfs_writeback_single(struct address_space *mapp=
ing,
 		ret =3D PTR_ERR(wreq);
 		goto couldnt_start;
 	}
-
-	wreq->buffer.iter =3D *iter;
 	wreq->len =3D iov_iter_count(iter);
=20
+	ret =3D netfs_extract_iter(iter, wreq->len, INT_MAX, 0, &wreq->dispatch_c=
ursor.bvecq, 0);
+	if (ret < 0)
+		goto cleanup_free;
+	if (ret < wreq->len) {
+		ret =3D -EIO;
+		goto cleanup_free;
+	}
+
+	bvecq_pos_set(&wreq->collect_cursor, &wreq->dispatch_cursor);
+
 	__set_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &wreq->flags);
 	trace_netfs_write(wreq, netfs_write_trace_writeback_single);
 	netfs_stat(&netfs_n_wh_writepages);
@@ -786,11 +890,11 @@ int netfs_writeback_single(struct address_space *mapp=
ing,
 		subreq =3D stream->construct;
 		subreq->len =3D wreq->len;
 		stream->submit_len =3D subreq->len;
-		stream->submit_extendable_to =3D round_up(wreq->len, PAGE_SIZE);
=20
 		netfs_issue_write(wreq, stream);
 	}
=20
+	wreq->submitted =3D wreq->len;
 	smp_wmb(); /* Write lists before ALL_QUEUED. */
 	set_bit(NETFS_RREQ_ALL_QUEUED, &wreq->flags);
=20
@@ -806,6 +910,8 @@ int netfs_writeback_single(struct address_space *mappin=
g,
 	_leave(" =3D %d", ret);
 	return ret;
=20
+cleanup_free:
+	netfs_put_failed_request(wreq);
 couldnt_start:
 	mutex_unlock(&ictx->wb_lock);
 	_leave(" =3D %d", ret);
diff --git a/fs/netfs/write_retry.c b/fs/netfs/write_retry.c
index 32e1058bf252..de2f9b196fa5 100644
--- a/fs/netfs/write_retry.c
+++ b/fs/netfs/write_retry.c
@@ -17,6 +17,7 @@
 static void netfs_retry_write_stream(struct netfs_io_request *wreq,
 				     struct netfs_io_stream *stream)
 {
+	struct bvecq_pos dispatch_cursor =3D {};
 	struct list_head *next;
=20
 	_enter("R=3D%x[%x:]", wreq->debug_id, stream->stream_nr);
@@ -39,12 +40,8 @@ static void netfs_retry_write_stream(struct netfs_io_req=
uest *wreq,
 			if (test_bit(NETFS_SREQ_FAILED, &subreq->flags))
 				break;
 			if (__test_and_clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags)) {
-				struct iov_iter source;
-
-				netfs_reset_iter(subreq);
-				source =3D subreq->io_iter;
 				netfs_get_subrequest(subreq, netfs_sreq_trace_get_resubmit);
-				netfs_reissue_write(stream, subreq, &source);
+				netfs_reissue_write(stream, subreq);
 			}
 		}
 		return;
@@ -54,11 +51,12 @@ static void netfs_retry_write_stream(struct netfs_io_re=
quest *wreq,
=20
 	do {
 		struct netfs_io_subrequest *subreq =3D NULL, *from, *to, *tmp;
-		struct iov_iter source;
 		unsigned long long start, len;
 		size_t part;
 		bool boundary =3D false;
=20
+		bvecq_pos_unset(&dispatch_cursor);
+
 		/* Go through the stream and find the next span of contiguous
 		 * data that we then rejig (cifs, for example, needs the wsize
 		 * renegotiating) and reissue.
@@ -70,11 +68,12 @@ static void netfs_retry_write_stream(struct netfs_io_re=
quest *wreq,
=20
 		if (test_bit(NETFS_SREQ_FAILED, &from->flags) ||
 		    !test_bit(NETFS_SREQ_NEED_RETRY, &from->flags))
-			return;
+			goto out;
=20
 		list_for_each_continue(next, &stream->subrequests) {
 			subreq =3D list_entry(next, struct netfs_io_subrequest, rreq_link);
-			if (subreq->start + subreq->transferred !=3D start + len ||
+			if (subreq->start !=3D start + len ||
+			    subreq->transferred > 0 ||
 			    test_bit(NETFS_SREQ_BOUNDARY, &subreq->flags) ||
 			    !test_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags))
 				break;
@@ -85,11 +84,13 @@ static void netfs_retry_write_stream(struct netfs_io_re=
quest *wreq,
 		/* Determine the set of buffers we're going to use.  Each
 		 * subreq gets a subset of a single overall contiguous buffer.
 		 */
-		netfs_reset_iter(from);
-		source =3D from->io_iter;
-		source.count =3D len;
+		bvecq_pos_transfer(&dispatch_cursor, &from->dispatch_pos);
+		bvecq_pos_advance(&dispatch_cursor, from->transferred);
=20
-		/* Work through the sublist. */
+		/* Work through the sublist.  The chain of buffers we're going
+		 * to fill is attached to dispatch_cursor and we need to read
+		 * 'len' amount of data from 'start'.
+		 */
 		subreq =3D from;
 		list_for_each_entry_from(subreq, &stream->subrequests, rreq_link) {
 			if (!len)
@@ -99,16 +100,23 @@ static void netfs_retry_write_stream(struct netfs_io_r=
equest *wreq,
 			subreq->len	=3D len;
 			__clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
 			trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
+			subreq->transferred =3D 0;
+
+			bvecq_pos_unset(&subreq->dispatch_pos);
+			bvecq_pos_unset(&subreq->content);
=20
 			/* Renegotiate max_len (wsize) */
 			stream->sreq_max_len =3D len;
+			stream->sreq_max_segs =3D INT_MAX;
 			stream->prepare_write(subreq);
=20
-			part =3D umin(len, stream->sreq_max_len);
-			if (unlikely(stream->sreq_max_segs))
-				part =3D netfs_limit_iter(&source, 0, part, stream->sreq_max_segs);
+			bvecq_pos_set(&subreq->dispatch_pos, &dispatch_cursor);
+			part =3D bvecq_slice(&dispatch_cursor,
+					   umin(len, stream->sreq_max_len),
+					   stream->sreq_max_segs,
+					   &subreq->nr_segs);
 			subreq->len =3D part;
-			subreq->transferred =3D 0;
+
 			len -=3D part;
 			start +=3D part;
 			if (len && subreq =3D=3D to &&
@@ -116,7 +124,7 @@ static void netfs_retry_write_stream(struct netfs_io_re=
quest *wreq,
 				boundary =3D true;
=20
 			netfs_get_subrequest(subreq, netfs_sreq_trace_get_resubmit);
-			netfs_reissue_write(stream, subreq, &source);
+			netfs_reissue_write(stream, subreq);
 			if (subreq =3D=3D to)
 				break;
 		}
@@ -177,8 +185,13 @@ static void netfs_retry_write_stream(struct netfs_io_r=
equest *wreq,
=20
 			stream->prepare_write(subreq);
=20
-			part =3D umin(len, stream->sreq_max_len);
+			bvecq_pos_set(&subreq->dispatch_pos, &dispatch_cursor);
+			part =3D bvecq_slice(&dispatch_cursor,
+					   umin(len, stream->sreq_max_len),
+					   stream->sreq_max_segs,
+					   &subreq->nr_segs);
 			subreq->len =3D subreq->transferred + part;
+
 			len -=3D part;
 			start +=3D part;
 			if (!len && boundary) {
@@ -186,13 +199,16 @@ static void netfs_retry_write_stream(struct netfs_io_=
request *wreq,
 				boundary =3D false;
 			}
=20
-			netfs_reissue_write(stream, subreq, &source);
+			netfs_reissue_write(stream, subreq);
 			if (!len)
 				break;
=20
 		} while (len);
=20
 	} while (!list_is_head(next, &stream->subrequests));
+
+out:
+	bvecq_pos_unset(&dispatch_cursor);
 }
=20
 /*
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 40f45ecf1db8..15a1c3026733 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -19,12 +19,13 @@
 #include <linux/pagemap.h>
 #include <linux/bvecq.h>
 #include <linux/uio.h>
-#include <linux/rolling_buffer.h>
=20
 enum netfs_sreq_ref_trace;
 typedef struct mempool mempool_t;
+struct readahead_control;
+struct netfs_io_request;
+struct netfs_io_subrequest;
 struct fscache_occupancy;
-struct folio_queue;
=20
 /**
  * folio_start_private_2 - Start an fscache write on a folio.  [DEPRECATED]
@@ -137,7 +138,6 @@ struct netfs_io_stream {
 	unsigned int		sreq_max_segs;	/* 0 or max number of segments in an iterato=
r */
 	unsigned int		submit_off;	/* Folio offset we're submitting from */
 	unsigned int		submit_len;	/* Amount of data left to submit */
-	unsigned int		submit_extendable_to; /* Amount I/O can be rounded up to */
 	void (*prepare_write)(struct netfs_io_subrequest *subreq);
 	void (*issue_write)(struct netfs_io_subrequest *subreq);
 	/* Collection tracking */
@@ -180,6 +180,8 @@ struct netfs_io_subrequest {
 	struct netfs_io_request *rreq;		/* Supervising I/O request */
 	struct work_struct	work;
 	struct list_head	rreq_link;	/* Link in rreq->subrequests */
+	struct bvecq_pos	dispatch_pos;	/* Bookmark in the combined queue of the s=
tart */
+	struct bvecq_pos	content;	/* The (copied) content of the subrequest */
 	struct iov_iter		io_iter;	/* Iterator for this subrequest */
 	unsigned long long	start;		/* Where to start the I/O */
 	size_t			len;		/* Size of the I/O */
@@ -242,13 +244,14 @@ struct netfs_io_request {
 	struct netfs_io_stream	io_streams[2];	/* Streams of parallel I/O operatio=
ns */
 #define NR_IO_STREAMS 2 //wreq->nr_io_streams
 	struct netfs_group	*group;		/* Writeback group being written back */
-	struct rolling_buffer	buffer;		/* Unencrypted buffer */
-#define NETFS_ROLLBUF_PUT_MARK		ROLLBUF_MARK_1
-#define NETFS_ROLLBUF_PAGECACHE_MARK	ROLLBUF_MARK_2
+	struct bvecq		*spare;		/* Advance allocation of bvecq */
+	struct bvecq_pos	load_cursor;	/* Point at which new folios are loaded in =
*/
+	struct bvecq_pos	dispatch_cursor; /* Point from which buffers are dispatc=
hed */
+	struct bvecq_pos	collect_cursor;	/* Clear-up point of I/O buffer */
 	wait_queue_head_t	waitq;		/* Processor waiter */
 	void			*netfs_priv;	/* Private data for the netfs */
 	void			*netfs_priv2;	/* Private data for the netfs */
-	struct bio_vec		*direct_bv;	/* DIO buffer list (when handling iovec-iter)=
 */
+	unsigned long long	last_end;	/* End pos of last folio submitted */
 	unsigned long long	submitted;	/* Amount submitted for I/O so far */
 	unsigned long long	len;		/* Length of the request */
 	size_t			transferred;	/* Amount to be indicated as transferred */
@@ -261,7 +264,6 @@ struct netfs_io_request {
 	unsigned long long	cleaned_to;	/* Position we've cleaned folios to */
 	unsigned long long	abandon_to;	/* Position to abandon folios to */
 	const struct folio	*no_unlock_folio; /* Don't unlock this folio after rea=
d */
-	unsigned int		direct_bv_count; /* Number of elements in direct_bv[] */
 	unsigned int		debug_id;
 	unsigned int		rsize;		/* Maximum read size (0 for none) */
 	unsigned int		wsize;		/* Maximum write size (0 for none) */
@@ -270,7 +272,6 @@ struct netfs_io_request {
 	spinlock_t		lock;		/* Lock for queuing subreqs */
 	unsigned char		front_folio_order; /* Order (size) of front folio */
 	enum netfs_io_origin	origin;		/* Origin of the request */
-	bool			direct_bv_unpin; /* T if direct_bv[] must be unpinned */
 	refcount_t		ref;
 	unsigned long		flags;
 #define NETFS_RREQ_IN_PROGRESS		0	/* Unlocked when the request completes (=
has ref) */
@@ -478,12 +479,6 @@ void netfs_end_io_write(struct inode *inode);
 int netfs_start_io_direct(struct inode *inode);
 void netfs_end_io_direct(struct inode *inode);
=20
-/* Miscellaneous APIs. */
-struct folio_queue *netfs_folioq_alloc(unsigned int rreq_id, gfp_t gfp,
-				       unsigned int trace /*enum netfs_folioq_trace*/);
-void netfs_folioq_free(struct folio_queue *folioq,
-		       unsigned int trace /*enum netfs_trace_folioq*/);
-
 /* Buffer wrangling helpers API. */
 int netfs_alloc_folioq_buffer(struct address_space *mapping,
 			      struct folio_queue **_buffer,
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index d5723ce18cbb..59f330003d02 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -230,7 +230,9 @@
 	EM(netfs_folio_trace_store_copy,	"store-copy")	\
 	EM(netfs_folio_trace_store_plus,	"store+")	\
 	EM(netfs_folio_trace_wthru,		"wthru")	\
-	E_(netfs_folio_trace_wthru_plus,	"wthru+")
+	EM(netfs_folio_trace_wthru_plus,	"wthru+")	\
+	EM(netfs_folio_trace_zero,		"zero")		\
+	E_(netfs_folio_trace_zero_ra,		"zero-ra")
=20
 #define netfs_collect_contig_traces				\
 	EM(netfs_contig_trace_collect,		"Collect")	\
@@ -243,13 +245,13 @@
 	EM(netfs_trace_donate_to_next,		"to-next")	\
 	E_(netfs_trace_donate_to_deferred_next,	"defer-next")
=20
-#define netfs_folioq_traces					\
-	EM(netfs_trace_folioq_alloc_buffer,	"alloc-buf")	\
-	EM(netfs_trace_folioq_clear,		"clear")	\
-	EM(netfs_trace_folioq_delete,		"delete")	\
-	EM(netfs_trace_folioq_make_space,	"make-space")	\
-	EM(netfs_trace_folioq_rollbuf_init,	"roll-init")	\
-	E_(netfs_trace_folioq_read_progress,	"r-progress")
+#define netfs_bvecq_traces					\
+	EM(netfs_trace_bvecq_alloc_buffer,	"alloc-buf")	\
+	EM(netfs_trace_bvecq_clear,		"clear")	\
+	EM(netfs_trace_bvecq_delete,		"delete")	\
+	EM(netfs_trace_bvecq_make_space,	"make-space")	\
+	EM(netfs_trace_bvecq_rollbuf_init,	"roll-init")	\
+	E_(netfs_trace_bvecq_read_progress,	"r-progress")
=20
 #ifndef __NETFS_DECLARE_TRACE_ENUMS_ONCE_ONLY
 #define __NETFS_DECLARE_TRACE_ENUMS_ONCE_ONLY
@@ -269,7 +271,7 @@ enum netfs_sreq_ref_trace { netfs_sreq_ref_traces } __m=
ode(byte);
 enum netfs_folio_trace { netfs_folio_traces } __mode(byte);
 enum netfs_collect_contig_trace { netfs_collect_contig_traces } __mode(byt=
e);
 enum netfs_donate_trace { netfs_donate_traces } __mode(byte);
-enum netfs_folioq_trace { netfs_folioq_traces } __mode(byte);
+enum netfs_bvecq_trace { netfs_bvecq_traces } __mode(byte);
=20
 #endif
=20
@@ -293,7 +295,7 @@ netfs_sreq_ref_traces;
 netfs_folio_traces;
 netfs_collect_contig_traces;
 netfs_donate_traces;
-netfs_folioq_traces;
+netfs_bvecq_traces;
=20
 /*
  * Now redefine the EM() and E_() macros to map the enums to the strings t=
hat
@@ -397,10 +399,10 @@ TRACE_EVENT(netfs_sreq,
 		    __entry->len	=3D sreq->len;
 		    __entry->transferred =3D sreq->transferred;
 		    __entry->start	=3D sreq->start;
-		    __entry->slot	=3D sreq->io_iter.folioq_slot;
+		    __entry->slot	=3D sreq->dispatch_pos.slot;
 			   ),
=20
-	    TP_printk("R=3D%08x[%x] %s %s f=3D%03x s=3D%llx %zx/%zx s=3D%u e=3D%d=
",
+	    TP_printk("R=3D%08x[%x] %s %s f=3D%03x s=3D%llx %zx/%zx qs=3D%u e=3D%=
d",
 		      __entry->rreq, __entry->index,
 		      __print_symbolic(__entry->source, netfs_sreq_sources),
 		      __print_symbolic(__entry->what, netfs_sreq_traces),
@@ -776,27 +778,25 @@ TRACE_EVENT(netfs_collect_stream,
 		      __entry->collected_to, __entry->issued_to)
 	    );
=20
-TRACE_EVENT(netfs_folioq,
-	    TP_PROTO(const struct folio_queue *fq,
-		     enum netfs_folioq_trace trace),
+TRACE_EVENT(netfs_bvecq,
+	    TP_PROTO(const struct bvecq *bq,
+		     enum netfs_bvecq_trace trace),
=20
-	    TP_ARGS(fq, trace),
+	    TP_ARGS(bq, trace),
=20
 	    TP_STRUCT__entry(
-		    __field(unsigned int,		rreq)
 		    __field(unsigned int,		id)
-		    __field(enum netfs_folioq_trace,	trace)
+		    __field(enum netfs_bvecq_trace,	trace)
 			     ),
=20
 	    TP_fast_assign(
-		    __entry->rreq	=3D fq ? fq->rreq_id : 0;
-		    __entry->id		=3D fq ? fq->debug_id : 0;
+		    __entry->id		=3D bq ? bq->priv : 0;
 		    __entry->trace	=3D trace;
 			   ),
=20
-	    TP_printk("R=3D%08x fq=3D%x %s",
-		      __entry->rreq, __entry->id,
-		      __print_symbolic(__entry->trace, netfs_folioq_traces))
+	    TP_printk("fq=3D%x %s",
+		      __entry->id,
+		      __print_symbolic(__entry->trace, netfs_bvecq_traces))
 	    );
=20
 TRACE_EVENT(netfs_bv_slot,
From nobody Mon May 25 04:33:52 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id CB0B73B9DAB
	for <linux-kernel@vger.kernel.org>; Mon, 18 May 2026 22:32:36 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779143559; cv=none;
 b=taY++aEgpqkLfzzK9mHbOOkMRJG6MVpdoYrWoaRbgT51f94EVaeVAyfVrLaIHm3moUPxgabRMpVJl63Oc0mh5SSlKdAyfkXuuoVHi1BelHJN/VS5MYPf5XOnZ8NpHUZWKLh1CsyH6aDZoM2AOQdz+pNGYN0BoIqwrj0axvp+AzE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779143559; c=relaxed/simple;
	bh=RDoZM2bAZ4Qsft6qMwShgl3rUgUZosExTNOAqAchX3Y=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=Hqy6VlTqMBpr0oGNncHfO9AI4tXpq/hmHut0SLsYHhAzS97X7MJjk3DiuiULZLJ5CNulCekEbFOxNZlPc0zKfipJ7zUdPZ0ohB/jPzgjc0NK+ljH5KWBV7TNTYtpYpTPAwiS2V6P+6C5X4vKK7ZNI3gzF9X9S9vmekG5qTmd/iE=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=Hso4NROZ; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="Hso4NROZ"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1779143554;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=qGlfupTHkD4ku9IDSUoOP8kdkJKy1we0jMDsuXnaNZs=;
	b=Hso4NROZxr979HPJtE5yFEhlhBvbpJ7RR1UZG5cmLtoO/ehh+fKW3Hzg1tJdJshYdVd9zE
	HhV42cpA684CkAosLdvPGC5oVmuHlAwbUubD/BYIsEGP82qolbD0OpXwmUtQVDdtJxP4yu
	6kz+MYxt6OLpBLNWhCuCB1s00nt8oFU=
Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-450-4qyYzh9BMSurSxqG4zREXg-1; Mon,
 18 May 2026 18:32:24 -0400
X-MC-Unique: 4qyYzh9BMSurSxqG4zREXg-1
X-Mimecast-MFC-AGG-ID: 4qyYzh9BMSurSxqG4zREXg_1779143532
Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 7393319560B7;
	Mon, 18 May 2026 22:32:12 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.48.33])
	by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 9A0151956053;
	Mon, 18 May 2026 22:32:05 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.org>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Stefan Metzmacher <metze@samba.org>,
	Shyam Prasad N <sprasad@microsoft.com>,
	Tom Talpey <tom@talpey.com>
Subject: [PATCH v2 14/21] cifs: Remove support for ITER_FOLIOQ from
 smb_extract_iter_to_rdma()
Date: Mon, 18 May 2026 23:29:46 +0100
Message-ID: <20260518222959.488126-15-dhowells@redhat.com>
In-Reply-To: <20260518222959.488126-1-dhowells@redhat.com>
References: <20260518222959.488126-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17
Content-Type: text/plain; charset="utf-8"

netfslib now only presents an bvecq queue and an associated ITER_BVECQ
iterator to the filesystem, so it isn't going to see the ITER_FOLIOQ
iterator.  So remove that code.

Netfslib also won't supply ITER_BVEC/KVEC iterators, though smbdirect
might; further in future, it won't supply iterators at all, but rather a
bvecq slice (that can be used to construct an iterator).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Stefan Metzmacher <metze@samba.org>
cc: Shyam Prasad N <sprasad@microsoft.com>
cc: Tom Talpey <tom@talpey.com>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Acked-by: Stefan Metzmacher <metze@samba.org>
---
 fs/smb/smbdirect/connection.c | 68 -----------------------------------
 1 file changed, 68 deletions(-)

diff --git a/fs/smb/smbdirect/connection.c b/fs/smb/smbdirect/connection.c
index 4d2a1700104e..8858e1dfbc25 100644
--- a/fs/smb/smbdirect/connection.c
+++ b/fs/smb/smbdirect/connection.c
@@ -6,7 +6,6 @@
=20
 #include "internal.h"
 #include <linux/bvecq.h>
-#include <linux/folio_queue.h>
=20
 struct smbdirect_map_sges {
 	struct ib_sge *sge;
@@ -2130,70 +2129,6 @@ static ssize_t smbdirect_map_sges_from_kvec(struct i=
ov_iter *iter,
 	return ret;
 }
=20
-/*
- * Extract folio fragments from a FOLIOQ-class iterator and add them to an
- * ib_sge list.  The folios are not pinned.
- */
-static ssize_t smbdirect_map_sges_from_folioq(struct iov_iter *iter,
-					      struct smbdirect_map_sges *state,
-					      ssize_t maxsize)
-{
-	const struct folio_queue *folioq =3D iter->folioq;
-	unsigned int slot =3D iter->folioq_slot;
-	ssize_t ret =3D 0;
-	size_t offset =3D iter->iov_offset;
-
-	if (WARN_ON_ONCE(!folioq))
-		return -EIO;
-
-	if (slot >=3D folioq_nr_slots(folioq)) {
-		folioq =3D folioq->next;
-		if (WARN_ON_ONCE(!folioq))
-			return -EIO;
-		slot =3D 0;
-	}
-
-	do {
-		struct folio *folio =3D folioq_folio(folioq, slot);
-		size_t fsize =3D folioq_folio_size(folioq, slot);
-
-		if (offset < fsize) {
-			size_t part =3D umin(maxsize, fsize - offset);
-			bool ok;
-
-			ok =3D smbdirect_map_sges_single_page(state,
-							    folio_page(folio, 0),
-							    offset,
-							    part);
-			if (!ok)
-				return -EIO;
-
-			offset +=3D part;
-			ret +=3D part;
-			maxsize -=3D part;
-		}
-
-		if (offset >=3D fsize) {
-			offset =3D 0;
-			slot++;
-			if (slot >=3D folioq_nr_slots(folioq)) {
-				if (!folioq->next) {
-					WARN_ON_ONCE(ret < iter->count);
-					break;
-				}
-				folioq =3D folioq->next;
-				slot =3D 0;
-			}
-		}
-	} while (state->num_sge < state->max_sge && maxsize > 0);
-
-	iter->folioq =3D folioq;
-	iter->folioq_slot =3D slot;
-	iter->iov_offset =3D offset;
-	iter->count -=3D ret;
-	return ret;
-}
-
 /*
  * Extract page fragments from up to the given amount of the source iterat=
or
  * and build up an ib_sge list that refers to all of those bits.  The ib_s=
ge list
@@ -2224,9 +2159,6 @@ static ssize_t smbdirect_map_sges_from_iter(struct io=
v_iter *iter, size_t len,
 	case ITER_KVEC:
 		ret =3D smbdirect_map_sges_from_kvec(iter, state, len);
 		break;
-	case ITER_FOLIOQ:
-		ret =3D smbdirect_map_sges_from_folioq(iter, state, len);
-		break;
 	default:
 		WARN_ONCE(1, "iov_iter_type[%u]\n", iov_iter_type(iter));
 		return -EIO;
From nobody Mon May 25 04:33:52 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4FA953B47CF
	for <linux-kernel@vger.kernel.org>; Mon, 18 May 2026 22:32:30 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779143551; cv=none;
 b=X2motDviTIc22VW1DYv7YAIbHzHbVBBPoEzTP82cU0ctzA0GrG6y3V2US/2zqHeeoCbygNXQ6N5lIGiNaGj5ptgiUFZTw55b19uiafTbjCDWDWflSjsG0DzWKMFv6nYE2VxKcVdZXFaUEmty8+U/h+hO5xRuIWoAinrqicMj/8U=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779143551; c=relaxed/simple;
	bh=K9C0fom2VN7NJca6HBsiNjrk7q6GwJ1tesue3bEeH3o=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=tDWZiDjSZQw5RxRJZrVjastq1eCt200y9ZTVQhDvDf00vQ7eK5ZP1fxbT6wiEfW7k3gUP/EN2E9auD1v499ZXonVviGUrPg711LM9sS2I2n/YC9VHSmemkbUdDmdfkEGDsi8N70UDbN3jY0U5TpkrKl2+y0YBoakWCrKKhFTmRo=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=gq6fWjZI; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="gq6fWjZI"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1779143549;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=2GdqhlTTyPpgPeJEKArfYZH3i6xjRZ8jMxV7l6pWvfU=;
	b=gq6fWjZIhp9bEGEs4Hbo15V+6FMEe3tNLu9DIYupdIXkuH9QAPAWgSDc56Wwc1vS8sd4Np
	N7wlNL6soLtVBAt8oncwxwQrByHcl5Xcr+f552c2WNYfDKl7/SKQKz5sekONMirUqfFrJc
	ax0NuOT44ctzhPy3DO6km/9duPUM4m8=
Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-60-y65m8WQnNSWCwrjzx4UfHA-1; Mon,
 18 May 2026 18:32:23 -0400
X-MC-Unique: y65m8WQnNSWCwrjzx4UfHA-1
X-Mimecast-MFC-AGG-ID: y65m8WQnNSWCwrjzx4UfHA_1779143540
Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 79485180047F;
	Mon, 18 May 2026 22:32:20 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.48.33])
	by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 39AEC1956053;
	Mon, 18 May 2026 22:32:13 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.org>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 15/21] netfs: Remove netfs_alloc/free_folioq_buffer()
Date: Mon, 18 May 2026 23:29:47 +0100
Message-ID: <20260518222959.488126-16-dhowells@redhat.com>
In-Reply-To: <20260518222959.488126-1-dhowells@redhat.com>
References: <20260518222959.488126-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17
Content-Type: text/plain; charset="utf-8"

Remove netfs_alloc/free_folioq_buffer() as these have been replaced with
netfs_alloc/free_bvecq_buffer().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/afs/dir_edit.c       |  1 -
 fs/netfs/misc.c         | 98 -----------------------------------------
 fs/smb/client/smb2ops.c |  1 -
 include/linux/netfs.h   |  6 ---
 4 files changed, 106 deletions(-)

diff --git a/fs/afs/dir_edit.c b/fs/afs/dir_edit.c
index fc918b3d8f68..2c655cd6a8e4 100644
--- a/fs/afs/dir_edit.c
+++ b/fs/afs/dir_edit.c
@@ -10,7 +10,6 @@
 #include <linux/namei.h>
 #include <linux/pagemap.h>
 #include <linux/iversion.h>
-#include <linux/folio_queue.h>
 #include "internal.h"
 #include "xdr_fs.h"
=20
diff --git a/fs/netfs/misc.c b/fs/netfs/misc.c
index ee67a0681784..8fc4e5ef2152 100644
--- a/fs/netfs/misc.c
+++ b/fs/netfs/misc.c
@@ -8,104 +8,6 @@
 #include <linux/swap.h>
 #include "internal.h"
=20
-#if 0
-/**
- * netfs_alloc_folioq_buffer - Allocate buffer space into a folio queue
- * @mapping: Address space to set on the folio (or NULL).
- * @_buffer: Pointer to the folio queue to add to (may point to a NULL; up=
dated).
- * @_cur_size: Current size of the buffer (updated).
- * @size: Target size of the buffer.
- * @gfp: The allocation constraints.
- */
-int netfs_alloc_folioq_buffer(struct address_space *mapping,
-			      struct folio_queue **_buffer,
-			      size_t *_cur_size, ssize_t size, gfp_t gfp)
-{
-	struct folio_queue *tail =3D *_buffer, *p;
-
-	size =3D round_up(size, PAGE_SIZE);
-	if (*_cur_size >=3D size)
-		return 0;
-
-	if (tail)
-		while (tail->next)
-			tail =3D tail->next;
-
-	do {
-		struct folio *folio;
-		int order =3D 0, slot;
-
-		if (!tail || folioq_full(tail)) {
-			p =3D netfs_folioq_alloc(0, GFP_NOFS, netfs_trace_folioq_alloc_buffer);
-			if (!p)
-				return -ENOMEM;
-			if (tail) {
-				tail->next =3D p;
-				p->prev =3D tail;
-			} else {
-				*_buffer =3D p;
-			}
-			tail =3D p;
-		}
-
-		if (size - *_cur_size > PAGE_SIZE)
-			order =3D umin(ilog2(size - *_cur_size) - PAGE_SHIFT,
-				     MAX_PAGECACHE_ORDER);
-
-		folio =3D folio_alloc(gfp, order);
-		if (!folio && order > 0)
-			folio =3D folio_alloc(gfp, 0);
-		if (!folio)
-			return -ENOMEM;
-
-		folio->mapping =3D mapping;
-		folio->index =3D *_cur_size / PAGE_SIZE;
-		trace_netfs_folio(folio, netfs_folio_trace_alloc_buffer);
-		slot =3D folioq_append_mark(tail, folio);
-		*_cur_size +=3D folioq_folio_size(tail, slot);
-	} while (*_cur_size < size);
-
-	return 0;
-}
-EXPORT_SYMBOL(netfs_alloc_folioq_buffer);
-
-/**
- * netfs_free_folioq_buffer - Free a folio queue.
- * @fq: The start of the folio queue to free
- *
- * Free up a chain of folio_queues and, if marked, the marked folios they =
point
- * to.
- */
-void netfs_free_folioq_buffer(struct folio_queue *fq)
-{
-	struct folio_queue *next;
-	struct folio_batch fbatch;
-
-	folio_batch_init(&fbatch);
-
-	for (; fq; fq =3D next) {
-		for (int slot =3D 0; slot < folioq_count(fq); slot++) {
-			struct folio *folio =3D folioq_folio(fq, slot);
-
-			if (!folio ||
-			    !folioq_is_marked(fq, slot))
-				continue;
-
-			trace_netfs_folio(folio, netfs_folio_trace_put);
-			if (folio_batch_add(&fbatch, folio))
-				folio_batch_release(&fbatch);
-		}
-
-		netfs_stat_d(&netfs_n_folioq);
-		next =3D fq->next;
-		kfree(fq);
-	}
-
-	folio_batch_release(&fbatch);
-}
-EXPORT_SYMBOL(netfs_free_folioq_buffer);
-#endif
-
 /**
  * netfs_dirty_folio - Mark folio dirty and pin a cache object for writeba=
ck
  * @mapping: The mapping the folio belongs to.
diff --git a/fs/smb/client/smb2ops.c b/fs/smb/client/smb2ops.c
index 230102f2e411..9199baa5c315 100644
--- a/fs/smb/client/smb2ops.c
+++ b/fs/smb/client/smb2ops.c
@@ -13,7 +13,6 @@
 #include <linux/sort.h>
 #include <crypto/aead.h>
 #include <linux/fiemap.h>
-#include <linux/folio_queue.h>
 #include <uapi/linux/magic.h>
 #include "cifsfs.h"
 #include "cifsglob.h"
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 15a1c3026733..9e551e09054f 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -479,12 +479,6 @@ void netfs_end_io_write(struct inode *inode);
 int netfs_start_io_direct(struct inode *inode);
 void netfs_end_io_direct(struct inode *inode);
=20
-/* Buffer wrangling helpers API. */
-int netfs_alloc_folioq_buffer(struct address_space *mapping,
-			      struct folio_queue **_buffer,
-			      size_t *_cur_size, ssize_t size, gfp_t gfp);
-void netfs_free_folioq_buffer(struct folio_queue *fq);
-
 /**
  * netfs_inode - Get the netfs inode context from the inode
  * @inode: The inode to query
From nobody Mon May 25 04:33:52 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 921033B4EA3
	for <linux-kernel@vger.kernel.org>; Mon, 18 May 2026 22:32:35 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779143557; cv=none;
 b=lYvLUFxZ5N495WEpurSiFrlMKrTtUyN5Z8Nh/0yugO4dpY4jZW1OHEcs6Rs4HeZHLmq/AhQ87Zjjd8KJYfT29JwDiXT61aWQVmgXjC2Fg6Iv5Ci9ia7xlTrWVBKRAOJJZ+yrd0Fm1k50bJryylsTooRnJbIPkIdXzHs9gzpVnIg=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779143557; c=relaxed/simple;
	bh=3Up+gMHFhETQ2bAHM7fbeYKi9N+C2mVZ3p56GFx2j2A=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=hGN1ZifYeS5TpDB2iXgB6uwkM9mDyV0Pl7cDfplT8rfrazIcRCMT7RQ8L64l3LowrUGdydAGccNsXanziIMX2u79jJUMfR2JcitwiStL6Hx5G6pIV/b0Qau/xLBj6zBF95aRlbqt6JcMlznC5Q5jZpVhxu9rw1sYhsgVPCrWpCA=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=Mu5U6ZZQ; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="Mu5U6ZZQ"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1779143554;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=dDpeP38QOsfxeIaJNWSXWtmDpOF3Fp9FCzxzSp2z87g=;
	b=Mu5U6ZZQAKDkYQutNh6yuUPqQ5RERITl/6okdk741oSlojLquvuOCD+Cne6w54sASO9FaB
	RlFhzqU+qlGZbzIFlgzXEOi4rNVfVpSUCqzySbgXR1Gy+eI/+xLXSJm+93L7Z/vM5DMc4r
	42oE/YSMeK70yIzHbiGLkr8z1Yq26Jg=
Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-43-brn1cdqUMFeJu1NB6nTX_g-1; Mon,
 18 May 2026 18:32:31 -0400
X-MC-Unique: brn1cdqUMFeJu1NB6nTX_g-1
X-Mimecast-MFC-AGG-ID: brn1cdqUMFeJu1NB6nTX_g_1779143548
Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 41C7D19560AA;
	Mon, 18 May 2026 22:32:28 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.48.33])
	by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 3334B180058A;
	Mon, 18 May 2026 22:32:21 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.org>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 16/21] netfs: Remove netfs_extract_user_iter()
Date: Mon, 18 May 2026 23:29:48 +0100
Message-ID: <20260518222959.488126-17-dhowells@redhat.com>
In-Reply-To: <20260518222959.488126-1-dhowells@redhat.com>
References: <20260518222959.488126-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111
Content-Type: text/plain; charset="utf-8"

Remove netfs_extract_user_iter() as it has been replaced with
netfs_extract_iter().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/netfs/iterator.c   | 104 ------------------------------------------
 include/linux/netfs.h |   3 --
 2 files changed, 107 deletions(-)

diff --git a/fs/netfs/iterator.c b/fs/netfs/iterator.c
index 10a25a618712..566693ac47ef 100644
--- a/fs/netfs/iterator.c
+++ b/fs/netfs/iterator.c
@@ -141,110 +141,6 @@ ssize_t netfs_extract_iter(struct iov_iter *orig, siz=
e_t max_len, size_t max_pag
 EXPORT_SYMBOL_GPL(netfs_extract_iter);
=20
 #if 0
-/**
- * netfs_extract_user_iter - Extract the pages from a user iterator into a=
 bvec
- * @orig: The original iterator
- * @orig_len: The amount of iterator to copy
- * @new: The iterator to be set up
- * @extraction_flags: Flags to qualify the request
- *
- * Extract the page fragments from the given amount of the source iterator=
 and
- * build up a second iterator that refers to all of those bits.  This allo=
ws
- * the original iterator to be disposed of.
- *
- * @extraction_flags can have ITER_ALLOW_P2PDMA set to request peer-to-pee=
r DMA be
- * allowed on the pages extracted.
- *
- * On success, the number of elements in the bvec is returned, the original
- * iterator will have been advanced by the amount extracted.
- *
- * The iov_iter_extract_mode() function should be used to query how cleanup
- * should be performed.
- */
-ssize_t netfs_extract_user_iter(struct iov_iter *orig, size_t orig_len,
-				struct iov_iter *new,
-				iov_iter_extraction_t extraction_flags)
-{
-	struct bio_vec *bv =3D NULL;
-	struct page **pages;
-	unsigned int cur_npages;
-	unsigned int max_pages;
-	unsigned int npages =3D 0;
-	unsigned int i;
-	ssize_t ret =3D 0;
-	size_t count =3D orig_len, offset, len;
-	size_t bv_size, pg_size;
-
-	if (WARN_ON_ONCE(!iter_is_ubuf(orig) && !iter_is_iovec(orig)))
-		return -EIO;
-
-	max_pages =3D iov_iter_npages(orig, INT_MAX);
-	bv_size =3D array_size(max_pages, sizeof(*bv));
-	bv =3D kvmalloc(bv_size, GFP_KERNEL);
-	if (!bv)
-		return -ENOMEM;
-
-	/* Put the page list at the end of the bvec list storage.  bvec
-	 * elements are larger than page pointers, so as long as we work
-	 * 0->last, we should be fine.
-	 */
-	pg_size =3D array_size(max_pages, sizeof(*pages));
-	pages =3D (void *)bv + bv_size - pg_size;
-
-	while (count && npages < max_pages) {
-		ret =3D iov_iter_extract_pages(orig, &pages, count,
-					     max_pages - npages, extraction_flags,
-					     &offset);
-		if (unlikely(ret <=3D 0)) {
-			ret =3D ret ?: -EIO;
-			break;
-		}
-
-		if (WARN(ret > count,
-			 "%s: extract_pages overrun %zd > %zu bytes\n",
-			 __func__, ret, count)) {
-			ret =3D -EIO;
-			break;
-		}
-
-		cur_npages =3D DIV_ROUND_UP(offset + ret, PAGE_SIZE);
-		if (WARN(cur_npages > max_pages - npages,
-			 "%s: extract_pages overrun %u > %u pages\n",
-			 __func__, npages + cur_npages, max_pages)) {
-			ret =3D -EIO;
-			break;
-		}
-
-		count -=3D ret;
-		ret +=3D offset;
-
-		for (i =3D 0; i < cur_npages; i++) {
-			len =3D ret > PAGE_SIZE ? PAGE_SIZE : ret;
-			bvec_set_page(bv + npages + i, *pages++, len - offset, offset);
-			ret -=3D len;
-			offset =3D 0;
-		}
-
-		npages +=3D cur_npages;
-	}
-
-	/* Note: Don't try to clean up after EIO.  Either we got no pages, so
-	 * nothing to clean up, or we got a buffer overrun, memory corruption
-	 * and can't trust the stuff in the buffer (a WARN was emitted).
-	 */
-
-	if (ret < 0 && (ret =3D=3D -ENOMEM || npages =3D=3D 0)) {
-		for (i =3D 0; i < npages; i++)
-			unpin_user_page(bv[i].bv_page);
-		kvfree(bv);
-		return ret;
-	}
-
-	iov_iter_bvec(new, orig->data_source, bv, npages, orig_len - count);
-	return npages;
-}
-EXPORT_SYMBOL_GPL(netfs_extract_user_iter);
-
 /*
  * Select the span of a bvec iterator we're going to use.  Limit it by bot=
h maximum
  * size and maximum number of segments.  Returns the size of the span in b=
ytes.
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 9e551e09054f..d0b1408bd02f 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -464,9 +464,6 @@ void netfs_put_subrequest(struct netfs_io_subrequest *s=
ubreq,
 ssize_t netfs_extract_iter(struct iov_iter *orig, size_t max_len, size_t m=
ax_pages,
 			   unsigned long long fpos, struct bvecq **_bvecq_head,
 			   iov_iter_extraction_t extraction_flags);
-ssize_t netfs_extract_user_iter(struct iov_iter *orig, size_t orig_len,
-				struct iov_iter *new,
-				iov_iter_extraction_t extraction_flags);
 size_t netfs_limit_iter(const struct iov_iter *iter, size_t start_offset,
 			size_t max_size, size_t max_segs);
 void netfs_prepare_write_failed(struct netfs_io_subrequest *subreq);
From nobody Mon May 25 04:33:52 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8BB933BED2B
	for <linux-kernel@vger.kernel.org>; Mon, 18 May 2026 22:32:44 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779143567; cv=none;
 b=CETytb1rkfQQw8erhzgE250vpY1jh4v0aqnOFzpTWM1JK1/aYZWqa61mpxChGrBoYZUANiu1v/26SMrcvhs/9w/8A6f8cjQ+aYSL71kY0uKAZG4Gv1nVcAsmG5KrfgDe5ZaDTzNF3oyinjdIr11J7fg14aAiVqnd64C8ovTDD1U=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779143567; c=relaxed/simple;
	bh=g287tUcRJzBAbxQnQEtMvwvGgBkyK7cCXnENPKdTcxw=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=hCQ8Bfavzxfy1FZ492KhiRFjvgTHP3oc/F7iUc0hXe6SK00RrlaKIZ4YWO/4sSN+zUrCFdIKF8itkjYusdnzS+qZQMzSJnRi1clMmDy1iE7vg7km1QiMAAh/gHtZ4YVSwwF1/QoxbCqCFlEEpYmHTME+G/elS6YcZBOftElRAv8=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=TDvYyAzX; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="TDvYyAzX"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1779143563;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=OqB7zTnEkz7fj/UldtQdiNseeNCWZ4Z79VJ3y6c81UM=;
	b=TDvYyAzXET+aamfHiO1rUEdsQt3Wb/lDG+k1wAy8bk4XQHzQ2sHFvT3k/nu62W9egq6b9K
	OX1KCFyBZ2wcXzskgpBCf4TL97o5575A+kOeSSPmAUaxIOCBlWuJFQ+91MvlRwJbgk9gW1
	BJ8gIDNTthD3hnKLzs3bPLbzLUNlkh0=
Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-629-bzQnsAUrN1GvQDwmgwHz9g-1; Mon,
 18 May 2026 18:32:39 -0400
X-MC-Unique: bzQnsAUrN1GvQDwmgwHz9g-1
X-Mimecast-MFC-AGG-ID: bzQnsAUrN1GvQDwmgwHz9g_1779143556
Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 4CC221800451;
	Mon, 18 May 2026 22:32:36 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.48.33])
	by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id F22781800357;
	Mon, 18 May 2026 22:32:29 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.org>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 17/21] iov_iter: Remove ITER_FOLIOQ
Date: Mon, 18 May 2026 23:29:49 +0100
Message-ID: <20260518222959.488126-18-dhowells@redhat.com>
In-Reply-To: <20260518222959.488126-1-dhowells@redhat.com>
References: <20260518222959.488126-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93
Content-Type: text/plain; charset="utf-8"

Remove ITER_FOLIOQ as it's no longer used.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 include/linux/iov_iter.h   |  65 +--------
 include/linux/uio.h        |  12 --
 lib/iov_iter.c             | 175 +-----------------------
 lib/scatterlist.c          |  69 +---------
 lib/tests/kunit_iov_iter.c | 271 -------------------------------------
 5 files changed, 7 insertions(+), 585 deletions(-)

diff --git a/include/linux/iov_iter.h b/include/linux/iov_iter.h
index c19a4c561ab4..c4ed8dafa92f 100644
--- a/include/linux/iov_iter.h
+++ b/include/linux/iov_iter.h
@@ -11,7 +11,6 @@
 #include <linux/uio.h>
 #include <linux/bvec.h>
 #include <linux/bvecq.h>
-#include <linux/folio_queue.h>
=20
 typedef size_t (*iov_step_f)(void *iter_base, size_t progress, size_t len,
 			     void *priv, void *priv2);
@@ -202,62 +201,6 @@ size_t iterate_bvecq(struct iov_iter *iter, size_t len=
, void *priv, void *priv2,
 	return progress;
 }
=20
-/*
- * Handle ITER_FOLIOQ.
- */
-static __always_inline
-size_t iterate_folioq(struct iov_iter *iter, size_t len, void *priv, void =
*priv2,
-		      iov_step_f step)
-{
-	const struct folio_queue *folioq =3D iter->folioq;
-	unsigned int slot =3D iter->folioq_slot;
-	size_t progress =3D 0, skip =3D iter->iov_offset;
-
-	if (slot =3D=3D folioq_nr_slots(folioq)) {
-		/* The iterator may have been extended. */
-		folioq =3D folioq->next;
-		slot =3D 0;
-	}
-
-	do {
-		struct folio *folio =3D folioq_folio(folioq, slot);
-		size_t part, remain =3D 0, consumed;
-		size_t fsize;
-		void *base;
-
-		if (!folio)
-			break;
-
-		fsize =3D folioq_folio_size(folioq, slot);
-		if (skip < fsize) {
-			base =3D kmap_local_folio(folio, skip);
-			part =3D umin(len, PAGE_SIZE - skip % PAGE_SIZE);
-			remain =3D step(base, progress, part, priv, priv2);
-			kunmap_local(base);
-			consumed =3D part - remain;
-			len -=3D consumed;
-			progress +=3D consumed;
-			skip +=3D consumed;
-		}
-		if (skip >=3D fsize) {
-			skip =3D 0;
-			slot++;
-			if (slot =3D=3D folioq_nr_slots(folioq) && folioq->next) {
-				folioq =3D folioq->next;
-				slot =3D 0;
-			}
-		}
-		if (remain)
-			break;
-	} while (len);
-
-	iter->folioq_slot =3D slot;
-	iter->folioq =3D folioq;
-	iter->iov_offset =3D skip;
-	iter->count -=3D progress;
-	return progress;
-}
-
 /*
  * Handle ITER_XARRAY.
  */
@@ -369,8 +312,6 @@ size_t iterate_and_advance2(struct iov_iter *iter, size=
_t len, void *priv,
 		return iterate_kvec(iter, len, priv, priv2, step);
 	if (iov_iter_is_bvecq(iter))
 		return iterate_bvecq(iter, len, priv, priv2, step);
-	if (iov_iter_is_folioq(iter))
-		return iterate_folioq(iter, len, priv, priv2, step);
 	if (iov_iter_is_xarray(iter))
 		return iterate_xarray(iter, len, priv, priv2, step);
 	return iterate_discard(iter, len, priv, priv2, step);
@@ -405,8 +346,8 @@ size_t iterate_and_advance(struct iov_iter *iter, size_=
t len, void *priv,
  * buffer is presented in segments, which for kernel iteration are broken =
up by
  * physical pages and mapped, with the mapped address being presented.
  *
- * [!] Note This will only handle BVEC, KVEC, BVECQ, FOLIOQ, XARRAY and
- * DISCARD-type iterators; it will not handle UBUF or IOVEC-type iterators.
+ * [!] Note This will only handle BVEC, KVEC, BVECQ, XARRAY and DISCARD-ty=
pe
+ * iterators; it will not handle UBUF or IOVEC-type iterators.
  *
  * A step functions, @step, must be provided, one for handling mapped kern=
el
  * addresses and the other is given user addresses which have the potentia=
l to
@@ -435,8 +376,6 @@ size_t iterate_and_advance_kernel(struct iov_iter *iter=
, size_t len, void *priv,
 		return iterate_kvec(iter, len, priv, priv2, step);
 	if (iov_iter_is_bvecq(iter))
 		return iterate_bvecq(iter, len, priv, priv2, step);
-	if (iov_iter_is_folioq(iter))
-		return iterate_folioq(iter, len, priv, priv2, step);
 	if (iov_iter_is_xarray(iter))
 		return iterate_xarray(iter, len, priv, priv2, step);
 	return iterate_discard(iter, len, priv, priv2, step);
diff --git a/include/linux/uio.h b/include/linux/uio.h
index f7cfa6ea8213..e84a0c4f28c6 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -11,7 +11,6 @@
 #include <uapi/linux/uio.h>
=20
 struct page;
-struct folio_queue;
=20
 typedef unsigned int __bitwise iov_iter_extraction_t;
=20
@@ -27,7 +26,6 @@ enum iter_type {
 	ITER_BVEC,
 	ITER_KVEC,
 	ITER_BVECQ,
-	ITER_FOLIOQ,
 	ITER_XARRAY,
 	ITER_DISCARD,
 };
@@ -70,7 +68,6 @@ struct iov_iter {
 				const struct kvec *kvec;
 				const struct bio_vec *bvec;
 				const struct bvecq *bvecq;
-				const struct folio_queue *folioq;
 				struct xarray *xarray;
 				void __user *ubuf;
 			};
@@ -80,7 +77,6 @@ struct iov_iter {
 	union {
 		unsigned long nr_segs;
 		u16 bvecq_slot;
-		u8 folioq_slot;
 		loff_t xarray_start;
 	};
 };
@@ -153,11 +149,6 @@ static inline bool iov_iter_is_bvecq(const struct iov_=
iter *i)
 	return iov_iter_type(i) =3D=3D ITER_BVECQ;
 }
=20
-static inline bool iov_iter_is_folioq(const struct iov_iter *i)
-{
-	return iov_iter_type(i) =3D=3D ITER_FOLIOQ;
-}
-
 static inline bool iov_iter_is_xarray(const struct iov_iter *i)
 {
 	return iov_iter_type(i) =3D=3D ITER_XARRAY;
@@ -306,9 +297,6 @@ void iov_iter_discard(struct iov_iter *i, unsigned int =
direction, size_t count);
 void iov_iter_bvec_queue(struct iov_iter *i, unsigned int direction,
 			 const struct bvecq *bvecq,
 			 unsigned int first_slot, unsigned int offset, size_t count);
-void iov_iter_folio_queue(struct iov_iter *i, unsigned int direction,
-			  const struct folio_queue *folioq,
-			  unsigned int first_slot, unsigned int offset, size_t count);
 void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xa=
rray *xarray,
 		     loff_t start, size_t count);
 ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 63fc75c2bc48..f3626a640a4c 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -571,39 +571,6 @@ static void iov_iter_bvecq_advance(struct iov_iter *i,=
 size_t by)
 	i->bvecq =3D bq;
 }
=20
-static void iov_iter_folioq_advance(struct iov_iter *i, size_t size)
-{
-	const struct folio_queue *folioq =3D i->folioq;
-	unsigned int slot =3D i->folioq_slot;
-
-	if (!i->count)
-		return;
-	i->count -=3D size;
-
-	if (slot >=3D folioq_nr_slots(folioq)) {
-		folioq =3D folioq->next;
-		slot =3D 0;
-	}
-
-	size +=3D i->iov_offset; /* From beginning of current segment. */
-	do {
-		size_t fsize =3D folioq_folio_size(folioq, slot);
-
-		if (likely(size < fsize))
-			break;
-		size -=3D fsize;
-		slot++;
-		if (slot >=3D folioq_nr_slots(folioq) && folioq->next) {
-			folioq =3D folioq->next;
-			slot =3D 0;
-		}
-	} while (size);
-
-	i->iov_offset =3D size;
-	i->folioq_slot =3D slot;
-	i->folioq =3D folioq;
-}
-
 void iov_iter_advance(struct iov_iter *i, size_t size)
 {
 	if (unlikely(i->count < size))
@@ -618,8 +585,6 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
 		iov_iter_bvec_advance(i, size);
 	} else if (iov_iter_is_bvecq(i)) {
 		iov_iter_bvecq_advance(i, size);
-	} else if (iov_iter_is_folioq(i)) {
-		iov_iter_folioq_advance(i, size);
 	} else if (iov_iter_is_discard(i)) {
 		i->count -=3D size;
 	}
@@ -652,32 +617,6 @@ static void iov_iter_bvecq_revert(struct iov_iter *i, =
size_t unroll)
 	i->bvecq =3D bq;
 }
=20
-static void iov_iter_folioq_revert(struct iov_iter *i, size_t unroll)
-{
-	const struct folio_queue *folioq =3D i->folioq;
-	unsigned int slot =3D i->folioq_slot;
-
-	for (;;) {
-		size_t fsize;
-
-		if (slot =3D=3D 0) {
-			folioq =3D folioq->prev;
-			slot =3D folioq_nr_slots(folioq);
-		}
-		slot--;
-
-		fsize =3D folioq_folio_size(folioq, slot);
-		if (unroll <=3D fsize) {
-			i->iov_offset =3D fsize - unroll;
-			break;
-		}
-		unroll -=3D fsize;
-	}
-
-	i->folioq_slot =3D slot;
-	i->folioq =3D folioq;
-}
-
 void iov_iter_revert(struct iov_iter *i, size_t unroll)
 {
 	if (!unroll)
@@ -712,9 +651,6 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
 	} else if (iov_iter_is_bvecq(i)) {
 		i->iov_offset =3D 0;
 		iov_iter_bvecq_revert(i, unroll);
-	} else if (iov_iter_is_folioq(i)) {
-		i->iov_offset =3D 0;
-		iov_iter_folioq_revert(i, unroll);
 	} else { /* same logics for iovec and kvec */
 		const struct iovec *iov =3D iter_iov(i);
 		while (1) {
@@ -758,8 +694,6 @@ size_t iov_iter_single_seg_count(const struct iov_iter =
*i)
 		}
 		return umin(i->count, bq->bv[slot].bv_len - offset);
 	}
-	if (unlikely(iov_iter_is_folioq(i)))
-		return umin(folioq_folio_size(i->folioq, i->folioq_slot), i->count);
 	return i->count;
 }
 EXPORT_SYMBOL(iov_iter_single_seg_count);
@@ -825,36 +759,6 @@ void iov_iter_bvec_queue(struct iov_iter *i, unsigned =
int direction,
 }
 EXPORT_SYMBOL(iov_iter_bvec_queue);
=20
-/**
- * iov_iter_folio_queue - Initialise an I/O iterator to use the folios in =
a folio queue
- * @i: The iterator to initialise.
- * @direction: The direction of the transfer.
- * @folioq: The starting point in the folio queue.
- * @first_slot: The first slot in the folio queue to use
- * @offset: The offset into the folio in the first slot to start at
- * @count: The size of the I/O buffer in bytes.
- *
- * Set up an I/O iterator to either draw data out of the pages attached to=
 an
- * inode or to inject data into those pages.  The pages *must* be prevented
- * from evaporation, either by taking a ref on them or locking them by the
- * caller.
- */
-void iov_iter_folio_queue(struct iov_iter *i, unsigned int direction,
-			  const struct folio_queue *folioq, unsigned int first_slot,
-			  unsigned int offset, size_t count)
-{
-	BUG_ON(direction & ~1);
-	*i =3D (struct iov_iter) {
-		.iter_type =3D ITER_FOLIOQ,
-		.data_source =3D direction,
-		.folioq =3D folioq,
-		.folioq_slot =3D first_slot,
-		.count =3D count,
-		.iov_offset =3D offset,
-	};
-}
-EXPORT_SYMBOL(iov_iter_folio_queue);
-
 /**
  * iov_iter_xarray - Initialise an I/O iterator to use the pages in an xar=
ray
  * @i: The iterator to initialise.
@@ -996,9 +900,7 @@ unsigned long iov_iter_alignment(const struct iov_iter =
*i)
 	if (iov_iter_is_bvecq(i))
 		return iov_iter_alignment_bvecq(i);
=20
-	/* With both xarray and folioq types, we're dealing with whole folios. */
-	if (iov_iter_is_folioq(i))
-		return i->iov_offset | i->count;
+	/* With the xarray type, we're dealing with whole folios. */
 	if (iov_iter_is_xarray(i))
 		return (i->xarray_start + i->iov_offset) | i->count;
=20
@@ -1253,11 +1155,6 @@ int iov_iter_npages(const struct iov_iter *i, int ma=
xpages)
 		return bvec_npages(i, maxpages);
 	if (iov_iter_is_bvecq(i))
 		return iov_npages_bvecq(i, maxpages);
-	if (iov_iter_is_folioq(i)) {
-		unsigned offset =3D i->iov_offset % PAGE_SIZE;
-		int npages =3D DIV_ROUND_UP(offset + i->count, PAGE_SIZE);
-		return min(npages, maxpages);
-	}
 	if (iov_iter_is_xarray(i)) {
 		unsigned offset =3D (i->xarray_start + i->iov_offset) % PAGE_SIZE;
 		int npages =3D DIV_ROUND_UP(offset + i->count, PAGE_SIZE);
@@ -1680,68 +1577,6 @@ static ssize_t iov_iter_extract_bvecq_pages(struct i=
ov_iter *iter,
 	return extracted;
 }
=20
-/*
- * Extract a list of contiguous pages from an ITER_FOLIOQ iterator.  This =
does
- * not get references on the pages, nor does it get a pin on them.
- */
-static ssize_t iov_iter_extract_folioq_pages(struct iov_iter *i,
-					     struct page ***pages, size_t maxsize,
-					     unsigned int maxpages,
-					     iov_iter_extraction_t extraction_flags,
-					     size_t *offset0)
-{
-	const struct folio_queue *folioq =3D i->folioq;
-	struct page **p;
-	unsigned int nr =3D 0;
-	size_t extracted =3D 0, offset, slot =3D i->folioq_slot;
-
-	if (slot >=3D folioq_nr_slots(folioq)) {
-		folioq =3D folioq->next;
-		slot =3D 0;
-		if (WARN_ON(i->iov_offset !=3D 0))
-			return -EIO;
-	}
-
-	offset =3D i->iov_offset & ~PAGE_MASK;
-	*offset0 =3D offset;
-
-	maxpages =3D want_pages_array(pages, maxsize, offset, maxpages);
-	if (!maxpages)
-		return -ENOMEM;
-	p =3D *pages;
-
-	for (;;) {
-		struct folio *folio =3D folioq_folio(folioq, slot);
-		size_t offset =3D i->iov_offset, fsize =3D folioq_folio_size(folioq, slo=
t);
-		size_t part =3D PAGE_SIZE - offset % PAGE_SIZE;
-
-		if (offset < fsize) {
-			part =3D umin(part, umin(maxsize - extracted, fsize - offset));
-			i->count -=3D part;
-			i->iov_offset +=3D part;
-			extracted +=3D part;
-
-			p[nr++] =3D folio_page(folio, offset / PAGE_SIZE);
-		}
-
-		if (nr >=3D maxpages || extracted >=3D maxsize)
-			break;
-
-		if (i->iov_offset >=3D fsize) {
-			i->iov_offset =3D 0;
-			slot++;
-			if (slot =3D=3D folioq_nr_slots(folioq) && folioq->next) {
-				folioq =3D folioq->next;
-				slot =3D 0;
-			}
-		}
-	}
-
-	i->folioq =3D folioq;
-	i->folioq_slot =3D slot;
-	return extracted;
-}
-
 /*
  * Extract a list of contiguous pages from an ITER_XARRAY iterator.  This =
does not
  * get references on the pages, nor does it get a pin on them.
@@ -1986,8 +1821,8 @@ static ssize_t iov_iter_extract_user_pages(struct iov=
_iter *i,
  *      added to the pages, but refs will not be taken.
  *      iov_iter_extract_will_pin() will return true.
  *
- *  (*) If the iterator is ITER_KVEC, ITER_BVEC, ITER_FOLIOQ or ITER_XARRA=
Y, the
- *      pages are merely listed; no extra refs or pins are obtained.
+ *  (*) If the iterator is ITER_KVEC, ITER_BVEC, ITER_XARRAY, the pages are
+ *      merely listed; no extra refs or pins are obtained.
  *      iov_iter_extract_will_pin() will return 0.
  *
  * Note also:
@@ -2026,10 +1861,6 @@ ssize_t iov_iter_extract_pages(struct iov_iter *i,
 		return iov_iter_extract_bvecq_pages(i, pages, maxsize,
 						    maxpages, extraction_flags,
 						    offset0);
-	if (iov_iter_is_folioq(i))
-		return iov_iter_extract_folioq_pages(i, pages, maxsize,
-						     maxpages, extraction_flags,
-						     offset0);
 	if (iov_iter_is_xarray(i))
 		return iov_iter_extract_xarray_pages(i, pages, maxsize,
 						     maxpages, extraction_flags,
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index b92144659543..11b6a890cf60 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -12,7 +12,6 @@
 #include <linux/bvec.h>
 #include <linux/bvecq.h>
 #include <linux/uio.h>
-#include <linux/folio_queue.h>
=20
 /**
  * sg_nents - return total count of entries in scatterlist
@@ -1330,67 +1329,6 @@ static ssize_t extract_bvecq_to_sg(struct iov_iter *=
iter,
 	return ret;
 }
=20
-/*
- * Extract up to sg_max folios from an FOLIOQ-type iterator and add them to
- * the scatterlist.  The pages are not pinned.
- */
-static ssize_t extract_folioq_to_sg(struct iov_iter *iter,
-				   ssize_t maxsize,
-				   struct sg_table *sgtable,
-				   unsigned int sg_max,
-				   iov_iter_extraction_t extraction_flags)
-{
-	const struct folio_queue *folioq =3D iter->folioq;
-	struct scatterlist *sg =3D sgtable->sgl + sgtable->nents;
-	unsigned int slot =3D iter->folioq_slot;
-	ssize_t ret =3D 0;
-	size_t offset =3D iter->iov_offset;
-
-	BUG_ON(!folioq);
-
-	if (slot >=3D folioq_nr_slots(folioq)) {
-		folioq =3D folioq->next;
-		if (WARN_ON_ONCE(!folioq))
-			return 0;
-		slot =3D 0;
-	}
-
-	do {
-		struct folio *folio =3D folioq_folio(folioq, slot);
-		size_t fsize =3D folioq_folio_size(folioq, slot);
-
-		if (offset < fsize) {
-			size_t part =3D umin(maxsize - ret, fsize - offset);
-
-			sg_set_page(sg, folio_page(folio, 0), part, offset);
-			sgtable->nents++;
-			sg++;
-			sg_max--;
-			offset +=3D part;
-			ret +=3D part;
-		}
-
-		if (offset >=3D fsize) {
-			offset =3D 0;
-			slot++;
-			if (slot >=3D folioq_nr_slots(folioq)) {
-				if (!folioq->next) {
-					WARN_ON_ONCE(ret < iter->count);
-					break;
-				}
-				folioq =3D folioq->next;
-				slot =3D 0;
-			}
-		}
-	} while (sg_max > 0 && ret < maxsize);
-
-	iter->folioq =3D folioq;
-	iter->folioq_slot =3D slot;
-	iter->iov_offset =3D offset;
-	iter->count -=3D ret;
-	return ret;
-}
-
 /*
  * Extract up to sg_max folios from an XARRAY-type iterator and add them to
  * the scatterlist.  The pages are not pinned.
@@ -1453,8 +1391,8 @@ static ssize_t extract_xarray_to_sg(struct iov_iter *=
iter,
  * addition of @sg_max elements.
  *
  * The pages referred to by UBUF- and IOVEC-type iterators are extracted a=
nd
- * pinned; BVEC-, BVECQ-, KVEC-, FOLIOQ- and XARRAY-type are extracted but
- * aren't pinned; DISCARD-type is not supported.
+ * pinned; BVEC-, BVECQ-, KVEC-, XARRAY-type are extracted but aren't pinn=
ed;
+ * DISCARD-type is not supported.
  *
  * No end mark is placed on the scatterlist; that's left to the caller.
  *
@@ -1489,9 +1427,6 @@ ssize_t extract_iter_to_sg(struct iov_iter *iter, siz=
e_t maxsize,
 	case ITER_BVECQ:
 		return extract_bvecq_to_sg(iter, maxsize, sgtable, sg_max,
 					   extraction_flags);
-	case ITER_FOLIOQ:
-		return extract_folioq_to_sg(iter, maxsize, sgtable, sg_max,
-					    extraction_flags);
 	case ITER_XARRAY:
 		return extract_xarray_to_sg(iter, maxsize, sgtable, sg_max,
 					    extraction_flags);
diff --git a/lib/tests/kunit_iov_iter.c b/lib/tests/kunit_iov_iter.c
index 1342487dd211..eac24f874aa1 100644
--- a/lib/tests/kunit_iov_iter.c
+++ b/lib/tests/kunit_iov_iter.c
@@ -13,7 +13,6 @@
 #include <linux/uio.h>
 #include <linux/bvec.h>
 #include <linux/bvecq.h>
-#include <linux/folio_queue.h>
 #include <linux/scatterlist.h>
 #include <linux/minmax.h>
 #include <linux/mman.h>
@@ -376,176 +375,6 @@ static void __init iov_kunit_copy_from_bvec(struct ku=
nit *test)
 	KUNIT_SUCCEED(test);
 }
=20
-static void iov_kunit_destroy_folioq(void *data)
-{
-	struct folio_queue *folioq, *next;
-
-	for (folioq =3D data; folioq; folioq =3D next) {
-		next =3D folioq->next;
-		kfree(folioq);
-	}
-}
-
-static void __init iov_kunit_load_folioq(struct kunit *test,
-					struct iov_iter *iter, int dir,
-					struct folio_queue *folioq,
-					struct page **pages, size_t npages)
-{
-	struct folio_queue *p =3D folioq;
-	size_t size =3D 0;
-	int i;
-
-	for (i =3D 0; i < npages; i++) {
-		if (folioq_full(p)) {
-			p->next =3D kzalloc_obj(struct folio_queue);
-			KUNIT_ASSERT_NOT_ERR_OR_NULL(test, p->next);
-			folioq_init(p->next, 0);
-			p->next->prev =3D p;
-			p =3D p->next;
-		}
-		folioq_append(p, page_folio(pages[i]));
-		size +=3D PAGE_SIZE;
-	}
-	iov_iter_folio_queue(iter, dir, folioq, 0, 0, size);
-}
-
-static struct folio_queue *iov_kunit_create_folioq(struct kunit *test)
-{
-	struct folio_queue *folioq;
-
-	folioq =3D kzalloc_obj(struct folio_queue);
-	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, folioq);
-	kunit_add_action_or_reset(test, iov_kunit_destroy_folioq, folioq);
-	folioq_init(folioq, 0);
-	return folioq;
-}
-
-/*
- * Test copying to a ITER_FOLIOQ-type iterator.
- */
-static void __init iov_kunit_copy_to_folioq(struct kunit *test)
-{
-	const struct kvec_test_range *pr;
-	struct iov_iter iter;
-	struct folio_queue *folioq;
-	struct page **spages, **bpages;
-	u8 *scratch, *buffer;
-	size_t bufsize, npages, size, copied;
-	int i, patt;
-
-	bufsize =3D 0x100000;
-	npages =3D bufsize / PAGE_SIZE;
-
-	folioq =3D iov_kunit_create_folioq(test);
-
-	scratch =3D iov_kunit_create_buffer(test, &spages, npages);
-	for (i =3D 0; i < bufsize; i++)
-		scratch[i] =3D pattern(i);
-
-	buffer =3D iov_kunit_create_buffer(test, &bpages, npages);
-	memset(buffer, 0, bufsize);
-
-	iov_kunit_load_folioq(test, &iter, READ, folioq, bpages, npages);
-
-	i =3D 0;
-	for (pr =3D kvec_test_ranges; pr->from >=3D 0; pr++) {
-		size =3D pr->to - pr->from;
-		KUNIT_ASSERT_LE(test, pr->to, bufsize);
-
-		iov_iter_folio_queue(&iter, READ, folioq, 0, 0, pr->to);
-		iov_iter_advance(&iter, pr->from);
-		copied =3D copy_to_iter(scratch + i, size, &iter);
-
-		KUNIT_EXPECT_EQ(test, copied, size);
-		KUNIT_EXPECT_EQ(test, iter.count, 0);
-		KUNIT_EXPECT_EQ(test, iter.iov_offset, pr->to % PAGE_SIZE);
-		i +=3D size;
-		if (test->status =3D=3D KUNIT_FAILURE)
-			goto stop;
-	}
-
-	/* Build the expected image in the scratch buffer. */
-	patt =3D 0;
-	memset(scratch, 0, bufsize);
-	for (pr =3D kvec_test_ranges; pr->from >=3D 0; pr++)
-		for (i =3D pr->from; i < pr->to; i++)
-			scratch[i] =3D pattern(patt++);
-
-	/* Compare the images */
-	for (i =3D 0; i < bufsize; i++) {
-		KUNIT_EXPECT_EQ_MSG(test, buffer[i], scratch[i], "at i=3D%x", i);
-		if (buffer[i] !=3D scratch[i])
-			return;
-	}
-
-stop:
-	KUNIT_SUCCEED(test);
-}
-
-/*
- * Test copying from a ITER_FOLIOQ-type iterator.
- */
-static void __init iov_kunit_copy_from_folioq(struct kunit *test)
-{
-	const struct kvec_test_range *pr;
-	struct iov_iter iter;
-	struct folio_queue *folioq;
-	struct page **spages, **bpages;
-	u8 *scratch, *buffer;
-	size_t bufsize, npages, size, copied;
-	int i, j;
-
-	bufsize =3D 0x100000;
-	npages =3D bufsize / PAGE_SIZE;
-
-	folioq =3D iov_kunit_create_folioq(test);
-
-	buffer =3D iov_kunit_create_buffer(test, &bpages, npages);
-	for (i =3D 0; i < bufsize; i++)
-		buffer[i] =3D pattern(i);
-
-	scratch =3D iov_kunit_create_buffer(test, &spages, npages);
-	memset(scratch, 0, bufsize);
-
-	iov_kunit_load_folioq(test, &iter, READ, folioq, bpages, npages);
-
-	i =3D 0;
-	for (pr =3D kvec_test_ranges; pr->from >=3D 0; pr++) {
-		size =3D pr->to - pr->from;
-		KUNIT_ASSERT_LE(test, pr->to, bufsize);
-
-		iov_iter_folio_queue(&iter, WRITE, folioq, 0, 0, pr->to);
-		iov_iter_advance(&iter, pr->from);
-		copied =3D copy_from_iter(scratch + i, size, &iter);
-
-		KUNIT_EXPECT_EQ(test, copied, size);
-		KUNIT_EXPECT_EQ(test, iter.count, 0);
-		KUNIT_EXPECT_EQ(test, iter.iov_offset, pr->to % PAGE_SIZE);
-		i +=3D size;
-	}
-
-	/* Build the expected image in the main buffer. */
-	i =3D 0;
-	memset(buffer, 0, bufsize);
-	for (pr =3D kvec_test_ranges; pr->from >=3D 0; pr++) {
-		for (j =3D pr->from; j < pr->to; j++) {
-			buffer[i++] =3D pattern(j);
-			if (i >=3D bufsize)
-				goto stop;
-		}
-	}
-stop:
-
-	/* Compare the images */
-	for (i =3D 0; i < bufsize; i++) {
-		KUNIT_EXPECT_EQ_MSG(test, scratch[i], buffer[i], "at i=3D%x", i);
-		if (scratch[i] !=3D buffer[i])
-			return;
-	}
-
-	KUNIT_SUCCEED(test);
-}
-
 static void iov_kunit_destroy_bvecq(void *data)
 {
 	struct bvecq *bq, *next;
@@ -1119,85 +948,6 @@ static void __init iov_kunit_extract_pages_bvecq(stru=
ct kunit *test)
 	KUNIT_SUCCEED(test);
 }
=20
-/*
- * Test the extraction of ITER_FOLIOQ-type iterators.
- */
-static void __init iov_kunit_extract_pages_folioq(struct kunit *test)
-{
-	const struct kvec_test_range *pr;
-	struct folio_queue *folioq;
-	struct iov_iter iter;
-	struct page **bpages, *pagelist[8], **pages =3D pagelist;
-	ssize_t len;
-	size_t bufsize, size =3D 0, npages;
-	int i, from;
-
-	bufsize =3D 0x100000;
-	npages =3D bufsize / PAGE_SIZE;
-
-	folioq =3D iov_kunit_create_folioq(test);
-
-	iov_kunit_create_buffer(test, &bpages, npages);
-	iov_kunit_load_folioq(test, &iter, READ, folioq, bpages, npages);
-
-	for (pr =3D kvec_test_ranges; pr->from >=3D 0; pr++) {
-		from =3D pr->from;
-		size =3D pr->to - from;
-		KUNIT_ASSERT_LE(test, pr->to, bufsize);
-
-		iov_iter_folio_queue(&iter, WRITE, folioq, 0, 0, pr->to);
-		iov_iter_advance(&iter, from);
-
-		do {
-			size_t offset0 =3D LONG_MAX;
-
-			for (i =3D 0; i < ARRAY_SIZE(pagelist); i++)
-				pagelist[i] =3D (void *)(unsigned long)0xaa55aa55aa55aa55ULL;
-
-			len =3D iov_iter_extract_pages(&iter, &pages, 100 * 1024,
-						     ARRAY_SIZE(pagelist), 0, &offset0);
-			KUNIT_EXPECT_GE(test, len, 0);
-			if (len < 0)
-				break;
-			KUNIT_EXPECT_LE(test, len, size);
-			KUNIT_EXPECT_EQ(test, iter.count, size - len);
-			if (len =3D=3D 0)
-				break;
-			size -=3D len;
-			KUNIT_EXPECT_GE(test, (ssize_t)offset0, 0);
-			KUNIT_EXPECT_LT(test, offset0, PAGE_SIZE);
-
-			for (i =3D 0; i < ARRAY_SIZE(pagelist); i++) {
-				struct page *p;
-				ssize_t part =3D min_t(ssize_t, len, PAGE_SIZE - offset0);
-				int ix;
-
-				KUNIT_ASSERT_GE(test, part, 0);
-				ix =3D from / PAGE_SIZE;
-				KUNIT_ASSERT_LT(test, ix, npages);
-				p =3D bpages[ix];
-				KUNIT_EXPECT_PTR_EQ(test, pagelist[i], p);
-				KUNIT_EXPECT_EQ(test, offset0, from % PAGE_SIZE);
-				from +=3D part;
-				len -=3D part;
-				KUNIT_ASSERT_GE(test, len, 0);
-				if (len =3D=3D 0)
-					break;
-				offset0 =3D 0;
-			}
-
-			if (test->status =3D=3D KUNIT_FAILURE)
-				goto stop;
-		} while (iov_iter_count(&iter) > 0);
-
-		KUNIT_EXPECT_EQ(test, size, 0);
-		KUNIT_EXPECT_EQ(test, iter.count, 0);
-	}
-
-stop:
-	KUNIT_SUCCEED(test);
-}
-
 /*
  * Test the extraction of ITER_XARRAY-type iterators.
  */
@@ -1425,23 +1175,6 @@ static void __init iov_kunit_iter_to_sg_bvec(struct =
kunit *test)
 	iov_kunit_iter_to_sg_check(test, &iter, bufsize, &data);
 }
=20
-static void __init iov_kunit_iter_to_sg_folioq(struct kunit *test)
-{
-	struct iov_kunit_iter_to_sg_data data;
-	struct folio_queue *folioq;
-	struct iov_iter iter;
-	size_t bufsize;
-
-	bufsize =3D 0x100000;
-	iov_kunit_iter_to_sg_init(test, bufsize, false, &data);
-
-	folioq =3D iov_kunit_create_folioq(test);
-	iov_kunit_load_folioq(test, &iter, READ, folioq, data.pages,
-			      data.npages);
-
-	iov_kunit_iter_to_sg_check(test, &iter, bufsize, &data);
-}
-
 static void __init iov_kunit_iter_to_sg_xarray(struct kunit *test)
 {
 	struct iov_kunit_iter_to_sg_data data;
@@ -1480,18 +1213,14 @@ static struct kunit_case __refdata iov_kunit_cases[=
] =3D {
 	KUNIT_CASE(iov_kunit_copy_from_bvec),
 	KUNIT_CASE(iov_kunit_copy_to_bvecq),
 	KUNIT_CASE(iov_kunit_copy_from_bvecq),
-	KUNIT_CASE(iov_kunit_copy_to_folioq),
-	KUNIT_CASE(iov_kunit_copy_from_folioq),
 	KUNIT_CASE(iov_kunit_copy_to_xarray),
 	KUNIT_CASE(iov_kunit_copy_from_xarray),
 	KUNIT_CASE(iov_kunit_extract_pages_kvec),
 	KUNIT_CASE(iov_kunit_extract_pages_bvec),
 	KUNIT_CASE(iov_kunit_extract_pages_bvecq),
-	KUNIT_CASE(iov_kunit_extract_pages_folioq),
 	KUNIT_CASE(iov_kunit_extract_pages_xarray),
 	KUNIT_CASE(iov_kunit_iter_to_sg_kvec),
 	KUNIT_CASE(iov_kunit_iter_to_sg_bvec),
-	KUNIT_CASE(iov_kunit_iter_to_sg_folioq),
 	KUNIT_CASE(iov_kunit_iter_to_sg_xarray),
 	KUNIT_CASE(iov_kunit_iter_to_sg_ubuf),
 	{}
From nobody Mon May 25 04:33:52 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 178E43BFAD1
	for <linux-kernel@vger.kernel.org>; Mon, 18 May 2026 22:32:52 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779143575; cv=none;
 b=U6gWON9FAtUOAwp83TjZwQvuwdVKpZI9sxQBUGEMDaB3eKb9a1fL1K9vDU/W5nXb9JAshBpDLRLrP7Ix7uxPTHILDxJ3YZq9dWpR+JwKsCqaJGXFgPHWP/SH2lYshihrx/EOfYO03i9sPsPtq0NCyon2FwLe4CAE4Osrw8w1fok=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779143575; c=relaxed/simple;
	bh=+I4pmq/Yl/qil89W5OZeFAyYLsZh+NkzKzY453/tM2o=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=qEyXXCVb04YL9yXZ+AnA5tGRpX3fbpLxCCsZnHavIaOl0I6qZYmfwuo9fbheECkqDofAJq+9Jn29z/A/j9gmt/n92E2f4nVqr1oXcK0ntuYk5zzrr56x3SnxGITXBKg8dRhuLcX+By4xrnOI78RwcgvSdYaQEylaNpjfjuOmjfQ=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=WT4zLEOx; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="WT4zLEOx"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1779143572;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=PCzbE8g6nYB3n2XLtnG7Q0mWYeXP1o+rp7wdcEGBjEs=;
	b=WT4zLEOxAk7tedVv1rq1u9ePt7grw3WH8kEzoMTQRTGhY3yuM0m1dGy+nfzS7albyqqqfK
	X8yr/WVY0fK4Kh1tyBTdEAM2UVa8P/JxDHwH1Bb7LUJ9fmnDjtKPCK4F0H2OM/xAMRQyPJ
	bEb9hLta0rypGSk/asuVTkGQ+2Sh13c=
Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-594-ug6Q9MoBPyOwXqCmrPk60A-1; Mon,
 18 May 2026 18:32:46 -0400
X-MC-Unique: ug6Q9MoBPyOwXqCmrPk60A-1
X-Mimecast-MFC-AGG-ID: ug6Q9MoBPyOwXqCmrPk60A_1779143564
Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 04A7419560AA;
	Mon, 18 May 2026 22:32:44 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.48.33])
	by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id E474118004A3;
	Mon, 18 May 2026 22:32:37 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.org>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 18/21] netfs: Remove folio_queue and rolling_buffer
Date: Mon, 18 May 2026 23:29:50 +0100
Message-ID: <20260518222959.488126-19-dhowells@redhat.com>
In-Reply-To: <20260518222959.488126-1-dhowells@redhat.com>
References: <20260518222959.488126-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93
Content-Type: text/plain; charset="utf-8"

Remove folio_queue and rolling_buffer as they're no longer used.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 Documentation/core-api/folio_queue.rst      | 209 --------------
 Documentation/core-api/index.rst            |   1 -
 Documentation/filesystems/netfs_library.rst |   2 +-
 fs/netfs/iterator.c                         | 192 -------------
 fs/netfs/rolling_buffer.c                   | 297 --------------------
 include/linux/folio_queue.h                 | 282 -------------------
 include/linux/netfs.h                       |   2 -
 include/linux/rolling_buffer.h              |  64 -----
 kernel/bpf/btf.c                            |   2 -
 9 files changed, 1 insertion(+), 1050 deletions(-)
 delete mode 100644 Documentation/core-api/folio_queue.rst
 delete mode 100644 fs/netfs/rolling_buffer.c
 delete mode 100644 include/linux/folio_queue.h
 delete mode 100644 include/linux/rolling_buffer.h

diff --git a/Documentation/core-api/folio_queue.rst b/Documentation/core-ap=
i/folio_queue.rst
deleted file mode 100644
index b7628896d2b6..000000000000
--- a/Documentation/core-api/folio_queue.rst
+++ /dev/null
@@ -1,209 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0+
-
-=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
-Folio Queue
-=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
-
-:Author: David Howells <dhowells@redhat.com>
-
-.. Contents:
-
- * Overview
- * Initialisation
- * Adding and removing folios
- * Querying information about a folio
- * Querying information about a folio_queue
- * Folio queue iteration
- * Folio marks
- * Lockless simultaneous production/consumption issues
-
-
-Overview
-=3D=3D=3D=3D=3D=3D=3D=3D
-
-The folio_queue struct forms a single segment in a segmented list of folios
-that can be used to form an I/O buffer.  As such, the list can be iterated=
 over
-using the ITER_FOLIOQ iov_iter type.
-
-The publicly accessible members of the structure are::
-
-	struct folio_queue {
-		struct folio_queue *next;
-		struct folio_queue *prev;
-		...
-	};
-
-A pair of pointers are provided, ``next`` and ``prev``, that point to the
-segments on either side of the segment being accessed.  Whilst this is a
-doubly-linked list, it is intentionally not a circular list; the outward
-sibling pointers in terminal segments should be NULL.
-
-Each segment in the list also stores:
-
- * an ordered sequence of folio pointers,
- * the size of each folio and
- * three 1-bit marks per folio,
-
-but these should not be accessed directly as the underlying data structure=
 may
-change, but rather the access functions outlined below should be used.
-
-The facility can be made accessible by::
-
-	#include <linux/folio_queue.h>
-
-and to use the iterator::
-
-	#include <linux/uio.h>
-
-
-Initialisation
-=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
-
-A segment should be initialised by calling::
-
-	void folioq_init(struct folio_queue *folioq);
-
-with a pointer to the segment to be initialised.  Note that this will not
-necessarily initialise all the folio pointers, so care must be taken to ch=
eck
-the number of folios added.
-
-
-Adding and removing folios
-=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
-
-Folios can be set in the next unused slot in a segment struct by calling o=
ne
-of::
-
-	unsigned int folioq_append(struct folio_queue *folioq,
-				   struct folio *folio);
-
-	unsigned int folioq_append_mark(struct folio_queue *folioq,
-					struct folio *folio);
-
-Both functions update the stored folio count, store the folio and note its
-size.  The second function also sets the first mark for the folio added.  =
Both
-functions return the number of the slot used.  [!] Note that no attempt is=
 made
-to check that the capacity wasn't overrun and the list will not be extended
-automatically.
-
-A folio can be excised by calling::
-
-	void folioq_clear(struct folio_queue *folioq, unsigned int slot);
-
-This clears the slot in the array and also clears all the marks for that f=
olio,
-but doesn't change the folio count - so future accesses of that slot must =
check
-if the slot is occupied.
-
-
-Querying information about a folio
-=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
-
-Information about the folio in a particular slot may be queried by the
-following function::
-
-	struct folio *folioq_folio(const struct folio_queue *folioq,
-				   unsigned int slot);
-
-If a folio has not yet been set in that slot, this may yield an undefined
-pointer.  The size of the folio in a slot may be queried with either of::
-
-	unsigned int folioq_folio_order(const struct folio_queue *folioq,
-					unsigned int slot);
-
-	size_t folioq_folio_size(const struct folio_queue *folioq,
-				 unsigned int slot);
-
-The first function returns the size as an order and the second as a number=
 of
-bytes.
-
-
-Querying information about a folio_queue
-=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
-
-Information may be retrieved about a particular segment with the following
-functions::
-
-	unsigned int folioq_nr_slots(const struct folio_queue *folioq);
-
-	unsigned int folioq_count(struct folio_queue *folioq);
-
-	bool folioq_full(struct folio_queue *folioq);
-
-The first function returns the maximum capacity of a segment.  It must not=
 be
-assumed that this won't vary between segments.  The second returns the num=
ber
-of folios added to a segments and the third is a shorthand to indicate if =
the
-segment has been filled to capacity.
-
-Not that the count and fullness are not affected by clearing folios from t=
he
-segment.  These are more about indicating how many slots in the array have=
 been
-initialised, and it assumed that slots won't get reused, but rather the se=
gment
-will get discarded as the queue is consumed.
-
-
-Folio marks
-=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
-
-Folios within a queue can also have marks assigned to them.  These marks c=
an be
-used to note information such as if a folio needs folio_put() calling upon=
 it.
-There are three marks available to be set for each folio.
-
-The marks can be set by::
-
-	void folioq_mark(struct folio_queue *folioq, unsigned int slot);
-	void folioq_mark2(struct folio_queue *folioq, unsigned int slot);
-
-Cleared by::
-
-	void folioq_unmark(struct folio_queue *folioq, unsigned int slot);
-	void folioq_unmark2(struct folio_queue *folioq, unsigned int slot);
-
-And the marks can be queried by::
-
-	bool folioq_is_marked(const struct folio_queue *folioq, unsigned int slot=
);
-	bool folioq_is_marked2(const struct folio_queue *folioq, unsigned int slo=
t);
-
-The marks can be used for any purpose and are not interpreted by this API.
-
-
-Folio queue iteration
-=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
-
-A list of segments may be iterated over using the I/O iterator facility us=
ing
-an ``iov_iter`` iterator of ``ITER_FOLIOQ`` type.  The iterator may be
-initialised with::
-
-	void iov_iter_folio_queue(struct iov_iter *i, unsigned int direction,
-				  const struct folio_queue *folioq,
-				  unsigned int first_slot, unsigned int offset,
-				  size_t count);
-
-This may be told to start at a particular segment, slot and offset within a
-queue.  The iov iterator functions will follow the next pointers when adva=
ncing
-and prev pointers when reverting when needed.
-
-
-Lockless simultaneous production/consumption issues
-=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
-
-If properly managed, the list can be extended by the producer at the head =
end
-and shortened by the consumer at the tail end simultaneously without the n=
eed
-to take locks.  The ITER_FOLIOQ iterator inserts appropriate barriers to a=
id
-with this.
-
-Care must be taken when simultaneously producing and consuming a list.  If=
 the
-last segment is reached and the folios it refers to are entirely consumed =
by
-the IOV iterators, an iov_iter struct will be left pointing to the last se=
gment
-with a slot number equal to the capacity of that segment.  The iterator wi=
ll
-try to continue on from this if there's another segment available when it =
is
-used again, but care must be taken lest the segment got removed and freed =
by
-the consumer before the iterator was advanced.
-
-It is recommended that the queue always contain at least one segment, even=
 if
-that segment has never been filled or is entirely spent.  This prevents the
-head and tail pointers from collapsing.
-
-
-API Function Reference
-=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
-
-.. kernel-doc:: include/linux/folio_queue.h
diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/inde=
x.rst
index 13769d5c40bf..16c529a33ac4 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -39,7 +39,6 @@ Library functionality that is used throughout the kernel.
    kref
    cleanup
    assoc_array
-   folio_queue
    xarray
    maple_tree
    idr
diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/fi=
lesystems/netfs_library.rst
index ddd799df6ce3..18e3c3aae57c 100644
--- a/Documentation/filesystems/netfs_library.rst
+++ b/Documentation/filesystems/netfs_library.rst
@@ -449,7 +449,7 @@ be called from the writeback code to write the data to =
the cache, if there is
 one.
=20
 The inode should be marked ``NETFS_ICTX_SINGLE_NO_UPLOAD`` if this API is =
to be
-used.  The writeback function requires the buffer to be of ITER_FOLIOQ typ=
e.
+used.
=20
 High-Level VM API
 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
diff --git a/fs/netfs/iterator.c b/fs/netfs/iterator.c
index 566693ac47ef..3040be52c293 100644
--- a/fs/netfs/iterator.c
+++ b/fs/netfs/iterator.c
@@ -139,195 +139,3 @@ ssize_t netfs_extract_iter(struct iov_iter *orig, siz=
e_t max_len, size_t max_pag
 	return extracted ?: ret;
 }
 EXPORT_SYMBOL_GPL(netfs_extract_iter);
-
-#if 0
-/*
- * Select the span of a bvec iterator we're going to use.  Limit it by bot=
h maximum
- * size and maximum number of segments.  Returns the size of the span in b=
ytes.
- */
-static size_t netfs_limit_bvec(const struct iov_iter *iter, size_t start_o=
ffset,
-			       size_t max_size, size_t max_segs)
-{
-	const struct bio_vec *bvecs =3D iter->bvec;
-	unsigned int nbv =3D iter->nr_segs, ix =3D 0, nsegs =3D 0;
-	size_t len, span =3D 0, n =3D iter->count;
-	size_t skip =3D iter->iov_offset + start_offset;
-
-	if (WARN_ON(!iov_iter_is_bvec(iter)) ||
-	    WARN_ON(start_offset > n) ||
-	    n =3D=3D 0)
-		return 0;
-
-	while (n && ix < nbv && skip) {
-		len =3D bvecs[ix].bv_len;
-		if (skip < len)
-			break;
-		skip -=3D len;
-		n -=3D len;
-		ix++;
-	}
-
-	while (n && ix < nbv) {
-		len =3D min3(n, bvecs[ix].bv_len - skip, max_size);
-		span +=3D len;
-		nsegs++;
-		ix++;
-		if (span >=3D max_size || nsegs >=3D max_segs)
-			break;
-		skip =3D 0;
-		n -=3D len;
-	}
-
-	return min(span, max_size);
-}
-
-/*
- * Select the span of a kvec iterator we're going to use.  Limit it by both
- * maximum size and maximum number of segments.  Returns the size of the s=
pan
- * in bytes.
- */
-static size_t netfs_limit_kvec(const struct iov_iter *iter, size_t start_o=
ffset,
-			       size_t max_size, size_t max_segs)
-{
-	const struct kvec *kvecs =3D iter->kvec;
-	unsigned int nkv =3D iter->nr_segs, ix =3D 0, nsegs =3D 0;
-	size_t len, span =3D 0, n =3D iter->count;
-	size_t skip =3D iter->iov_offset + start_offset;
-
-	if (WARN_ON(!iov_iter_is_kvec(iter)) ||
-	    WARN_ON(start_offset > n) ||
-	    n =3D=3D 0)
-		return 0;
-
-	while (n && ix < nkv && skip) {
-		len =3D kvecs[ix].iov_len;
-		if (skip < len)
-			break;
-		skip -=3D len;
-		n -=3D len;
-		ix++;
-	}
-
-	while (n && ix < nkv) {
-		len =3D min3(n, kvecs[ix].iov_len - skip, max_size);
-		span +=3D len;
-		nsegs++;
-		ix++;
-		if (span >=3D max_size || nsegs >=3D max_segs)
-			break;
-		skip =3D 0;
-		n -=3D len;
-	}
-
-	return min(span, max_size);
-}
-
-/*
- * Select the span of an xarray iterator we're going to use.  Limit it by =
both
- * maximum size and maximum number of segments.  It is assumed that segmen=
ts
- * can be larger than a page in size, provided they're physically contiguo=
us.
- * Returns the size of the span in bytes.
- */
-static size_t netfs_limit_xarray(const struct iov_iter *iter, size_t start=
_offset,
-				 size_t max_size, size_t max_segs)
-{
-	struct folio *folio;
-	unsigned int nsegs =3D 0;
-	loff_t pos =3D iter->xarray_start + iter->iov_offset;
-	pgoff_t index =3D pos / PAGE_SIZE;
-	size_t span =3D 0, n =3D iter->count;
-
-	XA_STATE(xas, iter->xarray, index);
-
-	if (WARN_ON(!iov_iter_is_xarray(iter)) ||
-	    WARN_ON(start_offset > n) ||
-	    n =3D=3D 0)
-		return 0;
-	max_size =3D min(max_size, n - start_offset);
-
-	rcu_read_lock();
-	xas_for_each(&xas, folio, ULONG_MAX) {
-		size_t offset, flen, len;
-		if (xas_retry(&xas, folio))
-			continue;
-		if (WARN_ON(xa_is_value(folio)))
-			break;
-		if (WARN_ON(folio_test_hugetlb(folio)))
-			break;
-
-		flen =3D folio_size(folio);
-		offset =3D offset_in_folio(folio, pos);
-		len =3D min(max_size, flen - offset);
-		span +=3D len;
-		nsegs++;
-		if (span >=3D max_size || nsegs >=3D max_segs)
-			break;
-	}
-
-	rcu_read_unlock();
-	return min(span, max_size);
-}
-
-/*
- * Select the span of a folio queue iterator we're going to use.  Limit it=
 by
- * both maximum size and maximum number of segments.  Returns the size of =
the
- * span in bytes.
- */
-static size_t netfs_limit_folioq(const struct iov_iter *iter, size_t start=
_offset,
-				 size_t max_size, size_t max_segs)
-{
-	const struct folio_queue *folioq =3D iter->folioq;
-	unsigned int nsegs =3D 0;
-	unsigned int slot =3D iter->folioq_slot;
-	size_t span =3D 0, n =3D iter->count;
-
-	if (WARN_ON(!iov_iter_is_folioq(iter)) ||
-	    WARN_ON(start_offset > n) ||
-	    n =3D=3D 0)
-		return 0;
-	max_size =3D umin(max_size, n - start_offset);
-
-	if (slot >=3D folioq_nr_slots(folioq)) {
-		folioq =3D folioq->next;
-		slot =3D 0;
-	}
-
-	start_offset +=3D iter->iov_offset;
-	do {
-		size_t flen =3D folioq_folio_size(folioq, slot);
-
-		if (start_offset < flen) {
-			span +=3D flen - start_offset;
-			nsegs++;
-			start_offset =3D 0;
-		} else {
-			start_offset -=3D flen;
-		}
-		if (span >=3D max_size || nsegs >=3D max_segs)
-			break;
-
-		slot++;
-		if (slot >=3D folioq_nr_slots(folioq)) {
-			folioq =3D folioq->next;
-			slot =3D 0;
-		}
-	} while (folioq);
-
-	return umin(span, max_size);
-}
-
-size_t netfs_limit_iter(const struct iov_iter *iter, size_t start_offset,
-			size_t max_size, size_t max_segs)
-{
-	if (iov_iter_is_folioq(iter))
-		return netfs_limit_folioq(iter, start_offset, max_size, max_segs);
-	if (iov_iter_is_bvec(iter))
-		return netfs_limit_bvec(iter, start_offset, max_size, max_segs);
-	if (iov_iter_is_xarray(iter))
-		return netfs_limit_xarray(iter, start_offset, max_size, max_segs);
-	if (iov_iter_is_kvec(iter))
-		return netfs_limit_kvec(iter, start_offset, max_size, max_segs);
-	BUG();
-}
-EXPORT_SYMBOL(netfs_limit_iter);
-#endif
diff --git a/fs/netfs/rolling_buffer.c b/fs/netfs/rolling_buffer.c
deleted file mode 100644
index 576b425a227d..000000000000
--- a/fs/netfs/rolling_buffer.c
+++ /dev/null
@@ -1,297 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-or-later
-/* Rolling buffer helpers
- *
- * Copyright (C) 2024 Red Hat, Inc. All Rights Reserved.
- * Written by David Howells (dhowells@redhat.com)
- */
-
-#include <linux/bitops.h>
-#include <linux/pagemap.h>
-#include <linux/rolling_buffer.h>
-#include <linux/slab.h>
-#include "internal.h"
-
-static atomic_t debug_ids;
-
-/**
- * netfs_folioq_alloc - Allocate a folio_queue struct
- * @rreq_id: Associated debugging ID for tracing purposes
- * @gfp: Allocation constraints
- * @trace: Trace tag to indicate the purpose of the allocation
- *
- * Allocate, initialise and account the folio_queue struct and log a trace=
 line
- * to mark the allocation.
- */
-struct folio_queue *netfs_folioq_alloc(unsigned int rreq_id, gfp_t gfp,
-				       unsigned int /*enum netfs_folioq_trace*/ trace)
-{
-	struct folio_queue *fq;
-
-	fq =3D kmalloc_obj(*fq, gfp);
-	if (fq) {
-		netfs_stat(&netfs_n_folioq);
-		folioq_init(fq, rreq_id);
-		fq->debug_id =3D atomic_inc_return(&debug_ids);
-		trace_netfs_folioq(fq, trace);
-	}
-	return fq;
-}
-EXPORT_SYMBOL(netfs_folioq_alloc);
-
-/**
- * netfs_folioq_free - Free a folio_queue struct
- * @folioq: The object to free
- * @trace: Trace tag to indicate which free
- *
- * Free and unaccount the folio_queue struct.
- */
-void netfs_folioq_free(struct folio_queue *folioq,
-		       unsigned int /*enum netfs_trace_folioq*/ trace)
-{
-	trace_netfs_folioq(folioq, trace);
-	netfs_stat_d(&netfs_n_folioq);
-	kfree(folioq);
-}
-EXPORT_SYMBOL(netfs_folioq_free);
-
-/*
- * Initialise a rolling buffer.  We allocate an empty folio queue struct t=
o so
- * that the pointers can be independently driven by the producer and the
- * consumer.
- */
-int rolling_buffer_init(struct rolling_buffer *roll, unsigned int rreq_id,
-			unsigned int direction)
-{
-	struct folio_queue *fq;
-
-	fq =3D netfs_folioq_alloc(rreq_id, GFP_NOFS, netfs_trace_folioq_rollbuf_i=
nit);
-	if (!fq)
-		return -ENOMEM;
-
-	roll->head =3D fq;
-	roll->tail =3D fq;
-	iov_iter_folio_queue(&roll->iter, direction, fq, 0, 0, 0);
-	return 0;
-}
-
-/*
- * Add another folio_queue to a rolling buffer if there's no space left.
- */
-int rolling_buffer_make_space(struct rolling_buffer *roll)
-{
-	struct folio_queue *fq, *head =3D roll->head;
-
-	if (!folioq_full(head))
-		return 0;
-
-	fq =3D netfs_folioq_alloc(head->rreq_id, GFP_NOFS, netfs_trace_folioq_mak=
e_space);
-	if (!fq)
-		return -ENOMEM;
-	fq->prev =3D head;
-
-	roll->head =3D fq;
-	if (folioq_full(head)) {
-		/* Make sure we don't leave the master iterator pointing to a
-		 * block that might get immediately consumed.
-		 */
-		if (roll->iter.folioq =3D=3D head &&
-		    roll->iter.folioq_slot =3D=3D folioq_nr_slots(head)) {
-			roll->iter.folioq =3D fq;
-			roll->iter.folioq_slot =3D 0;
-		}
-	}
-
-	/* Make sure the initialisation is stored before the next pointer.
-	 *
-	 * [!] NOTE: After we set head->next, the consumer is at liberty to
-	 * immediately delete the old head.
-	 */
-	smp_store_release(&head->next, fq);
-	return 0;
-}
-
-/*
- * Decant the list of folios to read into a rolling buffer.
- */
-ssize_t rolling_buffer_load_from_ra(struct rolling_buffer *roll,
-				    struct readahead_control *ractl,
-				    struct folio_batch *put_batch)
-{
-	struct folio_queue *fq;
-	struct page **vec;
-	int nr, ix, to;
-	ssize_t size =3D 0;
-
-	if (rolling_buffer_make_space(roll) < 0)
-		return -ENOMEM;
-
-	fq =3D roll->head;
-	vec =3D (struct page **)fq->vec.folios;
-	nr =3D __readahead_batch(ractl, vec + folio_batch_count(&fq->vec),
-			       folio_batch_space(&fq->vec));
-	ix =3D fq->vec.nr;
-	to =3D ix + nr;
-	fq->vec.nr =3D to;
-	for (; ix < to; ix++) {
-		struct folio *folio =3D folioq_folio(fq, ix);
-		unsigned int order =3D folio_order(folio);
-
-		fq->orders[ix] =3D order;
-		size +=3D PAGE_SIZE << order;
-		trace_netfs_folio(folio, netfs_folio_trace_read);
-		if (!folio_batch_add(put_batch, folio))
-			folio_batch_release(put_batch);
-	}
-	WRITE_ONCE(roll->iter.count, roll->iter.count + size);
-
-	/* Store the counter after setting the slot. */
-	smp_store_release(&roll->next_head_slot, to);
-	return size;
-}
-
-/*
- * Decant the entire list of folios to read into a rolling buffer.
- */
-ssize_t rolling_buffer_bulk_load_from_ra(struct rolling_buffer *roll,
-					 struct readahead_control *ractl,
-					 unsigned int rreq_id)
-{
-	XA_STATE(xas, &ractl->mapping->i_pages, ractl->_index);
-	struct folio_queue *fq;
-	struct folio *folio;
-	ssize_t loaded =3D 0;
-	int nr, slot =3D 0, npages =3D 0;
-
-	/* First allocate all the folioqs we're going to need to avoid having
-	 * to deal with ENOMEM later.
-	 */
-	nr =3D ractl->_nr_folios;
-	do {
-		fq =3D netfs_folioq_alloc(rreq_id, GFP_KERNEL,
-					netfs_trace_folioq_make_space);
-		if (!fq) {
-			rolling_buffer_clear(roll);
-			return -ENOMEM;
-		}
-		fq->prev =3D roll->head;
-		if (!roll->tail)
-			roll->tail =3D fq;
-		else
-			roll->head->next =3D fq;
-		roll->head =3D fq;
-
-		nr -=3D folioq_nr_slots(fq);
-	} while (nr > 0);
-
-	rcu_read_lock();
-
-	fq =3D roll->tail;
-	xas_for_each(&xas, folio, ractl->_index + ractl->_nr_pages - 1) {
-		unsigned int order;
-
-		if (xas_retry(&xas, folio))
-			continue;
-		VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
-
-		order =3D folio_order(folio);
-		fq->orders[slot] =3D order;
-		fq->vec.folios[slot] =3D folio;
-		loaded +=3D PAGE_SIZE << order;
-		npages +=3D 1 << order;
-		trace_netfs_folio(folio, netfs_folio_trace_read);
-
-		slot++;
-		if (slot >=3D folioq_nr_slots(fq)) {
-			fq->vec.nr =3D slot;
-			fq =3D fq->next;
-			if (!fq) {
-				WARN_ON_ONCE(npages < readahead_count(ractl));
-				break;
-			}
-			slot =3D 0;
-		}
-	}
-
-	rcu_read_unlock();
-
-	if (fq)
-		fq->vec.nr =3D slot;
-
-	WRITE_ONCE(roll->iter.count, loaded);
-	iov_iter_folio_queue(&roll->iter, ITER_DEST, roll->tail, 0, 0, loaded);
-	ractl->_index    +=3D npages;
-	ractl->_nr_pages -=3D npages;
-	return loaded;
-}
-
-/*
- * Append a folio to the rolling buffer.
- */
-ssize_t rolling_buffer_append(struct rolling_buffer *roll, struct folio *f=
olio,
-			      unsigned int flags)
-{
-	ssize_t size =3D folio_size(folio);
-	int slot;
-
-	if (rolling_buffer_make_space(roll) < 0)
-		return -ENOMEM;
-
-	slot =3D folioq_append(roll->head, folio);
-	if (flags & ROLLBUF_MARK_1)
-		folioq_mark(roll->head, slot);
-	if (flags & ROLLBUF_MARK_2)
-		folioq_mark2(roll->head, slot);
-
-	WRITE_ONCE(roll->iter.count, roll->iter.count + size);
-
-	/* Store the counter after setting the slot. */
-	smp_store_release(&roll->next_head_slot, slot);
-	return size;
-}
-
-/*
- * Delete a spent buffer from a rolling queue and return the next in line.=
  We
- * don't return the last buffer to keep the pointers independent, but retu=
rn
- * NULL instead.
- */
-struct folio_queue *rolling_buffer_delete_spent(struct rolling_buffer *rol=
l)
-{
-	struct folio_queue *spent =3D roll->tail, *next =3D READ_ONCE(spent->next=
);
-
-	if (!next)
-		return NULL;
-	next->prev =3D NULL;
-	netfs_folioq_free(spent, netfs_trace_folioq_delete);
-	roll->tail =3D next;
-	return next;
-}
-
-/*
- * Clear out a rolling queue.  Folios that have mark 1 set are put.
- */
-void rolling_buffer_clear(struct rolling_buffer *roll)
-{
-	struct folio_batch fbatch;
-	struct folio_queue *p;
-
-	folio_batch_init(&fbatch);
-
-	while ((p =3D roll->tail)) {
-		roll->tail =3D p->next;
-		for (int slot =3D 0; slot < folioq_count(p); slot++) {
-			struct folio *folio =3D folioq_folio(p, slot);
-
-			if (!folio)
-				continue;
-			if (folioq_is_marked(p, slot)) {
-				trace_netfs_folio(folio, netfs_folio_trace_put);
-				if (!folio_batch_add(&fbatch, folio))
-					folio_batch_release(&fbatch);
-			}
-		}
-
-		netfs_folioq_free(p, netfs_trace_folioq_clear);
-	}
-
-	folio_batch_release(&fbatch);
-}
diff --git a/include/linux/folio_queue.h b/include/linux/folio_queue.h
deleted file mode 100644
index f6d5f1f127c9..000000000000
--- a/include/linux/folio_queue.h
+++ /dev/null
@@ -1,282 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-or-later */
-/* Queue of folios definitions
- *
- * Copyright (C) 2024 Red Hat, Inc. All Rights Reserved.
- * Written by David Howells (dhowells@redhat.com)
- *
- * See:
- *
- *	Documentation/core-api/folio_queue.rst
- *
- * for a description of the API.
- */
-
-#ifndef _LINUX_FOLIO_QUEUE_H
-#define _LINUX_FOLIO_QUEUE_H
-
-#include <linux/folio_batch.h>
-#include <linux/mm.h>
-
-/*
- * Segment in a queue of running buffers.  Each segment can hold a number =
of
- * folios and a portion of the queue can be referenced with the ITER_FOLIOQ
- * iterator.  The possibility exists of inserting non-folio elements into =
the
- * queue (such as gaps).
- *
- * Explicit prev and next pointers are used instead of a list_head to make=
 it
- * easier to add segments to tail and remove them from the head without the
- * need for a lock.
- */
-struct folio_queue {
-	struct folio_batch	vec;		/* Folios in the queue segment */
-	u8			orders[FOLIO_BATCH_SIZE]; /* Order of each folio */
-	struct folio_queue	*next;		/* Next queue segment or NULL */
-	struct folio_queue	*prev;		/* Previous queue segment of NULL */
-	unsigned long		marks;		/* 1-bit mark per folio */
-	unsigned long		marks2;		/* Second 1-bit mark per folio */
-#if FOLIO_BATCH_SIZE > BITS_PER_LONG
-#error marks is not big enough
-#endif
-	unsigned int		rreq_id;
-	unsigned int		debug_id;
-};
-
-/**
- * folioq_init - Initialise a folio queue segment
- * @folioq: The segment to initialise
- * @rreq_id: The request identifier to use in tracelines.
- *
- * Initialise a folio queue segment and set an identifier to be used in tr=
aces.
- *
- * Note that the folio pointers are left uninitialised.
- */
-static inline void folioq_init(struct folio_queue *folioq, unsigned int rr=
eq_id)
-{
-	folio_batch_init(&folioq->vec);
-	folioq->next =3D NULL;
-	folioq->prev =3D NULL;
-	folioq->marks =3D 0;
-	folioq->marks2 =3D 0;
-	folioq->rreq_id =3D rreq_id;
-	folioq->debug_id =3D 0;
-}
-
-/**
- * folioq_nr_slots: Query the capacity of a folio queue segment
- * @folioq: The segment to query
- *
- * Query the number of folios that a particular folio queue segment might =
hold.
- * [!] NOTE: This must not be assumed to be the same for every segment!
- */
-static inline unsigned int folioq_nr_slots(const struct folio_queue *folio=
q)
-{
-	return FOLIO_BATCH_SIZE;
-}
-
-/**
- * folioq_count: Query the occupancy of a folio queue segment
- * @folioq: The segment to query
- *
- * Query the number of folios that have been added to a folio queue segmen=
t.
- * Note that this is not decreased as folios are removed from a segment.
- */
-static inline unsigned int folioq_count(struct folio_queue *folioq)
-{
-	return folio_batch_count(&folioq->vec);
-}
-
-/**
- * folioq_full: Query if a folio queue segment is full
- * @folioq: The segment to query
- *
- * Query if a folio queue segment is fully occupied.  Note that this does =
not
- * change if folios are removed from a segment.
- */
-static inline bool folioq_full(struct folio_queue *folioq)
-{
-	//return !folio_batch_space(&folioq->vec);
-	return folioq_count(folioq) >=3D folioq_nr_slots(folioq);
-}
-
-/**
- * folioq_is_marked: Check first folio mark in a folio queue segment
- * @folioq: The segment to query
- * @slot: The slot number of the folio to query
- *
- * Determine if the first mark is set for the folio in the specified slot =
in a
- * folio queue segment.
- */
-static inline bool folioq_is_marked(const struct folio_queue *folioq, unsi=
gned int slot)
-{
-	return test_bit(slot, &folioq->marks);
-}
-
-/**
- * folioq_mark: Set the first mark on a folio in a folio queue segment
- * @folioq: The segment to modify
- * @slot: The slot number of the folio to modify
- *
- * Set the first mark for the folio in the specified slot in a folio queue
- * segment.
- */
-static inline void folioq_mark(struct folio_queue *folioq, unsigned int sl=
ot)
-{
-	set_bit(slot, &folioq->marks);
-}
-
-/**
- * folioq_unmark: Clear the first mark on a folio in a folio queue segment
- * @folioq: The segment to modify
- * @slot: The slot number of the folio to modify
- *
- * Clear the first mark for the folio in the specified slot in a folio que=
ue
- * segment.
- */
-static inline void folioq_unmark(struct folio_queue *folioq, unsigned int =
slot)
-{
-	clear_bit(slot, &folioq->marks);
-}
-
-/**
- * folioq_is_marked2: Check second folio mark in a folio queue segment
- * @folioq: The segment to query
- * @slot: The slot number of the folio to query
- *
- * Determine if the second mark is set for the folio in the specified slot=
 in a
- * folio queue segment.
- */
-static inline bool folioq_is_marked2(const struct folio_queue *folioq, uns=
igned int slot)
-{
-	return test_bit(slot, &folioq->marks2);
-}
-
-/**
- * folioq_mark2: Set the second mark on a folio in a folio queue segment
- * @folioq: The segment to modify
- * @slot: The slot number of the folio to modify
- *
- * Set the second mark for the folio in the specified slot in a folio queue
- * segment.
- */
-static inline void folioq_mark2(struct folio_queue *folioq, unsigned int s=
lot)
-{
-	set_bit(slot, &folioq->marks2);
-}
-
-/**
- * folioq_unmark2: Clear the second mark on a folio in a folio queue segme=
nt
- * @folioq: The segment to modify
- * @slot: The slot number of the folio to modify
- *
- * Clear the second mark for the folio in the specified slot in a folio qu=
eue
- * segment.
- */
-static inline void folioq_unmark2(struct folio_queue *folioq, unsigned int=
 slot)
-{
-	clear_bit(slot, &folioq->marks2);
-}
-
-/**
- * folioq_append: Add a folio to a folio queue segment
- * @folioq: The segment to add to
- * @folio: The folio to add
- *
- * Add a folio to the tail of the sequence in a folio queue segment, incre=
asing
- * the occupancy count and returning the slot number for the folio just ad=
ded.
- * The folio size is extracted and stored in the queue and the marks are l=
eft
- * unmodified.
- *
- * Note that it's left up to the caller to check that the segment capacity=
 will
- * not be exceeded and to extend the queue.
- */
-static inline unsigned int folioq_append(struct folio_queue *folioq, struc=
t folio *folio)
-{
-	unsigned int slot =3D folioq->vec.nr++;
-
-	folioq->vec.folios[slot] =3D folio;
-	folioq->orders[slot] =3D folio_order(folio);
-	return slot;
-}
-
-/**
- * folioq_append_mark: Add a folio to a folio queue segment
- * @folioq: The segment to add to
- * @folio: The folio to add
- *
- * Add a folio to the tail of the sequence in a folio queue segment, incre=
asing
- * the occupancy count and returning the slot number for the folio just ad=
ded.
- * The folio size is extracted and stored in the queue, the first mark is =
set
- * and and the second and third marks are left unmodified.
- *
- * Note that it's left up to the caller to check that the segment capacity=
 will
- * not be exceeded and to extend the queue.
- */
-static inline unsigned int folioq_append_mark(struct folio_queue *folioq, =
struct folio *folio)
-{
-	unsigned int slot =3D folioq->vec.nr++;
-
-	folioq->vec.folios[slot] =3D folio;
-	folioq->orders[slot] =3D folio_order(folio);
-	folioq_mark(folioq, slot);
-	return slot;
-}
-
-/**
- * folioq_folio: Get a folio from a folio queue segment
- * @folioq: The segment to access
- * @slot: The folio slot to access
- *
- * Retrieve the folio in the specified slot from a folio queue segment.  N=
ote
- * that no bounds check is made and if the slot hasn't been added into yet=
, the
- * pointer will be undefined.  If the slot has been cleared, NULL will be
- * returned.
- */
-static inline struct folio *folioq_folio(const struct folio_queue *folioq,=
 unsigned int slot)
-{
-	return folioq->vec.folios[slot];
-}
-
-/**
- * folioq_folio_order: Get the order of a folio from a folio queue segment
- * @folioq: The segment to access
- * @slot: The folio slot to access
- *
- * Retrieve the order of the folio in the specified slot from a folio queue
- * segment.  Note that no bounds check is made and if the slot hasn't been
- * added into yet, the order returned will be 0.
- */
-static inline unsigned int folioq_folio_order(const struct folio_queue *fo=
lioq, unsigned int slot)
-{
-	return folioq->orders[slot];
-}
-
-/**
- * folioq_folio_size: Get the size of a folio from a folio queue segment
- * @folioq: The segment to access
- * @slot: The folio slot to access
- *
- * Retrieve the size of the folio in the specified slot from a folio queue
- * segment.  Note that no bounds check is made and if the slot hasn't been
- * added into yet, the size returned will be PAGE_SIZE.
- */
-static inline size_t folioq_folio_size(const struct folio_queue *folioq, u=
nsigned int slot)
-{
-	return PAGE_SIZE << folioq_folio_order(folioq, slot);
-}
-
-/**
- * folioq_clear: Clear a folio from a folio queue segment
- * @folioq: The segment to clear
- * @slot: The folio slot to clear
- *
- * Clear a folio from a sequence in a folio queue segment and clear its ma=
rks.
- * The occupancy count is left unchanged.
- */
-static inline void folioq_clear(struct folio_queue *folioq, unsigned int s=
lot)
-{
-	folioq->vec.folios[slot] =3D NULL;
-	folioq_unmark(folioq, slot);
-	folioq_unmark2(folioq, slot);
-}
-
-#endif /* _LINUX_FOLIO_QUEUE_H */
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index d0b1408bd02f..7dca6a513509 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -464,8 +464,6 @@ void netfs_put_subrequest(struct netfs_io_subrequest *s=
ubreq,
 ssize_t netfs_extract_iter(struct iov_iter *orig, size_t max_len, size_t m=
ax_pages,
 			   unsigned long long fpos, struct bvecq **_bvecq_head,
 			   iov_iter_extraction_t extraction_flags);
-size_t netfs_limit_iter(const struct iov_iter *iter, size_t start_offset,
-			size_t max_size, size_t max_segs);
 void netfs_prepare_write_failed(struct netfs_io_subrequest *subreq);
 void netfs_write_subrequest_terminated(void *_op, ssize_t transferred_or_e=
rror);
=20
diff --git a/include/linux/rolling_buffer.h b/include/linux/rolling_buffer.h
deleted file mode 100644
index b35ef43f325f..000000000000
--- a/include/linux/rolling_buffer.h
+++ /dev/null
@@ -1,64 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-or-later */
-/* Rolling buffer of folios
- *
- * Copyright (C) 2024 Red Hat, Inc. All Rights Reserved.
- * Written by David Howells (dhowells@redhat.com)
- */
-
-#ifndef _ROLLING_BUFFER_H
-#define _ROLLING_BUFFER_H
-
-#include <linux/folio_queue.h>
-#include <linux/uio.h>
-
-/*
- * Rolling buffer.  Whilst the buffer is live and in use, folios and folio
- * queue segments can be added to one end by one thread and removed from t=
he
- * other end by another thread.  The buffer isn't allowed to be empty; it =
must
- * always have at least one folio_queue in it so that neither side has to
- * modify both queue pointers.
- *
- * The iterator in the buffer is extended as buffers are inserted.  It can=
 be
- * snapshotted to use a segment of the buffer.
- */
-struct rolling_buffer {
-	struct folio_queue	*head;		/* Producer's insertion point */
-	struct folio_queue	*tail;		/* Consumer's removal point */
-	struct iov_iter		iter;		/* Iterator tracking what's left in the buffer */
-	u8			next_head_slot;	/* Next slot in ->head */
-	u8			first_tail_slot; /* First slot in ->tail */
-};
-
-/*
- * Snapshot of a rolling buffer.
- */
-struct rolling_buffer_snapshot {
-	struct folio_queue	*curr_folioq;	/* Queue segment in which current folio =
resides */
-	unsigned char		curr_slot;	/* Folio currently being read */
-	unsigned char		curr_order;	/* Order of folio */
-};
-
-/* Marks to store per-folio in the internal folio_queue structs. */
-#define ROLLBUF_MARK_1	BIT(0)
-#define ROLLBUF_MARK_2	BIT(1)
-
-int rolling_buffer_init(struct rolling_buffer *roll, unsigned int rreq_id,
-			unsigned int direction);
-int rolling_buffer_make_space(struct rolling_buffer *roll);
-ssize_t rolling_buffer_load_from_ra(struct rolling_buffer *roll,
-				    struct readahead_control *ractl,
-				    struct folio_batch *put_batch);
-ssize_t rolling_buffer_bulk_load_from_ra(struct rolling_buffer *roll,
-					 struct readahead_control *ractl,
-					 unsigned int rreq_id);
-ssize_t rolling_buffer_append(struct rolling_buffer *roll, struct folio *f=
olio,
-			      unsigned int flags);
-struct folio_queue *rolling_buffer_delete_spent(struct rolling_buffer *rol=
l);
-void rolling_buffer_clear(struct rolling_buffer *roll);
-
-static inline void rolling_buffer_advance(struct rolling_buffer *roll, siz=
e_t amount)
-{
-	iov_iter_advance(&roll->iter, amount);
-}
-
-#endif /* _ROLLING_BUFFER_H */
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index a62d78581207..dface7d06b98 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -6734,8 +6734,6 @@ static const struct bpf_raw_tp_null_args raw_tp_null_=
args[] =3D {
 	/* amdgpu */
 	{ "amdgpu_vm_bo_map", 0x1 },
 	{ "amdgpu_vm_bo_unmap", 0x1 },
-	/* netfs */
-	{ "netfs_folioq", 0x1 },
 	/* xfs from xfs_defer_pending_class */
 	{ "xfs_defer_create_intent", 0x1 },
 	{ "xfs_defer_cancel_list", 0x1 },
From nobody Mon May 25 04:33:52 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 00BAE3C0619
	for <linux-kernel@vger.kernel.org>; Mon, 18 May 2026 22:32:58 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779143580; cv=none;
 b=e7mtqBEc3931Amkrb3spmtEEF0lv0VlFiCEvWxR/SBY4zD9U96C1LH2xvl+TezYKUqFstH5jYG2hFFCvJLGkGyI/KSMihahIUttUWwSKMfBjV7pcdFHwxKlwQa3BoIPd6h6emY2gU6n5qIu6AE5GhlonliueEz9vpuEooVNhCOk=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779143580; c=relaxed/simple;
	bh=cjJNffIha+PwqPaKI1opJkQpu7n2ZDMd7gGUtzCzekA=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=I+TSBKOc59PpXfppwmJod91unUEQzFvHbKzpmhq5s5W7zBmtghkEZJMzjXhxOF8/blsri5O5WLoOENPRq6A1N4SQ7L8eJ8Ta8yk84UvtuZ2djuQyUXUSLCvkBAOmb/hVK8XIypFxMI6crdDSHMXwueE1kov1XxuxhlzxVMNlrWU=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=AB4tKuQe; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="AB4tKuQe"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1779143578;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=nmIXKYZPaEoy+v0EUgnQEN1+1FR/WvjVVNeVAn9uPnk=;
	b=AB4tKuQe1vZNy50+3Po9ijxMYlS3hFLV+hPnm2AaLIkJn4RgP/bD1rghVSdTNWPBgVvvC4
	GX/6wtd8O1+ovBIC5ru/5DuvEo+Fq0j8sxVVU5nEYegarQ8VxJQnKbGpemsq5T2yS0YxLD
	bi0b/p63L3OOB4DJK9efBAGxLn0RcmQ=
Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-402-UXEWfVQiNji3yUZDr9JE1Q-1; Mon,
 18 May 2026 18:32:54 -0400
X-MC-Unique: UXEWfVQiNji3yUZDr9JE1Q-1
X-Mimecast-MFC-AGG-ID: UXEWfVQiNji3yUZDr9JE1Q_1779143572
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id DAB141800371;
	Mon, 18 May 2026 22:32:51 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.48.33])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id B2FAD19560A3;
	Mon, 18 May 2026 22:32:45 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.org>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 19/21] netfs: Check for too much data being read
Date: Mon, 18 May 2026 23:29:51 +0100
Message-ID: <20260518222959.488126-20-dhowells@redhat.com>
In-Reply-To: <20260518222959.488126-1-dhowells@redhat.com>
References: <20260518222959.488126-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12
Content-Type: text/plain; charset="utf-8"

Put in a check in read subreq termination to detect more data being read
for a subrequest than was requested.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/netfs/read_collect.c      | 8 ++++++++
 include/trace/events/netfs.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/fs/netfs/read_collect.c b/fs/netfs/read_collect.c
index 977b69ac8725..fc62eaef6107 100644
--- a/fs/netfs/read_collect.c
+++ b/fs/netfs/read_collect.c
@@ -542,6 +542,14 @@ void netfs_read_subreq_terminated(struct netfs_io_subr=
equest *subreq)
 		break;
 	}
=20
+	if (subreq->transferred > subreq->len) {
+		subreq->transferred =3D 0;
+		__set_bit(NETFS_SREQ_FAILED, &subreq->flags);
+		__clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
+		trace_netfs_sreq(subreq, netfs_sreq_trace_too_much);
+		subreq->error =3D -EIO;
+	}
+
 	/* Deal with retry requests, short reads and errors.  If we retry
 	 * but don't make progress, we abandon the attempt.
 	 */
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index 59f330003d02..cc29582f6245 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -134,6 +134,7 @@
 	EM(netfs_sreq_trace_submit,		"SUBMT")	\
 	EM(netfs_sreq_trace_superfluous,	"SPRFL")	\
 	EM(netfs_sreq_trace_terminated,		"TERM ")	\
+	EM(netfs_sreq_trace_too_much,		"!TOOM")	\
 	EM(netfs_sreq_trace_wait_for,		"_WAIT")	\
 	EM(netfs_sreq_trace_write,		"WRITE")	\
 	EM(netfs_sreq_trace_write_skip,		"SKIP ")	\
From nobody Mon May 25 04:33:52 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 620ED3B6C1C
	for <linux-kernel@vger.kernel.org>; Mon, 18 May 2026 22:33:08 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779143589; cv=none;
 b=S18CSxX/ALgkUE3dHhByEZcdFTzKMcjCY+DTKfJpfdZ+sOYnHOatWj9sDuIhxbtGTHd3uNRs1qmDQrB9i/E6qNDHMjlttlwUQLUDPv45thJsN1KDbGtv3FxFwU5bmXoJ/RDZCUYgCOgu7jF5lGNYPeiCuPW+ht1vtyMYovwDvjE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779143589; c=relaxed/simple;
	bh=bwBanNfSNsMTX/vRcxGfGZnrBW8jMtLUm4wrRhxJ10g=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=FKm19UnwmCtAzzIDv4YJ2EPzfAB/p7yBIPfN+avowr3OBJbNTa19QBRaFe6FcAcvHDZtUdTMpo8UdEdmP6r5BtLDwNVHd3ZbHLhQKYZMKRx5u1vGjN1GneHtzmNExsDVksE/MER1Yylw6arLTlWLUYTduCQ71+nYTXCcWRrvHkw=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=RXqgX8FZ; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="RXqgX8FZ"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1779143587;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=V6J+UkGP43uUoUVJ8yoJ6tN7aQvqskLe/EDUX+Vd4Sk=;
	b=RXqgX8FZcxoJNaAkcX9H/eZExwFeRPpVi+Xc18nfjYiaiWrBO+LXpbSrYTVwZdRpLw1x1p
	K9IggMtIdlDvXBsEdEPcg2k8jzDGgAeQ5rzpio6+gkNmXc2fLe8n4XJDWCbwH7F2ixsXYE
	txF4b/IGPcfqX+NKO6LLSTRDOhTmNtM=
Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-651-4XsFDnFJOImHYuNISJXP9A-1; Mon,
 18 May 2026 18:33:02 -0400
X-MC-Unique: 4XsFDnFJOImHYuNISJXP9A-1
X-Mimecast-MFC-AGG-ID: 4XsFDnFJOImHYuNISJXP9A_1779143579
Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id A3BB3195608C;
	Mon, 18 May 2026 22:32:59 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.48.33])
	by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 7F3261955D84;
	Mon, 18 May 2026 22:32:53 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.org>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 20/21] netfs: Limit the minimum trigger for progress
 reporting
Date: Mon, 18 May 2026 23:29:52 +0100
Message-ID: <20260518222959.488126-21-dhowells@redhat.com>
In-Reply-To: <20260518222959.488126-1-dhowells@redhat.com>
References: <20260518222959.488126-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17
Content-Type: text/plain; charset="utf-8"

For really big read RPC ops that span multiple folios, netfslib allows the
filesystem to give progress notifications to wake up the collector thread
to do a collection of folios that have now been fetched, even if the RPC is
still ongoing, thereby allowing the application to make progress.

The trigger for this is that at least one folio has been downloaded since
the clean point.  If, however, the folios are small, this means the
collector thread is constantly being woken up - which has a negative
performance impact on the system.

Set a minimum trigger of 256KiB or the size of the folio at the front of
the queue, whichever is larger.

Also, fix the base to be the stream collection point, not the point at
which the collector has cleaned up to (which is currently 0 until something
has been collected).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/netfs/read_collect.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/netfs/read_collect.c b/fs/netfs/read_collect.c
index fc62eaef6107..fccc6c2d891e 100644
--- a/fs/netfs/read_collect.c
+++ b/fs/netfs/read_collect.c
@@ -491,15 +491,15 @@ void netfs_read_collection_worker(struct work_struct =
*work)
 void netfs_read_subreq_progress(struct netfs_io_subrequest *subreq)
 {
 	struct netfs_io_request *rreq =3D subreq->rreq;
-	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
-	size_t fsize =3D PAGE_SIZE << rreq->front_folio_order;
+	struct netfs_io_stream *stream =3D &rreq->io_streams[subreq->stream_nr];
+	size_t fsize =3D umax(PAGE_SIZE << rreq->front_folio_order, 256 * 1024);
=20
 	trace_netfs_sreq(subreq, netfs_sreq_trace_progress);
=20
 	/* If we are at the head of the queue, wake up the collector,
 	 * getting a ref to it if we were the ones to do so.
 	 */
-	if (subreq->start + subreq->transferred > rreq->cleaned_to + fsize &&
+	if (subreq->start + subreq->transferred >=3D stream->collected_to + fsize=
 &&
 	    (rreq->origin =3D=3D NETFS_READAHEAD ||
 	     rreq->origin =3D=3D NETFS_READPAGE ||
 	     rreq->origin =3D=3D NETFS_READ_FOR_WRITE) &&
From nobody Mon May 25 04:33:52 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2AE263A1A56
	for <linux-kernel@vger.kernel.org>; Mon, 18 May 2026 22:33:16 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1779143604; cv=none;
 b=AYqpcX0iB8gphHzaxs9lcNthVC/zv48NnNbOfwO4bpKx16pmNG7a0/BuflrqDYgIZAGE2yMO0HYe4EmT4icsFhe1A977kUhJzdo7JzDK/STn4AZolycamt/nJCINBfcs0SYq/MkKdyGDZ/Qza6Z0jkQrw3eMpUqM6UlcIXDzIck=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1779143604; c=relaxed/simple;
	bh=A50HcT9gCVj7fVsnQ6LP0G26t/vgFyE95N3v9mPosl8=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=a07YGTW+6cJvgZl3gt98pskdr2lfV/vvpyOqdVyS+GxXHmzRmsI5HcJTK9iHcPrtQv8eXuEvFoeDCV9gfqonG4YdAHwHMSJ8n8yvLn/BMQBCA+eZMCOA/EwkXkLP89EuOFcQIMGliqc+2hQ9f1UCuKf1siWr1CfHQ+TUle5uryY=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=Pcjy3pNJ; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="Pcjy3pNJ"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1779143595;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=JNrJdbduRML7S8tiJgLdSgyFac1oKvXWRckJSucSlDU=;
	b=Pcjy3pNJszM7Tq7xPfXzr7T3oFLaWlpY8Bm1YAfAklL+e0GFpUJBfkZl44D4IjNob5xxnX
	efSxAiLFs1+dVYHKbpTT3ybcbspNcRCUnJqWckvAr4/hfkmSpJxDCstr4nui9yDlIWQoSX
	1FyOpuiIcDWdXZZw/Iz8sEQA1CF/41k=
Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-316-HjZm-zv3OPunJjoO7CcpvQ-1; Mon,
 18 May 2026 18:33:11 -0400
X-MC-Unique: HjZm-zv3OPunJjoO7CcpvQ-1
X-Mimecast-MFC-AGG-ID: HjZm-zv3OPunJjoO7CcpvQ_1779143588
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 4510C195608B;
	Mon, 18 May 2026 22:33:08 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.48.33])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 3635A19560A2;
	Mon, 18 May 2026 22:33:00 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.org>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 21/21] netfs: Combine prepare and issue ops and grab the
 buffers on request
Date: Mon, 18 May 2026 23:29:53 +0100
Message-ID: <20260518222959.488126-22-dhowells@redhat.com>
In-Reply-To: <20260518222959.488126-1-dhowells@redhat.com>
References: <20260518222959.488126-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12
Content-Type: text/plain; charset="utf-8"

Modify the way subrequests are generated in netfslib to try and simplify
the code.  The issue, primarily, is in writeback: the code has to create
multiple streams of write requests to disparate targets with different
properties (e.g. server and fscache), where not every folio needs to go to
every target (e.g. data just read from the server may only need writing to
the cache).

The current model in writeback, at least, is to go carefully through every
folio, preparing a subrequest for each stream when it was detected that
part of the current folio needed to go to that stream, and repeating this
within and across contiguous folios; then to issue subrequests as they
become full or hit boundaries after first setting up the buffer.  However,
this is quite difficult to follow - and makes it tricky to handle
discontiguous folios in a request.

This is changed such that netfs now accumulates buffers and attaches them
to each stream when they become valid for that stream, then flushes the
stream when a limit or a boundary is hit.  The issuing code in netfs then
loops around creating and issuing subrequests without calling a separate
prepare stage (though a function is provided to get an estimate of when
flushing should occur).  The filesystem (or cache) then gets to take a
slice of the master bvec chain as its I/O buffer for each subrequest,
including discontiguities if it can support a sparse/vectored RPC (as Ceph
can).

Similar-ish changes also apply to buffered read and unbuffered read and
write, though in each of those cases there is only a single contiguous
stream.  Though for buffered read this consists of interwoven requests from
multiple sources (server or cache).

To this end, netfslib is changed in the following ways:

 (1) ->prepare_xxx(), buffer selection and ->issue_xxx() are now collapsed
     together such that one ->issue_xxx() call is made with the subrequest
     defined to the maximum extent; the filesystem/cache then reduces the
     length of the subrequest and calls back to netfslib to grab a slice of
     the buffer, which may reduce the subrequest further if a maximum
     segment limit is set.  The filesystem/cache then dispatches the
     operation.

 (2) Retry buffer tracking is added to the netfs_io_request struct.  This
     is then selected by the subrequest retry counter being non-zero.

 (3) The use of iov_iter is pushed down to the filesystem.  Netfslib now
     provides the filesystem with a bvecq holding the buffer rather than an
     iov_iter.  The bvecq can be duplicated and headers/trailers attached
     to hold protocol and several bvecqs can be linked together to create a
     compound operation.

 (4) If the ->issue_xxx() functions terminate with -ENOMEM, a flag is set
     on the request to abort further subrequest generation/retrying.

 (5) During writeback, netfslib now builds up an accumulation of buffered
     data before issuing writes on each stream (one server, one cache).  It
     asks each stream for an estimate of how much data to accumulate before
     it next generates subrequests on the stream.  The filesystem or cache
     is not required to use up all the data accumulated on a stream at that
     time unless the end of the pagecache is hit.

 (6) During read-gaps, in which there are two gaps on either end of a dirty
     streaming write page that need to be filled, a buffer is constructed
     consisting of the two ends plus a sink page repeated to cover the
     middle portion.  This is passed to the server as a single write.  For
     something like Ceph, this should probably be done either as a
     vectored/sparse read or as two separate reads (if different Ceph
     objects are involved).

 (7) During unbuffered/DIO read/write, there is a single contiguous file
     region to be read or written as a single stream.  The dispatching
     function just creates subrequests and calls ->issue_xxx() repeatedly
     to eat through the bufferage.

 (8) At the start of buffered read, the entire set of folios allocated by
     VM readahead is loaded into a bvecq chain, rather than trying to do it
     piecemeal as-needed.  As the pages were already added and locked by
     the VM, this is slightly more efficient than loading piecemeal as only
     a single iteration of the xarray is required.

 (9) During buffered read, there is a single contiguous file region, to
     read as a single stream - however, this stream may be stitched
     together from subrequests to multiple sources.  Which sources are used
     where is now determined by querying the cache to find the next couple
     of extents in which it has data; netfslib uses this to direct the
     subrequests towards the appropriate sources.

     Each subrequest is given the maximum length in the current extent and
     then ->issue_read() is called.  The filesystem then limits the size
     and slices off a piece of the buffer for that extent.

(10) Cachefiles now provides an estimation function that indicates the
     standard maxima for doing DIO (MAX_RW_COUNT and BIO_MAX_VECS).

Note that sparse cachefiles still rely on the backing filesystem for
content mapping.  That will need to be addressed in a future patch and is
not trivial to fix.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/9p/vfs_addr.c             |  49 +-
 fs/afs/dir.c                 |  11 +-
 fs/afs/file.c                |  27 +-
 fs/afs/fsclient.c            |   8 +-
 fs/afs/internal.h            |   6 +-
 fs/afs/symlink.c             |   5 +-
 fs/afs/write.c               |  32 +-
 fs/afs/yfsclient.c           |   6 +-
 fs/cachefiles/io.c           | 255 ++++++----
 fs/ceph/Kconfig              |   1 +
 fs/ceph/addr.c               | 119 ++---
 fs/netfs/Kconfig             |   3 +
 fs/netfs/Makefile            |   2 +-
 fs/netfs/buffered_read.c     | 237 +++++-----
 fs/netfs/buffered_write.c    |  27 +-
 fs/netfs/direct_read.c       |  78 +--
 fs/netfs/direct_write.c      | 127 +++--
 fs/netfs/fscache_io.c        |   6 -
 fs/netfs/internal.h          | 104 +++-
 fs/netfs/iterator.c          |   4 +-
 fs/netfs/misc.c              |  35 +-
 fs/netfs/objects.c           |   7 +-
 fs/netfs/read_collect.c      |  43 +-
 fs/netfs/read_pgpriv2.c      | 113 +++--
 fs/netfs/read_retry.c        | 199 ++++----
 fs/netfs/read_single.c       | 150 +++---
 fs/netfs/write_collect.c     |  58 ++-
 fs/netfs/write_issue.c       | 887 +++++++++++++++++++++--------------
 fs/netfs/write_retry.c       | 136 +++---
 fs/nfs/Kconfig               |   1 +
 fs/nfs/fscache.c             |  23 +-
 fs/smb/client/cifssmb.c      |  13 +-
 fs/smb/client/file.c         | 137 +++---
 fs/smb/client/smb2ops.c      |   9 +-
 fs/smb/client/smb2pdu.c      |  28 +-
 fs/smb/client/transport.c    |  15 +-
 include/linux/netfs.h        |  90 ++--
 include/trace/events/netfs.h |  51 +-
 net/9p/client.c              |   8 +-
 39 files changed, 1840 insertions(+), 1270 deletions(-)

diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index c21d33830f5f..e2f67853d74d 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -48,32 +48,71 @@ static void v9fs_begin_writeback(struct netfs_io_reques=
t *wreq)
 	wreq->io_streams[0].avail =3D true;
 }
=20
+/*
+ * Estimate how much data should be accumulated before we start issuing
+ * write subrequests.
+ */
+static int v9fs_estimate_write(struct netfs_io_request *wreq,
+			       struct netfs_io_stream *stream,
+			       struct netfs_write_estimate *estimate)
+{
+	struct p9_fid *fid =3D wreq->netfs_priv;
+	unsigned long long limit =3D ULLONG_MAX - stream->issue_from;
+	unsigned long long max_len =3D fid->clnt->msize - P9_IOHDRSZ;
+
+	estimate->issue_at =3D stream->issue_from + umin(max_len, limit);
+	return 0;
+}
+
 /*
  * Issue a subrequest to write to the server.
  */
 static void v9fs_issue_write(struct netfs_io_subrequest *subreq)
 {
+	struct iov_iter iter;
 	struct p9_fid *fid =3D subreq->rreq->netfs_priv;
 	int err, len;
=20
-	len =3D p9_client_write(fid, subreq->start, &subreq->io_iter, &err);
+	subreq->len =3D umin(subreq->len, fid->clnt->msize - P9_IOHDRSZ);
+
+	err =3D netfs_prepare_write_buffer(subreq, INT_MAX);
+	if (err < 0)
+		goto term;
+
+	iov_iter_bvec_queue(&iter, ITER_SOURCE, subreq->content.bvecq,
+			    subreq->content.slot, subreq->content.offset, subreq->len);
+
+	len =3D p9_client_write(fid, subreq->start, &iter, &err);
 	if (len > 0)
 		__set_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
-	netfs_write_subrequest_terminated(subreq, len ?: err);
+
+term:
+	return netfs_write_subrequest_terminated(subreq, len ?: err);
 }
=20
 /**
  * v9fs_issue_read - Issue a read from 9P
  * @subreq: The read to make
+ * @rctx: Read generation context
  */
 static void v9fs_issue_read(struct netfs_io_subrequest *subreq)
 {
 	struct netfs_io_request *rreq =3D subreq->rreq;
+	struct iov_iter iter;
 	struct p9_fid *fid =3D rreq->netfs_priv;
 	unsigned long long pos =3D subreq->start + subreq->transferred;
 	int total, err;
=20
-	total =3D p9_client_read(fid, pos, &subreq->io_iter, &err);
+	err =3D netfs_prepare_read_buffer(subreq, INT_MAX);
+	if (err < 0)
+		goto term;
+
+	iov_iter_bvec_queue(&iter, ITER_DEST, subreq->content.bvecq,
+			    subreq->content.slot, subreq->content.offset, subreq->len);
+
+	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
+
+	total =3D p9_client_read(fid, pos, &iter, &err);
=20
 	/* if we just extended the file size, any portion not in
 	 * cache won't be on server and is zeroes */
@@ -87,8 +126,9 @@ static void v9fs_issue_read(struct netfs_io_subrequest *=
subreq)
 		__set_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
 	}
=20
+term:
 	subreq->error =3D err;
-	netfs_read_subreq_terminated(subreq);
+	return netfs_read_subreq_terminated(subreq);
 }
=20
 /**
@@ -154,6 +194,7 @@ const struct netfs_request_ops v9fs_req_ops =3D {
 	.free_request		=3D v9fs_free_request,
 	.issue_read		=3D v9fs_issue_read,
 	.begin_writeback	=3D v9fs_begin_writeback,
+	.estimate_write		=3D v9fs_estimate_write,
 	.issue_write		=3D v9fs_issue_write,
 };
=20
diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 774d86bf878e..4a2e4c10ba21 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -248,8 +248,8 @@ static ssize_t afs_do_read_single(struct afs_vnode *dvn=
ode, struct file *file)
 	if (dvnode->directory_size < i_size) {
 		size_t cur_size =3D dvnode->directory_size;
=20
-		ret =3D bvecq_expand_buffer(&dvnode->directory, &cur_size, i_size,
-					  GFP_KERNEL);
+		ret =3D bvecq_expand_buffer(&dvnode->directory, &cur_size,
+					  round_up(i_size, PAGE_SIZE), GFP_KERNEL);
 		dvnode->directory_size =3D cur_size;
 		if (ret < 0)
 			return ret;
@@ -2217,11 +2217,10 @@ static int afs_dir_writepages(struct address_space =
*mapping,
 	}
=20
 	if (test_bit(AFS_VNODE_DIR_VALID, &dvnode->flags)) {
+		size_t len =3D i_size_read(&dvnode->netfs.inode);
 		iov_iter_bvec_queue(&iter, ITER_SOURCE, dvnode->directory, 0, 0,
-				    i_size_read(&dvnode->netfs.inode));
-		ret =3D netfs_writeback_single(mapping, wbc, &iter);
-		if (ret =3D=3D 1)
-			ret =3D 0; /* Skipped write due to lock conflict. */
+				    round_up(len, PAGE_SIZE));
+		ret =3D netfs_writeback_single(mapping, wbc, &iter, len);
 	}
=20
 	up_read(&dvnode->validate_lock);
diff --git a/fs/afs/file.c b/fs/afs/file.c
index 67f38e99ada7..d2e75f044a7a 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -337,6 +337,7 @@ static void afs_issue_read(struct netfs_io_subrequest *=
subreq)
 	struct afs_operation *op;
 	struct afs_vnode *vnode =3D AFS_FS_I(subreq->rreq->inode);
 	struct key *key =3D subreq->rreq->netfs_priv;
+	int ret;
=20
 	_enter("%s{%llx:%llu.%u},%x,,,",
 	       vnode->volume->name,
@@ -345,11 +346,14 @@ static void afs_issue_read(struct netfs_io_subrequest=
 *subreq)
 	       vnode->fid.unique,
 	       key_serial(key));
=20
+	ret =3D netfs_prepare_read_buffer(subreq, INT_MAX);
+	if (ret < 0)
+		goto failed;
+
 	op =3D afs_alloc_operation(key, vnode->volume);
 	if (IS_ERR(op)) {
-		subreq->error =3D PTR_ERR(op);
-		netfs_read_subreq_terminated(subreq);
-		return;
+		ret =3D PTR_ERR(op);
+		goto failed;
 	}
=20
 	afs_op_set_vnode(op, 0, vnode);
@@ -364,20 +368,21 @@ static void afs_issue_read(struct netfs_io_subrequest=
 *subreq)
 		op->flags |=3D AFS_OPERATION_ASYNC;
=20
 		if (!afs_begin_vnode_operation(op)) {
-			subreq->error =3D afs_put_operation(op);
-			netfs_read_subreq_terminated(subreq);
-			return;
+			ret =3D afs_put_operation(op);
+			goto failed;
 		}
=20
-		if (!afs_select_fileserver(op)) {
-			afs_end_read(op);
-			return;
-		}
+		if (!afs_select_fileserver(op))
+			afs_end_read(op); /* Error recorded here. */
=20
 		afs_issue_read_call(op);
 	} else {
 		afs_do_sync_operation(op);
 	}
+	return;
+failed:
+	subreq->error =3D ret;
+	return netfs_read_subreq_terminated(subreq);
 }
=20
 static int afs_init_request(struct netfs_io_request *rreq, struct file *fi=
le)
@@ -470,7 +475,7 @@ const struct netfs_request_ops afs_req_ops =3D {
 	.update_i_size		=3D afs_update_i_size,
 	.invalidate_cache	=3D afs_netfs_invalidate_cache,
 	.begin_writeback	=3D afs_begin_writeback,
-	.prepare_write		=3D afs_prepare_write,
+	.estimate_write		=3D afs_estimate_write,
 	.issue_write		=3D afs_issue_write,
 	.retry_request		=3D afs_retry_request,
 };
diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c
index a2ffd60889f8..c332b733d7a7 100644
--- a/fs/afs/fsclient.c
+++ b/fs/afs/fsclient.c
@@ -339,7 +339,9 @@ static int afs_deliver_fs_fetch_data(struct afs_call *c=
all)
 		if (call->remaining =3D=3D 0)
 			goto no_more_data;
=20
-		call->iter =3D &subreq->io_iter;
+		iov_iter_bvec_queue(&call->def_iter, ITER_DEST, subreq->content.bvecq,
+				    subreq->content.slot, subreq->content.offset, subreq->len);
+
 		call->iov_len =3D umin(call->remaining, subreq->len - subreq->transferre=
d);
 		call->unmarshall++;
 		fallthrough;
@@ -1085,7 +1087,7 @@ static void afs_fs_store_data64(struct afs_operation =
*op)
 	if (!call)
 		return afs_op_nomem(op);
=20
-	call->write_iter =3D op->store.write_iter;
+	call->write_iter =3D &op->store.write_iter;
=20
 	/* marshall the parameters */
 	bp =3D call->request;
@@ -1139,7 +1141,7 @@ void afs_fs_store_data(struct afs_operation *op)
 	if (!call)
 		return afs_op_nomem(op);
=20
-	call->write_iter =3D op->store.write_iter;
+	call->write_iter =3D &op->store.write_iter;
=20
 	/* marshall the parameters */
 	bp =3D call->request;
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index d2641efc756f..f20126000524 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -915,7 +915,7 @@ struct afs_operation {
 			afs_lock_type_t type;
 		} lock;
 		struct {
-			struct iov_iter	*write_iter;
+			struct iov_iter	write_iter;
 			loff_t	pos;
 			loff_t	size;
 			loff_t	i_size;
@@ -1698,7 +1698,9 @@ extern int afs_check_volume_status(struct afs_volume =
*, struct afs_operation *);
 /*
  * write.c
  */
-void afs_prepare_write(struct netfs_io_subrequest *subreq);
+int afs_estimate_write(struct netfs_io_request *wreq,
+		       struct netfs_io_stream *stream,
+		       struct netfs_write_estimate *estimate);
 void afs_issue_write(struct netfs_io_subrequest *subreq);
 void afs_begin_writeback(struct netfs_io_request *wreq);
 void afs_retry_request(struct netfs_io_request *wreq, struct netfs_io_stre=
am *stream);
diff --git a/fs/afs/symlink.c b/fs/afs/symlink.c
index 6709b119e8a0..2c0791c32609 100644
--- a/fs/afs/symlink.c
+++ b/fs/afs/symlink.c
@@ -243,9 +243,10 @@ int afs_symlink_writepages(struct address_space *mappi=
ng,
=20
 	if (vnode->directory &&
 	    atomic64_read(&vnode->cb_expires_at) !=3D AFS_NO_CB_PROMISE) {
+		size_t len =3D i_size_read(&vnode->netfs.inode);
 		iov_iter_bvec_queue(&iter, ITER_SOURCE, vnode->directory, 0, 0,
-				    i_size_read(&vnode->netfs.inode));
-		ret =3D netfs_writeback_single(mapping, wbc, &iter);
+				    round_up(len, PAGE_SIZE));
+		ret =3D netfs_writeback_single(mapping, wbc, &iter, len);
 	}
=20
 	if (ret =3D=3D 0) {
diff --git a/fs/afs/write.c b/fs/afs/write.c
index 7f34b939706a..8b6053ebc2b3 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -83,17 +83,20 @@ static const struct afs_operation_ops afs_store_data_op=
eration =3D {
 };
=20
 /*
- * Prepare a subrequest to write to the server.  This sets the max_len
- * parameter.
+ * Estimate the maximum size of a write we can send to the server.
  */
-void afs_prepare_write(struct netfs_io_subrequest *subreq)
+int afs_estimate_write(struct netfs_io_request *wreq,
+		       struct netfs_io_stream *stream,
+		       struct netfs_write_estimate *estimate)
 {
-	struct netfs_io_stream *stream =3D &subreq->rreq->io_streams[subreq->stre=
am_nr];
+	unsigned long long limit =3D ULLONG_MAX - stream->issue_from;
+	unsigned long long max_len =3D 256 * 1024 * 1024;
=20
 	//if (test_bit(NETFS_SREQ_RETRYING, &subreq->flags))
-	//	subreq->max_len =3D 512 * 1024;
-	//else
-	stream->sreq_max_len =3D 256 * 1024 * 1024;
+	//	max_len =3D 512 * 1024;
+
+	estimate->issue_at =3D stream->issue_from + umin(max_len, limit);
+	return 0;
 }
=20
 /*
@@ -139,12 +142,15 @@ static void afs_issue_write_worker(struct work_struct=
 *work)
 	op->flags		|=3D AFS_OPERATION_UNINTR;
 	op->ops			=3D &afs_store_data_operation;
=20
+	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
 	afs_begin_vnode_operation(op);
=20
-	op->store.write_iter	=3D &subreq->io_iter;
 	op->store.i_size	=3D umax(pos + len, netfs_read_remote_i_size(&vnode->net=
fs.inode));
 	op->mtime		=3D inode_get_mtime(&vnode->netfs.inode);
=20
+	iov_iter_bvec_queue(&op->store.write_iter, ITER_SOURCE, subreq->content.b=
vecq,
+			    subreq->content.slot, subreq->content.offset, subreq->len);
+
 	afs_wait_for_operation(op);
 	ret =3D afs_put_operation(op);
 	switch (ret) {
@@ -170,6 +176,14 @@ static void afs_issue_write_worker(struct work_struct =
*work)
=20
 void afs_issue_write(struct netfs_io_subrequest *subreq)
 {
+	int ret;
+
+	if (subreq->len > 256 * 1024 * 1024)
+		subreq->len =3D 256 * 1024 * 1024;
+	ret =3D netfs_prepare_write_buffer(subreq, INT_MAX);
+	if (ret < 0)
+		return netfs_write_subrequest_terminated(subreq, ret);
+
 	subreq->work.func =3D afs_issue_write_worker;
 	if (!queue_work(system_dfl_wq, &subreq->work))
 		WARN_ON_ONCE(1);
@@ -183,6 +197,8 @@ void afs_begin_writeback(struct netfs_io_request *wreq)
 {
 	if (S_ISREG(wreq->inode->i_mode))
 		afs_get_writeback_key(wreq);
+
+	wreq->io_streams[0].avail =3D true;
 }
=20
 /*
diff --git a/fs/afs/yfsclient.c b/fs/afs/yfsclient.c
index d941179730a9..52c588092050 100644
--- a/fs/afs/yfsclient.c
+++ b/fs/afs/yfsclient.c
@@ -385,7 +385,9 @@ static int yfs_deliver_fs_fetch_data64(struct afs_call =
*call)
 		if (call->remaining =3D=3D 0)
 			goto no_more_data;
=20
-		call->iter =3D &subreq->io_iter;
+		iov_iter_bvec_queue(&call->def_iter, ITER_DEST, subreq->content.bvecq,
+				    subreq->content.slot, subreq->content.offset, subreq->len);
+
 		call->iov_len =3D min(call->remaining, subreq->len - subreq->transferred=
);
 		call->unmarshall++;
 		fallthrough;
@@ -1357,7 +1359,7 @@ void yfs_fs_store_data(struct afs_operation *op)
 	if (!call)
 		return afs_op_nomem(op);
=20
-	call->write_iter =3D op->store.write_iter;
+	call->write_iter =3D &op->store.write_iter;
=20
 	/* marshall the parameters */
 	bp =3D call->request;
diff --git a/fs/cachefiles/io.c b/fs/cachefiles/io.c
index eebebda46a09..8256d7d66da3 100644
--- a/fs/cachefiles/io.c
+++ b/fs/cachefiles/io.c
@@ -26,7 +26,10 @@ struct cachefiles_kiocb {
 	};
 	struct cachefiles_object *object;
 	netfs_io_terminated_t	term_func;
-	void			*term_func_priv;
+	union {
+		struct netfs_io_subrequest *subreq;
+		void			*term_func_priv;
+	};
 	bool			was_async;
 	unsigned int		inval_counter;	/* Copy of cookie->inval_counter */
 	u64			b_writing;
@@ -194,12 +197,133 @@ static int cachefiles_read(struct netfs_cache_resour=
ces *cres,
 	return ret;
 }
=20
+/*
+ * Handle completion of a read from the cache issued by netfslib.
+ */
+static void cachefiles_issue_read_complete(struct kiocb *iocb, long ret)
+{
+	struct cachefiles_kiocb *ki =3D container_of(iocb, struct cachefiles_kioc=
b, iocb);
+	struct netfs_io_subrequest *subreq =3D ki->subreq;
+	struct inode *inode =3D file_inode(ki->iocb.ki_filp);
+
+	_enter("%ld", ret);
+
+	if (ret < 0) {
+		subreq->error =3D -ESTALE;
+		trace_cachefiles_io_error(ki->object, inode, ret,
+					  cachefiles_trace_read_error);
+	}
+
+	if (ret >=3D 0) {
+		if (ki->object->cookie->inval_counter =3D=3D ki->inval_counter) {
+			subreq->error =3D 0;
+			if (ret > 0) {
+				subreq->transferred +=3D ret;
+				__set_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
+			}
+		} else {
+			subreq->error =3D -ESTALE;
+		}
+	}
+
+	netfs_read_subreq_terminated(subreq);
+	cachefiles_put_kiocb(ki);
+}
+
+/*
+ * Issue a read operation to the cache.
+ */
+static void cachefiles_issue_read(struct netfs_io_subrequest *subreq)
+{
+	struct netfs_cache_resources *cres =3D &subreq->rreq->cache_resources;
+	struct cachefiles_object *object;
+	struct cachefiles_kiocb *ki;
+	struct iov_iter iter;
+	struct file *file;
+	unsigned int old_nofs;
+	ssize_t ret =3D -ENOBUFS;
+
+	if (!fscache_wait_for_operation(cres, FSCACHE_WANT_READ))
+		goto failed;
+
+	fscache_count_read();
+	object =3D cachefiles_cres_object(cres);
+	file =3D cachefiles_cres_file(cres);
+
+	_enter("%pD,%lli,%llx,%zx/%llx",
+	       file, file_inode(file)->i_ino, subreq->start, subreq->len,
+	       i_size_read(file_inode(file)));
+
+	if (subreq->len > MAX_RW_COUNT)
+		subreq->len =3D MAX_RW_COUNT;
+
+	ret =3D netfs_prepare_read_buffer(subreq, BIO_MAX_VECS);
+	if (ret < 0)
+		goto failed;
+
+	iov_iter_bvec_queue(&iter, ITER_DEST, subreq->content.bvecq,
+			    subreq->content.slot, subreq->content.offset, subreq->len);
+
+	ret =3D -ENOMEM;
+	ki =3D kzalloc_obj(struct cachefiles_kiocb);
+	if (!ki)
+		goto failed;
+
+	refcount_set(&ki->ki_refcnt, 2);
+	ki->iocb.ki_filp	=3D file;
+	ki->iocb.ki_pos		=3D subreq->start;
+	ki->iocb.ki_flags	=3D IOCB_DIRECT;
+	ki->iocb.ki_ioprio	=3D get_current_ioprio();
+	ki->iocb.ki_complete	=3D cachefiles_issue_read_complete;
+	ki->object		=3D object;
+	ki->inval_counter	=3D cres->inval_counter;
+	ki->subreq		=3D subreq;
+	ki->was_async		=3D true;
+
+	get_file(ki->iocb.ki_filp);
+	cachefiles_grab_object(object, cachefiles_obj_get_ioreq);
+
+	trace_cachefiles_read(object, file_inode(file), ki->iocb.ki_pos, subreq->=
len);
+	old_nofs =3D memalloc_nofs_save();
+	ret =3D cachefiles_inject_read_error();
+	if (ret =3D=3D 0)
+		ret =3D vfs_iocb_iter_read(file, &ki->iocb, &iter);
+	memalloc_nofs_restore(old_nofs);
+
+	switch (ret) {
+	case -EIOCBQUEUED:
+		break;
+
+	case -ERESTARTSYS:
+	case -ERESTARTNOINTR:
+	case -ERESTARTNOHAND:
+	case -ERESTART_RESTARTBLOCK:
+		/* There's no easy way to restart the syscall since other AIO's
+		 * may be already running. Just fail this IO with EINTR.
+		 */
+		ret =3D -EINTR;
+		fallthrough;
+	default:
+		ki->was_async =3D false;
+		cachefiles_issue_read_complete(&ki->iocb, ret);
+		break;
+	}
+
+	cachefiles_put_kiocb(ki);
+	_leave(" =3D %zd", ret);
+	return;
+failed:
+	subreq->error =3D ret;
+	return netfs_read_subreq_terminated(subreq);
+}
+
 /*
  * Query the occupancy of the cache in a region, returning the extent of t=
he
- * next two chunks of cached data and the next hole.
+ * next two chunks of cached data and the next hole.  The occupancy map is
+ * preloaded to show just one giant hole.
  */
-static int cachefiles_query_occupancy(struct netfs_cache_resources *cres,
-				      struct fscache_occupancy *occ)
+static void cachefiles_query_occupancy(struct netfs_cache_resources *cres,
+				       struct fscache_occupancy *occ)
 {
 	struct cachefiles_object *object;
 	struct inode *inode;
@@ -209,7 +333,7 @@ static int cachefiles_query_occupancy(struct netfs_cach=
e_resources *cres,
 	int i;
=20
 	if (!fscache_wait_for_operation(cres, FSCACHE_WANT_READ))
-		return -ENOBUFS;
+		return;
=20
 	object =3D cachefiles_cres_object(cres);
 	file =3D cachefiles_cres_file(cres);
@@ -244,7 +368,7 @@ static int cachefiles_query_occupancy(struct netfs_cach=
e_resources *cres,
 			ret =3D vfs_llseek(file, occ->query_from, SEEK_DATA);
 		if (IS_ERR_VALUE_LL(ret)) {
 			if (ret !=3D -ENXIO)
-				return ret;
+				goto done;
 			occ->query_from =3D ULLONG_MAX;
 			goto done;
 		}
@@ -257,7 +381,7 @@ static int cachefiles_query_occupancy(struct netfs_cach=
e_resources *cres,
 			ret =3D vfs_llseek(file, occ->query_from, SEEK_HOLE);
 		if (IS_ERR_VALUE_LL(ret)) {
 			if (ret !=3D -ENXIO)
-				return ret;
+				goto done;
 			occ->query_from =3D ULLONG_MAX;
 			goto done;
 		}
@@ -270,7 +394,6 @@ static int cachefiles_query_occupancy(struct netfs_cach=
e_resources *cres,
 done:
 	_debug("query[0] %llx-%llx", occ->cached_from[0], occ->cached_to[0]);
 	_debug("query[1] %llx-%llx", occ->cached_from[1], occ->cached_to[1]);
-	return 0;
 }
=20
 /*
@@ -610,47 +733,13 @@ int __cachefiles_prepare_write(struct cachefiles_obje=
ct *object,
 				    cachefiles_has_space_for_write);
 }
=20
-static int cachefiles_prepare_write(struct netfs_cache_resources *cres,
-				    loff_t *_start, size_t *_len, size_t upper_len,
-				    loff_t i_size, bool no_space_allocated_yet)
+static int cachefiles_estimate_write(struct netfs_io_request *wreq,
+				     struct netfs_io_stream *stream,
+				     struct netfs_write_estimate *estimate)
 {
-	struct cachefiles_object *object =3D cachefiles_cres_object(cres);
-	struct cachefiles_cache *cache =3D object->volume->cache;
-	const struct cred *saved_cred;
-	int ret;
-
-	if (!cachefiles_cres_file(cres)) {
-		if (!fscache_wait_for_operation(cres, FSCACHE_WANT_WRITE))
-			return -ENOBUFS;
-		if (!cachefiles_cres_file(cres))
-			return -ENOBUFS;
-	}
-
-	cachefiles_begin_secure(cache, &saved_cred);
-	ret =3D __cachefiles_prepare_write(object, cachefiles_cres_file(cres),
-					 _start, _len, upper_len,
-					 no_space_allocated_yet);
-	cachefiles_end_secure(cache, saved_cred);
-	return ret;
-}
-
-static void cachefiles_prepare_write_subreq(struct netfs_io_subrequest *su=
breq)
-{
-	struct netfs_io_request *wreq =3D subreq->rreq;
-	struct netfs_cache_resources *cres =3D &wreq->cache_resources;
-	struct netfs_io_stream *stream =3D &wreq->io_streams[subreq->stream_nr];
-
-	_enter("W=3D%x[%x] %llx", wreq->debug_id, subreq->debug_index, subreq->st=
art);
-
-	stream->sreq_max_len =3D MAX_RW_COUNT;
-	stream->sreq_max_segs =3D BIO_MAX_VECS;
-
-	if (!cachefiles_cres_file(cres)) {
-		if (!fscache_wait_for_operation(cres, FSCACHE_WANT_WRITE))
-			return netfs_prepare_write_failed(subreq);
-		if (!cachefiles_cres_file(cres))
-			return netfs_prepare_write_failed(subreq);
-	}
+	estimate->issue_at =3D stream->issue_from + MAX_RW_COUNT;
+	estimate->max_segs =3D BIO_MAX_VECS;
+	return 0;
 }
=20
 static void cachefiles_issue_write(struct netfs_io_subrequest *subreq)
@@ -659,55 +748,55 @@ static void cachefiles_issue_write(struct netfs_io_su=
brequest *subreq)
 	struct netfs_cache_resources *cres =3D &wreq->cache_resources;
 	struct cachefiles_object *object =3D cachefiles_cres_object(cres);
 	struct cachefiles_cache *cache =3D object->volume->cache;
+	struct iov_iter iter;
 	const struct cred *saved_cred;
-	size_t off, pre, post, len =3D subreq->len;
 	loff_t start =3D subreq->start;
-	int ret;
+	size_t len =3D subreq->len;
+	int ret =3D -EINVAL;
=20
 	_enter("W=3D%x[%x] %llx-%llx",
 	       wreq->debug_id, subreq->debug_index, start, start + len - 1);
=20
-	/* We need to start on the cache granularity boundary */
-	off =3D start & (cache->bsize - 1);
-	if (off) {
-		pre =3D cache->bsize - off;
-		if (pre >=3D len) {
-			fscache_count_dio_misfit();
-			netfs_write_subrequest_terminated(subreq, len);
-			return;
-		}
-		subreq->transferred +=3D pre;
-		start +=3D pre;
-		len -=3D pre;
-		iov_iter_advance(&subreq->io_iter, pre);
-	}
-
-	/* We also need to end on the cache granularity boundary */
-	post =3D len & (cache->bsize - 1);
-	if (post) {
-		len -=3D post;
-		if (len =3D=3D 0) {
-			fscache_count_dio_misfit();
-			netfs_write_subrequest_terminated(subreq, post);
-			return;
-		}
-		iov_iter_truncate(&subreq->io_iter, len);
+	if (!cachefiles_cres_file(cres)) {
+		if (!fscache_wait_for_operation(cres, FSCACHE_WANT_WRITE))
+			goto failed;
+		if (!cachefiles_cres_file(cres))
+			goto failed;
+	}
+
+	ret =3D netfs_prepare_write_buffer(subreq, BIO_MAX_VECS);
+	if (ret < 0)
+		goto failed;
+
+	/* The buffer extraction func may round out start and end. */
+	start =3D subreq->start;
+	len =3D subreq->len;
+
+	/* We need to start and end on cache granularity boundaries. */
+	if (WARN_ON_ONCE(start & (cache->bsize - 1)) ||
+	    WARN_ON_ONCE(len   & (cache->bsize - 1))) {
+		fscache_count_dio_misfit();
+		ret =3D -EIO;
+		goto failed;
 	}
=20
+	iov_iter_bvec_queue(&iter, ITER_SOURCE, subreq->content.bvecq,
+			    subreq->content.slot, subreq->content.offset, len);
+
 	trace_netfs_sreq(subreq, netfs_sreq_trace_cache_prepare);
 	cachefiles_begin_secure(cache, &saved_cred);
 	ret =3D __cachefiles_prepare_write(object, cachefiles_cres_file(cres),
 					 &start, &len, len, true);
 	cachefiles_end_secure(cache, saved_cred);
-	if (ret < 0) {
-		netfs_write_subrequest_terminated(subreq, ret);
-		return;
-	}
+	if (ret < 0)
+		goto failed;
=20
 	trace_netfs_sreq(subreq, netfs_sreq_trace_cache_write);
-	cachefiles_write(&subreq->rreq->cache_resources,
-			 subreq->start, &subreq->io_iter,
+	cachefiles_write(&subreq->rreq->cache_resources, subreq->start, &iter,
 			 netfs_write_subrequest_terminated, subreq);
+	return;
+failed:
+	return netfs_write_subrequest_terminated(subreq, ret);
 }
=20
 /*
@@ -881,9 +970,9 @@ static const struct netfs_cache_ops cachefiles_netfs_ca=
che_ops =3D {
 	.end_operation		=3D cachefiles_end_operation,
 	.read			=3D cachefiles_read,
 	.write			=3D cachefiles_write,
+	.issue_read		=3D cachefiles_issue_read,
 	.issue_write		=3D cachefiles_issue_write,
-	.prepare_write		=3D cachefiles_prepare_write,
-	.prepare_write_subreq	=3D cachefiles_prepare_write_subreq,
+	.estimate_write		=3D cachefiles_estimate_write,
 	.prepare_ondemand_read	=3D cachefiles_prepare_ondemand_read,
 	.query_occupancy	=3D cachefiles_query_occupancy,
 	.collect_write		=3D cachefiles_collect_write,
diff --git a/fs/ceph/Kconfig b/fs/ceph/Kconfig
index 3d64a316ca31..aa6ccd7794d2 100644
--- a/fs/ceph/Kconfig
+++ b/fs/ceph/Kconfig
@@ -4,6 +4,7 @@ config CEPH_FS
 	depends on INET
 	select CEPH_LIB
 	select NETFS_SUPPORT
+	select NETFS_PGPRIV2
 	select FS_ENCRYPTION_ALGS if FS_ENCRYPTION
 	default n
 	help
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 0a86f672cc09..9f22d0a894a2 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -274,7 +274,7 @@ static void finish_netfs_read(struct ceph_osd_request *=
req)
 	ceph_dec_osd_stopping_blocker(fsc->mdsc);
 }
=20
-static bool ceph_netfs_issue_op_inline(struct netfs_io_subrequest *subreq)
+static int ceph_netfs_issue_op_inline(struct netfs_io_subrequest *subreq)
 {
 	struct netfs_io_request *rreq =3D subreq->rreq;
 	struct inode *inode =3D rreq->inode;
@@ -283,7 +283,8 @@ static bool ceph_netfs_issue_op_inline(struct netfs_io_=
subrequest *subreq)
 	struct ceph_mds_request *req;
 	struct ceph_mds_client *mdsc =3D ceph_sb_to_mdsc(inode->i_sb);
 	struct ceph_inode_info *ci =3D ceph_inode(inode);
-	ssize_t err =3D 0;
+	struct iov_iter iter;
+	ssize_t err;
 	size_t len;
 	int mode;
=20
@@ -292,21 +293,32 @@ static bool ceph_netfs_issue_op_inline(struct netfs_i=
o_subrequest *subreq)
 		__set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
 	__clear_bit(NETFS_SREQ_COPY_TO_CACHE, &subreq->flags);
=20
-	if (subreq->start >=3D inode->i_size)
+	if (subreq->start >=3D inode->i_size) {
+		__set_bit(NETFS_SREQ_HIT_EOF, &subreq->flags);
+		err =3D 0;
 		goto out;
+	}
+
+	err =3D netfs_prepare_read_buffer(subreq, INT_MAX);
+	if (err < 0)
+		return err;
+
+	iov_iter_bvec_queue(&iter, ITER_DEST, subreq->content.bvecq,
+			    subreq->content.slot, subreq->content.offset,
+			    subreq->len);
=20
 	/* We need to fetch the inline data. */
 	mode =3D ceph_try_to_choose_auth_mds(inode, CEPH_STAT_CAP_INLINE_DATA);
 	req =3D ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_GETATTR, mode);
-	if (IS_ERR(req)) {
-		err =3D PTR_ERR(req);
-		goto out;
-	}
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
 	req->r_ino1 =3D ci->i_vino;
 	req->r_args.getattr.mask =3D cpu_to_le32(CEPH_STAT_CAP_INLINE_DATA);
 	req->r_num_caps =3D 2;
=20
 	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
+
 	err =3D ceph_mdsc_do_request(mdsc, NULL, req);
 	if (err < 0)
 		goto out;
@@ -316,11 +328,11 @@ static bool ceph_netfs_issue_op_inline(struct netfs_i=
o_subrequest *subreq)
 	if (iinfo->inline_version =3D=3D CEPH_INLINE_NONE) {
 		/* The data got uninlined */
 		ceph_mdsc_put_request(req);
-		return false;
+		return 1;
 	}
=20
 	len =3D min_t(size_t, iinfo->inline_len - subreq->start, subreq->len);
-	err =3D copy_to_iter(iinfo->inline_data + subreq->start, len, &subreq->io=
_iter);
+	err =3D copy_to_iter(iinfo->inline_data + subreq->start, len, &iter);
 	if (err =3D=3D 0) {
 		err =3D -EFAULT;
 	} else {
@@ -333,23 +345,7 @@ static bool ceph_netfs_issue_op_inline(struct netfs_io=
_subrequest *subreq)
 	subreq->error =3D err;
 	trace_netfs_sreq(subreq, netfs_sreq_trace_io_progress);
 	netfs_read_subreq_terminated(subreq);
-	return true;
-}
-
-static int ceph_netfs_prepare_read(struct netfs_io_subrequest *subreq)
-{
-	struct netfs_io_request *rreq =3D subreq->rreq;
-	struct inode *inode =3D rreq->inode;
-	struct ceph_inode_info *ci =3D ceph_inode(inode);
-	struct ceph_fs_client *fsc =3D ceph_inode_to_fs_client(inode);
-	u64 objno, objoff;
-	u32 xlen;
-
-	/* Truncate the extent at the end of the current block */
-	ceph_calc_file_object_mapping(&ci->i_layout, subreq->start, subreq->len,
-				      &objno, &objoff, &xlen);
-	rreq->io_streams[0].sreq_max_len =3D umin(xlen, fsc->mount_options->rsize=
);
-	return 0;
+	return -EIOCBQUEUED;
 }
=20
 static void ceph_netfs_issue_read(struct netfs_io_subrequest *subreq)
@@ -361,26 +357,34 @@ static void ceph_netfs_issue_read(struct netfs_io_sub=
request *subreq)
 	struct ceph_client *cl =3D fsc->client;
 	struct ceph_osd_request *req =3D NULL;
 	struct ceph_vino vino =3D ceph_vino(inode);
-	int err;
-	u64 len;
+	struct iov_iter iter;
+	u64 objno, objoff, len, off =3D subreq->start;
+	u32 maxlen;
+	int err =3D -EIO;
 	bool sparse =3D IS_ENCRYPTED(inode) || ceph_test_mount_opt(fsc, SPARSEREA=
D);
-	u64 off =3D subreq->start;
 	int extent_cnt;
=20
-	if (ceph_inode_is_shutdown(inode)) {
-		err =3D -EIO;
-		goto out;
+	if (ceph_inode_is_shutdown(inode))
+		goto failed_noput;
+
+	if (ceph_has_inline_data(ci)) {
+		err =3D ceph_netfs_issue_op_inline(subreq);
+		if (err !=3D 1)
+			goto failed_noput;
 	}
=20
-	if (ceph_has_inline_data(ci) && ceph_netfs_issue_op_inline(subreq))
-		return;
+	/* Truncate the extent at the end of the current block */
+	ceph_calc_file_object_mapping(&ci->i_layout, subreq->start, subreq->len,
+				      &objno, &objoff, &maxlen);
+	maxlen =3D umin(maxlen, fsc->mount_options->rsize);
+	len =3D umin(subreq->len, maxlen);
+	subreq->len =3D len;
=20
 	// TODO: This rounding here is slightly dodgy.  It *should* work, for
 	// now, as the cache only deals in blocks that are a multiple of
 	// PAGE_SIZE and fscrypt blocks are at most PAGE_SIZE.  What needs to
 	// happen is for the fscrypt driving to be moved into netfslib and the
 	// data in the cache also to be stored encrypted.
-	len =3D subreq->len;
 	ceph_fscrypt_adjust_off_and_len(inode, &off, &len);
=20
 	req =3D ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout, vino,
@@ -389,20 +393,27 @@ static void ceph_netfs_issue_read(struct netfs_io_sub=
request *subreq)
 			ci->i_truncate_size, false);
 	if (IS_ERR(req)) {
 		err =3D PTR_ERR(req);
-		req =3D NULL;
-		goto out;
+		goto failed_noput;
 	}
=20
 	if (sparse) {
 		extent_cnt =3D __ceph_sparse_read_ext_count(inode, len);
 		err =3D ceph_alloc_sparse_ext_map(&req->r_ops[0], extent_cnt);
 		if (err)
-			goto out;
+			goto failed;
 	}
=20
 	doutc(cl, "%llx.%llx pos=3D%llu orig_len=3D%zu len=3D%llu\n",
 	      ceph_vinop(inode), subreq->start, subreq->len, len);
=20
+	err =3D netfs_prepare_read_buffer(subreq, INT_MAX);
+	if (err < 0)
+		goto failed;
+
+	iov_iter_bvec_queue(&iter, ITER_DEST, subreq->content.bvecq,
+			    subreq->content.slot, subreq->content.offset,
+			    subreq->len);
+
 	/*
 	 * FIXME: For now, use CEPH_OSD_DATA_TYPE_PAGES instead of _ITER for
 	 * encrypted inodes. We'd need infrastructure that handles an iov_iter
@@ -421,13 +432,11 @@ static void ceph_netfs_issue_read(struct netfs_io_sub=
request *subreq)
 		 * ceph_msg_data_cursor_init() triggers BUG_ON() in the case
 		 * if msg->sparse_read_total > msg->data_length.
 		 */
-		subreq->io_iter.count =3D len;
-
-		err =3D iov_iter_get_pages_alloc2(&subreq->io_iter, &pages, len, &page_o=
ff);
+		err =3D iov_iter_get_pages_alloc2(&iter, &pages, len, &page_off);
 		if (err < 0) {
 			doutc(cl, "%llx.%llx failed to allocate pages, %d\n",
 			      ceph_vinop(inode), err);
-			goto out;
+			goto eio;
 		}
=20
 		/* should always give us a page-aligned read */
@@ -438,12 +447,10 @@ static void ceph_netfs_issue_read(struct netfs_io_sub=
request *subreq)
 		osd_req_op_extent_osd_data_pages(req, 0, pages, len, 0, false,
 						 false);
 	} else {
-		osd_req_op_extent_osd_iter(req, 0, &subreq->io_iter);
-	}
-	if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
-		err =3D -EIO;
-		goto out;
+		osd_req_op_extent_osd_iter(req, 0, &iter);
 	}
+	if (!ceph_inc_osd_stopping_blocker(fsc->mdsc))
+		goto eio;
 	req->r_callback =3D finish_netfs_read;
 	req->r_priv =3D subreq;
 	req->r_inode =3D inode;
@@ -451,19 +458,21 @@ static void ceph_netfs_issue_read(struct netfs_io_sub=
request *subreq)
=20
 	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
 	ceph_osdc_start_request(req->r_osdc, req);
-out:
 	ceph_osdc_put_request(req);
-	if (err) {
-		subreq->error =3D err;
-		netfs_read_subreq_terminated(subreq);
-	}
-	doutc(cl, "%llx.%llx result %d\n", ceph_vinop(inode), err);
+	doutc(cl, "%llx.%llx result -EIOCBQUEUED\n", ceph_vinop(inode));
+	return;
+eio:
+	err =3D -EIO;
+failed:
+	ceph_osdc_put_request(req);
+failed_noput:
+	subreq->error =3D err;
+	return netfs_read_subreq_terminated(subreq);
 }
=20
 static int ceph_init_request(struct netfs_io_request *rreq, struct file *f=
ile)
 {
 	struct inode *inode =3D rreq->inode;
-	struct ceph_fs_client *fsc =3D ceph_inode_to_fs_client(inode);
 	struct ceph_client *cl =3D ceph_inode_to_client(inode);
 	int got =3D 0, want =3D CEPH_CAP_FILE_CACHE;
 	struct ceph_netfs_request_data *priv;
@@ -515,7 +524,6 @@ static int ceph_init_request(struct netfs_io_request *r=
req, struct file *file)
=20
 	priv->caps =3D got;
 	rreq->netfs_priv =3D priv;
-	rreq->io_streams[0].sreq_max_len =3D fsc->mount_options->rsize;
=20
 out:
 	if (ret < 0) {
@@ -543,7 +551,6 @@ static void ceph_netfs_free_request(struct netfs_io_req=
uest *rreq)
 const struct netfs_request_ops ceph_netfs_ops =3D {
 	.init_request		=3D ceph_init_request,
 	.free_request		=3D ceph_netfs_free_request,
-	.prepare_read		=3D ceph_netfs_prepare_read,
 	.issue_read		=3D ceph_netfs_issue_read,
 	.expand_readahead	=3D ceph_netfs_expand_readahead,
 	.check_write_begin	=3D ceph_netfs_check_write_begin,
diff --git a/fs/netfs/Kconfig b/fs/netfs/Kconfig
index 7701c037c328..d0e7b0971fa3 100644
--- a/fs/netfs/Kconfig
+++ b/fs/netfs/Kconfig
@@ -22,6 +22,9 @@ config NETFS_STATS
 	  between CPUs.  On the other hand, the stats are very useful for
 	  debugging purposes.  Saying 'Y' here is recommended.
=20
+config NETFS_PGPRIV2
+	bool
+
 config NETFS_DEBUG
 	bool "Enable dynamic debugging netfslib and FS-Cache"
 	depends on NETFS_SUPPORT
diff --git a/fs/netfs/Makefile b/fs/netfs/Makefile
index 0621e6870cbd..421dd0be413b 100644
--- a/fs/netfs/Makefile
+++ b/fs/netfs/Makefile
@@ -12,13 +12,13 @@ netfs-y :=3D \
 	misc.o \
 	objects.o \
 	read_collect.o \
-	read_pgpriv2.o \
 	read_retry.o \
 	read_single.o \
 	write_collect.o \
 	write_issue.o \
 	write_retry.o
=20
+netfs-$(CONFIG_NETFS_PGPRIV2) +=3D read_pgpriv2.o
 netfs-$(CONFIG_NETFS_STATS) +=3D stats.o
=20
 netfs-$(CONFIG_FSCACHE) +=3D \
diff --git a/fs/netfs/buffered_read.c b/fs/netfs/buffered_read.c
index 92716a6c9133..b47b3760fe0d 100644
--- a/fs/netfs/buffered_read.c
+++ b/fs/netfs/buffered_read.c
@@ -100,70 +100,98 @@ static int netfs_begin_cache_read(struct netfs_io_req=
uest *rreq, struct netfs_in
 }
=20
 /*
- * netfs_prepare_read_iterator - Prepare the subreq iterator for I/O
- * @subreq: The subrequest to be set up
- *
- * Prepare the I/O iterator representing the read buffer on a subrequest f=
or
- * the filesystem to use for I/O (it can be passed directly to a socket). =
 This
- * is intended to be called from the ->issue_read() method once the filesy=
stem
- * has trimmed the request to the size it wants.
- *
- * Returns the limited size if successful and -ENOMEM if insufficient memo=
ry
- * available.
+ * Prepare the I/O buffer on a buffered read subrequest for the filesystem=
 to
+ * use as a bvec queue.
  */
-static ssize_t netfs_prepare_read_iterator(struct netfs_io_subrequest *sub=
req)
+static int netfs_prepare_buffered_read_buffer(struct netfs_io_subrequest *=
subreq,
+					      unsigned int max_segs)
 {
 	struct netfs_io_request *rreq =3D subreq->rreq;
 	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
 	ssize_t extracted;
-	size_t rsize =3D subreq->len;
=20
-	if (subreq->source =3D=3D NETFS_DOWNLOAD_FROM_SERVER)
-		rsize =3D umin(rsize, stream->sreq_max_len);
+	_enter("R=3D%08x[%x] l=3D%zx s=3D%u",
+	       rreq->debug_id, subreq->debug_index, subreq->len, max_segs);
=20
-	bvecq_pos_set(&subreq->dispatch_pos, &rreq->dispatch_cursor);
-	extracted =3D bvecq_slice(&rreq->dispatch_cursor, subreq->len,
-				stream->sreq_max_segs, &subreq->nr_segs);
-	if (extracted < rsize) {
+	bvecq_pos_set(&subreq->dispatch_pos, &stream->dispatch_cursor);
+	bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
+	extracted =3D bvecq_slice(&stream->dispatch_cursor, subreq->len,
+				max_segs, &subreq->nr_segs);
+
+	if (extracted < subreq->len) {
 		subreq->len =3D extracted;
 		trace_netfs_sreq(subreq, netfs_sreq_trace_limited);
 	}
+	stream->buffered -=3D extracted;
+	stream->issue_from =3D subreq->start + subreq->len;
+	rreq->submitted =3D stream->issue_from;
=20
-	return subreq->len;
+	if (!stream->buffered)
+		netfs_all_subreqs_queued(rreq);
+	return 0;
 }
=20
-/*
- * Issue a read against the cache.
- * - Eats the caller's ref on subreq.
+/**
+ * netfs_prepare_read_buffer - Get the buffer for a subrequest
+ * @subreq: The subrequest to get the buffer for
+ * @max_segs: Maximum number of segments in buffer (or INT_MAX)
+ *
+ * Extract a slice of buffer from the stream and attach it to the subreque=
st as
+ * a bio_vec queue.  The maximum amount of data attached is set by
+ * @subreq->len, but this may be shortened if @max_segs would be exceeded.
+ *
+ * [!] NOTE: This must be run in the same thread as ->issue_read() was cal=
led
+ * in as we access the readahead_control struct if there is one.
  */
-static void netfs_read_cache_to_pagecache(struct netfs_io_request *rreq,
-					  struct netfs_io_subrequest *subreq)
+int netfs_prepare_read_buffer(struct netfs_io_subrequest *subreq,
+			      unsigned int max_segs)
 {
-	struct netfs_cache_resources *cres =3D &rreq->cache_resources;
-
-	netfs_stat(&netfs_n_rh_read);
-	cres->ops->read(cres, subreq->start, &subreq->io_iter, NETFS_READ_HOLE_IG=
NORE,
-			netfs_cache_read_terminated, subreq);
+	switch (subreq->rreq->origin) {
+	case NETFS_READAHEAD:
+	case NETFS_READPAGE:
+	case NETFS_READ_FOR_WRITE:
+		if (subreq->retry_count)
+			return netfs_prepare_buffered_read_retry_buffer(subreq, max_segs);
+		return netfs_prepare_buffered_read_buffer(subreq, max_segs);
+
+	case NETFS_UNBUFFERED_READ:
+	case NETFS_DIO_READ:
+	case NETFS_READ_GAPS:
+		return netfs_prepare_unbuffered_read_buffer(subreq, max_segs);
+	case NETFS_READ_SINGLE:
+		return netfs_prepare_read_single_buffer(subreq, max_segs);
+	default:
+		WARN_ON_ONCE(1);
+		return -EIO;
+	}
 }
+EXPORT_SYMBOL(netfs_prepare_read_buffer);
=20
-int netfs_read_query_cache(struct netfs_io_request *rreq, struct fscache_o=
ccupancy *occ)
+void netfs_read_query_cache(struct netfs_io_request *rreq, struct fscache_=
occupancy *occ)
 {
 	struct netfs_cache_resources *cres =3D &rreq->cache_resources;
=20
 	occ->granularity =3D PAGE_SIZE;
 	if (occ->query_from >=3D occ->query_to)
-		return 0;
+		return;
 	if (!cres->ops)
-		return 0;
+		return;
 	occ->query_from =3D round_up(occ->query_from, occ->granularity);
-	return cres->ops->query_occupancy(cres, occ);
+	cres->ops->query_occupancy(cres, occ);
 }
=20
-void netfs_queue_read(struct netfs_io_request *rreq,
-		      struct netfs_io_subrequest *subreq)
+/*
+ * Allocate and prepare a read subrequest.
+ */
+struct netfs_io_subrequest *netfs_alloc_read_subrequest(struct netfs_io_re=
quest *rreq)
 {
+	struct netfs_io_subrequest *subreq;
 	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
=20
+	subreq =3D netfs_alloc_subrequest(rreq);
+	if (!subreq)
+		return subreq;
+
 	__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
=20
 	/* We add to the end of the list whilst the collector may be walking
@@ -182,30 +210,37 @@ void netfs_queue_read(struct netfs_io_request *rreq,
 	}
=20
 	spin_unlock(&rreq->lock);
+	return subreq;
 }
=20
 static void netfs_issue_read(struct netfs_io_request *rreq,
-			     struct netfs_io_subrequest *subreq)
+			    struct netfs_io_subrequest *subreq)
 {
-	bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
-	iov_iter_bvec_queue(&subreq->io_iter, ITER_DEST, subreq->content.bvecq,
-			    subreq->content.slot, subreq->content.offset, subreq->len);
+	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
+
+	_enter("R=3D%08x[%x]", rreq->debug_id, subreq->debug_index);
=20
 	switch (subreq->source) {
 	case NETFS_DOWNLOAD_FROM_SERVER:
-		rreq->netfs_ops->issue_read(subreq);
-		break;
-	case NETFS_READ_FROM_CACHE:
-		netfs_read_cache_to_pagecache(rreq, subreq);
-		break;
+		return rreq->netfs_ops->issue_read(subreq);
+	case NETFS_READ_FROM_CACHE: {
+		struct netfs_cache_resources *cres =3D &rreq->cache_resources;
+
+		netfs_stat(&netfs_n_rh_read);
+		return cres->ops->issue_read(subreq);
+	}
 	default:
-		bvecq_zero(&rreq->dispatch_cursor, subreq->len);
+		WARN_ON_ONCE(1);
+		fallthrough;
+	case NETFS_FILL_WITH_ZEROES:
+		stream->issue_from =3D subreq->start + subreq->len;
+		stream->buffered =3D 0;
+		trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
+		netfs_all_subreqs_queued(rreq);
+		bvecq_zero(&stream->dispatch_cursor, subreq->len);
 		subreq->transferred =3D subreq->len;
 		subreq->error =3D 0;
-		iov_iter_zero(subreq->len, &subreq->io_iter);
-		subreq->transferred =3D subreq->len;
-		netfs_read_subreq_terminated(subreq);
-		break;
+		return netfs_read_subreq_terminated(subreq);
 	}
 }
=20
@@ -225,20 +260,17 @@ static void netfs_read_to_pagecache(struct netfs_io_r=
equest *rreq)
 		.cached_to[1]	=3D ULLONG_MAX,
 	};
 	struct fscache_occupancy *occ =3D &_occ;
-	unsigned long long start =3D rreq->start;
-	ssize_t size =3D rreq->len;
+	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
 	int ret =3D 0;
=20
 	_enter("R=3D%08x", rreq->debug_id);
=20
-	bvecq_pos_set(&rreq->dispatch_cursor, &rreq->load_cursor);
-	bvecq_pos_set(&rreq->collect_cursor, &rreq->dispatch_cursor);
+	bvecq_pos_set(&stream->dispatch_cursor, &rreq->load_cursor);
+	bvecq_pos_set(&rreq->collect_cursor, &rreq->load_cursor);
=20
 	do {
-		int (*prepare_read)(struct netfs_io_subrequest *subreq) =3D NULL;
 		struct netfs_io_subrequest *subreq;
-		unsigned long long hole_to, cache_to;
-		ssize_t slice;
+		unsigned long long hole_to, cache_to, stop;
=20
 		/* If we don't have any, find out the next couple of data
 		 * extents from the cache, containing of following the
@@ -247,7 +279,7 @@ static void netfs_read_to_pagecache(struct netfs_io_req=
uest *rreq)
 		 */
 		hole_to =3D occ->cached_from[0];
 		cache_to =3D occ->cached_to[0];
-		if (start >=3D cache_to) {
+		if (stream->issue_from >=3D cache_to) {
 			/* Extent exhausted; shuffle down. */
 			int i;
=20
@@ -263,51 +295,45 @@ static void netfs_read_to_pagecache(struct netfs_io_r=
equest *rreq)
 				continue;
=20
 			/* Get new extents */
-			ret =3D netfs_read_query_cache(rreq, occ);
-			if (ret < 0)
-				break;
+			netfs_read_query_cache(rreq, occ);
 			continue;
 		}
=20
-		subreq =3D netfs_alloc_subrequest(rreq);
+		subreq =3D netfs_alloc_read_subrequest(rreq);
 		if (!subreq) {
 			ret =3D -ENOMEM;
 			break;
 		}
=20
-		subreq->start	=3D start;
-		subreq->len	=3D size;
-
-		netfs_queue_read(rreq, subreq);
+		subreq->start =3D stream->issue_from;
+		stop =3D stream->issue_from + stream->buffered;
=20
 		unsigned long long zero_point =3D netfs_read_zero_point(rreq->inode);
 		unsigned long long zlimit =3D umin(zero_point, rreq->i_size);
=20
 		_debug("rsub %llx %llx-%llx", subreq->start, hole_to, cache_to);
=20
-		if (start >=3D hole_to && start < cache_to) {
+		if (stream->issue_from >=3D hole_to && stream->issue_from < cache_to) {
 			/* Overlap with a cached region, where the cache may
 			 * record a block of zeroes.
 			 */
-			_debug("cached s=3D%llx c=3D%llx l=3D%zx", start, cache_to, size);
-			subreq->len =3D umin(cache_to - start, size);
+			_debug("cached s=3D%llx c=3D%llx l=3D%zx",
+			       stream->issue_from, cache_to, stream->buffered);
+			subreq->len =3D umin(cache_to - stream->issue_from, stream->buffered);
 			subreq->len =3D round_up(subreq->len, occ->granularity);
 			if (occ->cached_type[0] =3D=3D FSCACHE_EXTENT_ZERO) {
 				subreq->source =3D NETFS_FILL_WITH_ZEROES;
 				netfs_stat(&netfs_n_rh_zero);
 			} else {
 				subreq->source =3D NETFS_READ_FROM_CACHE;
-				prepare_read =3D rreq->cache_resources.ops->prepare_read;
 			}
-
-			trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
-
-		} else if (subreq->start >=3D zlimit && size > 0) {
+		} else if (subreq->start >=3D zlimit &&
+			   subreq->start < stop) {
 			/* If this range lies beyond the zero-point, that part
 			 * can just be cleared locally.
 			 */
-			_debug("zero %llx-%llx", start, start + size);
-			subreq->len =3D size;
+			_debug("zero %llx-%llx", subreq->start, stop);
+			subreq->len =3D stream->buffered;
 			subreq->source =3D NETFS_FILL_WITH_ZEROES;
 			if (rreq->cache_resources.ops)
 				__set_bit(NETFS_SREQ_COPY_TO_CACHE, &subreq->flags);
@@ -317,10 +343,10 @@ static void netfs_read_to_pagecache(struct netfs_io_r=
equest *rreq)
 			 * this range lies beyond the zero-point or the EOF,
 			 * that part can just be cleared locally.
 			 */
-			unsigned long long limit =3D min3(zlimit, start + size, hole_to);
+			unsigned long long limit =3D min3(zlimit, stop, hole_to);
=20
 			_debug("limit %llx %llx", rreq->i_size, zero_point);
-			_debug("download %llx-%llx", start, start + size);
+			_debug("download %llx-%llx", subreq->start, stop);
 			subreq->len =3D umin(limit - subreq->start, ULONG_MAX);
 			subreq->source =3D NETFS_DOWNLOAD_FROM_SERVER;
 			if (rreq->cache_resources.ops)
@@ -328,41 +354,15 @@ static void netfs_read_to_pagecache(struct netfs_io_r=
equest *rreq)
 			netfs_stat(&netfs_n_rh_download);
 		}
=20
-		if (size =3D=3D 0) {
+		if (subreq->len =3D=3D 0) {
 			pr_err("ZERO-LEN READ: R=3D%08x[%x] l=3D%zx/%zx s=3D%llx z=3D%llx i=3D%=
llx",
 			       rreq->debug_id, subreq->debug_index,
-			       subreq->len, size,
+			       subreq->len, stream->buffered,
 			       subreq->start, zero_point, rreq->i_size);
 			netfs_cancel_read(subreq, ret);
 			break;
 		}
=20
-		rreq->io_streams[0].sreq_max_len =3D MAX_RW_COUNT;
-		rreq->io_streams[0].sreq_max_segs =3D INT_MAX;
-
-		if (prepare_read) {
-			ret =3D prepare_read(subreq);
-			if (ret < 0) {
-				netfs_cancel_read(subreq, ret);
-				break;
-			}
-			trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
-		}
-
-		slice =3D netfs_prepare_read_iterator(subreq);
-		if (slice < 0) {
-			ret =3D slice;
-			netfs_cancel_read(subreq, ret);
-			break;
-		}
-		start +=3D slice;
-		size -=3D slice;
-		if (size <=3D 0) {
-			smp_wmb(); /* Write lists before ALL_QUEUED. */
-			set_bit(NETFS_RREQ_ALL_QUEUED, &rreq->flags);
-		}
-
-		trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
 		netfs_issue_read(rreq, subreq);
 		netfs_maybe_bulk_drop_ra_refs(rreq);
=20
@@ -371,19 +371,19 @@ static void netfs_read_to_pagecache(struct netfs_io_r=
equest *rreq)
 		if (test_bit(NETFS_RREQ_FAILED, &rreq->flags))
 			break;
 		cond_resched();
-	} while (size > 0);
+	} while (stream->buffered > 0);
=20
-	if (unlikely(size > 0)) {
-		smp_wmb(); /* Write lists before ALL_QUEUED. */
-		set_bit(NETFS_RREQ_ALL_QUEUED, &rreq->flags);
+	if (unlikely(!netfs_are_all_subreqs_queued(rreq))) {
+		netfs_all_subreqs_queued(rreq);
 		netfs_wake_collector(rreq);
 	}
=20
 	/* Defer error return as we may need to wait for outstanding I/O. */
-	cmpxchg(&rreq->error, 0, ret);
+	if (ret < 0)
+		cmpxchg(&rreq->error, 0, ret);
=20
 	bvecq_pos_unset(&rreq->load_cursor);
-	bvecq_pos_unset(&rreq->dispatch_cursor);
+	bvecq_pos_unset(&stream->dispatch_cursor);
 }
=20
 /**
@@ -404,17 +404,22 @@ static void netfs_read_to_pagecache(struct netfs_io_r=
equest *rreq)
 void netfs_readahead(struct readahead_control *ractl)
 {
 	struct netfs_io_request *rreq;
+	struct netfs_io_stream *stream;
 	struct netfs_inode *ictx =3D netfs_inode(ractl->mapping->host);
 	unsigned long long start =3D readahead_pos(ractl);
 	ssize_t added;
 	size_t size =3D readahead_length(ractl);
 	int ret;
=20
+	_enter("");
+
 	rreq =3D netfs_alloc_request(ractl->mapping, ractl->file, start, size,
 				   NETFS_READAHEAD);
 	if (IS_ERR(rreq))
 		return;
=20
+	stream =3D &rreq->io_streams[0];
+
 	__set_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &rreq->flags);
=20
 	ret =3D netfs_begin_cache_read(rreq, ictx);
@@ -441,6 +446,8 @@ void netfs_readahead(struct readahead_control *ractl)
 	rreq->submitted =3D rreq->start + added;
 	rreq->cleaned_to =3D rreq->start;
 	rreq->front_folio_order =3D get_order(rreq->load_cursor.bvecq->bv[0].bv_l=
en);
+	stream->issue_from =3D rreq->start;
+	stream->buffered =3D added;
=20
 	netfs_read_to_pagecache(rreq);
 	netfs_maybe_bulk_drop_ra_refs(rreq);
@@ -456,6 +463,7 @@ EXPORT_SYMBOL(netfs_readahead);
  */
 static int netfs_create_singular_buffer(struct netfs_io_request *rreq, str=
uct folio *folio)
 {
+	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
 	struct bvecq *bq;
 	size_t fsize =3D folio_size(folio);
=20
@@ -466,6 +474,8 @@ static int netfs_create_singular_buffer(struct netfs_io=
_request *rreq, struct fo
 	bvec_set_folio(&bq->bv[0], folio, fsize, 0);
 	bvecq_filled_to(bq, 1);
 	rreq->submitted =3D rreq->start + fsize;
+	stream->issue_from =3D rreq->start;
+	stream->buffered =3D fsize;
 	return 0;
 }
=20
@@ -475,6 +485,7 @@ static int netfs_create_singular_buffer(struct netfs_io=
_request *rreq, struct fo
 static int netfs_read_gaps(struct file *file, struct folio *folio)
 {
 	struct netfs_io_request *rreq;
+	struct netfs_io_stream *stream;
 	struct address_space *mapping =3D folio->mapping;
 	struct netfs_group *group =3D netfs_folio_group(folio);
 	struct netfs_folio *finfo =3D netfs_folio_info(folio);
@@ -496,6 +507,7 @@ static int netfs_read_gaps(struct file *file, struct fo=
lio *folio)
 		ret =3D PTR_ERR(rreq);
 		goto alloc_error;
 	}
+	stream =3D &rreq->io_streams[0];
=20
 	ret =3D netfs_begin_cache_read(rreq, ctx);
 	if (ret =3D=3D -ENOMEM || ret =3D=3D -EINTR || ret =3D=3D -ERESTARTSYS)
@@ -544,6 +556,8 @@ static int netfs_read_gaps(struct file *file, struct fo=
lio *folio)
 	bvecq_filled_to(bq, slot);
=20
 	rreq->submitted =3D rreq->start + flen;
+	stream->issue_from =3D rreq->start;
+	stream->buffered =3D flen;
=20
 	netfs_read_to_pagecache(rreq);
=20
@@ -622,6 +636,7 @@ int netfs_read_folio(struct file *file, struct folio *f=
olio)
 		goto discard;
=20
 	netfs_read_to_pagecache(rreq);
+
 	ret =3D netfs_wait_for_read(rreq);
 	netfs_put_request(rreq, netfs_rreq_trace_put_return);
 	return ret < 0 ? ret : 0;
diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c
index fb3120fb24db..d11ae0c1722a 100644
--- a/fs/netfs/buffered_write.c
+++ b/fs/netfs/buffered_write.c
@@ -95,8 +95,8 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struct io=
v_iter *iter,
 		.range_start	=3D iocb->ki_pos,
 		.range_end	=3D iocb->ki_pos + iter->count,
 	};
-	struct netfs_io_request *wreq =3D NULL;
-	struct folio *folio =3D NULL, *writethrough =3D NULL;
+	struct netfs_writethrough *writethrough =3D NULL;
+	struct folio *folio =3D NULL;
 	unsigned int bdp_flags =3D (iocb->ki_flags & IOCB_NOWAIT) ? BDP_ASYNC : 0;
 	ssize_t written =3D 0, ret, ret2;
 	loff_t pos =3D iocb->ki_pos;
@@ -113,15 +113,13 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struc=
t iov_iter *iter,
 			goto out;
 		}
=20
-		wreq =3D netfs_begin_writethrough(iocb, iter->count);
-		if (IS_ERR(wreq)) {
+		writethrough =3D netfs_begin_writethrough(iocb, iter->count);
+		if (IS_ERR(writethrough)) {
 			wbc_detach_inode(&wbc);
-			ret =3D PTR_ERR(wreq);
-			wreq =3D NULL;
+			ret =3D PTR_ERR(writethrough);
+			writethrough =3D NULL;
 			goto out;
 		}
-		if (!is_sync_kiocb(iocb))
-			wreq->iocb =3D iocb;
 		netfs_stat(&netfs_n_wh_writethrough);
 	} else {
 		netfs_stat(&netfs_n_wh_buffered_write);
@@ -387,14 +385,15 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struc=
t iov_iter *iter,
 		pos +=3D copied;
 		written +=3D copied;
=20
-		if (likely(!wreq)) {
+		if (likely(!writethrough)) {
 			folio_mark_dirty(folio);
 			folio_unlock(folio);
 		} else {
-			netfs_advance_writethrough(wreq, &wbc, folio, copied,
-						   offset + copied =3D=3D flen,
-						   &writethrough);
+			ret =3D netfs_advance_writethrough(writethrough, &wbc, folio, copied,
+							 offset + copied =3D=3D flen);
 			/* Folio unlocked */
+			if (ret < 0)
+				break;
 		}
 	retry:
 		folio_put(folio);
@@ -417,8 +416,8 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struct =
iov_iter *iter,
 			ctx->ops->post_modify(inode);
 	}
=20
-	if (unlikely(wreq)) {
-		ret2 =3D netfs_end_writethrough(wreq, &wbc, writethrough);
+	if (unlikely(writethrough)) {
+		ret2 =3D netfs_end_writethrough(writethrough, &wbc);
 		wbc_detach_inode(&wbc);
 		if (ret2 =3D=3D -EIOCBQUEUED)
 			return ret2;
diff --git a/fs/netfs/direct_read.c b/fs/netfs/direct_read.c
index 3c52c7584489..d2675e981405 100644
--- a/fs/netfs/direct_read.c
+++ b/fs/netfs/direct_read.c
@@ -16,6 +16,32 @@
 #include <linux/netfs.h>
 #include "internal.h"
=20
+int netfs_prepare_unbuffered_read_buffer(struct netfs_io_subrequest *subre=
q,
+					 unsigned int max_segs)
+{
+	struct netfs_io_request *rreq =3D subreq->rreq;
+	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
+	size_t len;
+
+	bvecq_pos_set(&subreq->dispatch_pos, &stream->dispatch_cursor);
+	bvecq_pos_set(&subreq->content, &stream->dispatch_cursor);
+	len =3D bvecq_slice(&stream->dispatch_cursor, subreq->len, max_segs,
+			  &subreq->nr_segs);
+
+	if (len < subreq->len) {
+		subreq->len =3D len;
+		trace_netfs_sreq(subreq, netfs_sreq_trace_limited);
+	}
+
+	stream->buffered   -=3D subreq->len;
+	stream->issue_from +=3D subreq->len;
+	rreq->submitted =3D stream->issue_from;
+
+	if (stream->buffered =3D=3D 0)
+		netfs_all_subreqs_queued(rreq);
+	return 0;
+}
+
 /*
  * Perform a read to a buffer from the server, slicing up the region to be=
 read
  * according to the network rsize.
@@ -23,16 +49,13 @@
 static void netfs_dispatch_unbuffered_reads(struct netfs_io_request *rreq)
 {
 	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
-	unsigned long long start =3D rreq->start;
-	ssize_t size =3D rreq->len;
-	int ret;
=20
-	bvecq_pos_set(&rreq->dispatch_cursor, &rreq->load_cursor);
+	bvecq_pos_transfer(&stream->dispatch_cursor, &rreq->load_cursor);
=20
 	do {
 		struct netfs_io_subrequest *subreq;
=20
-		subreq =3D netfs_alloc_subrequest(rreq);
+		subreq =3D netfs_alloc_read_subrequest(rreq);
 		if (!subreq) {
 			/* Stash the error in the request if there's not
 			 * already an error set.
@@ -42,37 +65,10 @@ static void netfs_dispatch_unbuffered_reads(struct netf=
s_io_request *rreq)
 		}
=20
 		subreq->source	=3D NETFS_DOWNLOAD_FROM_SERVER;
-		subreq->start	=3D start;
-		subreq->len	=3D size;
-
-		netfs_queue_read(rreq, subreq);
+		subreq->start	=3D stream->issue_from;
+		subreq->len	=3D stream->buffered;
=20
 		netfs_stat(&netfs_n_rh_download);
-		if (rreq->netfs_ops->prepare_read) {
-			ret =3D rreq->netfs_ops->prepare_read(subreq);
-			if (ret < 0) {
-				netfs_cancel_read(subreq, ret);
-				break;
-			}
-		}
-
-		bvecq_pos_set(&subreq->dispatch_pos, &rreq->dispatch_cursor);
-		bvecq_pos_set(&subreq->content, &rreq->dispatch_cursor);
-		subreq->len =3D bvecq_slice(&rreq->dispatch_cursor,
-					  umin(size, stream->sreq_max_len),
-					  stream->sreq_max_segs,
-					  &subreq->nr_segs);
-
-		size -=3D subreq->len;
-		start +=3D subreq->len;
-		rreq->submitted +=3D subreq->len;
-		if (size <=3D 0) {
-			smp_wmb(); /* Write lists before ALL_QUEUED. */
-			set_bit(NETFS_RREQ_ALL_QUEUED, &rreq->flags);
-		}
-
-		iov_iter_bvec_queue(&subreq->io_iter, ITER_DEST, subreq->content.bvecq,
-				    subreq->content.slot, subreq->content.offset, subreq->len);
=20
 		rreq->netfs_ops->issue_read(subreq);
=20
@@ -81,15 +77,14 @@ static void netfs_dispatch_unbuffered_reads(struct netf=
s_io_request *rreq)
 		if (test_bit(NETFS_RREQ_FAILED, &rreq->flags))
 			break;
 		cond_resched();
-	} while (size > 0);
+	} while (stream->buffered > 0);
=20
-	if (unlikely(size > 0)) {
-		smp_wmb(); /* Write lists before ALL_QUEUED. */
-		set_bit(NETFS_RREQ_ALL_QUEUED, &rreq->flags);
+	if (unlikely(stream->buffered > 0)) {
+		netfs_all_subreqs_queued(rreq);
 		netfs_wake_collector(rreq);
 	}
=20
-	bvecq_pos_unset(&rreq->dispatch_cursor);
+	bvecq_pos_unset(&stream->dispatch_cursor);
 }
=20
 /*
@@ -140,6 +135,7 @@ static ssize_t netfs_unbuffered_read(struct netfs_io_re=
quest *rreq, bool sync)
 ssize_t netfs_unbuffered_read_iter_locked(struct kiocb *iocb, struct iov_i=
ter *iter)
 {
 	struct netfs_io_request *rreq;
+	struct netfs_io_stream *stream;
 	ssize_t ret;
 	size_t orig_count =3D iov_iter_count(iter);
 	bool sync =3D is_sync_kiocb(iocb);
@@ -164,6 +160,8 @@ ssize_t netfs_unbuffered_read_iter_locked(struct kiocb =
*iocb, struct iov_iter *i
 	netfs_stat(&netfs_n_rh_dio_read);
 	trace_netfs_read(rreq, rreq->start, rreq->len, netfs_read_trace_dio_read);
=20
+	stream =3D &rreq->io_streams[0];
+
 	/* If this is an async op, we have to keep track of the destination
 	 * buffer for ourselves as the caller's iterator will be trashed when
 	 * we return.
@@ -179,6 +177,8 @@ ssize_t netfs_unbuffered_read_iter_locked(struct kiocb =
*iocb, struct iov_iter *i
 		goto error_put;
=20
 	rreq->len =3D ret;
+	stream->buffered =3D ret;
+	stream->issue_from =3D rreq->start;
=20
 	// TODO: Set up bounce buffer if needed
=20
diff --git a/fs/netfs/direct_write.c b/fs/netfs/direct_write.c
index 0309dd3c37d2..c51f3cbacd40 100644
--- a/fs/netfs/direct_write.c
+++ b/fs/netfs/direct_write.c
@@ -9,6 +9,34 @@
 #include <linux/uio.h>
 #include "internal.h"
=20
+/*
+ * Prepare the buffer for an unbuffered/DIO write.
+ */
+int netfs_prepare_unbuffered_write_buffer(struct netfs_io_subrequest *subr=
eq,
+					  unsigned int max_segs)
+{
+	struct netfs_io_stream *stream =3D &subreq->rreq->io_streams[subreq->stre=
am_nr];
+	size_t len;
+
+	bvecq_pos_set(&subreq->dispatch_pos, &stream->dispatch_cursor);
+	bvecq_pos_set(&subreq->content, &stream->dispatch_cursor);
+	len =3D bvecq_slice(&stream->dispatch_cursor, subreq->len, max_segs,
+			  &subreq->nr_segs);
+
+	if (len < subreq->len) {
+		subreq->len =3D len;
+		trace_netfs_sreq(subreq, netfs_sreq_trace_limited);
+	}
+
+	// TODO: Wait here for completion of prev subreq
+
+	stream->issue_from +=3D subreq->len;
+	stream->buffered   -=3D subreq->len;
+	if (stream->buffered =3D=3D 0)
+		netfs_all_subreqs_queued(subreq->rreq);
+	return 0;
+}
+
 /*
  * Perform the cleanup rituals after an unbuffered write is complete.
  */
@@ -74,9 +102,9 @@ static void netfs_unbuffered_write_collect(struct netfs_=
io_request *wreq,
=20
 	wreq->transferred +=3D subreq->transferred;
 	if (subreq->transferred < subreq->len) {
-		bvecq_pos_unset(&wreq->dispatch_cursor);
-		bvecq_pos_transfer(&wreq->dispatch_cursor, &subreq->dispatch_pos);
-		bvecq_pos_advance(&wreq->dispatch_cursor, subreq->transferred);
+		bvecq_pos_unset(&stream->dispatch_cursor);
+		bvecq_pos_transfer(&stream->dispatch_cursor, &subreq->dispatch_pos);
+		bvecq_pos_advance(&stream->dispatch_cursor, subreq->transferred);
 	}
=20
 	stream->collected_to =3D subreq->start + subreq->transferred;
@@ -85,6 +113,7 @@ static void netfs_unbuffered_write_collect(struct netfs_=
io_request *wreq,
=20
 	trace_netfs_collect_stream(wreq, stream);
 	trace_netfs_collect_state(wreq, wreq->collected_to, 0);
+	/* TODO: Progressively clean up wreq->direct_bq */
 }
=20
 /*
@@ -103,60 +132,36 @@ static int netfs_unbuffered_write(struct netfs_io_req=
uest *wreq)
=20
 	_enter("%llx", wreq->len);
=20
-	bvecq_pos_set(&wreq->dispatch_cursor, &wreq->load_cursor);
-	bvecq_pos_set(&wreq->collect_cursor, &wreq->dispatch_cursor);
+	stream->issue_from =3D wreq->start;
+	stream->buffered =3D wreq->len;
+	bvecq_pos_set(&stream->dispatch_cursor, &wreq->load_cursor);
+	bvecq_pos_set(&wreq->collect_cursor, &stream->dispatch_cursor);
=20
 	if (wreq->origin =3D=3D NETFS_DIO_WRITE)
 		inode_dio_begin(wreq->inode);
=20
-	stream->collected_to =3D wreq->start;
-
 	for (;;) {
 		bool retry =3D false;
=20
 		if (!subreq) {
-			netfs_prepare_write(wreq, stream, wreq->start + wreq->transferred);
-			subreq =3D stream->construct;
-			stream->construct =3D NULL;
-		} else {
-			bvecq_pos_set(&subreq->dispatch_pos, &wreq->dispatch_cursor);
-		}
-
-		/* Check if (re-)preparation failed. */
-		if (unlikely(test_bit(NETFS_SREQ_FAILED, &subreq->flags))) {
-			netfs_write_subrequest_terminated(subreq, subreq->error);
-			wreq->error =3D subreq->error;
-			break;
+			subreq =3D netfs_alloc_write_subreq(wreq, stream);
+			if (!subreq)
+				return -ENOMEM;
 		}
=20
-		subreq->len =3D bvecq_slice(&wreq->dispatch_cursor, stream->sreq_max_len,
-					  stream->sreq_max_segs, &subreq->nr_segs);
-		bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
-
-		iov_iter_bvec_queue(&subreq->io_iter, ITER_SOURCE,
-				    subreq->content.bvecq, subreq->content.slot,
-				    subreq->content.offset,
-				    subreq->len);
-
-		if (!iov_iter_count(&subreq->io_iter))
-			break;
-
-		trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
 		stream->issue_write(subreq);
=20
-		/* Async, need to wait. */
-		netfs_wait_for_in_progress_stream(wreq, stream);
-
-		if (test_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags)) {
+		ret =3D netfs_wait_for_in_progress_subreq(wreq, subreq);
+		if (ret < 0) {
+			if (ret !=3D -EAGAIN) {
+				list_del_init(&subreq->rreq_link);
+				ret =3D subreq->error;
+				netfs_put_subrequest(subreq, netfs_sreq_trace_put_failed);
+				subreq =3D NULL;
+				goto failed;
+			}
 			retry =3D true;
-		} else if (test_bit(NETFS_SREQ_FAILED, &subreq->flags)) {
-			ret =3D subreq->error;
-			wreq->error =3D ret;
-			netfs_see_subrequest(subreq, netfs_sreq_trace_see_failed);
-			subreq =3D NULL;
-			break;
 		}
-		ret =3D 0;
=20
 		if (!retry) {
 			netfs_unbuffered_write_collect(wreq, stream, subreq);
@@ -171,20 +176,21 @@ static int netfs_unbuffered_write(struct netfs_io_req=
uest *wreq)
 			continue;
 		}
=20
-		/* We need to retry the last subrequest, so first reset the
-		 * iterator, taking into account what, if anything, we managed
-		 * to transfer.
+		/* We need to retry the last subrequest, so first wind back the
+		 * buffer position.
 		 */
 		subreq->error =3D -EAGAIN;
 		trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
=20
 		bvecq_pos_unset(&subreq->content);
-		bvecq_pos_unset(&wreq->dispatch_cursor);
-		bvecq_pos_transfer(&wreq->dispatch_cursor, &subreq->dispatch_pos);
+		bvecq_pos_unset(&stream->dispatch_cursor);
+		bvecq_pos_transfer(&stream->dispatch_cursor, &subreq->dispatch_pos);
=20
+		stream->issue_from -=3D subreq->len - subreq->transferred;
+		stream->buffered   +=3D subreq->len - subreq->transferred;
 		if (subreq->transferred > 0) {
-			wreq->transferred +=3D subreq->transferred;
-			bvecq_pos_advance(&wreq->dispatch_cursor, subreq->transferred);
+			wreq->transferred  +=3D subreq->transferred;
+			bvecq_pos_advance(&stream->dispatch_cursor, subreq->transferred);
 		}
=20
 		if (stream->source =3D=3D NETFS_UPLOAD_TO_SERVER &&
@@ -192,25 +198,21 @@ static int netfs_unbuffered_write(struct netfs_io_req=
uest *wreq)
 			wreq->netfs_ops->retry_request(wreq, stream);
=20
 		__clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
-		__clear_bit(NETFS_SREQ_BOUNDARY, &subreq->flags);
 		__clear_bit(NETFS_SREQ_FAILED, &subreq->flags);
-		subreq->start		=3D wreq->start + wreq->transferred;
-		subreq->len		=3D wreq->len   - wreq->transferred;
+		__clear_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
+		subreq->start		=3D stream->issue_from;
+		subreq->len		=3D stream->buffered;
 		subreq->transferred	=3D 0;
 		subreq->retry_count	+=3D 1;
-		stream->sreq_max_len	=3D UINT_MAX;
-		stream->sreq_max_segs	=3D INT_MAX;
=20
 		netfs_get_subrequest(subreq, netfs_sreq_trace_get_resubmit);
=20
-		if (stream->prepare_write)
-			stream->prepare_write(subreq);
 		__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
 		netfs_stat(&netfs_n_wh_retry_write_subreq);
 	}
=20
-	bvecq_pos_unset(&wreq->dispatch_cursor);
-	bvecq_pos_unset(&wreq->load_cursor);
+failed:
+	bvecq_pos_unset(&stream->dispatch_cursor);
 	netfs_unbuffered_write_done(wreq);
 	_leave(" =3D %d", ret);
 	return ret;
@@ -254,6 +256,7 @@ ssize_t netfs_unbuffered_write_iter_locked(struct kiocb=
 *iocb, struct iov_iter *
 	if (IS_ERR(wreq))
 		return PTR_ERR(wreq);
=20
+	wreq->len =3D iov_iter_count(iter);
 	wreq->io_streams[0].avail =3D true;
 	trace_netfs_write(wreq, (iocb->ki_flags & IOCB_DIRECT ?
 				 netfs_write_trace_dio_write :
@@ -264,9 +267,7 @@ ssize_t netfs_unbuffered_write_iter_locked(struct kiocb=
 *iocb, struct iov_iter *
 		 * we have to save the source buffer as the iterator is only
 		 * good until we return.  In such a case, extract an iterator
 		 * to represent as much of the the output buffer as we can
-		 * manage.  Note that the extraction might not be able to
-		 * allocate a sufficiently large bvec array and may shorten the
-		 * request.
+		 * manage.  Note that the extraction may shorten the request.
 		 */
 		ssize_t n =3D netfs_extract_iter(iter, len, INT_MAX, iocb->ki_pos,
 					       &wreq->load_cursor.bvecq, 0);
@@ -281,8 +282,6 @@ ssize_t netfs_unbuffered_write_iter_locked(struct kiocb=
 *iocb, struct iov_iter *
 		       wreq->load_cursor.bvecq->max_slots);
 	}
=20
-	__set_bit(NETFS_RREQ_USE_IO_ITER, &wreq->flags);
-
 	/* Copy the data into the bounce buffer and encrypt it. */
 	// TODO
=20
diff --git a/fs/netfs/fscache_io.c b/fs/netfs/fscache_io.c
index fafa8c6bec57..e4b7888fe757 100644
--- a/fs/netfs/fscache_io.c
+++ b/fs/netfs/fscache_io.c
@@ -239,10 +239,6 @@ void __fscache_write_to_cache(struct fscache_cookie *c=
ookie,
 				    fscache_access_io_write) < 0)
 		goto abandon_free;
=20
-	ret =3D cres->ops->prepare_write(cres, &start, &len, len, i_size, false);
-	if (ret < 0)
-		goto abandon_end;
-
 	/* TODO: Consider clearing page bits now for space the write isn't
 	 * covering.  This is more complicated than it appears when THPs are
 	 * taken into account.
@@ -252,8 +248,6 @@ void __fscache_write_to_cache(struct fscache_cookie *co=
okie,
 	fscache_write(cres, start, &iter, fscache_wreq_done, wreq);
 	return;
=20
-abandon_end:
-	return fscache_wreq_done(wreq, ret);
 abandon_free:
 	kfree(wreq);
 abandon:
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index 5674a57f2e22..bcbcbc804d91 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -22,10 +22,9 @@
 /*
  * buffered_read.c
  */
-int netfs_read_query_cache(struct netfs_io_request *rreq,
-			   struct fscache_occupancy *occ);
-void netfs_queue_read(struct netfs_io_request *rreq,
-		      struct netfs_io_subrequest *subreq);
+void netfs_read_query_cache(struct netfs_io_request *rreq,
+			    struct fscache_occupancy *occ);
+struct netfs_io_subrequest *netfs_alloc_read_subrequest(struct netfs_io_re=
quest *rreq);
 void netfs_cache_read_terminated(void *priv, ssize_t transferred_or_error);
 int netfs_prefetch_for_write(struct file *file, struct folio *folio,
 			     size_t offset, size_t len);
@@ -36,6 +35,18 @@ int netfs_prefetch_for_write(struct file *file, struct f=
olio *folio,
 void netfs_update_i_size(struct netfs_inode *ctx, struct inode *inode,
 			 loff_t pos, size_t copied);
=20
+/*
+ * direct_read.c
+ */
+int netfs_prepare_unbuffered_read_buffer(struct netfs_io_subrequest *subre=
q,
+					 unsigned int max_segs);
+
+/*
+ * direct_write.c
+ */
+int netfs_prepare_unbuffered_write_buffer(struct netfs_io_subrequest *subr=
eq,
+					  unsigned int max_segs);
+
 /*
  * main.c
  */
@@ -72,6 +83,8 @@ struct bvecq *netfs_buffer_make_space(struct netfs_io_req=
uest *rreq,
 				      enum netfs_bvecq_trace trace);
 void netfs_wake_collector(struct netfs_io_request *rreq);
 void netfs_subreq_clear_in_progress(struct netfs_io_subrequest *subreq);
+int netfs_wait_for_in_progress_subreq(struct netfs_io_request *rreq,
+				      struct netfs_io_subrequest *subreq);
 void netfs_wait_for_in_progress_stream(struct netfs_io_request *rreq,
 				       struct netfs_io_stream *stream);
 ssize_t netfs_wait_for_read(struct netfs_io_request *rreq);
@@ -117,16 +130,53 @@ void netfs_cache_read_terminated(void *priv, ssize_t =
transferred_or_error);
 /*
  * read_pgpriv2.c
  */
+#ifdef CONFIG_NETFS_PGPRIV2
+int netfs_prepare_pgpriv2_write_buffer(struct netfs_io_subrequest *subreq,
+				       unsigned int max_segs);
 void netfs_pgpriv2_copy_to_cache(struct netfs_io_request *rreq, struct fol=
io *folio);
 void netfs_pgpriv2_end_copy_to_cache(struct netfs_io_request *rreq);
 bool netfs_pgpriv2_unlock_copied_folios(struct netfs_io_request *wreq);
+static inline bool netfs_using_pgpriv2(const struct netfs_io_request *rreq)
+{
+	return test_bit(NETFS_RREQ_USE_PGPRIV2, &rreq->flags);
+}
+#else
+static inline int netfs_prepare_pgpriv2_write_buffer(struct netfs_io_subre=
quest *subreq,
+						     unsigned int max_segs)
+{
+	return -EIO;
+}
+static inline void netfs_pgpriv2_copy_to_cache(struct netfs_io_request *rr=
eq, struct folio *folio)
+{
+}
+static inline void netfs_pgpriv2_end_copy_to_cache(struct netfs_io_request=
 *rreq)
+{
+}
+static inline bool netfs_pgpriv2_unlock_copied_folios(struct netfs_io_requ=
est *wreq)
+{
+	return true;
+}
+static inline bool netfs_using_pgpriv2(const struct netfs_io_request *rreq)
+{
+	return false;
+}
+#endif
=20
 /*
  * read_retry.c
  */
+int netfs_prepare_buffered_read_retry_buffer(struct netfs_io_subrequest *s=
ubreq,
+					     unsigned int max_segs);
+int netfs_reset_for_read_retry(struct netfs_io_subrequest *subreq);
 void netfs_retry_reads(struct netfs_io_request *rreq);
 void netfs_unlock_abandoned_read_pages(struct netfs_io_request *rreq);
=20
+/*
+ * read_single.c
+ */
+int netfs_prepare_read_single_buffer(struct netfs_io_subrequest *subreq,
+				     unsigned int max_segs);
+
 /*
  * stats.c
  */
@@ -198,30 +248,25 @@ void netfs_write_collection_worker(struct work_struct=
 *work);
 /*
  * write_issue.c
  */
+struct netfs_writethrough;
 struct netfs_io_request *netfs_create_write_req(struct address_space *mapp=
ing,
 						struct file *file,
 						loff_t start,
 						enum netfs_io_origin origin);
-void netfs_prepare_write(struct netfs_io_request *wreq,
-			 struct netfs_io_stream *stream,
-			 loff_t start);
-void netfs_reissue_write(struct netfs_io_stream *stream,
-			 struct netfs_io_subrequest *subreq);
-void netfs_issue_write(struct netfs_io_request *wreq,
-		       struct netfs_io_stream *stream);
-size_t netfs_advance_write(struct netfs_io_request *wreq,
-			   struct netfs_io_stream *stream,
-			   loff_t start, size_t len, bool to_eof);
-struct netfs_io_request *netfs_begin_writethrough(struct kiocb *iocb, size=
_t len);
-int netfs_advance_writethrough(struct netfs_io_request *wreq, struct write=
back_control *wbc,
-			       struct folio *folio, size_t copied, bool to_page_end,
-			       struct folio **writethrough_cache);
-ssize_t netfs_end_writethrough(struct netfs_io_request *wreq, struct write=
back_control *wbc,
-			       struct folio *writethrough_cache);
+struct netfs_io_subrequest *netfs_alloc_write_subreq(struct netfs_io_reque=
st *wreq,
+						     struct netfs_io_stream *stream);
+struct netfs_writethrough *netfs_begin_writethrough(struct kiocb *iocb, si=
ze_t len);
+int netfs_advance_writethrough(struct netfs_writethrough *wthru,
+			       struct writeback_control *wbc,
+			       struct folio *folio, size_t copied, bool to_page_end);
+ssize_t netfs_end_writethrough(struct netfs_writethrough *wthru,
+			       struct writeback_control *wbc);
=20
 /*
  * write_retry.c
  */
+int netfs_prepare_write_retry_buffer(struct netfs_io_subrequest *subreq,
+				     unsigned int max_segs);
 void netfs_retry_writes(struct netfs_io_request *wreq);
=20
 /*
@@ -304,6 +349,25 @@ static inline bool netfs_check_subreq_in_progress(cons=
t struct netfs_io_subreque
 	return test_bit_acquire(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
 }
=20
+/*
+ * Indicate that we've generated and queued all the subrequests we're goin=
g to.
+ */
+static inline void netfs_all_subreqs_queued(struct netfs_io_request *rreq)
+{
+	smp_wmb(); /* Write lists before ALL_QUEUED. */
+	set_bit(NETFS_RREQ_ALL_QUEUED, &rreq->flags);
+	trace_netfs_rreq(rreq, netfs_rreq_trace_all_queued);
+}
+
+/*
+ * Query if all subrequests are queued.
+ */
+static inline bool netfs_are_all_subreqs_queued(const struct netfs_io_requ=
est *rreq)
+{
+	/* Read lists after ALL_QUEUED. */
+	return test_bit_acquire(NETFS_RREQ_ALL_QUEUED, &rreq->flags);
+}
+
 /*
  * fscache-cache.c
  */
diff --git a/fs/netfs/iterator.c b/fs/netfs/iterator.c
index 3040be52c293..e29aad1da0b3 100644
--- a/fs/netfs/iterator.c
+++ b/fs/netfs/iterator.c
@@ -102,14 +102,14 @@ ssize_t netfs_extract_iter(struct iov_iter *orig, siz=
e_t max_len, size_t max_pag
 			}
=20
 			if (got =3D=3D 0) {
-				pr_err("extract_pages gave nothing from %zu, %zu\n",
+				pr_err("extract_pages gave nothing from %zx, %zx\n",
 				       extracted, max_len);
 				ret =3D -EIO;
 				goto out;
 			}
=20
 			if (WARN(got > max_len,
-				 "%s: extract_pages overrun %zd > %zu bytes\n",
+				 "%s: extract_pages overrun %zx > %zx bytes\n",
 				 __func__, got, max_len)) {
 				ret =3D -EIO;
 				break;
diff --git a/fs/netfs/misc.c b/fs/netfs/misc.c
index 8fc4e5ef2152..0af45204fabc 100644
--- a/fs/netfs/misc.c
+++ b/fs/netfs/misc.c
@@ -250,6 +250,37 @@ void netfs_subreq_clear_in_progress(struct netfs_io_su=
brequest *subreq)
 		netfs_wake_collector(rreq);
 }
=20
+/*
+ * Wait for a subrequest to come to completion.
+ */
+int netfs_wait_for_in_progress_subreq(struct netfs_io_request *rreq,
+				      struct netfs_io_subrequest *subreq)
+{
+	if (netfs_check_subreq_in_progress(subreq)) {
+		DEFINE_WAIT(myself);
+
+		trace_netfs_rreq(rreq, netfs_rreq_trace_wait_quiesce);
+		for (;;) {
+			prepare_to_wait(&rreq->waitq, &myself, TASK_UNINTERRUPTIBLE);
+
+			if (!netfs_check_subreq_in_progress(subreq))
+				break;
+
+			trace_netfs_sreq(subreq, netfs_sreq_trace_wait_for);
+			schedule();
+		}
+
+		trace_netfs_rreq(rreq, netfs_rreq_trace_waited_quiesce);
+		finish_wait(&rreq->waitq, &myself);
+	}
+
+	if (test_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags))
+		return -EAGAIN;
+	if (test_bit(NETFS_SREQ_FAILED, &subreq->flags))
+		return subreq->error;
+	return 0;
+}
+
 /*
  * Wait for all outstanding I/O in a stream to quiesce.
  */
@@ -310,7 +341,7 @@ static int netfs_collect_in_app(struct netfs_io_request=
 *rreq,
 			need_collect =3D true;
 			break;
 		}
-		if (subreq || !test_bit(NETFS_RREQ_ALL_QUEUED, &rreq->flags))
+		if (subreq || !netfs_are_all_subreqs_queued(rreq))
 			done =3D false;
 	}
=20
@@ -380,7 +411,7 @@ static ssize_t netfs_wait_for_in_progress(struct netfs_=
io_request *rreq,
 		case NETFS_UNBUFFERED_WRITE:
 			break;
 		default:
-			if (rreq->submitted < rreq->len) {
+			if (rreq->transferred < rreq->len) {
 				trace_netfs_failure(rreq, NULL, ret, netfs_fail_short_read);
 				ret =3D -EIO;
 			}
diff --git a/fs/netfs/objects.c b/fs/netfs/objects.c
index 7f5187c64ae9..d4a95a462576 100644
--- a/fs/netfs/objects.c
+++ b/fs/netfs/objects.c
@@ -46,8 +46,6 @@ struct netfs_io_request *netfs_alloc_request(struct addre=
ss_space *mapping,
 	rreq->i_size	=3D i_size_read(inode);
 	rreq->debug_id	=3D atomic_inc_return(&debug_ids);
 	rreq->wsize	=3D INT_MAX;
-	rreq->io_streams[0].sreq_max_len =3D ULONG_MAX;
-	rreq->io_streams[0].sreq_max_segs =3D 0;
 	spin_lock_init(&rreq->lock);
 	INIT_LIST_HEAD(&rreq->io_streams[0].subrequests);
 	INIT_LIST_HEAD(&rreq->io_streams[1].subrequests);
@@ -134,9 +132,11 @@ static void netfs_deinit_request(struct netfs_io_reque=
st *rreq)
 	if (rreq->cache_resources.ops)
 		rreq->cache_resources.ops->end_operation(&rreq->cache_resources);
 	bvecq_pos_unset(&rreq->load_cursor);
-	bvecq_pos_unset(&rreq->dispatch_cursor);
 	bvecq_pos_unset(&rreq->collect_cursor);
+	bvecq_pos_unset(&rreq->retry_cursor);
 	bvecq_put(rreq->spare);
+	for (int i =3D 0; i < NR_IO_STREAMS; i++)
+		bvecq_pos_unset(&rreq->io_streams[i].dispatch_cursor);
=20
 	if (atomic_dec_and_test(&ictx->io_count))
 		wake_up_var(&ictx->io_count);
@@ -227,6 +227,7 @@ static void netfs_free_subrequest(struct netfs_io_subre=
quest *subreq)
 	struct netfs_io_request *rreq =3D subreq->rreq;
=20
 	trace_netfs_sreq(subreq, netfs_sreq_trace_free);
+	WARN_ON_ONCE(!list_empty(&subreq->rreq_link));
 	if (rreq->netfs_ops->free_subrequest)
 		rreq->netfs_ops->free_subrequest(subreq);
 	bvecq_pos_unset(&subreq->dispatch_pos);
diff --git a/fs/netfs/read_collect.c b/fs/netfs/read_collect.c
index fccc6c2d891e..aa7e206fccf2 100644
--- a/fs/netfs/read_collect.c
+++ b/fs/netfs/read_collect.c
@@ -36,6 +36,7 @@ static void netfs_clear_unread(struct netfs_io_subrequest=
 *subreq)
=20
 	if (subreq->start + subreq->transferred >=3D subreq->rreq->i_size)
 		__set_bit(NETFS_SREQ_HIT_EOF, &subreq->flags);
+	trace_netfs_rreq(subreq->rreq, netfs_rreq_trace_zero_unread);
 }
=20
 /*
@@ -58,7 +59,7 @@ static void netfs_unlock_read_folio(struct netfs_io_reque=
st *rreq,
 	flush_dcache_folio(folio);
 	folio_mark_uptodate(folio);
=20
-	if (!test_bit(NETFS_RREQ_USE_PGPRIV2, &rreq->flags)) {
+	if (!netfs_using_pgpriv2(rreq)) {
 		finfo =3D netfs_folio_info(folio);
 		if (finfo) {
 			trace_netfs_folio(folio, netfs_folio_trace_filled_gaps);
@@ -258,8 +259,7 @@ static void netfs_collect_read_results(struct netfs_io_=
request *rreq)
 				transferred =3D front->len;
 				trace_netfs_rreq(rreq, netfs_rreq_trace_set_abandon);
 			}
-			if (front->start + transferred >=3D rreq->cleaned_to + fsize ||
-			    test_bit(NETFS_SREQ_HIT_EOF, &front->flags))
+			if (front->start + transferred >=3D rreq->cleaned_to + fsize)
 				netfs_read_unlock_folios(rreq, &notes);
 		} else {
 			stream->collected_to =3D front->start + transferred;
@@ -378,31 +378,6 @@ static void netfs_rreq_assess_dio(struct netfs_io_requ=
est *rreq)
 		inode_dio_end(rreq->inode);
 }
=20
-/*
- * Do processing after reading a monolithic single object.
- */
-static void netfs_rreq_assess_single(struct netfs_io_request *rreq)
-{
-	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
-
-	if (!rreq->error && stream->source =3D=3D NETFS_DOWNLOAD_FROM_SERVER &&
-	    fscache_resources_valid(&rreq->cache_resources)) {
-		trace_netfs_rreq(rreq, netfs_rreq_trace_dirty);
-		netfs_single_mark_inode_dirty(rreq->inode);
-	}
-
-	if (rreq->iocb) {
-		rreq->iocb->ki_pos +=3D rreq->transferred;
-		if (rreq->iocb->ki_complete) {
-			trace_netfs_rreq(rreq, netfs_rreq_trace_ki_complete);
-			rreq->iocb->ki_complete(
-				rreq->iocb, rreq->error ? rreq->error : rreq->transferred);
-		}
-	}
-	if (rreq->netfs_ops->done)
-		rreq->netfs_ops->done(rreq);
-}
-
 /*
  * Perform the collection of subrequests and folios.
  *
@@ -418,9 +393,8 @@ bool netfs_read_collection(struct netfs_io_request *rre=
q)
 	/* We're done when the app thread has finished posting subreqs and the
 	 * queue is empty.
 	 */
-	if (!test_bit(NETFS_RREQ_ALL_QUEUED, &rreq->flags))
+	if (!netfs_are_all_subreqs_queued(rreq))
 		return false;
-	smp_rmb(); /* Read ALL_QUEUED before subreq lists. */
=20
 	if (!list_empty(&stream->subrequests))
 		return false;
@@ -438,7 +412,7 @@ bool netfs_read_collection(struct netfs_io_request *rre=
q)
 		netfs_rreq_assess_dio(rreq);
 		break;
 	case NETFS_READ_SINGLE:
-		netfs_rreq_assess_single(rreq);
+		WARN_ON_ONCE(1);
 		break;
 	default:
 		break;
@@ -563,6 +537,11 @@ void netfs_read_subreq_terminated(struct netfs_io_subr=
equest *subreq)
 		} else if (test_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags)) {
 			__set_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
 			trace_netfs_sreq(subreq, netfs_sreq_trace_partial_read);
+		} else if (subreq->source =3D=3D NETFS_READ_FROM_CACHE) {
+			netfs_stat(&netfs_n_rh_read_failed);
+			__set_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
+			subreq->error =3D -ENODATA;
+			trace_netfs_sreq(subreq, netfs_sreq_trace_short);
 		} else {
 			__set_bit(NETFS_SREQ_FAILED, &subreq->flags);
 			subreq->error =3D -ENODATA;
@@ -581,6 +560,8 @@ void netfs_read_subreq_terminated(struct netfs_io_subre=
quest *subreq)
=20
 	if (unlikely(subreq->error < 0)) {
 		trace_netfs_failure(rreq, subreq, subreq->error, netfs_fail_read);
+		if (subreq->error =3D=3D -ENOMEM)
+			set_bit(NETFS_RREQ_SAW_ENOMEM, &rreq->flags);
 		if (subreq->source =3D=3D NETFS_READ_FROM_CACHE) {
 			netfs_stat(&netfs_n_rh_read_failed);
 			__set_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
diff --git a/fs/netfs/read_pgpriv2.c b/fs/netfs/read_pgpriv2.c
index f9a0fb3e89e3..e6a60cf67e4a 100644
--- a/fs/netfs/read_pgpriv2.c
+++ b/fs/netfs/read_pgpriv2.c
@@ -13,8 +13,37 @@
 #include <linux/task_io_accounting_ops.h>
 #include "internal.h"
=20
+int netfs_prepare_pgpriv2_write_buffer(struct netfs_io_subrequest *subreq,
+				       unsigned int max_segs)
+{
+	struct netfs_io_request *creq =3D subreq->rreq;
+	struct netfs_io_stream *stream =3D &creq->io_streams[1];
+	size_t len;
+
+	bvecq_pos_set(&subreq->dispatch_pos, &stream->dispatch_cursor);
+	bvecq_pos_set(&subreq->content, &stream->dispatch_cursor);
+	len =3D bvecq_slice(&stream->dispatch_cursor, subreq->len, max_segs,
+			  &subreq->nr_segs);
+
+	if (len < subreq->len) {
+		subreq->len =3D len;
+		trace_netfs_sreq(subreq, netfs_sreq_trace_limited);
+	}
+
+	// TODO: Wait here for completion of prev subreq
+
+	stream->issue_from +=3D subreq->len;
+	stream->buffered   -=3D subreq->len;
+	if (stream->buffered =3D=3D 0)
+		netfs_all_subreqs_queued(creq);
+	return 0;
+}
+
 /*
- * [DEPRECATED] Copy a folio to the cache with PG_private_2 set.
+ * [DEPRECATED] Copy a folio to the cache with PG_private_2 set.  Note tha=
t the
+ * folio won't necessarily be contiguous with the previous one as there mi=
ght
+ * be a mixture of folios read from the cache and downloaded from the serv=
er
+ * (or just zeroed).
  */
 static void netfs_pgpriv2_copy_folio(struct netfs_io_request *creq, struct=
 folio *folio)
 {
@@ -24,7 +53,6 @@ static void netfs_pgpriv2_copy_folio(struct netfs_io_requ=
est *creq, struct folio
 	size_t dio_size =3D PAGE_SIZE;
 	size_t fsize =3D folio_size(folio), flen =3D fsize;
 	loff_t fpos =3D folio_pos(folio), i_size;
-	bool to_eof =3D false;
=20
 	_enter("");
=20
@@ -44,12 +72,8 @@ static void netfs_pgpriv2_copy_folio(struct netfs_io_req=
uest *creq, struct folio
 	if (fpos + fsize > creq->i_size)
 		creq->i_size =3D i_size;
=20
-	if (flen > i_size - fpos) {
+	if (flen > i_size - fpos)
 		flen =3D i_size - fpos;
-		to_eof =3D true;
-	} else if (flen =3D=3D i_size - fpos) {
-		to_eof =3D true;
-	}
=20
 	flen =3D round_up(flen, dio_size);
=20
@@ -81,37 +105,9 @@ static void netfs_pgpriv2_copy_folio(struct netfs_io_re=
quest *creq, struct folio
 	bvecq_filled_to(queue, slot);
 	creq->load_cursor.slot =3D slot;
 	creq->load_cursor.offset =3D 0;
+	trace_netfs_wback(creq, folio, 0);
=20
-	bvecq_pos_nudge(&creq->dispatch_cursor);
-=09
-	cache->submit_off =3D 0;
-	cache->submit_len =3D flen;
-
-	/* Attach the folio to one or more subrequests.  For a big folio, we
-	 * could end up with thousands of subrequests if the wsize is small -
-	 * but we might need to wait during the creation of subrequests for
-	 * network resources (eg. SMB credits).
-	 */
-	do {
-		ssize_t part;
-
-		creq->dispatch_cursor.offset =3D cache->submit_off;
-
-		atomic64_set(&creq->issued_to, fpos + cache->submit_off);
-		part =3D netfs_advance_write(creq, cache, fpos + cache->submit_off,
-					   cache->submit_len, to_eof);
-		cache->submit_off +=3D part;
-		if (part > cache->submit_len)
-			cache->submit_len =3D 0;
-		else
-			cache->submit_len -=3D part;
-	} while (cache->submit_len > 0);
-
-	bvecq_pos_step(&creq->dispatch_cursor);
-	atomic64_set(&creq->issued_to, fpos + fsize);
-
-	if (flen < fsize)
-		netfs_issue_write(creq, cache);
+	cache->buffered +=3D flen;
 }
=20
 /*
@@ -121,6 +117,7 @@ static struct netfs_io_request *netfs_pgpriv2_begin_cop=
y_to_cache(
 	struct netfs_io_request *rreq, struct folio *folio)
 {
 	struct netfs_io_request *creq;
+	struct netfs_io_stream *cache;
=20
 	if (!fscache_resources_valid(&rreq->cache_resources))
 		goto cancel;
@@ -130,12 +127,15 @@ static struct netfs_io_request *netfs_pgpriv2_begin_c=
opy_to_cache(
 	if (IS_ERR(creq))
 		goto cancel;
=20
-	if (!creq->io_streams[1].avail)
+	cache =3D &creq->io_streams[1];
+	if (!cache->avail)
 		goto cancel_put;
=20
-	bvecq_buffer_init(&creq->load_cursor, GFP_KERNEL);
-	bvecq_pos_set(&creq->dispatch_cursor, &creq->load_cursor);
-	bvecq_pos_set(&creq->collect_cursor, &creq->dispatch_cursor);
+	if (bvecq_buffer_init(&creq->load_cursor, GFP_KERNEL) < 0)
+		goto cancel_put;
+
+	bvecq_pos_set(&cache->dispatch_cursor, &creq->load_cursor);
+	bvecq_pos_set(&creq->collect_cursor, &creq->load_cursor);
=20
 	__set_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &creq->flags);
 	trace_netfs_copy2cache(rreq, creq);
@@ -178,19 +178,44 @@ void netfs_pgpriv2_copy_to_cache(struct netfs_io_requ=
est *rreq, struct folio *fo
 	netfs_pgpriv2_copy_folio(creq, folio);
 }
=20
+/*
+ * Issue all pending writes on the cache stream.
+ */
+static int netfs_pgpriv2_issue_stream(struct netfs_io_request *wreq,
+				      struct netfs_io_stream *stream)
+{
+	int ret;
+
+	atomic64_set_release(&stream->issued_to, wreq->start);
+
+	do {
+		struct netfs_io_subrequest *subreq;
+
+		subreq =3D netfs_alloc_write_subreq(wreq, stream);
+		if (!subreq)
+			return -ENOMEM;
+
+		stream->issue_write(subreq);
+		if (test_bit(NETFS_RREQ_SAW_ENOMEM, &wreq->flags))
+			return -ENOMEM;
+
+	} while (stream->buffered > 0);
+
+	return ret;
+}
+
 /*
  * [DEPRECATED] End writing to the cache, flushing out any outstanding wri=
tes.
  */
 void netfs_pgpriv2_end_copy_to_cache(struct netfs_io_request *rreq)
 {
 	struct netfs_io_request *creq =3D rreq->copy_to_cache;
+	struct netfs_io_stream *stream =3D &creq->io_streams[1];
=20
 	if (IS_ERR_OR_NULL(creq))
 		return;
=20
-	netfs_issue_write(creq, &creq->io_streams[1]);
-	smp_wmb(); /* Write lists before ALL_QUEUED. */
-	set_bit(NETFS_RREQ_ALL_QUEUED, &creq->flags);
+	netfs_pgpriv2_issue_stream(creq, stream);
 	trace_netfs_rreq(rreq, netfs_rreq_trace_end_copy_to_cache);
 	if (list_empty_careful(&creq->io_streams[1].subrequests))
 		netfs_wake_collector(creq);
diff --git a/fs/netfs/read_retry.c b/fs/netfs/read_retry.c
index c45aef8dc03c..a5cd6e20cae1 100644
--- a/fs/netfs/read_retry.c
+++ b/fs/netfs/read_retry.c
@@ -9,20 +9,55 @@
 #include <linux/slab.h>
 #include "internal.h"
=20
-static void netfs_reissue_read(struct netfs_io_request *rreq,
-			       struct netfs_io_subrequest *subreq)
+/*
+ * Prepare the I/O buffer on a buffered read subrequest for the filesystem=
 to
+ * use as a bvec queue.
+ */
+int netfs_prepare_buffered_read_retry_buffer(struct netfs_io_subrequest *s=
ubreq,
+					     unsigned int max_segs)
 {
-	bvecq_pos_unset(&subreq->content);
+	struct netfs_io_request *rreq =3D subreq->rreq;
+	size_t len;
+
+	bvecq_pos_set(&subreq->dispatch_pos, &rreq->retry_cursor);
 	bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
-	iov_iter_bvec_queue(&subreq->io_iter, ITER_DEST, subreq->content.bvecq,
-			    subreq->content.slot, subreq->content.offset, subreq->len);
-	iov_iter_advance(&subreq->io_iter, subreq->transferred);
+	len =3D bvecq_slice(&rreq->retry_cursor, subreq->len, max_segs,
+			  &subreq->nr_segs);
+	if (len < subreq->len) {
+		subreq->len =3D len;
+		trace_netfs_sreq(subreq, netfs_sreq_trace_limited);
+	}
+	rreq->retry_buffered -=3D subreq->len;
+	rreq->retry_start    +=3D subreq->len;
+	return 0;
+}
=20
-	subreq->error =3D 0;
+/*
+ * Reset the state of the subrequest and discard any buffering so that we =
can
+ * retry (where this may include sending it to the server instead of the
+ * cache).
+ */
+int netfs_reset_for_read_retry(struct netfs_io_subrequest *subreq)
+{
+	trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
+
+	if (subreq->retry_count > 3) {
+		trace_netfs_sreq(subreq, netfs_sreq_trace_too_many_retries);
+		return subreq->error;
+	}
+
+	subreq->retry_count++;
 	__clear_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
+	__clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
+	__clear_bit(NETFS_SREQ_FAILED, &subreq->flags);
 	__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
-	netfs_stat(&netfs_n_rh_retry_read_subreq);
-	subreq->rreq->netfs_ops->issue_read(subreq);
+	bvecq_pos_unset(&subreq->content);
+	bvecq_pos_unset(&subreq->dispatch_pos);
+	subreq->error =3D 0;
+	subreq->transferred =3D 0;
+	netfs_get_subrequest(subreq, netfs_sreq_trace_get_resubmit);
+	netfs_stat(&netfs_n_wh_retry_write_subreq);
+	return 0;
 }
=20
 /*
@@ -33,8 +68,8 @@ static void netfs_retry_read_subrequests(struct netfs_io_=
request *rreq)
 {
 	struct netfs_io_subrequest *subreq;
 	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
-	struct bvecq_pos dispatch_cursor =3D {};
 	struct list_head *next;
+	int ret;
=20
 	_enter("R=3D%x", rreq->debug_id);
=20
@@ -44,46 +79,18 @@ static void netfs_retry_read_subrequests(struct netfs_i=
o_request *rreq)
 	if (rreq->netfs_ops->retry_request)
 		rreq->netfs_ops->retry_request(rreq, NULL);
=20
-	/* If there's no renegotiation to do, just resend each retryable subreq
-	 * up to the first permanently failed one.
-	 */
-	if (!rreq->netfs_ops->prepare_read &&
-	    !rreq->cache_resources.ops) {
-		list_for_each_entry(subreq, &stream->subrequests, rreq_link) {
-			if (test_bit(NETFS_SREQ_FAILED, &subreq->flags))
-				break;
-			if (__test_and_clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags)) {
-				subreq->retry_count++;
-				netfs_get_subrequest(subreq, netfs_sreq_trace_get_resubmit);
-				netfs_reissue_read(rreq, subreq);
-			}
-		}
-		return;
-	}
-
-	/* Okay, we need to renegotiate all the download requests and flip any
-	 * failed cache reads over to being download requests and negotiate
-	 * those also.  All fully successful subreqs have been removed from the
-	 * list and any spare data from those has been donated.
-	 *
-	 * What we do is decant the list and rebuild it one subreq at a time so
-	 * that we don't end up with donations jumping over a gap we're busy
-	 * populating with smaller subrequests.  In the event that the subreq
-	 * we just launched finishes before we insert the next subreq, it'll
-	 * fill in rreq->prev_donated instead.
-	 *
-	 * Note: Alternatively, we could split the tail subrequest right before
-	 * we reissue it and fix up the donations under lock.
+	/* Renegotiate all the download requests and flip any failed cache
+	 * reads over to being download requests and negotiate those also.
 	 */
 	next =3D stream->subrequests.next;
=20
 	do {
 		struct netfs_io_subrequest *from, *to, *tmp;
-		unsigned long long start, len;
-		size_t part;
-		bool boundary =3D false, subreq_superfluous =3D false;
+		unsigned long long start;
+		size_t len;
+		bool subreq_superfluous =3D false;
=20
-		bvecq_pos_unset(&dispatch_cursor);
+		bvecq_pos_unset(&rreq->retry_cursor);
=20
 		/* Go through the subreqs and find the next span of contiguous
 		 * buffer that we then rejig (cifs, for example, needs the
@@ -98,8 +105,7 @@ static void netfs_retry_read_subrequests(struct netfs_io=
_request *rreq)
 		       rreq->debug_id, from->debug_index,
 		       from->start, from->transferred, from->len);
=20
-		if (test_bit(NETFS_SREQ_FAILED, &from->flags) ||
-		    !test_bit(NETFS_SREQ_NEED_RETRY, &from->flags)) {
+		if (!test_bit(NETFS_SREQ_NEED_RETRY, &from->flags)) {
 			subreq =3D from;
 			goto abandon;
 		}
@@ -108,20 +114,21 @@ static void netfs_retry_read_subrequests(struct netfs=
_io_request *rreq)
 			subreq =3D list_entry(next, struct netfs_io_subrequest, rreq_link);
 			if (subreq->start !=3D start + len ||
 			    subreq->transferred > 0 ||
-			    test_bit(NETFS_SREQ_BOUNDARY, &subreq->flags) ||
 			    !test_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags))
 				break;
 			to =3D subreq;
 			len +=3D to->len;
 		}
=20
-		_debug(" - range: %llx-%llx %llx", start, start + len - 1, len);
+		_debug(" - range: %llx-%llx %zx", start, start + len - 1, len);
=20
 		/* Determine the set of buffers we're going to use.  Each
-		 * subreq gets a subset of a single overall contiguous buffer.
+		 * subreq takes a subset of a single overall contiguous buffer.
 		 */
-		bvecq_pos_transfer(&dispatch_cursor, &from->dispatch_pos);
-		bvecq_pos_advance(&dispatch_cursor, from->transferred);
+		bvecq_pos_transfer(&rreq->retry_cursor, &from->dispatch_pos);
+		bvecq_pos_advance(&rreq->retry_cursor, from->transferred);
+		rreq->retry_start =3D start;
+		rreq->retry_buffered =3D len;
 		from->transferred =3D 0;
=20
 		/* Work through the sublist.  The chain of buffers we're going
@@ -130,51 +137,25 @@ static void netfs_retry_read_subrequests(struct netfs=
_io_request *rreq)
 		 */
 		subreq =3D from;
 		list_for_each_entry_from(subreq, &stream->subrequests, rreq_link) {
-			if (!len) {
+			if (rreq->retry_buffered =3D=3D 0) {
 				subreq_superfluous =3D true;
 				break;
 			}
 			subreq->source	=3D NETFS_DOWNLOAD_FROM_SERVER;
-			subreq->start	=3D start;
-			subreq->len	=3D len;
-			__clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
-			__clear_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
-			subreq->retry_count++;
-			subreq->transferred =3D 0;
+			subreq->start	=3D rreq->retry_start;
+			subreq->len	=3D rreq->retry_buffered;
=20
-			bvecq_pos_unset(&subreq->content);
-			bvecq_pos_unset(&subreq->dispatch_pos);
-			bvecq_pos_set(&subreq->dispatch_pos, &dispatch_cursor);
-
-			trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
-
-			/* Renegotiate max_len (rsize) */
-			stream->sreq_max_len =3D len;
-			stream->sreq_max_segs =3D INT_MAX;
-			if (rreq->netfs_ops->prepare_read &&
-			    rreq->netfs_ops->prepare_read(subreq) < 0) {
-				trace_netfs_sreq(subreq, netfs_sreq_trace_reprep_failed);
+			ret =3D netfs_reset_for_read_retry(subreq);
+			if (ret < 0) {
 				__set_bit(NETFS_SREQ_FAILED, &subreq->flags);
+				rreq->error =3D ret;
 				goto abandon;
 			}
=20
-			part =3D bvecq_slice(&dispatch_cursor,
-					   umin(len, stream->sreq_max_len),
-					   stream->sreq_max_segs,
-					   &subreq->nr_segs);
-			subreq->len =3D part;
-
-			len -=3D part;
-			start +=3D part;
-			if (!len) {
-				if (boundary)
-					__set_bit(NETFS_SREQ_BOUNDARY, &subreq->flags);
-			} else {
-				__clear_bit(NETFS_SREQ_BOUNDARY, &subreq->flags);
-			}
-
-			netfs_get_subrequest(subreq, netfs_sreq_trace_get_resubmit);
-			netfs_reissue_read(rreq, subreq);
+			netfs_stat(&netfs_n_rh_download);
+			rreq->netfs_ops->issue_read(subreq);
+			if (test_bit(NETFS_RREQ_SAW_ENOMEM, &rreq->flags))
+				goto abandon_after;
 			if (subreq =3D=3D to) {
 				subreq_superfluous =3D false;
 				break;
@@ -184,7 +165,7 @@ static void netfs_retry_read_subrequests(struct netfs_i=
o_request *rreq)
 		/* If we managed to use fewer subreqs, we can discard the
 		 * excess; if we used the same number, then we're done.
 		 */
-		if (!len) {
+		if (rreq->retry_buffered =3D=3D 0) {
 			if (!subreq_superfluous)
 				continue;
 			list_for_each_entry_safe_from(subreq, tmp,
@@ -202,7 +183,8 @@ static void netfs_retry_read_subrequests(struct netfs_i=
o_request *rreq)
 		}
=20
 		/* We ran out of subrequests, so we need to allocate some more
-		 * and insert them after.
+		 * and insert them after.  They must start with being marked
+		 * for retry to switch to the retry cursor.
 		 */
 		do {
 			subreq =3D netfs_alloc_subrequest(rreq);
@@ -211,8 +193,8 @@ static void netfs_retry_read_subrequests(struct netfs_i=
o_request *rreq)
 				goto abandon_after;
 			}
 			subreq->source		=3D NETFS_DOWNLOAD_FROM_SERVER;
-			subreq->start		=3D start;
-			subreq->len		=3D len;
+			subreq->start		=3D rreq->retry_start;
+			subreq->len		=3D rreq->retry_buffered;
 			subreq->stream_nr	=3D stream->stream_nr;
 			subreq->retry_count	=3D 1;
=20
@@ -220,43 +202,27 @@ static void netfs_retry_read_subrequests(struct netfs=
_io_request *rreq)
 					     refcount_read(&subreq->ref),
 					     netfs_sreq_trace_new);
=20
+			__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
+
 			spin_lock(&rreq->lock);
+			/* Write IN_PROGRESS before pointer to new subreq */
+			smp_wmb();
 			list_add(&subreq->rreq_link, &to->rreq_link);
 			spin_unlock(&rreq->lock);
 			to =3D subreq;
 			trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
=20
-			stream->sreq_max_len	=3D umin(len, rreq->rsize);
-			stream->sreq_max_segs	=3D INT_MAX;
-
 			netfs_stat(&netfs_n_rh_download);
-			if (rreq->netfs_ops->prepare_read(subreq) < 0) {
-				trace_netfs_sreq(subreq, netfs_sreq_trace_reprep_failed);
-				__set_bit(NETFS_SREQ_FAILED, &subreq->flags);
-				goto abandon;
-			}
-
-			bvecq_pos_set(&subreq->dispatch_pos, &dispatch_cursor);
-			part =3D bvecq_slice(&dispatch_cursor,
-					   umin(len, stream->sreq_max_len),
-					   stream->sreq_max_segs,
-					   &subreq->nr_segs);
-			subreq->len =3D part;
-
-			len -=3D part;
-			start +=3D part;
-			if (!len && boundary) {
-				__set_bit(NETFS_SREQ_BOUNDARY, &to->flags);
-				boundary =3D false;
-			}
+			rreq->netfs_ops->issue_read(subreq);
+			if (test_bit(NETFS_RREQ_SAW_ENOMEM, &rreq->flags))
+				goto abandon_after;
=20
-			netfs_reissue_read(rreq, subreq);
-		} while (len);
+		} while (rreq->retry_buffered > 0);
=20
 	} while (!list_is_head(next, &stream->subrequests));
=20
 out:
-	bvecq_pos_unset(&dispatch_cursor);
+	bvecq_pos_unset(&rreq->retry_cursor);
 	return;
=20
 	/* If we hit an error, fail all remaining incomplete subrequests */
@@ -322,6 +288,7 @@ void netfs_unlock_abandoned_read_pages(struct netfs_io_=
request *rreq)
 			}
 			trace_netfs_folio(folio, netfs_folio_trace_abandon);
 			folio_unlock(folio);
+			p->bv[slot].bv_page =3D NULL;
 		}
 	}
 }
diff --git a/fs/netfs/read_single.c b/fs/netfs/read_single.c
index 98938a54810e..8237894c8fd8 100644
--- a/fs/netfs/read_single.c
+++ b/fs/netfs/read_single.c
@@ -16,6 +16,22 @@
 #include <linux/netfs.h>
 #include "internal.h"
=20
+int netfs_prepare_read_single_buffer(struct netfs_io_subrequest *subreq,
+				     unsigned int max_segs)
+{
+	struct netfs_io_request *rreq =3D subreq->rreq;
+	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
+
+	bvecq_pos_set(&subreq->dispatch_pos, &rreq->load_cursor);
+	bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
+
+	stream->buffered =3D 0;
+	stream->issue_from +=3D subreq->len;
+	rreq->submitted =3D stream->issue_from;
+	netfs_all_subreqs_queued(rreq);
+	return 0;
+}
+
 /**
  * netfs_single_mark_inode_dirty - Mark a single, monolithic object inode =
dirty
  * @inode: The inode to mark
@@ -58,17 +74,6 @@ static int netfs_single_begin_cache_read(struct netfs_io=
_request *rreq, struct n
 	return fscache_begin_read_operation(&rreq->cache_resources, netfs_i_cooki=
e(ctx));
 }
=20
-static void netfs_single_read_cache(struct netfs_io_request *rreq,
-				    struct netfs_io_subrequest *subreq)
-{
-	struct netfs_cache_resources *cres =3D &rreq->cache_resources;
-
-	_enter("R=3D%08x[%x]", rreq->debug_id, subreq->debug_index);
-	netfs_stat(&netfs_n_rh_read);
-	cres->ops->read(cres, subreq->start, &subreq->io_iter, NETFS_READ_HOLE_FA=
IL,
-			netfs_cache_read_terminated, subreq);
-}
-
 /*
  * Perform a read to a buffer from the cache or the server.  Only a single
  * subreq is permitted as the object must be fetched in a single transacti=
on.
@@ -84,73 +89,74 @@ static int netfs_single_dispatch_read(struct netfs_io_r=
equest *rreq)
 		.cached_to[1]	=3D ULLONG_MAX,
 	};
 	struct netfs_io_subrequest *subreq;
-	int ret =3D 0;
+	int ret;
+
+	netfs_read_query_cache(rreq, &occ);
=20
-	subreq =3D netfs_alloc_subrequest(rreq);
+	subreq =3D netfs_alloc_read_subrequest(rreq);
 	if (!subreq)
 		return -ENOMEM;
=20
-	subreq->source	=3D NETFS_DOWNLOAD_FROM_SERVER;
 	subreq->start	=3D 0;
 	subreq->len	=3D rreq->len;
=20
-	bvecq_pos_set(&subreq->dispatch_pos, &rreq->dispatch_cursor);
-	bvecq_pos_set(&subreq->content, &rreq->dispatch_cursor);
-
-	iov_iter_bvec_queue(&subreq->io_iter, ITER_DEST, subreq->content.bvecq,
-			    subreq->content.slot, subreq->content.offset, subreq->len);
-
-	netfs_queue_read(rreq, subreq);
+	trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
=20
 	/* Try to use the cache if the cache content matches the size of the
 	 * remote file.
 	 */
-	netfs_read_query_cache(rreq, &occ);
 	if (occ.cached_from[0] =3D=3D 0 &&
-	    occ.cached_to[0] =3D=3D rreq->len)
+	    occ.cached_to[0] =3D=3D rreq->len) {
+		struct netfs_cache_resources *cres =3D &rreq->cache_resources;
+
 		subreq->source =3D NETFS_READ_FROM_CACHE;
+		netfs_stat(&netfs_n_rh_read);
+		cres->ops->issue_read(subreq);
+		ret =3D netfs_wait_for_in_progress_subreq(rreq, subreq);
+		if (ret =3D=3D -ENOMEM)
+			goto cancel;
+		if (ret =3D=3D 0)
+			goto success;
+
+		/* Didn't manage to retrieve from the cache, so toss it to the
+		 * server instead.
+		 */
+		if (netfs_reset_for_read_retry(subreq) < 0)
+			goto cancel;
+	}
=20
-	switch (subreq->source) {
-	case NETFS_DOWNLOAD_FROM_SERVER:
-		netfs_stat(&netfs_n_rh_download);
-		if (rreq->netfs_ops->prepare_read) {
-			ret =3D rreq->netfs_ops->prepare_read(subreq);
-			if (ret < 0)
-				goto cancel;
-		}
+	__set_bit(NETFS_RREQ_FOLIO_COPY_TO_CACHE, &rreq->flags);
=20
-		smp_wmb(); /* Write lists before ALL_QUEUED. */
-		set_bit(NETFS_RREQ_ALL_QUEUED, &rreq->flags);
+	/* Try to send it to the cache. */
+	for (;;) {
+		subreq->source =3D NETFS_DOWNLOAD_FROM_SERVER;
+		netfs_stat(&netfs_n_rh_download);
 		rreq->netfs_ops->issue_read(subreq);
-		rreq->submitted +=3D subreq->len;
-		break;
-	case NETFS_READ_FROM_CACHE:
-		if (rreq->cache_resources.ops->prepare_read) {
-			ret =3D rreq->cache_resources.ops->prepare_read(subreq);
-			if (ret < 0)
-				goto cancel;
-		}
-
-		smp_wmb(); /* Write lists before ALL_QUEUED. */
-		set_bit(NETFS_RREQ_ALL_QUEUED, &rreq->flags);
-		trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
-		netfs_single_read_cache(rreq, subreq);
-		rreq->submitted +=3D subreq->len;
-		ret =3D 0;
-		break;
-	default:
-		pr_warn("Unexpected single-read source %u\n", subreq->source);
-		WARN_ON_ONCE(true);
-		ret =3D -EIO;
-		goto cancel;
+		ret =3D netfs_wait_for_in_progress_subreq(rreq, subreq);
+		if (ret =3D=3D 0)
+			goto success;
+		if (ret =3D=3D -ENOMEM)
+			goto cancel;
+		if (ret !=3D -EAGAIN)
+			goto failed;
+		if (netfs_reset_for_read_retry(subreq) < 0)
+			goto cancel;
 	}
=20
-	return ret;
+success:
+	rreq->transferred =3D subreq->transferred;
+	list_del_init(&subreq->rreq_link);
+	netfs_put_subrequest(subreq, netfs_sreq_trace_put_consumed);
+	return 0;
 cancel:
-	netfs_cancel_read(subreq, ret);
-	smp_wmb(); /* Write lists before ALL_QUEUED. */
-	set_bit(NETFS_RREQ_ALL_QUEUED, &rreq->flags);
-	netfs_wake_collector(rreq);
+	rreq->error =3D ret;
+	list_del_init(&subreq->rreq_link);
+	netfs_put_subrequest(subreq, netfs_sreq_trace_put_cancel);
+	return ret;
+failed:
+	rreq->error =3D ret;
+	list_del_init(&subreq->rreq_link);
+	netfs_put_subrequest(subreq, netfs_sreq_trace_put_failed);
 	return ret;
 }
=20
@@ -182,7 +188,7 @@ ssize_t netfs_read_single(struct inode *inode, struct f=
ile *file, struct iov_ite
 	if (IS_ERR(rreq))
 		return PTR_ERR(rreq);
=20
-	ret =3D netfs_extract_iter(iter, rreq->len, INT_MAX, 0, &rreq->dispatch_c=
ursor.bvecq, 0);
+	ret =3D netfs_extract_iter(iter, rreq->len, INT_MAX, 0, &rreq->load_curso=
r.bvecq, 0);
 	if (ret < 0)
 		goto cleanup_free;
=20
@@ -193,9 +199,29 @@ ssize_t netfs_read_single(struct inode *inode, struct =
file *file, struct iov_ite
 	netfs_stat(&netfs_n_rh_read_single);
 	trace_netfs_read(rreq, 0, rreq->len, netfs_read_trace_read_single);
=20
-	netfs_single_dispatch_read(rreq);
+	ret =3D netfs_single_dispatch_read(rreq);
+
+	trace_netfs_rreq(rreq, netfs_rreq_trace_complete);
+	if (ret =3D=3D 0) {
+		task_io_account_read(rreq->transferred);
+
+		if (test_bit(NETFS_RREQ_FOLIO_COPY_TO_CACHE, &rreq->flags) &&
+		    fscache_resources_valid(&rreq->cache_resources)) {
+			trace_netfs_rreq(rreq, netfs_rreq_trace_dirty);
+			netfs_single_mark_inode_dirty(rreq->inode);
+		}
+		ret =3D rreq->transferred;
+	}
+
+	if (rreq->netfs_ops->done)
+		rreq->netfs_ops->done(rreq);
+
+	netfs_wake_rreq_flag(rreq, NETFS_RREQ_IN_PROGRESS, netfs_rreq_trace_wake_=
ip);
+	/* As we cleared NETFS_RREQ_IN_PROGRESS, we acquired its ref. */
+	netfs_put_request(rreq, netfs_rreq_trace_put_work_ip);
+
+	trace_netfs_rreq(rreq, netfs_rreq_trace_done);
=20
-	ret =3D netfs_wait_for_read(rreq);
 	netfs_put_request(rreq, netfs_rreq_trace_put_return);
 	return ret;
=20
diff --git a/fs/netfs/write_collect.c b/fs/netfs/write_collect.c
index a91b34cf01f5..6dc656fdecd1 100644
--- a/fs/netfs/write_collect.c
+++ b/fs/netfs/write_collect.c
@@ -28,8 +28,8 @@ static void netfs_dump_request(const struct netfs_io_requ=
est *rreq)
 	       rreq->origin, rreq->error);
 	pr_err("  st=3D%llx tsl=3D%zx/%llx/%llx\n",
 	       rreq->start, rreq->transferred, rreq->submitted, rreq->len);
-	pr_err("  cci=3D%llx/%llx/%llx\n",
-	       rreq->cleaned_to, rreq->collected_to, atomic64_read(&rreq->issued_=
to));
+	pr_err("  cci=3D%llx/%llx\n",
+	       rreq->cleaned_to, rreq->collected_to);
 	pr_err("  iw=3D%pSR\n", rreq->netfs_ops->issue_write);
 	for (int i =3D 0; i < NR_IO_STREAMS; i++) {
 		const struct netfs_io_subrequest *sreq;
@@ -38,8 +38,9 @@ static void netfs_dump_request(const struct netfs_io_requ=
est *rreq)
 		pr_err("  str[%x] s=3D%x e=3D%d acnf=3D%u,%u,%u,%u\n",
 		       s->stream_nr, s->source, s->error,
 		       s->avail, s->active, s->need_retry, s->failed);
-		pr_err("  str[%x] ct=3D%llx t=3D%zx\n",
-		       s->stream_nr, s->collected_to, s->transferred);
+		pr_err("  str[%x] it=3D%llx ct=3D%llx t=3D%zx\n",
+		       s->stream_nr, atomic64_read(&s->issued_to),
+		       s->collected_to, s->transferred);
 		list_for_each_entry(sreq, &s->subrequests, rreq_link) {
 			pr_err("  sreq[%x:%x] sc=3D%u s=3D%llx t=3D%zx/%zx r=3D%d f=3D%lx\n",
 			       sreq->stream_nr, sreq->debug_index, sreq->source,
@@ -56,7 +57,7 @@ static void netfs_dump_request(const struct netfs_io_requ=
est *rreq)
  */
 int netfs_folio_written_back(struct folio *folio)
 {
-	enum netfs_folio_trace why =3D netfs_folio_trace_clear;
+	enum netfs_folio_trace why =3D netfs_folio_trace_endwb;
 	struct inode *inode =3D folio_inode(folio);
 	struct netfs_inode *ictx =3D netfs_inode(inode);
 	struct netfs_folio *finfo;
@@ -79,13 +80,13 @@ int netfs_folio_written_back(struct folio *folio)
 		group =3D finfo->netfs_group;
 		gcount++;
 		kfree(finfo);
-		why =3D netfs_folio_trace_clear_s;
+		why =3D netfs_folio_trace_endwb_s;
 		goto end_wb;
 	}
=20
 	if ((group =3D netfs_folio_group(folio))) {
 		if (group =3D=3D NETFS_FOLIO_COPY_TO_CACHE) {
-			why =3D netfs_folio_trace_clear_cc;
+			why =3D netfs_folio_trace_endwb_cc;
 			folio_detach_private(folio);
 			goto end_wb;
 		}
@@ -98,7 +99,7 @@ int netfs_folio_written_back(struct folio *folio)
 		if (!folio_test_dirty(folio)) {
 			folio_detach_private(folio);
 			gcount++;
-			why =3D netfs_folio_trace_clear_g;
+			why =3D netfs_folio_trace_endwb_g;
 		}
 	}
=20
@@ -150,6 +151,12 @@ static void netfs_writeback_unlock_folios(struct netfs=
_io_request *wreq,
 			slot  =3D wreq->collect_cursor.slot;
 		}
=20
+		if (!bvecq->bv[slot].bv_page) {
+			WARN_ONCE(1, "R=3D%08x slot already cleared?\n", wreq->debug_id);
+			fsize =3D bvecq->bv[slot].bv_len;
+			goto skip;
+		}
+
 		folio =3D page_folio(bvecq->bv[slot].bv_page);
 		if (WARN_ONCE(!folio_test_writeback(folio),
 			      "R=3D%08x: folio %lx is not under writeback\n",
@@ -174,6 +181,7 @@ static void netfs_writeback_unlock_folios(struct netfs_=
io_request *wreq,
 		*notes |=3D MADE_PROGRESS;
=20
 		bvecq->bv[slot].bv_page =3D NULL;
+	skip:
 		slot++;
 		if (fpos + fsize >=3D collected_to)
 			break;
@@ -224,9 +232,7 @@ static void netfs_collect_write_results(struct netfs_io=
_request *wreq)
 	trace_netfs_rreq(wreq, netfs_rreq_trace_collect);
=20
 reassess_streams:
-	/* Order reading the issued_to point before reading the queue it refers t=
o. */
-	issued_to =3D atomic64_read_acquire(&wreq->issued_to);
-	smp_rmb();
+	issued_to =3D ULLONG_MAX;
 	collected_to =3D ULLONG_MAX;
 	if (wreq->origin =3D=3D NETFS_WRITEBACK ||
 	    wreq->origin =3D=3D NETFS_WRITETHROUGH ||
@@ -241,17 +247,30 @@ static void netfs_collect_write_results(struct netfs_=
io_request *wreq)
 	 * to the tail whilst we're doing this.
 	 */
 	for (s =3D 0; s < NR_IO_STREAMS; s++) {
+		unsigned long long s_issued_to;
+
 		stream =3D &wreq->io_streams[s];
-		/* Read active flag before list pointers */
+		/* Read active flag before issued_to */
 		if (!smp_load_acquire(&stream->active))
 			continue;
=20
-		front =3D list_first_entry_or_null_acquire(&stream->subrequests,
-							 struct netfs_io_subrequest, rreq_link);
-		/* Read first subreq pointer before IN_PROGRESS flag. */
-
-		while (front) {
+		for (;;) {
 			bool cancelled;
+
+			/* Order reading the issued_to point before reading the
+			 * queue it refers to.
+			 */
+			s_issued_to =3D atomic64_read_acquire(&stream->issued_to);
+			if (s_issued_to < issued_to)
+				issued_to =3D s_issued_to;
+
+			front =3D list_first_entry_or_null_acquire(&stream->subrequests,
+								 struct netfs_io_subrequest,
+								 rreq_link);
+			/* Read first subreq pointer before IN_PROGRESS flag. */
+			if (!front)
+				break;
+
 			trace_netfs_collect_sreq(wreq, front);
 			//_debug("sreq [%x] %llx %zx/%zx",
 			//       front->debug_index, front->start, front->transferred, front->l=
en);
@@ -420,9 +439,8 @@ bool netfs_write_collection(struct netfs_io_request *wr=
eq)
 	/* We're done when the app thread has finished posting subreqs and all
 	 * the queues in all the streams are empty.
 	 */
-	if (!test_bit(NETFS_RREQ_ALL_QUEUED, &wreq->flags))
+	if (!netfs_are_all_subreqs_queued(wreq))
 		return false;
-	smp_rmb(); /* Read ALL_QUEUED before lists. */
=20
 	transferred =3D LONG_MAX;
 	for (s =3D 0; s < NR_IO_STREAMS; s++) {
@@ -533,6 +551,8 @@ void netfs_write_subrequest_terminated(void *_op, ssize=
_t transferred_or_error)
=20
 	if (IS_ERR_VALUE(transferred_or_error)) {
 		subreq->error =3D transferred_or_error;
+		if (transferred_or_error =3D=3D -ENOMEM)
+			set_bit(NETFS_RREQ_SAW_ENOMEM, &wreq->flags);
=20
 		switch (subreq->source) {
 		case NETFS_WRITE_TO_CACHE:
diff --git a/fs/netfs/write_issue.c b/fs/netfs/write_issue.c
index 986a578fd0da..66f5daf9d8cf 100644
--- a/fs/netfs/write_issue.c
+++ b/fs/netfs/write_issue.c
@@ -36,6 +36,39 @@
 #include <linux/pagemap.h>
 #include "internal.h"
=20
+#define NOTE_UPLOAD_AVAIL	0x001	/* Upload is available */
+#define NOTE_CACHE_AVAIL	0x002	/* Local cache is available */
+#define NOTE_CACHE_COPY		0x004	/* Copy folio to cache */
+#define NOTE_UPLOAD		0x008	/* Upload folio to server */
+#define NOTE_UPLOAD_STARTED	0x010	/* Upload started */
+#define NOTE_STREAMW		0x020	/* Folio is from a streaming write */
+#define NOTE_DISCONTIG_BEFORE	0x040	/* Folio discontiguous with the previo=
us folio */
+#define NOTE_DISCONTIG_AFTER	0x080	/* Folio discontiguous with the next fo=
lio */
+#define NOTE_TO_EOF		0x100	/* Data in folio ends at EOF */
+#define NOTE_FLUSH_ANYWAY	0x200	/* Flush data, even if not hit estimated l=
imit */
+
+#define NOTES__KEEP_MASK (NOTE_UPLOAD_AVAIL | NOTE_CACHE_AVAIL | NOTE_UPLO=
AD_STARTED)
+
+struct netfs_wb_params {
+	unsigned long long	last_end;	/* End file pos of previous folio */
+	unsigned long long	folio_start;	/* File pos of folio */
+	unsigned int		folio_len;	/* Length of folio */
+	unsigned int		dirty_offset;	/* Offset of dirty region in folio */
+	unsigned int		dirty_len;	/* Length of dirty region in folio */
+	unsigned int		notes;		/* Notes on applicability */
+	struct bvecq_pos	dispatch_cursor; /* Folio queue anchor for issue_at */
+	struct netfs_write_estimate estimates[2];
+};
+
+struct netfs_writethrough {
+	struct netfs_wb_params	params;
+	struct netfs_io_request	*wreq;
+	struct folio		*in_progress;
+};
+
+static int netfs_prepare_write_single_buffer(struct netfs_io_subrequest *s=
ubreq,
+					     unsigned int max_segs);
+
 /*
  * Kill all dirty folios in the event of an unrecoverable error, starting =
with
  * a locked folio we've already obtained from writeback_iter().
@@ -115,65 +148,48 @@ struct netfs_io_request *netfs_create_write_req(struc=
t address_space *mapping,
=20
 	wreq->io_streams[0].stream_nr		=3D 0;
 	wreq->io_streams[0].source		=3D NETFS_UPLOAD_TO_SERVER;
-	wreq->io_streams[0].prepare_write	=3D ictx->ops->prepare_write;
+	wreq->io_streams[0].applicable		=3D NOTE_UPLOAD;
+	wreq->io_streams[0].estimate_write	=3D ictx->ops->estimate_write;
 	wreq->io_streams[0].issue_write		=3D ictx->ops->issue_write;
 	wreq->io_streams[0].collected_to	=3D start;
 	wreq->io_streams[0].transferred		=3D 0;
=20
 	wreq->io_streams[1].stream_nr		=3D 1;
 	wreq->io_streams[1].source		=3D NETFS_WRITE_TO_CACHE;
+	wreq->io_streams[1].applicable		=3D NOTE_CACHE_COPY;
 	wreq->io_streams[1].collected_to	=3D start;
 	wreq->io_streams[1].transferred		=3D 0;
 	if (fscache_resources_valid(&wreq->cache_resources)) {
 		wreq->io_streams[1].avail	=3D true;
 		wreq->io_streams[1].active	=3D true;
-		wreq->io_streams[1].prepare_write =3D wreq->cache_resources.ops->prepare=
_write_subreq;
+		wreq->io_streams[1].estimate_write =3D wreq->cache_resources.ops->estima=
te_write;
 		wreq->io_streams[1].issue_write =3D wreq->cache_resources.ops->issue_wri=
te;
 	}
=20
 	return wreq;
 }
=20
-/**
- * netfs_prepare_write_failed - Note write preparation failed
- * @subreq: The subrequest to mark
- *
- * Mark a subrequest to note that preparation for write failed.
- */
-void netfs_prepare_write_failed(struct netfs_io_subrequest *subreq)
-{
-	__set_bit(NETFS_SREQ_FAILED, &subreq->flags);
-	trace_netfs_sreq(subreq, netfs_sreq_trace_prep_failed);
-}
-EXPORT_SYMBOL(netfs_prepare_write_failed);
-
 /*
- * Prepare a write subrequest.  We need to allocate a new subrequest
- * if we don't have one.
+ * Allocate and prepare a write subrequest.
  */
-void netfs_prepare_write(struct netfs_io_request *wreq,
-			 struct netfs_io_stream *stream,
-			 loff_t start)
+struct netfs_io_subrequest *netfs_alloc_write_subreq(struct netfs_io_reque=
st *wreq,
+						     struct netfs_io_stream *stream)
 {
 	struct netfs_io_subrequest *subreq;
=20
 	subreq =3D netfs_alloc_subrequest(wreq);
 	subreq->source		=3D stream->source;
-	subreq->start		=3D start;
+	subreq->start		=3D stream->issue_from;
+	subreq->len		=3D stream->buffered;
 	subreq->stream_nr	=3D stream->stream_nr;
=20
-	bvecq_pos_set(&subreq->dispatch_pos, &wreq->dispatch_cursor);
-
 	_enter("R=3D%x[%x]", wreq->debug_id, subreq->debug_index);
=20
 	trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
=20
-	stream->sreq_max_len	=3D UINT_MAX;
-	stream->sreq_max_segs	=3D INT_MAX;
 	switch (stream->source) {
 	case NETFS_UPLOAD_TO_SERVER:
 		netfs_stat(&netfs_n_wh_upload);
-		stream->sreq_max_len =3D wreq->wsize;
 		break;
 	case NETFS_WRITE_TO_CACHE:
 		netfs_stat(&netfs_n_wh_write);
@@ -183,9 +199,6 @@ void netfs_prepare_write(struct netfs_io_request *wreq,
 		break;
 	}
=20
-	if (stream->prepare_write)
-		stream->prepare_write(subreq);
-
 	__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
=20
 	/* We add to the end of the list whilst the collector may be walking
@@ -195,99 +208,46 @@ void netfs_prepare_write(struct netfs_io_request *wre=
q,
 	spin_lock(&wreq->lock);
 	/* Write IN_PROGRESS before pointer to new subreq */
 	list_add_tail_release(&subreq->rreq_link, &stream->subrequests);
-	if (list_is_first(&subreq->rreq_link, &stream->subrequests)) {
-		if (!stream->active) {
-			stream->collected_to =3D subreq->start;
-			/* Write list pointers before active flag */
-			smp_store_release(&stream->active, true);
-		}
-	}
+	if (list_is_first(&subreq->rreq_link, &stream->subrequests) &&
+	    stream->collected_to =3D=3D 0)
+		stream->collected_to =3D subreq->start;
=20
 	spin_unlock(&wreq->lock);
-
-	stream->construct =3D subreq;
+	return subreq;
 }
=20
 /*
- * Set the I/O iterator for the filesystem/cache to use and dispatch the I=
/O
- * operation.  The operation may be asynchronous and should call
- * netfs_write_subrequest_terminated() when complete.
+ * Prepare the buffer for a buffered write.
  */
-static void netfs_do_issue_write(struct netfs_io_stream *stream,
-				 struct netfs_io_subrequest *subreq)
+static int netfs_prepare_buffered_write_buffer(struct netfs_io_subrequest =
*subreq,
+					       unsigned int max_segs)
 {
 	struct netfs_io_request *wreq =3D subreq->rreq;
+	struct netfs_io_stream *stream =3D &wreq->io_streams[subreq->stream_nr];
+	ssize_t len;
=20
-	_enter("R=3D%x[%x],%zx", wreq->debug_id, subreq->debug_index, subreq->len=
);
+	_enter("%zx,{,%u,%u},%u",
+	       subreq->len, stream->dispatch_cursor.slot, stream->dispatch_cursor=
.offset, max_segs);
=20
-	if (stream->source =3D=3D NETFS_WRITE_TO_CACHE &&
-	    unlikely(test_bit(NETFS_RREQ_CACHE_STOP, &wreq->flags))) {
-		size_t dio_size =3D wreq->cache_resources.dio_size;
-		size_t len, disp;
-
-		disp =3D subreq->start & (dio_size - 1);
-		len =3D round_up(subreq->len + disp, dio_size);
-
-		subreq->start -=3D disp;
-		subreq->len =3D len;
-
-		__set_bit(NETFS_SREQ_CANCELLED, &subreq->flags);
-		return netfs_write_subrequest_terminated(subreq, subreq->len);
-	}
-
-	if (test_bit(NETFS_SREQ_FAILED, &subreq->flags))
-		return netfs_write_subrequest_terminated(subreq, subreq->error);
-
-	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
-	stream->issue_write(subreq);
-}
-
-void netfs_reissue_write(struct netfs_io_stream *stream,
-			 struct netfs_io_subrequest *subreq)
-{
-	// TODO: Use encrypted buffer
-	bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
-	iov_iter_bvec_queue(&subreq->io_iter, ITER_SOURCE,
-			    subreq->content.bvecq, subreq->content.slot,
-			    subreq->content.offset,
-			    subreq->len);
-	iov_iter_advance(&subreq->io_iter, subreq->transferred);
-
-	subreq->retry_count++;
-	subreq->error =3D 0;
-	__clear_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
-	__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
-	netfs_stat(&netfs_n_wh_retry_write_subreq);
-	netfs_do_issue_write(stream, subreq);
-}
-
-void netfs_issue_write(struct netfs_io_request *wreq,
-		       struct netfs_io_stream *stream)
-{
-	struct netfs_io_subrequest *subreq =3D stream->construct;
-
-	if (!subreq)
-		return;
+	bvecq_pos_set(&subreq->dispatch_pos, &stream->dispatch_cursor);
=20
 	/* If we have a write to the cache, we need to round out the first and
 	 * last entries (only those as the data will be on virtually contiguous
 	 * folios) to cache DIO boundaries.
 	 */
 	if (subreq->source =3D=3D NETFS_WRITE_TO_CACHE) {
-		struct bvecq_pos tmp_pos;
 		struct bio_vec *bv;
 		struct bvecq *bq;
 		size_t dio_size =3D wreq->cache_resources.dio_size;
-		size_t disp, len;
-		int ret;
+		size_t disp, dlen;
=20
-		bvecq_pos_set(&tmp_pos, &subreq->dispatch_pos);
-		ret =3D bvecq_extract(&tmp_pos, subreq->len, INT_MAX, &subreq->content.b=
vecq);
-		bvecq_pos_unset(&tmp_pos);
-		if (ret < 0) {
-			netfs_write_subrequest_terminated(subreq, -ENOMEM);
-			return;
-		}
+		len =3D bvecq_extract(&stream->dispatch_cursor, subreq->len, max_segs,
+				    &subreq->content.bvecq);
+		if (len < 0)
+			return -ENOMEM;
+
+		_debug("extract %zx/%zx", len, subreq->len);
+		subreq->len =3D len;
=20
 		/* Round the first entry down.  We should be able to get away
 		 * with this as this path only happens for buffered reads and
@@ -315,88 +275,292 @@ void netfs_issue_write(struct netfs_io_request *wreq,
 		while (bq->next)
 			bq =3D bq->next;
 		bv =3D &bq->bv[bq->nr_slots - 1];
-		len =3D round_up(bv->bv_len, dio_size);
-		if (len > bv->bv_len) {
-			subreq->len +=3D len - bv->bv_len;
-			bv->bv_len =3D len;
+		dlen =3D round_up(bv->bv_len, dio_size);
+		if (dlen > bv->bv_len) {
+			subreq->len +=3D dlen - bv->bv_len;
+			bv->bv_len =3D dlen;
 		}
 	} else {
-		bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
+		bvecq_pos_set(&subreq->content, &stream->dispatch_cursor);
+		len =3D bvecq_slice(&stream->dispatch_cursor, subreq->len, max_segs,
+				  &subreq->nr_segs);
+
+		if (len < subreq->len) {
+			subreq->len =3D len;
+			trace_netfs_sreq(subreq, netfs_sreq_trace_limited);
+		}
+	}
+
+	stream->issue_from +=3D len;
+	stream->buffered   -=3D len;
+	if (stream->buffered =3D=3D 0) {
+		stream->buffering =3D false;
+		bvecq_pos_unset(&stream->dispatch_cursor);
+	}
+	/* Order loading the queue before updating the issue_to point */
+	atomic64_set_release(&stream->issued_to, stream->issue_from);
+	return 0;
+}
+
+/**
+ * netfs_prepare_write_buffer - Get the buffer for a subrequest
+ * @subreq: The subrequest to get the buffer for
+ * @max_segs: Maximum number of segments in buffer (or INT_MAX)
+ *
+ * Extract a slice of buffer from the stream and attach it to the subreque=
st as
+ * a bio_vec queue.  The maximum amount of data attached is set by
+ * @subreq->len, but this may be shortened if @max_segs would be exceeded.
+ */
+int netfs_prepare_write_buffer(struct netfs_io_subrequest *subreq,
+			       unsigned int max_segs)
+{
+	struct netfs_io_request *rreq =3D subreq->rreq;
+
+	switch (rreq->origin) {
+	case NETFS_WRITEBACK:
+	case NETFS_WRITETHROUGH:
+		if (test_bit(NETFS_RREQ_RETRYING, &rreq->flags))
+			return netfs_prepare_write_retry_buffer(subreq, max_segs);
+		return netfs_prepare_buffered_write_buffer(subreq, max_segs);
+
+	case NETFS_UNBUFFERED_WRITE:
+	case NETFS_DIO_WRITE:
+		return netfs_prepare_unbuffered_write_buffer(subreq, max_segs);
+
+	case NETFS_WRITEBACK_SINGLE:
+		return netfs_prepare_write_single_buffer(subreq, max_segs);
+
+	case NETFS_PGPRIV2_COPY_TO_CACHE:
+		return netfs_prepare_pgpriv2_write_buffer(subreq, max_segs);
+
+	default:
+		WARN_ON_ONCE(1);
+		return -EIO;
 	}
+}
+EXPORT_SYMBOL(netfs_prepare_write_buffer);
+
+/*
+ * Issue writes for a stream.
+ */
+static int netfs_issue_writes(struct netfs_io_request *wreq,
+			      struct netfs_io_stream *stream,
+			      struct netfs_wb_params *params)
+{
+	struct netfs_write_estimate *estimate =3D &params->estimates[stream->stre=
am_nr];
+
+	for (;;) {
+		struct netfs_io_subrequest *subreq;
+
+		if (test_bit(NETFS_RREQ_PAUSE, &wreq->flags))
+			netfs_wait_for_paused_write(wreq);
+
+		subreq =3D netfs_alloc_write_subreq(wreq, stream);
+		if (!subreq)
+			return -ENOMEM;
+
+		if (stream->source =3D=3D NETFS_WRITE_TO_CACHE &&
+		    unlikely(test_bit(NETFS_RREQ_CACHE_STOP, &wreq->flags))) {
+			size_t dio_size =3D wreq->cache_resources.dio_size;
+			size_t len, disp;
+
+			disp =3D subreq->start & (dio_size - 1);
+			len =3D round_up(subreq->len + disp, dio_size);
+
+			subreq->start -=3D disp;
+			subreq->len =3D len;
+
+			stream->issue_from =3D subreq->start + len;
+			stream->buffered =3D 0;
+			stream->buffering =3D false;
+			bvecq_pos_unset(&stream->dispatch_cursor);
+			estimate->issue_at =3D subreq->start + len + 16 * 1024 * 1024;
+			estimate->max_segs =3D INT_MAX;
+			__set_bit(NETFS_SREQ_CANCELLED, &subreq->flags);
+			netfs_write_subrequest_terminated(subreq, len);
+			return 0;
+		}
+
+		stream->issue_write(subreq);
+		if (test_bit(NETFS_RREQ_SAW_ENOMEM, &wreq->flags))
+			return -ENOMEM;
=20
-	iov_iter_bvec_queue(&subreq->io_iter, ITER_SOURCE,
-			    subreq->content.bvecq, subreq->content.slot,
-			    subreq->content.offset,
-			    subreq->len);
+		if (stream->buffered =3D=3D 0) {
+			if (stream->stream_nr =3D=3D 0)
+				params->notes &=3D ~NOTE_UPLOAD_STARTED;
+			return 0;
+		}
=20
-	stream->construct =3D NULL;
-	netfs_do_issue_write(stream, subreq);
+		if (!(params->notes & NOTE_FLUSH_ANYWAY)) {
+			estimate->issue_at =3D ULLONG_MAX;
+			estimate->max_segs =3D INT_MAX;
+			stream->estimate_write(wreq, stream, estimate);
+			if (stream->issue_from + stream->buffered < estimate->issue_at &&
+			    estimate->max_segs > 0)
+				return 0;
+		}
+	}
 }
=20
 /*
- * Add data to the write subrequest, dispatching each as we fill it up or =
if it
- * is discontiguous with the previous.  We only fill one part at a time so=
 that
- * we can avoid overrunning the credits obtained (cifs) and try to paralle=
lise
- * content-crypto preparation with network writes.
+ * Issue pending writes on a stream.
  */
-size_t netfs_advance_write(struct netfs_io_request *wreq,
-			   struct netfs_io_stream *stream,
-			   loff_t start, size_t len, bool to_eof)
+static int netfs_issue_stream(struct netfs_io_request *wreq,
+			      struct netfs_wb_params *params, int s)
 {
-	struct netfs_io_subrequest *subreq =3D stream->construct;
-	size_t part;
+	struct netfs_write_estimate *estimate =3D &params->estimates[s];
+	struct netfs_io_stream *stream =3D &wreq->io_streams[s];
+	unsigned long long dirty_start;
+	bool discontig_before =3D params->notes & NOTE_DISCONTIG_BEFORE;
+	int ret;
+
+	_enter("%x", params->notes);
=20
-	if (!stream->avail) {
-		_leave("no write");
-		return len;
+	/* If the current folio doesn't contribute to this stream, see if we
+	 * need to flush it.
+	 */
+	if (!(params->notes & stream->applicable)) {
+		if (!stream->buffering) {
+			atomic64_set_release(&stream->issued_to,
+					     params->folio_start + params->folio_len);
+			return 0;
+		}
+		discontig_before =3D true;
 	}
=20
-	_enter("R=3D%x[%x]", wreq->debug_id, subreq ? subreq->debug_index : 0);
+	/* Issue writes if we meet a discontiguity before the current folio.
+	 * Even if the filesystem can do sparse/vectored writes, we still
+	 * generate a subreq per contiguous region rather than generating
+	 * separate extent lists.
+	 */
+	if (stream->buffering && discontig_before) {
+		params->notes |=3D NOTE_FLUSH_ANYWAY;
+		ret =3D netfs_issue_writes(wreq, stream, params);
+		if (ret < 0)
+			return ret;
+		stream->buffering =3D false;
+		params->notes &=3D ~NOTE_FLUSH_ANYWAY;
+	}
=20
-	if (subreq && start !=3D subreq->start + subreq->len) {
-		netfs_issue_write(wreq, stream);
-		subreq =3D NULL;
+	if (!(params->notes & stream->applicable)) {
+		atomic64_set_release(&stream->issued_to,
+				     params->folio_start + params->folio_len);
+		return 0;
 	}
=20
-	if (!stream->construct)
-		netfs_prepare_write(wreq, stream, start);
-	subreq =3D stream->construct;
+	/* If we're not currently buffering on this stream, we need to get an
+	 * estimate of when we need to issue a write.  It might be within the
+	 * starting folio.
+	 */
+	dirty_start =3D params->folio_start + params->dirty_offset;
+	if (!stream->buffering) {
+		stream->buffering =3D true;
+		stream->issue_from =3D dirty_start;
+		bvecq_pos_set(&stream->dispatch_cursor, &params->dispatch_cursor);
+		estimate->issue_at =3D ULLONG_MAX;
+		estimate->max_segs =3D INT_MAX;
+		stream->estimate_write(wreq, stream, estimate);
+	}
=20
-	part =3D umin(stream->sreq_max_len - subreq->len, len);
-	_debug("part %zx/%zx %zx/%zx", subreq->len, stream->sreq_max_len, part, l=
en);
-	subreq->len +=3D part;
-	subreq->nr_segs++;
+	stream->buffered +=3D params->dirty_len;
+	estimate->max_segs--;
=20
-	if (subreq->len >=3D stream->sreq_max_len ||
-	    subreq->nr_segs >=3D stream->sreq_max_segs ||
-	    to_eof) {
-		netfs_issue_write(wreq, stream);
-		subreq =3D NULL;
+	/* Poke the filesystem to issue writes when we hit the limit it set or
+	 * if the data ends before the end of the page.
+	 */
+	if (params->notes & NOTE_DISCONTIG_AFTER)
+		params->notes |=3D NOTE_FLUSH_ANYWAY;
+	_debug("[%u] %llx + %zx >=3D %llx, %u %x",
+	       s, stream->issue_from, stream->buffered, estimate->issue_at,
+	       estimate->max_segs, params->notes);
+	if (stream->issue_from + stream->buffered >=3D estimate->issue_at ||
+	    estimate->max_segs <=3D 0 ||
+	    (params->notes & NOTE_FLUSH_ANYWAY)) {
+		ret =3D netfs_issue_writes(wreq, stream, params);
+		if (ret < 0)
+			return ret;
 	}
=20
-	return part;
+	return 0;
 }
=20
 /*
- * Write some of a pending folio data back to the server.
+ * See which streams need writes issuing and issue them.
  */
-static int netfs_write_folio(struct netfs_io_request *wreq,
-			     struct writeback_control *wbc,
-			     struct folio *folio)
+static int netfs_issue_streams(struct netfs_io_request *wreq,
+			       struct netfs_wb_params *params)
+{
+	int ret =3D 0, ret2;
+
+	_enter("%x", params->notes);
+
+	for (int s =3D 0; s < NR_IO_STREAMS; s++) {
+		ret2 =3D netfs_issue_stream(wreq, params, s);
+		if (ret2 < 0)
+			ret =3D ret2;
+	}
+	return ret;
+}
+
+/*
+ * End the issuing of writes, let the collector know we're done.
+ */
+static void netfs_end_issue_write(struct netfs_io_request *wreq,
+				  struct netfs_wb_params *params)
+{
+	bool needs_poke =3D true;
+
+	params->notes |=3D NOTE_FLUSH_ANYWAY;
+
+	for (int s =3D 0; s < NR_IO_STREAMS; s++) {
+		struct netfs_io_stream *stream =3D &wreq->io_streams[s];
+		int ret;
+
+		if (stream->buffering) {
+			ret =3D netfs_issue_writes(wreq, stream, params);
+			if (ret < 0 && stream->source !=3D NETFS_WRITE_TO_CACHE) {
+				/* Leave the error somewhere the completion
+				 * path can pick it up if there isn't already
+				 * another error logged.
+				 */
+				cmpxchg(&wreq->error, 0, ret);
+			}
+			stream->buffering =3D false;
+		}
+	}
+
+	netfs_all_subreqs_queued(wreq);
+
+	for (int s =3D 0; s < NR_IO_STREAMS; s++) {
+		struct netfs_io_stream *stream =3D &wreq->io_streams[s];
+
+		if (!stream->active)
+			continue;
+		if (!list_empty(&stream->subrequests))
+			needs_poke =3D false;
+	}
+
+	if (needs_poke)
+		netfs_wake_collector(wreq);
+}
+
+/*
+ * Queue a folio for writeback.
+ */
+static int netfs_queue_wb_folio(struct netfs_io_request *wreq,
+				struct writeback_control *wbc,
+				struct folio *folio,
+				struct netfs_wb_params *params)
 {
-	struct netfs_io_stream *upload =3D &wreq->io_streams[0];
-	struct netfs_io_stream *cache  =3D &wreq->io_streams[1];
-	struct netfs_io_stream *stream;
 	struct netfs_group *fgroup; /* TODO: Use this with ceph */
 	struct netfs_folio *finfo;
 	struct bvecq *queue =3D wreq->load_cursor.bvecq;
 	unsigned int slot;
 	size_t fsize =3D folio_size(folio), flen =3D fsize, foff =3D 0;
 	loff_t fpos =3D folio_pos(folio), i_size;
-	bool to_eof =3D false, streamw =3D false;
-	bool debug =3D false;
+	int ret;
=20
-	_enter("");
+	_enter("%x", params->notes);
=20
 	if (!wreq->spare) {
 		wreq->spare =3D bvecq_alloc_one(BVECQ_STD_SLOTS, GFP_NOFS);
@@ -431,23 +595,36 @@ static int netfs_write_folio(struct netfs_io_request =
*wreq,
 	if (finfo) {
 		foff =3D finfo->dirty_offset;
 		flen =3D foff + finfo->dirty_len;
-		streamw =3D true;
+		params->notes |=3D NOTE_STREAMW;
+		if (foff > 0)
+			params->notes |=3D NOTE_DISCONTIG_BEFORE;
+		if (flen < fsize)
+			params->notes |=3D NOTE_DISCONTIG_AFTER;
 	}
=20
+	if (params->last_end && fpos !=3D params->last_end)
+		params->notes |=3D NOTE_DISCONTIG_BEFORE;
+	params->last_end =3D fpos + fsize;
+
 	if (wreq->origin =3D=3D NETFS_WRITETHROUGH) {
-		to_eof =3D false;
 		if (flen > i_size - fpos)
 			flen =3D i_size - fpos;
+		/* EOF may be changing. */
 	} else if (flen > i_size - fpos) {
 		flen =3D i_size - fpos;
-		if (!streamw)
+		if (!(params->notes & NOTE_STREAMW))
 			folio_zero_segment(folio, flen, fsize);
-		to_eof =3D true;
+		params->notes |=3D NOTE_TO_EOF;
 	} else if (flen =3D=3D i_size - fpos) {
-		to_eof =3D true;
+		params->notes |=3D NOTE_TO_EOF;
 	}
 	flen -=3D foff;
=20
+	params->folio_start	=3D fpos;
+	params->folio_len	=3D fsize;
+	params->dirty_offset	=3D foff;
+	params->dirty_len	=3D flen;
+
 	_debug("folio %zx %zx %zx", foff, flen, fsize);
=20
 	/* Deal with discontinuities in the stream of dirty pages.  These can
@@ -467,168 +644,84 @@ static int netfs_write_folio(struct netfs_io_request=
 *wreq,
 	 *     write-back group.
 	 */
 	if (fgroup =3D=3D NETFS_FOLIO_COPY_TO_CACHE) {
-		netfs_issue_write(wreq, upload);
+		if (!(params->notes & NOTE_CACHE_AVAIL)) {
+			trace_netfs_folio(folio, netfs_folio_trace_cancel_copy);
+			goto cancel_folio;
+		}
+		params->notes |=3D NOTE_CACHE_COPY;
+		trace_netfs_folio(folio, netfs_folio_trace_store_copy);
 	} else if (fgroup !=3D wreq->group) {
 		/* We can't write this page to the server yet. */
 		kdebug("wrong group");
-		folio_redirty_for_writepage(wbc, folio);
-		folio_unlock(folio);
-		netfs_issue_write(wreq, upload);
-		netfs_issue_write(wreq, cache);
-		return 0;
+		goto skip_folio;
+	} else if (!(params->notes & (NOTE_UPLOAD_AVAIL | NOTE_CACHE_AVAIL))) {
+		trace_netfs_folio(folio, netfs_folio_trace_cancel_store);
+		goto cancel_folio_discard;
+	} else {
+		if (params->notes & NOTE_UPLOAD_STARTED) {
+			params->notes |=3D NOTE_UPLOAD;
+			trace_netfs_folio(folio, netfs_folio_trace_store_plus);
+		} else {
+			params->notes |=3D NOTE_UPLOAD | NOTE_UPLOAD_STARTED;
+			trace_netfs_folio(folio, netfs_folio_trace_store);
+		}
+		if ((params->notes & NOTE_CACHE_AVAIL) &&
+		    !(params->notes & NOTE_STREAMW))
+			params->notes |=3D NOTE_CACHE_COPY;
 	}
=20
-	if (foff > 0)
-		netfs_issue_write(wreq, upload);
-	if (streamw)
-		netfs_issue_write(wreq, cache);
-
 	folio_start_writeback(folio);
 	folio_unlock(folio);
=20
-	if (fgroup =3D=3D NETFS_FOLIO_COPY_TO_CACHE) {
-		if (!cache->avail) {
-			trace_netfs_folio(folio, netfs_folio_trace_cancel_copy);
-			netfs_issue_write(wreq, upload);
-			netfs_folio_written_back(folio);
-			return 0;
-		}
-		trace_netfs_folio(folio, netfs_folio_trace_store_copy);
-	} else if (!upload->avail && !cache->avail) {
-		trace_netfs_folio(folio, netfs_folio_trace_cancel_store);
-		netfs_folio_written_back(folio);
-		return 0;
-	} else if (!upload->construct) {
-		trace_netfs_folio(folio, netfs_folio_trace_store);
-	} else {
-		trace_netfs_folio(folio, netfs_folio_trace_store_plus);
-	}
-
 	/* Institute a new bvec queue segment if the current one is full or if
 	 * we encounter a discontiguity.  The discontiguity break is important
 	 * when it comes to bulk unlocking folios by file range.
 	 */
 	if (bvecq_is_full(queue) ||
-	    (fpos !=3D wreq->last_end && wreq->last_end > 0)) {
+	    ((params->notes & NOTE_DISCONTIG_BEFORE) && queue->nr_slots > 0)) {
 		bvecq_buffer_append(&wreq->load_cursor, wreq->spare);
 		wreq->spare =3D NULL;
=20
 		queue =3D wreq->load_cursor.bvecq;
 		queue->fpos =3D fpos;
-		if (fpos !=3D wreq->last_end)
+		if (params->notes & NOTE_DISCONTIG_BEFORE)
 			queue->discontig =3D true;
-		bvecq_pos_move(&wreq->dispatch_cursor, queue);
-		wreq->dispatch_cursor.slot =3D 0;
+		bvecq_pos_move(&params->dispatch_cursor, queue);
+		params->dispatch_cursor.slot =3D 0;
 	}
=20
 	/* Attach the folio to the rolling buffer. */
 	slot =3D queue->nr_slots;
-	bvec_set_folio(&queue->bv[slot], folio, flen, 0);
+	bvec_set_folio(&queue->bv[slot], folio, flen, foff);
 	trace_netfs_bv_slot(queue, slot);
 	slot++;
 	bvecq_filled_to(queue, slot);
 	wreq->load_cursor.slot =3D slot;
 	wreq->load_cursor.offset =3D 0;
-	wreq->last_end =3D fpos + foff + flen;
+	trace_netfs_wback(wreq, folio, params->notes);
=20
-	/* Move the submission point forward to allow for write-streaming data
-	 * not starting at the front of the page.  We don't do write-streaming
-	 * with the cache as the cache requires DIO alignment.
-	 *
-	 * Also skip uploading for data that's been read and just needs copying
-	 * to the cache.
-	 */
-	bvecq_pos_nudge(&wreq->dispatch_cursor);
-=09
-	for (int s =3D 0; s < NR_IO_STREAMS; s++) {
-		stream =3D &wreq->io_streams[s];
-		stream->submit_off =3D 0;
-		stream->submit_len =3D flen;
-		if (!stream->avail ||
-		    (stream->source =3D=3D NETFS_WRITE_TO_CACHE && streamw) ||
-		    (stream->source =3D=3D NETFS_UPLOAD_TO_SERVER &&
-		     fgroup =3D=3D NETFS_FOLIO_COPY_TO_CACHE)) {
-			stream->submit_off =3D UINT_MAX;
-			stream->submit_len =3D 0;
-		}
-	}
-
-	/* Attach the folio to one or more subrequests.  For a big folio, we
-	 * could end up with thousands of subrequests if the wsize is small -
-	 * but we might need to wait during the creation of subrequests for
-	 * network resources (eg. SMB credits).
-	 */
-	for (;;) {
-		ssize_t part;
-		size_t lowest_off =3D ULONG_MAX;
-		int choose_s =3D -1;
-
-		/* Always add to the lowest-submitted stream first. */
-		for (int s =3D 0; s < NR_IO_STREAMS; s++) {
-			stream =3D &wreq->io_streams[s];
-			if (stream->submit_len > 0 &&
-			    stream->submit_off < lowest_off) {
-				lowest_off =3D stream->submit_off;
-				choose_s =3D s;
-			}
-		}
-
-		if (choose_s < 0)
-			break;
-		stream =3D &wreq->io_streams[choose_s];
-
-		/* Advance the cursor. */
-		wreq->dispatch_cursor.offset =3D stream->submit_off;
-
-		atomic64_set(&wreq->issued_to, fpos + foff + stream->submit_off);
-		part =3D netfs_advance_write(wreq, stream, fpos + foff + stream->submit_=
off,
-					   stream->submit_len, to_eof);
-		stream->submit_off +=3D part;
-		if (part > stream->submit_len)
-			stream->submit_len =3D 0;
-		else
-			stream->submit_len -=3D part;
-		if (part > 0)
-			debug =3D true;
-	}
-
-	bvecq_pos_step(&wreq->dispatch_cursor);
-	/* Order loading the queue before updating the issue_to point */
-	atomic64_set_release(&wreq->issued_to, fpos + fsize);
-
-	if (!debug)
-		kdebug("R=3D%x: No submit", wreq->debug_id);
-
-	if (foff + flen < fsize)
-		for (int s =3D 0; s < NR_IO_STREAMS; s++)
-			netfs_issue_write(wreq, &wreq->io_streams[s]);
-
-	_leave(" =3D 0");
+out:
+	_leave(" =3D %x", params->notes);
 	return 0;
-}
=20
-/*
- * End the issuing of writes, letting the collector know we're done.
- */
-static void netfs_end_issue_write(struct netfs_io_request *wreq)
-{
-	bool needs_poke =3D true;
-
-	smp_wmb(); /* Write subreq lists before ALL_QUEUED. */
-	set_bit(NETFS_RREQ_ALL_QUEUED, &wreq->flags);
-
-	for (int s =3D 0; s < NR_IO_STREAMS; s++) {
-		struct netfs_io_stream *stream =3D &wreq->io_streams[s];
-
-		if (!stream->active)
-			continue;
-		if (!list_empty(&stream->subrequests))
-			needs_poke =3D false;
-		netfs_issue_write(wreq, stream);
-	}
-
-	if (needs_poke)
-		netfs_wake_collector(wreq);
+skip_folio:
+	ret =3D folio_redirty_for_writepage(wbc, folio);
+	folio_unlock(folio);
+	if (ret < 0)
+		return ret;
+	params->notes |=3D NOTE_DISCONTIG_BEFORE;
+	goto out;
+cancel_folio_discard:
+	netfs_put_group(fgroup);
+cancel_folio:
+	folio_detach_private(folio);
+	kfree(finfo);
+	folio_unlock(folio);
+	folio_cancel_dirty(folio);
+	if (wreq->origin =3D=3D NETFS_WRITETHROUGH)
+		folio_end_writeback(folio);
+	params->notes |=3D NOTE_DISCONTIG_BEFORE;
+	goto out;
 }
=20
 /*
@@ -639,6 +732,7 @@ int netfs_writepages(struct address_space *mapping,
 {
 	struct netfs_inode *ictx =3D netfs_inode(mapping->host);
 	struct netfs_io_request *wreq =3D NULL;
+	struct netfs_wb_params params =3D {};
 	struct folio *folio;
 	int error =3D 0;
=20
@@ -664,35 +758,48 @@ int netfs_writepages(struct address_space *mapping,
=20
 	if (bvecq_buffer_init(&wreq->load_cursor, GFP_NOFS) < 0)
 		goto nomem;
-	bvecq_pos_set(&wreq->dispatch_cursor, &wreq->load_cursor);
-	bvecq_pos_set(&wreq->collect_cursor, &wreq->dispatch_cursor);
+	bvecq_pos_set(&params.dispatch_cursor, &wreq->load_cursor);
+	bvecq_pos_set(&wreq->collect_cursor, &wreq->load_cursor);
=20
 	__set_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &wreq->flags);
 	trace_netfs_write(wreq, netfs_write_trace_writeback);
 	netfs_stat(&netfs_n_wh_writepages);
=20
-	do {
-		_debug("wbiter %lx %llx", folio->index, atomic64_read(&wreq->issued_to));
+	if (wreq->io_streams[1].avail)
+		params.notes |=3D NOTE_CACHE_AVAIL;
=20
-		/* It appears we don't have to handle cyclic writeback wrapping. */
-		WARN_ON_ONCE(wreq && folio_pos(folio) < atomic64_read(&wreq->issued_to));
+	do {
+		_debug("wbiter %lx", folio->index);
=20
 		if (netfs_folio_group(folio) !=3D NETFS_FOLIO_COPY_TO_CACHE &&
 		    unlikely(!test_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags))) {
 			set_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags);
 			wreq->netfs_ops->begin_writeback(wreq);
+			if (wreq->io_streams[0].avail) {
+				params.notes |=3D NOTE_UPLOAD_AVAIL;
+				/* Order setting the active flag after other fields. */
+				smp_store_release(&wreq->io_streams[0].active, true);
+			}
 		}
=20
-		error =3D netfs_write_folio(wreq, wbc, folio);
+		params.notes &=3D NOTES__KEEP_MASK;
+		error =3D netfs_queue_wb_folio(wreq, wbc, folio, &params);
+		if (error < 0)
+			break;
+		error =3D netfs_issue_streams(wreq, &params);
 		if (error < 0)
 			break;
+
+		bvecq_pos_step(&params.dispatch_cursor);
 	} while ((folio =3D writeback_iter(mapping, wbc, folio, &error)));
=20
-	netfs_end_issue_write(wreq);
+	netfs_end_issue_write(wreq, &params);
=20
 	mutex_unlock(&ictx->wb_lock);
 	bvecq_pos_unset(&wreq->load_cursor);
-	bvecq_pos_unset(&wreq->dispatch_cursor);
+	bvecq_pos_unset(&params.dispatch_cursor);
+	for (int i =3D 0; i < NR_IO_STREAMS; i++)
+		bvecq_pos_unset(&wreq->io_streams[i].dispatch_cursor);
 	netfs_wake_collector(wreq);
=20
 	netfs_put_request(wreq, netfs_rreq_trace_put_return);
@@ -714,32 +821,60 @@ EXPORT_SYMBOL(netfs_writepages);
 /*
  * Begin a write operation for writing through the pagecache.
  */
-struct netfs_io_request *netfs_begin_writethrough(struct kiocb *iocb, size=
_t len)
+struct netfs_writethrough *netfs_begin_writethrough(struct kiocb *iocb, si=
ze_t len)
 {
+	struct netfs_writethrough *wthru =3D NULL;
 	struct netfs_io_request *wreq =3D NULL;
 	struct netfs_inode *ictx =3D netfs_inode(file_inode(iocb->ki_filp));
=20
+	wthru =3D kzalloc_obj(struct netfs_writethrough);
+	if (!wthru)
+		return ERR_PTR(-ENOMEM);
+
 	mutex_lock(&ictx->wb_lock);
=20
 	wreq =3D netfs_create_write_req(iocb->ki_filp->f_mapping, iocb->ki_filp,
 				      iocb->ki_pos, NETFS_WRITETHROUGH);
 	if (IS_ERR(wreq)) {
 		mutex_unlock(&ictx->wb_lock);
-		return wreq;
+		kfree(wthru);
+		return ERR_CAST(wreq);
 	}
+	wthru->wreq =3D wreq;
=20
-	if (bvecq_buffer_init(&wreq->load_cursor, GFP_NOFS) < 0) {
-		netfs_put_failed_request(wreq);
-		mutex_unlock(&ictx->wb_lock);
-		return ERR_PTR(-ENOMEM);
-	}
+	wreq->spare =3D bvecq_alloc_one(BVECQ_STD_SLOTS, GFP_NOFS);
+	if (!wreq->spare)
+		goto nomem_unlock;
=20
-	bvecq_pos_set(&wreq->dispatch_cursor, &wreq->load_cursor);
-	bvecq_pos_set(&wreq->collect_cursor, &wreq->dispatch_cursor);
+	if (bvecq_buffer_init(&wreq->load_cursor, GFP_NOFS) < 0)
+		goto nomem_unlock;
+
+	bvecq_pos_set(&wthru->params.dispatch_cursor, &wreq->load_cursor);
+	bvecq_pos_set(&wreq->collect_cursor, &wreq->load_cursor);
+
+	if (wreq->io_streams[1].avail)
+		wthru->params.notes |=3D NOTE_CACHE_AVAIL;
=20
 	wreq->io_streams[0].avail =3D true;
 	trace_netfs_write(wreq, netfs_write_trace_writethrough);
-	return wreq;
+	if (!is_sync_kiocb(iocb))
+		wreq->iocb =3D iocb;
+
+	if (unlikely(!test_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags))) {
+		set_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags);
+		/* Don't call ->begin_writeback() as ->init_request() gets file*. */
+		if (wreq->io_streams[0].avail) {
+			wthru->params.notes |=3D NOTE_UPLOAD_AVAIL;
+			/* Order setting the active flag after other fields. */
+			smp_store_release(&wreq->io_streams[0].active, true);
+		}
+	}
+	return wthru;
+nomem_unlock:
+	netfs_put_failed_request(wreq);
+	mutex_unlock(&ictx->wb_lock);
+	kfree(wthru);
+	return ERR_PTR(-ENOMEM);
 }
=20
 /*
@@ -748,10 +883,11 @@ struct netfs_io_request *netfs_begin_writethrough(str=
uct kiocb *iocb, size_t len
  * to the request.  If we've added more than wsize then we need to create =
a new
  * subrequest.
  */
-int netfs_advance_writethrough(struct netfs_io_request *wreq, struct write=
back_control *wbc,
-			       struct folio *folio, size_t copied, bool to_page_end,
-			       struct folio **writethrough_cache)
+int netfs_advance_writethrough(struct netfs_writethrough *wthru,
+			       struct writeback_control *wbc,
+			       struct folio *folio, size_t copied, bool to_page_end)
 {
+	struct netfs_io_request *wreq =3D wthru->wreq;
 	int ret;
=20
 	_enter("R=3D%x ws=3D%u cp=3D%zu tp=3D%u",
@@ -759,18 +895,18 @@ int netfs_advance_writethrough(struct netfs_io_reques=
t *wreq, struct writeback_c
=20
 	/* The folio is locked. */
=20
-	if (*writethrough_cache !=3D folio) {
-		if (*writethrough_cache) {
+	if (wthru->in_progress !=3D folio) {
+		if (wthru->in_progress) {
 			/* Did the folio get moved? */
-			folio_put(*writethrough_cache);
-			*writethrough_cache =3D NULL;
+			folio_put(wthru->in_progress);
+			wthru->in_progress =3D NULL;
 		}
 		/* We can make multiple writes to the folio... */
 		if (wreq->len =3D=3D 0)
 			trace_netfs_folio(folio, netfs_folio_trace_wthru);
 		else
 			trace_netfs_folio(folio, netfs_folio_trace_wthru_plus);
-		*writethrough_cache =3D folio;
+		wthru->in_progress =3D folio;
 		folio_get(folio);
 	}
=20
@@ -782,9 +918,20 @@ int netfs_advance_writethrough(struct netfs_io_request=
 *wreq, struct writeback_c
 		return 0;
 	}
=20
-	ret =3D netfs_write_folio(wreq, wbc, folio);
-	folio_put(*writethrough_cache);
-	*writethrough_cache =3D NULL;
+	wthru->params.notes &=3D NOTES__KEEP_MASK;
+	ret =3D netfs_queue_wb_folio(wreq, wbc, folio, &wthru->params);
+	if (ret < 0)
+		return ret;
+
+	if (!wreq->spare) {
+		wreq->spare =3D bvecq_alloc_one(BVECQ_STD_SLOTS, GFP_NOFS);
+		if (!wreq->spare)
+			return -ENOMEM;
+	}
+
+	ret =3D netfs_issue_streams(wreq, &wthru->params);
+	folio_put(wthru->in_progress);
+	wthru->in_progress =3D NULL;
 	wreq->submitted =3D wreq->len;
 	return ret;
 }
@@ -792,41 +939,85 @@ int netfs_advance_writethrough(struct netfs_io_reques=
t *wreq, struct writeback_c
 /*
  * End a write operation used when writing through the pagecache.
  */
-ssize_t netfs_end_writethrough(struct netfs_io_request *wreq, struct write=
back_control *wbc,
-			       struct folio *writethrough_cache)
+ssize_t netfs_end_writethrough(struct netfs_writethrough *wthru,
+			       struct writeback_control *wbc)
 {
+	struct netfs_io_request *wreq =3D wthru->wreq;
 	struct netfs_inode *ictx =3D netfs_inode(wreq->inode);
+	struct folio *folio =3D wthru->in_progress;
 	ssize_t ret;
=20
 	_enter("R=3D%x", wreq->debug_id);
=20
-	if (writethrough_cache) {
-		folio_lock(writethrough_cache);
-		netfs_write_folio(wreq, wbc, writethrough_cache);
-		folio_put(writethrough_cache);
+	if (folio) {
+		folio_lock(folio);
+		wthru->params.notes &=3D NOTES__KEEP_MASK;
+		ret =3D netfs_queue_wb_folio(wreq, wbc, folio, &wthru->params);
+		if (ret =3D=3D 0)
+			ret =3D netfs_issue_streams(wreq, &wthru->params);
+		folio_put(folio);
+		wthru->in_progress =3D NULL;
 		wreq->submitted =3D wreq->len;
 	}
=20
-	netfs_end_issue_write(wreq);
+	netfs_end_issue_write(wreq, &wthru->params);
=20
 	mutex_unlock(&ictx->wb_lock);
=20
 	bvecq_pos_unset(&wreq->load_cursor);
-	bvecq_pos_unset(&wreq->dispatch_cursor);
+	bvecq_pos_unset(&wthru->params.dispatch_cursor);
+	for (int i =3D 0; i < NR_IO_STREAMS; i++)
+		bvecq_pos_unset(&wreq->io_streams[i].dispatch_cursor);
=20
 	if (wreq->iocb)
 		ret =3D -EIOCBQUEUED;
 	else
 		ret =3D netfs_wait_for_write(wreq);
 	netfs_put_request(wreq, netfs_rreq_trace_put_return);
+	kfree(wthru);
 	return ret;
 }
=20
+/*
+ * Prepare a buffer for a single monolithic write.
+ */
+static int netfs_prepare_write_single_buffer(struct netfs_io_subrequest *s=
ubreq,
+					     unsigned int max_segs)
+{
+	struct netfs_io_request *wreq =3D subreq->rreq;
+	struct netfs_io_stream *stream =3D &wreq->io_streams[subreq->stream_nr];
+	struct bio_vec *bv;
+	struct bvecq *bq;
+	size_t dio_size =3D wreq->cache_resources.dio_size;
+	size_t dlen;
+
+	bvecq_pos_set(&subreq->dispatch_pos, &stream->dispatch_cursor);
+	bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
+
+	/* Round the end of the last entry up. */
+	bq =3D subreq->content.bvecq;
+	while (bq->next)
+		bq =3D bq->next;
+	bv =3D &bq->bv[bq->nr_slots - 1];
+	dlen =3D round_up(bv->bv_len, dio_size);
+	if (dlen > bv->bv_len) {
+		subreq->len +=3D dlen - bv->bv_len;
+		bv->bv_len =3D dlen;
+	}
+
+	stream->buffered   =3D 0;
+	stream->issue_from =3D subreq->len;
+	wreq->submitted    =3D subreq->len;
+	netfs_all_subreqs_queued(wreq);
+	return 0;
+}
+
 /**
  * netfs_writeback_single - Write back a monolithic payload
  * @mapping: The mapping to write from
  * @wbc: Hints from the VM
- * @iter: Data to write.
+ * @iter: Data to write
+ * @len: Amount of data to write
  *
  * Write a monolithic, non-pagecache object back to the server and/or
  * the cache.  There's a maximum of one subrequest per stream.
@@ -837,12 +1028,15 @@ ssize_t netfs_end_writethrough(struct netfs_io_reque=
st *wreq, struct writeback_c
  */
 int netfs_writeback_single(struct address_space *mapping,
 			   struct writeback_control *wbc,
-			   struct iov_iter *iter)
+			   struct iov_iter *iter,
+			   size_t len)
 {
 	struct netfs_io_request *wreq;
 	struct netfs_inode *ictx =3D netfs_inode(mapping->host);
 	int ret;
=20
+	_enter("%zx,%zx", iov_iter_count(iter), len);
+
 	if (!mutex_trylock(&ictx->wb_lock)) {
 		if (wbc->sync_mode =3D=3D WB_SYNC_NONE) {
 			/* The VFS will have undirtied the inode. */
@@ -859,23 +1053,24 @@ int netfs_writeback_single(struct address_space *map=
ping,
 		ret =3D PTR_ERR(wreq);
 		goto couldnt_start;
 	}
-	wreq->len =3D iov_iter_count(iter);
=20
-	ret =3D netfs_extract_iter(iter, wreq->len, INT_MAX, 0, &wreq->dispatch_c=
ursor.bvecq, 0);
+	wreq->len =3D len;
+
+	ret =3D netfs_extract_iter(iter, len, INT_MAX, 0, &wreq->load_cursor.bvec=
q, 0);
 	if (ret < 0)
 		goto cleanup_free;
-	if (ret < wreq->len) {
+	if (ret < len) {
 		ret =3D -EIO;
 		goto cleanup_free;
 	}
=20
-	bvecq_pos_set(&wreq->collect_cursor, &wreq->dispatch_cursor);
+	bvecq_pos_set(&wreq->collect_cursor, &wreq->load_cursor);
=20
 	__set_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &wreq->flags);
 	trace_netfs_write(wreq, netfs_write_trace_writeback_single);
 	netfs_stat(&netfs_n_wh_writepages);
=20
-	if (__test_and_set_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags))
+	if (test_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags))
 		wreq->netfs_ops->begin_writeback(wreq);
=20
 	for (int s =3D 0; s < NR_IO_STREAMS; s++) {
@@ -885,21 +1080,29 @@ int netfs_writeback_single(struct address_space *map=
ping,
 		if (!stream->avail)
 			continue;
=20
-		netfs_prepare_write(wreq, stream, 0);
+		stream->issue_from =3D 0;
+		stream->buffered   =3D len;
=20
-		subreq =3D stream->construct;
-		subreq->len =3D wreq->len;
-		stream->submit_len =3D subreq->len;
+		subreq =3D netfs_alloc_write_subreq(wreq, stream);
+		if (!subreq) {
+			ret =3D -ENOMEM;
+			break;
+		}
=20
-		netfs_issue_write(wreq, stream);
+		bvecq_pos_set(&stream->dispatch_cursor, &wreq->load_cursor);
+
+		stream->issue_write(subreq);
+
+		bvecq_pos_unset(&stream->dispatch_cursor);
 	}
=20
 	wreq->submitted =3D wreq->len;
-	smp_wmb(); /* Write lists before ALL_QUEUED. */
-	set_bit(NETFS_RREQ_ALL_QUEUED, &wreq->flags);
-
 	mutex_unlock(&ictx->wb_lock);
-	netfs_wake_collector(wreq);
+
+	if (unlikely(!netfs_are_all_subreqs_queued(wreq))) {
+		netfs_all_subreqs_queued(wreq);
+		netfs_wake_collector(wreq);
+	}
=20
 	/* TODO: Might want to be async here if WB_SYNC_NONE, but then need to
 	 * wait before modifying.
diff --git a/fs/netfs/write_retry.c b/fs/netfs/write_retry.c
index de2f9b196fa5..fb73a37ecc91 100644
--- a/fs/netfs/write_retry.c
+++ b/fs/netfs/write_retry.c
@@ -12,12 +12,43 @@
 #include "internal.h"
=20
 /*
- * Perform retries on the streams that need it.
+ * Prepare the write buffer for a retry.  We can't necessarily reuse the w=
rite
+ * buffer from the previous run of a subrequest because the filesystem is
+ * permitted to modify it (add headers/trailers, encrypt it).  Further, the
+ * subrequest may now be a different size (e.g. cifs has to negotiate for
+ * maximum transfer size).  Also, we can't look at *stream as that may sti=
ll
+ * refer to the source material being broken up into original subrequests.
+ */
+int netfs_prepare_write_retry_buffer(struct netfs_io_subrequest *subreq,
+				     unsigned int max_segs)
+{
+	struct netfs_io_request *wreq =3D subreq->rreq;
+	struct netfs_io_stream *stream =3D &wreq->io_streams[subreq->stream_nr];
+	size_t len;
+
+	bvecq_pos_set(&subreq->dispatch_pos, &wreq->retry_cursor);
+	bvecq_pos_set(&subreq->content, &wreq->retry_cursor);
+	len =3D bvecq_slice(&wreq->retry_cursor, subreq->len, max_segs, &subreq->=
nr_segs);
+
+	if (len < subreq->len) {
+		subreq->len =3D len;
+		trace_netfs_sreq(subreq, netfs_sreq_trace_limited);
+	}
+
+	stream->issue_from +=3D len;
+	stream->buffered   -=3D len;
+	if (stream->buffered =3D=3D 0)
+		bvecq_pos_unset(&wreq->retry_cursor);
+	return 0;
+}
+
+/*
+ * Perform retries on the streams that need it.  This only has to deal with
+ * buffered writes; unbuffered write retry is handled in direct_write.c.
  */
 static void netfs_retry_write_stream(struct netfs_io_request *wreq,
 				     struct netfs_io_stream *stream)
 {
-	struct bvecq_pos dispatch_cursor =3D {};
 	struct list_head *next;
=20
 	_enter("R=3D%x[%x:]", wreq->debug_id, stream->stream_nr);
@@ -32,30 +63,13 @@ static void netfs_retry_write_stream(struct netfs_io_re=
quest *wreq,
 	if (unlikely(stream->failed))
 		return;
=20
-	/* If there's no renegotiation to do, just resend each failed subreq. */
-	if (!stream->prepare_write) {
-		struct netfs_io_subrequest *subreq;
-
-		list_for_each_entry(subreq, &stream->subrequests, rreq_link) {
-			if (test_bit(NETFS_SREQ_FAILED, &subreq->flags))
-				break;
-			if (__test_and_clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags)) {
-				netfs_get_subrequest(subreq, netfs_sreq_trace_get_resubmit);
-				netfs_reissue_write(stream, subreq);
-			}
-		}
-		return;
-	}
-
 	next =3D stream->subrequests.next;
=20
 	do {
 		struct netfs_io_subrequest *subreq =3D NULL, *from, *to, *tmp;
 		unsigned long long start, len;
-		size_t part;
-		bool boundary =3D false;
=20
-		bvecq_pos_unset(&dispatch_cursor);
+		bvecq_pos_unset(&wreq->retry_cursor);
=20
 		/* Go through the stream and find the next span of contiguous
 		 * data that we then rejig (cifs, for example, needs the wsize
@@ -74,7 +88,6 @@ static void netfs_retry_write_stream(struct netfs_io_requ=
est *wreq,
 			subreq =3D list_entry(next, struct netfs_io_subrequest, rreq_link);
 			if (subreq->start !=3D start + len ||
 			    subreq->transferred > 0 ||
-			    test_bit(NETFS_SREQ_BOUNDARY, &subreq->flags) ||
 			    !test_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags))
 				break;
 			to =3D subreq;
@@ -84,8 +97,10 @@ static void netfs_retry_write_stream(struct netfs_io_req=
uest *wreq,
 		/* Determine the set of buffers we're going to use.  Each
 		 * subreq gets a subset of a single overall contiguous buffer.
 		 */
-		bvecq_pos_transfer(&dispatch_cursor, &from->dispatch_pos);
-		bvecq_pos_advance(&dispatch_cursor, from->transferred);
+		bvecq_pos_transfer(&wreq->retry_cursor, &from->dispatch_pos);
+		bvecq_pos_advance(&wreq->retry_cursor, from->transferred);
+		wreq->retry_start =3D start;
+		wreq->retry_buffered =3D len;
=20
 		/* Work through the sublist.  The chain of buffers we're going
 		 * to fill is attached to dispatch_cursor and we need to read
@@ -93,38 +108,29 @@ static void netfs_retry_write_stream(struct netfs_io_r=
equest *wreq,
 		 */
 		subreq =3D from;
 		list_for_each_entry_from(subreq, &stream->subrequests, rreq_link) {
-			if (!len)
+			if (!wreq->retry_buffered)
 				break;
=20
-			subreq->start	=3D start;
-			subreq->len	=3D len;
-			__clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
-			trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
-			subreq->transferred =3D 0;
-
 			bvecq_pos_unset(&subreq->dispatch_pos);
 			bvecq_pos_unset(&subreq->content);
+			subreq->content.bvecq =3D NULL;
+			subreq->content.slot =3D 0;
+			subreq->content.offset =3D 0;
=20
-			/* Renegotiate max_len (wsize) */
-			stream->sreq_max_len =3D len;
-			stream->sreq_max_segs =3D INT_MAX;
-			stream->prepare_write(subreq);
-
-			bvecq_pos_set(&subreq->dispatch_pos, &dispatch_cursor);
-			part =3D bvecq_slice(&dispatch_cursor,
-					   umin(len, stream->sreq_max_len),
-					   stream->sreq_max_segs,
-					   &subreq->nr_segs);
-			subreq->len =3D part;
-
-			len -=3D part;
-			start +=3D part;
-			if (len && subreq =3D=3D to &&
-			    __test_and_clear_bit(NETFS_SREQ_BOUNDARY, &to->flags))
-				boundary =3D true;
-
+			__clear_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
+			__clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
+			__clear_bit(NETFS_SREQ_FAILED, &subreq->flags);
+			__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
+			subreq->start		=3D wreq->retry_start;
+			subreq->len		=3D wreq->retry_buffered;
+			subreq->transferred	=3D 0;
+			subreq->retry_count	+=3D 1;
+			subreq->error		=3D 0;
+
+			netfs_stat(&netfs_n_wh_retry_write_subreq);
+			trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
 			netfs_get_subrequest(subreq, netfs_sreq_trace_get_resubmit);
-			netfs_reissue_write(stream, subreq);
+			stream->issue_write(subreq);
 			if (subreq =3D=3D to)
 				break;
 		}
@@ -155,6 +161,7 @@ static void netfs_retry_write_stream(struct netfs_io_re=
quest *wreq,
 			subreq =3D netfs_alloc_subrequest(wreq);
 			subreq->source		=3D to->source;
 			subreq->start		=3D start;
+			subreq->len		=3D len;
 			subreq->stream_nr	=3D to->stream_nr;
 			subreq->retry_count	=3D 1;
=20
@@ -163,18 +170,17 @@ static void netfs_retry_write_stream(struct netfs_io_=
request *wreq,
 					     netfs_sreq_trace_new);
 			trace_netfs_sreq(subreq, netfs_sreq_trace_split);
=20
+			__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
 			spin_lock(&wreq->lock);
+			/* Write IN_PROGRESS before pointer to new subreq */
+			smp_wmb();
 			list_add(&subreq->rreq_link, &to->rreq_link);
 			spin_unlock(&wreq->lock);
 			to =3D subreq;
-			trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
=20
-			stream->sreq_max_len	=3D len;
-			stream->sreq_max_segs	=3D INT_MAX;
 			switch (stream->source) {
 			case NETFS_UPLOAD_TO_SERVER:
 				netfs_stat(&netfs_n_wh_upload);
-				stream->sreq_max_len =3D umin(len, wreq->wsize);
 				break;
 			case NETFS_WRITE_TO_CACHE:
 				netfs_stat(&netfs_n_wh_write);
@@ -183,32 +189,14 @@ static void netfs_retry_write_stream(struct netfs_io_=
request *wreq,
 				WARN_ON_ONCE(1);
 			}
=20
-			stream->prepare_write(subreq);
-
-			bvecq_pos_set(&subreq->dispatch_pos, &dispatch_cursor);
-			part =3D bvecq_slice(&dispatch_cursor,
-					   umin(len, stream->sreq_max_len),
-					   stream->sreq_max_segs,
-					   &subreq->nr_segs);
-			subreq->len =3D subreq->transferred + part;
-
-			len -=3D part;
-			start +=3D part;
-			if (!len && boundary) {
-				__set_bit(NETFS_SREQ_BOUNDARY, &to->flags);
-				boundary =3D false;
-			}
-
-			netfs_reissue_write(stream, subreq);
-			if (!len)
-				break;
-
+			trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
+			stream->issue_write(subreq);
 		} while (len);
=20
 	} while (!list_is_head(next, &stream->subrequests));
=20
 out:
-	bvecq_pos_unset(&dispatch_cursor);
+	bvecq_pos_unset(&wreq->retry_cursor);
 }
=20
 /*
diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig
index 6bb30543eff0..e7862f35b72c 100644
--- a/fs/nfs/Kconfig
+++ b/fs/nfs/Kconfig
@@ -174,6 +174,7 @@ config NFS_FSCACHE
 	bool "Provide NFS client caching support"
 	depends on NFS_FS
 	select NETFS_SUPPORT
+	select NETFS_PGPRIV2
 	select FSCACHE
 	help
 	  Say Y here if you want NFS data to be cached locally on disc through
diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c
index 9b7fdad4a920..cf750faaec6a 100644
--- a/fs/nfs/fscache.c
+++ b/fs/nfs/fscache.c
@@ -23,6 +23,7 @@
 #include "iostat.h"
 #include "fscache.h"
 #include "nfstrace.h"
+#include <trace/events/netfs.h>
=20
 #define NFS_MAX_KEY_LEN 1000
=20
@@ -273,8 +274,6 @@ static int nfs_netfs_init_request(struct netfs_io_reque=
st *rreq, struct file *fi
 	rreq->debug_id =3D atomic_inc_return(&nfs_netfs_debug_id);
 	/* [DEPRECATED] Use PG_private_2 to mark folio being written to the cache=
. */
 	__set_bit(NETFS_RREQ_USE_PGPRIV2, &rreq->flags);
-	rreq->io_streams[0].sreq_max_len =3D NFS_SB(rreq->inode->i_sb)->rsize;
-
 	return 0;
 }
=20
@@ -298,6 +297,7 @@ static struct nfs_netfs_io_data *nfs_netfs_alloc(struct=
 netfs_io_subrequest *sre
=20
 static void nfs_netfs_issue_read(struct netfs_io_subrequest *sreq)
 {
+	struct netfs_io_request		*rreq =3D sreq->rreq;
 	struct nfs_netfs_io_data	*netfs;
 	struct nfs_pageio_descriptor	pgio;
 	struct inode *inode =3D sreq->rreq->inode;
@@ -307,6 +307,15 @@ static void nfs_netfs_issue_read(struct netfs_io_subre=
quest *sreq)
 	pgoff_t start, last;
 	int err;
=20
+	if (sreq->len > NFS_SB(rreq->inode->i_sb)->rsize)
+		sreq->len =3D NFS_SB(rreq->inode->i_sb)->rsize;
+
+	err =3D netfs_prepare_read_buffer(sreq, INT_MAX);
+	if (err < 0) {
+		sreq->error =3D err;
+		goto term;
+	}
+
 	start =3D (sreq->start + sreq->transferred) >> PAGE_SHIFT;
 	last =3D ((sreq->start + sreq->len - sreq->transferred - 1) >> PAGE_SHIFT=
);
=20
@@ -315,13 +324,15 @@ static void nfs_netfs_issue_read(struct netfs_io_subr=
equest *sreq)
=20
 	netfs =3D nfs_netfs_alloc(sreq);
 	if (!netfs) {
-		sreq->error =3D -ENOMEM;
-		return netfs_read_subreq_terminated(sreq);
+		sreq->error =3D err;
+		goto term;
 	}
=20
+	trace_netfs_sreq(sreq, netfs_sreq_trace_submit);
+
 	pgio.pg_netfs =3D netfs; /* used in completion */
=20
-	xa_for_each_range(&sreq->rreq->mapping->i_pages, idx, page, start, last) {
+	xa_for_each_range(&rreq->mapping->i_pages, idx, page, start, last) {
 		/* nfs_read_add_folio() may schedule() due to pNFS layout and other RPCs=
  */
 		err =3D nfs_read_add_folio(&pgio, ctx, page_folio(page));
 		if (err < 0) {
@@ -332,6 +343,8 @@ static void nfs_netfs_issue_read(struct netfs_io_subreq=
uest *sreq)
 out:
 	nfs_pageio_complete_read(&pgio);
 	nfs_netfs_put(netfs);
+term:
+	return netfs_read_subreq_terminated(sreq);
 }
=20
 void nfs_netfs_initiate_read(struct nfs_pgio_header *hdr)
diff --git a/fs/smb/client/cifssmb.c b/fs/smb/client/cifssmb.c
index 9e27bfa7376b..0b1051a17ca8 100644
--- a/fs/smb/client/cifssmb.c
+++ b/fs/smb/client/cifssmb.c
@@ -1467,8 +1467,7 @@ cifs_readv_callback(struct TCP_Server_Info *server, s=
truct mid_q_entry *mid)
 	struct cifs_tcon *tcon =3D tlink_tcon(rdata->req->cfile->tlink);
 	struct inode *inode =3D &ictx->inode;
 	struct smb_rqst rqst =3D { .rq_iov =3D rdata->iov,
-				 .rq_nvec =3D 1,
-				 .rq_iter =3D rdata->subreq.io_iter };
+				 .rq_nvec =3D 1};
 	struct cifs_credits credits =3D {
 		.value =3D 1,
 		.instance =3D 0,
@@ -1482,6 +1481,11 @@ cifs_readv_callback(struct TCP_Server_Info *server, =
struct mid_q_entry *mid)
 		 __func__, mid->mid, mid->mid_state, rdata->result,
 		 rdata->subreq.len);
=20
+	if (rdata->got_bytes)
+		iov_iter_bvec_queue(&rqst.rq_iter, ITER_DEST,
+				    rdata->subreq.content.bvecq, rdata->subreq.content.slot,
+				    rdata->subreq.content.offset, rdata->subreq.len);
+
 	switch (mid->mid_state) {
 	case MID_RESPONSE_RECEIVED:
 		/* result already set, check signature */
@@ -2003,7 +2007,10 @@ cifs_async_writev(struct cifs_io_subrequest *wdata)
=20
 	rqst.rq_iov =3D iov;
 	rqst.rq_nvec =3D 1;
-	rqst.rq_iter =3D wdata->subreq.io_iter;
+
+	iov_iter_bvec_queue(&rqst.rq_iter, ITER_SOURCE,
+			    wdata->subreq.content.bvecq, wdata->subreq.content.slot,
+			    wdata->subreq.content.offset, wdata->subreq.len);
=20
 	cifs_dbg(FYI, "async write at %llu %zu bytes\n",
 		 wdata->subreq.start, wdata->subreq.len);
diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c
index b60344125f27..d3a9041786ac 100644
--- a/fs/smb/client/file.c
+++ b/fs/smb/client/file.c
@@ -43,18 +43,36 @@ static int cifs_reopen_file(struct cifsFileInfo *cfile,=
 bool can_flush);
  * Prepare a subrequest to upload to the server.  We need to allocate cred=
its
  * so that we know the maximum amount of data that we can include in it.
  */
-static void cifs_prepare_write(struct netfs_io_subrequest *subreq)
+static int cifs_estimate_write(struct netfs_io_request *wreq,
+			       struct netfs_io_stream *stream,
+			       struct netfs_write_estimate *estimate)
+{
+	struct cifs_sb_info *cifs_sb =3D CIFS_SB(wreq->inode->i_sb);
+
+	estimate->issue_at =3D stream->issue_from + cifs_sb->ctx->wsize;
+	return 0;
+}
+
+/*
+ * Issue a subrequest to upload to the server.
+ */
+static void cifs_issue_write(struct netfs_io_subrequest *subreq)
 {
 	struct cifs_io_subrequest *wdata =3D
 		container_of(subreq, struct cifs_io_subrequest, subreq);
 	struct cifs_io_request *req =3D wdata->req;
-	struct netfs_io_stream *stream =3D &req->rreq.io_streams[subreq->stream_n=
r];
 	struct TCP_Server_Info *server;
 	struct cifsFileInfo *open_file =3D req->cfile;
-	struct cifs_sb_info *cifs_sb =3D CIFS_SB(wdata->rreq->inode->i_sb);
-	size_t wsize =3D req->rreq.wsize;
+	struct cifs_sb_info *cifs_sb =3D CIFS_SB(subreq->rreq->inode->i_sb);
+	unsigned int max_segs =3D INT_MAX;
+	size_t len;
 	int rc;
=20
+	if (cifs_forced_shutdown(cifs_sb)) {
+		rc =3D smb_EIO(smb_eio_trace_forced_shutdown);
+		goto fail;
+	}
+
 	if (!wdata->have_xid) {
 		wdata->xid =3D get_xid();
 		wdata->have_xid =3D true;
@@ -73,18 +91,16 @@ static void cifs_prepare_write(struct netfs_io_subreque=
st *subreq)
 		if (rc < 0) {
 			if (rc =3D=3D -EAGAIN)
 				goto retry;
-			subreq->error =3D rc;
-			return netfs_prepare_write_failed(subreq);
+			goto fail;
 		}
 	}
=20
-	rc =3D server->ops->wait_mtu_credits(server, wsize, &stream->sreq_max_len,
-					   &wdata->credits);
-	if (rc < 0) {
-		subreq->error =3D rc;
-		return netfs_prepare_write_failed(subreq);
-	}
+	len =3D umin(subreq->len, cifs_sb->ctx->wsize);
+	rc =3D server->ops->wait_mtu_credits(server, len, &len, &wdata->credits);
+	if (rc < 0)
+		goto fail;
=20
+	subreq->len =3D len;
 	wdata->credits.rreq_debug_id =3D subreq->rreq->debug_id;
 	wdata->credits.rreq_debug_index =3D subreq->debug_index;
 	wdata->credits.in_flight_check =3D 1;
@@ -100,44 +116,33 @@ static void cifs_prepare_write(struct netfs_io_subreq=
uest *subreq)
 		const struct smbdirect_socket_parameters *sp =3D
 			smbd_get_parameters(server->smbd_conn);
=20
-		stream->sreq_max_segs =3D sp->max_frmr_depth;
+		max_segs =3D sp->max_frmr_depth;
 	}
 #endif
-}
-
-/*
- * Issue a subrequest to upload to the server.
- */
-static void cifs_issue_write(struct netfs_io_subrequest *subreq)
-{
-	struct cifs_io_subrequest *wdata =3D
-		container_of(subreq, struct cifs_io_subrequest, subreq);
-	struct cifs_sb_info *sbi =3D CIFS_SB(subreq->rreq->inode->i_sb);
-	int rc;
=20
-	if (cifs_forced_shutdown(sbi)) {
-		rc =3D smb_EIO(smb_eio_trace_forced_shutdown);
-		goto fail;
-	}
+	rc =3D netfs_prepare_write_buffer(subreq, max_segs);
+	if (rc < 0)
+		goto fail_with_credits;
=20
-	rc =3D adjust_credits(wdata->server, wdata, cifs_trace_rw_credits_issue_w=
rite_adjust);
+	rc =3D adjust_credits(server, wdata, cifs_trace_rw_credits_issue_write_ad=
just);
 	if (rc)
-		goto fail;
+		goto fail_with_credits;
=20
 	rc =3D -EAGAIN;
 	if (wdata->req->cfile->invalidHandle)
-		goto fail;
+		goto fail_with_credits;
=20
 	wdata->server->ops->async_writev(wdata);
 out:
 	return;
=20
-fail:
+fail_with_credits:
 	if (rc =3D=3D -EAGAIN)
 		trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
 	else
 		trace_netfs_sreq(subreq, netfs_sreq_trace_fail);
 	add_credits_and_wake_if(wdata->server, &wdata->credits, 0);
+fail:
 	cifs_write_subrequest_terminated(wdata, rc);
 	goto out;
 }
@@ -148,17 +153,25 @@ static void cifs_netfs_invalidate_cache(struct netfs_=
io_request *wreq)
 }
=20
 /*
- * Negotiate the size of a read operation on behalf of the netfs library.
+ * Issue a read operation on behalf of the netfs helper functions.  We're =
asked
+ * to make a read of a certain size at a point in the file.  We are permit=
ted
+ * to only read a portion of that, but as long as we read something, the n=
etfs
+ * helper will call us again so that we can issue another read.
  */
-static int cifs_prepare_read(struct netfs_io_subrequest *subreq)
+static void cifs_issue_read(struct netfs_io_subrequest *subreq)
 {
 	struct netfs_io_request *rreq =3D subreq->rreq;
 	struct cifs_io_subrequest *rdata =3D container_of(subreq, struct cifs_io_=
subrequest, subreq);
 	struct cifs_io_request *req =3D container_of(subreq->rreq, struct cifs_io=
_request, rreq);
-	struct TCP_Server_Info *server;
+	struct TCP_Server_Info *server =3D rdata->server;
 	struct cifs_sb_info *cifs_sb =3D CIFS_SB(rreq->inode->i_sb);
-	size_t size;
-	int rc =3D 0;
+	unsigned int max_segs =3D INT_MAX;
+	size_t len;
+	int rc;
+
+	cifs_dbg(FYI, "%s: op=3D%08x[%x] mapping=3D%p len=3D%zu/%zu\n",
+		 __func__, rreq->debug_id, subreq->debug_index, rreq->mapping,
+		 subreq->transferred, subreq->len);
=20
 	if (!rdata->have_xid) {
 		rdata->xid =3D get_xid();
@@ -172,17 +185,15 @@ static int cifs_prepare_read(struct netfs_io_subreque=
st *subreq)
 		cifs_negotiate_rsize(server, cifs_sb->ctx,
 				     tlink_tcon(req->cfile->tlink));
=20
-	rc =3D server->ops->wait_mtu_credits(server, cifs_sb->ctx->rsize,
-					   &size, &rdata->credits);
+	len =3D umin(subreq->len, cifs_sb->ctx->rsize);
+	rc =3D server->ops->wait_mtu_credits(server, len, &len, &rdata->credits);
 	if (rc)
-		return rc;
-
-	rreq->io_streams[0].sreq_max_len =3D size;
+		goto failed;
=20
-	rdata->credits.in_flight_check =3D 1;
+	subreq->len =3D len;
 	rdata->credits.rreq_debug_id =3D rreq->debug_id;
 	rdata->credits.rreq_debug_index =3D subreq->debug_index;
-
+	rdata->credits.in_flight_check =3D 1;
 	trace_smb3_rw_credits(rdata->rreq->debug_id,
 			      rdata->subreq.debug_index,
 			      rdata->credits.value,
@@ -194,33 +205,17 @@ static int cifs_prepare_read(struct netfs_io_subreque=
st *subreq)
 		const struct smbdirect_socket_parameters *sp =3D
 			smbd_get_parameters(server->smbd_conn);
=20
-		rreq->io_streams[0].sreq_max_segs =3D sp->max_frmr_depth;
+		max_segs =3D sp->max_frmr_depth;
 	}
 #endif
-	return 0;
-}
=20
-/*
- * Issue a read operation on behalf of the netfs helper functions.  We're =
asked
- * to make a read of a certain size at a point in the file.  We are permit=
ted
- * to only read a portion of that, but as long as we read something, the n=
etfs
- * helper will call us again so that we can issue another read.
- */
-static void cifs_issue_read(struct netfs_io_subrequest *subreq)
-{
-	struct netfs_io_request *rreq =3D subreq->rreq;
-	struct cifs_io_subrequest *rdata =3D container_of(subreq, struct cifs_io_=
subrequest, subreq);
-	struct cifs_io_request *req =3D container_of(subreq->rreq, struct cifs_io=
_request, rreq);
-	struct TCP_Server_Info *server =3D rdata->server;
-	int rc =3D 0;
-
-	cifs_dbg(FYI, "%s: op=3D%08x[%x] mapping=3D%p len=3D%zu/%zu\n",
-		 __func__, rreq->debug_id, subreq->debug_index, rreq->mapping,
-		 subreq->transferred, subreq->len);
+	rc =3D netfs_prepare_read_buffer(subreq, max_segs);
+	if (rc < 0)
+		goto fail_with_credits;
=20
 	rc =3D adjust_credits(server, rdata, cifs_trace_rw_credits_issue_read_adj=
ust);
 	if (rc)
-		goto failed;
+		goto fail_with_credits;
=20
 	if (req->cfile->invalidHandle) {
 		do {
@@ -235,14 +230,21 @@ static void cifs_issue_read(struct netfs_io_subreques=
t *subreq)
 		__set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
=20
 	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
+
 	rc =3D rdata->server->ops->async_readv(rdata);
 	if (rc)
 		goto failed;
 	return;
=20
+fail_with_credits:
+	if (rc =3D=3D -EAGAIN)
+		trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
+	else
+		trace_netfs_sreq(subreq, netfs_sreq_trace_fail);
+	add_credits_and_wake_if(rdata->server, &rdata->credits, 0);
 failed:
 	subreq->error =3D rc;
-	netfs_read_subreq_terminated(subreq);
+	return netfs_read_subreq_terminated(subreq);
 }
=20
 /*
@@ -352,11 +354,10 @@ const struct netfs_request_ops cifs_req_ops =3D {
 	.init_request		=3D cifs_init_request,
 	.free_request		=3D cifs_free_request,
 	.free_subrequest	=3D cifs_free_subrequest,
-	.prepare_read		=3D cifs_prepare_read,
 	.issue_read		=3D cifs_issue_read,
 	.done			=3D cifs_rreq_done,
 	.begin_writeback	=3D cifs_begin_writeback,
-	.prepare_write		=3D cifs_prepare_write,
+	.estimate_write		=3D cifs_estimate_write,
 	.issue_write		=3D cifs_issue_write,
 	.invalidate_cache	=3D cifs_netfs_invalidate_cache,
 };
diff --git a/fs/smb/client/smb2ops.c b/fs/smb/client/smb2ops.c
index 9199baa5c315..b0b344724d2e 100644
--- a/fs/smb/client/smb2ops.c
+++ b/fs/smb/client/smb2ops.c
@@ -4732,6 +4732,7 @@ handle_read_data(struct TCP_Server_Info *server, stru=
ct mid_q_entry *mid,
 	unsigned int cur_page_idx;
 	unsigned int pad_len;
 	struct cifs_io_subrequest *rdata =3D mid->callback_data;
+	struct iov_iter iter;
 	struct smb2_hdr *shdr =3D (struct smb2_hdr *)buf;
 	size_t copied;
 	bool use_rdma_mr =3D false;
@@ -4804,6 +4805,10 @@ handle_read_data(struct TCP_Server_Info *server, str=
uct mid_q_entry *mid,
=20
 	pad_len =3D data_offset - server->vals->read_rsp_size;
=20
+	iov_iter_bvec_queue(&iter, ITER_DEST,
+			    rdata->subreq.content.bvecq, rdata->subreq.content.slot,
+			    rdata->subreq.content.offset, rdata->subreq.len);
+
 	if (buf_len <=3D data_offset) {
 		/* read response payload is in pages */
 		cur_page_idx =3D pad_len / PAGE_SIZE;
@@ -4833,7 +4838,7 @@ handle_read_data(struct TCP_Server_Info *server, stru=
ct mid_q_entry *mid,
=20
 		/* Copy the data to the output I/O iterator. */
 		rdata->result =3D cifs_copy_bvecq_to_iter(buffer, buffer_len,
-							cur_off, &rdata->subreq.io_iter);
+							cur_off, &iter);
 		if (rdata->result !=3D 0) {
 			if (is_offloaded)
 				mid->mid_state =3D MID_RESPONSE_MALFORMED;
@@ -4847,7 +4852,7 @@ handle_read_data(struct TCP_Server_Info *server, stru=
ct mid_q_entry *mid,
 		   buf_len >=3D end_off) {
 		/* read response payload is in buf */
 		WARN_ONCE(buffer, "read data can be either in buf or in buffer");
-		copied =3D copy_to_iter(buf + data_offset, data_len, &rdata->subreq.io_i=
ter);
+		copied =3D copy_to_iter(buf + data_offset, data_len, &iter);
 		if (copied =3D=3D 0)
 			return smb_EIO2(smb_eio_trace_rx_copy_to_iter, copied, data_len);
 		rdata->got_bytes =3D copied;
diff --git a/fs/smb/client/smb2pdu.c b/fs/smb/client/smb2pdu.c
index 3bd300347f16..971b075e2e7d 100644
--- a/fs/smb/client/smb2pdu.c
+++ b/fs/smb/client/smb2pdu.c
@@ -4553,9 +4553,13 @@ smb2_new_read_req(void **buf, unsigned int *total_le=
n,
 	 */
 	if (rdata && smb3_use_rdma_offload(io_parms)) {
 		struct smbdirect_buffer_descriptor_v1 *v1;
+		struct iov_iter iter;
 		bool need_invalidate =3D server->dialect =3D=3D SMB30_PROT_ID;
=20
-		rdata->mr =3D smbd_register_mr(server->smbd_conn, &rdata->subreq.io_iter,
+		iov_iter_bvec_queue(&iter, ITER_DEST,
+				    rdata->subreq.content.bvecq, rdata->subreq.content.slot,
+				    rdata->subreq.content.offset, rdata->subreq.len);
+		rdata->mr =3D smbd_register_mr(server->smbd_conn, &iter,
 					     true, need_invalidate);
 		if (!rdata->mr)
 			return -EAGAIN;
@@ -4619,9 +4623,10 @@ smb2_readv_callback(struct TCP_Server_Info *server, =
struct mid_q_entry *mid)
 	unsigned int rreq_debug_id =3D rdata->rreq->debug_id;
 	unsigned int subreq_debug_index =3D rdata->subreq.debug_index;
=20
-	if (rdata->got_bytes) {
-		rqst.rq_iter	  =3D rdata->subreq.io_iter;
-	}
+	if (rdata->got_bytes)
+		iov_iter_bvec_queue(&rqst.rq_iter, ITER_DEST,
+				    rdata->subreq.content.bvecq, rdata->subreq.content.slot,
+				    rdata->subreq.content.offset, rdata->subreq.len);
=20
 	WARN_ONCE(rdata->server !=3D server,
 		  "rdata server %p !=3D mid server %p",
@@ -5109,7 +5114,9 @@ smb2_async_writev(struct cifs_io_subrequest *wdata)
 		goto out;
=20
 	rqst.rq_iov =3D iov;
-	rqst.rq_iter =3D wdata->subreq.io_iter;
+	iov_iter_bvec_queue(&rqst.rq_iter, ITER_SOURCE,
+			    wdata->subreq.content.bvecq, wdata->subreq.content.slot,
+			    wdata->subreq.content.offset, wdata->subreq.len);
=20
 	rqst.rq_iov[0].iov_len =3D total_len - 1;
 	rqst.rq_iov[0].iov_base =3D (char *)req;
@@ -5148,9 +5155,14 @@ smb2_async_writev(struct cifs_io_subrequest *wdata)
 	 */
 	if (smb3_use_rdma_offload(io_parms)) {
 		struct smbdirect_buffer_descriptor_v1 *v1;
+		struct iov_iter iter;
 		bool need_invalidate =3D server->dialect =3D=3D SMB30_PROT_ID;
=20
-		wdata->mr =3D smbd_register_mr(server->smbd_conn, &wdata->subreq.io_iter,
+		iov_iter_bvec_queue(&iter, ITER_SOURCE,
+				    wdata->subreq.content.bvecq, wdata->subreq.content.slot,
+				    wdata->subreq.content.offset, wdata->subreq.len);
+
+		wdata->mr =3D smbd_register_mr(server->smbd_conn, &iter,
 					     false, need_invalidate);
 		if (!wdata->mr) {
 			rc =3D -EAGAIN;
@@ -5187,8 +5199,8 @@ smb2_async_writev(struct cifs_io_subrequest *wdata)
 		smb2_set_replay(server, &rqst);
 	}
=20
-	cifs_dbg(FYI, "async write at %llu %u bytes iter=3D%zx\n",
-		 io_parms->offset, io_parms->length, iov_iter_count(&wdata->subreq.io_it=
er));
+	cifs_dbg(FYI, "async write at %llu %u bytes len=3D%zx\n",
+		 io_parms->offset, io_parms->length, wdata->subreq.len);
=20
 	if (wdata->credits.value > 0) {
 		shdr->CreditCharge =3D cpu_to_le16(DIV_ROUND_UP(wdata->subreq.len,
diff --git a/fs/smb/client/transport.c b/fs/smb/client/transport.c
index fdf4e50c27ce..be2f6b909c34 100644
--- a/fs/smb/client/transport.c
+++ b/fs/smb/client/transport.c
@@ -1267,12 +1267,19 @@ cifs_readv_receive(struct TCP_Server_Info *server, =
struct mid_q_entry *mid)
 	}
=20
 #ifdef CONFIG_CIFS_SMB_DIRECT
-	if (rdata->mr)
+	if (rdata->mr) {
 		length =3D data_len; /* An RDMA read is already done. */
-	else
+	} else {
+#endif
+		struct iov_iter iter;
+
+		iov_iter_bvec_queue(&iter, ITER_DEST, rdata->subreq.content.bvecq,
+				    rdata->subreq.content.slot, rdata->subreq.content.offset,
+				    data_len);
+		length =3D cifs_read_iter_from_socket(server, &iter, data_len);
+#ifdef CONFIG_CIFS_SMB_DIRECT
+	}
 #endif
-		length =3D cifs_read_iter_from_socket(server, &rdata->subreq.io_iter,
-						    data_len);
 	if (length > 0)
 		rdata->got_bytes +=3D length;
 	server->total_read +=3D length;
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 7dca6a513509..86bef8fec14b 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -66,7 +66,7 @@ struct netfs_inode {
 #endif
 	struct mutex		wb_lock;	/* Writeback serialisation */
 	loff_t			_remote_i_size;	/* Size of the remote file */
-	loff_t			_zero_point;	/* Size after which we assume there's no data
+	unsigned long long	_zero_point;	/* Size after which we assume there's no =
data
 						 * on the server */
 	atomic_t		io_count;	/* Number of outstanding reqs */
 	unsigned long		flags;
@@ -126,25 +126,39 @@ static inline struct netfs_group *netfs_folio_group(s=
truct folio *folio)
 	return priv;
 }
=20
+/*
+ * Estimate of maximum write subrequest for writeback.  The filesystem is
+ * responsible for filling this in when called from ->estimate_write(), th=
ough
+ * netfslib will preset infinite defaults.
+ */
+struct netfs_write_estimate {
+	unsigned long long	issue_at;	/* Point at which we must submit */
+	int			max_segs;	/* Max number of segments in a single RPC */
+};
+
 /*
  * Stream of I/O subrequests going to a particular destination, such as the
  * server or the local cache.  This is mainly intended for writing where w=
e may
  * have to write to multiple destinations concurrently.
  */
 struct netfs_io_stream {
-	/* Submission tracking */
-	struct netfs_io_subrequest *construct;	/* Op being constructed */
-	size_t			sreq_max_len;	/* Maximum size of a subrequest */
-	unsigned int		sreq_max_segs;	/* 0 or max number of segments in an iterato=
r */
-	unsigned int		submit_off;	/* Folio offset we're submitting from */
-	unsigned int		submit_len;	/* Amount of data left to submit */
-	void (*prepare_write)(struct netfs_io_subrequest *subreq);
+	/* Submission tracking (main dispatch only; not retry) */
+	struct bvecq_pos	dispatch_cursor; /* Point from which buffers are dispatc=
hed */
+	unsigned long long	issue_from;	/* Current issue point */
+	size_t			buffered;	/* Amount in buffer */
+	u8			applicable;	/* What sources are applicable (NOTE_* mask) */
+	bool			buffering;	/* T if buffering on this stream */
+	int (*estimate_write)(struct netfs_io_request *wreq,
+			      struct netfs_io_stream *stream,
+			      struct netfs_write_estimate *estimate);
 	void (*issue_write)(struct netfs_io_subrequest *subreq);
+	atomic64_t		issued_to;	/* Point to which can be considered issued */
+
 	/* Collection tracking */
 	struct list_head	subrequests;	/* Contributory I/O operations */
 	unsigned long long	collected_to;	/* Position we've collected results to */
 	size_t			transferred;	/* The amount transferred from this stream */
-	unsigned short		error;		/* Aggregate error for the stream */
+	short			error;		/* Aggregate error for the stream */
 	enum netfs_io_source	source;		/* Where to read from/write to */
 	unsigned char		stream_nr;	/* Index of stream in parent table */
 	bool			avail;		/* T if stream is available */
@@ -182,14 +196,13 @@ struct netfs_io_subrequest {
 	struct list_head	rreq_link;	/* Link in rreq->subrequests */
 	struct bvecq_pos	dispatch_pos;	/* Bookmark in the combined queue of the s=
tart */
 	struct bvecq_pos	content;	/* The (copied) content of the subrequest */
-	struct iov_iter		io_iter;	/* Iterator for this subrequest */
 	unsigned long long	start;		/* Where to start the I/O */
 	size_t			len;		/* Size of the I/O */
 	size_t			transferred;	/* Amount of data transferred */
+	unsigned int		nr_segs;	/* Number of segments in content */
 	refcount_t		ref;
 	short			error;		/* 0 or error that occurred */
 	unsigned short		debug_index;	/* Index in list (for debugging output) */
-	unsigned int		nr_segs;	/* Number of segs in io_iter */
 	u8			retry_count;	/* The number of retries (0 on initial pass) */
 	enum netfs_io_source	source;		/* Where to read from/write to */
 	unsigned char		stream_nr;	/* I/O stream this belongs to */
@@ -198,7 +211,6 @@ struct netfs_io_subrequest {
 #define NETFS_SREQ_CLEAR_TAIL		1	/* Set if the rest of the read should be =
cleared */
 #define NETFS_SREQ_MADE_PROGRESS	4	/* Set if we transferred at least some =
data */
 #define NETFS_SREQ_ONDEMAND		5	/* Set if it's from on-demand read mode */
-#define NETFS_SREQ_BOUNDARY		6	/* Set if ends on hard boundary (eg. ceph o=
bject) */
 #define NETFS_SREQ_HIT_EOF		7	/* Set if short due to EOF */
 #define NETFS_SREQ_IN_PROGRESS		8	/* Unlocked when the subrequest complete=
s */
 #define NETFS_SREQ_NEED_RETRY		9	/* Set if the filesystem requests a retry=
 */
@@ -246,23 +258,26 @@ struct netfs_io_request {
 	struct netfs_group	*group;		/* Writeback group being written back */
 	struct bvecq		*spare;		/* Advance allocation of bvecq */
 	struct bvecq_pos	load_cursor;	/* Point at which new folios are loaded in =
*/
-	struct bvecq_pos	dispatch_cursor; /* Point from which buffers are dispatc=
hed */
 	struct bvecq_pos	collect_cursor;	/* Clear-up point of I/O buffer */
+	struct bvecq_pos	retry_cursor;	/* Point from which retries are dispatched=
 */
 	wait_queue_head_t	waitq;		/* Processor waiter */
 	void			*netfs_priv;	/* Private data for the netfs */
 	void			*netfs_priv2;	/* Private data for the netfs */
-	unsigned long long	last_end;	/* End pos of last folio submitted */
 	unsigned long long	submitted;	/* Amount submitted for I/O so far */
 	unsigned long long	len;		/* Length of the request */
 	size_t			transferred;	/* Amount to be indicated as transferred */
 	long			error;		/* 0 or error that occurred */
 	unsigned long long	i_size;		/* Size of the file */
 	unsigned long long	start;		/* Start position */
-	atomic64_t		issued_to;	/* Write issuer folio cursor */
 	unsigned long long	collected_to;	/* Point we've collected to */
 	unsigned long long	cache_coll_to;	/* Point the cache has collected to */
 	unsigned long long	cleaned_to;	/* Position we've cleaned folios to */
 	unsigned long long	abandon_to;	/* Position to abandon folios to */
+#ifdef CONFIG_NETFS_PGPRIV2
+	unsigned long long	last_end;	/* End of last folio added */
+#endif
+	unsigned long long	retry_start;	/* Position to retry from */
+	size_t			retry_buffered;	/* Amount of data to retry */
 	const struct folio	*no_unlock_folio; /* Don't unlock this folio after rea=
d */
 	unsigned int		debug_id;
 	unsigned int		rsize;		/* Maximum read size (0 for none) */
@@ -280,6 +295,7 @@ struct netfs_io_request {
 #define NETFS_RREQ_FAILED		3	/* The request failed */
 #define NETFS_RREQ_RETRYING		4	/* Set if we're in the retry path */
 #define NETFS_RREQ_SHORT_TRANSFER	5	/* Set if we have a short transfer */
+#define NETFS_RREQ_SAW_ENOMEM		6	/* Set if we encounted ENOMEM */
 #define NETFS_RREQ_CACHE_STOP		8	/* Set to stop caching (ENOBUFS or error)=
 */
 #define NETFS_RREQ_CACHE_ERROR		9	/* Set if we got an error from the cache=
 */
 #define NETFS_RREQ_OFFLOAD_COLLECTION	12	/* Offload collection to workqueu=
e */
@@ -288,8 +304,10 @@ struct netfs_io_request {
 #define NETFS_RREQ_UPLOAD_TO_SERVER	15	/* Need to write to the server */
 #define NETFS_RREQ_USE_IO_ITER		16	/* Use ->io_iter rather than ->i_pages =
*/
 #define NETFS_RREQ_NEED_PUT_RA_REFS	17	/* Need to put the folio refs RA ga=
ve us */
+#ifdef CONFIG_NETFS_PGPRIV2
 #define NETFS_RREQ_USE_PGPRIV2		31	/* [DEPRECATED] Use PG_private_2 to mark
 						 * write to cache on read */
+#endif
 	const struct netfs_request_ops *netfs_ops;
 };
=20
@@ -305,7 +323,6 @@ struct netfs_request_ops {
=20
 	/* Read request handling */
 	void (*expand_readahead)(struct netfs_io_request *rreq);
-	int (*prepare_read)(struct netfs_io_subrequest *subreq);
 	void (*issue_read)(struct netfs_io_subrequest *subreq);
 	bool (*is_still_valid)(struct netfs_io_request *rreq);
 	int (*check_write_begin)(struct file *file, loff_t pos, unsigned len,
@@ -318,7 +335,9 @@ struct netfs_request_ops {
=20
 	/* Write request handling */
 	void (*begin_writeback)(struct netfs_io_request *wreq);
-	void (*prepare_write)(struct netfs_io_subrequest *subreq);
+	int (*estimate_write)(struct netfs_io_request *wreq,
+			      struct netfs_io_stream *stream,
+			      struct netfs_write_estimate *estimate);
 	void (*issue_write)(struct netfs_io_subrequest *subreq);
 	void (*retry_request)(struct netfs_io_request *wreq, struct netfs_io_stre=
am *stream);
 	void (*invalidate_cache)(struct netfs_io_request *wreq);
@@ -360,6 +379,14 @@ struct netfs_cache_ops {
 		     netfs_io_terminated_t term_func,
 		     void *term_func_priv);
=20
+	/* Estimate the amount of data that can be written in an op. */
+	int (*estimate_write)(struct netfs_io_request *wreq,
+			      struct netfs_io_stream *stream,
+			      struct netfs_write_estimate *estimate);
+
+	/* Read data from the cache for a netfs subrequest. */
+	void (*issue_read)(struct netfs_io_subrequest *subreq);
+
 	/* Write data to the cache from a netfs subrequest. */
 	void (*issue_write)(struct netfs_io_subrequest *subreq);
=20
@@ -369,25 +396,6 @@ struct netfs_cache_ops {
 				 unsigned long long *_len,
 				 unsigned long long i_size);
=20
-	/* Prepare a read operation, shortening it to a cached/uncached
-	 * boundary as appropriate.
-	 */
-	int (*prepare_read)(struct netfs_io_subrequest *subreq);
-
-	/* Prepare a write subrequest, working out if we're allowed to do it
-	 * and finding out the maximum amount of data to gather before
-	 * attempting to submit.  If we're not permitted to do it, the
-	 * subrequest should be marked failed.
-	 */
-	void (*prepare_write_subreq)(struct netfs_io_subrequest *subreq);
-
-	/* Prepare a write operation, working out what part of the write we can
-	 * actually do.
-	 */
-	int (*prepare_write)(struct netfs_cache_resources *cres,
-			     loff_t *_start, size_t *_len, size_t upper_len,
-			     loff_t i_size, bool no_space_allocated_yet);
-
 	/* Prepare an on-demand read operation, shortening it to a cached/uncached
 	 * boundary as appropriate.
 	 */
@@ -399,8 +407,8 @@ struct netfs_cache_ops {
 	/* Query the occupancy of the cache in a region, returning where the
 	 * next chunk of data starts and how long it is.
 	 */
-	int (*query_occupancy)(struct netfs_cache_resources *cres,
-			       struct fscache_occupancy *occ);
+	void (*query_occupancy)(struct netfs_cache_resources *cres,
+				struct fscache_occupancy *occ);
=20
 	/* Collect the result of buffered writeback to the cache.  This
 	 * includes copying a read to the cache.  block_type is one of:
@@ -434,10 +442,9 @@ void netfs_single_mark_inode_dirty(struct inode *inode=
);
 ssize_t netfs_read_single(struct inode *inode, struct file *file, struct i=
ov_iter *iter);
 int netfs_writeback_single(struct address_space *mapping,
 			   struct writeback_control *wbc,
-			   struct iov_iter *iter);
+			   struct iov_iter *iter, size_t len);
=20
 /* Address operations API */
-struct readahead_control;
 void netfs_readahead(struct readahead_control *);
 int netfs_read_folio(struct file *, struct folio *);
 int netfs_write_begin(struct netfs_inode *, struct file *,
@@ -464,7 +471,8 @@ void netfs_put_subrequest(struct netfs_io_subrequest *s=
ubreq,
 ssize_t netfs_extract_iter(struct iov_iter *orig, size_t max_len, size_t m=
ax_pages,
 			   unsigned long long fpos, struct bvecq **_bvecq_head,
 			   iov_iter_extraction_t extraction_flags);
-void netfs_prepare_write_failed(struct netfs_io_subrequest *subreq);
+int netfs_prepare_read_buffer(struct netfs_io_subrequest *subreq, unsigned=
 int max_segs);
+int netfs_prepare_write_buffer(struct netfs_io_subrequest *subreq, unsigne=
d int max_segs);
 void netfs_write_subrequest_terminated(void *_op, ssize_t transferred_or_e=
rror);
=20
 int netfs_start_io_read(struct inode *inode);
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index cc29582f6245..fbd000399b26 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -49,6 +49,7 @@
 	E_(NETFS_PGPRIV2_COPY_TO_CACHE,		"2C")
=20
 #define netfs_rreq_traces					\
+	EM(netfs_rreq_trace_all_queued,		"ALL-Q  ")	\
 	EM(netfs_rreq_trace_assess,		"ASSESS ")	\
 	EM(netfs_rreq_trace_cache_cancelled,	"CA-CNCL")	\
 	EM(netfs_rreq_trace_cache_failed,	"CA-FAIL")	\
@@ -86,7 +87,8 @@
 	EM(netfs_rreq_trace_waited_quiesce,	"DONE-QUIESCE")	\
 	EM(netfs_rreq_trace_wake_ip,		"WAKE-IP")	\
 	EM(netfs_rreq_trace_wake_queue,		"WAKE-Q ")	\
-	E_(netfs_rreq_trace_write_done,		"WR-DONE")
+	EM(netfs_rreq_trace_write_done,		"WR-DONE")	\
+	E_(netfs_rreq_trace_zero_unread,	"ZERO-UR")
=20
 #define netfs_sreq_sources					\
 	EM(NETFS_SOURCE_UNKNOWN,		"----")		\
@@ -135,6 +137,7 @@
 	EM(netfs_sreq_trace_superfluous,	"SPRFL")	\
 	EM(netfs_sreq_trace_terminated,		"TERM ")	\
 	EM(netfs_sreq_trace_too_much,		"!TOOM")	\
+	EM(netfs_sreq_trace_too_many_retries,	"!RETR")	\
 	EM(netfs_sreq_trace_wait_for,		"_WAIT")	\
 	EM(netfs_sreq_trace_write,		"WRITE")	\
 	EM(netfs_sreq_trace_write_skip,		"SKIP ")	\
@@ -202,12 +205,12 @@
 	EM(netfs_folio_trace_alloc_buffer,	"alloc-buf")	\
 	EM(netfs_folio_trace_cancel_copy,	"cancel-copy")	\
 	EM(netfs_folio_trace_cancel_store,	"cancel-store")	\
-	EM(netfs_folio_trace_clear,		"clear")	\
-	EM(netfs_folio_trace_clear_cc,		"clear-cc")	\
-	EM(netfs_folio_trace_clear_g,		"clear-g")	\
-	EM(netfs_folio_trace_clear_s,		"clear-s")	\
 	EM(netfs_folio_trace_copy_to_cache,	"mark-copy")	\
 	EM(netfs_folio_trace_end_copy,		"end-copy")	\
+	EM(netfs_folio_trace_endwb,		"endwb")	\
+	EM(netfs_folio_trace_endwb_cc,		"endwb-cc")	\
+	EM(netfs_folio_trace_endwb_g,		"endwb-g")	\
+	EM(netfs_folio_trace_endwb_s,		"endwb-s")	\
 	EM(netfs_folio_trace_filled_gaps,	"filled-gaps")	\
 	EM(netfs_folio_trace_invalidate_all,	"inval-all")	\
 	EM(netfs_folio_trace_invalidate_front,	"inval-front")	\
@@ -400,10 +403,10 @@ TRACE_EVENT(netfs_sreq,
 		    __entry->len	=3D sreq->len;
 		    __entry->transferred =3D sreq->transferred;
 		    __entry->start	=3D sreq->start;
-		    __entry->slot	=3D sreq->dispatch_pos.slot;
+		    __entry->slot	=3D sreq->content.slot;
 			   ),
=20
-	    TP_printk("R=3D%08x[%x] %s %s f=3D%03x s=3D%llx %zx/%zx qs=3D%u e=3D%=
d",
+	    TP_printk("R=3D%08x[%x] %s %s f=3D%03x s=3D%llx %zx/%zx bv=3D%u e=3D%=
d",
 		      __entry->rreq, __entry->index,
 		      __print_symbolic(__entry->source, netfs_sreq_sources),
 		      __print_symbolic(__entry->what, netfs_sreq_traces),
@@ -511,6 +514,7 @@ TRACE_EVENT(netfs_folio,
 	    TP_STRUCT__entry(
 		    __field(u64,			ino)
 		    __field(pgoff_t,			index)
+		    __field(unsigned long,		pfn)
 		    __field(unsigned int,		nr)
 		    __field(enum netfs_folio_trace,	why)
 			     ),
@@ -521,13 +525,40 @@ TRACE_EVENT(netfs_folio,
 		    __entry->why =3D why;
 		    __entry->index =3D folio->index;
 		    __entry->nr =3D folio_nr_pages(folio);
+		    __entry->pfn =3D folio_pfn(folio);
 			   ),
=20
-	    TP_printk("i=3D%05llx ix=3D%05lx-%05lx %s",
+	    TP_printk("p=3D%lx i=3D%05llx ix=3D%05lx-%05lx %s",
+		      __entry->pfn,
 		      __entry->ino, __entry->index, __entry->index + __entry->nr - 1,
 		      __print_symbolic(__entry->why, netfs_folio_traces))
 	    );
=20
+TRACE_EVENT(netfs_wback,
+	    TP_PROTO(struct netfs_io_request *wreq, struct folio *folio, unsigned=
 int notes),
+
+	    TP_ARGS(wreq, folio, notes),
+
+	    TP_STRUCT__entry(
+		    __field(pgoff_t,			index)
+		    __field(unsigned int,		wreq)
+		    __field(unsigned int,		nr)
+		    __field(unsigned int,		notes)
+			     ),
+
+	    TP_fast_assign(
+		    __entry->wreq =3D wreq->debug_id;
+		    __entry->notes =3D notes;
+		    __entry->index =3D folio->index;
+		    __entry->nr =3D folio_nr_pages(folio);
+			   ),
+
+	    TP_printk("R=3D%08x ix=3D%05lx-%05lx n=3D%02x",
+		      __entry->wreq,
+		      __entry->index, __entry->index + __entry->nr - 1,
+		      __entry->notes)
+	    );
+
 TRACE_EVENT(netfs_write_iter,
 	    TP_PROTO(const struct kiocb *iocb, const struct iov_iter *from),
=20
@@ -771,7 +802,7 @@ TRACE_EVENT(netfs_collect_stream,
 		    __entry->wreq	=3D wreq->debug_id;
 		    __entry->stream	=3D stream->stream_nr;
 		    __entry->collected_to =3D stream->collected_to;
-		    __entry->issued_to	=3D atomic64_read(&wreq->issued_to);
+		    __entry->issued_to	=3D atomic64_read(&stream->issued_to);
 			   ),
=20
 	    TP_printk("R=3D%08x[%x:] cto=3D%llx ito=3D%llx",
@@ -795,7 +826,7 @@ TRACE_EVENT(netfs_bvecq,
 		    __entry->trace	=3D trace;
 			   ),
=20
-	    TP_printk("fq=3D%x %s",
+	    TP_printk("bq=3D%x %s",
 		      __entry->id,
 		      __print_symbolic(__entry->trace, netfs_bvecq_traces))
 	    );
diff --git a/net/9p/client.c b/net/9p/client.c
index f0dcf252af7e..8d365c000553 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -1561,6 +1561,7 @@ void
 p9_client_write_subreq(struct netfs_io_subrequest *subreq)
 {
 	struct netfs_io_request *wreq =3D subreq->rreq;
+	struct iov_iter iter;
 	struct p9_fid *fid =3D wreq->netfs_priv;
 	struct p9_client *clnt =3D fid->clnt;
 	struct p9_req_t *req;
@@ -1571,14 +1572,17 @@ p9_client_write_subreq(struct netfs_io_subrequest *=
subreq)
 	p9_debug(P9_DEBUG_9P, ">>> TWRITE fid %d offset %llu len %d\n",
 		 fid->fid, start, len);
=20
+	iov_iter_bvec_queue(&iter, ITER_SOURCE, subreq->content.bvecq,
+			    subreq->content.slot, subreq->content.offset, subreq->len);
+
 	/* Don't bother zerocopy for small IO (< 1024) */
 	if (clnt->trans_mod->zc_request && len > 1024) {
-		req =3D p9_client_zc_rpc(clnt, P9_TWRITE, NULL, &subreq->io_iter,
+		req =3D p9_client_zc_rpc(clnt, P9_TWRITE, NULL, &iter,
 				       0, wreq->len, P9_ZC_HDR_SZ, "dqd",
 				       fid->fid, start, len);
 	} else {
 		req =3D p9_client_rpc(clnt, P9_TWRITE, "dqV", fid->fid,
-				    start, len, &subreq->io_iter);
+				    start, len, &iter);
 	}
 	if (IS_ERR(req)) {
 		netfs_write_subrequest_terminated(subreq, PTR_ERR(req));