From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id A425A374721
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:46:23 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774521985; cv=none;
 b=Nh3qJ+bp7/tgy9qn+LYqBQRKDWfh6fCfQHubQfX2pqz3SbWck6wEDfKBL++VmVni+Fs69UWYhwsb0wfgtGmkPdl9lEBa1SeuFVEqMYLC9XTfUUSPXT8kEqsSBrZ2R91ku3ucnJwkGJU5Mh+Ya0ehU2PSDzQ8TgxqrU6HK6aLCb4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774521985; c=relaxed/simple;
	bh=V4fOTRTogYEpBNmIJVKnkZTFC5/VLtjae4HU+6Q4TwE=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=LT8rLr8wnPOTNSaNU1Pz8o4x09sgGqogtnOFaDPsrNS4GMLI65aviMhIWyXsY840uO7GMg/wrXvjO+iMwgWNQex4LQzFbNPSTUhA19d1OQ2AcbT9C5idfRs1yrI/qWeiou3EHogQM/0rOw/DSUN37xF9QYgIK/wreuRGlef2OKU=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=fDj6xJtm; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="fDj6xJtm"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774521982;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=ikME2fKnQmyL6ccsVR9iuIqF4Gk7rE38Es/pzWOmTZE=;
	b=fDj6xJtm/N4+mEoHE3Ot/4nj1rIvAkUjzm5xDVpkMklyTUn9dT2YmukaWCszz8vkKWmccp
	PX/OMy1ynxsrn0WAbi6snzNZK6PS11qX/fP1zC8POHrjBxaost4KV9oUZBvM7mAm6CFlS9
	4pPRMKh+7LNovGqCQK2JZXsdlJ0WwFo=
Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-365-ZuIOung0PmK_T8_--PBq-A-1; Thu,
 26 Mar 2026 06:46:19 -0400
X-MC-Unique: ZuIOung0PmK_T8_--PBq-A-1
X-Mimecast-MFC-AGG-ID: ZuIOung0PmK_T8_--PBq-A_1774521975
Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id AE5CD19560B6;
	Thu, 26 Mar 2026 10:46:13 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id CF2E618002A6;
	Thu, 26 Mar 2026 10:46:04 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Deepanshu Kartikey <kartikey406@gmail.com>,
	syzbot+7227db0fbac9f348dba0@syzkaller.appspotmail.com,
	Deepanshu Kartikey <Kartikey406@gmail.com>
Subject: [PATCH 01/26] netfs: Fix NULL pointer dereference in
 netfs_unbuffered_write() on retry
Date: Thu, 26 Mar 2026 10:45:16 +0000
Message-ID: <20260326104544.509518-2-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111
Content-Type: text/plain; charset="utf-8"

From: Deepanshu Kartikey <kartikey406@gmail.com>

When a write subrequest is marked NETFS_SREQ_NEED_RETRY, the retry path
in netfs_unbuffered_write() unconditionally calls stream->prepare_write()
without checking if it is NULL.

Filesystems such as 9P do not set the prepare_write operation, so
stream->prepare_write remains NULL. When get_user_pages() fails with
-EFAULT and the subrequest is flagged for retry, this results in a NULL
pointer dereference at fs/netfs/direct_write.c:189.

Fix this by mirroring the pattern already used in write_retry.c: if
stream->prepare_write is NULL, skip renegotiation and directly reissue
the subrequest via netfs_reissue_write(), which handles iterator reset,
IN_PROGRESS flag, stats update and reissue internally.

Fixes: a0b4c7a49137 ("netfs: Fix unbuffered/DIO writes to dispatch subreque=
sts in strict sequence")
Reported-by: syzbot+7227db0fbac9f348dba0@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=3D7227db0fbac9f348dba0
Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
---
 fs/netfs/direct_write.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/netfs/direct_write.c b/fs/netfs/direct_write.c
index dd1451bf7543..4d9760e36c11 100644
--- a/fs/netfs/direct_write.c
+++ b/fs/netfs/direct_write.c
@@ -186,10 +186,18 @@ static int netfs_unbuffered_write(struct netfs_io_req=
uest *wreq)
 		stream->sreq_max_segs	=3D INT_MAX;
=20
 		netfs_get_subrequest(subreq, netfs_sreq_trace_get_resubmit);
-		stream->prepare_write(subreq);
=20
-		__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
-		netfs_stat(&netfs_n_wh_retry_write_subreq);
+		if (stream->prepare_write) {
+			stream->prepare_write(subreq);
+			__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
+			netfs_stat(&netfs_n_wh_retry_write_subreq);
+		} else {
+			struct iov_iter source;
+
+			netfs_reset_iter(subreq);
+			source =3D subreq->io_iter;
+			netfs_reissue_write(stream, subreq, &source);
+		}
 	}
=20
 	netfs_unbuffered_write_done(wreq);
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 56BC83290C4
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:46:32 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774521993; cv=none;
 b=FLTac9i3sN7YEpTIk158al3es+QBFwRF4bA848kd95c7emY3X4LBltqtz72F3Jvbc4izJhpPFtfYazdBCWYZmFxuZkvr0E/RPUYVRdG0FNVVLcKQ46N6qT5f7BCDLXTMzHbtaEbNJKLwvuEjgPFm/GT7YpuSv6/zXxR8IEbFAbU=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774521993; c=relaxed/simple;
	bh=gvz/gcDuW+KetAIQBj5BeJWsKHNMDr4LBMHJoJ7MhWM=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=C4yxmKrNCDVN/9XLUtiezKeOVwduYAT5C6SUR5JhU2qjxauqU/ITONIDACQeFtl4HbhFWNBqJh7pfaqxi2el1G67xhBkGBBOD/suw2MvKVweWmukr1UNuLdeIRGEmGG3JXrMIfgFUAsIWI65jbpIfBA8hW7L5IQ5Xfh4Ryfs4So=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=TYqVaMBs; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="TYqVaMBs"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774521991;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=MVVzRK3Tmywza1SU8JPOcR6dARnvIl+M3c8pioBtMpU=;
	b=TYqVaMBsNTPsw5mqYlP2KkUnQvejWgmnGAepxsUbTiRgVwnI0LYiG6LvVSY0hGPVfO06yt
	yUc5Sbq63a1wklmDdfPCRJgUQouZRfoLFtU41zWhxhNPd0j+WP6/qoY07aAS745lLRvbND
	BeSfI4g2HeMqj0ep0o+Ojh4gTpjnYG8=
Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-532-bkvv3eYQNDWdQ1hrbBSehg-1; Thu,
 26 Mar 2026 06:46:28 -0400
X-MC-Unique: bkvv3eYQNDWdQ1hrbBSehg-1
X-Mimecast-MFC-AGG-ID: bkvv3eYQNDWdQ1hrbBSehg_1774521985
Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 6D192195608B;
	Thu, 26 Mar 2026 10:46:23 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 81831180075C;
	Thu, 26 Mar 2026 10:46:15 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Deepanshu Kartikey <kartikey406@gmail.com>,
	syzbot+9c058f0d63475adc97fd@syzkaller.appspotmail.com,
	Deepanshu Kartikey <Kartikey406@gmail.com>
Subject: [PATCH 02/26] netfs: Fix kernel BUG in netfs_limit_iter() for
 ITER_KVEC iterators
Date: Thu, 26 Mar 2026 10:45:17 +0000
Message-ID: <20260326104544.509518-3-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93
Content-Type: text/plain; charset="utf-8"

From: Deepanshu Kartikey <kartikey406@gmail.com>

When a process crashes and the kernel writes a core dump to a 9P
filesystem, __kernel_write() creates an ITER_KVEC iterator. This
iterator reaches netfs_limit_iter() via netfs_unbuffered_write(), which
only handles ITER_FOLIOQ, ITER_BVEC and ITER_XARRAY iterator types,
hitting the BUG() for any other type.

Fix this by adding netfs_limit_kvec() following the same pattern as
netfs_limit_bvec(), since both kvec and bvec are simple segment arrays
with pointer and length fields. Dispatch it from netfs_limit_iter() when
the iterator type is ITER_KVEC.

Fixes: cae932d3aee5 ("netfs: Add func to calculate pagecount/size-limited s=
pan of an iterator")
Reported-by: syzbot+9c058f0d63475adc97fd@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=3D9c058f0d63475adc97fd
Tested-by: syzbot+9c058f0d63475adc97fd@syzkaller.appspotmail.com
Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
---
 fs/netfs/iterator.c | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/fs/netfs/iterator.c b/fs/netfs/iterator.c
index 72a435e5fc6d..154a14bb2d7f 100644
--- a/fs/netfs/iterator.c
+++ b/fs/netfs/iterator.c
@@ -142,6 +142,47 @@ static size_t netfs_limit_bvec(const struct iov_iter *=
iter, size_t start_offset,
 	return min(span, max_size);
 }
=20
+/*
+ * Select the span of a kvec iterator we're going to use.  Limit it by both
+ * maximum size and maximum number of segments.  Returns the size of the s=
pan
+ * in bytes.
+ */
+static size_t netfs_limit_kvec(const struct iov_iter *iter, size_t start_o=
ffset,
+			       size_t max_size, size_t max_segs)
+{
+	const struct kvec *kvecs =3D iter->kvec;
+	unsigned int nkv =3D iter->nr_segs, ix =3D 0, nsegs =3D 0;
+	size_t len, span =3D 0, n =3D iter->count;
+	size_t skip =3D iter->iov_offset + start_offset;
+
+	if (WARN_ON(!iov_iter_is_kvec(iter)) ||
+	    WARN_ON(start_offset > n) ||
+	    n =3D=3D 0)
+		return 0;
+
+	while (n && ix < nkv && skip) {
+		len =3D kvecs[ix].iov_len;
+		if (skip < len)
+			break;
+		skip -=3D len;
+		n -=3D len;
+		ix++;
+	}
+
+	while (n && ix < nkv) {
+		len =3D min3(n, kvecs[ix].iov_len - skip, max_size);
+		span +=3D len;
+		nsegs++;
+		ix++;
+		if (span >=3D max_size || nsegs >=3D max_segs)
+			break;
+		skip =3D 0;
+		n -=3D len;
+	}
+
+	return min(span, max_size);
+}
+
 /*
  * Select the span of an xarray iterator we're going to use.  Limit it by =
both
  * maximum size and maximum number of segments.  It is assumed that segmen=
ts
@@ -245,6 +286,8 @@ size_t netfs_limit_iter(const struct iov_iter *iter, si=
ze_t start_offset,
 		return netfs_limit_bvec(iter, start_offset, max_size, max_segs);
 	if (iov_iter_is_xarray(iter))
 		return netfs_limit_xarray(iter, start_offset, max_size, max_segs);
+	if (iov_iter_is_kvec(iter))
+		return netfs_limit_kvec(iter, start_offset, max_size, max_segs);
 	BUG();
 }
 EXPORT_SYMBOL(netfs_limit_iter);
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id D62AB31A07F
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:46:40 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522002; cv=none;
 b=eFK7zaKB0VGNO4wOE7535nm9iYj4Nyk180bwlHKKncxm9gdd3dHNBpDzETRXZXJ7dKJOUQpQ4jkexQZwX1N+oSVsMe+FV8dFSAIDGnTouF4b0yhAnZMEaAj7EjHHpkMpdfaBpsnGKLfDQkxiJPABieEz2MbquFPRQgIm7PnrrEY=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522002; c=relaxed/simple;
	bh=NiEvLX1ucx8Te5c6NeyjhEKcLknIJ7cWOIAJ/htm1KM=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=SRuNmQn35971jTxuttYvqmuaclMglIxVX5fVZAZzwDVdb1dHP0zwzThtlymOxsy50OTv7JmmwBio6Jr4+MLli/1N4INAJsZxuqfOqRloHut4ZPfneO8JpEMT26pYKZNVRHwe4sBdU8NCSgaNGudns4DsOt08wabpvjax5iOkulQ=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=iUg2n16n; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="iUg2n16n"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774521999;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=Hd+gaM5AaJoJX7Kct1SnNrQY+YwFKTfTVThh//CGGs8=;
	b=iUg2n16nawgoRTODgQ87CFgP+yyLqD3gOGaPjmj2i2oGGWBwQsYvNdAHpKuyRO6qwVUhPY
	4kHeVSrC3XxMF7Eq07nyu64RhgFlUAgz0kpFITpTirD71vJKcSOmqMNpaJb0CiCCdLgNwI
	gCW6f1KLgeBM6aXZzoLrQK8t1cQKYLo=
Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-644-TiPghNfSMYurVt22XIYetQ-1; Thu,
 26 Mar 2026 06:46:37 -0400
X-MC-Unique: TiPghNfSMYurVt22XIYetQ-1
X-Mimecast-MFC-AGG-ID: TiPghNfSMYurVt22XIYetQ_1774521994
Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 9C94918005B0;
	Thu, 26 Mar 2026 10:46:33 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 4AC1530002DA;
	Thu, 26 Mar 2026 10:46:24 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>,
	Paulo Alcantara <pc@manguebit.org>
Subject: [PATCH 03/26] netfs: fix VM_BUG_ON_FOLIO() issue in
 netfs_write_begin() call
Date: Thu, 26 Mar 2026 10:45:18 +0000
Message-ID: <20260326104544.509518-4-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4
Content-Type: text/plain; charset="utf-8"

From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>

The multiple runs of generic/013 test-case is capable
to reproduce a kernel BUG at mm/filemap.c:1504 with
probability of 30%.

while true; do
  sudo ./check generic/013
done

[ 9849.452376] page: refcount:3 mapcount:0 mapping:00000000e58ff252 index:0=
x10781 pfn:0x1c322
[ 9849.452412] memcg:ffff8881a1915800
[ 9849.452417] aops:ceph_aops ino:1000058db9e dentry name(?):"f9XXXXXX"
[ 9849.452432] flags: 0x17ffffc0000000(node=3D0|zone=3D2|lastcpupid=3D0x1ff=
fff)
[ 9849.452441] raw: 0017ffffc0000000 0000000000000000 dead000000000122 ffff=
88816110d248
[ 9849.452445] raw: 0000000000010781 0000000000000000 00000003ffffffff ffff=
8881a1915800
[ 9849.452447] page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(foli=
o))
[ 9849.452474] ------------[ cut here ]------------
[ 9849.452476] kernel BUG at mm/filemap.c:1504!
[ 9849.478635] Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
[ 9849.481772] CPU: 2 UID: 0 PID: 84223 Comm: fsstress Not tainted 7.0.0-rc=
1+ #18 PREEMPT(full)
[ 9849.482881] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS =
1.17.0-9.fc43 06/1
0/2025
[ 9849.484539] RIP: 0010:folio_unlock+0x85/0xa0
[ 9849.485076] Code: 89 df 31 f6 e8 1c f3 ff ff 48 8b 5d f8 c9 31 c0 31 d2 =
31 f6 31 ff c3 cc
cc cc cc 48 c7 c6 80 6c d9 a7 48 89 df e8 4b b3 10 00 <0f> 0b 48 89 df e8 2=
1 e6 2c 00 eb 9d 0f 1f 40 00 66 66 2e 0f 1f 84
[ 9849.493818] RSP: 0018:ffff8881bb8076b0 EFLAGS: 00010246
[ 9849.495740] RAX: 0000000000000000 RBX: ffffea00070c8980 RCX: 00000000000=
00000
[ 9849.498678] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000=
00000
[ 9849.500559] RBP: ffff8881bb8076b8 R08: 0000000000000000 R09: 00000000000=
00000
[ 9849.501097] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000107=
82000
[ 9849.502108] R13: ffff8881935de738 R14: ffff88816110d010 R15: 00000000000=
01000
[ 9849.502516] FS:  00007e36cbe94740(0000) GS:ffff88824a899000(0000) knlGS:=
0000000000000000
[ 9849.502996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9849.503810] CR2: 000000c0002b0000 CR3: 000000011bbf6004 CR4: 00000000007=
72ef0
[ 9849.504459] PKRU: 55555554
[ 9849.504626] Call Trace:
[ 9849.505242]  <TASK>
[ 9849.505379]  netfs_write_begin+0x7c8/0x10a0
[ 9849.505877]  ? __kasan_check_read+0x11/0x20
[ 9849.506384]  ? __pfx_netfs_write_begin+0x10/0x10
[ 9849.507178]  ceph_write_begin+0x8c/0x1c0
[ 9849.507934]  generic_perform_write+0x391/0x8f0
[ 9849.508503]  ? __pfx_generic_perform_write+0x10/0x10
[ 9849.509062]  ? file_update_time_flags+0x19a/0x4b0
[ 9849.509581]  ? ceph_get_caps+0x63/0xf0
[ 9849.510259]  ? ceph_get_caps+0x63/0xf0
[ 9849.510530]  ceph_write_iter+0xe79/0x1ae0
[ 9849.511282]  ? __pfx_ceph_write_iter+0x10/0x10
[ 9849.511839]  ? lock_acquire+0x1ad/0x310
[ 9849.512334]  ? ksys_write+0xf9/0x230
[ 9849.512582]  ? lock_is_held_type+0xaa/0x140
[ 9849.513128]  vfs_write+0x512/0x1110
[ 9849.513634]  ? __fget_files+0x33/0x350
[ 9849.513893]  ? __pfx_vfs_write+0x10/0x10
[ 9849.514143]  ? mutex_lock_nested+0x1b/0x30
[ 9849.514394]  ksys_write+0xf9/0x230
[ 9849.514621]  ? __pfx_ksys_write+0x10/0x10
[ 9849.514887]  ? do_syscall_64+0x25e/0x1520
[ 9849.515122]  ? __kasan_check_read+0x11/0x20
[ 9849.515366]  ? trace_hardirqs_on_prepare+0x178/0x1c0
[ 9849.515655]  __x64_sys_write+0x72/0xd0
[ 9849.515885]  ? trace_hardirqs_on+0x24/0x1c0
[ 9849.516130]  x64_sys_call+0x22f/0x2390
[ 9849.516341]  do_syscall_64+0x12b/0x1520
[ 9849.516545]  ? do_syscall_64+0x27c/0x1520
[ 9849.516783]  ? do_syscall_64+0x27c/0x1520
[ 9849.517003]  ? lock_release+0x318/0x480
[ 9849.517220]  ? __x64_sys_io_getevents+0x143/0x2d0
[ 9849.517479]  ? percpu_ref_put_many.constprop.0+0x8f/0x210
[ 9849.517779]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 9849.518073]  ? do_syscall_64+0x25e/0x1520
[ 9849.518291]  ? __kasan_check_read+0x11/0x20
[ 9849.518519]  ? trace_hardirqs_on_prepare+0x178/0x1c0
[ 9849.518799]  ? do_syscall_64+0x27c/0x1520
[ 9849.519024]  ? local_clock_noinstr+0xf/0x120
[ 9849.519262]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 9849.519544]  ? do_syscall_64+0x25e/0x1520
[ 9849.519781]  ? __kasan_check_read+0x11/0x20
[ 9849.520008]  ? trace_hardirqs_on_prepare+0x178/0x1c0
[ 9849.520273]  ? do_syscall_64+0x27c/0x1520
[ 9849.520491]  ? trace_hardirqs_on_prepare+0x178/0x1c0
[ 9849.520767]  ? irqentry_exit+0x10c/0x6c0
[ 9849.520984]  ? trace_hardirqs_off+0x86/0x1b0
[ 9849.521224]  ? exc_page_fault+0xab/0x130
[ 9849.521472]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 9849.521766] RIP: 0033:0x7e36cbd14907
[ 9849.521989] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f =
00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48=
> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[ 9849.523057] RSP: 002b:00007ffff2d2a968 EFLAGS: 00000246 ORIG_RAX: 000000=
0000000001
[ 9849.523484] RAX: ffffffffffffffda RBX: 000000000000e549 RCX: 00007e36cbd=
14907
[ 9849.523885] RDX: 000000000000e549 RSI: 00005bd797ec6370 RDI: 00000000000=
00004
[ 9849.524277] RBP: 0000000000000004 R08: 0000000000000047 R09: 00005bd797e=
c6370
[ 9849.524652] R10: 0000000000000078 R11: 0000000000000246 R12: 00000000000=
00049
[ 9849.525062] R13: 0000000010781a37 R14: 00005bd797ec6370 R15: 00000000000=
00000
[ 9849.525447]  </TASK>
[ 9849.525574] Modules linked in: intel_rapl_msr intel_rapl_common intel_un=
core_frequency_common intel_pmc_core pmt_telemetry pmt_discovery pmt_class =
intel_pmc_ssram_telemetry intel_vsec kvm_intel joydev kvm irqbypass ghash_c=
lmulni_intel aesni_intel input_leds rapl mac_hid psmouse vga16fb serio_raw =
vgastate floppy i2c_piix4 bochs qemu_fw_cfg i2c_smbus pata_acpi sch_fq_code=
l rbd msr parport_pc ppdev lp parport efi_pstore
[ 9849.529150] ---[ end trace 0000000000000000 ]---
[ 9849.529502] RIP: 0010:folio_unlock+0x85/0xa0
[ 9849.530813] Code: 89 df 31 f6 e8 1c f3 ff ff 48 8b 5d f8 c9 31 c0 31 d2 =
31 f6 31 ff c3 cc cc cc cc 48 c7 c6 80 6c d9 a7 48 89 df e8 4b b3 10 00 <0f=
> 0b 48 89 df e8 21 e6 2c 00 eb 9d 0f 1f 40 00 66 66 2e 0f 1f 84
[ 9849.534986] RSP: 0018:ffff8881bb8076b0 EFLAGS: 00010246
[ 9849.536198] RAX: 0000000000000000 RBX: ffffea00070c8980 RCX: 00000000000=
00000
[ 9849.537718] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000=
00000
[ 9849.539321] RBP: ffff8881bb8076b8 R08: 0000000000000000 R09: 00000000000=
00000
[ 9849.540862] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000107=
82000
[ 9849.542438] R13: ffff8881935de738 R14: ffff88816110d010 R15: 00000000000=
01000
[ 9849.543996] FS:  00007e36cbe94740(0000) GS:ffff88824b899000(0000) knlGS:=
0000000000000000
[ 9849.545854] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9849.547092] CR2: 00007e36cb3ff000 CR3: 000000011bbf6006 CR4: 00000000007=
72ef0
[ 9849.548679] PKRU: 55555554

The race sequence:
1. Read completes -> netfs_read_collection() runs
2. netfs_wake_rreq_flag(rreq, NETFS_RREQ_IN_PROGRESS, ...)
3. netfs_wait_for_read() returns -EFAULT to netfs_write_begin()
4. The netfs_unlock_abandoned_read_pages() unlocks the folio
5. netfs_write_begin() calls folio_unlock(folio) -> VM_BUG_ON_FOLIO()

The key reason of the issue that netfs_unlock_abandoned_read_pages()
doesn't check the flag NETFS_RREQ_NO_UNLOCK_FOLIO and executes
folio_unlock() unconditionally. This patch implements in
netfs_unlock_abandoned_read_pages() logic similar to
netfs_unlock_read_folio().

Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
cc: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: Ceph Development <ceph-devel@vger.kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
---
 fs/netfs/read_retry.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/netfs/read_retry.c b/fs/netfs/read_retry.c
index 7793ba5e3e8f..71a0c7ed163a 100644
--- a/fs/netfs/read_retry.c
+++ b/fs/netfs/read_retry.c
@@ -285,8 +285,15 @@ void netfs_unlock_abandoned_read_pages(struct netfs_io=
_request *rreq)
 			struct folio *folio =3D folioq_folio(p, slot);
=20
 			if (folio && !folioq_is_marked2(p, slot)) {
-				trace_netfs_folio(folio, netfs_folio_trace_abandon);
-				folio_unlock(folio);
+				if (folio->index =3D=3D rreq->no_unlock_folio &&
+				    test_bit(NETFS_RREQ_NO_UNLOCK_FOLIO,
+					     &rreq->flags)) {
+					_debug("no unlock");
+				} else {
+					trace_netfs_folio(folio,
+						netfs_folio_trace_abandon);
+					folio_unlock(folio);
+				}
 			}
 		}
 	}
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id C185F359A71
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:46:51 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522013; cv=none;
 b=cWPAhxGhZM0NCnIN414CdmadrIrPnKOPEJoZH07izc4plsgX/ljpzGaLmziKVlGYjZtxA1VAgVcqx01jT+o1oti/k+BBHzWvBfXkvbfMEcXcxcuQWWRpC/WChFFX+Bqt7nkU89GmDV26+H50gQ2QZy8Ud2Zv0Sy9TomhE7MCtW4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522013; c=relaxed/simple;
	bh=3fxCrPv5jOhzXpbEPh03vPyoVSVq+MIAtMK0QRjzkYs=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=TGH5nD9j1bV2vadY3RQu33C/So3CUMU7qW46IWwmJMjOmjnLRnxbswscKZSzvFi++6QkduOj2uWWMz5J9XVbvnCjdKdfYHn5YO1cjcxdHfxVfwq2fF9tiNxiSl/s2QzWKHKlN985lRgog/CLToR5pJH6eiBDFANcVB6uPaJRNUg=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=FZ/uPPQx; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="FZ/uPPQx"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774522010;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=HpVDvhFE67l3OFnKv5DoNdb7shfVdE092GWoAVAsAa0=;
	b=FZ/uPPQxSCs/e+hF9iA3DrV2Xj//z2loiEpWEEX0GsjmyYOXAgH3xzpC1WUn9bZFR8sWY+
	dc4oagHoP2pPayAS+e3dEA7oiTyap/DGj/YwLn+ilkRWeLpcoeKvRbiQeAgp79ihzd70Nk
	0aj9YVz4GXVqpw1+xqcHYXI2+2bI04c=
Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-553-SbEJ6LxhO5WiP_Ngehg1ZQ-1; Thu,
 26 Mar 2026 06:46:46 -0400
X-MC-Unique: SbEJ6LxhO5WiP_Ngehg1ZQ-1
X-Mimecast-MFC-AGG-ID: SbEJ6LxhO5WiP_Ngehg1ZQ_1774522003
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 04DF719560B9;
	Thu, 26 Mar 2026 10:46:43 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 586C419560B1;
	Thu, 26 Mar 2026 10:46:35 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Paulo Alcantara <pc@manguebit.org>,
	Xiaoli Feng <xifeng@redhat.com>,
	stable@vger.kernel.org
Subject: [PATCH 04/26] netfs: fix error handling in netfs_extract_user_iter()
Date: Thu, 26 Mar 2026 10:45:19 +0000
Message-ID: <20260326104544.509518-5-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12
Content-Type: text/plain; charset="utf-8"

From: Paulo Alcantara <pc@manguebit.org>

In netfs_extract_user_iter(), if iov_iter_extract_pages() failed to
extract user pages, bail out on -ENOMEM, otherwise return the error
code only if @npages =3D=3D 0, allowing short DIO reads and writes to be
issued.

This fixes mmapstress02 from LTP tests against CIFS.

Reported-by: Xiaoli Feng <xifeng@redhat.com>
Fixes: 85dd2c8ff368 ("netfs: Add a function to extract a UBUF or IOVEC into=
 a BVEC iterator")
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: netfs@lists.linux.dev
Cc: stable@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: David Howells <dhowells@redhat.com>
---
 fs/netfs/iterator.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/netfs/iterator.c b/fs/netfs/iterator.c
index 154a14bb2d7f..adca78747f23 100644
--- a/fs/netfs/iterator.c
+++ b/fs/netfs/iterator.c
@@ -22,7 +22,7 @@
  *
  * Extract the page fragments from the given amount of the source iterator=
 and
  * build up a second iterator that refers to all of those bits.  This allo=
ws
- * the original iterator to disposed of.
+ * the original iterator to be disposed of.
  *
  * @extraction_flags can have ITER_ALLOW_P2PDMA set to request peer-to-pee=
r DMA be
  * allowed on the pages extracted.
@@ -67,8 +67,8 @@ ssize_t netfs_extract_user_iter(struct iov_iter *orig, si=
ze_t orig_len,
 		ret =3D iov_iter_extract_pages(orig, &pages, count,
 					     max_pages - npages, extraction_flags,
 					     &offset);
-		if (ret < 0) {
-			pr_err("Couldn't get user pages (rc=3D%zd)\n", ret);
+		if (unlikely(ret <=3D 0)) {
+			ret =3D ret ?: -EIO;
 			break;
 		}
=20
@@ -97,6 +97,13 @@ ssize_t netfs_extract_user_iter(struct iov_iter *orig, s=
ize_t orig_len,
 		npages +=3D cur_npages;
 	}
=20
+	if (ret < 0 && (ret =3D=3D -ENOMEM || npages =3D=3D 0)) {
+		for (i =3D 0; i < npages; i++)
+			unpin_user_page(bv[i].bv_page);
+		kvfree(bv);
+		return ret;
+	}
+
 	iov_iter_bvec(new, orig->data_source, bv, npages, orig_len - count);
 	return npages;
 }
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id D06C3155C97
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:47:00 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522023; cv=none;
 b=S9wZO44Um4qugCNcXZ5MeQx3XtqVWPaVLNMUp7kc4Bsbg11RNAmeeXi1CgxY2tYtJXvbxWqSEzVzf+CM/qau5+mhaJfwLeRGiaBOEaE8mnlR7H1Mmp1DRTdXmMrhwcur72RjsNty6q8JnsrB9tSmG13QTtBnaSrhEEAOMf0Go2Q=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522023; c=relaxed/simple;
	bh=szclke4JIsBjvbhewBgvgDMxYA3sk65zpYLRMFX9C2c=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=pylsfv7Cayx14G5jiyKmmzBAbHWMfY9HwTqjgm/CG1rp5C2wlNSmKTEWM6Xz3KQwrsjJdUDU7dAlTcIvrHVr/aZAyhwREnEe8MjgrCzdGvOFm10zMSsIXpiRC92PCR1tOg3/3cLJl0FdLMCplcJgipzsvxTbJmxH6sNNgkWiLNA=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=U6IcheUG; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="U6IcheUG"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774522019;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=9JBhkshmo9wp53/7HssoafiGynGAcR+6aULQINPD5UQ=;
	b=U6IcheUGgjwldkYA4es5AkKvyHILsUiAHZkCdCjcJgBnWkDAlPeua04l3qjQLsGM4FapLi
	SX+wylLpqL36i+9ErURyReorsSgou2dtHSk/hMo77sQB6Bto1eMM+jIKgA8IOqULmcgejx
	+QYjvPQjz1feOakMddCl2dNYLW/nzhU=
Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-402-lIkaQfXkP4W9euIGMAGh9A-1; Thu,
 26 Mar 2026 06:46:54 -0400
X-MC-Unique: lIkaQfXkP4W9euIGMAGh9A-1
X-Mimecast-MFC-AGG-ID: lIkaQfXkP4W9euIGMAGh9A_1774522012
Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 88EF519560A2;
	Thu, 26 Mar 2026 10:46:51 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id B082C1955D84;
	Thu, 26 Mar 2026 10:46:44 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Paulo Alcantara <pc@manguebit.org>
Subject: [PATCH 05/26] netfs: Fix read abandonment during retry
Date: Thu, 26 Mar 2026 10:45:20 +0000
Message-ID: <20260326104544.509518-6-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17
Content-Type: text/plain; charset="utf-8"

Under certain circumstances, all the remaining subrequests from a read
request will get abandoned during retry.  The abandonment process expects
the 'subreq' variable to be set to the place to start abandonment from, but
it doesn't always have a useful value (it will be uninitialised on the
first pass through the loop and it may point to a deleted subrequest on
later passes).

Fix the first jump to "abandon:" to set subreq to the start of the first
subrequest expected to need retry (which, in this abandonment case, turned
out unexpectedly to no longer have NEED_RETRY set).

Also clear the subreq pointer after discarding superfluous retryable
subrequests to cause an oops if we do try to access it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading")
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
---
 fs/netfs/read_retry.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/netfs/read_retry.c b/fs/netfs/read_retry.c
index 71a0c7ed163a..68fc869513ef 100644
--- a/fs/netfs/read_retry.c
+++ b/fs/netfs/read_retry.c
@@ -93,8 +93,10 @@ static void netfs_retry_read_subrequests(struct netfs_io=
_request *rreq)
 		       from->start, from->transferred, from->len);
=20
 		if (test_bit(NETFS_SREQ_FAILED, &from->flags) ||
-		    !test_bit(NETFS_SREQ_NEED_RETRY, &from->flags))
+		    !test_bit(NETFS_SREQ_NEED_RETRY, &from->flags)) {
+			subreq =3D from;
 			goto abandon;
+		}
=20
 		list_for_each_continue(next, &stream->subrequests) {
 			subreq =3D list_entry(next, struct netfs_io_subrequest, rreq_link);
@@ -178,6 +180,7 @@ static void netfs_retry_read_subrequests(struct netfs_i=
o_request *rreq)
 				if (subreq =3D=3D to)
 					break;
 			}
+			subreq =3D NULL;
 			continue;
 		}
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id C99683E5ECC
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:47:07 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522031; cv=none;
 b=orUJaQ5/E+wENuRHdDCQhWSUVz6ish21cM77THlGAmWxlq8Lccse1K7rSHLxHZ1pmzGiozxnp1TNFS7JQcItfSjpHSB8tAH3CAc28PEokwKaDarN8cPstznN7DQO0lIDNIuwjWTpFW/xdxJWgmVVS6bOzUdjUFZr/Ga47ManvJw=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522031; c=relaxed/simple;
	bh=JSDeL6SwtVG7tcOuyPSqzZl0BvtjT+R1aGX/1tT+wNA=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=NtDQcNO978ZxCo+iNvB1BiT3lmoHNKqunugcx2iEGGIddyeyjSwhFN5pk7AIYGiqMw05L2zPbT4sM0W5PK4l8QOXSzGBSdMK5d8Gtr2zt8l+VpeYfcnpkunjEIP433WO+BoaTXKLi02HL4Ym24ihX9Zn/5f6ggA88ww5nwjFfQo=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=NZLv6Syv; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="NZLv6Syv"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774522026;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=/tdvGHLBae6cwxkiV/xURdgZGIZUXmsfYZzZeOFw5ZU=;
	b=NZLv6Syv+6d/mRsMu24ch1CtmlR0b895r6hqzTbaR2a2Ei6OsvKntxoK/HtubDnEKLyp51
	G1zGrTreHhSKmkJ3OpBMMOotb4h1fqwrLOgOevzPngn8T0i+IZXs34bwiXbTQGTzawofOQ
	ko//rEVHzguQUMaAz3RJEmgG15rGpMI=
Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-244-6Q1OkKBWMNewwDpzF46dXQ-1; Thu,
 26 Mar 2026 06:47:02 -0400
X-MC-Unique: 6Q1OkKBWMNewwDpzF46dXQ-1
X-Mimecast-MFC-AGG-ID: 6Q1OkKBWMNewwDpzF46dXQ_1774522020
Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id F1A2819560B4;
	Thu, 26 Mar 2026 10:46:59 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 42F741800673;
	Thu, 26 Mar 2026 10:46:53 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Paulo Alcantara <pc@manguebit.org>
Subject: [PATCH 06/26] netfs: Fix the handling of stream->front by removing it
Date: Thu, 26 Mar 2026 10:45:21 +0000
Message-ID: <20260326104544.509518-7-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93
Content-Type: text/plain; charset="utf-8"

The netfs_io_stream::front member is meant to point to the subrequest
currently being collected on a stream, but it isn't actually used this way
by direct write (which mostly ignores it).  However, there's a tracepoint
which looks at it.  Further, stream->front is actually redundant with
stream->subrequests.next.

Fix the potential problem in the direct code by just removing the member
and using stream->subrequests.next instead, thereby also simplifying the
code.

Fixes: a0b4c7a49137 ("netfs: Fix unbuffered/DIO writes to dispatch subreque=
sts in strict sequence")
Reported-by: Paulo Alcantara <pc@manguebit.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
---
 fs/netfs/buffered_read.c     | 3 +--
 fs/netfs/direct_read.c       | 3 +--
 fs/netfs/direct_write.c      | 1 -
 fs/netfs/read_collect.c      | 4 ++--
 fs/netfs/read_single.c       | 1 -
 fs/netfs/write_collect.c     | 4 ++--
 fs/netfs/write_issue.c       | 3 +--
 include/linux/netfs.h        | 1 -
 include/trace/events/netfs.h | 8 ++++----
 9 files changed, 11 insertions(+), 17 deletions(-)

diff --git a/fs/netfs/buffered_read.c b/fs/netfs/buffered_read.c
index 88a0d801525f..a8c0d86118c5 100644
--- a/fs/netfs/buffered_read.c
+++ b/fs/netfs/buffered_read.c
@@ -171,9 +171,8 @@ static void netfs_queue_read(struct netfs_io_request *r=
req,
 	spin_lock(&rreq->lock);
 	list_add_tail(&subreq->rreq_link, &stream->subrequests);
 	if (list_is_first(&subreq->rreq_link, &stream->subrequests)) {
-		stream->front =3D subreq;
 		if (!stream->active) {
-			stream->collected_to =3D stream->front->start;
+			stream->collected_to =3D subreq->start;
 			/* Store list pointers before active flag */
 			smp_store_release(&stream->active, true);
 		}
diff --git a/fs/netfs/direct_read.c b/fs/netfs/direct_read.c
index a498ee8d6674..f72e6da88cca 100644
--- a/fs/netfs/direct_read.c
+++ b/fs/netfs/direct_read.c
@@ -71,9 +71,8 @@ static int netfs_dispatch_unbuffered_reads(struct netfs_i=
o_request *rreq)
 		spin_lock(&rreq->lock);
 		list_add_tail(&subreq->rreq_link, &stream->subrequests);
 		if (list_is_first(&subreq->rreq_link, &stream->subrequests)) {
-			stream->front =3D subreq;
 			if (!stream->active) {
-				stream->collected_to =3D stream->front->start;
+				stream->collected_to =3D subreq->start;
 				/* Store list pointers before active flag */
 				smp_store_release(&stream->active, true);
 			}
diff --git a/fs/netfs/direct_write.c b/fs/netfs/direct_write.c
index 4d9760e36c11..f9ab69de3e29 100644
--- a/fs/netfs/direct_write.c
+++ b/fs/netfs/direct_write.c
@@ -111,7 +111,6 @@ static int netfs_unbuffered_write(struct netfs_io_reque=
st *wreq)
 			netfs_prepare_write(wreq, stream, wreq->start + wreq->transferred);
 			subreq =3D stream->construct;
 			stream->construct =3D NULL;
-			stream->front =3D NULL;
 		}
=20
 		/* Check if (re-)preparation failed. */
diff --git a/fs/netfs/read_collect.c b/fs/netfs/read_collect.c
index 137f0e28a44c..e5f6665b3341 100644
--- a/fs/netfs/read_collect.c
+++ b/fs/netfs/read_collect.c
@@ -205,7 +205,8 @@ static void netfs_collect_read_results(struct netfs_io_=
request *rreq)
 	 * in progress.  The issuer thread may be adding stuff to the tail
 	 * whilst we're doing this.
 	 */
-	front =3D READ_ONCE(stream->front);
+	front =3D list_first_entry_or_null(&stream->subrequests,
+					 struct netfs_io_subrequest, rreq_link);
 	while (front) {
 		size_t transferred;
=20
@@ -301,7 +302,6 @@ static void netfs_collect_read_results(struct netfs_io_=
request *rreq)
 		list_del_init(&front->rreq_link);
 		front =3D list_first_entry_or_null(&stream->subrequests,
 						 struct netfs_io_subrequest, rreq_link);
-		stream->front =3D front;
 		spin_unlock(&rreq->lock);
 		netfs_put_subrequest(remove,
 				     notes & ABANDON_SREQ ?
diff --git a/fs/netfs/read_single.c b/fs/netfs/read_single.c
index 8e6264f62a8f..d0e23bc42445 100644
--- a/fs/netfs/read_single.c
+++ b/fs/netfs/read_single.c
@@ -107,7 +107,6 @@ static int netfs_single_dispatch_read(struct netfs_io_r=
equest *rreq)
 	spin_lock(&rreq->lock);
 	list_add_tail(&subreq->rreq_link, &stream->subrequests);
 	trace_netfs_sreq(subreq, netfs_sreq_trace_added);
-	stream->front =3D subreq;
 	/* Store list pointers before active flag */
 	smp_store_release(&stream->active, true);
 	spin_unlock(&rreq->lock);
diff --git a/fs/netfs/write_collect.c b/fs/netfs/write_collect.c
index 83eb3dc1adf8..b194447f4b11 100644
--- a/fs/netfs/write_collect.c
+++ b/fs/netfs/write_collect.c
@@ -228,7 +228,8 @@ static void netfs_collect_write_results(struct netfs_io=
_request *wreq)
 		if (!smp_load_acquire(&stream->active))
 			continue;
=20
-		front =3D stream->front;
+		front =3D list_first_entry_or_null(&stream->subrequests,
+						 struct netfs_io_subrequest, rreq_link);
 		while (front) {
 			trace_netfs_collect_sreq(wreq, front);
 			//_debug("sreq [%x] %llx %zx/%zx",
@@ -279,7 +280,6 @@ static void netfs_collect_write_results(struct netfs_io=
_request *wreq)
 			list_del_init(&front->rreq_link);
 			front =3D list_first_entry_or_null(&stream->subrequests,
 							 struct netfs_io_subrequest, rreq_link);
-			stream->front =3D front;
 			spin_unlock(&wreq->lock);
 			netfs_put_subrequest(remove,
 					     notes & SAW_FAILURE ?
diff --git a/fs/netfs/write_issue.c b/fs/netfs/write_issue.c
index 437268f65640..2db688f94125 100644
--- a/fs/netfs/write_issue.c
+++ b/fs/netfs/write_issue.c
@@ -206,9 +206,8 @@ void netfs_prepare_write(struct netfs_io_request *wreq,
 	spin_lock(&wreq->lock);
 	list_add_tail(&subreq->rreq_link, &stream->subrequests);
 	if (list_is_first(&subreq->rreq_link, &stream->subrequests)) {
-		stream->front =3D subreq;
 		if (!stream->active) {
-			stream->collected_to =3D stream->front->start;
+			stream->collected_to =3D subreq->start;
 			/* Write list pointers before active flag */
 			smp_store_release(&stream->active, true);
 		}
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 72ee7d210a74..ba17ac5bf356 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -140,7 +140,6 @@ struct netfs_io_stream {
 	void (*issue_write)(struct netfs_io_subrequest *subreq);
 	/* Collection tracking */
 	struct list_head	subrequests;	/* Contributory I/O operations */
-	struct netfs_io_subrequest *front;	/* Op being collected */
 	unsigned long long	collected_to;	/* Position we've collected results to */
 	size_t			transferred;	/* The amount transferred from this stream */
 	unsigned short		error;		/* Aggregate error for the stream */
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index 2d366be46a1c..cbe28211106c 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -740,19 +740,19 @@ TRACE_EVENT(netfs_collect_stream,
 		    __field(unsigned int,	wreq)
 		    __field(unsigned char,	stream)
 		    __field(unsigned long long,	collected_to)
-		    __field(unsigned long long,	front)
+		    __field(unsigned long long,	issued_to)
 			     ),
=20
 	    TP_fast_assign(
 		    __entry->wreq	=3D wreq->debug_id;
 		    __entry->stream	=3D stream->stream_nr;
 		    __entry->collected_to =3D stream->collected_to;
-		    __entry->front	=3D stream->front ? stream->front->start : UINT_MAX;
+		    __entry->issued_to	=3D atomic64_read(&wreq->issued_to);
 			   ),
=20
-	    TP_printk("R=3D%08x[%x:] cto=3D%llx frn=3D%llx",
+	    TP_printk("R=3D%08x[%x:] cto=3D%llx ito=3D%llx",
 		      __entry->wreq, __entry->stream,
-		      __entry->collected_to, __entry->front)
+		      __entry->collected_to, __entry->issued_to)
 	    );
=20
 TRACE_EVENT(netfs_folioq,
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 285493E6DFE
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:47:17 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522039; cv=none;
 b=ZZFF6CjNd4FBi2sUiYpeYu8VO6AUHkK9eVEaIWE2VIuMMkdTHhWjEzZipEwCUAkxnSIGOjtAKil93tIlGAxYCCQUehxzUj8uOm7vCGswRVUY75rd6ZMwB0jUEAJB+BNtlRpA1CTOPgvnMTI0RB3PgNbMgKRjpW+BlbcvbWkVfcU=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522039; c=relaxed/simple;
	bh=5oiL4h3gwETW05fsIL6ptaxARlHp1gU+1xVFfXCtL9Y=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=FXDI5jWSu5LaOFfNkt+9weVKS13CeO/uNObJrVMSXyhrf700FQ+gvRnS2EutKakW3UZG+v4Utn+5hr0NyblERnlEozQyYAel/nLtvyXezZV1vmECQN6XYcEAxDcZdv6WeMhJFa61SkkzS5Ey4IMJvIFkJHBDzfB7SwRfWAchtAU=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=PHcsoPv4; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="PHcsoPv4"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774522036;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=8pMTqSI3JlWcbbZ5ax9gnuhYVE8KLn3P9vdg2r2cDfs=;
	b=PHcsoPv4VOhZwnBth+8Cm8WMMCG68v1xtdYfsmuuYT/lyFZFExRxF+5O6Do8S1hYwaeBSp
	P+nkp92rAV4ycrzVWJsWL2LCaxC6ZYBe/tzGSCUaqaVSUjDMpoLZU0BX+vHV05mhSjS87Q
	TESDxNh8i+x/3DnvNLHd6vQDpXZbvfU=
Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-128-292P2OpvMDm-lsocWZMW3Q-1; Thu,
 26 Mar 2026 06:47:12 -0400
X-MC-Unique: 292P2OpvMDm-lsocWZMW3Q-1
X-Mimecast-MFC-AGG-ID: 292P2OpvMDm-lsocWZMW3Q_1774522029
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 5B90819560BB;
	Thu, 26 Mar 2026 10:47:09 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id E9D3519560B1;
	Thu, 26 Mar 2026 10:47:01 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	NeilBrown <neil@brown.name>,
	Paulo Alcantara <pc@manguebit.org>
Subject: [PATCH 07/26] cachefiles: Fix excess dput() after end_removing()
Date: Thu, 26 Mar 2026 10:45:22 +0000
Message-ID: <20260326104544.509518-8-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12
Content-Type: text/plain; charset="utf-8"

When cachefiles_cull() calls cachefiles_bury_object(), the latter eats the
former's ref on the victim dentry that it obtained from
cachefiles_lookup_for_cull().  However, commit 7bb1eb45e43c left the dput
of the victim in place, resulting in occasional:

  WARNING: fs/dcache.c:829 at dput.part.0+0xf5/0x110, CPU#7: cachefilesd/11=
831
  cachefiles_cull+0x8c/0xe0 [cachefiles]
  cachefiles_daemon_cull+0xcd/0x120 [cachefiles]
  cachefiles_daemon_write+0x14e/0x1d0 [cachefiles]
  vfs_write+0xc3/0x480
  ...

reports.

Actually, it's worse than that: cachefiles_bury_object() eats the ref it was
given - and then may continue to the now-unref'd dentry it if it turns out =
to
be a directory.  So simply removing the aberrant dput() is not sufficient.

Fix this by making cachefiles_bury_object() retain the ref itself around
end_removing() if it needs to keep it and then drop the ref before returnin=
g.

Fixes: bd6ede8a06e8 ("VFS/nfsd/cachefiles/ovl: introduce start_removing() a=
nd end_removing()")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: NeilBrown <neil@brown.name>
cc: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
---
 fs/cachefiles/namei.c | 36 +++++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
index e5ec90dccc27..20138309733f 100644
--- a/fs/cachefiles/namei.c
+++ b/fs/cachefiles/namei.c
@@ -287,14 +287,14 @@ int cachefiles_bury_object(struct cachefiles_cache *c=
ache,
 	if (!d_is_dir(rep)) {
 		ret =3D cachefiles_unlink(cache, object, dir, rep, why);
 		end_removing(rep);
-
 		_leave(" =3D %d", ret);
 		return ret;
 	}
=20
 	/* directories have to be moved to the graveyard */
 	_debug("move stale object to graveyard");
-	end_removing(rep);
+	dget(rep);
+	end_removing(rep); /* Drops ref on rep */
=20
 try_again:
 	/* first step is to make up a grave dentry in the graveyard */
@@ -304,8 +304,10 @@ int cachefiles_bury_object(struct cachefiles_cache *ca=
che,
=20
 	/* do the multiway lock magic */
 	trap =3D lock_rename(cache->graveyard, dir);
-	if (IS_ERR(trap))
-		return PTR_ERR(trap);
+	if (IS_ERR(trap)) {
+		ret =3D PTR_ERR(trap);
+		goto out;
+	}
=20
 	/* do some checks before getting the grave dentry */
 	if (rep->d_parent !=3D dir || IS_DEADDIR(d_inode(rep))) {
@@ -313,25 +315,27 @@ int cachefiles_bury_object(struct cachefiles_cache *c=
ache,
 		 * lock */
 		unlock_rename(cache->graveyard, dir);
 		_leave(" =3D 0 [culled?]");
-		return 0;
+		ret =3D 0;
+		goto out;
 	}
=20
+	ret =3D -EIO;
 	if (!d_can_lookup(cache->graveyard)) {
 		unlock_rename(cache->graveyard, dir);
 		cachefiles_io_error(cache, "Graveyard no longer a directory");
-		return -EIO;
+		goto out;
 	}
=20
 	if (trap =3D=3D rep) {
 		unlock_rename(cache->graveyard, dir);
 		cachefiles_io_error(cache, "May not make directory loop");
-		return -EIO;
+		goto out;
 	}
=20
 	if (d_mountpoint(rep)) {
 		unlock_rename(cache->graveyard, dir);
 		cachefiles_io_error(cache, "Mountpoint in cache");
-		return -EIO;
+		goto out;
 	}
=20
 	grave =3D lookup_one(&nop_mnt_idmap, &QSTR(nbuffer), cache->graveyard);
@@ -343,11 +347,12 @@ int cachefiles_bury_object(struct cachefiles_cache *c=
ache,
=20
 		if (PTR_ERR(grave) =3D=3D -ENOMEM) {
 			_leave(" =3D -ENOMEM");
-			return -ENOMEM;
+			ret =3D -ENOMEM;
+			goto out;
 		}
=20
 		cachefiles_io_error(cache, "Lookup error %ld", PTR_ERR(grave));
-		return -EIO;
+		goto out;
 	}
=20
 	if (d_is_positive(grave)) {
@@ -362,7 +367,7 @@ int cachefiles_bury_object(struct cachefiles_cache *cac=
he,
 		unlock_rename(cache->graveyard, dir);
 		dput(grave);
 		cachefiles_io_error(cache, "Mountpoint in graveyard");
-		return -EIO;
+		goto out;
 	}
=20
 	/* target should not be an ancestor of source */
@@ -370,7 +375,7 @@ int cachefiles_bury_object(struct cachefiles_cache *cac=
he,
 		unlock_rename(cache->graveyard, dir);
 		dput(grave);
 		cachefiles_io_error(cache, "May not make directory loop");
-		return -EIO;
+		goto out;
 	}
=20
 	/* attempt the rename */
@@ -404,8 +409,10 @@ int cachefiles_bury_object(struct cachefiles_cache *ca=
che,
 	__cachefiles_unmark_inode_in_use(object, d_inode(rep));
 	unlock_rename(cache->graveyard, dir);
 	dput(grave);
-	_leave(" =3D 0");
-	return 0;
+	_leave(" =3D %d", ret);
+out:
+	dput(rep);
+	return ret;
 }
=20
 /*
@@ -812,7 +819,6 @@ int cachefiles_cull(struct cachefiles_cache *cache, str=
uct dentry *dir,
=20
 	ret =3D cachefiles_bury_object(cache, NULL, dir, victim,
 				     FSCACHE_OBJECT_WAS_CULLED);
-	dput(victim);
 	if (ret < 0)
 		goto error;
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id DA1F6333429
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:47:25 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522048; cv=none;
 b=iK+E4bXQ2ReaQzLfKPs2zUzEgl2wjxEfK+iKAfaAn3BTtoCrv2k330RuBIxytkMdUfVaBrFil5aQlwSm3fAugAbXiDYcapY62sV/aDFgXOaNJyNFhFgMQAUicHnv0floMJTsduSuxgURtESr0ltTQ0ZsRVY/i2uvfBKoDqcyGS4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522048; c=relaxed/simple;
	bh=mSttHELtORA3TF+uaSywZfpmCBiron6A6GzGwHAVu9c=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=VEL+0l0XxYYt2TDwlM8Ebe0//VjTY1aVDHSD6kyp8fG3dhP6xRsjIdnvarLq5DklLY2CCZ22hji9OPi4gwjUwIxzVuc8recL62h14v050DjNkU/QhTh0UZY2Jgnn9LHZYxWMGZPUaCyKRtWZAwCo8aZnM4KiAFav3WmXG5Om/Sk=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=byE8qaxY; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="byE8qaxY"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774522045;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=wLthSvn8qvWjKJAvzgNmkKxDSR01zA8txUKhM97kKV8=;
	b=byE8qaxYbAXuBYtRVvPb4DRkJ/uZd+Gyj6KkzOzpQCter3ke+AR1wJOcka9mahn2h2ih8T
	AGuAAWFTyxVQNYM7XVWJIwNoEx18FBwHn5bVaSzLlvZlNvaRNhmHK9PF6FlQLsSEC5f0/g
	BqJnUEuFl+U9W+IiIezSnyeVoxdv9go=
Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-658-2RDWs7cwP920WiGnWDa5Tg-1; Thu,
 26 Mar 2026 06:47:20 -0400
X-MC-Unique: 2RDWs7cwP920WiGnWDa5Tg-1
X-Mimecast-MFC-AGG-ID: 2RDWs7cwP920WiGnWDa5Tg_1774522038
Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id CF695180060D;
	Thu, 26 Mar 2026 10:47:17 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 095E31800673;
	Thu, 26 Mar 2026 10:47:10 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Paulo Alcantara <pc@manguebit.org>
Subject: [PATCH 08/26] cachefiles: Don't rely on backing fs storage map for
 most use cases
Date: Thu, 26 Mar 2026 10:45:23 +0000
Message-ID: <20260326104544.509518-9-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93
Content-Type: text/plain; charset="utf-8"

Cachefiles currently uses the backing filesystem's idea of what data is
held in a backing file and queries this by means of SEEK_DATA and
SEEK_HOLE.  However, this means it does two seek operations on the backing
file for each individual read call it wants to prepare (unless the first
returns -ENXIO).  Worse, the backing filesystem is at liberty to insert or
remove blocks of zeros in order to optimise its layout which may cause
false positives and false negatives.

The problem is that keeping track of what is dirty is tricky (if storing
info in xattrs, which may have limited capacity and must be read and
written as one piece) and expensive (in terms of diskspace at least) and is
basically duplicating what a filesystem does.

However, the most common write case, in which the application does {
open(O_TRUNC); write(); write(); ... write(); close(); } where each write
follows directly on from the previous and leaves no gaps in the file is
reasonably easy to detect and can be noted in the primary xattr as
CACHEFILES_CONTENT_ALL, indicating we have everything up to the object size
stored.

In this specific case, given that it is known that there are no holes in
the file, there's no need to call SEEK_DATA/HOLE or use any other mechanism
to track the contents.  That speeds things up enormously.

Even when it is necessary to use SEEK_DATA/HOLE, it may not be necessary to
call it for each cache read subrequest generated.

Implement this by adding support for the CACHEFILES_CONTENT_ALL content
type (which is defined, but currently unused), which requires a slight
adjustment in how backing files are managed.  Specifically, the driver
needs to know how much of the tail block is data and whether storing more
data will create a hole.

To this end, the way that the size of a backing file is managed is changed.
Currently, the backing file is expanded to strictly match the size of the
network file, but this can be changed to carry more useful information.
This makes two pieces of metadata available: xattr.object_size and the
backing file's i_size.  Apply the following schema:

  (a) i_size is always a multiple of the DIO block size.

  (b) i_size is only updated to the end of the highest write stored.  This
      is used to work out if we are following on without leaving a hole.

  (c) xattr.object_size is the size of the network filesystem file cached
      in this backing file.

  (d) xattr.object_size must point after the start of the last block
      (unless both are 0).

  (e) If xattr.object_size is at or after the block at the current end of
      the backing file (ie. i_size), then we have all the contents of the
      block (if xattr.content =3D=3D CACHEFILES_CONTENT_ALL).

  (f) If xattr.object_size is somewhere in the middle of the last block,
      then the data following it is invalid and must be ignored.

  (g) If data is added to the last block, then that block must be fetched,
      modified and rewritten (it must be a buffered write through the
      pagecache and not DIO).

  (h) Writes to cache are rounded out to blocks on both sides and the
      folios used as sources must contain data for any lower gap and must
      have been cleared for any upper gap, and so will rewrite any
      non-data area in the tail block.

To implement this, the following changes are made:

 (1) cookie->object_size is no longer updated when writes are copied into
     the pagecache, but rather only updated when a write request completes.

     This prevents object size miscomparison when checking the xattr
     causing the backing file to be invalidated (opening and marking the
     backing file and modifying the pagecache run in parallel).

 (2) The cache's current idea of the amount of data that should be stored
     in the backing file is kept track of in object->object_size.

     Possibly this is redundant with cookie->object_size, but the latter
     gets updated in some addition circumstances.

 (3) The size of the backing file at the start of a request is now tracked
     in struct netfs_cache_resources so that the partial EOF block can be
     located and cleaned.

 (4) The cache block size is now used consistently rather than using
     CACHEFILES_DIO_BLOCK_SIZE (4096).

 (5) The backing file size is no longer adjusted when looking up an object.

 (6) When shortening a file, if the new size is not block aligned, the part
     beyond the new size is cleared.  If the file is truncated to zero, the
     content_info gets reset to CACHEFILES_CONTENT_NO_DATA.

 (7) A new struct, fscache_occupancy, is instituted to track the region
     being read.  Netfslib allocates it and fills in the start and end of
     the region to be read then calls the ->query_occupancy() method to
     find and fill in the extents.  It also indicates whether a recorded
     extent contains data or just contains a region that's all zeros
     (FSCACHE_EXTENT_DATA or FSCACHE_EXTENT_ZERO).

 (8) The ->prepare_read() cache method is changed such that, if given, it
     just limits the amount that can be read from the cache in one go.  It
     no longer indicates what source of read should be done; that
     information is now obtained from ->query_occupancy().

 (9) A new cache method, ->collect_write(), is added that is called when a
     contiguous series of writes have completed and a discontiguity or the
     end of the request has been hit.  It it supplied with the start and
     length of the write made to the backing file and can use this
     information to update the cache metadata.

(10) cachefiles_query_occupancy() is altered to find the next two "extents"
     of data stored in the backing file by doing SEEK_DATA/HOLE between the
     bounds set - unless it is known that there are no holes, in which case
     a whole-file first extent can be set.

(11) cachefiles_collect_write() is implemented to take the collated write
     completion information and use this to update the cache metadata, in
     particular working out whether there's now a hole in the backing file
     requiring future use of SEEK_DATA/HOLE instead of just assuming the
     data is all present.

     It also uses fallocate(FALLOC_FL_ZERO_RANGE) to clean the part of a
     partial block that extended beyond the old object size.  It might be
     better to perform a synchronous DIO write for this purpose, but that
     would mandate an RMW cycle.  Ideally, it should be all zeros anyway,
     but, unfortunately, shared-writable mmap can interfere.

(12) cachefiles_begin_operation() is updated to note the current backing
     file size and the cache DIO size.

(13) cachefiles_create_tmpfile() no longer expands the backing file when it
     creates it.

(14) cachefiles_set_object_xattr() is changed to use object->object_size
     rather than cookie->object_size.

(15) cachefiles_check_auxdata() is altered to actually store the content
     type and to also set object->object_size.  The cachefiles_coherency
     tracepoint is also modified to display xattr.object_size.

(16) netfs_read_to_pagecache() is reworked.  The cache ->prepare_read()
     method is replaced with ->query_occupancy() as the arbiter of what
     region of the file is read from where, and that retrieves up to two
     occupied extents of the backing file at once.

     The cache ->prepare_read() method is now repurposed to be the same as
     the equivalent network filesystem method and allows the cache to limit
     the size of the read before the iterator is prepared.

     netfs_single_dispatch_read() is similarly modified.

(17) netfs_update_i_size() and afs_update_i_size() no longer call
     fscache_update_cookie() to update cookie->object_size.

(18) Write collection now collates contiguous sequences of writes to the
     cache and calls the cache ->collect_write() method.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/afs/file.c                     |   1 -
 fs/cachefiles/interface.c         |  82 ++-------
 fs/cachefiles/internal.h          |  10 +-
 fs/cachefiles/io.c                | 265 +++++++++++++++++++++++-------
 fs/cachefiles/namei.c             |  19 +--
 fs/cachefiles/xattr.c             |  20 ++-
 fs/netfs/buffered_read.c          | 185 +++++++++++++--------
 fs/netfs/buffered_write.c         |   3 -
 fs/netfs/internal.h               |   2 +
 fs/netfs/read_single.c            |  39 +++--
 fs/netfs/write_collect.c          |  49 +++++-
 fs/netfs/write_issue.c            |   3 +
 include/linux/fscache.h           |  17 ++
 include/linux/netfs.h             |  16 +-
 include/trace/events/cachefiles.h |  15 +-
 15 files changed, 466 insertions(+), 260 deletions(-)

diff --git a/fs/afs/file.c b/fs/afs/file.c
index f609366fd2ac..424e0c98d67f 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -436,7 +436,6 @@ static void afs_update_i_size(struct inode *inode, loff=
_t new_i_size)
 		inode_set_bytes(&vnode->netfs.inode, new_i_size);
 	}
 	write_sequnlock(&vnode->cb_lock);
-	fscache_update_cookie(afs_vnode_cache(vnode), NULL, &new_i_size);
 }
=20
 static void afs_netfs_invalidate_cache(struct netfs_io_request *wreq)
diff --git a/fs/cachefiles/interface.c b/fs/cachefiles/interface.c
index a08250d244ea..736bfcaa4e1d 100644
--- a/fs/cachefiles/interface.c
+++ b/fs/cachefiles/interface.c
@@ -105,73 +105,6 @@ void cachefiles_put_object(struct cachefiles_object *o=
bject,
 	_leave("");
 }
=20
-/*
- * Adjust the size of a cache file if necessary to match the DIO size.  We=
 keep
- * the EOF marker a multiple of DIO blocks so that we don't fall back to d=
oing
- * non-DIO for a partial block straddling the EOF, but we also have to be
- * careful of someone expanding the file and accidentally accreting the
- * padding.
- */
-static int cachefiles_adjust_size(struct cachefiles_object *object)
-{
-	struct iattr newattrs;
-	struct file *file =3D object->file;
-	uint64_t ni_size;
-	loff_t oi_size;
-	int ret;
-
-	ni_size =3D object->cookie->object_size;
-	ni_size =3D round_up(ni_size, CACHEFILES_DIO_BLOCK_SIZE);
-
-	_enter("{OBJ%x},[%llu]",
-	       object->debug_id, (unsigned long long) ni_size);
-
-	if (!file)
-		return -ENOBUFS;
-
-	oi_size =3D i_size_read(file_inode(file));
-	if (oi_size =3D=3D ni_size)
-		return 0;
-
-	inode_lock(file_inode(file));
-
-	/* if there's an extension to a partial page at the end of the backing
-	 * file, we need to discard the partial page so that we pick up new
-	 * data after it */
-	if (oi_size & ~PAGE_MASK && ni_size > oi_size) {
-		_debug("discard tail %llx", oi_size);
-		newattrs.ia_valid =3D ATTR_SIZE;
-		newattrs.ia_size =3D oi_size & PAGE_MASK;
-		ret =3D cachefiles_inject_remove_error();
-		if (ret =3D=3D 0)
-			ret =3D notify_change(&nop_mnt_idmap, file->f_path.dentry,
-					    &newattrs, NULL);
-		if (ret < 0)
-			goto truncate_failed;
-	}
-
-	newattrs.ia_valid =3D ATTR_SIZE;
-	newattrs.ia_size =3D ni_size;
-	ret =3D cachefiles_inject_write_error();
-	if (ret =3D=3D 0)
-		ret =3D notify_change(&nop_mnt_idmap, file->f_path.dentry,
-				    &newattrs, NULL);
-
-truncate_failed:
-	inode_unlock(file_inode(file));
-
-	if (ret < 0)
-		trace_cachefiles_io_error(NULL, file_inode(file), ret,
-					  cachefiles_trace_notify_change_error);
-	if (ret =3D=3D -EIO) {
-		cachefiles_io_error_obj(object, "Size set failed");
-		ret =3D -ENOBUFS;
-	}
-
-	_leave(" =3D %d", ret);
-	return ret;
-}
-
 /*
  * Attempt to look up the nominated node in this cache
  */
@@ -204,7 +137,6 @@ static bool cachefiles_lookup_cookie(struct fscache_coo=
kie *cookie)
 	spin_lock(&cache->object_list_lock);
 	list_add(&object->cache_link, &cache->object_list);
 	spin_unlock(&cache->object_list_lock);
-	cachefiles_adjust_size(object);
=20
 	cachefiles_end_secure(cache, saved_cred);
 	_leave(" =3D t");
@@ -238,7 +170,7 @@ static bool cachefiles_shorten_object(struct cachefiles=
_object *object,
 	loff_t i_size, dio_size;
 	int ret;
=20
-	dio_size =3D round_up(new_size, CACHEFILES_DIO_BLOCK_SIZE);
+	dio_size =3D round_up(new_size, cache->bsize);
 	i_size =3D i_size_read(inode);
=20
 	trace_cachefiles_trunc(object, inode, i_size, dio_size,
@@ -270,6 +202,7 @@ static bool cachefiles_shorten_object(struct cachefiles=
_object *object,
 		}
 	}
=20
+	object->object_size =3D new_size;
 	return true;
 }
=20
@@ -284,15 +217,20 @@ static void cachefiles_resize_cookie(struct netfs_cac=
he_resources *cres,
 	struct fscache_cookie *cookie =3D object->cookie;
 	const struct cred *saved_cred;
 	struct file *file =3D cachefiles_cres_file(cres);
-	loff_t old_size =3D cookie->object_size;
+	unsigned long long i_size =3D i_size_read(file_inode(file));
=20
-	_enter("%llu->%llu", old_size, new_size);
+	_enter("%llu->%llu", i_size, new_size);
=20
-	if (new_size < old_size) {
+	if (new_size < i_size) {
+		/* The file is being shrunk - we need to downsize the backing
+		 * file and clear the end of the final block.
+		 */
 		cachefiles_begin_secure(cache, &saved_cred);
 		cachefiles_shorten_object(object, file, new_size);
 		cachefiles_end_secure(cache, saved_cred);
 		object->cookie->object_size =3D new_size;
+		if (new_size =3D=3D 0)
+			object->content_info =3D CACHEFILES_CONTENT_NO_DATA;
 		return;
 	}
=20
diff --git a/fs/cachefiles/internal.h b/fs/cachefiles/internal.h
index b62cd3e9a18e..00482a13fc48 100644
--- a/fs/cachefiles/internal.h
+++ b/fs/cachefiles/internal.h
@@ -18,8 +18,6 @@
 #include <linux/xarray.h>
 #include <linux/cachefiles.h>
=20
-#define CACHEFILES_DIO_BLOCK_SIZE 4096
-
 struct cachefiles_cache;
 struct cachefiles_object;
=20
@@ -68,12 +66,16 @@ struct cachefiles_object {
 	struct list_head		cache_link;	/* Link in cache->*_list */
 	struct file			*file;		/* The file representing this object */
 	char				*d_name;	/* Backing file name */
+	unsigned long			flags;
+#define CACHEFILES_OBJECT_USING_TMPFILE	0		/* Have an unlinked tmpfile */
+	unsigned long long		object_size;	/* Size of the object stored
+							 * (independent of cookie->object_size for
+							 * coherency reasons)
+							 */
 	int				debug_id;
 	spinlock_t			lock;
 	refcount_t			ref;
 	enum cachefiles_content		content_info:8;	/* Info about content presence */
-	unsigned long			flags;
-#define CACHEFILES_OBJECT_USING_TMPFILE	0		/* Have an unlinked tmpfile */
 #ifdef CONFIG_CACHEFILES_ONDEMAND
 	struct cachefiles_ondemand_info	*ondemand;
 #endif
diff --git a/fs/cachefiles/io.c b/fs/cachefiles/io.c
index eaf47851c65f..b5ff75697b3e 100644
--- a/fs/cachefiles/io.c
+++ b/fs/cachefiles/io.c
@@ -32,6 +32,8 @@ struct cachefiles_kiocb {
 	u64			b_writing;
 };
=20
+#define IS_ERR_VALUE_LL(x) unlikely((x) >=3D (unsigned long long)-MAX_ERRN=
O)
+
 static inline void cachefiles_put_kiocb(struct cachefiles_kiocb *ki)
 {
 	if (refcount_dec_and_test(&ki->ki_refcnt)) {
@@ -193,60 +195,81 @@ static int cachefiles_read(struct netfs_cache_resourc=
es *cres,
 }
=20
 /*
- * Query the occupancy of the cache in a region, returning where the next =
chunk
- * of data starts and how long it is.
+ * Query the occupancy of the cache in a region, returning the extent of t=
he
+ * next two chunks of cached data and the next hole.
  */
 static int cachefiles_query_occupancy(struct netfs_cache_resources *cres,
-				      loff_t start, size_t len, size_t granularity,
-				      loff_t *_data_start, size_t *_data_len)
+				      struct fscache_occupancy *occ)
 {
 	struct cachefiles_object *object;
+	struct inode *inode;
 	struct file *file;
-	loff_t off, off2;
-
-	*_data_start =3D -1;
-	*_data_len =3D 0;
+	unsigned long long i_size;
+	loff_t ret;
+	int i;
=20
 	if (!fscache_wait_for_operation(cres, FSCACHE_WANT_READ))
 		return -ENOBUFS;
=20
 	object =3D cachefiles_cres_object(cres);
 	file =3D cachefiles_cres_file(cres);
-	granularity =3D max_t(size_t, object->volume->cache->bsize, granularity);
+	inode =3D file_inode(file);
+	occ->granularity =3D object->volume->cache->bsize;
+
+	_enter("%pD,%li,%llx-%llx/%llx",
+	       file, inode->i_ino, occ->query_from, occ->query_to,
+	       i_size_read(inode));
+
+	if (i_size_read(inode) =3D=3D 0)
+		goto done;
+
+	switch (object->content_info) {
+	case CACHEFILES_CONTENT_ALL:
+	case CACHEFILES_CONTENT_SINGLE:
+		i_size =3D i_size_read(inode);
+		if (i_size > occ->query_from) {
+			occ->cached_from[0] =3D 0;
+			occ->cached_to[0] =3D i_size;
+			occ->cached_type[0] =3D FSCACHE_EXTENT_DATA;
+			occ->query_from =3D ULLONG_MAX;
+		}
+		goto done;
+	default:
+		break;
+	}
=20
-	_enter("%pD,%li,%llx,%zx/%llx",
-	       file, file_inode(file)->i_ino, start, len,
-	       i_size_read(file_inode(file)));
+	for (i =3D 0; i < ARRAY_SIZE(occ->cached_from); i++) {
+		ret =3D cachefiles_inject_read_error();
+		if (ret =3D=3D 0)
+			ret =3D file->f_op->llseek(file, occ->query_from, SEEK_DATA);
+		if (IS_ERR_VALUE_LL(ret)) {
+			if (ret !=3D -ENXIO)
+				return ret;
+			occ->query_from =3D ULLONG_MAX;
+			goto done;
+		}
+		occ->cached_type[i] =3D FSCACHE_EXTENT_DATA;
+		occ->cached_from[i] =3D ret;
+		occ->query_from =3D ret;
+
+		ret =3D cachefiles_inject_read_error();
+		if (ret =3D=3D 0)
+			ret =3D file->f_op->llseek(file, occ->query_from, SEEK_HOLE);
+		if (IS_ERR_VALUE_LL(ret)) {
+			if (ret !=3D -ENXIO)
+				return ret;
+			occ->query_from =3D ULLONG_MAX;
+			goto done;
+		}
+		occ->cached_to[i] =3D ret;
+		occ->query_from =3D ret;
+		if (occ->query_from >=3D occ->query_to)
+			break;
+	}
=20
-	off =3D cachefiles_inject_read_error();
-	if (off =3D=3D 0)
-		off =3D vfs_llseek(file, start, SEEK_DATA);
-	if (off =3D=3D -ENXIO)
-		return -ENODATA; /* Beyond EOF */
-	if (off < 0 && off >=3D (loff_t)-MAX_ERRNO)
-		return -ENOBUFS; /* Error. */
-	if (round_up(off, granularity) >=3D start + len)
-		return -ENODATA; /* No data in range */
-
-	off2 =3D cachefiles_inject_read_error();
-	if (off2 =3D=3D 0)
-		off2 =3D vfs_llseek(file, off, SEEK_HOLE);
-	if (off2 =3D=3D -ENXIO)
-		return -ENODATA; /* Beyond EOF */
-	if (off2 < 0 && off2 >=3D (loff_t)-MAX_ERRNO)
-		return -ENOBUFS; /* Error. */
-
-	/* Round away partial blocks */
-	off =3D round_up(off, granularity);
-	off2 =3D round_down(off2, granularity);
-	if (off2 <=3D off)
-		return -ENODATA;
-
-	*_data_start =3D off;
-	if (off2 > start + len)
-		*_data_len =3D len;
-	else
-		*_data_len =3D off2 - off;
+done:
+	_debug("query[0] %llx-%llx", occ->cached_from[0], occ->cached_to[0]);
+	_debug("query[1] %llx-%llx", occ->cached_from[1], occ->cached_to[1]);
 	return 0;
 }
=20
@@ -489,18 +512,6 @@ cachefiles_do_prepare_read(struct netfs_cache_resource=
s *cres,
 	return ret;
 }
=20
-/*
- * Prepare a read operation, shortening it to a cached/uncached
- * boundary as appropriate.
- */
-static enum netfs_io_source cachefiles_prepare_read(struct netfs_io_subreq=
uest *subreq,
-						    unsigned long long i_size)
-{
-	return cachefiles_do_prepare_read(&subreq->rreq->cache_resources,
-					  subreq->start, &subreq->len, i_size,
-					  &subreq->flags, subreq->rreq->inode->i_ino);
-}
-
 /*
  * Prepare an on-demand read operation, shortening it to a cached/uncached
  * boundary as appropriate.
@@ -658,9 +669,9 @@ static void cachefiles_issue_write(struct netfs_io_subr=
equest *subreq)
 	       wreq->debug_id, subreq->debug_index, start, start + len - 1);
=20
 	/* We need to start on the cache granularity boundary */
-	off =3D start & (CACHEFILES_DIO_BLOCK_SIZE - 1);
+	off =3D start & (cache->bsize - 1);
 	if (off) {
-		pre =3D CACHEFILES_DIO_BLOCK_SIZE - off;
+		pre =3D cache->bsize - off;
 		if (pre >=3D len) {
 			fscache_count_dio_misfit();
 			netfs_write_subrequest_terminated(subreq, len);
@@ -674,8 +685,8 @@ static void cachefiles_issue_write(struct netfs_io_subr=
equest *subreq)
=20
 	/* We also need to end on the cache granularity boundary */
 	if (start + len =3D=3D wreq->i_size) {
-		size_t part =3D len % CACHEFILES_DIO_BLOCK_SIZE;
-		size_t need =3D CACHEFILES_DIO_BLOCK_SIZE - part;
+		size_t part =3D len & (cache->bsize - 1);
+		size_t need =3D cache->bsize - part;
=20
 		if (part && stream->submit_extendable_to >=3D need) {
 			len +=3D need;
@@ -684,7 +695,7 @@ static void cachefiles_issue_write(struct netfs_io_subr=
equest *subreq)
 		}
 	}
=20
-	post =3D len & (CACHEFILES_DIO_BLOCK_SIZE - 1);
+	post =3D len & (cache->bsize - 1);
 	if (post) {
 		len -=3D post;
 		if (len =3D=3D 0) {
@@ -711,6 +722,134 @@ static void cachefiles_issue_write(struct netfs_io_su=
brequest *subreq)
 			 netfs_write_subrequest_terminated, subreq);
 }
=20
+/*
+ * Collect the result of buffered writeback to the cache.  This includes
+ * copying a read to the cache.  Netfslib collates the results, which might
+ * occur out of order, and delivers them to the cache so that it can updat=
e its
+ * content record.
+ *
+ * The writes we made are all rounded out at both sides to the nearest DIO
+ * block boundary, so if the final block contains the EOF in the middle of=
 it
+ * (rather than at the end), padding will have been written to the file.  =
The
+ * backing file's filesize will have been updated if the write extended the
+ * file; the filesize may still change due to outstanding subreqs.
+ *
+ * The metadata in the cache file xattr records the size of the object we =
have
+ * stored, but the cache file EOF only goes up to where we've cached data =
to
+ * and, furthermore, is rounded up to the nearest DIO block boundary.
+ */
+static void cachefiles_collect_write(struct netfs_io_request *wreq,
+				     unsigned long long start, size_t len)
+{
+	struct netfs_cache_resources *cres =3D &wreq->cache_resources;
+	struct cachefiles_object *object =3D cachefiles_cres_object(cres);
+	struct cachefiles_cache *cache =3D object->volume->cache;
+	struct fscache_cookie *cookie =3D fscache_cres_cookie(cres);
+	struct file *file =3D cachefiles_cres_file(cres);
+	unsigned long long old_size =3D cres->cache_i_size;
+	unsigned long long new_size =3D i_size_read(file_inode(file));
+	unsigned long long data_to =3D cookie->object_size;
+	unsigned long long end =3D start + len;
+	int ret;
+
+	_enter("%llx,%zx,%x", start, len, cache->bsize);
+
+	if (WARN_ON(old_size	& (cache->bsize - 1)) ||
+	    WARN_ON(new_size	& (cache->bsize - 1)) ||
+	    WARN_ON(start	& (cache->bsize - 1)) ||
+	    WARN_ON(len		& (cache->bsize - 1))) {
+		trace_cachefiles_io_error(object, file_inode(file), -EIO,
+					  cachefiles_trace_alignment_error);
+		cachefiles_remove_object_xattr(cache, object, file->f_path.dentry);
+		return;
+	}
+
+	/* Zeroth case: Single monolithic files are handled specially.
+	 */
+	if (wreq->origin =3D=3D NETFS_WRITEBACK_SINGLE) {
+		object->content_info =3D CACHEFILES_CONTENT_SINGLE;
+		goto update_sizes;
+	}
+
+	/* First case: The backing file was empty. */
+	if (old_size =3D=3D 0) {
+		if (start =3D=3D 0)
+			object->content_info =3D CACHEFILES_CONTENT_ALL;
+		else
+			object->content_info =3D CACHEFILES_CONTENT_BACKFS_MAP;
+		goto update_sizes;
+	}
+
+	/* Second case: The backing file is entirely within the old object size
+	 * and thus there can be no partial tail block to deal with in the
+	 * cache file.
+	 */
+	if (old_size <=3D data_to) {
+		if (start > old_size)
+			goto discontiguous;
+		goto update_sizes;
+	}
+
+	/* Third case: The write happened entirely within the bounds of the
+	 * current cache file's size.
+	 */
+	if (end <=3D old_size)
+		goto update_sizes;
+
+	/* Fourth case: The write overwrote the partial tail block and extended
+	 * the file.  We only need to update the object size because netfslib
+	 * rounds out/pads cache writes to whole disk blocks.
+	 */
+	if (start < old_size)
+		goto update_sizes;
+
+	/* Fifth case: The write started from the end of the whole tail block
+	 * and extended the file.  Just extend our notion of the filesize.
+	 */
+	if (start =3D=3D old_size && old_size =3D=3D data_to)
+		goto update_sizes;
+
+	/* Sixth case: The write continued on from the partial tail block and
+	 * extended the file.  Need to clear the gap.
+	 */
+	if (start =3D=3D old_size && old_size > data_to)
+		goto clear_gap;
+
+discontiguous:
+	/* Seventh case: The write was beyond the EOF on the cache file, so now
+	 * there's a hole in the file and we can no longer say in the metadata
+	 * that we can assume we have it all.  We may also need to clear the
+	 * end of the partial tail block.
+	 */
+	/* TODO: For the moment, we will have to use SEEK_HOLE/SEEK_DATA. */
+	object->content_info =3D CACHEFILES_CONTENT_BACKFS_MAP;
+
+clear_gap:
+	/* We need to clear any partial padding that got jumped over.  It
+	 * *should* be all zeros, but shared-writable mmap exists...
+	 */
+	if (old_size > data_to) {
+		trace_cachefiles_trunc(object, file_inode(file), data_to, old_size,
+				       cachefiles_trunc_clear_padding);
+		ret =3D cachefiles_inject_write_error();
+		if (ret =3D=3D 0)
+			ret =3D vfs_fallocate(file, FALLOC_FL_ZERO_RANGE,
+					    data_to, old_size - data_to);
+		if (ret < 0) {
+			trace_cachefiles_io_error(object, file_inode(file), ret,
+						  cachefiles_trace_fallocate_error);
+			cachefiles_io_error_obj(object, "fallocate zero pad failed %d", ret);
+			cachefiles_remove_object_xattr(cache, object, file->f_path.dentry);
+			return;
+		}
+	}
+
+update_sizes:
+	cres->cache_i_size =3D umax(old_size, end);
+	object->object_size =3D cookie->object_size;
+	return;
+}
+
 /*
  * Clean up an operation.
  */
@@ -728,11 +867,11 @@ static const struct netfs_cache_ops cachefiles_netfs_=
cache_ops =3D {
 	.read			=3D cachefiles_read,
 	.write			=3D cachefiles_write,
 	.issue_write		=3D cachefiles_issue_write,
-	.prepare_read		=3D cachefiles_prepare_read,
 	.prepare_write		=3D cachefiles_prepare_write,
 	.prepare_write_subreq	=3D cachefiles_prepare_write_subreq,
 	.prepare_ondemand_read	=3D cachefiles_prepare_ondemand_read,
 	.query_occupancy	=3D cachefiles_query_occupancy,
+	.collect_write		=3D cachefiles_collect_write,
 };
=20
 /*
@@ -742,14 +881,18 @@ bool cachefiles_begin_operation(struct netfs_cache_re=
sources *cres,
 				enum fscache_want_state want_state)
 {
 	struct cachefiles_object *object =3D cachefiles_cres_object(cres);
+	struct file *file;
=20
 	if (!cachefiles_cres_file(cres)) {
 		cres->ops =3D &cachefiles_netfs_cache_ops;
 		if (object->file) {
 			spin_lock(&object->lock);
-			if (!cres->cache_priv2 && object->file)
-				cres->cache_priv2 =3D get_file(object->file);
+			file =3D object->file;
+			if (!cres->cache_priv2 && file)
+				cres->cache_priv2 =3D get_file(file);
 			spin_unlock(&object->lock);
+			cres->cache_i_size =3D i_size_read(file_inode(file));
+			cres->dio_size =3D object->volume->cache->bsize;
 		}
 	}
=20
diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
index 20138309733f..38d730233658 100644
--- a/fs/cachefiles/namei.c
+++ b/fs/cachefiles/namei.c
@@ -449,7 +449,6 @@ struct file *cachefiles_create_tmpfile(struct cachefile=
s_object *object)
 	struct dentry *fan =3D volume->fanout[(u8)object->cookie->key_hash];
 	struct file *file;
 	const struct path parentpath =3D { .mnt =3D cache->mnt, .dentry =3D fan };
-	uint64_t ni_size;
 	long ret;
=20
=20
@@ -481,23 +480,6 @@ struct file *cachefiles_create_tmpfile(struct cachefil=
es_object *object)
 	if (ret < 0)
 		goto err_unuse;
=20
-	ni_size =3D object->cookie->object_size;
-	ni_size =3D round_up(ni_size, CACHEFILES_DIO_BLOCK_SIZE);
-
-	if (ni_size > 0) {
-		trace_cachefiles_trunc(object, file_inode(file), 0, ni_size,
-				       cachefiles_trunc_expand_tmpfile);
-		ret =3D cachefiles_inject_write_error();
-		if (ret =3D=3D 0)
-			ret =3D vfs_truncate(&file->f_path, ni_size);
-		if (ret < 0) {
-			trace_cachefiles_vfs_error(
-				object, file_inode(file), ret,
-				cachefiles_trace_trunc_error);
-			goto err_unuse;
-		}
-	}
-
 	ret =3D -EINVAL;
 	if (unlikely(!file->f_op->read_iter) ||
 	    unlikely(!file->f_op->write_iter)) {
@@ -507,6 +489,7 @@ struct file *cachefiles_create_tmpfile(struct cachefile=
s_object *object)
 	}
 out:
 	cachefiles_end_secure(cache, saved_cred);
+	object->content_info =3D CACHEFILES_CONTENT_ALL;
 	return file;
=20
 err_unuse:
diff --git a/fs/cachefiles/xattr.c b/fs/cachefiles/xattr.c
index 52383b1d0ba6..27f969c41eef 100644
--- a/fs/cachefiles/xattr.c
+++ b/fs/cachefiles/xattr.c
@@ -54,7 +54,7 @@ int cachefiles_set_object_xattr(struct cachefiles_object =
*object)
 	if (!buf)
 		return -ENOMEM;
=20
-	buf->object_size	=3D cpu_to_be64(object->cookie->object_size);
+	buf->object_size	=3D cpu_to_be64(object->object_size);
 	buf->zero_point		=3D 0;
 	buf->type		=3D CACHEFILES_COOKIE_TYPE_DATA;
 	buf->content		=3D object->content_info;
@@ -77,6 +77,7 @@ int cachefiles_set_object_xattr(struct cachefiles_object =
*object)
 		trace_cachefiles_vfs_error(object, file_inode(file), ret,
 					   cachefiles_trace_setxattr_error);
 		trace_cachefiles_coherency(object, file_inode(file)->i_ino,
+					   object->object_size,
 					   be64_to_cpup((__be64 *)buf->data),
 					   buf->content,
 					   cachefiles_coherency_set_fail);
@@ -86,6 +87,7 @@ int cachefiles_set_object_xattr(struct cachefiles_object =
*object)
 				"Failed to set xattr with error %d", ret);
 	} else {
 		trace_cachefiles_coherency(object, file_inode(file)->i_ino,
+					   object->object_size,
 					   be64_to_cpup((__be64 *)buf->data),
 					   buf->content,
 					   cachefiles_coherency_set_ok);
@@ -106,6 +108,7 @@ int cachefiles_check_auxdata(struct cachefiles_object *=
object, struct file *file
 	unsigned int len =3D object->cookie->aux_len, tlen;
 	const void *p =3D fscache_get_aux(object->cookie);
 	enum cachefiles_coherency_trace why;
+	unsigned long long obj_size;
 	ssize_t xlen;
 	int ret =3D -ESTALE;
=20
@@ -127,29 +130,33 @@ int cachefiles_check_auxdata(struct cachefiles_object=
 *object, struct file *file
 			cachefiles_io_error_obj(
 				object,
 				"Failed to read aux with error %zd", xlen);
-		why =3D cachefiles_coherency_check_xattr;
+		trace_cachefiles_coherency(object, file_inode(file)->i_ino, 0, 0, 0,
+					   cachefiles_coherency_check_xattr);
 		goto out;
 	}
=20
+	obj_size =3D be64_to_cpu(buf->object_size);
 	if (buf->type !=3D CACHEFILES_COOKIE_TYPE_DATA) {
 		why =3D cachefiles_coherency_check_type;
 	} else if (memcmp(buf->data, p, len) !=3D 0) {
 		why =3D cachefiles_coherency_check_aux;
-	} else if (be64_to_cpu(buf->object_size) !=3D object->cookie->object_size=
) {
+	} else if (obj_size !=3D object->cookie->object_size) {
 		why =3D cachefiles_coherency_check_objsize;
 	} else if (buf->content =3D=3D CACHEFILES_CONTENT_DIRTY) {
 		// TODO: Begin conflict resolution
 		pr_warn("Dirty object in cache\n");
 		why =3D cachefiles_coherency_check_dirty;
 	} else {
+		object->content_info =3D buf->content;
+		object->object_size =3D obj_size;
 		why =3D cachefiles_coherency_check_ok;
 		ret =3D 0;
 	}
=20
-out:
-	trace_cachefiles_coherency(object, file_inode(file)->i_ino,
+	trace_cachefiles_coherency(object, file_inode(file)->i_ino, obj_size,
 				   be64_to_cpup((__be64 *)buf->data),
 				   buf->content, why);
+out:
 	kfree(buf);
 	return ret;
 }
@@ -163,6 +170,9 @@ int cachefiles_remove_object_xattr(struct cachefiles_ca=
che *cache,
 {
 	int ret;
=20
+	trace_cachefiles_coherency(object, d_inode(dentry)->i_ino, 0, 0, 0,
+				   cachefiles_coherency_remove);
+
 	ret =3D cachefiles_inject_remove_error();
 	if (ret =3D=3D 0) {
 		ret =3D mnt_want_write(cache->mnt);
diff --git a/fs/netfs/buffered_read.c b/fs/netfs/buffered_read.c
index a8c0d86118c5..aee59ccea257 100644
--- a/fs/netfs/buffered_read.c
+++ b/fs/netfs/buffered_read.c
@@ -127,21 +127,6 @@ static ssize_t netfs_prepare_read_iterator(struct netf=
s_io_subrequest *subreq,
 	return subreq->len;
 }
=20
-static enum netfs_io_source netfs_cache_prepare_read(struct netfs_io_reque=
st *rreq,
-						     struct netfs_io_subrequest *subreq,
-						     loff_t i_size)
-{
-	struct netfs_cache_resources *cres =3D &rreq->cache_resources;
-	enum netfs_io_source source;
-
-	if (!cres->ops)
-		return NETFS_DOWNLOAD_FROM_SERVER;
-	source =3D cres->ops->prepare_read(subreq, i_size);
-	trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
-	return source;
-
-}
-
 /*
  * Issue a read against the cache.
  * - Eats the caller's ref on subreq.
@@ -156,6 +141,19 @@ static void netfs_read_cache_to_pagecache(struct netfs=
_io_request *rreq,
 			netfs_cache_read_terminated, subreq);
 }
=20
+int netfs_read_query_cache(struct netfs_io_request *rreq, struct fscache_o=
ccupancy *occ)
+{
+	struct netfs_cache_resources *cres =3D &rreq->cache_resources;
+
+	occ->granularity =3D PAGE_SIZE;
+	if (occ->query_from >=3D occ->query_to)
+		return 0;
+	if (!cres->ops)
+		return 0;
+	occ->query_from =3D round_up(occ->query_from, occ->granularity);
+	return cres->ops->query_occupancy(cres, occ);
+}
+
 static void netfs_queue_read(struct netfs_io_request *rreq,
 			     struct netfs_io_subrequest *subreq,
 			     bool last_subreq)
@@ -214,16 +212,55 @@ static void netfs_issue_read(struct netfs_io_request =
*rreq,
 static void netfs_read_to_pagecache(struct netfs_io_request *rreq,
 				    struct readahead_control *ractl)
 {
+	struct fscache_occupancy _occ =3D {
+		.query_from	=3D rreq->start,
+		.query_to	=3D rreq->start + rreq->len,
+		.cached_from[0]	=3D 0,
+		.cached_to[0]	=3D 0,
+		.cached_from[1]	=3D ULLONG_MAX,
+		.cached_to[1]	=3D ULLONG_MAX,
+	};
+	struct fscache_occupancy *occ =3D &_occ;
 	struct netfs_inode *ictx =3D netfs_inode(rreq->inode);
 	unsigned long long start =3D rreq->start;
 	ssize_t size =3D rreq->len;
 	int ret =3D 0;
=20
 	do {
+		int (*prepare_read)(struct netfs_io_subrequest *subreq) =3D NULL;
 		struct netfs_io_subrequest *subreq;
-		enum netfs_io_source source =3D NETFS_SOURCE_UNKNOWN;
+		unsigned long long hole_to, cache_to;
 		ssize_t slice;
=20
+		/* If we don't have any, find out the next couple of data
+		 * extents from the cache, containing of following the
+		 * specified start offset.  Holes have to be fetched from the
+		 * server; data regions from the cache.
+		 */
+		hole_to =3D occ->cached_from[0];
+		cache_to =3D occ->cached_to[0];
+		if (start >=3D cache_to) {
+			/* Extent exhausted; shuffle down. */
+			int i;
+
+			for (i =3D 0; i < ARRAY_SIZE(occ->cached_from) - 1; i++) {
+				occ->cached_from[i] =3D occ->cached_from[i + 1];
+				occ->cached_to[i]   =3D occ->cached_to[i + 1];
+				occ->cached_type[i] =3D occ->cached_type[i + 1];
+			}
+			occ->cached_from[i] =3D ULLONG_MAX;
+			occ->cached_to[i]   =3D ULLONG_MAX;
+
+			if (occ->cached_from[0] !=3D ULLONG_MAX)
+				continue;
+
+			/* Get new extents */
+			ret =3D netfs_read_query_cache(rreq, occ);
+			if (ret < 0)
+				break;
+			continue;
+		}
+
 		subreq =3D netfs_alloc_subrequest(rreq);
 		if (!subreq) {
 			ret =3D -ENOMEM;
@@ -233,65 +270,81 @@ static void netfs_read_to_pagecache(struct netfs_io_r=
equest *rreq,
 		subreq->start	=3D start;
 		subreq->len	=3D size;
=20
-		source =3D netfs_cache_prepare_read(rreq, subreq, rreq->i_size);
-		subreq->source =3D source;
-		if (source =3D=3D NETFS_DOWNLOAD_FROM_SERVER) {
-			unsigned long long zp =3D umin(ictx->zero_point, rreq->i_size);
-			size_t len =3D subreq->len;
-
-			if (unlikely(rreq->origin =3D=3D NETFS_READ_SINGLE))
-				zp =3D rreq->i_size;
-			if (subreq->start >=3D zp) {
-				subreq->source =3D source =3D NETFS_FILL_WITH_ZEROES;
-				goto fill_with_zeroes;
+		_debug("rsub %llx %llx-%llx", subreq->start, hole_to, cache_to);
+
+		if (start >=3D hole_to && start < cache_to) {
+			/* Overlap with a cached region, where the cache may
+			 * record a block of zeroes.
+			 */
+			_debug("cached s=3D%llx c=3D%llx l=3D%zx", start, cache_to, size);
+			subreq->len =3D umin(cache_to - start, size);
+			subreq->len =3D round_up(subreq->len, occ->granularity);
+			if (occ->cached_type[0] =3D=3D FSCACHE_EXTENT_ZERO) {
+				subreq->source =3D NETFS_FILL_WITH_ZEROES;
+				netfs_stat(&netfs_n_rh_zero);
+			} else {
+				subreq->source =3D NETFS_READ_FROM_CACHE;
+				prepare_read =3D rreq->cache_resources.ops->prepare_read;
 			}
=20
-			if (len > zp - subreq->start)
-				len =3D zp - subreq->start;
-			if (len =3D=3D 0) {
-				pr_err("ZERO-LEN READ: R=3D%08x[%x] l=3D%zx/%zx s=3D%llx z=3D%llx i=3D=
%llx",
-				       rreq->debug_id, subreq->debug_index,
-				       subreq->len, size,
-				       subreq->start, ictx->zero_point, rreq->i_size);
-				break;
-			}
-			subreq->len =3D len;
-
-			netfs_stat(&netfs_n_rh_download);
-			if (rreq->netfs_ops->prepare_read) {
-				ret =3D rreq->netfs_ops->prepare_read(subreq);
-				if (ret < 0) {
-					subreq->error =3D ret;
-					/* Not queued - release both refs. */
-					netfs_put_subrequest(subreq,
-							     netfs_sreq_trace_put_cancel);
-					netfs_put_subrequest(subreq,
-							     netfs_sreq_trace_put_cancel);
-					break;
-				}
-				trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
-			}
-			goto issue;
-		}
+			trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
=20
-	fill_with_zeroes:
-		if (source =3D=3D NETFS_FILL_WITH_ZEROES) {
+		} else if ((subreq->start >=3D ictx->zero_point ||
+			    subreq->start >=3D rreq->i_size) &&
+			   size > 0) {
+			/* If this range lies beyond the zero-point, that part
+			 * can just be cleared locally.
+			 */
+			_debug("zero %llx-%llx", start, start + size);
+			subreq->len =3D size;
 			subreq->source =3D NETFS_FILL_WITH_ZEROES;
-			trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
+			if (rreq->cache_resources.ops)
+				__set_bit(NETFS_SREQ_COPY_TO_CACHE, &subreq->flags);
 			netfs_stat(&netfs_n_rh_zero);
-			goto issue;
+		} else {
+			/* Read a cache hole from the server.  If any part of
+			 * this range lies beyond the zero-point or the EOF,
+			 * that part can just be cleared locally.
+			 */
+			unsigned long long zlimit =3D umin(rreq->i_size, ictx->zero_point);
+			unsigned long long limit =3D min3(zlimit, start + size, hole_to);
+
+			_debug("limit %llx %llx", rreq->i_size, ictx->zero_point);
+			_debug("download %llx-%llx", start, start + size);
+			subreq->len =3D umin(limit - subreq->start, ULONG_MAX);
+			subreq->source =3D NETFS_DOWNLOAD_FROM_SERVER;
+			if (rreq->cache_resources.ops)
+				__set_bit(NETFS_SREQ_COPY_TO_CACHE, &subreq->flags);
+			netfs_stat(&netfs_n_rh_download);
 		}
=20
-		if (source =3D=3D NETFS_READ_FROM_CACHE) {
-			trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
-			goto issue;
+		if (size =3D=3D 0) {
+			pr_err("ZERO-LEN READ: R=3D%08x[%x] l=3D%zx/%zx s=3D%llx z=3D%llx i=3D%=
llx",
+			       rreq->debug_id, subreq->debug_index,
+			       subreq->len, size,
+			       subreq->start, ictx->zero_point, rreq->i_size);
+			trace_netfs_sreq(subreq, netfs_sreq_trace_cancel);
+			/* Not queued - release both refs. */
+			netfs_put_subrequest(subreq, netfs_sreq_trace_put_cancel);
+			netfs_put_subrequest(subreq, netfs_sreq_trace_put_cancel);
+			break;
 		}
=20
-		pr_err("Unexpected read source %u\n", source);
-		WARN_ON_ONCE(1);
-		break;
+		rreq->io_streams[0].sreq_max_len =3D MAX_RW_COUNT;
+		rreq->io_streams[0].sreq_max_segs =3D INT_MAX;
+
+		if (prepare_read) {
+			ret =3D prepare_read(subreq);
+			if (ret < 0) {
+				subreq->error =3D ret;
+				/* Not queued - release both refs. */
+				netfs_put_subrequest(subreq, netfs_sreq_trace_put_cancel);
+				netfs_put_subrequest(subreq, netfs_sreq_trace_put_cancel);
+				break;
+			}
+			trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
+		}
=20
-	issue:
 		slice =3D netfs_prepare_read_iterator(subreq, ractl);
 		if (slice < 0) {
 			ret =3D slice;
@@ -305,6 +358,8 @@ static void netfs_read_to_pagecache(struct netfs_io_req=
uest *rreq,
 		size -=3D slice;
 		start +=3D slice;
=20
+		trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
+
 		netfs_queue_read(rreq, subreq, size <=3D 0);
 		netfs_issue_read(rreq, subreq);
 		cond_resched();
diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c
index 22a4d61631c9..bce3e7109ec1 100644
--- a/fs/netfs/buffered_write.c
+++ b/fs/netfs/buffered_write.c
@@ -73,9 +73,6 @@ void netfs_update_i_size(struct netfs_inode *ctx, struct =
inode *inode,
 	i_size =3D i_size_read(inode);
 	if (end > i_size) {
 		i_size_write(inode, end);
-#if IS_ENABLED(CONFIG_FSCACHE)
-		fscache_update_cookie(ctx->cache, NULL, &end);
-#endif
=20
 		gap =3D SECTOR_SIZE - (i_size & (SECTOR_SIZE - 1));
 		if (copied > gap) {
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index d436e20d3418..2fcf31de5f2c 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -23,6 +23,8 @@
 /*
  * buffered_read.c
  */
+int netfs_read_query_cache(struct netfs_io_request *rreq,
+			   struct fscache_occupancy *occ);
 void netfs_cache_read_terminated(void *priv, ssize_t transferred_or_error);
 int netfs_prefetch_for_write(struct file *file, struct folio *folio,
 			     size_t offset, size_t len);
diff --git a/fs/netfs/read_single.c b/fs/netfs/read_single.c
index d0e23bc42445..d87a03859ebd 100644
--- a/fs/netfs/read_single.c
+++ b/fs/netfs/read_single.c
@@ -58,20 +58,6 @@ static int netfs_single_begin_cache_read(struct netfs_io=
_request *rreq, struct n
 	return fscache_begin_read_operation(&rreq->cache_resources, netfs_i_cooki=
e(ctx));
 }
=20
-static void netfs_single_cache_prepare_read(struct netfs_io_request *rreq,
-					    struct netfs_io_subrequest *subreq)
-{
-	struct netfs_cache_resources *cres =3D &rreq->cache_resources;
-
-	if (!cres->ops) {
-		subreq->source =3D NETFS_DOWNLOAD_FROM_SERVER;
-		return;
-	}
-	subreq->source =3D cres->ops->prepare_read(subreq, rreq->i_size);
-	trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
-
-}
-
 static void netfs_single_read_cache(struct netfs_io_request *rreq,
 				    struct netfs_io_subrequest *subreq)
 {
@@ -90,6 +76,14 @@ static void netfs_single_read_cache(struct netfs_io_requ=
est *rreq,
 static int netfs_single_dispatch_read(struct netfs_io_request *rreq)
 {
 	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
+	struct fscache_occupancy occ =3D {
+		.query_from	=3D 0,
+		.query_to	=3D rreq->len,
+		.cached_from[0]	=3D ULLONG_MAX,
+		.cached_to[0]	=3D ULLONG_MAX,
+		.cached_from[1]	=3D ULLONG_MAX,
+		.cached_to[1]	=3D ULLONG_MAX,
+	};
 	struct netfs_io_subrequest *subreq;
 	int ret =3D 0;
=20
@@ -97,11 +91,19 @@ static int netfs_single_dispatch_read(struct netfs_io_r=
equest *rreq)
 	if (!subreq)
 		return -ENOMEM;
=20
-	subreq->source	=3D NETFS_SOURCE_UNKNOWN;
+	subreq->source	=3D NETFS_DOWNLOAD_FROM_SERVER;
 	subreq->start	=3D 0;
 	subreq->len	=3D rreq->len;
 	subreq->io_iter	=3D rreq->buffer.iter;
=20
+	/* Try to use the cache if the cache content matches the size of the
+	 * remote file.
+	 */
+	netfs_read_query_cache(rreq, &occ);
+	if (occ.cached_from[0] =3D=3D 0 &&
+	    occ.cached_to[0] =3D=3D rreq->len)
+		subreq->source =3D NETFS_READ_FROM_CACHE;
+
 	__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
=20
 	spin_lock(&rreq->lock);
@@ -111,7 +113,6 @@ static int netfs_single_dispatch_read(struct netfs_io_r=
equest *rreq)
 	smp_store_release(&stream->active, true);
 	spin_unlock(&rreq->lock);
=20
-	netfs_single_cache_prepare_read(rreq, subreq);
 	switch (subreq->source) {
 	case NETFS_DOWNLOAD_FROM_SERVER:
 		netfs_stat(&netfs_n_rh_download);
@@ -125,6 +126,12 @@ static int netfs_single_dispatch_read(struct netfs_io_=
request *rreq)
 		rreq->submitted +=3D subreq->len;
 		break;
 	case NETFS_READ_FROM_CACHE:
+		if (rreq->cache_resources.ops->prepare_read) {
+			ret =3D rreq->cache_resources.ops->prepare_read(subreq);
+			if (ret < 0)
+				goto cancel;
+		}
+
 		trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
 		netfs_single_read_cache(rreq, subreq);
 		rreq->submitted +=3D subreq->len;
diff --git a/fs/netfs/write_collect.c b/fs/netfs/write_collect.c
index b194447f4b11..a839735d5675 100644
--- a/fs/netfs/write_collect.c
+++ b/fs/netfs/write_collect.c
@@ -185,6 +185,16 @@ static void netfs_writeback_unlock_folios(struct netfs=
_io_request *wreq,
 	wreq->buffer.first_tail_slot =3D slot;
 }
=20
+static void netfs_cache_collect(struct netfs_io_request *wreq,
+				struct netfs_io_stream *stream)
+{
+	struct netfs_cache_resources *cres =3D &wreq->cache_resources;
+
+	if (cres->ops && cres->ops->collect_write)
+		cres->ops->collect_write(wreq, wreq->cache_coll_to,
+					 stream->collected_to - wreq->cache_coll_to);
+}
+
 /*
  * Collect and assess the results of various write subrequests.  We may ne=
ed to
  * retry some of the results - or even do an RMW cycle for content crypto.
@@ -238,6 +248,11 @@ static void netfs_collect_write_results(struct netfs_i=
o_request *wreq)
 			if (stream->collected_to < front->start) {
 				trace_netfs_collect_gap(wreq, stream, issued_to, 'F');
 				stream->collected_to =3D front->start;
+				if (stream->source =3D=3D NETFS_WRITE_TO_CACHE) {
+					if (wreq->cache_coll_to < stream->collected_to)
+						netfs_cache_collect(wreq, stream);
+					wreq->cache_coll_to =3D stream->collected_to;
+				}
 			}
=20
 			/* Stall if the front is still undergoing I/O. */
@@ -261,8 +276,19 @@ static void netfs_collect_write_results(struct netfs_i=
o_request *wreq)
 			if (test_bit(NETFS_SREQ_FAILED, &front->flags)) {
 				stream->failed =3D true;
 				stream->error =3D front->error;
-				if (stream->source =3D=3D NETFS_UPLOAD_TO_SERVER)
+				switch (stream->source) {
+				case NETFS_UPLOAD_TO_SERVER:
 					mapping_set_error(wreq->mapping, front->error);
+					break;
+				case NETFS_WRITE_TO_CACHE:
+					if (wreq->cache_coll_to < stream->collected_to)
+						netfs_cache_collect(wreq, stream);
+					wreq->cache_coll_to =3D stream->collected_to + front->len;
+					break;
+				default:
+					WARN_ON(1);
+					break;
+				}
 				notes |=3D NEED_REASSESS | SAW_FAILURE;
 				break;
 			}
@@ -355,6 +381,7 @@ static void netfs_collect_write_results(struct netfs_io=
_request *wreq)
  */
 bool netfs_write_collection(struct netfs_io_request *wreq)
 {
+	struct netfs_io_stream *cstream =3D &wreq->io_streams[1];
 	struct netfs_inode *ictx =3D netfs_inode(wreq->inode);
 	size_t transferred;
 	bool transferred_valid =3D false;
@@ -390,13 +417,19 @@ bool netfs_write_collection(struct netfs_io_request *=
wreq)
 		wreq->transferred =3D transferred;
 	trace_netfs_rreq(wreq, netfs_rreq_trace_write_done);
=20
-	if (wreq->io_streams[1].active &&
-	    wreq->io_streams[1].failed &&
-	    ictx->ops->invalidate_cache) {
-		/* Cache write failure doesn't prevent writeback completion
-		 * unless we're in disconnected mode.
-		 */
-		ictx->ops->invalidate_cache(wreq);
+	if (cstream->active) {
+		if (cstream->failed) {
+			if (ictx->ops->invalidate_cache)
+				/* Cache write failure doesn't prevent
+				 * writeback completion unless we're in
+				 * disconnected mode.
+				 */
+				ictx->ops->invalidate_cache(wreq);
+		} else {
+			if (wreq->cache_coll_to < cstream->collected_to)
+				netfs_cache_collect(wreq, cstream);
+			wreq->cache_coll_to =3D cstream->collected_to;
+		}
 	}
=20
 	_debug("finished");
diff --git a/fs/netfs/write_issue.c b/fs/netfs/write_issue.c
index 2db688f94125..2de6b8621e11 100644
--- a/fs/netfs/write_issue.c
+++ b/fs/netfs/write_issue.c
@@ -112,6 +112,8 @@ struct netfs_io_request *netfs_create_write_req(struct =
address_space *mapping,
 		goto nomem;
=20
 	wreq->cleaned_to =3D wreq->start;
+	if (wreq->cache_resources.dio_size > 1)
+		wreq->cache_coll_to =3D round_down(wreq->start, wreq->cache_resources.di=
o_size);
=20
 	wreq->io_streams[0].stream_nr		=3D 0;
 	wreq->io_streams[0].source		=3D NETFS_UPLOAD_TO_SERVER;
@@ -263,6 +265,7 @@ void netfs_issue_write(struct netfs_io_request *wreq,
=20
 	if (!subreq)
 		return;
+
 	stream->construct =3D NULL;
 	subreq->io_iter.count =3D subreq->len;
 	netfs_do_issue_write(stream, subreq);
diff --git a/include/linux/fscache.h b/include/linux/fscache.h
index 58fdb9605425..850d20241075 100644
--- a/include/linux/fscache.h
+++ b/include/linux/fscache.h
@@ -147,6 +147,23 @@ struct fscache_cookie {
 	};
 };
=20
+enum fscache_extent_type {
+	FSCACHE_EXTENT_DATA,
+	FSCACHE_EXTENT_ZERO,
+} __mode(byte);
+
+/*
+ * Cache occupancy information.
+ */
+struct fscache_occupancy {
+	unsigned long long	query_from;	/* Point to query from */
+	unsigned long long	query_to;	/* Point to query to */
+	unsigned long long	cached_from[2];	/* Point at which cache extents start =
*/
+	unsigned long long	cached_to[2];	/* Point at which cache extents end */
+	unsigned int		granularity;	/* Granularity desired */
+	enum fscache_extent_type cached_type[2];	/* Type of cache extent */
+};
+
 /*
  * slow-path functions for when there is actually caching available, and t=
he
  * netfs does actually have a valid token
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index ba17ac5bf356..77238bc4a712 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -22,6 +22,7 @@
=20
 enum netfs_sreq_ref_trace;
 typedef struct mempool mempool_t;
+struct fscache_occupancy;
 struct folio_queue;
=20
 /**
@@ -159,8 +160,10 @@ struct netfs_cache_resources {
 	const struct netfs_cache_ops	*ops;
 	void				*cache_priv;
 	void				*cache_priv2;
+	unsigned long long		cache_i_size;	/* Initial size of cache file */
 	unsigned int			debug_id;	/* Cookie debug ID */
 	unsigned int			inval_counter;	/* object->inval_counter at begin_op */
+	unsigned int			dio_size;	/* DIO block size */
 };
=20
 /*
@@ -250,6 +253,7 @@ struct netfs_io_request {
 	unsigned long long	start;		/* Start position */
 	atomic64_t		issued_to;	/* Write issuer folio cursor */
 	unsigned long long	collected_to;	/* Point we've collected to */
+	unsigned long long	cache_coll_to;	/* Point the cache has collected to */
 	unsigned long long	cleaned_to;	/* Position we've cleaned folios to */
 	unsigned long long	abandon_to;	/* Position to abandon folios to */
 	pgoff_t			no_unlock_folio; /* Don't unlock this folio after read */
@@ -354,8 +358,7 @@ struct netfs_cache_ops {
 	/* Prepare a read operation, shortening it to a cached/uncached
 	 * boundary as appropriate.
 	 */
-	enum netfs_io_source (*prepare_read)(struct netfs_io_subrequest *subreq,
-					     unsigned long long i_size);
+	int (*prepare_read)(struct netfs_io_subrequest *subreq);
=20
 	/* Prepare a write subrequest, working out if we're allowed to do it
 	 * and finding out the maximum amount of data to gather before
@@ -383,8 +386,13 @@ struct netfs_cache_ops {
 	 * next chunk of data starts and how long it is.
 	 */
 	int (*query_occupancy)(struct netfs_cache_resources *cres,
-			       loff_t start, size_t len, size_t granularity,
-			       loff_t *_data_start, size_t *_data_len);
+			       struct fscache_occupancy *occ);
+
+	/* Collect the result of buffered writeback to the cache.
+	 * This includes copying a read to the cache.
+	 */
+	void (*collect_write)(struct netfs_io_request *wreq,
+			      unsigned long long start, size_t len);
 };
=20
 /* High-level read API. */
diff --git a/include/trace/events/cachefiles.h b/include/trace/events/cache=
files.h
index a743b2a35ea7..4bba6fda1f8b 100644
--- a/include/trace/events/cachefiles.h
+++ b/include/trace/events/cachefiles.h
@@ -56,6 +56,7 @@ enum cachefiles_coherency_trace {
 	cachefiles_coherency_check_ok,
 	cachefiles_coherency_check_type,
 	cachefiles_coherency_check_xattr,
+	cachefiles_coherency_remove,
 	cachefiles_coherency_set_fail,
 	cachefiles_coherency_set_ok,
 	cachefiles_coherency_vol_check_cmp,
@@ -67,6 +68,7 @@ enum cachefiles_coherency_trace {
 };
=20
 enum cachefiles_trunc_trace {
+	cachefiles_trunc_clear_padding,
 	cachefiles_trunc_dio_adjust,
 	cachefiles_trunc_expand_tmpfile,
 	cachefiles_trunc_shrink,
@@ -84,6 +86,7 @@ enum cachefiles_prepare_read_trace {
 };
=20
 enum cachefiles_error_trace {
+	cachefiles_trace_alignment_error,
 	cachefiles_trace_fallocate_error,
 	cachefiles_trace_getxattr_error,
 	cachefiles_trace_link_error,
@@ -144,6 +147,7 @@ enum cachefiles_error_trace {
 	EM(cachefiles_coherency_check_ok,	"OK      ")		\
 	EM(cachefiles_coherency_check_type,	"BAD type")		\
 	EM(cachefiles_coherency_check_xattr,	"BAD xatt")		\
+	EM(cachefiles_coherency_remove,		"REMOVE  ")		\
 	EM(cachefiles_coherency_set_fail,	"SET fail")		\
 	EM(cachefiles_coherency_set_ok,		"SET ok  ")		\
 	EM(cachefiles_coherency_vol_check_cmp,	"VOL BAD cmp ")		\
@@ -154,6 +158,7 @@ enum cachefiles_error_trace {
 	E_(cachefiles_coherency_vol_set_ok,	"VOL SET ok  ")
=20
 #define cachefiles_trunc_traces						\
+	EM(cachefiles_trunc_clear_padding,	"CLRPAD")		\
 	EM(cachefiles_trunc_dio_adjust,		"DIOADJ")		\
 	EM(cachefiles_trunc_expand_tmpfile,	"EXPTMP")		\
 	E_(cachefiles_trunc_shrink,		"SHRINK")
@@ -169,6 +174,7 @@ enum cachefiles_error_trace {
 	E_(cachefiles_trace_read_seek_nxio,	"seek-enxio")
=20
 #define cachefiles_error_traces						\
+	EM(cachefiles_trace_alignment_error,	"align")		\
 	EM(cachefiles_trace_fallocate_error,	"fallocate")		\
 	EM(cachefiles_trace_getxattr_error,	"getxattr")		\
 	EM(cachefiles_trace_link_error,		"link")			\
@@ -379,12 +385,12 @@ TRACE_EVENT(cachefiles_rename,
=20
 TRACE_EVENT(cachefiles_coherency,
 	    TP_PROTO(struct cachefiles_object *obj,
-		     ino_t ino,
+		     ino_t ino, unsigned long long obj_size,
 		     u64 disk_aux,
 		     enum cachefiles_content content,
 		     enum cachefiles_coherency_trace why),
=20
-	    TP_ARGS(obj, ino, disk_aux, content, why),
+	    TP_ARGS(obj, ino, obj_size, disk_aux, content, why),
=20
 	    /* Note that obj may be NULL */
 	    TP_STRUCT__entry(
@@ -392,6 +398,7 @@ TRACE_EVENT(cachefiles_coherency,
 		    __field(enum cachefiles_coherency_trace,	why)
 		    __field(enum cachefiles_content,		content)
 		    __field(u64,				ino)
+		    __field(u64,				obj_size)
 		    __field(u64,				aux)
 		    __field(u64,				disk_aux)
 			     ),
@@ -401,14 +408,16 @@ TRACE_EVENT(cachefiles_coherency,
 		    __entry->why	=3D why;
 		    __entry->content	=3D content;
 		    __entry->ino	=3D ino;
+		    __entry->obj_size	=3D obj_size,
 		    __entry->aux	=3D be64_to_cpup((__be64 *)obj->cookie->inline_aux);
 		    __entry->disk_aux	=3D disk_aux;
 			   ),
=20
-	    TP_printk("o=3D%08x %s B=3D%llx c=3D%u aux=3D%llx dsk=3D%llx",
+	    TP_printk("o=3D%08x %s B=3D%llx oz=3D%llx c=3D%u aux=3D%llx dsk=3D%ll=
x",
 		      __entry->obj,
 		      __print_symbolic(__entry->why, cachefiles_coherency_traces),
 		      __entry->ino,
+		      __entry->obj_size,
 		      __entry->content,
 		      __entry->aux,
 		      __entry->disk_aux)
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1A3683DB652
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:47:33 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522054; cv=none;
 b=BVGxA6aSaALY1mGA5e6AdPKTew/KXQqY7KhV982FWSAtPyxpwnm0Lz+ObpFV9ZFsd87EH+K+I10h1kYw4DBZTx9nz7PjH8YGAFZ8VTKBYWBiziVcIskL4FzXqPtdXCOkxLst3CNOUo4u2gO4r9WbyWER80RMy2vW5HxCwQHo1Ek=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522054; c=relaxed/simple;
	bh=79cVFfhe8APwuHmLh1YDbiu67xjgpVYYFq2TxW3qNg8=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=FKeRqKm7UdUI0NJnbAr5mZ+FKH1CCGIFcqD72u90nievJObje+y0MhRDojIylgOs5AxjslJvHKz7OGes11x8ONNyE9zhskpAvu1ioi+2R6eVlzT/helnpupS1hIbMeXJ1+yZSPfbW0wXgQV25Gt+OQuezzPzkjcQOpivoE6Us1g=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=I21oZA+v; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="I21oZA+v"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774522052;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=vmw7T0PnOLgxIpELQ9rQpU7AHBZsSNPxjlfKpKQj97s=;
	b=I21oZA+vVTUcy/a+KAYMhYXdxaFFwx8VSKL1pAGMjPyMdGx0pc2sZahnZk5DPh4zHC+EB0
	6m9GrpvA5BQ62x+WolFEvEWP2C+/HimDm0H4MEkhRdQBqj0vguxw2DNLl4XkLEUuIv8zys
	d3CN3j6pGFvg0ZXwajzbaiuAFI7RHsA=
Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-694-jGSoJECHMrekBUYeanVpuA-1; Thu,
 26 Mar 2026 06:47:29 -0400
X-MC-Unique: jGSoJECHMrekBUYeanVpuA-1
X-Mimecast-MFC-AGG-ID: jGSoJECHMrekBUYeanVpuA_1774522046
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 483F218005B2;
	Thu, 26 Mar 2026 10:47:26 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 8E95019560B1;
	Thu, 26 Mar 2026 10:47:19 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Paulo Alcantara <pc@manguebit.org>,
	linux-mm@kvack.org
Subject: [PATCH 09/26] mm: Make readahead store folio count in
 readahead_control
Date: Thu, 26 Mar 2026 10:45:24 +0000
Message-ID: <20260326104544.509518-10-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12
Content-Type: text/plain; charset="utf-8"

Make readahead store folio count in readahead_control so that the
filesystem can know in advance how many folios it needs to keep track of.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
Reviewed-by:: Paulo Alcantara (Red Hat) <pc@manguebit.org>
---
 include/linux/pagemap.h | 1 +
 mm/readahead.c          | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ec442af3f886..3c3e34e5fe8a 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1361,6 +1361,7 @@ struct readahead_control {
 	struct file_ra_state *ra;
 /* private: use the readahead_* accessors instead */
 	pgoff_t _index;
+	unsigned int _nr_folios;
 	unsigned int _nr_pages;
 	unsigned int _batch_count;
 	bool dropbehind;
diff --git a/mm/readahead.c b/mm/readahead.c
index 7b05082c89ea..53134c9d9fe9 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -292,6 +292,7 @@ void page_cache_ra_unbounded(struct readahead_control *=
ractl,
 		if (i =3D=3D mark)
 			folio_set_readahead(folio);
 		ractl->_workingset |=3D folio_test_workingset(folio);
+		ractl->_nr_folios++;
 		ractl->_nr_pages +=3D min_nrpages;
 		i +=3D min_nrpages;
 	}
@@ -459,6 +460,7 @@ static inline int ra_alloc_folio(struct readahead_contr=
ol *ractl, pgoff_t index,
 		return err;
 	}
=20
+	ractl->_nr_folios++;
 	ractl->_nr_pages +=3D 1UL << order;
 	ractl->_workingset |=3D folio_test_workingset(folio);
 	return 0;
@@ -802,6 +804,7 @@ void readahead_expand(struct readahead_control *ractl,
 			ractl->_workingset =3D true;
 			psi_memstall_enter(&ractl->_pflags);
 		}
+		ractl->_nr_folios++;
 		ractl->_nr_pages +=3D min_nrpages;
 		ractl->_index =3D folio->index;
 	}
@@ -831,6 +834,7 @@ void readahead_expand(struct readahead_control *ractl,
 			ractl->_workingset =3D true;
 			psi_memstall_enter(&ractl->_pflags);
 		}
+		ractl->_nr_folios++;
 		ractl->_nr_pages +=3D min_nrpages;
 		if (ra) {
 			ra->size +=3D min_nrpages;
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 692B53DCD87
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:47:42 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522064; cv=none;
 b=SRVMU3LjkvQBhpFyUPeluHZGzBeiXkYjDRiksZMJCCG7Bb2ffh8MACBQoEm/G4j/msHKDFAzvFELzta792y+DluHrt9Jz7GJjuQzWTfd/MNOPDPpaaLkOwbA7Dbn654eFTry645WFwraicWv03KhrYpGWiBlvRJ5NpWK3Awv0Q0=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522064; c=relaxed/simple;
	bh=attGeeK+7AGJQcEd5IkZIh9qlrn2+ZW9hOEouMFzCuI=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=UBiU/bg4hYWU+lpax9NW+XIIIefpl6EeqfpGwVBzKzWmUJF52+mwunhl9nx3Xvq0Jb635Mt8x+LrEy5jcRPxEZH/rhtDx6+iyBRmYifQ44MWRIdXiYFHI+EoaJKRdtfYb22+vu4JJT0kyqtUYt2Fe3fnA+Hpfr7AQnvWC9QAgOc=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=E2WAlz0E; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="E2WAlz0E"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774522061;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=qzGei/xBTUYREFiynnCgIfE37SpwP1mzY41m1gTiMBs=;
	b=E2WAlz0Ex9ZUUcsGnepr0Q6pWagQBcaalSph4/gfBVImr/Me8lF3oPtELrJEkD+dwMrywX
	wOmzY12D2FYX8qEvZpVlUQD3gMt/0+tgPUD1H48+FaUPbXto6rPuSX6VcpVKFjYH/kv0e9
	jMNBgQAZKvCia93Gz+tUJtDenDpwXDc=
Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-693-m2zo_54vN3yDFY_z3Bv5eA-1; Thu,
 26 Mar 2026 06:47:38 -0400
X-MC-Unique: m2zo_54vN3yDFY_z3Bv5eA-1
X-Mimecast-MFC-AGG-ID: m2zo_54vN3yDFY_z3Bv5eA_1774522055
Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id A126E18005BD;
	Thu, 26 Mar 2026 10:47:34 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id E21EC1800671;
	Thu, 26 Mar 2026 10:47:27 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Paulo Alcantara <pc@manguebit.org>,
	linux-mm@kvack.org
Subject: [PATCH 10/26] netfs: Bulk load the readahead-provided folios up front
Date: Thu, 26 Mar 2026 10:45:25 +0000
Message-ID: <20260326104544.509518-11-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93
Content-Type: text/plain; charset="utf-8"

Load all the folios by the VM for readahead up front into the folio queue.
With the number of folios provided by the VM, the folio queue can be fully
allocated first and then the loading happen in one go inside the RCU read
lock.  The folio refs acquired from readahead are dropped in bulk once the
first subrequest is dispatched as it's quite a slow operation.

This simplifies the buffer handling later and isn't noticeably slower as
the xarray doesn't need to be modified and the folios are all already
pre-locked.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
---
 fs/netfs/buffered_read.c       | 95 +++++++++++++++++++++-------------
 fs/netfs/rolling_buffer.c      | 75 +++++++++++++++++++++++++++
 include/linux/netfs.h          |  1 +
 include/linux/rolling_buffer.h |  3 ++
 include/trace/events/netfs.h   |  1 +
 5 files changed, 138 insertions(+), 37 deletions(-)

diff --git a/fs/netfs/buffered_read.c b/fs/netfs/buffered_read.c
index aee59ccea257..abdc990faaa2 100644
--- a/fs/netfs/buffered_read.c
+++ b/fs/netfs/buffered_read.c
@@ -54,6 +54,40 @@ static void netfs_rreq_expand(struct netfs_io_request *r=
req,
 	}
 }
=20
+/*
+ * Drop the folio refs acquired from the readahead API.
+ */
+static void netfs_bulk_drop_ra_refs(struct netfs_io_request *rreq)
+{
+	struct folio_batch fbatch;
+	struct folio *folio;
+	pgoff_t nr_pages =3D DIV_ROUND_UP(rreq->len, PAGE_SIZE);
+	pgoff_t first =3D rreq->start / PAGE_SIZE;
+	XA_STATE(xas, &rreq->mapping->i_pages, first);
+
+	folio_batch_init(&fbatch);
+
+	rcu_read_lock();
+
+	xas_for_each(&xas, folio,  first + nr_pages - 1) {
+		if (xas_retry(&xas, folio))
+			continue;
+
+		if (!folio_batch_add(&fbatch, folio))
+			folio_batch_release(&fbatch);
+	}
+
+	rcu_read_unlock();
+	folio_batch_release(&fbatch);
+	trace_netfs_rreq(rreq, netfs_rreq_trace_ra_put_ref);
+}
+
+static void netfs_maybe_bulk_drop_ra_refs(struct netfs_io_request *rreq)
+{
+	if (test_and_clear_bit(NETFS_RREQ_NEED_PUT_RA_REFS, &rreq->flags))
+		netfs_bulk_drop_ra_refs(rreq);
+}
+
 /*
  * Begin an operation, and fetch the stored zero point value from the cook=
ie if
  * available.
@@ -74,12 +108,8 @@ static int netfs_begin_cache_read(struct netfs_io_reque=
st *rreq, struct netfs_in
  *
  * Returns the limited size if successful and -ENOMEM if insufficient memo=
ry
  * available.
- *
- * [!] NOTE: This must be run in the same thread as ->issue_read() was cal=
led
- * in as we access the readahead_control struct.
  */
-static ssize_t netfs_prepare_read_iterator(struct netfs_io_subrequest *sub=
req,
-					   struct readahead_control *ractl)
+static ssize_t netfs_prepare_read_iterator(struct netfs_io_subrequest *sub=
req)
 {
 	struct netfs_io_request *rreq =3D subreq->rreq;
 	size_t rsize =3D subreq->len;
@@ -87,28 +117,6 @@ static ssize_t netfs_prepare_read_iterator(struct netfs=
_io_subrequest *subreq,
 	if (subreq->source =3D=3D NETFS_DOWNLOAD_FROM_SERVER)
 		rsize =3D umin(rsize, rreq->io_streams[0].sreq_max_len);
=20
-	if (ractl) {
-		/* If we don't have sufficient folios in the rolling buffer,
-		 * extract a folioq's worth from the readahead region at a time
-		 * into the buffer.  Note that this acquires a ref on each page
-		 * that we will need to release later - but we don't want to do
-		 * that until after we've started the I/O.
-		 */
-		struct folio_batch put_batch;
-
-		folio_batch_init(&put_batch);
-		while (rreq->submitted < subreq->start + rsize) {
-			ssize_t added;
-
-			added =3D rolling_buffer_load_from_ra(&rreq->buffer, ractl,
-							    &put_batch);
-			if (added < 0)
-				return added;
-			rreq->submitted +=3D added;
-		}
-		folio_batch_release(&put_batch);
-	}
-
 	subreq->len =3D rsize;
 	if (unlikely(rreq->io_streams[0].sreq_max_segs)) {
 		size_t limit =3D netfs_limit_iter(&rreq->buffer.iter, 0, rsize,
@@ -209,8 +217,7 @@ static void netfs_issue_read(struct netfs_io_request *r=
req,
  * slicing up the region to be read according to available cache blocks and
  * network rsize.
  */
-static void netfs_read_to_pagecache(struct netfs_io_request *rreq,
-				    struct readahead_control *ractl)
+static void netfs_read_to_pagecache(struct netfs_io_request *rreq)
 {
 	struct fscache_occupancy _occ =3D {
 		.query_from	=3D rreq->start,
@@ -345,7 +352,7 @@ static void netfs_read_to_pagecache(struct netfs_io_req=
uest *rreq,
 			trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
 		}
=20
-		slice =3D netfs_prepare_read_iterator(subreq, ractl);
+		slice =3D netfs_prepare_read_iterator(subreq);
 		if (slice < 0) {
 			ret =3D slice;
 			subreq->error =3D ret;
@@ -362,6 +369,7 @@ static void netfs_read_to_pagecache(struct netfs_io_req=
uest *rreq,
=20
 		netfs_queue_read(rreq, subreq, size <=3D 0);
 		netfs_issue_read(rreq, subreq);
+		netfs_maybe_bulk_drop_ra_refs(rreq);
 		cond_resched();
 	} while (size > 0);
=20
@@ -395,6 +403,7 @@ void netfs_readahead(struct readahead_control *ractl)
 	struct netfs_io_request *rreq;
 	struct netfs_inode *ictx =3D netfs_inode(ractl->mapping->host);
 	unsigned long long start =3D readahead_pos(ractl);
+	ssize_t added;
 	size_t size =3D readahead_length(ractl);
 	int ret;
=20
@@ -415,11 +424,23 @@ void netfs_readahead(struct readahead_control *ractl)
=20
 	netfs_rreq_expand(rreq, ractl);
=20
-	rreq->submitted =3D rreq->start;
-	if (rolling_buffer_init(&rreq->buffer, rreq->debug_id, ITER_DEST) < 0)
+	/* Load the folios to be read into a bvecq chain.  Note that this
+	 * acquires a ref on each folio that we will need to release later -
+	 * but we don't want to do that until after we've started the I/O.
+	 */
+	added =3D rolling_buffer_bulk_load_from_ra(&rreq->buffer, ractl, rreq->de=
bug_id);
+	if (added < 0) {
+		ret =3D added;
 		goto cleanup_free;
-	netfs_read_to_pagecache(rreq, ractl);
+	}
+	__set_bit(NETFS_RREQ_NEED_PUT_RA_REFS, &rreq->flags);
+
+	rreq->submitted =3D rreq->start + added;
+	rreq->cleaned_to =3D rreq->start;
+	rreq->front_folio_order =3D folio_order(rreq->buffer.tail->vec.folios[0]);
=20
+	netfs_read_to_pagecache(rreq);
+	netfs_maybe_bulk_drop_ra_refs(rreq);
 	return netfs_put_request(rreq, netfs_rreq_trace_put_return);
=20
 cleanup_free:
@@ -511,7 +532,7 @@ static int netfs_read_gaps(struct file *file, struct fo=
lio *folio)
 	iov_iter_bvec(&rreq->buffer.iter, ITER_DEST, bvec, i, rreq->len);
 	rreq->submitted =3D rreq->start + flen;
=20
-	netfs_read_to_pagecache(rreq, NULL);
+	netfs_read_to_pagecache(rreq);
=20
 	if (sink)
 		folio_put(sink);
@@ -580,7 +601,7 @@ int netfs_read_folio(struct file *file, struct folio *f=
olio)
 	if (ret < 0)
 		goto discard;
=20
-	netfs_read_to_pagecache(rreq, NULL);
+	netfs_read_to_pagecache(rreq);
 	ret =3D netfs_wait_for_read(rreq);
 	netfs_put_request(rreq, netfs_rreq_trace_put_return);
 	return ret < 0 ? ret : 0;
@@ -737,7 +758,7 @@ int netfs_write_begin(struct netfs_inode *ctx,
 	if (ret < 0)
 		goto error_put;
=20
-	netfs_read_to_pagecache(rreq, NULL);
+	netfs_read_to_pagecache(rreq);
 	ret =3D netfs_wait_for_read(rreq);
 	if (ret < 0)
 		goto error;
@@ -802,7 +823,7 @@ int netfs_prefetch_for_write(struct file *file, struct =
folio *folio,
 	if (ret < 0)
 		goto error_put;
=20
-	netfs_read_to_pagecache(rreq, NULL);
+	netfs_read_to_pagecache(rreq);
 	ret =3D netfs_wait_for_read(rreq);
 	netfs_put_request(rreq, netfs_rreq_trace_put_return);
 	return ret < 0 ? ret : 0;
diff --git a/fs/netfs/rolling_buffer.c b/fs/netfs/rolling_buffer.c
index a17fbf9853a4..292011c1cacb 100644
--- a/fs/netfs/rolling_buffer.c
+++ b/fs/netfs/rolling_buffer.c
@@ -149,6 +149,81 @@ ssize_t rolling_buffer_load_from_ra(struct rolling_buf=
fer *roll,
 	return size;
 }
=20
+/*
+ * Decant the entire list of folios to read into a rolling buffer.
+ */
+ssize_t rolling_buffer_bulk_load_from_ra(struct rolling_buffer *roll,
+					 struct readahead_control *ractl,
+					 unsigned int rreq_id)
+{
+	XA_STATE(xas, &ractl->mapping->i_pages, ractl->_index);
+	struct folio_queue *fq;
+	struct folio *folio;
+	ssize_t loaded =3D 0;
+	int nr, slot =3D 0, npages =3D 0;
+
+	/* First allocate all the folioqs we're going to need to avoid having
+	 * to deal with ENOMEM later.
+	 */
+	nr =3D ractl->_nr_folios;
+	do {
+		fq =3D netfs_folioq_alloc(rreq_id, GFP_KERNEL,
+					netfs_trace_folioq_make_space);
+		if (!fq) {
+			rolling_buffer_clear(roll);
+			return -ENOMEM;
+		}
+		fq->prev =3D roll->head;
+		if (!roll->tail)
+			roll->tail =3D fq;
+		else
+			roll->head->next =3D fq;
+		roll->head =3D fq;
+		=09
+		nr -=3D folioq_nr_slots(fq);
+	} while (nr > 0);
+
+	rcu_read_lock();
+
+	fq =3D roll->tail;
+	xas_for_each(&xas, folio, ractl->_index + ractl->_nr_pages - 1) {
+		unsigned int order;
+
+		if (xas_retry(&xas, folio))
+			continue;
+		VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+
+		order =3D folio_order(folio);
+		fq->orders[slot] =3D order;
+		fq->vec.folios[slot] =3D folio;
+		loaded +=3D PAGE_SIZE << order;
+		npages +=3D 1 << order;
+		trace_netfs_folio(folio, netfs_folio_trace_read);
+
+		slot++;
+		if (slot >=3D folioq_nr_slots(fq)) {
+			fq->vec.nr =3D slot;
+			fq =3D fq->next;
+			if (!fq) {
+				WARN_ON_ONCE(npages < readahead_count(ractl));
+				break;
+			}
+			slot =3D 0;
+		}
+	}
+
+	rcu_read_unlock();
+
+	if (fq)
+		fq->vec.nr =3D slot;
+
+	WRITE_ONCE(roll->iter.count, loaded);
+	iov_iter_folio_queue(&roll->iter, ITER_DEST, roll->tail, 0, 0, loaded);
+	ractl->_index    +=3D npages;
+	ractl->_nr_pages -=3D npages;
+	return loaded;
+}
+
 /*
  * Append a folio to the rolling buffer.
  */
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 77238bc4a712..cc56b6512769 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -280,6 +280,7 @@ struct netfs_io_request {
 #define NETFS_RREQ_FOLIO_COPY_TO_CACHE	10	/* Copy current folio to cache f=
rom read */
 #define NETFS_RREQ_UPLOAD_TO_SERVER	11	/* Need to write to the server */
 #define NETFS_RREQ_USE_IO_ITER		12	/* Use ->io_iter rather than ->i_pages =
*/
+#define NETFS_RREQ_NEED_PUT_RA_REFS	13	/* Need to put the folio refs RA ga=
ve us */
 #define NETFS_RREQ_USE_PGPRIV2		31	/* [DEPRECATED] Use PG_private_2 to mark
 						 * write to cache on read */
 	const struct netfs_request_ops *netfs_ops;
diff --git a/include/linux/rolling_buffer.h b/include/linux/rolling_buffer.h
index ac15b1ffdd83..b35ef43f325f 100644
--- a/include/linux/rolling_buffer.h
+++ b/include/linux/rolling_buffer.h
@@ -48,6 +48,9 @@ int rolling_buffer_make_space(struct rolling_buffer *roll=
);
 ssize_t rolling_buffer_load_from_ra(struct rolling_buffer *roll,
 				    struct readahead_control *ractl,
 				    struct folio_batch *put_batch);
+ssize_t rolling_buffer_bulk_load_from_ra(struct rolling_buffer *roll,
+					 struct readahead_control *ractl,
+					 unsigned int rreq_id);
 ssize_t rolling_buffer_append(struct rolling_buffer *roll, struct folio *f=
olio,
 			      unsigned int flags);
 struct folio_queue *rolling_buffer_delete_spent(struct rolling_buffer *rol=
l);
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index cbe28211106c..b8236f9e940e 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -59,6 +59,7 @@
 	EM(netfs_rreq_trace_free,		"FREE   ")	\
 	EM(netfs_rreq_trace_intr,		"INTR   ")	\
 	EM(netfs_rreq_trace_ki_complete,	"KI-CMPL")	\
+	EM(netfs_rreq_trace_ra_put_ref,		"RA-PUT ")	\
 	EM(netfs_rreq_trace_recollect,		"RECLLCT")	\
 	EM(netfs_rreq_trace_redirty,		"REDIRTY")	\
 	EM(netfs_rreq_trace_resubmit,		"RESUBMT")	\
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0E1243EB7FA
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:47:48 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522070; cv=none;
 b=ARCqW4kCKHXT7BJDbC9ZzZmQ/MZiGCX12rNJVWjrVdiwK0eLSh1JoZB4y8v16Ahw/5qBK8j7zVZ/muZVjS4OTRvWheLvjlTlCfPxS1Uib96cwQFBSF/UNd2uAgLH75VYLCnbNHGDKadjshI6ZSD+DLiQkiJ+iafqiWxDDoAt0CU=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522070; c=relaxed/simple;
	bh=d/QV1SH+wpf5nvu0aaDNzTntlFkR0C0xmiVkP37V+M0=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=oGfUWokzOa/IwvVTMO4o87aeRT4xoTOrmh+krTKciBSj28NFQFZkiVn3vkfFl0nEdPo/gyXRGd6Mnpy8FZ7rrp3dEjr2+m7ut0zAfxO+/O/5a42n1wwgfURcx4VSB9hI6kXe0wJwIDS7tPx4SKE/UhKJCNyRR+GnGFC3/8lGMUg=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=HPf7bIHt; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="HPf7bIHt"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774522067;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=RdkzjxA4QBFdZtq86b0WloDVthjPMHtDr3k2b6G769Q=;
	b=HPf7bIHtAH5URQQC/cVXmhEuQgU+atCZuE0kgH05rAuso2tOP830fmdO3LAzKAzA9x8KWt
	7xCllRw6V38hDNqPUI9t4JQQxmO7q74fMtY0wy/BoJG5K+eeC93CBd0pnhSIPcyPjcenr7
	8Wc6Hc+/uNwatMDJhnVhyJv6NL/gBtQ=
Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-185-U8WL4ac-PTenCaRZTbwENg-1; Thu,
 26 Mar 2026 06:47:46 -0400
X-MC-Unique: U8WL4ac-PTenCaRZTbwENg-1
X-Mimecast-MFC-AGG-ID: U8WL4ac-PTenCaRZTbwENg_1774522063
Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 30E8B18002C7;
	Thu, 26 Mar 2026 10:47:43 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 484543000223;
	Thu, 26 Mar 2026 10:47:36 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Paulo Alcantara <pc@manguebit.org>,
	linux-block@vger.kernel.org
Subject: [PATCH 11/26] Add a function to kmap one page of a multipage bio_vec
Date: Thu, 26 Mar 2026 10:45:26 +0000
Message-ID: <20260326104544.509518-12-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4
Content-Type: text/plain; charset="utf-8"

Add a function to kmap one page of a multipage bio_vec by offset (which is
added to the offset in the bio_vec internally).  The caller is responsible
for calculating how much of the page is then available.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
---
 include/linux/bvec.h | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 06fb60471aaf..9788bfd52818 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -308,4 +308,25 @@ static inline phys_addr_t bvec_phys(const struct bio_v=
ec *bvec)
 	return page_to_phys(bvec->bv_page) + bvec->bv_offset;
 }
=20
+/**
+ * kmap_local_bvec - Map part of a bvec into the kernel virtual address sp=
ace
+ * @bvec: bvec to map
+ * @offset: Offset into bvec
+ *
+ * Map the page containing the byte at @offset into the kernel virtual add=
ress
+ * space.  The caller is responsible for making sure this doesn't overrun.
+ *
+ * Call kunmap_local on the returned address to unmap.
+ */
+static inline void *kmap_local_bvec(struct bio_vec *bvec, size_t offset)
+{
+#ifdef CONFIG_HIGHMEM
+	offset +=3D bvec->bv_offset;
+
+	return kmap_local_page(bvec->bv_page + offset / PAGE_SIZE) + offset % PAG=
E_SIZE;
+#else
+	return bvec_virt(bvec) + offset;
+#endif
+}
+
 #endif /* __LINUX_BVEC_H */
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0B81038B121
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:48:02 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522085; cv=none;
 b=p/6PHZJBbYCXnensnjiXw+CELBfOmlCFVUV5iQrkb7nUe0CelTj9eCENCBzRc1bUO1cYzlJbGR0I35d6VQsrbBcO59TX3GtSxd3twylQ43Ybd8IJDdkCq6dequHL8q3OE8FnuI3x9kJ1W8lxwa/DarQTtS30rXy1/z5TaM7c5kA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522085; c=relaxed/simple;
	bh=sDpdOIVwyiqdQdI2+efkY+hWuJKO5Eo2Q6hjaMSiLCg=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=Oj6GmQ8XIYaiYyWK2C2GArTwBagpnYF+VwrR1NX5FLTJa4suBNRbWlM7pLcTIMGTJtq3iNpjI6lj91Dafd0OjUC6D7gxdgQ4T+BVzQBlEBm9sHtLT9Otu+2sNlfi/LUoXaCsrG5fSL2fXZDnksp2FMCnkRKrhhm3TSbvpBDJOgM=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=a1Ul4Vtt; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="a1Ul4Vtt"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774522082;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=DxL+yW7+B7jQcX5wo/zZyXW1S1PpGyKQ1dUYF5xN7ks=;
	b=a1Ul4Vtt1obLseh8505whwFl596IvW05G2yrIFMFVUZ0IrDupi0AXT8zs6+9oOgJqTK8/m
	6IAog5A6NyuJvLGbRzAc/73B2lUm2UusPbmjUt/RuAB73QQXrIgm1Urp1trqYGEzBGDqQd
	UBn/7LKiICe5DmjVO7mvZ3SgM3HHOZc=
Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-489-zh8E2yOlOx6RfvhMNxM4VQ-1; Thu,
 26 Mar 2026 06:47:55 -0400
X-MC-Unique: zh8E2yOlOx6RfvhMNxM4VQ-1
X-Mimecast-MFC-AGG-ID: zh8E2yOlOx6RfvhMNxM4VQ_1774522072
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 1A08C1800464;
	Thu, 26 Mar 2026 10:47:52 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id CCC0E19560B1;
	Thu, 26 Mar 2026 10:47:44 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Paulo Alcantara <pc@manguebit.org>,
	linux-block@vger.kernel.org
Subject: [PATCH 12/26] iov_iter: Add a segmented queue of bio_vec[]
Date: Thu, 26 Mar 2026 10:45:27 +0000
Message-ID: <20260326104544.509518-13-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12
Content-Type: text/plain; charset="utf-8"

Add the concept of a segmented queue of bio_vec[] arrays.  This allows an
indefinite quantity of elements to be handled and allows things like
network filesystems and crypto drivers to glue bits on the ends without
having to reallocate the array.

The bvecq struct that defines each segment also carries capacity/usage
information along with flags indicating whether the constituent memory
regions need freeing or unpinning and the file position of the first
element in a segment.  The bvecq structs are refcounted to allow a queue to
be extracted in batches and split between a number of subrequests.

The bvecq can have the bio_vec[] it manages allocated in with it, but this
is not required.  A flag is provided for if this is the case as comparing
->bv to ->__bv is not sufficient to detect this case.

Add an iterator type ITER_BVECQ for it.  This is intended to replace
ITER_FOLIOQ (and ITER_XARRAY).

Note that the prev pointer is only really needed for iov_iter_revert() and
could be dispensed with if struct iov_iter contained the head information
as well as the current point.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
---
 include/linux/bvecq.h      |  46 ++++++
 include/linux/iov_iter.h   |  63 +++++++-
 include/linux/uio.h        |  11 ++
 lib/iov_iter.c             | 288 ++++++++++++++++++++++++++++++++++++-
 lib/scatterlist.c          |  66 +++++++++
 lib/tests/kunit_iov_iter.c | 180 +++++++++++++++++++++++
 6 files changed, 649 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/bvecq.h

diff --git a/include/linux/bvecq.h b/include/linux/bvecq.h
new file mode 100644
index 000000000000..462125af1cc7
--- /dev/null
+++ b/include/linux/bvecq.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Implementation of a segmented queue of bio_vec[].
+ *
+ * Copyright (C) 2026 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#ifndef _LINUX_BVECQ_H
+#define _LINUX_BVECQ_H
+
+#include <linux/bvec.h>
+
+/*
+ * Segmented bio_vec queue.
+ *
+ * These can be linked together to form messages of indefinite length and
+ * iterated over with an ITER_BVECQ iterator.  The list is non-circular; n=
ext
+ * and prev are NULL at the ends.
+ *
+ * The bv pointer points to the segment array; this may be __bv if allocat=
ed
+ * together.  The caller is responsible for determining whether or not thi=
s is
+ * the case as the array pointed to by bv may be follow on directly from t=
he
+ * bvecq by accident of allocation (ie. ->bv =3D=3D ->__bv is *not* suffic=
ient to
+ * determine this).
+ *
+ * The file position and discontiguity flag allow non-contiguous data sets=
 to
+ * be chained together, but still teased apart without the need to convert=
 the
+ * info in the bio_vec back into a folio pointer.
+ */
+struct bvecq {
+	struct bvecq	*next;		/* Next bvec in the list or NULL */
+	struct bvecq	*prev;		/* Prev bvec in the list or NULL */
+	unsigned long long fpos;	/* File position */
+	refcount_t	ref;
+	u32		priv;		/* Private data */
+	u16		nr_segs;	/* Number of elements in bv[] used */
+	u16		max_segs;	/* Number of elements allocated in bv[] */
+	bool		inline_bv:1;	/* T if __bv[] is being used */
+	bool		free:1;		/* T if the pages need freeing */
+	bool		unpin:1;	/* T if the pages need unpinning, not freeing */
+	bool		discontig:1;	/* T if not contiguous with previous bvecq */
+	struct bio_vec	*bv;		/* Pointer to array of page fragments */
+	struct bio_vec	__bv[];		/* Default array (if ->inline_bv) */
+};
+
+#endif /* _LINUX_BVECQ_H */
diff --git a/include/linux/iov_iter.h b/include/linux/iov_iter.h
index f9a17fbbd398..999607ece481 100644
--- a/include/linux/iov_iter.h
+++ b/include/linux/iov_iter.h
@@ -9,7 +9,7 @@
 #define _LINUX_IOV_ITER_H
=20
 #include <linux/uio.h>
-#include <linux/bvec.h>
+#include <linux/bvecq.h>
 #include <linux/folio_queue.h>
=20
 typedef size_t (*iov_step_f)(void *iter_base, size_t progress, size_t len,
@@ -141,6 +141,59 @@ size_t iterate_bvec(struct iov_iter *iter, size_t len,=
 void *priv, void *priv2,
 	return progress;
 }
=20
+/*
+ * Handle ITER_BVECQ.
+ */
+static __always_inline
+size_t iterate_bvecq(struct iov_iter *iter, size_t len, void *priv, void *=
priv2,
+		     iov_step_f step)
+{
+	const struct bvecq *bq =3D iter->bvecq;
+	unsigned int slot =3D iter->bvecq_slot;
+	size_t progress =3D 0, skip =3D iter->iov_offset;
+
+	if (slot =3D=3D bq->nr_segs) {
+		/* The iterator may have been extended. */
+		bq =3D bq->next;
+		slot =3D 0;
+	}
+
+	do {
+		const struct bio_vec *bvec =3D &bq->bv[slot];
+		struct page *page =3D bvec->bv_page + (bvec->bv_offset + skip) / PAGE_SI=
ZE;
+		size_t part, remain, consumed;
+		size_t poff =3D (bvec->bv_offset + skip) % PAGE_SIZE;
+		void *base;
+
+		part =3D umin(umin(bvec->bv_len - skip, PAGE_SIZE - poff), len);
+		base =3D kmap_local_page(page) + poff;
+		remain =3D step(base, progress, part, priv, priv2);
+		kunmap_local(base);
+		consumed =3D part - remain;
+		len -=3D consumed;
+		progress +=3D consumed;
+		skip +=3D consumed;
+		if (skip >=3D bvec->bv_len) {
+			skip =3D 0;
+			slot++;
+			if (slot >=3D bq->nr_segs) {
+				if (!bq->next)
+					break;
+				bq =3D bq->next;
+				slot =3D 0;
+			}
+		}
+		if (remain)
+			break;
+	} while (len);
+
+	iter->bvecq_slot =3D slot;
+	iter->bvecq =3D bq;
+	iter->iov_offset =3D skip;
+	iter->count -=3D progress;
+	return progress;
+}
+
 /*
  * Handle ITER_FOLIOQ.
  */
@@ -306,6 +359,8 @@ size_t iterate_and_advance2(struct iov_iter *iter, size=
_t len, void *priv,
 		return iterate_bvec(iter, len, priv, priv2, step);
 	if (iov_iter_is_kvec(iter))
 		return iterate_kvec(iter, len, priv, priv2, step);
+	if (iov_iter_is_bvecq(iter))
+		return iterate_bvecq(iter, len, priv, priv2, step);
 	if (iov_iter_is_folioq(iter))
 		return iterate_folioq(iter, len, priv, priv2, step);
 	if (iov_iter_is_xarray(iter))
@@ -342,8 +397,8 @@ size_t iterate_and_advance(struct iov_iter *iter, size_=
t len, void *priv,
  * buffer is presented in segments, which for kernel iteration are broken =
up by
  * physical pages and mapped, with the mapped address being presented.
  *
- * [!] Note This will only handle BVEC, KVEC, FOLIOQ, XARRAY and DISCARD-t=
ype
- * iterators; it will not handle UBUF or IOVEC-type iterators.
+ * [!] Note This will only handle BVEC, KVEC, BVECQ, FOLIOQ, XARRAY and
+ * DISCARD-type iterators; it will not handle UBUF or IOVEC-type iterators.
  *
  * A step functions, @step, must be provided, one for handling mapped kern=
el
  * addresses and the other is given user addresses which have the potentia=
l to
@@ -370,6 +425,8 @@ size_t iterate_and_advance_kernel(struct iov_iter *iter=
, size_t len, void *priv,
 		return iterate_bvec(iter, len, priv, priv2, step);
 	if (iov_iter_is_kvec(iter))
 		return iterate_kvec(iter, len, priv, priv2, step);
+	if (iov_iter_is_bvecq(iter))
+		return iterate_bvecq(iter, len, priv, priv2, step);
 	if (iov_iter_is_folioq(iter))
 		return iterate_folioq(iter, len, priv, priv2, step);
 	if (iov_iter_is_xarray(iter))
diff --git a/include/linux/uio.h b/include/linux/uio.h
index a9bc5b3067e3..aa50d348dfcc 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -27,6 +27,7 @@ enum iter_type {
 	ITER_BVEC,
 	ITER_KVEC,
 	ITER_FOLIOQ,
+	ITER_BVECQ,
 	ITER_XARRAY,
 	ITER_DISCARD,
 };
@@ -69,6 +70,7 @@ struct iov_iter {
 				const struct kvec *kvec;
 				const struct bio_vec *bvec;
 				const struct folio_queue *folioq;
+				const struct bvecq *bvecq;
 				struct xarray *xarray;
 				void __user *ubuf;
 			};
@@ -78,6 +80,7 @@ struct iov_iter {
 	union {
 		unsigned long nr_segs;
 		u8 folioq_slot;
+		u16 bvecq_slot;
 		loff_t xarray_start;
 	};
 };
@@ -150,6 +153,11 @@ static inline bool iov_iter_is_folioq(const struct iov=
_iter *i)
 	return iov_iter_type(i) =3D=3D ITER_FOLIOQ;
 }
=20
+static inline bool iov_iter_is_bvecq(const struct iov_iter *i)
+{
+	return iov_iter_type(i) =3D=3D ITER_BVECQ;
+}
+
 static inline bool iov_iter_is_xarray(const struct iov_iter *i)
 {
 	return iov_iter_type(i) =3D=3D ITER_XARRAY;
@@ -298,6 +306,9 @@ void iov_iter_discard(struct iov_iter *i, unsigned int =
direction, size_t count);
 void iov_iter_folio_queue(struct iov_iter *i, unsigned int direction,
 			  const struct folio_queue *folioq,
 			  unsigned int first_slot, unsigned int offset, size_t count);
+void iov_iter_bvec_queue(struct iov_iter *i, unsigned int direction,
+			 const struct bvecq *bvecq,
+			 unsigned int first_slot, unsigned int offset, size_t count);
 void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xa=
rray *xarray,
 		     loff_t start, size_t count);
 ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 0a63c7fba313..df8d037894b1 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -571,6 +571,39 @@ static void iov_iter_folioq_advance(struct iov_iter *i=
, size_t size)
 	i->folioq =3D folioq;
 }
=20
+static void iov_iter_bvecq_advance(struct iov_iter *i, size_t by)
+{
+	const struct bvecq *bq =3D i->bvecq;
+	unsigned int slot =3D i->bvecq_slot;
+
+	if (!i->count)
+		return;
+	i->count -=3D by;
+
+	if (slot >=3D bq->nr_segs) {
+		bq =3D bq->next;
+		slot =3D 0;
+	}
+
+	by +=3D i->iov_offset; /* From beginning of current segment. */
+	do {
+		size_t len =3D bq->bv[slot].bv_len;
+
+		if (likely(by < len))
+			break;
+		by -=3D len;
+		slot++;
+		if (slot >=3D bq->nr_segs && bq->next) {
+			bq =3D bq->next;
+			slot =3D 0;
+		}
+	} while (by);
+
+	i->iov_offset =3D by;
+	i->bvecq_slot =3D slot;
+	i->bvecq =3D bq;
+}
+
 void iov_iter_advance(struct iov_iter *i, size_t size)
 {
 	if (unlikely(i->count < size))
@@ -585,6 +618,8 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
 		iov_iter_bvec_advance(i, size);
 	} else if (iov_iter_is_folioq(i)) {
 		iov_iter_folioq_advance(i, size);
+	} else if (iov_iter_is_bvecq(i)) {
+		iov_iter_bvecq_advance(i, size);
 	} else if (iov_iter_is_discard(i)) {
 		i->count -=3D size;
 	}
@@ -617,6 +652,32 @@ static void iov_iter_folioq_revert(struct iov_iter *i,=
 size_t unroll)
 	i->folioq =3D folioq;
 }
=20
+static void iov_iter_bvecq_revert(struct iov_iter *i, size_t unroll)
+{
+	const struct bvecq *bq =3D i->bvecq;
+	unsigned int slot =3D i->bvecq_slot;
+
+	for (;;) {
+		size_t len;
+
+		if (slot =3D=3D 0) {
+			bq =3D bq->prev;
+			slot =3D bq->nr_segs;
+		}
+		slot--;
+
+		len =3D bq->bv[slot].bv_len;
+		if (unroll <=3D len) {
+			i->iov_offset =3D len - unroll;
+			break;
+		}
+		unroll -=3D len;
+	}
+
+	i->bvecq_slot =3D slot;
+	i->bvecq =3D bq;
+}
+
 void iov_iter_revert(struct iov_iter *i, size_t unroll)
 {
 	if (!unroll)
@@ -651,6 +712,9 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
 	} else if (iov_iter_is_folioq(i)) {
 		i->iov_offset =3D 0;
 		iov_iter_folioq_revert(i, unroll);
+	} else if (iov_iter_is_bvecq(i)) {
+		i->iov_offset =3D 0;
+		iov_iter_bvecq_revert(i, unroll);
 	} else { /* same logics for iovec and kvec */
 		const struct iovec *iov =3D iter_iov(i);
 		while (1) {
@@ -678,9 +742,12 @@ size_t iov_iter_single_seg_count(const struct iov_iter=
 *i)
 		if (iov_iter_is_bvec(i))
 			return min(i->count, i->bvec->bv_len - i->iov_offset);
 	}
+	if (!i->count)
+		return 0;
 	if (unlikely(iov_iter_is_folioq(i)))
-		return !i->count ? 0 :
-			umin(folioq_folio_size(i->folioq, i->folioq_slot), i->count);
+		return umin(folioq_folio_size(i->folioq, i->folioq_slot), i->count);
+	if (unlikely(iov_iter_is_bvecq(i)))
+		return min(i->count, i->bvecq->bv[i->bvecq_slot].bv_len - i->iov_offset);
 	return i->count;
 }
 EXPORT_SYMBOL(iov_iter_single_seg_count);
@@ -747,6 +814,35 @@ void iov_iter_folio_queue(struct iov_iter *i, unsigned=
 int direction,
 }
 EXPORT_SYMBOL(iov_iter_folio_queue);
=20
+/**
+ * iov_iter_bvec_queue - Initialise an I/O iterator to use a segmented bve=
c queue
+ * @i: The iterator to initialise.
+ * @direction: The direction of the transfer.
+ * @bvecq: The starting point in the bvec queue.
+ * @first_slot: The first slot in the bvec queue to use
+ * @offset: The offset into the bvec in the first slot to start at
+ * @count: The size of the I/O buffer in bytes.
+ *
+ * Set up an I/O iterator to either draw data out of the buffers attached =
to an
+ * inode or to inject data into those buffers.  The pages *must* be preven=
ted
+ * from evaporation, either by the caller.
+ */
+void iov_iter_bvec_queue(struct iov_iter *i, unsigned int direction,
+			 const struct bvecq *bvecq, unsigned int first_slot,
+			 unsigned int offset, size_t count)
+{
+	WARN_ON(direction & ~(READ | WRITE));
+	*i =3D (struct iov_iter) {
+		.iter_type	=3D ITER_BVECQ,
+		.data_source	=3D direction,
+		.bvecq		=3D bvecq,
+		.bvecq_slot	=3D first_slot,
+		.count		=3D count,
+		.iov_offset	=3D offset,
+	};
+}
+EXPORT_SYMBOL(iov_iter_bvec_queue);
+
 /**
  * iov_iter_xarray - Initialise an I/O iterator to use the pages in an xar=
ray
  * @i: The iterator to initialise.
@@ -839,6 +935,37 @@ static unsigned long iov_iter_alignment_bvec(const str=
uct iov_iter *i)
 	return res;
 }
=20
+static unsigned long iov_iter_alignment_bvecq(const struct iov_iter *iter)
+{
+	const struct bvecq *bq;
+	unsigned long res =3D 0;
+	unsigned int slot =3D iter->bvecq_slot;
+	size_t skip =3D iter->iov_offset;
+	size_t size =3D iter->count;
+
+	if (!size)
+		return res;
+
+	for (bq =3D iter->bvecq; bq; bq =3D bq->next) {
+		for (; slot < bq->nr_segs; slot++) {
+			const struct bio_vec *bvec =3D &bq->bv[slot];
+			size_t part =3D umin(bvec->bv_len - skip, size);
+
+			res |=3D bvec->bv_offset + skip;
+			res |=3D part;
+
+			size -=3D part;
+			if (size =3D=3D 0)
+				return res;
+			skip =3D 0;
+		}
+
+		slot =3D 0;
+	}
+
+	return res;
+}
+
 unsigned long iov_iter_alignment(const struct iov_iter *i)
 {
 	if (likely(iter_is_ubuf(i))) {
@@ -858,6 +985,8 @@ unsigned long iov_iter_alignment(const struct iov_iter =
*i)
 	/* With both xarray and folioq types, we're dealing with whole folios. */
 	if (iov_iter_is_folioq(i))
 		return i->iov_offset | i->count;
+	if (iov_iter_is_bvecq(i))
+		return iov_iter_alignment_bvecq(i);
 	if (iov_iter_is_xarray(i))
 		return (i->xarray_start + i->iov_offset) | i->count;
=20
@@ -1124,6 +1253,7 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_=
iter *i,
 		return iter_folioq_get_pages(i, pages, maxsize, maxpages, start);
 	if (iov_iter_is_xarray(i))
 		return iter_xarray_get_pages(i, pages, maxsize, maxpages, start);
+	WARN_ON_ONCE(iov_iter_is_bvecq(i));
 	return -EFAULT;
 }
=20
@@ -1192,6 +1322,36 @@ static int bvec_npages(const struct iov_iter *i, int=
 maxpages)
 	return npages;
 }
=20
+static size_t iov_npages_bvecq(const struct iov_iter *iter, size_t maxpage=
s)
+{
+	const struct bvecq *bq;
+	unsigned int slot =3D iter->bvecq_slot;
+	size_t npages =3D 0;
+	size_t skip =3D iter->iov_offset;
+	size_t size =3D iter->count;
+
+	for (bq =3D iter->bvecq; bq; bq =3D bq->next) {
+		for (; slot < bq->nr_segs; slot++) {
+			const struct bio_vec *bvec =3D &bq->bv[slot];
+			size_t offs =3D (bvec->bv_offset + skip) % PAGE_SIZE;
+			size_t part =3D umin(bvec->bv_len - skip, size);
+
+			npages +=3D DIV_ROUND_UP(offs + part, PAGE_SIZE);
+			if (npages >=3D maxpages)
+				goto out;
+
+			size -=3D part;
+			if (!size)
+				goto out;
+			skip =3D 0;
+		}
+
+		slot =3D 0;
+	}
+out:
+	return umin(npages, maxpages);
+}
+
 int iov_iter_npages(const struct iov_iter *i, int maxpages)
 {
 	if (unlikely(!i->count))
@@ -1211,6 +1371,8 @@ int iov_iter_npages(const struct iov_iter *i, int max=
pages)
 		int npages =3D DIV_ROUND_UP(offset + i->count, PAGE_SIZE);
 		return min(npages, maxpages);
 	}
+	if (iov_iter_is_bvecq(i))
+		return iov_npages_bvecq(i, maxpages);
 	if (iov_iter_is_xarray(i)) {
 		unsigned offset =3D (i->xarray_start + i->iov_offset) % PAGE_SIZE;
 		int npages =3D DIV_ROUND_UP(offset + i->count, PAGE_SIZE);
@@ -1554,6 +1716,124 @@ static ssize_t iov_iter_extract_folioq_pages(struct=
 iov_iter *i,
 	return extracted;
 }
=20
+/*
+ * Extract a list of virtually contiguous pages from an ITER_BVECQ iterato=
r.
+ * This does not get references on the pages, nor does it get a pin on the=
m.
+ */
+static ssize_t iov_iter_extract_bvecq_pages(struct iov_iter *iter,
+					    struct page ***pages, size_t maxsize,
+					    unsigned int maxpages,
+					    iov_iter_extraction_t extraction_flags,
+					    size_t *offset0)
+{
+	const struct bvecq *bvecq =3D iter->bvecq;
+	struct page **p;
+	unsigned int seg =3D iter->bvecq_slot, count =3D 0, nr =3D 0;
+	size_t extracted =3D 0, offset =3D iter->iov_offset;
+
+	if (seg >=3D bvecq->nr_segs) {
+		bvecq =3D bvecq->next;
+		if (WARN_ON_ONCE(!bvecq))
+			return 0;
+		seg =3D 0;
+	}
+
+	/* First, we count the run of virtually contiguous pages. */
+	do {
+		const struct bio_vec *bv =3D &bvecq->bv[seg];
+		size_t boff =3D bv->bv_offset, blen =3D bv->bv_len;
+
+		if (!bv->bv_page)
+			blen =3D 0;
+		if (extracted > 0 && boff % PAGE_SIZE)
+			break;
+
+		if (offset < blen) {
+			size_t part =3D umin(maxsize - extracted, blen - offset);
+			size_t poff =3D (boff + offset) % PAGE_SIZE;
+			size_t pcount =3D DIV_ROUND_UP(poff + blen, PAGE_SIZE);
+
+			offset	  +=3D part;
+			extracted +=3D part;
+			count	  +=3D pcount;
+			if ((boff + blen) % PAGE_SIZE)
+				break;
+		}
+
+		if (offset >=3D blen) {
+			offset =3D 0;
+			seg++;
+			if (seg >=3D bvecq->nr_segs) {
+				if (!bvecq->next) {
+					WARN_ON_ONCE(extracted < iter->count);
+					break;
+				}
+				bvecq =3D bvecq->next;
+				seg =3D 0;
+			}
+		}
+	} while (count < maxpages && extracted < maxsize);
+
+	maxpages =3D umin(maxpages, count);
+
+	if (!*pages) {
+		*pages =3D kvmalloc_array(maxpages, sizeof(struct page *), GFP_KERNEL);
+		if (!*pages)
+			return -ENOMEM;
+	}
+
+	p =3D *pages;
+
+	/* Now transcribe the page pointers. */
+	extracted =3D 0;
+	bvecq =3D iter->bvecq;
+	offset =3D iter->iov_offset;
+	seg =3D iter->bvecq_slot;
+
+	do {
+		const struct bio_vec *bv =3D &bvecq->bv[seg];
+		size_t boff =3D bv->bv_offset, blen =3D bv->bv_len;
+
+		if (!bv->bv_page)
+			blen =3D 0;
+
+		if (offset < blen) {
+			size_t part =3D umin(maxsize - extracted, blen - offset);
+			size_t poff =3D (boff + offset) % PAGE_SIZE;
+			size_t pix =3D (boff + offset) / PAGE_SIZE;
+
+			if (poff + part > PAGE_SIZE)
+				part =3D PAGE_SIZE - poff;
+
+			if (!extracted)
+				*offset0 =3D poff;
+
+			p[nr++] =3D bv->bv_page + pix;
+			offset +=3D part;
+			extracted +=3D part;
+		}
+
+		if (offset >=3D blen) {
+			offset =3D 0;
+			seg++;
+			if (seg >=3D bvecq->nr_segs) {
+				if (!bvecq->next) {
+					WARN_ON_ONCE(extracted < iter->count);
+					break;
+				}
+				bvecq =3D bvecq->next;
+				seg =3D 0;
+			}
+		}
+	} while (nr < maxpages && extracted < maxsize);
+
+	iter->bvecq =3D bvecq;
+	iter->bvecq_slot =3D seg;
+	iter->iov_offset =3D offset;
+	iter->count -=3D extracted;
+	return extracted;
+}
+
 /*
  * Extract a list of contiguous pages from an ITER_XARRAY iterator.  This =
does not
  * get references on the pages, nor does it get a pin on them.
@@ -1838,6 +2118,10 @@ ssize_t iov_iter_extract_pages(struct iov_iter *i,
 		return iov_iter_extract_folioq_pages(i, pages, maxsize,
 						     maxpages, extraction_flags,
 						     offset0);
+	if (iov_iter_is_bvecq(i))
+		return iov_iter_extract_bvecq_pages(i, pages, maxsize,
+						    maxpages, extraction_flags,
+						    offset0);
 	if (iov_iter_is_xarray(i))
 		return iov_iter_extract_xarray_pages(i, pages, maxsize,
 						     maxpages, extraction_flags,
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index d773720d11bf..03e3883a1a2d 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -10,6 +10,7 @@
 #include <linux/highmem.h>
 #include <linux/kmemleak.h>
 #include <linux/bvec.h>
+#include <linux/bvecq.h>
 #include <linux/uio.h>
 #include <linux/folio_queue.h>
=20
@@ -1328,6 +1329,68 @@ static ssize_t extract_folioq_to_sg(struct iov_iter =
*iter,
 	return ret;
 }
=20
+/*
+ * Extract up to sg_max folios from an BVECQ-type iterator and add them to
+ * the scatterlist.  The pages are not pinned.
+ */
+static ssize_t extract_bvecq_to_sg(struct iov_iter *iter,
+				   ssize_t maxsize,
+				   struct sg_table *sgtable,
+				   unsigned int sg_max,
+				   iov_iter_extraction_t extraction_flags)
+{
+	const struct bvecq *bvecq =3D iter->bvecq;
+	struct scatterlist *sg =3D sgtable->sgl + sgtable->nents;
+	unsigned int seg =3D iter->bvecq_slot;
+	ssize_t ret =3D 0;
+	size_t offset =3D iter->iov_offset;
+
+	if (seg >=3D bvecq->nr_segs) {
+		bvecq =3D bvecq->next;
+		if (WARN_ON_ONCE(!bvecq))
+			return 0;
+		seg =3D 0;
+	}
+
+	do {
+		const struct bio_vec *bv =3D &bvecq->bv[seg];
+		size_t blen =3D bv->bv_len;
+
+		if (!bv->bv_page)
+			blen =3D 0;
+
+		if (offset < blen) {
+			size_t part =3D umin(maxsize - ret, blen - offset);
+
+			sg_set_page(sg, bv->bv_page, part, bv->bv_offset + offset);
+			sgtable->nents++;
+			sg++;
+			sg_max--;
+			offset +=3D part;
+			ret +=3D part;
+		}
+
+		if (offset >=3D blen) {
+			offset =3D 0;
+			seg++;
+			if (seg >=3D bvecq->nr_segs) {
+				if (!bvecq->next) {
+					WARN_ON_ONCE(ret < iter->count);
+					break;
+				}
+				bvecq =3D bvecq->next;
+				seg =3D 0;
+			}
+		}
+	} while (sg_max > 0 && ret < maxsize);
+
+	iter->bvecq =3D bvecq;
+	iter->bvecq_slot =3D seg;
+	iter->iov_offset =3D offset;
+	iter->count -=3D ret;
+	return ret;
+}
+
 /*
  * Extract up to sg_max folios from an XARRAY-type iterator and add them to
  * the scatterlist.  The pages are not pinned.
@@ -1426,6 +1489,9 @@ ssize_t extract_iter_to_sg(struct iov_iter *iter, siz=
e_t maxsize,
 	case ITER_FOLIOQ:
 		return extract_folioq_to_sg(iter, maxsize, sgtable, sg_max,
 					    extraction_flags);
+	case ITER_BVECQ:
+		return extract_bvecq_to_sg(iter, maxsize, sgtable, sg_max,
+					   extraction_flags);
 	case ITER_XARRAY:
 		return extract_xarray_to_sg(iter, maxsize, sgtable, sg_max,
 					    extraction_flags);
diff --git a/lib/tests/kunit_iov_iter.c b/lib/tests/kunit_iov_iter.c
index bb847e5010eb..5bc941f64343 100644
--- a/lib/tests/kunit_iov_iter.c
+++ b/lib/tests/kunit_iov_iter.c
@@ -12,6 +12,7 @@
 #include <linux/mm.h>
 #include <linux/uio.h>
 #include <linux/bvec.h>
+#include <linux/bvecq.h>
 #include <linux/folio_queue.h>
 #include <kunit/test.h>
=20
@@ -536,6 +537,183 @@ static void __init iov_kunit_copy_from_folioq(struct =
kunit *test)
 	KUNIT_SUCCEED(test);
 }
=20
+static void iov_kunit_destroy_bvecq(void *data)
+{
+	struct bvecq *bq, *next;
+
+	for (bq =3D data; bq; bq =3D next) {
+		next =3D bq->next;
+		for (int i =3D 0; i < bq->nr_segs; i++)
+			if (bq->bv[i].bv_page)
+				put_page(bq->bv[i].bv_page);
+		kfree(bq);
+	}
+}
+
+static struct bvecq *iov_kunit_alloc_bvecq(struct kunit *test, unsigned in=
t max_segs)
+{
+	struct bvecq *bq;
+
+	bq =3D kzalloc(struct_size(bq, __bv, max_segs), GFP_KERNEL);
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, bq);
+	bq->max_segs =3D max_segs;
+	return bq;
+}
+
+static struct bvecq *iov_kunit_create_bvecq(struct kunit *test, unsigned i=
nt max_segs)
+{
+	struct bvecq *bq;
+
+	bq =3D iov_kunit_alloc_bvecq(test, max_segs);
+	kunit_add_action_or_reset(test, iov_kunit_destroy_bvecq, bq);
+	return bq;
+}
+
+static void __init iov_kunit_load_bvecq(struct kunit *test,
+					struct iov_iter *iter, int dir,
+					struct bvecq *bq_head,
+					struct page **pages, size_t npages)
+{
+	struct bvecq *bq =3D bq_head;
+	size_t size =3D 0;
+
+	for (int i =3D 0; i < npages; i++) {
+		if (bq->nr_segs >=3D bq->max_segs) {
+			bq->next =3D iov_kunit_alloc_bvecq(test, 8);
+			bq->next->prev =3D bq;
+			bq =3D bq->next;
+		}
+		bvec_set_page(&bq->bv[bq->nr_segs], pages[i], PAGE_SIZE, 0);
+		bq->nr_segs++;
+		size +=3D PAGE_SIZE;
+	}
+	iov_iter_bvec_queue(iter, dir, bq_head, 0, 0, size);
+}
+
+/*
+ * Test copying to a ITER_BVECQ-type iterator.
+ */
+static void __init iov_kunit_copy_to_bvecq(struct kunit *test)
+{
+	const struct kvec_test_range *pr;
+	struct iov_iter iter;
+	struct bvecq *bq;
+	struct page **spages, **bpages;
+	u8 *scratch, *buffer;
+	size_t bufsize, npages, size, copied;
+	int i, patt;
+
+	bufsize =3D 0x100000;
+	npages =3D bufsize / PAGE_SIZE;
+
+	bq =3D iov_kunit_create_bvecq(test, 8);
+
+	scratch =3D iov_kunit_create_buffer(test, &spages, npages);
+	for (i =3D 0; i < bufsize; i++)
+		scratch[i] =3D pattern(i);
+
+	buffer =3D iov_kunit_create_buffer(test, &bpages, npages);
+	memset(buffer, 0, bufsize);
+
+	iov_kunit_load_bvecq(test, &iter, READ, bq, bpages, npages);
+
+	i =3D 0;
+	for (pr =3D kvec_test_ranges; pr->from >=3D 0; pr++) {
+		size =3D pr->to - pr->from;
+		KUNIT_ASSERT_LE(test, pr->to, bufsize);
+
+		iov_iter_bvec_queue(&iter, READ, bq, 0, 0, pr->to);
+		iov_iter_advance(&iter, pr->from);
+		copied =3D copy_to_iter(scratch + i, size, &iter);
+
+		KUNIT_EXPECT_EQ(test, copied, size);
+		KUNIT_EXPECT_EQ(test, iter.count, 0);
+		i +=3D size;
+		if (test->status =3D=3D KUNIT_FAILURE)
+			goto stop;
+	}
+
+	/* Build the expected image in the scratch buffer. */
+	patt =3D 0;
+	memset(scratch, 0, bufsize);
+	for (pr =3D kvec_test_ranges; pr->from >=3D 0; pr++)
+		for (i =3D pr->from; i < pr->to; i++)
+			scratch[i] =3D pattern(patt++);
+
+	/* Compare the images */
+	for (i =3D 0; i < bufsize; i++) {
+		KUNIT_EXPECT_EQ_MSG(test, buffer[i], scratch[i], "at i=3D%x", i);
+		if (buffer[i] !=3D scratch[i])
+			return;
+	}
+
+stop:
+	KUNIT_SUCCEED(test);
+}
+
+/*
+ * Test copying from a ITER_BVECQ-type iterator.
+ */
+static void __init iov_kunit_copy_from_bvecq(struct kunit *test)
+{
+	const struct kvec_test_range *pr;
+	struct iov_iter iter;
+	struct bvecq *bq;
+	struct page **spages, **bpages;
+	u8 *scratch, *buffer;
+	size_t bufsize, npages, size, copied;
+	int i, j;
+
+	bufsize =3D 0x100000;
+	npages =3D bufsize / PAGE_SIZE;
+
+	bq =3D iov_kunit_create_bvecq(test, 8);
+
+	buffer =3D iov_kunit_create_buffer(test, &bpages, npages);
+	for (i =3D 0; i < bufsize; i++)
+		buffer[i] =3D pattern(i);
+
+	scratch =3D iov_kunit_create_buffer(test, &spages, npages);
+	memset(scratch, 0, bufsize);
+
+	iov_kunit_load_bvecq(test, &iter, READ, bq, bpages, npages);
+
+	i =3D 0;
+	for (pr =3D kvec_test_ranges; pr->from >=3D 0; pr++) {
+		size =3D pr->to - pr->from;
+		KUNIT_ASSERT_LE(test, pr->to, bufsize);
+
+		iov_iter_bvec_queue(&iter, WRITE, bq, 0, 0, pr->to);
+		iov_iter_advance(&iter, pr->from);
+		copied =3D copy_from_iter(scratch + i, size, &iter);
+
+		KUNIT_EXPECT_EQ(test, copied, size);
+		KUNIT_EXPECT_EQ(test, iter.count, 0);
+		i +=3D size;
+	}
+
+	/* Build the expected image in the main buffer. */
+	i =3D 0;
+	memset(buffer, 0, bufsize);
+	for (pr =3D kvec_test_ranges; pr->from >=3D 0; pr++) {
+		for (j =3D pr->from; j < pr->to; j++) {
+			buffer[i++] =3D pattern(j);
+			if (i >=3D bufsize)
+				goto stop;
+		}
+	}
+stop:
+
+	/* Compare the images */
+	for (i =3D 0; i < bufsize; i++) {
+		KUNIT_EXPECT_EQ_MSG(test, scratch[i], buffer[i], "at i=3D%x", i);
+		if (scratch[i] !=3D buffer[i])
+			return;
+	}
+
+	KUNIT_SUCCEED(test);
+}
+
 static void iov_kunit_destroy_xarray(void *data)
 {
 	struct xarray *xarray =3D data;
@@ -1016,6 +1194,8 @@ static struct kunit_case __refdata iov_kunit_cases[] =
=3D {
 	KUNIT_CASE(iov_kunit_copy_from_bvec),
 	KUNIT_CASE(iov_kunit_copy_to_folioq),
 	KUNIT_CASE(iov_kunit_copy_from_folioq),
+	KUNIT_CASE(iov_kunit_copy_to_bvecq),
+	KUNIT_CASE(iov_kunit_copy_from_bvecq),
 	KUNIT_CASE(iov_kunit_copy_to_xarray),
 	KUNIT_CASE(iov_kunit_copy_from_xarray),
 	KUNIT_CASE(iov_kunit_extract_pages_kvec),
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id AE28A3ED111
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:48:10 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522093; cv=none;
 b=bDgSvU9OwBphwygjoN4a/RqY7PSU4OJIMG2549fiyzy1Q73Ucq79WLRYseonEoynT8tZm/NHqJekv6KVGsc65FSyyzMqfDrImCWHrhJCIRRsREf1Jdwou63K6v1VT2hOfIW4SNKS2qQcY3oxGNA1qyWMntU5a79LUxEr19wzPE0=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522093; c=relaxed/simple;
	bh=KDBbASYwY1Jt9g1tjqPa+0s93HJKDlMALHBY+TRuzqw=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=c7MFXZKZR0zuIh/mqLErz6sR00oVt+GPP87fmPm/ggPHlgXjmKfVHT09oGU8GrOvOW3Jjg+z8+ehIJRDMIOim91Xh9DBto9YXLmVxp7iqk9Ui4WrvzbrcOItYHe63NwwBvk2qdKYy6ivzgrjYvwA1RUFAbAJa1Cn/t1mn2ujqyM=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=afdRqkSU; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="afdRqkSU"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774522089;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=yDP/Sm6+MeU+8XN4gJgSUzN+frxmLx3rFrwpkE8Ii+s=;
	b=afdRqkSU5vLpebp9i9U1XxmcHSnbsXsBn3kfR4t8re7K19JEF5244bYl5N7WehIqxKU02a
	QVEbSb1YvtJHqYnX9pmUkttd5l9bfiG29+/IAVAKX8rWhKlw5Hlo1us4qxVUQz88q9LW17
	Q4JcE4JM/RAhLq88+fIUJa5JRBzho0E=
Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-70-2aoNTZEbMpuAFj9PGCTVWg-1; Thu,
 26 Mar 2026 06:48:03 -0400
X-MC-Unique: 2aoNTZEbMpuAFj9PGCTVWg-1
X-Mimecast-MFC-AGG-ID: 2aoNTZEbMpuAFj9PGCTVWg_1774522080
Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 9BE531956065;
	Thu, 26 Mar 2026 10:48:00 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id CC011180075D;
	Thu, 26 Mar 2026 10:47:53 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Paulo Alcantara <pc@manguebit.org>
Subject: [PATCH 13/26] netfs: Add some tools for managing bvecq chains
Date: Thu, 26 Mar 2026 10:45:28 +0000
Message-ID: <20260326104544.509518-14-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93
Content-Type: text/plain; charset="utf-8"

Provide a selection of tools for managing bvec queue chains.  This
includes:

 (1) Allocation, prepopulation, expansion, shortening and refcounting of
     bvecqs and bvecq chains.

     This can be used to do things like creating an encryption buffer in
     cifs or a directory content buffer in afs.  The memory segments will
     be appropriate disposed off according to the flags on the bvecq.

 (2) Management of a bvecq chain as a rolling buffer and the management of
     positions within it.

 (3) Loading folios, slicing chains and clearing content.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/netfs/Makefile            |   1 +
 fs/netfs/bvecq.c             | 706 +++++++++++++++++++++++++++++++++++
 fs/netfs/internal.h          |   1 +
 fs/netfs/stats.c             |   4 +-
 include/linux/bvecq.h        | 165 +++++++-
 include/linux/iov_iter.h     |   4 +-
 include/linux/netfs.h        |   1 +
 include/trace/events/netfs.h |  24 ++
 lib/iov_iter.c               |  16 +-
 lib/scatterlist.c            |   4 +-
 lib/tests/kunit_iov_iter.c   |  18 +-
 11 files changed, 919 insertions(+), 25 deletions(-)
 create mode 100644 fs/netfs/bvecq.c

diff --git a/fs/netfs/Makefile b/fs/netfs/Makefile
index b43188d64bd8..e1f12ecb5abf 100644
--- a/fs/netfs/Makefile
+++ b/fs/netfs/Makefile
@@ -3,6 +3,7 @@
 netfs-y :=3D \
 	buffered_read.o \
 	buffered_write.o \
+	bvecq.o \
 	direct_read.o \
 	direct_write.o \
 	iterator.o \
diff --git a/fs/netfs/bvecq.c b/fs/netfs/bvecq.c
new file mode 100644
index 000000000000..c71646ea5243
--- /dev/null
+++ b/fs/netfs/bvecq.c
@@ -0,0 +1,706 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Buffering helpers for bvec queues
+ *
+ * Copyright (C) 2025 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include "internal.h"
+
+void bvecq_dump(const struct bvecq *bq)
+{
+	int b =3D 0;
+
+	for (; bq; bq =3D bq->next, b++) {
+		int skipz =3D 0;
+
+		pr_notice("BQ[%u] %u/%u fp=3D%llx\n", b, bq->nr_slots, bq->max_slots, bq=
->fpos);
+		for (int s =3D 0; s < bq->nr_slots; s++) {
+			const struct bio_vec *bv =3D &bq->bv[s];
+
+			if (!bv->bv_page && !bv->bv_len && skipz < 2) {
+				skipz =3D 1;
+				continue;
+			}
+			if (skipz =3D=3D 1)
+				pr_notice("BQ[%u:00-%02u] ...\n", b, s - 1);
+			skipz =3D 2;
+			pr_notice("BQ[%u:%02u] %10lx %04x %04x %u\n",
+				  b, s,
+				  bv->bv_page ? page_to_pfn(bv->bv_page) : 0,
+				  bv->bv_offset, bv->bv_len,
+				  bv->bv_page ? page_count(bv->bv_page) : 0);
+		}
+	}
+}
+EXPORT_SYMBOL(bvecq_dump);
+
+/**
+ * bvecq_alloc_one - Allocate a single bvecq node with unpopulated slots
+ * @nr_slots: Number of slots to allocate
+ * @gfp: The allocation constraints.
+ *
+ * Allocate a single bvecq node and initialise the header.  A number of in=
line
+ * slots are also allocated, rounded up to fit after the header in a power=
-of-2
+ * slab object of up to 512 bytes (up to 29 slots on a 64-bit cpu).  The s=
lot
+ * array is not initialised.
+ *
+ * Return: The node pointer or NULL on allocation failure.
+ */
+struct bvecq *bvecq_alloc_one(size_t nr_slots, gfp_t gfp)
+{
+	struct bvecq *bq;
+	const size_t max_size =3D 512;
+	const size_t max_slots =3D (max_size - sizeof(*bq)) / sizeof(bq->__bv[0]);
+	size_t part =3D umin(nr_slots, max_slots);
+	size_t size =3D roundup_pow_of_two(struct_size(bq, __bv, part));
+
+	bq =3D kmalloc(size, gfp);
+	if (bq) {
+		*bq =3D (struct bvecq) {
+			.ref		=3D REFCOUNT_INIT(1),
+			.bv		=3D bq->__bv,
+			.inline_bv	=3D true,
+			.max_slots	=3D (size - sizeof(*bq)) / sizeof(bq->__bv[0]),
+		};
+		netfs_stat(&netfs_n_bvecq);
+	}
+	return bq;
+}
+EXPORT_SYMBOL(bvecq_alloc_one);
+
+/**
+ * bvecq_alloc_chain - Allocate an unpopulated bvecq chain
+ * @nr_slots: Number of slots to allocate
+ * @gfp: The allocation constraints.
+ *
+ * Allocate a chain of bvecq nodes providing at least the requested cumula=
tive
+ * number of slots.
+ *
+ * Return: The first node pointer or NULL on allocation failure.
+ */
+struct bvecq *bvecq_alloc_chain(size_t nr_slots, gfp_t gfp)
+{
+	struct bvecq *head =3D NULL, *tail =3D NULL;
+
+	_enter("%zu", nr_slots);
+
+	for (;;) {
+		struct bvecq *bq;
+
+		bq =3D bvecq_alloc_one(nr_slots, gfp);
+		if (!bq)
+			goto oom;
+
+		if (tail) {
+			tail->next =3D bq;
+			bq->prev =3D tail;
+		} else {
+			head =3D bq;
+		}
+		tail =3D bq;
+		if (tail->max_slots >=3D nr_slots)
+			break;
+		nr_slots -=3D tail->max_slots;
+	}
+
+	return head;
+oom:
+	bvecq_put(head);
+	return NULL;
+}
+EXPORT_SYMBOL(bvecq_alloc_chain);
+
+/**
+ * bvecq_alloc_buffer - Allocate a bvecq chain and populate with buffers
+ * @size: Target size of the buffer (can be 0 for an empty buffer)
+ * @pre_slots: Number of preamble slots to set aside
+ * @gfp: The allocation constraints.
+ *
+ * Allocate a chain of bvecq nodes and populate the slots with sufficient =
pages
+ * to provide at least the requested amount of space, leaving the first
+ * @pre_slots slots unset.  The pages allocated may be compound pages larg=
er
+ * than PAGE_SIZE and thus occupy fewer slots.  The pages have their refco=
unts
+ * set to 1 and can be passed to MSG_SPLICE_PAGES.
+ *
+ * Return: The first node pointer or NULL on allocation failure.
+ */
+struct bvecq *bvecq_alloc_buffer(size_t size, unsigned int pre_slots, gfp_=
t gfp)
+{
+	struct bvecq *head =3D NULL, *tail =3D NULL, *p =3D NULL;
+	size_t count =3D DIV_ROUND_UP(size, PAGE_SIZE);
+
+	_enter("%zx,%zx,%u", size, count, pre_slots);
+
+	do {
+		struct page **pages;
+		int want, got;
+
+		p =3D bvecq_alloc_one(umin(count, 32 - 3), gfp);
+		if (!p)
+			goto oom;
+
+		p->free =3D true;
+
+		if (tail) {
+			tail->next =3D p;
+			p->prev =3D tail;
+		} else {
+			head =3D p;
+		}
+		tail =3D p;
+		if (!count)
+			break;
+
+		pages =3D (struct page **)&p->bv[p->max_slots];
+		pages -=3D p->max_slots - pre_slots;
+		memset(pages, 0, (p->max_slots - pre_slots) * sizeof(pages[0]));
+
+		want =3D umin(count, p->max_slots - pre_slots);
+		got =3D alloc_pages_bulk(gfp, want, pages);
+		if (got < want) {
+			for (int i =3D 0; i < got; i++)
+				__free_page(pages[i]);
+			goto oom;
+		}
+
+		tail->nr_slots =3D pre_slots + got;
+		for (int i =3D 0; i < got; i++) {
+			int j =3D pre_slots + i;
+
+			set_page_count(pages[i], 1);
+			bvec_set_page(&tail->bv[j], pages[i], PAGE_SIZE, 0);
+		}
+
+		count -=3D got;
+		pre_slots =3D 0;
+	} while (count > 0);
+
+	return head;
+oom:
+	bvecq_put(head);
+	return NULL;
+}
+EXPORT_SYMBOL(bvecq_alloc_buffer);
+
+/*
+ * Free the page pointed to be a segment as necessary.
+ */
+static void bvecq_free_seg(struct bvecq *bq, unsigned int seg)
+{
+	if (!bq->free ||
+	    !bq->bv[seg].bv_page)
+		return;
+
+	if (bq->unpin)
+		unpin_user_page(bq->bv[seg].bv_page);
+	else
+		__free_page(bq->bv[seg].bv_page);
+}
+
+/**
+ * bvecq_put - Put a ref on a bvec queue
+ * @bq: The start of the folio queue to free
+ *
+ * Put the ref(s) on the nodes in a bvec queue, freeing up the node and the
+ * page fragments it points to as the refcounts become zero.
+ */
+void bvecq_put(struct bvecq *bq)
+{
+	struct bvecq *next;
+
+	for (; bq; bq =3D next) {
+		if (!refcount_dec_and_test(&bq->ref))
+			break;
+		for (int seg =3D 0; seg < bq->nr_slots; seg++)
+			bvecq_free_seg(bq, seg);
+		next =3D bq->next;
+		netfs_stat_d(&netfs_n_bvecq);
+		kfree(bq);
+	}
+}
+EXPORT_SYMBOL(bvecq_put);
+
+/**
+ * bvecq_expand_buffer - Allocate buffer space into a bvec queue
+ * @_buffer: Pointer to the bvecq chain to expand (may point to a NULL; up=
dated).
+ * @_cur_size: Current size of the buffer (updated).
+ * @size: Target size of the buffer.
+ * @gfp: The allocation constraints.
+ */
+int bvecq_expand_buffer(struct bvecq **_buffer, size_t *_cur_size, ssize_t=
 size, gfp_t gfp)
+{
+	struct bvecq *tail =3D *_buffer;
+	const size_t max_slots =3D 32;
+
+	size =3D round_up(size, PAGE_SIZE);
+	if (*_cur_size >=3D size)
+		return 0;
+
+	if (tail)
+		while (tail->next)
+			tail =3D tail->next;
+
+	do {
+		struct page *page;
+		int order =3D 0;
+
+		if (!tail || bvecq_is_full(tail)) {
+			struct bvecq *p;
+
+			p =3D bvecq_alloc_one(max_slots, gfp);
+			if (!p)
+				return -ENOMEM;
+			if (tail) {
+				tail->next =3D p;
+				p->prev =3D tail;
+			} else {
+				*_buffer =3D p;
+			}
+			tail =3D p;
+		}
+
+		if (size - *_cur_size > PAGE_SIZE)
+			order =3D umin(ilog2(size - *_cur_size) - PAGE_SHIFT,
+				     MAX_PAGECACHE_ORDER);
+
+		page =3D alloc_pages(gfp | __GFP_COMP, order);
+		if (!page && order > 0)
+			page =3D alloc_pages(gfp | __GFP_COMP, 0);
+		if (!page)
+			return -ENOMEM;
+
+		bvec_set_page(&tail->bv[tail->nr_slots++], page, PAGE_SIZE << order, 0);
+		*_cur_size +=3D PAGE_SIZE << order;
+	} while (*_cur_size < size);
+
+	return 0;
+}
+EXPORT_SYMBOL(bvecq_expand_buffer);
+
+/**
+ * bvecq_shorten_buffer - Shorten a bvec queue buffer
+ * @bq: The start of the buffer to shorten
+ * @slot: The slot to start from
+ * @size: The size to retain
+ *
+ * Shorten the content of a bvec queue down to the minimum number of segme=
nts,
+ * starting at the specified segment, to retain the specified size.
+ *
+ * Return: 0 if successful; -EMSGSIZE if there is insufficient content.
+ */
+int bvecq_shorten_buffer(struct bvecq *bq, unsigned int slot, size_t size)
+{
+	ssize_t retain =3D size;
+
+	/* Skip through the segments we want to keep. */
+	for (; bq; bq =3D bq->next) {
+		for (; slot < bq->nr_slots; slot++) {
+			retain -=3D bq->bv[slot].bv_len;
+			if (retain < 0)
+				goto found;
+		}
+		slot =3D 0;
+	}
+	if (WARN_ON_ONCE(retain > 0))
+		return -EMSGSIZE;
+	return 0;
+
+found:
+	/* Shorten the entry to be retained and clean the rest of this bvecq. */
+	bq->bv[slot].bv_len +=3D retain;
+	slot++;
+	for (int i =3D slot; i < bq->nr_slots; i++)
+		bvecq_free_seg(bq, i);
+	bq->nr_slots =3D slot;
+
+	/* Free the queue tail. */
+	bvecq_put(bq->next);
+	bq->next =3D NULL;
+	return 0;
+}
+EXPORT_SYMBOL(bvecq_shorten_buffer);
+
+/**
+ * bvecq_buffer_init - Initialise a buffer and set position
+ * @pos: The position to point at the new buffer.
+ * @gfp: The allocation constraints.
+ *
+ * Initialise a rolling buffer.  We allocate an unpopulated bvecq node to =
so
+ * that the pointers can be independently driven by the producer and the
+ * consumer.
+ *
+ * Return 0 if successful; -ENOMEM on allocation failure.
+ */
+int bvecq_buffer_init(struct bvecq_pos *pos, gfp_t gfp)
+{
+	struct bvecq *bq;
+
+	bq =3D bvecq_alloc_one(13, gfp);
+	if (!bq)
+		return -ENOMEM;
+
+	pos->bvecq  =3D bq; /* Comes with a ref. */
+	pos->slot   =3D 0;
+	pos->offset =3D 0;
+	return 0;
+}
+
+/**
+ * bvecq_buffer_make_space - Start a new bvecq node in a buffer
+ * @pos: The position of the last node.
+ * @gfp: The allocation constraints.
+ *
+ * Add a new node on to the buffer chain at the specified position, either
+ * because the previous one is full or because we have a discontiguity to
+ * contend with, and update @pos to point to it.
+ *
+ * Return: 0 if successful; -ENOMEM on allocation failure.
+ */
+int bvecq_buffer_make_space(struct bvecq_pos *pos, gfp_t gfp)
+{
+	struct bvecq *bq, *head =3D pos->bvecq;
+
+	bq =3D bvecq_alloc_one(14, gfp);
+	if (!bq)
+		return -ENOMEM;
+	bq->prev =3D head;
+
+	pos->bvecq =3D bvecq_get(bq);
+	pos->slot =3D 0;
+	pos->offset =3D 0;
+
+	/* Make sure the initialisation is stored before the next pointer.
+	 *
+	 * [!] NOTE: After we set head->next, the consumer is at liberty to
+	 * immediately delete the old head.
+	 */
+	smp_store_release(&head->next, bq);
+	bvecq_put(head);
+	return 0;
+}
+
+/**
+ * bvecq_pos_advance - Advance a bvecq position
+ * @pos: The position to advance.
+ * @amount: The amount of bytes to advance by.
+ *
+ * Advance the specified bvecq position by @amount bytes.  @pos is updated=
 and
+ * bvecq ref counts may have been manipulated.  If the position hits the e=
nd of
+ * the queue, then it is left pointing beyond the last slot of the last bv=
ecq
+ * so that it doesn't break the chain.
+ */
+void bvecq_pos_advance(struct bvecq_pos *pos, size_t amount)
+{
+	struct bvecq *bq =3D pos->bvecq;
+	unsigned int slot =3D pos->slot;
+	size_t offset =3D pos->offset;
+
+	if (slot >=3D bq->nr_slots) {
+		bq =3D bq->next;
+		slot =3D 0;
+	}
+
+	while (amount) {
+		const struct bio_vec *bv =3D &bq->bv[slot];
+		size_t part =3D umin(bv->bv_len - offset, amount);
+
+		if (likely(part < bv->bv_len)) {
+			offset +=3D part;
+			break;
+		}
+		amount -=3D part;
+		offset =3D 0;
+		slot++;
+		if (slot >=3D bq->nr_slots) {
+			if (!bq->next)
+				break;
+			bq =3D bq->next;
+			slot =3D 0;
+		}
+	}
+
+	pos->slot   =3D slot;
+	pos->offset =3D offset;
+	bvecq_pos_move(pos, bq);
+}
+
+/**
+ * bvecq_zero - Clear memory starting at the bvecq position.
+ * @pos: The position in the bvecq chain to start clearing.
+ * @amount: The number of bytes to clear.
+ *
+ * Clear memory fragments pointed to by a bvec queue.  @pos is updated and
+ * bvecq ref counts may have been manipulated.  If the position hits the e=
nd of
+ * the queue, then it is left pointing beyond the last slot of the last bv=
ecq
+ * so that it doesn't break the chain.
+ *
+ * Return: The number of bytes cleared.
+ */
+ssize_t bvecq_zero(struct bvecq_pos *pos, size_t amount)
+{
+	struct bvecq *bq =3D pos->bvecq;
+	unsigned int slot =3D pos->slot;
+	ssize_t cleared =3D 0;
+	size_t offset =3D pos->offset;
+
+	if (WARN_ON_ONCE(!bq))
+		return 0;
+
+	if (slot >=3D bq->nr_slots) {
+		bq =3D bq->next;
+		if (WARN_ON_ONCE(!bq))
+			return 0;
+		slot =3D 0;
+	}
+
+	do {
+		const struct bio_vec *bv =3D &bq->bv[slot];
+
+		if (offset < bv->bv_len) {
+			size_t part =3D umin(amount - cleared, bv->bv_len - offset);
+
+			memzero_page(bv->bv_page, bv->bv_offset + offset, part);
+
+			offset +=3D part;
+			cleared +=3D part;
+		}
+
+		if (offset >=3D bv->bv_len) {
+			offset =3D 0;
+			slot++;
+			if (slot >=3D bq->nr_slots) {
+				if (!bq->next)
+					break;
+				bq =3D bq->next;
+				slot =3D 0;
+			}
+		}
+	} while (cleared < amount);
+
+	bvecq_pos_move(pos, bq);
+	pos->slot =3D slot;
+	pos->offset =3D offset;
+	return cleared;
+}
+
+/**
+ * bvecq_slice - Find a slice of a bvecq queue
+ * @pos: The position to start at.
+ * @max_size: The maximum size of the slice (or ULONG_MAX).
+ * @max_segs: The maximum number of segments in the slice (or INT_MAX).
+ * @_nr_segs: Where to put the number of segments (updated).
+ *
+ * Determine the size and number of segments that can be obtained the next
+ * slice of bvec queue up to the maximum size and segment count specified.=
  The
+ * slice is also limited if a discontiguity is found.
+ *
+ * @pos is updated to the end of the slice.  If the position hits the end =
of
+ * the queue, then it is left pointing beyond the last slot of the last bv=
ecq
+ * so that it doesn't break the chain.
+ *
+ * Return: The number of bytes in the slice.
+ */
+size_t bvecq_slice(struct bvecq_pos *pos, size_t max_size,
+		   unsigned int max_segs, unsigned int *_nr_segs)
+{
+	struct bvecq *bq;
+	unsigned int slot =3D pos->slot, nsegs =3D 0;
+	size_t size =3D 0;
+	size_t offset =3D pos->offset;
+
+	bq =3D pos->bvecq;
+	for (;;) {
+		for (; slot < bq->nr_slots; slot++) {
+			const struct bio_vec *bvec =3D &bq->bv[slot];
+
+			if (offset < bvec->bv_len && bvec->bv_page) {
+				size_t part =3D umin(bvec->bv_len - offset, max_size);
+
+				size +=3D part;
+				offset +=3D part;
+				max_size -=3D part;
+				nsegs++;
+				if (!max_size || nsegs >=3D max_segs)
+					goto out;
+			}
+			offset =3D 0;
+		}
+
+		/* pos->bvecq isn't allowed to go NULL as the queue may get
+		 * extended and we would lose our place.
+		 */
+		if (!bq->next)
+			break;
+		slot =3D 0;
+		bq =3D bq->next;
+		if (bq->discontig && size > 0)
+			break;
+	}
+
+out:
+	*_nr_segs =3D nsegs;
+	if (slot =3D=3D bq->nr_slots && bq->next) {
+		bq =3D bq->next;
+		slot =3D 0;
+		offset =3D 0;
+	}
+	bvecq_pos_move(pos, bq);
+	pos->slot =3D slot;
+	pos->offset =3D offset;
+	return size;
+}
+
+/**
+ * bvecq_extract - Extract a slice of a bvecq queue into a new bvecq queue
+ * @pos: The position to start at.
+ * @max_size: The maximum size of the slice (or ULONG_MAX).
+ * @max_segs: The maximum number of segments in the slice (or INT_MAX).
+ * @to: Where to put the extraction bvecq chain head (updated).
+ *
+ * Allocate a new bvecq and extract into it memory fragments from a slice =
of
+ * bvec queue, starting at @pos.  The slice is also limited if a discontig=
uity
+ * is found.  No refs are taken on the page.
+ *
+ * @pos is updated to the end of the slice.  If the position hits the end =
of
+ * the queue, then it is left pointing beyond the last slot of the last bv=
ecq
+ * so that it doesn't break the chain.
+ *
+ * If successful, *@to is set to point to the head of the newly allocated =
chain
+ * and the caller inherits a ref to it.
+ *
+ * Return: The number of bytes extracted; -ENOMEM on allocation failure or=
 -EIO
+ * if no segments were available to extract.
+ */
+ssize_t bvecq_extract(struct bvecq_pos *pos, size_t max_size,
+		      unsigned int max_segs, struct bvecq **to)
+{
+	struct bvecq_pos tmp_pos;
+	struct bvecq *src, *dst =3D NULL;
+	unsigned int slot =3D pos->slot, nsegs;
+	ssize_t extracted =3D 0;
+	size_t offset =3D pos->offset, amount;
+
+	*to =3D NULL;
+	if (WARN_ON_ONCE(!max_segs))
+		max_segs =3D INT_MAX;
+
+	bvecq_pos_set(&tmp_pos, pos);
+	amount =3D bvecq_slice(&tmp_pos, max_size, max_segs, &nsegs);
+	bvecq_pos_unset(&tmp_pos);
+	if (nsegs =3D=3D 0)
+		return -EIO;
+
+	dst =3D bvecq_alloc_chain(nsegs, GFP_KERNEL);
+	if (!dst)
+		return -ENOMEM;
+	*to =3D dst;
+	max_segs =3D nsegs;
+	nsegs =3D 0;
+
+	/* Transcribe the segments */
+	src =3D pos->bvecq;
+	for (;;) {
+		for (; slot < src->nr_slots; slot++) {
+			const struct bio_vec *sv =3D &src->bv[slot];
+			struct bio_vec *dv =3D &dst->bv[dst->nr_slots];
+
+			_debug("EXTR BQ=3D%x[%x] off=3D%zx am=3D%zx p=3D%lx",
+			       src->priv, slot, offset, amount, page_to_pfn(sv->bv_page));
+
+			if (offset < sv->bv_len && sv->bv_page) {
+				size_t part =3D umin(sv->bv_len - offset, amount);
+
+				bvec_set_page(dv, sv->bv_page, part,
+					      sv->bv_offset + offset);
+				extracted +=3D part;
+				amount -=3D part;
+				offset +=3D part;
+				trace_netfs_bv_slot(dst, dst->nr_slots);
+				dst->nr_slots++;
+				nsegs++;
+				if (bvecq_is_full(dst))
+					dst =3D dst->next;
+				if (nsegs >=3D max_segs)
+					goto out;
+				if (amount =3D=3D 0)
+					goto out;
+			}
+			offset =3D 0;
+		}
+
+		/* pos->bvecq isn't allowed to go NULL as the queue may get
+		 * extended and we would lose our place.
+		 */
+		if (!src->next)
+			break;
+		slot =3D 0;
+		src =3D src->next;
+		if (src->discontig && extracted > 0)
+			break;
+	}
+
+out:
+	if (slot =3D=3D src->nr_slots && src->next) {
+		src =3D src->next;
+		slot =3D 0;
+		offset =3D 0;
+	}
+	bvecq_pos_move(pos, src);
+	pos->slot =3D slot;
+	pos->offset =3D offset;
+	return extracted;
+}
+
+/**
+ * bvecq_load_from_ra - Allocate a bvecq chain and load from readahead
+ * @pos: Blank position object to attach the new chain to.
+ * @ractl: The readahead control context.
+ *
+ * Decant the set of folios to be read from the readahead context into a b=
vecq
+ * chain.  Each folio occupies one bio_vec element.
+ *
+ * Return: Amount of data loaded or -ENOMEM on allocation failure.
+ */
+ssize_t bvecq_load_from_ra(struct bvecq_pos *pos, struct readahead_control=
 *ractl)
+{
+	XA_STATE(xas, &ractl->mapping->i_pages, ractl->_index);
+	struct folio *folio;
+	struct bvecq *bq;
+	size_t loaded =3D 0;
+
+	bq =3D bvecq_alloc_chain(ractl->_nr_folios, GFP_NOFS);
+	if (!bq)
+		return -ENOMEM;
+
+	pos->bvecq  =3D bq;
+	pos->slot   =3D 0;
+	pos->offset =3D 0;
+
+	rcu_read_lock();
+
+	xas_for_each(&xas, folio, ractl->_index + ractl->_nr_pages - 1) {
+		size_t len;
+
+		if (xas_retry(&xas, folio))
+			continue;
+		VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+
+		len =3D folio_size(folio);
+		bvec_set_folio(&bq->bv[bq->nr_slots++], folio, len, 0);
+		loaded +=3D len;
+		trace_netfs_folio(folio, netfs_folio_trace_read);
+
+		if (bq->nr_slots >=3D bq->max_slots) {
+			bq =3D bq->next;
+			if (!bq)
+				break;
+		}
+	}
+
+	rcu_read_unlock();
+
+	ractl->_index +=3D ractl->_nr_pages;
+	ractl->_nr_pages =3D 0;
+	return loaded;
+}
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index 2fcf31de5f2c..ad47bcc1947b 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -168,6 +168,7 @@ extern atomic_t netfs_n_wh_retry_write_subreq;
 extern atomic_t netfs_n_wb_lock_skip;
 extern atomic_t netfs_n_wb_lock_wait;
 extern atomic_t netfs_n_folioq;
+extern atomic_t netfs_n_bvecq;
=20
 int netfs_stats_show(struct seq_file *m, void *v);
=20
diff --git a/fs/netfs/stats.c b/fs/netfs/stats.c
index ab6b916addc4..84c2a4bcc762 100644
--- a/fs/netfs/stats.c
+++ b/fs/netfs/stats.c
@@ -48,6 +48,7 @@ atomic_t netfs_n_wh_retry_write_subreq;
 atomic_t netfs_n_wb_lock_skip;
 atomic_t netfs_n_wb_lock_wait;
 atomic_t netfs_n_folioq;
+atomic_t netfs_n_bvecq;
=20
 int netfs_stats_show(struct seq_file *m, void *v)
 {
@@ -90,9 +91,10 @@ int netfs_stats_show(struct seq_file *m, void *v)
 		   atomic_read(&netfs_n_rh_retry_read_subreq),
 		   atomic_read(&netfs_n_wh_retry_write_req),
 		   atomic_read(&netfs_n_wh_retry_write_subreq));
-	seq_printf(m, "Objs   : rr=3D%u sr=3D%u foq=3D%u wsc=3D%u\n",
+	seq_printf(m, "Objs   : rr=3D%u sr=3D%u bq=3D%u foq=3D%u wsc=3D%u\n",
 		   atomic_read(&netfs_n_rh_rreq),
 		   atomic_read(&netfs_n_rh_sreq),
+		   atomic_read(&netfs_n_bvecq),
 		   atomic_read(&netfs_n_folioq),
 		   atomic_read(&netfs_n_wh_wstream_conflict));
 	seq_printf(m, "WbLock : skip=3D%u wait=3D%u\n",
diff --git a/include/linux/bvecq.h b/include/linux/bvecq.h
index 462125af1cc7..6c58a7fb6472 100644
--- a/include/linux/bvecq.h
+++ b/include/linux/bvecq.h
@@ -17,7 +17,7 @@
  * iterated over with an ITER_BVECQ iterator.  The list is non-circular; n=
ext
  * and prev are NULL at the ends.
  *
- * The bv pointer points to the segment array; this may be __bv if allocat=
ed
+ * The bv pointer points to the bio_vec array; this may be __bv if allocat=
ed
  * together.  The caller is responsible for determining whether or not thi=
s is
  * the case as the array pointed to by bv may be follow on directly from t=
he
  * bvecq by accident of allocation (ie. ->bv =3D=3D ->__bv is *not* suffic=
ient to
@@ -33,8 +33,8 @@ struct bvecq {
 	unsigned long long fpos;	/* File position */
 	refcount_t	ref;
 	u32		priv;		/* Private data */
-	u16		nr_segs;	/* Number of elements in bv[] used */
-	u16		max_segs;	/* Number of elements allocated in bv[] */
+	u16		nr_slots;	/* Number of elements in bv[] used */
+	u16		max_slots;	/* Number of elements allocated in bv[] */
 	bool		inline_bv:1;	/* T if __bv[] is being used */
 	bool		free:1;		/* T if the pages need freeing */
 	bool		unpin:1;	/* T if the pages need unpinning, not freeing */
@@ -43,4 +43,163 @@ struct bvecq {
 	struct bio_vec	__bv[];		/* Default array (if ->inline_bv) */
 };
=20
+/*
+ * Position in a bio_vec queue.  The bvecq holds a ref on the queue segmen=
t it
+ * points to.
+ */
+struct bvecq_pos {
+	struct bvecq		*bvecq;		/* The first bvecq */
+	unsigned int		offset;		/* The offset within the starting slot */
+	u16			slot;		/* The starting slot */
+};
+
+void bvecq_dump(const struct bvecq *bq);
+struct bvecq *bvecq_alloc_one(size_t nr_slots, gfp_t gfp);
+struct bvecq *bvecq_alloc_chain(size_t nr_slots, gfp_t gfp);
+struct bvecq *bvecq_alloc_buffer(size_t size, unsigned int pre_slots, gfp_=
t gfp);
+void bvecq_put(struct bvecq *bq);
+int bvecq_expand_buffer(struct bvecq **_buffer, size_t *_cur_size, ssize_t=
 size, gfp_t gfp);
+int bvecq_shorten_buffer(struct bvecq *bq, unsigned int slot, size_t size);
+int bvecq_buffer_init(struct bvecq_pos *pos, gfp_t gfp);
+int bvecq_buffer_make_space(struct bvecq_pos *pos, gfp_t gfp);
+void bvecq_pos_advance(struct bvecq_pos *pos, size_t amount);
+ssize_t bvecq_zero(struct bvecq_pos *pos, size_t amount);
+size_t bvecq_slice(struct bvecq_pos *pos, size_t max_size,
+		   unsigned int max_segs, unsigned int *_nr_segs);
+ssize_t bvecq_extract(struct bvecq_pos *pos, size_t max_size,
+		      unsigned int max_segs, struct bvecq **to);
+ssize_t bvecq_load_from_ra(struct bvecq_pos *pos, struct readahead_control=
 *ractl);
+
+/**
+ * bvecq_get - Get a ref on a bvecq
+ * @bq: The bvecq to get a ref on
+ */
+static inline struct bvecq *bvecq_get(struct bvecq *bq)
+{
+	refcount_inc(&bq->ref);
+	return bq;
+}
+
+/**
+ * bvecq_is_full - Determine if a bvecq is full
+ * @bvecq: The object to query
+ *
+ * Return: true if full; false if not.
+ */
+static inline bool bvecq_is_full(const struct bvecq *bvecq)
+{
+	return bvecq->nr_slots >=3D bvecq->max_slots;
+}
+
+/**
+ * bvecq_pos_set - Set one position to be the same as another
+ * @pos: The position object to set
+ * @at: The source position.
+ *
+ * Set @pos to have the same position as @at.  This may take a ref on the
+ * bvecq pointed to.
+ */
+static inline void bvecq_pos_set(struct bvecq_pos *pos, const struct bvecq=
_pos *at)
+{
+	*pos =3D *at;
+	bvecq_get(pos->bvecq);
+}
+
+/**
+ * bvecq_pos_unset - Unset a position
+ * @pos: The position object to unset
+ *
+ * Unset @pos.  This does any needed ref cleanup.
+ */
+static inline void bvecq_pos_unset(struct bvecq_pos *pos)
+{
+	bvecq_put(pos->bvecq);
+	pos->bvecq =3D NULL;
+	pos->slot =3D 0;
+	pos->offset =3D 0;
+}
+
+/**
+ * bvecq_pos_transfer - Transfer one position to another, clearing the fir=
st
+ * @pos: The position object to set
+ * @from: The source position to clear.
+ *
+ * Set @pos to have the same position as @from and then clear @from.  This=
 may
+ * transfer a ref on the bvecq pointed to.
+ */
+static inline void bvecq_pos_transfer(struct bvecq_pos *pos, struct bvecq_=
pos *from)
+{
+	*pos =3D *from;
+	from->bvecq =3D NULL;
+	from->slot =3D 0;
+	from->offset =3D 0;
+}
+
+/**
+ * bvecq_pos_move - Update a position to a new bvecq
+ * @pos: The position object to update.
+ * @to: The new bvecq to point at.
+ *
+ * Update @pos to point to @to if it doesn't already do so.  This may
+ * manipulate refs on the bvecqs pointed to.
+ */
+static inline void bvecq_pos_move(struct bvecq_pos *pos, struct bvecq *to)
+{
+	struct bvecq *old =3D pos->bvecq;
+
+	if (old !=3D to) {
+		pos->bvecq =3D bvecq_get(to);
+		bvecq_put(old);
+	}
+}
+
+/**
+ * bvecq_pos_step - Step a position to the next slot if possible
+ * @pos: The position object to step.
+ *
+ * Update @pos to point to the next slot in the queue if not at the end.  =
This
+ * may manipulate refs on the bvecqs pointed to.
+ *
+ * Return: true if successful, false if was at the end.
+ */
+static inline bool bvecq_pos_step(struct bvecq_pos *pos)
+{
+	struct bvecq *bq =3D pos->bvecq;
+
+	pos->slot++;
+	pos->offset =3D 0;
+	if (pos->slot <=3D bq->nr_slots)
+		return true;
+	if (!bq->next)
+		return false;
+	bvecq_pos_move(pos, bq->next);
+	return true;
+}
+
+/**
+ * bvecq_delete_spent - Delete the bvecq at the front if possible
+ * @pos: The position object to update.
+ *
+ * Delete the used up bvecq at the front of the queue that @pos points to =
if it
+ * is not the last node in the queue; if it is the last node in the queue,=
 it
+ * is kept so that the queue doesn't become detached from the other end.  =
This
+ * may manipulate refs on the bvecqs pointed to.
+ */
+static inline struct bvecq *bvecq_delete_spent(struct bvecq_pos *pos)
+{
+	struct bvecq *spent =3D pos->bvecq;
+	/* Read the contents of the queue node after the pointer to it. */
+	struct bvecq *next =3D smp_load_acquire(&spent->next);
+
+	if (!next)
+		return NULL;
+	next->prev =3D NULL;
+	spent->next =3D NULL;
+	bvecq_put(spent);
+	pos->bvecq =3D next; /* We take spent's ref */
+	pos->slot =3D 0;
+	pos->offset =3D 0;
+	return next;
+}
+
 #endif /* _LINUX_BVECQ_H */
diff --git a/include/linux/iov_iter.h b/include/linux/iov_iter.h
index 999607ece481..309642b3901f 100644
--- a/include/linux/iov_iter.h
+++ b/include/linux/iov_iter.h
@@ -152,7 +152,7 @@ size_t iterate_bvecq(struct iov_iter *iter, size_t len,=
 void *priv, void *priv2,
 	unsigned int slot =3D iter->bvecq_slot;
 	size_t progress =3D 0, skip =3D iter->iov_offset;
=20
-	if (slot =3D=3D bq->nr_segs) {
+	if (slot =3D=3D bq->nr_slots) {
 		/* The iterator may have been extended. */
 		bq =3D bq->next;
 		slot =3D 0;
@@ -176,7 +176,7 @@ size_t iterate_bvecq(struct iov_iter *iter, size_t len,=
 void *priv, void *priv2,
 		if (skip >=3D bvec->bv_len) {
 			skip =3D 0;
 			slot++;
-			if (slot >=3D bq->nr_segs) {
+			if (slot >=3D bq->nr_slots) {
 				if (!bq->next)
 					break;
 				bq =3D bq->next;
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index cc56b6512769..5bc48aacf7f6 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -17,6 +17,7 @@
 #include <linux/workqueue.h>
 #include <linux/fs.h>
 #include <linux/pagemap.h>
+#include <linux/bvecq.h>
 #include <linux/uio.h>
 #include <linux/rolling_buffer.h>
=20
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index b8236f9e940e..fbb094231659 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -779,6 +779,30 @@ TRACE_EVENT(netfs_folioq,
 		      __print_symbolic(__entry->trace, netfs_folioq_traces))
 	    );
=20
+TRACE_EVENT(netfs_bv_slot,
+	    TP_PROTO(const struct bvecq *bq, int slot),
+
+	    TP_ARGS(bq, slot),
+
+	    TP_STRUCT__entry(
+		    __field(unsigned long,		pfn)
+		    __field(unsigned int,		offset)
+		    __field(unsigned int,		len)
+		    __field(unsigned int,		slot)
+			     ),
+
+	    TP_fast_assign(
+		    __entry->slot =3D slot;
+		    __entry->pfn =3D page_to_pfn(bq->bv[slot].bv_page);
+		    __entry->offset =3D bq->bv[slot].bv_offset;
+		    __entry->len =3D bq->bv[slot].bv_len;
+			   ),
+
+	    TP_printk("bq[%x] p=3D%lx %x-%x",
+		      __entry->slot,
+		      __entry->pfn, __entry->offset, __entry->offset + __entry->len)
+	    );
+
 #undef EM
 #undef E_
 #endif /* _TRACE_NETFS_H */
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index df8d037894b1..4f091e6d4a22 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -580,7 +580,7 @@ static void iov_iter_bvecq_advance(struct iov_iter *i, =
size_t by)
 		return;
 	i->count -=3D by;
=20
-	if (slot >=3D bq->nr_segs) {
+	if (slot >=3D bq->nr_slots) {
 		bq =3D bq->next;
 		slot =3D 0;
 	}
@@ -593,7 +593,7 @@ static void iov_iter_bvecq_advance(struct iov_iter *i, =
size_t by)
 			break;
 		by -=3D len;
 		slot++;
-		if (slot >=3D bq->nr_segs && bq->next) {
+		if (slot >=3D bq->nr_slots && bq->next) {
 			bq =3D bq->next;
 			slot =3D 0;
 		}
@@ -662,7 +662,7 @@ static void iov_iter_bvecq_revert(struct iov_iter *i, s=
ize_t unroll)
=20
 		if (slot =3D=3D 0) {
 			bq =3D bq->prev;
-			slot =3D bq->nr_segs;
+			slot =3D bq->nr_slots;
 		}
 		slot--;
=20
@@ -947,7 +947,7 @@ static unsigned long iov_iter_alignment_bvecq(const str=
uct iov_iter *iter)
 		return res;
=20
 	for (bq =3D iter->bvecq; bq; bq =3D bq->next) {
-		for (; slot < bq->nr_segs; slot++) {
+		for (; slot < bq->nr_slots; slot++) {
 			const struct bio_vec *bvec =3D &bq->bv[slot];
 			size_t part =3D umin(bvec->bv_len - skip, size);
=20
@@ -1331,7 +1331,7 @@ static size_t iov_npages_bvecq(const struct iov_iter =
*iter, size_t maxpages)
 	size_t size =3D iter->count;
=20
 	for (bq =3D iter->bvecq; bq; bq =3D bq->next) {
-		for (; slot < bq->nr_segs; slot++) {
+		for (; slot < bq->nr_slots; slot++) {
 			const struct bio_vec *bvec =3D &bq->bv[slot];
 			size_t offs =3D (bvec->bv_offset + skip) % PAGE_SIZE;
 			size_t part =3D umin(bvec->bv_len - skip, size);
@@ -1731,7 +1731,7 @@ static ssize_t iov_iter_extract_bvecq_pages(struct io=
v_iter *iter,
 	unsigned int seg =3D iter->bvecq_slot, count =3D 0, nr =3D 0;
 	size_t extracted =3D 0, offset =3D iter->iov_offset;
=20
-	if (seg >=3D bvecq->nr_segs) {
+	if (seg >=3D bvecq->nr_slots) {
 		bvecq =3D bvecq->next;
 		if (WARN_ON_ONCE(!bvecq))
 			return 0;
@@ -1763,7 +1763,7 @@ static ssize_t iov_iter_extract_bvecq_pages(struct io=
v_iter *iter,
 		if (offset >=3D blen) {
 			offset =3D 0;
 			seg++;
-			if (seg >=3D bvecq->nr_segs) {
+			if (seg >=3D bvecq->nr_slots) {
 				if (!bvecq->next) {
 					WARN_ON_ONCE(extracted < iter->count);
 					break;
@@ -1816,7 +1816,7 @@ static ssize_t iov_iter_extract_bvecq_pages(struct io=
v_iter *iter,
 		if (offset >=3D blen) {
 			offset =3D 0;
 			seg++;
-			if (seg >=3D bvecq->nr_segs) {
+			if (seg >=3D bvecq->nr_slots) {
 				if (!bvecq->next) {
 					WARN_ON_ONCE(extracted < iter->count);
 					break;
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 03e3883a1a2d..93a3d194a914 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -1345,7 +1345,7 @@ static ssize_t extract_bvecq_to_sg(struct iov_iter *i=
ter,
 	ssize_t ret =3D 0;
 	size_t offset =3D iter->iov_offset;
=20
-	if (seg >=3D bvecq->nr_segs) {
+	if (seg >=3D bvecq->nr_slots) {
 		bvecq =3D bvecq->next;
 		if (WARN_ON_ONCE(!bvecq))
 			return 0;
@@ -1373,7 +1373,7 @@ static ssize_t extract_bvecq_to_sg(struct iov_iter *i=
ter,
 		if (offset >=3D blen) {
 			offset =3D 0;
 			seg++;
-			if (seg >=3D bvecq->nr_segs) {
+			if (seg >=3D bvecq->nr_slots) {
 				if (!bvecq->next) {
 					WARN_ON_ONCE(ret < iter->count);
 					break;
diff --git a/lib/tests/kunit_iov_iter.c b/lib/tests/kunit_iov_iter.c
index 5bc941f64343..ff0621636ff1 100644
--- a/lib/tests/kunit_iov_iter.c
+++ b/lib/tests/kunit_iov_iter.c
@@ -543,28 +543,28 @@ static void iov_kunit_destroy_bvecq(void *data)
=20
 	for (bq =3D data; bq; bq =3D next) {
 		next =3D bq->next;
-		for (int i =3D 0; i < bq->nr_segs; i++)
+		for (int i =3D 0; i < bq->nr_slots; i++)
 			if (bq->bv[i].bv_page)
 				put_page(bq->bv[i].bv_page);
 		kfree(bq);
 	}
 }
=20
-static struct bvecq *iov_kunit_alloc_bvecq(struct kunit *test, unsigned in=
t max_segs)
+static struct bvecq *iov_kunit_alloc_bvecq(struct kunit *test, unsigned in=
t max_slots)
 {
 	struct bvecq *bq;
=20
-	bq =3D kzalloc(struct_size(bq, __bv, max_segs), GFP_KERNEL);
+	bq =3D kzalloc(struct_size(bq, __bv, max_slots), GFP_KERNEL);
 	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, bq);
-	bq->max_segs =3D max_segs;
+	bq->max_slots =3D max_slots;
 	return bq;
 }
=20
-static struct bvecq *iov_kunit_create_bvecq(struct kunit *test, unsigned i=
nt max_segs)
+static struct bvecq *iov_kunit_create_bvecq(struct kunit *test, unsigned i=
nt max_slots)
 {
 	struct bvecq *bq;
=20
-	bq =3D iov_kunit_alloc_bvecq(test, max_segs);
+	bq =3D iov_kunit_alloc_bvecq(test, max_slots);
 	kunit_add_action_or_reset(test, iov_kunit_destroy_bvecq, bq);
 	return bq;
 }
@@ -578,13 +578,13 @@ static void __init iov_kunit_load_bvecq(struct kunit =
*test,
 	size_t size =3D 0;
=20
 	for (int i =3D 0; i < npages; i++) {
-		if (bq->nr_segs >=3D bq->max_segs) {
+		if (bq->nr_slots >=3D bq->max_slots) {
 			bq->next =3D iov_kunit_alloc_bvecq(test, 8);
 			bq->next->prev =3D bq;
 			bq =3D bq->next;
 		}
-		bvec_set_page(&bq->bv[bq->nr_segs], pages[i], PAGE_SIZE, 0);
-		bq->nr_segs++;
+		bvec_set_page(&bq->bv[bq->nr_slots], pages[i], PAGE_SIZE, 0);
+		bq->nr_slots++;
 		size +=3D PAGE_SIZE;
 	}
 	iov_iter_bvec_queue(iter, dir, bq_head, 0, 0, size);
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2A43238A73B
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:48:16 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522097; cv=none;
 b=nlenG5d5vClUrBGkDAPdKIgr8/+U5ZR/4QRK18V5vRh7/Fu1OHw54Cr5ltyfNeVAjM8ztclx/75mKtV3XAk8yM6F3sY1wiPuan3Gi6XEX/Nbu7o1O/BRLbvGE2NPqkqzyUIGK4PVrq+PEk8qO183HZjxAE1IeHbv3beXA6bDq/s=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522097; c=relaxed/simple;
	bh=MTxqkqINhnmFwsYY0I5RGkHMRfSTiKvm3IE0AehnxE0=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=D+Fru1Se+Vdd9cXtDWyZkzAX4hi2whcEX9/8aIicXUXtpuHXsa5sWYxbcQC7cSiepgBc4TzOq6zd3gQG5pwR/FyDOlgOHfnCg4pCDkHmtrG7kb6l5PkCF+hz9gPMRmM086XZ2WaL8XpOOcLA8yme8vhaWm55HCVbzH1121Iz61g=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=TORcYEjj; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="TORcYEjj"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774522095;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=0q3CSNpbXi5lOGrcvXbX8mvLw/thJvXIjpsnNa/bq6w=;
	b=TORcYEjjiXMIeWXLOD/kNKfxS4S1I0a8T8dEnEbGvFq8D2dM8brhiDM9tH8ki0mRm7i+Hs
	GlydInhZgclvF6FKVvsVZ6BszghbkdhXZXZ0SZ0fNu0SpRpgWqghJ10YyEUqrDU0A01KQc
	pMOzO9PK2QfWWcggAQKWQaB1wPjMMqM=
Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-584-H2PjY8ChMyuS7rT40AhezQ-1; Thu,
 26 Mar 2026 06:48:11 -0400
X-MC-Unique: H2PjY8ChMyuS7rT40AhezQ-1
X-Mimecast-MFC-AGG-ID: H2PjY8ChMyuS7rT40AhezQ_1774522089
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id E6D89180034E;
	Thu, 26 Mar 2026 10:48:08 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 783F11955F25;
	Thu, 26 Mar 2026 10:48:02 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Paulo Alcantara <pc@manguebit.org>
Subject: [PATCH 14/26] netfs: Add a function to extract from an iter into a
 bvecq
Date: Thu, 26 Mar 2026 10:45:29 +0000
Message-ID: <20260326104544.509518-15-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12
Content-Type: text/plain; charset="utf-8"

Add a function to extract a slice of data from an iterator of any type into
a bvec queue chain.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/netfs/iterator.c   | 123 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/netfs.h |   3 ++
 2 files changed, 126 insertions(+)

diff --git a/fs/netfs/iterator.c b/fs/netfs/iterator.c
index adca78747f23..e77fd39327c2 100644
--- a/fs/netfs/iterator.c
+++ b/fs/netfs/iterator.c
@@ -13,6 +13,129 @@
 #include <linux/netfs.h>
 #include "internal.h"
=20
+/**
+ * netfs_extract_iter - Extract the pages from an iterator into a bvecq
+ * @orig: The original iterator
+ * @orig_len: The amount of iterator to copy
+ * @max_segs: Maximum number of contiguous segments
+ * @fpos: Starting file position to label the bvecq with
+ * @_bvecq_head: Where to cache the bvec queue
+ * @extraction_flags: Flags to qualify the request
+ *
+ * Extract the page fragments from the given amount of the source iterator=
 and
+ * build bvec queue that refers to all of those bits.  This allows the ori=
ginal
+ * iterator to disposed of.
+ *
+ * @extraction_flags can have ITER_ALLOW_P2PDMA set to request peer-to-pee=
r DMA be
+ * allowed on the pages extracted.
+ *
+ * On success, the amount of data in the bvec is returned, the original
+ * iterator will have been advanced by the amount extracted.
+ *
+ * The bvecq segments are marked with indications on how to get clean up t=
he
+ * extracted fragments.
+ */
+ssize_t netfs_extract_iter(struct iov_iter *orig, size_t orig_len, size_t =
max_segs,
+			   unsigned long long fpos, struct bvecq **_bvecq_head,
+			   iov_iter_extraction_t extraction_flags)
+{
+	struct bvecq *bq_tail =3D NULL;
+	ssize_t ret =3D 0;
+	size_t extracted =3D 0, nr_pages;
+
+	_enter("{%u,%zx},%zx", orig->iter_type, orig->count, orig_len);
+
+	WARN_ON_ONCE(orig_len > orig->count);
+
+	nr_pages =3D iov_iter_npages(orig, max_segs ?: INT_MAX);
+	if (WARN_ON(nr_pages =3D=3D 0) ||
+	    WARN_ON(nr_pages > max_segs))
+		nr_pages =3D max_segs;
+	max_segs =3D nr_pages;
+
+	do {
+		struct bvecq *bq;
+
+		if (WARN_ON(max_segs =3D=3D 0))
+			break;
+
+		bq =3D bvecq_alloc_one(max_segs, GFP_NOFS);
+		if (!bq) {
+			ret =3D -ENOMEM;
+			break;
+		}
+		bq->free	=3D user_backed_iter(orig);
+		bq->unpin	=3D iov_iter_extract_will_pin(orig);
+		bq->prev	=3D bq_tail;
+		bq->fpos	=3D fpos + extracted;
+
+		if (bq_tail)
+			bq_tail->next =3D bq;
+		else
+			*_bvecq_head =3D bq;
+		bq_tail =3D bq;
+
+		if (orig_len =3D=3D 0)
+			break;
+
+		struct bio_vec *bv =3D bq->bv;
+		do {
+			struct page **pages;
+			ssize_t got;
+			size_t offset;
+			size_t space =3D bq->max_slots - bq->nr_slots;
+			size_t bv_size =3D array_size(bq->max_slots, sizeof(*bv));
+			size_t pg_size =3D array_size(space, sizeof(*pages));
+
+			/* Put the page list at the end of the bvec list
+			 * storage.  bvec elements are larger than page
+			 * pointers, so as long as we work 0->last, we should
+			 * be fine.
+			 */
+			pages =3D (void *)bv + bv_size - pg_size;
+
+			got =3D iov_iter_extract_pages(orig, &pages, orig_len,
+						     space, extraction_flags, &offset);
+			if (got < 0) {
+				ret =3D got;
+				goto out;
+			}
+
+			if (got =3D=3D 0) {
+				pr_err("extract_pages gave nothing from %zu, %zu\n",
+				       extracted, orig_len);
+				ret =3D -EIO;
+				goto out;
+			}
+
+			if (got > orig_len - extracted) {
+				pr_err("extract_pages rc=3D%zd more than %zu\n",
+				       got, orig_len);
+				goto out;
+			}
+
+			extracted +=3D got;
+			orig_len -=3D got;
+
+			do {
+				size_t len =3D umin(got, PAGE_SIZE - offset);
+
+				BUG_ON(bq->nr_slots >=3D bq->max_slots);
+
+				bvec_set_page(&bq->bv[bq->nr_slots],
+					      *pages++, len, offset);
+				bq->nr_slots++;
+				got -=3D len;
+				offset =3D 0;
+			} while (got > 0);
+		} while (orig_len > 0 && !bvecq_is_full(bq));
+	} while (orig_len > 0 && max_segs > 0);
+
+out:
+	return extracted ?: ret;
+}
+EXPORT_SYMBOL_GPL(netfs_extract_iter);
+
 /**
  * netfs_extract_user_iter - Extract the pages from a user iterator into a=
 bvec
  * @orig: The original iterator
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 5bc48aacf7f6..b4602f7b6431 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -445,6 +445,9 @@ void netfs_get_subrequest(struct netfs_io_subrequest *s=
ubreq,
 			  enum netfs_sreq_ref_trace what);
 void netfs_put_subrequest(struct netfs_io_subrequest *subreq,
 			  enum netfs_sreq_ref_trace what);
+ssize_t netfs_extract_iter(struct iov_iter *orig, size_t orig_len, size_t =
max_segs,
+			   unsigned long long fpos, struct bvecq **_bvecq_head,
+			   iov_iter_extraction_t extraction_flags);
 ssize_t netfs_extract_user_iter(struct iov_iter *orig, size_t orig_len,
 				struct iov_iter *new,
 				iov_iter_extraction_t extraction_flags);
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3461035E921
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:48:24 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522106; cv=none;
 b=rw0icpmInQx0yZoFWxFJbxekzEc0GY4jcpt2t09XwtgJtdEVSi5w12532rgdolR8HGS4hXVKepLB/Qg0qOv6CK+y9LZtG2SbDgY3Km3YzMMEA9kaHSIaKuQPYENt3SHqTNynQA6ocH/JdDPXxn0FGxL/g8s3nvrBEeusQ4foK1g=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522106; c=relaxed/simple;
	bh=5QJhZ/6Fp0Lhe/y9iPccdIFdjXMOmEQBt0aJl8EquS4=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=nnyJXfEAjKU1Guq5bZRYlaZ63yq54B2T8J/1EGwLkYz653Fnd/63XthGqNJt6DhyZdaSGpEm+9FeJK+nHJdquR0Vrz5G23nhxkcG+NSL4FsTfvlNYeRJkUATOrhscJPuYqefhkzHnR0xMhDO62MgMUXQxKKDyPr3YOwYXRnZmHE=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=hgrdaPKj; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="hgrdaPKj"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774522103;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=Is9VHSESh2LYKPQp2EQ9HQmWLZBLQlOGumcQU3ROqrk=;
	b=hgrdaPKjAuqHihGjLE4j97rM3IJV0/EQhNsjIhdzr/2xjOX1+GqWh8lwXNPpoJegBNMGiV
	Kg0xwP0FBImuJmNeHVqf9Dy48yFazHxkgIx3hfzE1MxI9Vm+hbbc6t2Ax8Alae1i1uwiUR
	Zxcg5JPLV24bGNuOqG0AMFF2nAd/9SU=
Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-281-FUHGgS5pPEqdKsGUd4H_9w-1; Thu,
 26 Mar 2026 06:48:19 -0400
X-MC-Unique: FUHGgS5pPEqdKsGUd4H_9w-1
X-Mimecast-MFC-AGG-ID: FUHGgS5pPEqdKsGUd4H_9w_1774522097
Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id E2F7F19560B8;
	Thu, 26 Mar 2026 10:48:16 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 8CBA81800671;
	Thu, 26 Mar 2026 10:48:10 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Paulo Alcantara <pc@manguebit.org>
Subject: [PATCH 15/26] afs: Use a bvecq to hold dir content rather than folioq
Date: Thu, 26 Mar 2026 10:45:30 +0000
Message-ID: <20260326104544.509518-16-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93
Content-Type: text/plain; charset="utf-8"

Use a bvecq to hold the contents of a directory rather than the folioq so
that the latter can be phased out.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/afs/dir.c           |  39 +++++------
 fs/afs/dir_edit.c      |  42 +++++------
 fs/afs/dir_search.c    |  33 ++++-----
 fs/afs/inode.c         |  20 +++---
 fs/afs/internal.h      |   6 +-
 fs/netfs/write_issue.c | 156 ++++++-----------------------------------
 6 files changed, 88 insertions(+), 208 deletions(-)

diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 78caef3f1338..6627a0d38e73 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -136,9 +136,9 @@ static void afs_dir_dump(struct afs_vnode *dvnode)
 	pr_warn("DIR %llx:%llx is=3D%llx\n",
 		dvnode->fid.vid, dvnode->fid.vnode, i_size);
=20
-	iov_iter_folio_queue(&iter, ITER_SOURCE, dvnode->directory, 0, 0, i_size);
-	iterate_folioq(&iter, iov_iter_count(&iter), NULL, NULL,
-		       afs_dir_dump_step);
+	iov_iter_bvec_queue(&iter, ITER_SOURCE, dvnode->directory, 0, 0, i_size);
+	iterate_bvecq(&iter, iov_iter_count(&iter), NULL, NULL,
+		      afs_dir_dump_step);
 }
=20
 /*
@@ -199,9 +199,9 @@ static int afs_dir_check(struct afs_vnode *dvnode)
 	if (unlikely(!i_size))
 		return 0;
=20
-	iov_iter_folio_queue(&iter, ITER_SOURCE, dvnode->directory, 0, 0, i_size);
-	checked =3D iterate_folioq(&iter, iov_iter_count(&iter), dvnode, NULL,
-				 afs_dir_check_step);
+	iov_iter_bvec_queue(&iter, ITER_SOURCE, dvnode->directory, 0, 0, i_size);
+	checked =3D iterate_bvecq(&iter, iov_iter_count(&iter), dvnode, NULL,
+				afs_dir_check_step);
 	if (checked !=3D i_size) {
 		afs_dir_dump(dvnode);
 		return -EIO;
@@ -255,15 +255,14 @@ static ssize_t afs_do_read_single(struct afs_vnode *d=
vnode, struct file *file)
 	if (dvnode->directory_size < i_size) {
 		size_t cur_size =3D dvnode->directory_size;
=20
-		ret =3D netfs_alloc_folioq_buffer(NULL,
-						&dvnode->directory, &cur_size, i_size,
-						mapping_gfp_mask(dvnode->netfs.inode.i_mapping));
+		ret =3D bvecq_expand_buffer(&dvnode->directory, &cur_size, i_size,
+					  mapping_gfp_mask(dvnode->netfs.inode.i_mapping));
 		dvnode->directory_size =3D cur_size;
 		if (ret < 0)
 			return ret;
 	}
=20
-	iov_iter_folio_queue(&iter, ITER_DEST, dvnode->directory, 0, 0, dvnode->d=
irectory_size);
+	iov_iter_bvec_queue(&iter, ITER_DEST, dvnode->directory, 0, 0, dvnode->di=
rectory_size);
=20
 	/* AFS requires us to perform the read of a directory synchronously as
 	 * a single unit to avoid issues with the directory contents being
@@ -282,9 +281,9 @@ static ssize_t afs_do_read_single(struct afs_vnode *dvn=
ode, struct file *file)
=20
 			if (ret2 < 0)
 				ret =3D ret2;
-		} else if (i_size < folioq_folio_size(dvnode->directory, 0)) {
+		} else if (i_size < PAGE_SIZE) {
 			/* NUL-terminate a symlink. */
-			char *symlink =3D kmap_local_folio(folioq_folio(dvnode->directory, 0), =
0);
+			char *symlink =3D kmap_local_bvec(&dvnode->directory->bv[0], 0);
=20
 			symlink[i_size] =3D 0;
 			kunmap_local(symlink);
@@ -305,8 +304,8 @@ ssize_t afs_read_single(struct afs_vnode *dvnode, struc=
t file *file)
 }
=20
 /*
- * Read the directory into a folio_queue buffer in one go, scrubbing the
- * previous contents.  We return -ESTALE if the caller needs to call us ag=
ain.
+ * Read the directory into the buffer in one go, scrubbing the previous
+ * contents.  We return -ESTALE if the caller needs to call us again.
  */
 ssize_t afs_read_dir(struct afs_vnode *dvnode, struct file *file)
 	__acquires(&dvnode->validate_lock)
@@ -487,7 +486,7 @@ static size_t afs_dir_iterate_step(void *iter_base, siz=
e_t progress, size_t len,
 }
=20
 /*
- * Iterate through the directory folios.
+ * Iterate through the directory content.
  */
 static int afs_dir_iterate_contents(struct inode *dir, struct dir_context =
*dir_ctx)
 {
@@ -502,11 +501,11 @@ static int afs_dir_iterate_contents(struct inode *dir=
, struct dir_context *dir_c
 	if (i_size <=3D 0 || dir_ctx->pos >=3D i_size)
 		return 0;
=20
-	iov_iter_folio_queue(&iter, ITER_SOURCE, dvnode->directory, 0, 0, i_size);
+	iov_iter_bvec_queue(&iter, ITER_SOURCE, dvnode->directory, 0, 0, i_size);
 	iov_iter_advance(&iter, round_down(dir_ctx->pos, AFS_DIR_BLOCK_SIZE));
=20
-	iterate_folioq(&iter, iov_iter_count(&iter), dvnode, &ctx,
-		       afs_dir_iterate_step);
+	iterate_bvecq(&iter, iov_iter_count(&iter), dvnode, &ctx,
+		      afs_dir_iterate_step);
=20
 	if (ctx.error =3D=3D -ESTALE)
 		afs_invalidate_dir(dvnode, afs_dir_invalid_iter_stale);
@@ -2211,8 +2210,8 @@ int afs_single_writepages(struct address_space *mappi=
ng,
 	if (is_dir ?
 	    test_bit(AFS_VNODE_DIR_VALID, &dvnode->flags) :
 	    atomic64_read(&dvnode->cb_expires_at) !=3D AFS_NO_CB_PROMISE) {
-		iov_iter_folio_queue(&iter, ITER_SOURCE, dvnode->directory, 0, 0,
-				     i_size_read(&dvnode->netfs.inode));
+		iov_iter_bvec_queue(&iter, ITER_SOURCE, dvnode->directory, 0, 0,
+				    i_size_read(&dvnode->netfs.inode));
 		ret =3D netfs_writeback_single(mapping, wbc, &iter);
 	}
=20
diff --git a/fs/afs/dir_edit.c b/fs/afs/dir_edit.c
index fd3aa9f97ce6..59d3decf7692 100644
--- a/fs/afs/dir_edit.c
+++ b/fs/afs/dir_edit.c
@@ -110,9 +110,8 @@ static void afs_clear_contig_bits(union afs_xdr_dir_blo=
ck *block,
  */
 static union afs_xdr_dir_block *afs_dir_get_block(struct afs_dir_iter *ite=
r, size_t block)
 {
-	struct folio_queue *fq;
 	struct afs_vnode *dvnode =3D iter->dvnode;
-	struct folio *folio;
+	struct bvecq *bq;
 	size_t blpos =3D block * AFS_DIR_BLOCK_SIZE;
 	size_t blend =3D (block + 1) * AFS_DIR_BLOCK_SIZE, fpos =3D iter->fpos;
 	int ret;
@@ -120,41 +119,38 @@ static union afs_xdr_dir_block *afs_dir_get_block(str=
uct afs_dir_iter *iter, siz
 	if (dvnode->directory_size < blend) {
 		size_t cur_size =3D dvnode->directory_size;
=20
-		ret =3D netfs_alloc_folioq_buffer(
-			NULL, &dvnode->directory, &cur_size, blend,
-			mapping_gfp_mask(dvnode->netfs.inode.i_mapping));
+		ret =3D bvecq_expand_buffer(&dvnode->directory, &cur_size, blend,
+					  mapping_gfp_mask(dvnode->netfs.inode.i_mapping));
 		dvnode->directory_size =3D cur_size;
 		if (ret < 0)
 			goto fail;
 	}
=20
-	fq =3D iter->fq;
-	if (!fq)
-		fq =3D dvnode->directory;
+	bq =3D iter->bq;
+	if (!bq)
+		bq =3D dvnode->directory;
=20
-	/* Search the folio queue for the folio containing the block... */
-	for (; fq; fq =3D fq->next) {
-		for (int s =3D iter->fq_slot; s < folioq_count(fq); s++) {
-			size_t fsize =3D folioq_folio_size(fq, s);
+	/* Search the contents for the region containing the block... */
+	for (; bq; bq =3D bq->next) {
+		for (int s =3D iter->bq_slot; s < bq->nr_slots; s++) {
+			struct bio_vec *bv =3D &bq->bv[s];
+			size_t bsize =3D bv->bv_len;
=20
-			if (blend <=3D fpos + fsize) {
+			if (blend <=3D fpos + bsize) {
 				/* ... and then return the mapped block. */
-				folio =3D folioq_folio(fq, s);
-				if (WARN_ON_ONCE(folio_pos(folio) !=3D fpos))
-					goto fail;
-				iter->fq =3D fq;
-				iter->fq_slot =3D s;
+				iter->bq =3D bq;
+				iter->bq_slot =3D s;
 				iter->fpos =3D fpos;
-				return kmap_local_folio(folio, blpos - fpos);
+				return kmap_local_bvec(bv, blpos - fpos);
 			}
-			fpos +=3D fsize;
+			fpos +=3D bsize;
 		}
-		iter->fq_slot =3D 0;
+		iter->bq_slot =3D 0;
 	}
=20
 fail:
-	iter->fq =3D NULL;
-	iter->fq_slot =3D 0;
+	iter->bq =3D NULL;
+	iter->bq_slot =3D 0;
 	afs_invalidate_dir(dvnode, afs_dir_invalid_edit_get_block);
 	return NULL;
 }
diff --git a/fs/afs/dir_search.c b/fs/afs/dir_search.c
index d2516e55b5ed..f1d2b49bc6f0 100644
--- a/fs/afs/dir_search.c
+++ b/fs/afs/dir_search.c
@@ -66,12 +66,11 @@ bool afs_dir_init_iter(struct afs_dir_iter *iter, const=
 struct qstr *name)
  */
 union afs_xdr_dir_block *afs_dir_find_block(struct afs_dir_iter *iter, siz=
e_t block)
 {
-	struct folio_queue *fq =3D iter->fq;
 	struct afs_vnode *dvnode =3D iter->dvnode;
-	struct folio *folio;
+	struct bvecq *bq =3D iter->bq;
 	size_t blpos =3D block * AFS_DIR_BLOCK_SIZE;
 	size_t blend =3D (block + 1) * AFS_DIR_BLOCK_SIZE, fpos =3D iter->fpos;
-	int slot =3D iter->fq_slot;
+	int slot =3D iter->bq_slot;
=20
 	_enter("%zx,%d", block, slot);
=20
@@ -83,36 +82,34 @@ union afs_xdr_dir_block *afs_dir_find_block(struct afs_=
dir_iter *iter, size_t bl
 	if (dvnode->directory_size < blend)
 		goto fail;
=20
-	if (!fq || blpos < fpos) {
-		fq =3D dvnode->directory;
+	if (!bq || blpos < fpos) {
+		bq =3D dvnode->directory;
 		slot =3D 0;
 		fpos =3D 0;
 	}
=20
 	/* Search the folio queue for the folio containing the block... */
-	for (; fq; fq =3D fq->next) {
-		for (; slot < folioq_count(fq); slot++) {
-			size_t fsize =3D folioq_folio_size(fq, slot);
+	for (; bq; bq =3D bq->next) {
+		for (; slot < bq->nr_slots; slot++) {
+			struct bio_vec *bv =3D &bq->bv[slot];
+			size_t bsize =3D bv->bv_len;
=20
-			if (blend <=3D fpos + fsize) {
+			if (blend <=3D fpos + bsize) {
 				/* ... and then return the mapped block. */
-				folio =3D folioq_folio(fq, slot);
-				if (WARN_ON_ONCE(folio_pos(folio) !=3D fpos))
-					goto fail;
-				iter->fq =3D fq;
-				iter->fq_slot =3D slot;
+				iter->bq =3D bq;
+				iter->bq_slot =3D slot;
 				iter->fpos =3D fpos;
-				iter->block =3D kmap_local_folio(folio, blpos - fpos);
+				iter->block =3D kmap_local_bvec(bv, blpos - fpos);
 				return iter->block;
 			}
-			fpos +=3D fsize;
+			fpos +=3D bsize;
 		}
 		slot =3D 0;
 	}
=20
 fail:
-	iter->fq =3D NULL;
-	iter->fq_slot =3D 0;
+	iter->bq =3D NULL;
+	iter->bq_slot =3D 0;
 	afs_invalidate_dir(dvnode, afs_dir_invalid_edit_get_block);
 	return NULL;
 }
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index dde1857fcabb..94e3442da849 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -31,12 +31,12 @@ void afs_init_new_symlink(struct afs_vnode *vnode, stru=
ct afs_operation *op)
 	size_t dsize =3D 0;
 	char *p;
=20
-	if (netfs_alloc_folioq_buffer(NULL, &vnode->directory, &dsize, size,
-				      mapping_gfp_mask(vnode->netfs.inode.i_mapping)) < 0)
+	if (bvecq_expand_buffer(&vnode->directory, &dsize, size,
+				mapping_gfp_mask(vnode->netfs.inode.i_mapping)) < 0)
 		return;
=20
 	vnode->directory_size =3D dsize;
-	p =3D kmap_local_folio(folioq_folio(vnode->directory, 0), 0);
+	p =3D kmap_local_bvec(&vnode->directory->bv[0], 0);
 	memcpy(p, op->create.symlink, size);
 	kunmap_local(p);
 	set_bit(AFS_VNODE_DIR_READ, &vnode->flags);
@@ -45,17 +45,17 @@ void afs_init_new_symlink(struct afs_vnode *vnode, stru=
ct afs_operation *op)
=20
 static void afs_put_link(void *arg)
 {
-	struct folio *folio =3D virt_to_folio(arg);
+	struct page *page =3D virt_to_page(arg);
=20
 	kunmap_local(arg);
-	folio_put(folio);
+	put_page(page);
 }
=20
 const char *afs_get_link(struct dentry *dentry, struct inode *inode,
 			 struct delayed_call *callback)
 {
 	struct afs_vnode *vnode =3D AFS_FS_I(inode);
-	struct folio *folio;
+	struct page *page;
 	char *content;
 	ssize_t ret;
=20
@@ -84,9 +84,9 @@ const char *afs_get_link(struct dentry *dentry, struct in=
ode *inode,
 	set_bit(AFS_VNODE_DIR_READ, &vnode->flags);
=20
 good:
-	folio =3D folioq_folio(vnode->directory, 0);
-	folio_get(folio);
-	content =3D kmap_local_folio(folio, 0);
+	page =3D vnode->directory->bv[0].bv_page;
+	get_page(page);
+	content =3D kmap_local_page(page);
 	set_delayed_call(callback, afs_put_link, content);
 	return content;
 }
@@ -761,7 +761,7 @@ void afs_evict_inode(struct inode *inode)
=20
 	netfs_wait_for_outstanding_io(inode);
 	truncate_inode_pages_final(&inode->i_data);
-	netfs_free_folioq_buffer(vnode->directory);
+	bvecq_put(vnode->directory);
=20
 	afs_set_cache_aux(vnode, &aux);
 	netfs_clear_inode_writeback(inode, &aux);
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 009064b8d661..9bf5d2f1dbc4 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -710,7 +710,7 @@ struct afs_vnode {
 #define AFS_VNODE_MODIFYING	10		/* Set if we're performing a modification =
op */
 #define AFS_VNODE_DIR_READ	11		/* Set if we've read a dir's contents */
=20
-	struct folio_queue	*directory;	/* Directory contents */
+	struct bvecq		*directory;	/* Directory contents */
 	struct list_head	wb_keys;	/* List of keys available for writeback */
 	struct list_head	pending_locks;	/* locks waiting to be granted */
 	struct list_head	granted_locks;	/* locks granted on this file */
@@ -983,9 +983,9 @@ static inline void afs_invalidate_cache(struct afs_vnod=
e *vnode, unsigned int fl
 struct afs_dir_iter {
 	struct afs_vnode	*dvnode;
 	union afs_xdr_dir_block *block;
-	struct folio_queue	*fq;
+	struct bvecq		*bq;
 	unsigned int		fpos;
-	int			fq_slot;
+	int			bq_slot;
 	unsigned int		loop_check;
 	u8			nr_slots;
 	u8			bucket;
diff --git a/fs/netfs/write_issue.c b/fs/netfs/write_issue.c
index 2de6b8621e11..9ca2c780f469 100644
--- a/fs/netfs/write_issue.c
+++ b/fs/netfs/write_issue.c
@@ -700,124 +700,11 @@ ssize_t netfs_end_writethrough(struct netfs_io_reque=
st *wreq, struct writeback_c
 	return ret;
 }
=20
-/*
- * Write some of a pending folio data back to the server and/or the cache.
- */
-static int netfs_write_folio_single(struct netfs_io_request *wreq,
-				    struct folio *folio)
-{
-	struct netfs_io_stream *upload =3D &wreq->io_streams[0];
-	struct netfs_io_stream *cache  =3D &wreq->io_streams[1];
-	struct netfs_io_stream *stream;
-	size_t iter_off =3D 0;
-	size_t fsize =3D folio_size(folio), flen;
-	loff_t fpos =3D folio_pos(folio);
-	bool to_eof =3D false;
-	bool no_debug =3D false;
-
-	_enter("");
-
-	flen =3D folio_size(folio);
-	if (flen > wreq->i_size - fpos) {
-		flen =3D wreq->i_size - fpos;
-		folio_zero_segment(folio, flen, fsize);
-		to_eof =3D true;
-	} else if (flen =3D=3D wreq->i_size - fpos) {
-		to_eof =3D true;
-	}
-
-	_debug("folio %zx/%zx", flen, fsize);
-
-	if (!upload->avail && !cache->avail) {
-		trace_netfs_folio(folio, netfs_folio_trace_cancel_store);
-		return 0;
-	}
-
-	if (!upload->construct)
-		trace_netfs_folio(folio, netfs_folio_trace_store);
-	else
-		trace_netfs_folio(folio, netfs_folio_trace_store_plus);
-
-	/* Attach the folio to the rolling buffer. */
-	folio_get(folio);
-	rolling_buffer_append(&wreq->buffer, folio, NETFS_ROLLBUF_PUT_MARK);
-
-	/* Move the submission point forward to allow for write-streaming data
-	 * not starting at the front of the page.  We don't do write-streaming
-	 * with the cache as the cache requires DIO alignment.
-	 *
-	 * Also skip uploading for data that's been read and just needs copying
-	 * to the cache.
-	 */
-	for (int s =3D 0; s < NR_IO_STREAMS; s++) {
-		stream =3D &wreq->io_streams[s];
-		stream->submit_off =3D 0;
-		stream->submit_len =3D flen;
-		if (!stream->avail) {
-			stream->submit_off =3D UINT_MAX;
-			stream->submit_len =3D 0;
-		}
-	}
-
-	/* Attach the folio to one or more subrequests.  For a big folio, we
-	 * could end up with thousands of subrequests if the wsize is small -
-	 * but we might need to wait during the creation of subrequests for
-	 * network resources (eg. SMB credits).
-	 */
-	for (;;) {
-		ssize_t part;
-		size_t lowest_off =3D ULONG_MAX;
-		int choose_s =3D -1;
-
-		/* Always add to the lowest-submitted stream first. */
-		for (int s =3D 0; s < NR_IO_STREAMS; s++) {
-			stream =3D &wreq->io_streams[s];
-			if (stream->submit_len > 0 &&
-			    stream->submit_off < lowest_off) {
-				lowest_off =3D stream->submit_off;
-				choose_s =3D s;
-			}
-		}
-
-		if (choose_s < 0)
-			break;
-		stream =3D &wreq->io_streams[choose_s];
-
-		/* Advance the iterator(s). */
-		if (stream->submit_off > iter_off) {
-			rolling_buffer_advance(&wreq->buffer, stream->submit_off - iter_off);
-			iter_off =3D stream->submit_off;
-		}
-
-		atomic64_set(&wreq->issued_to, fpos + stream->submit_off);
-		stream->submit_extendable_to =3D fsize - stream->submit_off;
-		part =3D netfs_advance_write(wreq, stream, fpos + stream->submit_off,
-					   stream->submit_len, to_eof);
-		stream->submit_off +=3D part;
-		if (part > stream->submit_len)
-			stream->submit_len =3D 0;
-		else
-			stream->submit_len -=3D part;
-		if (part > 0)
-			no_debug =3D true;
-	}
-
-	wreq->buffer.iter.iov_offset =3D 0;
-	if (fsize > iter_off)
-		rolling_buffer_advance(&wreq->buffer, fsize - iter_off);
-	atomic64_set(&wreq->issued_to, fpos + fsize);
-
-	if (!no_debug)
-		kdebug("R=3D%x: No submit", wreq->debug_id);
-	_leave(" =3D 0");
-	return 0;
-}
-
 /**
  * netfs_writeback_single - Write back a monolithic payload
  * @mapping: The mapping to write from
  * @wbc: Hints from the VM
- * @iter: Data to write, must be ITER_FOLIOQ.
+ * @iter: Data to write.
  *
  * Write a monolithic, non-pagecache object back to the server and/or
  * the cache.
@@ -828,13 +715,8 @@ int netfs_writeback_single(struct address_space *mappi=
ng,
 {
 	struct netfs_io_request *wreq;
 	struct netfs_inode *ictx =3D netfs_inode(mapping->host);
-	struct folio_queue *fq;
-	size_t size =3D iov_iter_count(iter);
 	int ret;
=20
-	if (WARN_ON_ONCE(!iov_iter_is_folioq(iter)))
-		return -EIO;
-
 	if (!mutex_trylock(&ictx->wb_lock)) {
 		if (wbc->sync_mode =3D=3D WB_SYNC_NONE) {
 			netfs_stat(&netfs_n_wb_lock_skip);
@@ -850,6 +732,9 @@ int netfs_writeback_single(struct address_space *mappin=
g,
 		goto couldnt_start;
 	}
=20
+	wreq->buffer.iter =3D *iter;
+	wreq->len =3D iov_iter_count(iter);
+
 	__set_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &wreq->flags);
 	trace_netfs_write(wreq, netfs_write_trace_writeback_single);
 	netfs_stat(&netfs_n_wh_writepages);
@@ -857,31 +742,34 @@ int netfs_writeback_single(struct address_space *mapp=
ing,
 	if (__test_and_set_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags))
 		wreq->netfs_ops->begin_writeback(wreq);
=20
-	for (fq =3D (struct folio_queue *)iter->folioq; fq; fq =3D fq->next) {
-		for (int slot =3D 0; slot < folioq_count(fq); slot++) {
-			struct folio *folio =3D folioq_folio(fq, slot);
-			size_t part =3D umin(folioq_folio_size(fq, slot), size);
+	for (int s =3D 0; s < NR_IO_STREAMS; s++) {
+		struct netfs_io_subrequest *subreq;
+		struct netfs_io_stream *stream =3D &wreq->io_streams[s];
+
+		if (!stream->avail)
+			continue;
=20
-			_debug("wbiter %lx %llx", folio->index, atomic64_read(&wreq->issued_to)=
);
+		netfs_prepare_write(wreq, stream, 0);
=20
-			ret =3D netfs_write_folio_single(wreq, folio);
-			if (ret < 0)
-				goto stop;
-			size -=3D part;
-			if (size <=3D 0)
-				goto stop;
-		}
+		subreq =3D stream->construct;
+		subreq->len =3D wreq->len;
+		stream->submit_len =3D subreq->len;
+		stream->submit_extendable_to =3D round_up(wreq->len, PAGE_SIZE);
+
+		netfs_issue_write(wreq, stream);
 	}
=20
-stop:
-	for (int s =3D 0; s < NR_IO_STREAMS; s++)
-		netfs_issue_write(wreq, &wreq->io_streams[s]);
 	smp_wmb(); /* Write lists before ALL_QUEUED. */
 	set_bit(NETFS_RREQ_ALL_QUEUED, &wreq->flags);
=20
 	mutex_unlock(&ictx->wb_lock);
 	netfs_wake_collector(wreq);
=20
+	/* TODO: Might want to be async here if WB_SYNC_NONE, but then need to
+	 * wait before modifying.
+	 */
+	ret =3D netfs_wait_for_write(wreq);
+
 	netfs_put_request(wreq, netfs_rreq_trace_put_return);
 	_leave(" =3D %d", ret);
 	return ret;
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id DEC353EF641
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:48:32 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522116; cv=none;
 b=RxehBEEPkbrBVku0MKYn3WzNYysRFpZcmu5UWBaWfQjTZXmDAPP3+f3otTM+jM450za1UQSK/+hpQqe8VHxSlm83WV5sZuLm50OM1DROg3wd1VAE5WhR0EGugw6Uo8JeSzEbCqv5fJJN83Tz7dENBVZ1CL0/lhMNfObxzwn1EMA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522116; c=relaxed/simple;
	bh=ZBF91+HpA9wgsul+yF7EeNsqtnIj9MKApc0uTHtyCw0=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=gETXMcAPar5MCFif+sZ3rhZf+zYZnUt536er9QZ6Wvr+UJPI1E7t1y9X1qbVbjixzE2Idk4pUaUMyWQjy3fJ/ybZxNuJcgEnca7k6sRB5aUsI5xOgsrT33td6/KgmFRaxQKBqOIxNGza88xbO49StbfGyVjwFp61Ios/PxPMAuQ=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=BM2bSMXn; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="BM2bSMXn"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774522111;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=krVmpP82y+QIFp7wo5qWjTSA/mWyNnX49/zPEYm0X7c=;
	b=BM2bSMXnJx+TQYyvLkMIn/2Gq4y7xf67bsr52Uevojf6UWo+J3NmkFJ00eyPS0YVXivVlz
	R2T7z9MIhvLzD1PM075A2kGuayw6qkUHVQunbQf+jvVDVg1inP17TmiQmiNgK7ewxIz/UD
	F9CJEPhAUCbqyGtBJ+RHwIeZT/s+9FM=
Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-684-QPvdn8QEMauIEwF2aJ17cw-1; Thu,
 26 Mar 2026 06:48:28 -0400
X-MC-Unique: QPvdn8QEMauIEwF2aJ17cw-1
X-Mimecast-MFC-AGG-ID: QPvdn8QEMauIEwF2aJ17cw_1774522105
Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 672BD195609D;
	Thu, 26 Mar 2026 10:48:25 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 9E77E1800673;
	Thu, 26 Mar 2026 10:48:18 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Paulo Alcantara <pc@manguebit.org>
Subject: [PATCH 16/26] cifs: Use a bvecq for buffering instead of a folioq
Date: Thu, 26 Mar 2026 10:45:31 +0000
Message-ID: <20260326104544.509518-17-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93
Content-Type: text/plain; charset="utf-8"

Use a bvecq for internal buffering for crypto purposes instead of a folioq
so that the latter can be phased out.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/smb/client/cifsglob.h |  2 +-
 fs/smb/client/smb2ops.c  | 70 +++++++++++++++++++---------------------
 2 files changed, 34 insertions(+), 38 deletions(-)

diff --git a/fs/smb/client/cifsglob.h b/fs/smb/client/cifsglob.h
index 6f9b6c72962b..8f3c16b57a1f 100644
--- a/fs/smb/client/cifsglob.h
+++ b/fs/smb/client/cifsglob.h
@@ -290,7 +290,7 @@ struct smb_rqst {
 	struct kvec	*rq_iov;	/* array of kvecs */
 	unsigned int	rq_nvec;	/* number of kvecs in array */
 	struct iov_iter	rq_iter;	/* Data iterator */
-	struct folio_queue *rq_buffer;	/* Buffer for encryption */
+	struct bvecq	*rq_buffer;	/* Buffer for encryption */
 };
=20
 struct mid_q_entry;
diff --git a/fs/smb/client/smb2ops.c b/fs/smb/client/smb2ops.c
index 7f2d3459cbf9..173acca17af7 100644
--- a/fs/smb/client/smb2ops.c
+++ b/fs/smb/client/smb2ops.c
@@ -4517,19 +4517,17 @@ crypt_message(struct TCP_Server_Info *server, int n=
um_rqst,
 }
=20
 /*
- * Copy data from an iterator to the folios in a folio queue buffer.
+ * Copy data from an iterator to the pages in a bvec queue buffer.
  */
-static bool cifs_copy_iter_to_folioq(struct iov_iter *iter, size_t size,
-				     struct folio_queue *buffer)
+static bool cifs_copy_iter_to_bvecq(struct iov_iter *iter, size_t size,
+				    struct bvecq *buffer)
 {
 	for (; buffer; buffer =3D buffer->next) {
-		for (int s =3D 0; s < folioq_count(buffer); s++) {
-			struct folio *folio =3D folioq_folio(buffer, s);
-			size_t part =3D folioq_folio_size(buffer, s);
+		for (int s =3D 0; s < buffer->nr_slots; s++) {
+			struct bio_vec *bv =3D &buffer->bv[s];
+			size_t part =3D umin(bv->bv_len, size);
=20
-			part =3D umin(part, size);
-
-			if (copy_folio_from_iter(folio, 0, part, iter) !=3D part)
+			if (copy_page_from_iter(bv->bv_page, 0, part, iter) !=3D part)
 				return false;
 			size -=3D part;
 		}
@@ -4541,7 +4539,7 @@ void
 smb3_free_compound_rqst(int num_rqst, struct smb_rqst *rqst)
 {
 	for (int i =3D 0; i < num_rqst; i++)
-		netfs_free_folioq_buffer(rqst[i].rq_buffer);
+		bvecq_put(rqst[i].rq_buffer);
 }
=20
 /*
@@ -4568,7 +4566,7 @@ smb3_init_transform_rq(struct TCP_Server_Info *server=
, int num_rqst,
 	for (int i =3D 1; i < num_rqst; i++) {
 		struct smb_rqst *old =3D &old_rq[i - 1];
 		struct smb_rqst *new =3D &new_rq[i];
-		struct folio_queue *buffer =3D NULL;
+		struct bvecq *buffer =3D NULL;
 		size_t size =3D iov_iter_count(&old->rq_iter);
=20
 		orig_len +=3D smb_rqst_len(server, old);
@@ -4576,17 +4574,16 @@ smb3_init_transform_rq(struct TCP_Server_Info *serv=
er, int num_rqst,
 		new->rq_nvec =3D old->rq_nvec;
=20
 		if (size > 0) {
-			size_t cur_size =3D 0;
-			rc =3D netfs_alloc_folioq_buffer(NULL, &buffer, &cur_size,
-						       size, GFP_NOFS);
-			if (rc < 0)
+			rc =3D -ENOMEM;
+			buffer =3D bvecq_alloc_buffer(size, 0, GFP_NOFS);
+			if (!buffer)
 				goto err_free;
=20
 			new->rq_buffer =3D buffer;
-			iov_iter_folio_queue(&new->rq_iter, ITER_SOURCE,
-					     buffer, 0, 0, size);
+			iov_iter_bvec_queue(&new->rq_iter, ITER_SOURCE,
+					    buffer, 0, 0, size);
=20
-			if (!cifs_copy_iter_to_folioq(&old->rq_iter, size, buffer)) {
+			if (!cifs_copy_iter_to_bvecq(&old->rq_iter, size, buffer)) {
 				rc =3D smb_EIO1(smb_eio_trace_tx_copy_iter_to_buf, size);
 				goto err_free;
 			}
@@ -4676,16 +4673,15 @@ decrypt_raw_data(struct TCP_Server_Info *server, ch=
ar *buf,
 }
=20
 static int
-cifs_copy_folioq_to_iter(struct folio_queue *folioq, size_t data_size,
-			 size_t skip, struct iov_iter *iter)
+cifs_copy_bvecq_to_iter(struct bvecq *bq, size_t data_size,
+			size_t skip, struct iov_iter *iter)
 {
-	for (; folioq; folioq =3D folioq->next) {
-		for (int s =3D 0; s < folioq_count(folioq); s++) {
-			struct folio *folio =3D folioq_folio(folioq, s);
-			size_t fsize =3D folio_size(folio);
-			size_t n, len =3D umin(fsize - skip, data_size);
+	for (; bq; bq =3D bq->next) {
+		for (int s =3D 0; s < bq->nr_slots; s++) {
+			struct bio_vec *bv =3D &bq->bv[s];
+			size_t n, len =3D umin(bv->bv_len - skip, data_size);
=20
-			n =3D copy_folio_to_iter(folio, skip, len, iter);
+			n =3D copy_page_to_iter(bv->bv_page, bv->bv_offset + skip, len, iter);
 			if (n !=3D len) {
 				cifs_dbg(VFS, "%s: something went wrong\n", __func__);
 				return smb_EIO2(smb_eio_trace_rx_copy_to_iter,
@@ -4701,7 +4697,7 @@ cifs_copy_folioq_to_iter(struct folio_queue *folioq, =
size_t data_size,
=20
 static int
 handle_read_data(struct TCP_Server_Info *server, struct mid_q_entry *mid,
-		 char *buf, unsigned int buf_len, struct folio_queue *buffer,
+		 char *buf, unsigned int buf_len, struct bvecq *buffer,
 		 unsigned int buffer_len, bool is_offloaded)
 {
 	unsigned int data_offset;
@@ -4810,8 +4806,8 @@ handle_read_data(struct TCP_Server_Info *server, stru=
ct mid_q_entry *mid,
 		}
=20
 		/* Copy the data to the output I/O iterator. */
-		rdata->result =3D cifs_copy_folioq_to_iter(buffer, buffer_len,
-							 cur_off, &rdata->subreq.io_iter);
+		rdata->result =3D cifs_copy_bvecq_to_iter(buffer, buffer_len,
+							cur_off, &rdata->subreq.io_iter);
 		if (rdata->result !=3D 0) {
 			if (is_offloaded)
 				mid->mid_state =3D MID_RESPONSE_MALFORMED;
@@ -4849,7 +4845,7 @@ handle_read_data(struct TCP_Server_Info *server, stru=
ct mid_q_entry *mid,
 struct smb2_decrypt_work {
 	struct work_struct decrypt;
 	struct TCP_Server_Info *server;
-	struct folio_queue *buffer;
+	struct bvecq *buffer;
 	char *buf;
 	unsigned int len;
 };
@@ -4863,7 +4859,7 @@ static void smb2_decrypt_offload(struct work_struct *=
work)
 	struct mid_q_entry *mid;
 	struct iov_iter iter;
=20
-	iov_iter_folio_queue(&iter, ITER_DEST, dw->buffer, 0, 0, dw->len);
+	iov_iter_bvec_queue(&iter, ITER_DEST, dw->buffer, 0, 0, dw->len);
 	rc =3D decrypt_raw_data(dw->server, dw->buf, dw->server->vals->read_rsp_s=
ize,
 			      &iter, true);
 	if (rc) {
@@ -4912,7 +4908,7 @@ static void smb2_decrypt_offload(struct work_struct *=
work)
 	}
=20
 free_pages:
-	netfs_free_folioq_buffer(dw->buffer);
+	bvecq_put(dw->buffer);
 	cifs_small_buf_release(dw->buf);
 	kfree(dw);
 }
@@ -4950,12 +4946,12 @@ receive_encrypted_read(struct TCP_Server_Info *serv=
er, struct mid_q_entry **mid,
 	dw->len =3D len;
 	len =3D round_up(dw->len, PAGE_SIZE);
=20
-	size_t cur_size =3D 0;
-	rc =3D netfs_alloc_folioq_buffer(NULL, &dw->buffer, &cur_size, len, GFP_N=
OFS);
-	if (rc < 0)
+	rc =3D -ENOMEM;
+	dw->buffer =3D bvecq_alloc_buffer(len, 0, GFP_NOFS);
+	if (!dw->buffer)
 		goto discard_data;
=20
-	iov_iter_folio_queue(&iter, ITER_DEST, dw->buffer, 0, 0, len);
+	iov_iter_bvec_queue(&iter, ITER_DEST, dw->buffer, 0, 0, len);
=20
 	/* Read the data into the buffer and clear excess bufferage. */
 	rc =3D cifs_read_iter_from_socket(server, &iter, dw->len);
@@ -5013,7 +5009,7 @@ receive_encrypted_read(struct TCP_Server_Info *server=
, struct mid_q_entry **mid,
 	}
=20
 free_pages:
-	netfs_free_folioq_buffer(dw->buffer);
+	bvecq_put(dw->buffer);
 free_dw:
 	kfree(dw);
 	return rc;
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id D90213EFD2C
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:48:41 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522124; cv=none;
 b=YI1OVWsIcCbhVyf8DQ9SlLHYp2KkjCug+6RtuWVrH4x6Sy/9mtUZ+hlkcPhjl4hbStm04zy6qpXpTisYkVFQe91PFZcM9AmbIHAlLX8uXyn2SlBbRpeb2d5OqDkjmyanutXYXz2TqXv07w8nLf6VmgDioJPSCTZagyNvMu41iAE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522124; c=relaxed/simple;
	bh=ooJGB5pm2kKH1nS60havhbSkj8IMeaPspLWnYGj9zzg=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=MsAQY1SLlWXRP3HEh9SA+VVSzcoS2XFQ6YZ7YlCLM8+FssjJPa/b2ew2Gd2Mq3wNkKdNfXxrXGNTDmlUrLqDyXzsBVNaLN3+ufnhgZsqmUQnQOlpKDGMEG79YaqPN06AwbO803rQNpQ7n66GiwfNVLe/JuQviVoevkJ1DBHsW/w=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=is25/9mA; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="is25/9mA"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774522121;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=eMP9bnV0yq18QtXiFKmTkxFw+bIJoptINTzT8/zA4D0=;
	b=is25/9mA8eVceG/S8f6xRz38TFtnXuchqdQM+SjoTK0jYXfOcZM+V9+uecUoF9owvXUGGd
	CewgkyB5nKh0J15HivnIsumVsuSSvcVy5ibYsausuG19QDXM3PuZPtHalRATDcy/+0kcoX
	zZ3wb+CE3BRszB/LU1eQasD/CrxmP2s=
Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-648-a-gXVbOYNM-z_sxtwygWsw-1; Thu,
 26 Mar 2026 06:48:36 -0400
X-MC-Unique: a-gXVbOYNM-z_sxtwygWsw-1
X-Mimecast-MFC-AGG-ID: a-gXVbOYNM-z_sxtwygWsw_1774522114
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id F17C6180049D;
	Thu, 26 Mar 2026 10:48:33 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 218C219560B1;
	Thu, 26 Mar 2026 10:48:26 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Paulo Alcantara <pc@manguebit.org>,
	Shyam Prasad N <sprasad@microsoft.com>,
	Tom Talpey <tom@talpey.com>
Subject: [PATCH 17/26] cifs: Support ITER_BVECQ in smb_extract_iter_to_rdma()
Date: Thu, 26 Mar 2026 10:45:32 +0000
Message-ID: <20260326104544.509518-18-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12
Content-Type: text/plain; charset="utf-8"

Add support for ITER_BVECQ to smb_extract_iter_to_rdma().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <sprasad@microsoft.com>
cc: Tom Talpey <tom@talpey.com>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/smb/client/smbdirect.c | 60 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/fs/smb/client/smbdirect.c b/fs/smb/client/smbdirect.c
index c79304012b08..f8a6be83db98 100644
--- a/fs/smb/client/smbdirect.c
+++ b/fs/smb/client/smbdirect.c
@@ -3298,6 +3298,63 @@ static ssize_t smb_extract_folioq_to_rdma(struct iov=
_iter *iter,
 	return ret;
 }
=20
+/*
+ * Extract memory fragments from a BVECQ-class iterator and add them to an=
 RDMA
+ * list.  The folios are not pinned.
+ */
+static ssize_t smb_extract_bvecq_to_rdma(struct iov_iter *iter,
+					 struct smb_extract_to_rdma *rdma,
+					 ssize_t maxsize)
+{
+	const struct bvecq *bq =3D iter->bvecq;
+	unsigned int slot =3D iter->bvecq_slot;
+	ssize_t ret =3D 0;
+	size_t offset =3D iter->iov_offset;
+
+	if (slot >=3D bq->nr_slots) {
+		bq =3D bq->next;
+		if (WARN_ON_ONCE(!bq))
+			return -EIO;
+		slot =3D 0;
+	}
+
+	do {
+		struct bio_vec *bv =3D &bq->bv[slot];
+		struct page *page =3D bv->bv_page;
+		size_t bsize =3D bv->bv_len;
+
+		if (offset < bsize) {
+			size_t part =3D umin(maxsize, bsize - offset);
+
+			if (!smb_set_sge(rdma, page, bv->bv_offset + offset, part))
+				return -EIO;
+
+			offset +=3D part;
+			ret +=3D part;
+			maxsize -=3D part;
+		}
+
+		if (offset >=3D bsize) {
+			offset =3D 0;
+			slot++;
+			if (slot >=3D bq->nr_slots) {
+				if (!bq->next) {
+					WARN_ON_ONCE(ret < iter->count);
+					break;
+				}
+				bq =3D bq->next;
+				slot =3D 0;
+			}
+		}
+	} while (rdma->nr_sge < rdma->max_sge && maxsize > 0);
+
+	iter->bvecq =3D bq;
+	iter->bvecq_slot =3D slot;
+	iter->iov_offset =3D offset;
+	iter->count -=3D ret;
+	return ret;
+}
+
 /*
  * Extract page fragments from up to the given amount of the source iterat=
or
  * and build up an RDMA list that refers to all of those bits.  The RDMA l=
ist
@@ -3325,6 +3382,9 @@ static ssize_t smb_extract_iter_to_rdma(struct iov_it=
er *iter, size_t len,
 	case ITER_FOLIOQ:
 		ret =3D smb_extract_folioq_to_rdma(iter, rdma, len);
 		break;
+	case ITER_BVECQ:
+		ret =3D smb_extract_bvecq_to_rdma(iter, rdma, len);
+		break;
 	default:
 		WARN_ON_ONCE(1);
 		return -EIO;
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0169D3E6DEE
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:48:52 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522136; cv=none;
 b=uHPwaxOqflirgpVd+wpTau5l9ne0MyBk37cyErOjg+OV1xIOkEKWujNmHGpCuhrrjl1WdOLx8ruPgyxTJK3C60cmsedgyWFWnm9qsy235bwiazd1A2XlXGx/HHtmarCz/3Fd8eYfCxvGzE4HOe9tHHwoZXCyURoXpmr7Pl64R7Y=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522136; c=relaxed/simple;
	bh=R8k+XNJfVpoadfcXtRALqlxAsDWkp6UZXOTlnSR4eiE=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=JbsB4APYdY+gjMbFG3ANCgqJd7WHR3PAWi2VejNFRlc4XxTJL5Zj4ACeSzmonjYddc6ffD2Lvd6Vb0RVL8EnRsl/QO5hoz0xgOYL7GdT0rS8coNaa2+SoDDalku5282YeJ58uGs4VmVDd1Q065FdeHnH81ykVvHBVg5mAk8FFqA=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=dCT+oA27; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="dCT+oA27"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774522132;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=DHzNGgNEUEYVKOYwuUioBIOm0YrjI0aYeod/gNOtkt4=;
	b=dCT+oA27hUQ1fAbZcU3/ZlOu03YV7FATBJ+MMTQtgXordeVjQkwx5xVI74jcT/uymOnCFR
	1XIxJWw7ZzzNoSBwpoKXtypoTINbZMpFXsK7Ev9UWWZKhsSqIxpnFiO8txwHu4CuYLlvMD
	mQnK3QRWTIiHnF5Lig2pbvePZI5AoLs=
Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-696-656NMLk4PUe8PBu6pePmmw-1; Thu,
 26 Mar 2026 06:48:47 -0400
X-MC-Unique: 656NMLk4PUe8PBu6pePmmw-1
X-Mimecast-MFC-AGG-ID: 656NMLk4PUe8PBu6pePmmw_1774522124
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 96E9A19560B5;
	Thu, 26 Mar 2026 10:48:44 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id B14131955F25;
	Thu, 26 Mar 2026 10:48:35 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Paulo Alcantara <pc@manguebit.org>,
	Shyam Prasad N <sprasad@microsoft.com>,
	Tom Talpey <tom@talpey.com>
Subject: [PATCH 18/26] netfs: Switch to using bvecq rather than folio_queue
 and rolling_buffer
Date: Thu, 26 Mar 2026 10:45:33 +0000
Message-ID: <20260326104544.509518-19-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12
Content-Type: text/plain; charset="utf-8"

Switch netfslib to using bvecq, a segmented bio_vec[] queue, instead of the
folio_queue and rolling_buffer constructs, to keep track of the regions of
memory it is performing I/O upon.  Each bvecq struct in the chain is marked
with the starting file position of that sequence so that discontiguities
can be handled (the contents of each individual bvecq must be contiguous).

For unbuffered/direct I/O, the iterator is extracted into the queue up
front.  For buffered I/O, the folios are added to the queue as the
operation proceeds, much as it does now with folio_queues.  There is one
important change for buffered writes: only the relevant part of the folio
is included; this is expanded for writes to the cache in a copy of the
bvecq segment (it is known that each bio_vec corresponds to part of a
folio in this case).

The bvecq structs are marked with information as to how the regions
contained therein should be disposed of (unlock-only, free, unpin).

When setting up a subrequest, netfslib will furnish it with a slice of the
main buffer queue as a pointer to starting bvecq, slot and offset and, for
the moment, an ITER_BVECQ iterator is set to cover the slice in
subreq->io_iter.

Notes on the implementation:

 (1) This patch uses the concept of a 'bvecq position', which is a tuple of
     { bvecq, slot, offset }.  This is lighter weight than using a full
     iov_iter, though that would also suffice.  If not NULL, the position
     also holds a reference on the bvecq it is pointing to.  This is
     probably overkill as only the hindmost position (that of collection)
     needs to hold a reference.

 (2) There are three positions on the netfs_io_request struct.  Not all are
     used by every request type.

     Firstly, there's ->load_cursor, which is used by buffered read and
     write to point to the next slot to have a folio inserted into it
     (either loaded from the readahead_control or from writeback_iter()).

     Secondly, there's ->dispatch_cursor, which is used to provide the
     position in the buffer from which we start dispatching a subrequest.

     Thirdly, there's the ->collect_cursor, which is used by the collection
     routines to point to the next memory region to be cleaned up.

 (3) There are two positions on the netfs_io_subrequest struct.

     Firstly, there's ->dispatch_pos, which indicates the position from
     which a subrequest's buffer begins.  This is used as the base of the
     position from which to retry (advanced by ->transfer).

     Secondly, there's ->content, which is normally the same as
     ->dispatch_pos but if the bvecq chain got duplicated or the content
     got copied, then this will point to that and will that will be
     disposed of on retry.

 (4) Maintenance of the position structs is done with helper functions,
     such as bvecq_pos_attach() to hide the refcounting.

 (5) When sending a write to the cache, the bvecq will be duplicated and
     the ends rounded up/down to the backing file's DIO block alignment.

 (6) bvec_slice() is used to select a slice of the source buffer and assign
     it to a subrequest.  The source buffer position is advanced.

 (7) netfs_extract_iter() is used by unbuffered/direct I/O API functions to
     decant a chunk of the iov_iter supplied by the VFS into a bvecq chain
     - and to label the bvecqs with appropriate disposal information
     (e.g. unpin, free, nothing).

There are further options that can be explored in the future:

 (1) Allow the provision of a duplicated bvecq chain for just that region
     so that the filesystem can add bits on either end (such as adding
     protocol headers and trailers and gluing several things together into
     a compound operation).

 (2) If a filesystem supports vectored/sparse read and write ops, it can be
     given a chain with discontiguities in it to perform in a single op
     (Ceph, for example, can do this).

 (3) Because each bvecq notes the start file position of the regions
     contained therein, there's no need to translate the info in the
     bio_vec into folio pointers in order to unlock the page after I/O.
     Instead, the inode's pagecache can be iterated over and the xarray
     marks cleared en masse.

 (4) Make MSG_SPLICE_PAGES handling read the disposal info in the bvecq and
     use that to indicate how it should get rid of the stuff it pasted into
     a sk_buff.

 (5) If a bounce buffer is needed (encryption, for example), the bounce
     buffer can be held in a bvecq and sliced up instead of the main buffer
     queue.

 (6) Get rid of subreq->io_iter and move the iov_iter stuff down into the
     filesystem.  The I/O iterators are normally only needed transitorily,
     and the one currently in netfs_io_subrequest is unnecessary most of
     the time.

folio_queue and rolling_buffer will be removed in a follow up patch.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <sprasad@microsoft.com>
cc: Tom Talpey <tom@talpey.com>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/cachefiles/io.c           |  12 ---
 fs/netfs/Makefile            |   1 -
 fs/netfs/buffered_read.c     | 115 ++++++++++++----------
 fs/netfs/direct_read.c       |  73 +++++---------
 fs/netfs/direct_write.c      |  86 +++++++++--------
 fs/netfs/internal.h          |  10 +-
 fs/netfs/iterator.c          |   2 +
 fs/netfs/misc.c              |  20 +---
 fs/netfs/objects.c           |  16 +---
 fs/netfs/read_collect.c      |  83 +++++++++-------
 fs/netfs/read_pgpriv2.c      |  68 +++++++++----
 fs/netfs/read_retry.c        |  80 +++++++++-------
 fs/netfs/read_single.c       |  12 ++-
 fs/netfs/stats.c             |   4 +-
 fs/netfs/write_collect.c     |  40 ++++----
 fs/netfs/write_issue.c       | 180 ++++++++++++++++++++++++++---------
 fs/netfs/write_retry.c       |  45 +++++----
 include/linux/netfs.h        |  24 ++---
 include/trace/events/netfs.h |  46 ++++-----
 19 files changed, 520 insertions(+), 397 deletions(-)

diff --git a/fs/cachefiles/io.c b/fs/cachefiles/io.c
index b5ff75697b3e..2af55a75b511 100644
--- a/fs/cachefiles/io.c
+++ b/fs/cachefiles/io.c
@@ -659,7 +659,6 @@ static void cachefiles_issue_write(struct netfs_io_subr=
equest *subreq)
 	struct netfs_cache_resources *cres =3D &wreq->cache_resources;
 	struct cachefiles_object *object =3D cachefiles_cres_object(cres);
 	struct cachefiles_cache *cache =3D object->volume->cache;
-	struct netfs_io_stream *stream =3D &wreq->io_streams[subreq->stream_nr];
 	const struct cred *saved_cred;
 	size_t off, pre, post, len =3D subreq->len;
 	loff_t start =3D subreq->start;
@@ -684,17 +683,6 @@ static void cachefiles_issue_write(struct netfs_io_sub=
request *subreq)
 	}
=20
 	/* We also need to end on the cache granularity boundary */
-	if (start + len =3D=3D wreq->i_size) {
-		size_t part =3D len & (cache->bsize - 1);
-		size_t need =3D cache->bsize - part;
-
-		if (part && stream->submit_extendable_to >=3D need) {
-			len +=3D need;
-			subreq->len +=3D need;
-			subreq->io_iter.count +=3D need;
-		}
-	}
-
 	post =3D len & (cache->bsize - 1);
 	if (post) {
 		len -=3D post;
diff --git a/fs/netfs/Makefile b/fs/netfs/Makefile
index e1f12ecb5abf..0621e6870cbd 100644
--- a/fs/netfs/Makefile
+++ b/fs/netfs/Makefile
@@ -15,7 +15,6 @@ netfs-y :=3D \
 	read_pgpriv2.o \
 	read_retry.o \
 	read_single.o \
-	rolling_buffer.o \
 	write_collect.o \
 	write_issue.o \
 	write_retry.o
diff --git a/fs/netfs/buffered_read.c b/fs/netfs/buffered_read.c
index abdc990faaa2..2cfd33abfb80 100644
--- a/fs/netfs/buffered_read.c
+++ b/fs/netfs/buffered_read.c
@@ -112,26 +112,21 @@ static int netfs_begin_cache_read(struct netfs_io_req=
uest *rreq, struct netfs_in
 static ssize_t netfs_prepare_read_iterator(struct netfs_io_subrequest *sub=
req)
 {
 	struct netfs_io_request *rreq =3D subreq->rreq;
+	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
+	ssize_t extracted;
 	size_t rsize =3D subreq->len;
=20
 	if (subreq->source =3D=3D NETFS_DOWNLOAD_FROM_SERVER)
-		rsize =3D umin(rsize, rreq->io_streams[0].sreq_max_len);
-
-	subreq->len =3D rsize;
-	if (unlikely(rreq->io_streams[0].sreq_max_segs)) {
-		size_t limit =3D netfs_limit_iter(&rreq->buffer.iter, 0, rsize,
-						rreq->io_streams[0].sreq_max_segs);
-
-		if (limit < rsize) {
-			subreq->len =3D limit;
-			trace_netfs_sreq(subreq, netfs_sreq_trace_limited);
-		}
+		rsize =3D umin(rsize, stream->sreq_max_len);
+
+	bvecq_pos_set(&subreq->dispatch_pos, &rreq->dispatch_cursor);
+	extracted =3D bvecq_slice(&rreq->dispatch_cursor, subreq->len,
+				stream->sreq_max_segs, &subreq->nr_segs);
+	if (extracted < rsize) {
+		subreq->len =3D extracted;
+		trace_netfs_sreq(subreq, netfs_sreq_trace_limited);
 	}
=20
-	subreq->io_iter	=3D rreq->buffer.iter;
-
-	iov_iter_truncate(&subreq->io_iter, subreq->len);
-	rolling_buffer_advance(&rreq->buffer, subreq->len);
 	return subreq->len;
 }
=20
@@ -195,6 +190,10 @@ static void netfs_queue_read(struct netfs_io_request *=
rreq,
 static void netfs_issue_read(struct netfs_io_request *rreq,
 			     struct netfs_io_subrequest *subreq)
 {
+	bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
+	iov_iter_bvec_queue(&subreq->io_iter, ITER_DEST, subreq->content.bvecq,
+			    subreq->content.slot, subreq->content.offset, subreq->len);
+
 	switch (subreq->source) {
 	case NETFS_DOWNLOAD_FROM_SERVER:
 		rreq->netfs_ops->issue_read(subreq);
@@ -203,7 +202,8 @@ static void netfs_issue_read(struct netfs_io_request *r=
req,
 		netfs_read_cache_to_pagecache(rreq, subreq);
 		break;
 	default:
-		__set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
+		bvecq_zero(&rreq->dispatch_cursor, subreq->len);
+		subreq->transferred =3D subreq->len;
 		subreq->error =3D 0;
 		iov_iter_zero(subreq->len, &subreq->io_iter);
 		subreq->transferred =3D subreq->len;
@@ -233,6 +233,11 @@ static void netfs_read_to_pagecache(struct netfs_io_re=
quest *rreq)
 	ssize_t size =3D rreq->len;
 	int ret =3D 0;
=20
+	_enter("R=3D%08x", rreq->debug_id);
+
+	bvecq_pos_set(&rreq->dispatch_cursor, &rreq->load_cursor);
+	bvecq_pos_set(&rreq->collect_cursor, &rreq->dispatch_cursor);
+
 	do {
 		int (*prepare_read)(struct netfs_io_subrequest *subreq) =3D NULL;
 		struct netfs_io_subrequest *subreq;
@@ -381,6 +386,9 @@ static void netfs_read_to_pagecache(struct netfs_io_req=
uest *rreq)
=20
 	/* Defer error return as we may need to wait for outstanding I/O. */
 	cmpxchg(&rreq->error, 0, ret);
+
+	bvecq_pos_unset(&rreq->load_cursor);
+	bvecq_pos_unset(&rreq->dispatch_cursor);
 }
=20
 /**
@@ -428,7 +436,7 @@ void netfs_readahead(struct readahead_control *ractl)
 	 * acquires a ref on each folio that we will need to release later -
 	 * but we don't want to do that until after we've started the I/O.
 	 */
-	added =3D rolling_buffer_bulk_load_from_ra(&rreq->buffer, ractl, rreq->de=
bug_id);
+	added =3D bvecq_load_from_ra(&rreq->load_cursor, ractl);
 	if (added < 0) {
 		ret =3D added;
 		goto cleanup_free;
@@ -437,7 +445,7 @@ void netfs_readahead(struct readahead_control *ractl)
=20
 	rreq->submitted =3D rreq->start + added;
 	rreq->cleaned_to =3D rreq->start;
-	rreq->front_folio_order =3D folio_order(rreq->buffer.tail->vec.folios[0]);
+	rreq->front_folio_order =3D get_order(rreq->load_cursor.bvecq->bv[0].bv_l=
en);
=20
 	netfs_read_to_pagecache(rreq);
 	netfs_maybe_bulk_drop_ra_refs(rreq);
@@ -449,20 +457,19 @@ void netfs_readahead(struct readahead_control *ractl)
 EXPORT_SYMBOL(netfs_readahead);
=20
 /*
- * Create a rolling buffer with a single occupying folio.
+ * Create a buffer queue with a single occupying folio.
  */
-static int netfs_create_singular_buffer(struct netfs_io_request *rreq, str=
uct folio *folio,
-					unsigned int rollbuf_flags)
+static int netfs_create_singular_buffer(struct netfs_io_request *rreq, str=
uct folio *folio)
 {
-	ssize_t added;
+	struct bvecq *bq;
+	size_t fsize =3D folio_size(folio);
=20
-	if (rolling_buffer_init(&rreq->buffer, rreq->debug_id, ITER_DEST) < 0)
+	if (bvecq_buffer_init(&rreq->load_cursor, GFP_KERNEL) < 0)
 		return -ENOMEM;
=20
-	added =3D rolling_buffer_append(&rreq->buffer, folio, rollbuf_flags);
-	if (added < 0)
-		return added;
-	rreq->submitted =3D rreq->start + added;
+	bq =3D rreq->load_cursor.bvecq;
+	bvec_set_folio(&bq->bv[bq->nr_slots++], folio, fsize, 0);
+	rreq->submitted =3D rreq->start + fsize;
 	return 0;
 }
=20
@@ -475,11 +482,11 @@ static int netfs_read_gaps(struct file *file, struct =
folio *folio)
 	struct address_space *mapping =3D folio->mapping;
 	struct netfs_folio *finfo =3D netfs_folio_info(folio);
 	struct netfs_inode *ctx =3D netfs_inode(mapping->host);
-	struct folio *sink =3D NULL;
-	struct bio_vec *bvec;
+	struct bvecq *bq =3D NULL;
+	struct page *sink =3D NULL;
 	unsigned int from =3D finfo->dirty_offset;
 	unsigned int to =3D from + finfo->dirty_len;
-	unsigned int off =3D 0, i =3D 0;
+	unsigned int off =3D 0;
 	size_t flen =3D folio_size(folio);
 	size_t nr_bvec =3D flen / PAGE_SIZE + 2;
 	size_t part;
@@ -504,38 +511,45 @@ static int netfs_read_gaps(struct file *file, struct =
folio *folio)
 	 * end get copied to, but the middle is discarded.
 	 */
 	ret =3D -ENOMEM;
-	bvec =3D kmalloc_objs(*bvec, nr_bvec);
-	if (!bvec)
+	bq =3D bvecq_alloc_one(nr_bvec, GFP_KERNEL);
+	if (!bq)
 		goto discard;
+	rreq->load_cursor.bvecq =3D bq;
=20
-	sink =3D folio_alloc(GFP_KERNEL, 0);
-	if (!sink) {
-		kfree(bvec);
+	sink =3D alloc_page(GFP_KERNEL);
+	if (!sink)
 		goto discard;
-	}
=20
 	trace_netfs_folio(folio, netfs_folio_trace_read_gaps);
=20
-	rreq->direct_bv =3D bvec;
-	rreq->direct_bv_count =3D nr_bvec;
+	for (struct bvecq *p =3D bq; p; p =3D p->next)
+		p->free =3D true;
+
 	if (from > 0) {
-		bvec_set_folio(&bvec[i++], folio, from, 0);
+		folio_get(folio);
+		bvec_set_folio(&bq->bv[bq->nr_slots++], folio, from, 0);
 		off =3D from;
 	}
 	while (off < to) {
-		part =3D min_t(size_t, to - off, PAGE_SIZE);
-		bvec_set_folio(&bvec[i++], sink, part, 0);
+		if (bvecq_is_full(bq))
+			bq =3D bq->next;
+		part =3D umin(to - off, PAGE_SIZE);
+		get_page(sink);
+		bvec_set_page(&bq->bv[bq->nr_slots++], sink, part, 0);
 		off +=3D part;
 	}
-	if (to < flen)
-		bvec_set_folio(&bvec[i++], folio, flen - to, to);
-	iov_iter_bvec(&rreq->buffer.iter, ITER_DEST, bvec, i, rreq->len);
+	if (to < flen) {
+		if (bvecq_is_full(bq))
+			bq =3D bq->next;
+		folio_get(folio);
+		bvec_set_folio(&bq->bv[bq->nr_slots++], folio, flen - to, to);
+	}
+
 	rreq->submitted =3D rreq->start + flen;
=20
 	netfs_read_to_pagecache(rreq);
=20
-	if (sink)
-		folio_put(sink);
+	put_page(sink);
=20
 	ret =3D netfs_wait_for_read(rreq);
 	if (ret >=3D 0) {
@@ -547,6 +561,8 @@ static int netfs_read_gaps(struct file *file, struct fo=
lio *folio)
 	return ret < 0 ? ret : 0;
=20
 discard:
+	if (sink)
+		put_page(sink);
 	netfs_put_failed_request(rreq);
 alloc_error:
 	folio_unlock(folio);
@@ -597,7 +613,7 @@ int netfs_read_folio(struct file *file, struct folio *f=
olio)
 	trace_netfs_read(rreq, rreq->start, rreq->len, netfs_read_trace_readpage);
=20
 	/* Set up the output buffer */
-	ret =3D netfs_create_singular_buffer(rreq, folio, 0);
+	ret =3D netfs_create_singular_buffer(rreq, folio);
 	if (ret < 0)
 		goto discard;
=20
@@ -754,7 +770,7 @@ int netfs_write_begin(struct netfs_inode *ctx,
 	trace_netfs_read(rreq, pos, len, netfs_read_trace_write_begin);
=20
 	/* Set up the output buffer */
-	ret =3D netfs_create_singular_buffer(rreq, folio, 0);
+	ret =3D netfs_create_singular_buffer(rreq, folio);
 	if (ret < 0)
 		goto error_put;
=20
@@ -819,9 +835,10 @@ int netfs_prefetch_for_write(struct file *file, struct=
 folio *folio,
 	trace_netfs_read(rreq, start, flen, netfs_read_trace_prefetch_for_write);
=20
 	/* Set up the output buffer */
-	ret =3D netfs_create_singular_buffer(rreq, folio, NETFS_ROLLBUF_PAGECACHE=
_MARK);
+	ret =3D netfs_create_singular_buffer(rreq, folio);
 	if (ret < 0)
 		goto error_put;
+	rreq->load_cursor.bvecq->free =3D true;
=20
 	netfs_read_to_pagecache(rreq);
 	ret =3D netfs_wait_for_read(rreq);
diff --git a/fs/netfs/direct_read.c b/fs/netfs/direct_read.c
index f72e6da88cca..05d09ba3d0d0 100644
--- a/fs/netfs/direct_read.c
+++ b/fs/netfs/direct_read.c
@@ -16,31 +16,6 @@
 #include <linux/netfs.h>
 #include "internal.h"
=20
-static void netfs_prepare_dio_read_iterator(struct netfs_io_subrequest *su=
breq)
-{
-	struct netfs_io_request *rreq =3D subreq->rreq;
-	size_t rsize;
-
-	rsize =3D umin(subreq->len, rreq->io_streams[0].sreq_max_len);
-	subreq->len =3D rsize;
-
-	if (unlikely(rreq->io_streams[0].sreq_max_segs)) {
-		size_t limit =3D netfs_limit_iter(&rreq->buffer.iter, 0, rsize,
-						rreq->io_streams[0].sreq_max_segs);
-
-		if (limit < rsize) {
-			subreq->len =3D limit;
-			trace_netfs_sreq(subreq, netfs_sreq_trace_limited);
-		}
-	}
-
-	trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
-
-	subreq->io_iter	=3D rreq->buffer.iter;
-	iov_iter_truncate(&subreq->io_iter, subreq->len);
-	iov_iter_advance(&rreq->buffer.iter, subreq->len);
-}
-
 /*
  * Perform a read to a buffer from the server, slicing up the region to be=
 read
  * according to the network rsize.
@@ -52,9 +27,10 @@ static int netfs_dispatch_unbuffered_reads(struct netfs_=
io_request *rreq)
 	ssize_t size =3D rreq->len;
 	int ret =3D 0;
=20
+	bvecq_pos_set(&rreq->dispatch_cursor, &rreq->load_cursor);
+
 	do {
 		struct netfs_io_subrequest *subreq;
-		ssize_t slice;
=20
 		subreq =3D netfs_alloc_subrequest(rreq);
 		if (!subreq) {
@@ -89,16 +65,24 @@ static int netfs_dispatch_unbuffered_reads(struct netfs=
_io_request *rreq)
 			}
 		}
=20
-		netfs_prepare_dio_read_iterator(subreq);
-		slice =3D subreq->len;
-		size -=3D slice;
-		start +=3D slice;
-		rreq->submitted +=3D slice;
+		bvecq_pos_set(&subreq->dispatch_pos, &rreq->dispatch_cursor);
+		bvecq_pos_set(&subreq->content, &rreq->dispatch_cursor);
+		subreq->len =3D bvecq_slice(&rreq->dispatch_cursor,
+					  umin(size, stream->sreq_max_len),
+					  stream->sreq_max_segs,
+					  &subreq->nr_segs);
+
+		size -=3D subreq->len;
+		start +=3D subreq->len;
+		rreq->submitted +=3D subreq->len;
 		if (size <=3D 0) {
 			smp_wmb(); /* Write lists before ALL_QUEUED. */
 			set_bit(NETFS_RREQ_ALL_QUEUED, &rreq->flags);
 		}
=20
+		iov_iter_bvec_queue(&subreq->io_iter, ITER_DEST, subreq->content.bvecq,
+				    subreq->content.slot, subreq->content.offset, subreq->len);
+
 		rreq->netfs_ops->issue_read(subreq);
=20
 		if (test_bit(NETFS_RREQ_PAUSE, &rreq->flags))
@@ -114,6 +98,7 @@ static int netfs_dispatch_unbuffered_reads(struct netfs_=
io_request *rreq)
 		netfs_wake_collector(rreq);
 	}
=20
+	bvecq_pos_unset(&rreq->dispatch_cursor);
 	return ret;
 }
=20
@@ -197,25 +182,15 @@ ssize_t netfs_unbuffered_read_iter_locked(struct kioc=
b *iocb, struct iov_iter *i
 	 * buffer for ourselves as the caller's iterator will be trashed when
 	 * we return.
 	 *
-	 * In such a case, extract an iterator to represent as much of the the
-	 * output buffer as we can manage.  Note that the extraction might not
-	 * be able to allocate a sufficiently large bvec array and may shorten
-	 * the request.
+	 * Extract a buffer queue to represent as much of the output buffer as
+	 * we can manage.  The fragments are extracted into a bvecq which will
+	 * have sufficient nodes allocated to hold all the data, though this
+	 * may end up truncated if ENOMEM is encountered.
 	 */
-	if (user_backed_iter(iter)) {
-		ret =3D netfs_extract_user_iter(iter, rreq->len, &rreq->buffer.iter, 0);
-		if (ret < 0)
-			goto error_put;
-		rreq->direct_bv =3D (struct bio_vec *)rreq->buffer.iter.bvec;
-		rreq->direct_bv_count =3D ret;
-		rreq->direct_bv_unpin =3D iov_iter_extract_will_pin(iter);
-		rreq->len =3D iov_iter_count(&rreq->buffer.iter);
-	} else {
-		rreq->buffer.iter =3D *iter;
-		rreq->len =3D orig_count;
-		rreq->direct_bv_unpin =3D false;
-		iov_iter_advance(iter, orig_count);
-	}
+	ret =3D netfs_extract_iter(iter, rreq->len, INT_MAX, iocb->ki_pos,
+				 &rreq->load_cursor.bvecq, 0);
+	if (ret < 0)
+		goto error_put;
=20
 	// TODO: Set up bounce buffer if needed
=20
diff --git a/fs/netfs/direct_write.c b/fs/netfs/direct_write.c
index f9ab69de3e29..a61c6d6fd17f 100644
--- a/fs/netfs/direct_write.c
+++ b/fs/netfs/direct_write.c
@@ -73,7 +73,11 @@ static void netfs_unbuffered_write_collect(struct netfs_=
io_request *wreq,
 	spin_unlock(&wreq->lock);
=20
 	wreq->transferred +=3D subreq->transferred;
-	iov_iter_advance(&wreq->buffer.iter, subreq->transferred);
+	if (subreq->transferred < subreq->len) {
+		bvecq_pos_unset(&wreq->dispatch_cursor);
+		bvecq_pos_transfer(&wreq->dispatch_cursor, &subreq->dispatch_pos);
+		bvecq_pos_advance(&wreq->dispatch_cursor, subreq->transferred);
+	}
=20
 	stream->collected_to =3D subreq->start + subreq->transferred;
 	wreq->collected_to =3D stream->collected_to;
@@ -99,6 +103,9 @@ static int netfs_unbuffered_write(struct netfs_io_reques=
t *wreq)
=20
 	_enter("%llx", wreq->len);
=20
+	bvecq_pos_set(&wreq->dispatch_cursor, &wreq->load_cursor);
+	bvecq_pos_set(&wreq->collect_cursor, &wreq->dispatch_cursor);
+
 	if (wreq->origin =3D=3D NETFS_DIO_WRITE)
 		inode_dio_begin(wreq->inode);
=20
@@ -111,6 +118,8 @@ static int netfs_unbuffered_write(struct netfs_io_reque=
st *wreq)
 			netfs_prepare_write(wreq, stream, wreq->start + wreq->transferred);
 			subreq =3D stream->construct;
 			stream->construct =3D NULL;
+		} else {
+			bvecq_pos_set(&subreq->dispatch_pos, &wreq->dispatch_cursor);
 		}
=20
 		/* Check if (re-)preparation failed. */
@@ -120,16 +129,18 @@ static int netfs_unbuffered_write(struct netfs_io_req=
uest *wreq)
 			break;
 		}
=20
-		iov_iter_truncate(&subreq->io_iter, wreq->len - wreq->transferred);
+		subreq->len =3D bvecq_slice(&wreq->dispatch_cursor, stream->sreq_max_len,
+					  stream->sreq_max_segs, &subreq->nr_segs);
+		bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
+
+		iov_iter_bvec_queue(&subreq->io_iter, ITER_SOURCE,
+				    subreq->content.bvecq, subreq->content.slot,
+				    subreq->content.offset,
+				    subreq->len);
+
 		if (!iov_iter_count(&subreq->io_iter))
 			break;
=20
-		subreq->len =3D netfs_limit_iter(&subreq->io_iter, 0,
-					       stream->sreq_max_len,
-					       stream->sreq_max_segs);
-		iov_iter_truncate(&subreq->io_iter, subreq->len);
-		stream->submit_extendable_to =3D subreq->len;
-
 		trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
 		stream->issue_write(subreq);
=20
@@ -166,8 +177,15 @@ static int netfs_unbuffered_write(struct netfs_io_requ=
est *wreq)
 		 */
 		subreq->error =3D -EAGAIN;
 		trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
-		if (subreq->transferred > 0)
-			iov_iter_advance(&wreq->buffer.iter, subreq->transferred);
+
+		bvecq_pos_unset(&subreq->content);
+		bvecq_pos_unset(&wreq->dispatch_cursor);
+		bvecq_pos_transfer(&wreq->dispatch_cursor, &subreq->dispatch_pos);
+
+		if (subreq->transferred > 0) {
+			wreq->transferred +=3D subreq->transferred;
+			bvecq_pos_advance(&wreq->dispatch_cursor, subreq->transferred);
+		}
=20
 		if (stream->source =3D=3D NETFS_UPLOAD_TO_SERVER &&
 		    wreq->netfs_ops->retry_request)
@@ -176,7 +194,6 @@ static int netfs_unbuffered_write(struct netfs_io_reque=
st *wreq)
 		__clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
 		__clear_bit(NETFS_SREQ_BOUNDARY, &subreq->flags);
 		__clear_bit(NETFS_SREQ_FAILED, &subreq->flags);
-		subreq->io_iter		=3D wreq->buffer.iter;
 		subreq->start		=3D wreq->start + wreq->transferred;
 		subreq->len		=3D wreq->len   - wreq->transferred;
 		subreq->transferred	=3D 0;
@@ -186,19 +203,14 @@ static int netfs_unbuffered_write(struct netfs_io_req=
uest *wreq)
=20
 		netfs_get_subrequest(subreq, netfs_sreq_trace_get_resubmit);
=20
-		if (stream->prepare_write) {
+		if (stream->prepare_write)
 			stream->prepare_write(subreq);
-			__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
-			netfs_stat(&netfs_n_wh_retry_write_subreq);
-		} else {
-			struct iov_iter source;
-
-			netfs_reset_iter(subreq);
-			source =3D subreq->io_iter;
-			netfs_reissue_write(stream, subreq, &source);
-		}
+		__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
+		netfs_stat(&netfs_n_wh_retry_write_subreq);
 	}
=20
+	bvecq_pos_unset(&wreq->dispatch_cursor);
+	bvecq_pos_unset(&wreq->load_cursor);
 	netfs_unbuffered_write_done(wreq);
 	_leave(" =3D %d", ret);
 	return ret;
@@ -217,12 +229,12 @@ static void netfs_unbuffered_write_async(struct work_=
struct *work)
  * encrypted file.  This can also be used for direct I/O writes.
  */
 ssize_t netfs_unbuffered_write_iter_locked(struct kiocb *iocb, struct iov_=
iter *iter,
-						  struct netfs_group *netfs_group)
+					   struct netfs_group *netfs_group)
 {
 	struct netfs_io_request *wreq;
 	unsigned long long start =3D iocb->ki_pos;
 	unsigned long long end =3D start + iov_iter_count(iter);
-	ssize_t ret, n;
+	ssize_t ret;
 	size_t len =3D iov_iter_count(iter);
 	bool async =3D !is_sync_kiocb(iocb);
=20
@@ -256,25 +268,17 @@ ssize_t netfs_unbuffered_write_iter_locked(struct kio=
cb *iocb, struct iov_iter *
 		 * allocate a sufficiently large bvec array and may shorten the
 		 * request.
 		 */
-		if (user_backed_iter(iter)) {
-			n =3D netfs_extract_user_iter(iter, len, &wreq->buffer.iter, 0);
-			if (n < 0) {
-				ret =3D n;
-				goto error_put;
-			}
-			wreq->direct_bv =3D (struct bio_vec *)wreq->buffer.iter.bvec;
-			wreq->direct_bv_count =3D n;
-			wreq->direct_bv_unpin =3D iov_iter_extract_will_pin(iter);
-		} else {
-			/* If this is a kernel-generated async DIO request,
-			 * assume that any resources the iterator points to
-			 * (eg. a bio_vec array) will persist till the end of
-			 * the op.
-			 */
-			wreq->buffer.iter =3D *iter;
-		}
+		ssize_t n =3D netfs_extract_iter(iter, len, INT_MAX, iocb->ki_pos,
+					       &wreq->load_cursor.bvecq, 0);
=20
-		wreq->len =3D iov_iter_count(&wreq->buffer.iter);
+		if (n < 0) {
+			ret =3D n;
+			goto error_put;
+		}
+		wreq->len =3D n;
+		_debug("dio-write %zx/%zx %u/%u",
+		       n, len, wreq->load_cursor.bvecq->nr_slots,
+		       wreq->load_cursor.bvecq->max_slots);
 	}
=20
 	__set_bit(NETFS_RREQ_USE_IO_ITER, &wreq->flags);
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index ad47bcc1947b..ddae82f94ce0 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -7,7 +7,6 @@
=20
 #include <linux/slab.h>
 #include <linux/seq_file.h>
-#include <linux/folio_queue.h>
 #include <linux/netfs.h>
 #include <linux/fscache.h>
 #include <linux/fscache-cache.h>
@@ -67,9 +66,8 @@ static inline void netfs_proc_del_rreq(struct netfs_io_re=
quest *rreq) {}
 /*
  * misc.c
  */
-struct folio_queue *netfs_buffer_make_space(struct netfs_io_request *rreq,
-					    enum netfs_folioq_trace trace);
-void netfs_reset_iter(struct netfs_io_subrequest *subreq);
+struct bvecq *netfs_buffer_make_space(struct netfs_io_request *rreq,
+				      enum netfs_bvecq_trace trace);
 void netfs_wake_collector(struct netfs_io_request *rreq);
 void netfs_subreq_clear_in_progress(struct netfs_io_subrequest *subreq);
 void netfs_wait_for_in_progress_stream(struct netfs_io_request *rreq,
@@ -167,7 +165,6 @@ extern atomic_t netfs_n_wh_retry_write_req;
 extern atomic_t netfs_n_wh_retry_write_subreq;
 extern atomic_t netfs_n_wb_lock_skip;
 extern atomic_t netfs_n_wb_lock_wait;
-extern atomic_t netfs_n_folioq;
 extern atomic_t netfs_n_bvecq;
=20
 int netfs_stats_show(struct seq_file *m, void *v);
@@ -205,8 +202,7 @@ void netfs_prepare_write(struct netfs_io_request *wreq,
 			 struct netfs_io_stream *stream,
 			 loff_t start);
 void netfs_reissue_write(struct netfs_io_stream *stream,
-			 struct netfs_io_subrequest *subreq,
-			 struct iov_iter *source);
+			 struct netfs_io_subrequest *subreq);
 void netfs_issue_write(struct netfs_io_request *wreq,
 		       struct netfs_io_stream *stream);
 size_t netfs_advance_write(struct netfs_io_request *wreq,
diff --git a/fs/netfs/iterator.c b/fs/netfs/iterator.c
index e77fd39327c2..581dbf650a19 100644
--- a/fs/netfs/iterator.c
+++ b/fs/netfs/iterator.c
@@ -136,6 +136,7 @@ ssize_t netfs_extract_iter(struct iov_iter *orig, size_=
t orig_len, size_t max_se
 }
 EXPORT_SYMBOL_GPL(netfs_extract_iter);
=20
+#if 0
 /**
  * netfs_extract_user_iter - Extract the pages from a user iterator into a=
 bvec
  * @orig: The original iterator
@@ -421,3 +422,4 @@ size_t netfs_limit_iter(const struct iov_iter *iter, si=
ze_t start_offset,
 	BUG();
 }
 EXPORT_SYMBOL(netfs_limit_iter);
+#endif
diff --git a/fs/netfs/misc.c b/fs/netfs/misc.c
index 6df89c92b10b..ab142cbaad35 100644
--- a/fs/netfs/misc.c
+++ b/fs/netfs/misc.c
@@ -8,6 +8,7 @@
 #include <linux/swap.h>
 #include "internal.h"
=20
+#if 0
 /**
  * netfs_alloc_folioq_buffer - Allocate buffer space into a folio queue
  * @mapping: Address space to set on the folio (or NULL).
@@ -103,24 +104,7 @@ void netfs_free_folioq_buffer(struct folio_queue *fq)
 	folio_batch_release(&fbatch);
 }
 EXPORT_SYMBOL(netfs_free_folioq_buffer);
-
-/*
- * Reset the subrequest iterator to refer just to the region remaining to =
be
- * read.  The iterator may or may not have been advanced by socket ops or
- * extraction ops to an extent that may or may not match the amount actual=
ly
- * read.
- */
-void netfs_reset_iter(struct netfs_io_subrequest *subreq)
-{
-	struct iov_iter *io_iter =3D &subreq->io_iter;
-	size_t remain =3D subreq->len - subreq->transferred;
-
-	if (io_iter->count > remain)
-		iov_iter_advance(io_iter, io_iter->count - remain);
-	else if (io_iter->count < remain)
-		iov_iter_revert(io_iter, remain - io_iter->count);
-	iov_iter_truncate(&subreq->io_iter, remain);
-}
+#endif
=20
 /**
  * netfs_dirty_folio - Mark folio dirty and pin a cache object for writeba=
ck
diff --git a/fs/netfs/objects.c b/fs/netfs/objects.c
index b8c4918d3dcd..eff431cd7d6a 100644
--- a/fs/netfs/objects.c
+++ b/fs/netfs/objects.c
@@ -119,7 +119,6 @@ static void netfs_free_request_rcu(struct rcu_head *rcu)
 static void netfs_deinit_request(struct netfs_io_request *rreq)
 {
 	struct netfs_inode *ictx =3D netfs_inode(rreq->inode);
-	unsigned int i;
=20
 	trace_netfs_rreq(rreq, netfs_rreq_trace_free);
=20
@@ -134,16 +133,9 @@ static void netfs_deinit_request(struct netfs_io_reque=
st *rreq)
 		rreq->netfs_ops->free_request(rreq);
 	if (rreq->cache_resources.ops)
 		rreq->cache_resources.ops->end_operation(&rreq->cache_resources);
-	if (rreq->direct_bv) {
-		for (i =3D 0; i < rreq->direct_bv_count; i++) {
-			if (rreq->direct_bv[i].bv_page) {
-				if (rreq->direct_bv_unpin)
-					unpin_user_page(rreq->direct_bv[i].bv_page);
-			}
-		}
-		kvfree(rreq->direct_bv);
-	}
-	rolling_buffer_clear(&rreq->buffer);
+	bvecq_pos_unset(&rreq->load_cursor);
+	bvecq_pos_unset(&rreq->dispatch_cursor);
+	bvecq_pos_unset(&rreq->collect_cursor);
=20
 	if (atomic_dec_and_test(&ictx->io_count))
 		wake_up_var(&ictx->io_count);
@@ -236,6 +228,8 @@ static void netfs_free_subrequest(struct netfs_io_subre=
quest *subreq)
 	trace_netfs_sreq(subreq, netfs_sreq_trace_free);
 	if (rreq->netfs_ops->free_subrequest)
 		rreq->netfs_ops->free_subrequest(subreq);
+	bvecq_pos_unset(&subreq->dispatch_pos);
+	bvecq_pos_unset(&subreq->content);
 	mempool_free(subreq, rreq->netfs_ops->subrequest_pool ?: &netfs_subreques=
t_pool);
 	netfs_stat_d(&netfs_n_rh_sreq);
 	netfs_put_request(rreq, netfs_rreq_trace_put_subreq);
diff --git a/fs/netfs/read_collect.c b/fs/netfs/read_collect.c
index e5f6665b3341..c7180680226c 100644
--- a/fs/netfs/read_collect.c
+++ b/fs/netfs/read_collect.c
@@ -27,9 +27,13 @@
  */
 static void netfs_clear_unread(struct netfs_io_subrequest *subreq)
 {
-	netfs_reset_iter(subreq);
-	WARN_ON_ONCE(subreq->len - subreq->transferred !=3D iov_iter_count(&subre=
q->io_iter));
-	iov_iter_zero(iov_iter_count(&subreq->io_iter), &subreq->io_iter);
+	struct iov_iter iter;
+
+	iov_iter_bvec_queue(&iter, ITER_DEST, subreq->content.bvecq,
+			    subreq->content.slot, subreq->content.offset, subreq->len);
+	iov_iter_advance(&iter, subreq->transferred);
+	iov_iter_zero(subreq->len, &iter);
+
 	if (subreq->start + subreq->transferred >=3D subreq->rreq->i_size)
 		__set_bit(NETFS_SREQ_HIT_EOF, &subreq->flags);
 }
@@ -40,11 +44,11 @@ static void netfs_clear_unread(struct netfs_io_subreque=
st *subreq)
  * dirty and let writeback handle it.
  */
 static void netfs_unlock_read_folio(struct netfs_io_request *rreq,
-				    struct folio_queue *folioq,
+				    struct bvecq *bvecq,
 				    int slot)
 {
 	struct netfs_folio *finfo;
-	struct folio *folio =3D folioq_folio(folioq, slot);
+	struct folio *folio =3D page_folio(bvecq->bv[slot].bv_page);
=20
 	if (unlikely(folio_pos(folio) < rreq->abandon_to)) {
 		trace_netfs_folio(folio, netfs_folio_trace_abandon);
@@ -75,7 +79,7 @@ static void netfs_unlock_read_folio(struct netfs_io_reque=
st *rreq,
 			trace_netfs_folio(folio, netfs_folio_trace_read_done);
 		}
=20
-		folioq_clear(folioq, slot);
+		bvecq->bv[slot].bv_page =3D NULL;
 	} else {
 		// TODO: Use of PG_private_2 is deprecated.
 		if (test_bit(NETFS_RREQ_FOLIO_COPY_TO_CACHE, &rreq->flags))
@@ -91,7 +95,7 @@ static void netfs_unlock_read_folio(struct netfs_io_reque=
st *rreq,
 		folio_unlock(folio);
 	}
=20
-	folioq_clear(folioq, slot);
+	bvecq->bv[slot].bv_page =3D NULL;
 }
=20
 /*
@@ -100,18 +104,24 @@ static void netfs_unlock_read_folio(struct netfs_io_r=
equest *rreq,
 static void netfs_read_unlock_folios(struct netfs_io_request *rreq,
 				     unsigned int *notes)
 {
-	struct folio_queue *folioq =3D rreq->buffer.tail;
+	struct bvecq *bvecq =3D rreq->collect_cursor.bvecq;
 	unsigned long long collected_to =3D rreq->collected_to;
-	unsigned int slot =3D rreq->buffer.first_tail_slot;
+	unsigned int slot =3D rreq->collect_cursor.slot;
=20
 	if (rreq->cleaned_to >=3D rreq->collected_to)
 		return;
=20
 	// TODO: Begin decryption
=20
-	if (slot >=3D folioq_nr_slots(folioq)) {
-		folioq =3D rolling_buffer_delete_spent(&rreq->buffer);
-		if (!folioq) {
+	if (slot >=3D bvecq->nr_slots) {
+		/* We need to be very careful - the cleanup can catch the
+		 * dispatcher, which could lead to us having nothing left in
+		 * the queue, causing the front and back pointers to end up on
+		 * different tracks.  To avoid this, we must always keep at
+		 * least one segment in the queue.
+		 */
+		bvecq =3D bvecq_delete_spent(&rreq->collect_cursor);
+		if (!bvecq) {
 			rreq->front_folio_order =3D 0;
 			return;
 		}
@@ -127,13 +137,13 @@ static void netfs_read_unlock_folios(struct netfs_io_=
request *rreq,
 		if (*notes & COPY_TO_CACHE)
 			set_bit(NETFS_RREQ_FOLIO_COPY_TO_CACHE, &rreq->flags);
=20
-		folio =3D folioq_folio(folioq, slot);
+		folio =3D page_folio(bvecq->bv[slot].bv_page);
 		if (WARN_ONCE(!folio_test_locked(folio),
 			      "R=3D%08x: folio %lx is not locked\n",
 			      rreq->debug_id, folio->index))
 			trace_netfs_folio(folio, netfs_folio_trace_not_locked);
=20
-		order =3D folioq_folio_order(folioq, slot);
+		order =3D folio_order(folio);
 		rreq->front_folio_order =3D order;
 		fsize =3D PAGE_SIZE << order;
 		fpos =3D folio_pos(folio);
@@ -145,33 +155,32 @@ static void netfs_read_unlock_folios(struct netfs_io_=
request *rreq,
 		if (collected_to < fend)
 			break;
=20
-		netfs_unlock_read_folio(rreq, folioq, slot);
+		netfs_unlock_read_folio(rreq, bvecq, slot);
 		WRITE_ONCE(rreq->cleaned_to, fpos + fsize);
 		*notes |=3D MADE_PROGRESS;
=20
 		clear_bit(NETFS_RREQ_FOLIO_COPY_TO_CACHE, &rreq->flags);
=20
-		/* Clean up the head folioq.  If we clear an entire folioq, then
-		 * we can get rid of it provided it's not also the tail folioq
-		 * being filled by the issuer.
+		/* Clean up the head bvecq segment.  If we clear an entire
+		 * segment, then we can get rid of it provided it's not also
+		 * the tail segment being filled by the issuer.
 		 */
-		folioq_clear(folioq, slot);
 		slot++;
-		if (slot >=3D folioq_nr_slots(folioq)) {
-			folioq =3D rolling_buffer_delete_spent(&rreq->buffer);
-			if (!folioq)
+		if (slot >=3D bvecq->nr_slots) {
+			bvecq =3D bvecq_delete_spent(&rreq->collect_cursor);
+			if (!bvecq)
 				goto done;
 			slot =3D 0;
-			trace_netfs_folioq(folioq, netfs_trace_folioq_read_progress);
+			//trace_netfs_bvecq(bvecq, netfs_trace_folioq_read_progress);
 		}
=20
 		if (fpos + fsize >=3D collected_to)
 			break;
 	}
=20
-	rreq->buffer.tail =3D folioq;
+	bvecq_pos_move(&rreq->collect_cursor, bvecq);
 done:
-	rreq->buffer.first_tail_slot =3D slot;
+	rreq->collect_cursor.slot =3D slot;
 }
=20
 /*
@@ -346,12 +355,14 @@ static void netfs_rreq_assess_dio(struct netfs_io_req=
uest *rreq)
=20
 	if (rreq->origin =3D=3D NETFS_UNBUFFERED_READ ||
 	    rreq->origin =3D=3D NETFS_DIO_READ) {
-		for (i =3D 0; i < rreq->direct_bv_count; i++) {
-			flush_dcache_page(rreq->direct_bv[i].bv_page);
-			// TODO: cifs marks pages in the destination buffer
-			// dirty under some circumstances after a read.  Do we
-			// need to do that too?
-			set_page_dirty(rreq->direct_bv[i].bv_page);
+		for (struct bvecq *bq =3D rreq->collect_cursor.bvecq; bq; bq =3D bq->nex=
t) {
+			for (i =3D 0; i < bq->nr_slots; i++) {
+				flush_dcache_page(bq->bv[i].bv_page);
+				// TODO: cifs marks pages in the destination buffer
+				// dirty under some circumstances after a read.  Do we
+				// need to do that too?
+				set_page_dirty(bq->bv[i].bv_page);
+			}
 		}
 	}
=20
@@ -442,7 +453,15 @@ bool netfs_read_collection(struct netfs_io_request *rr=
eq)
=20
 	trace_netfs_rreq(rreq, netfs_rreq_trace_done);
 	netfs_clear_subrequests(rreq);
-	netfs_unlock_abandoned_read_pages(rreq);
+	switch (rreq->origin) {
+	case NETFS_READAHEAD:
+	case NETFS_READPAGE:
+	case NETFS_READ_FOR_WRITE:
+		netfs_unlock_abandoned_read_pages(rreq);
+		break;
+	default:
+		break;
+	}
 	if (unlikely(rreq->copy_to_cache))
 		netfs_pgpriv2_end_copy_to_cache(rreq);
 	return true;
diff --git a/fs/netfs/read_pgpriv2.c b/fs/netfs/read_pgpriv2.c
index a1489aa29f78..fb783318318e 100644
--- a/fs/netfs/read_pgpriv2.c
+++ b/fs/netfs/read_pgpriv2.c
@@ -19,6 +19,9 @@
 static void netfs_pgpriv2_copy_folio(struct netfs_io_request *creq, struct=
 folio *folio)
 {
 	struct netfs_io_stream *cache =3D &creq->io_streams[1];
+	struct bvecq *queue;
+	unsigned int slot;
+	size_t dio_size =3D PAGE_SIZE;
 	size_t fsize =3D folio_size(folio), flen =3D fsize;
 	loff_t fpos =3D folio_pos(folio), i_size;
 	bool to_eof =3D false;
@@ -48,17 +51,40 @@ static void netfs_pgpriv2_copy_folio(struct netfs_io_re=
quest *creq, struct folio
 		to_eof =3D true;
 	}
=20
+	flen =3D round_up(flen, dio_size);
+
 	_debug("folio %zx %zx", flen, fsize);
=20
 	trace_netfs_folio(folio, netfs_folio_trace_store_copy);
=20
-	/* Attach the folio to the rolling buffer. */
-	if (rolling_buffer_append(&creq->buffer, folio, 0) < 0) {
-		clear_bit(NETFS_RREQ_FOLIO_COPY_TO_CACHE, &creq->flags);
-		return;
+
+	/* Institute a new bvec queue segment if the current one is full or if
+	 * we encounter a discontiguity.  The discontiguity break is important
+	 * when it comes to bulk unlocking folios by file range.
+	 */
+	queue =3D creq->load_cursor.bvecq;
+	if (bvecq_is_full(queue) ||
+	    (fpos !=3D creq->last_end && creq->last_end > 0)) {
+		if (bvecq_buffer_make_space(&creq->load_cursor, GFP_KERNEL) < 0) {
+			clear_bit(NETFS_RREQ_FOLIO_COPY_TO_CACHE, &creq->flags);
+			return;
+		}
+
+		queue =3D creq->load_cursor.bvecq;
+		queue->fpos =3D fpos;
+		if (fpos !=3D creq->last_end)
+			queue->discontig =3D true;
 	}
=20
-	cache->submit_extendable_to =3D fsize;
+	/* Attach the folio to the rolling buffer. */
+	slot =3D queue->nr_slots;
+	bvec_set_folio(&queue->bv[slot], folio, fsize, 0);
+	/* Order incrementing the slot counter after the slot is filled. */
+	smp_store_release(&queue->nr_slots, slot + 1);
+	creq->load_cursor.slot =3D slot + 1;
+	creq->load_cursor.offset =3D 0;
+	trace_netfs_bv_slot(queue, slot);
+
 	cache->submit_off =3D 0;
 	cache->submit_len =3D flen;
=20
@@ -70,10 +96,9 @@ static void netfs_pgpriv2_copy_folio(struct netfs_io_req=
uest *creq, struct folio
 	do {
 		ssize_t part;
=20
-		creq->buffer.iter.iov_offset =3D cache->submit_off;
+		creq->dispatch_cursor.offset =3D cache->submit_off;
=20
 		atomic64_set(&creq->issued_to, fpos + cache->submit_off);
-		cache->submit_extendable_to =3D fsize - cache->submit_off;
 		part =3D netfs_advance_write(creq, cache, fpos + cache->submit_off,
 					   cache->submit_len, to_eof);
 		cache->submit_off +=3D part;
@@ -83,8 +108,7 @@ static void netfs_pgpriv2_copy_folio(struct netfs_io_req=
uest *creq, struct folio
 			cache->submit_len -=3D part;
 	} while (cache->submit_len > 0);
=20
-	creq->buffer.iter.iov_offset =3D 0;
-	rolling_buffer_advance(&creq->buffer, fsize);
+	bvecq_pos_step(&creq->dispatch_cursor);
 	atomic64_set(&creq->issued_to, fpos + fsize);
=20
 	if (flen < fsize)
@@ -110,6 +134,10 @@ static struct netfs_io_request *netfs_pgpriv2_begin_co=
py_to_cache(
 	if (!creq->io_streams[1].avail)
 		goto cancel_put;
=20
+	bvecq_buffer_init(&creq->load_cursor, GFP_KERNEL);
+	bvecq_pos_set(&creq->dispatch_cursor, &creq->load_cursor);
+	bvecq_pos_set(&creq->collect_cursor, &creq->dispatch_cursor);
+
 	__set_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &creq->flags);
 	trace_netfs_copy2cache(rreq, creq);
 	trace_netfs_write(creq, netfs_write_trace_copy_to_cache);
@@ -170,22 +198,23 @@ void netfs_pgpriv2_end_copy_to_cache(struct netfs_io_=
request *rreq)
  */
 bool netfs_pgpriv2_unlock_copied_folios(struct netfs_io_request *creq)
 {
-	struct folio_queue *folioq =3D creq->buffer.tail;
+	struct bvecq *bq =3D creq->collect_cursor.bvecq;
 	unsigned long long collected_to =3D creq->collected_to;
-	unsigned int slot =3D creq->buffer.first_tail_slot;
+	unsigned int slot;
 	bool made_progress =3D false;
=20
-	if (slot >=3D folioq_nr_slots(folioq)) {
-		folioq =3D rolling_buffer_delete_spent(&creq->buffer);
+	if (bvecq_is_full(bq)) {
+		bq =3D bvecq_delete_spent(&creq->collect_cursor);
 		slot =3D 0;
 	}
+	slot =3D creq->collect_cursor.slot;
=20
 	for (;;) {
 		struct folio *folio;
 		unsigned long long fpos, fend;
 		size_t fsize, flen;
=20
-		folio =3D folioq_folio(folioq, slot);
+		folio =3D page_folio(bq->bv[slot].bv_page);
 		if (WARN_ONCE(!folio_test_private_2(folio),
 			      "R=3D%08x: folio %lx is not marked private_2\n",
 			      creq->debug_id, folio->index))
@@ -212,11 +241,11 @@ bool netfs_pgpriv2_unlock_copied_folios(struct netfs_=
io_request *creq)
 		 * we can get rid of it provided it's not also the tail folioq
 		 * being filled by the issuer.
 		 */
-		folioq_clear(folioq, slot);
+		bq->bv[slot].bv_page =3D NULL;
 		slot++;
-		if (slot >=3D folioq_nr_slots(folioq)) {
-			folioq =3D rolling_buffer_delete_spent(&creq->buffer);
-			if (!folioq)
+		if (slot >=3D bq->nr_slots) {
+			bq =3D bvecq_delete_spent(&creq->collect_cursor);
+			if (!bq)
 				goto done;
 			slot =3D 0;
 		}
@@ -225,8 +254,7 @@ bool netfs_pgpriv2_unlock_copied_folios(struct netfs_io=
_request *creq)
 			break;
 	}
=20
-	creq->buffer.tail =3D folioq;
 done:
-	creq->buffer.first_tail_slot =3D slot;
+	creq->collect_cursor.slot =3D slot;
 	return made_progress;
 }
diff --git a/fs/netfs/read_retry.c b/fs/netfs/read_retry.c
index 68fc869513ef..6f2eb14aac72 100644
--- a/fs/netfs/read_retry.c
+++ b/fs/netfs/read_retry.c
@@ -12,6 +12,11 @@
 static void netfs_reissue_read(struct netfs_io_request *rreq,
 			       struct netfs_io_subrequest *subreq)
 {
+	bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
+	iov_iter_bvec_queue(&subreq->io_iter, ITER_DEST, subreq->content.bvecq,
+			    subreq->content.slot, subreq->content.offset, subreq->len);
+	iov_iter_advance(&subreq->io_iter, subreq->transferred);
+
 	subreq->error =3D 0;
 	__clear_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
 	__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
@@ -27,6 +32,7 @@ static void netfs_retry_read_subrequests(struct netfs_io_=
request *rreq)
 {
 	struct netfs_io_subrequest *subreq;
 	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
+	struct bvecq_pos dispatch_cursor =3D {};
 	struct list_head *next;
=20
 	_enter("R=3D%x", rreq->debug_id);
@@ -48,7 +54,6 @@ static void netfs_retry_read_subrequests(struct netfs_io_=
request *rreq)
 			if (__test_and_clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags)) {
 				__clear_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
 				subreq->retry_count++;
-				netfs_reset_iter(subreq);
 				netfs_get_subrequest(subreq, netfs_sreq_trace_get_resubmit);
 				netfs_reissue_read(rreq, subreq);
 			}
@@ -74,11 +79,12 @@ static void netfs_retry_read_subrequests(struct netfs_i=
o_request *rreq)
=20
 	do {
 		struct netfs_io_subrequest *from, *to, *tmp;
-		struct iov_iter source;
 		unsigned long long start, len;
 		size_t part;
 		bool boundary =3D false, subreq_superfluous =3D false;
=20
+		bvecq_pos_unset(&dispatch_cursor);
+
 		/* Go through the subreqs and find the next span of contiguous
 		 * buffer that we then rejig (cifs, for example, needs the
 		 * rsize renegotiating) and reissue.
@@ -113,9 +119,8 @@ static void netfs_retry_read_subrequests(struct netfs_i=
o_request *rreq)
 		/* Determine the set of buffers we're going to use.  Each
 		 * subreq gets a subset of a single overall contiguous buffer.
 		 */
-		netfs_reset_iter(from);
-		source =3D from->io_iter;
-		source.count =3D len;
+		bvecq_pos_transfer(&dispatch_cursor, &from->dispatch_pos);
+		bvecq_pos_advance(&dispatch_cursor, from->transferred);
=20
 		/* Work through the sublist. */
 		subreq =3D from;
@@ -131,10 +136,14 @@ static void netfs_retry_read_subrequests(struct netfs=
_io_request *rreq)
 			__clear_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
 			subreq->retry_count++;
=20
+			bvecq_pos_unset(&subreq->dispatch_pos);
+			bvecq_pos_unset(&subreq->content);
+
 			trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
=20
 			/* Renegotiate max_len (rsize) */
 			stream->sreq_max_len =3D subreq->len;
+			stream->sreq_max_segs =3D INT_MAX;
 			if (rreq->netfs_ops->prepare_read &&
 			    rreq->netfs_ops->prepare_read(subreq) < 0) {
 				trace_netfs_sreq(subreq, netfs_sreq_trace_reprep_failed);
@@ -142,13 +151,13 @@ static void netfs_retry_read_subrequests(struct netfs=
_io_request *rreq)
 				goto abandon;
 			}
=20
-			part =3D umin(len, stream->sreq_max_len);
-			if (unlikely(stream->sreq_max_segs))
-				part =3D netfs_limit_iter(&source, 0, part, stream->sreq_max_segs);
+			bvecq_pos_set(&subreq->dispatch_pos, &dispatch_cursor);
+			part =3D bvecq_slice(&dispatch_cursor,
+					   umin(len, stream->sreq_max_len),
+					   stream->sreq_max_segs,
+					   &subreq->nr_segs);
 			subreq->len =3D subreq->transferred + part;
-			subreq->io_iter =3D source;
-			iov_iter_truncate(&subreq->io_iter, part);
-			iov_iter_advance(&source, part);
+
 			len -=3D part;
 			start +=3D part;
 			if (!len) {
@@ -208,9 +217,7 @@ static void netfs_retry_read_subrequests(struct netfs_i=
o_request *rreq)
 			trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
=20
 			stream->sreq_max_len	=3D umin(len, rreq->rsize);
-			stream->sreq_max_segs	=3D 0;
-			if (unlikely(stream->sreq_max_segs))
-				part =3D netfs_limit_iter(&source, 0, part, stream->sreq_max_segs);
+			stream->sreq_max_segs	=3D INT_MAX;
=20
 			netfs_stat(&netfs_n_rh_download);
 			if (rreq->netfs_ops->prepare_read(subreq) < 0) {
@@ -219,11 +226,12 @@ static void netfs_retry_read_subrequests(struct netfs=
_io_request *rreq)
 				goto abandon;
 			}
=20
-			part =3D umin(len, stream->sreq_max_len);
+			bvecq_pos_set(&subreq->dispatch_pos, &dispatch_cursor);
+			part =3D bvecq_slice(&dispatch_cursor,
+					   umin(len, stream->sreq_max_len),
+					   stream->sreq_max_segs,
+					   &subreq->nr_segs);
 			subreq->len =3D subreq->transferred + part;
-			subreq->io_iter =3D source;
-			iov_iter_truncate(&subreq->io_iter, part);
-			iov_iter_advance(&source, part);
=20
 			len -=3D part;
 			start +=3D part;
@@ -237,6 +245,8 @@ static void netfs_retry_read_subrequests(struct netfs_i=
o_request *rreq)
=20
 	} while (!list_is_head(next, &stream->subrequests));
=20
+out:
+	bvecq_pos_unset(&dispatch_cursor);
 	return;
=20
 	/* If we hit an error, fail all remaining incomplete subrequests */
@@ -253,6 +263,7 @@ static void netfs_retry_read_subrequests(struct netfs_i=
o_request *rreq)
 		__set_bit(NETFS_SREQ_FAILED, &subreq->flags);
 		__clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
 	}
+	goto out;
 }
=20
 /*
@@ -281,23 +292,24 @@ void netfs_retry_reads(struct netfs_io_request *rreq)
  */
 void netfs_unlock_abandoned_read_pages(struct netfs_io_request *rreq)
 {
-	struct folio_queue *p;
-
-	for (p =3D rreq->buffer.tail; p; p =3D p->next) {
-		for (int slot =3D 0; slot < folioq_count(p); slot++) {
-			struct folio *folio =3D folioq_folio(p, slot);
-
-			if (folio && !folioq_is_marked2(p, slot)) {
-				if (folio->index =3D=3D rreq->no_unlock_folio &&
-				    test_bit(NETFS_RREQ_NO_UNLOCK_FOLIO,
-					     &rreq->flags)) {
-					_debug("no unlock");
-				} else {
-					trace_netfs_folio(folio,
-						netfs_folio_trace_abandon);
-					folio_unlock(folio);
-				}
+	struct bvecq *p;
+
+	for (p =3D rreq->collect_cursor.bvecq; p; p =3D p->next) {
+		if (!p->free)
+			continue;
+		for (int slot =3D 0; slot < p->nr_slots; slot++) {
+			if (!p->bv[slot].bv_page)
+				continue;
+
+			struct folio *folio =3D page_folio(p->bv[slot].bv_page);
+
+			if (folio->index =3D=3D rreq->no_unlock_folio &&
+			    test_bit(NETFS_RREQ_NO_UNLOCK_FOLIO, &rreq->flags)) {
+				_debug("no unlock");
+				continue;
 			}
+			trace_netfs_folio(folio, netfs_folio_trace_abandon);
+			folio_unlock(folio);
 		}
 	}
 }
diff --git a/fs/netfs/read_single.c b/fs/netfs/read_single.c
index d87a03859ebd..b386cae77ece 100644
--- a/fs/netfs/read_single.c
+++ b/fs/netfs/read_single.c
@@ -94,7 +94,12 @@ static int netfs_single_dispatch_read(struct netfs_io_re=
quest *rreq)
 	subreq->source	=3D NETFS_DOWNLOAD_FROM_SERVER;
 	subreq->start	=3D 0;
 	subreq->len	=3D rreq->len;
-	subreq->io_iter	=3D rreq->buffer.iter;
+
+	bvecq_pos_set(&subreq->dispatch_pos, &rreq->dispatch_cursor);
+	bvecq_pos_set(&subreq->content, &rreq->dispatch_cursor);
+
+	iov_iter_bvec_queue(&subreq->io_iter, ITER_DEST, subreq->content.bvecq,
+			    subreq->content.slot, subreq->content.offset, subreq->len);
=20
 	/* Try to use the cache if the cache content matches the size of the
 	 * remote file.
@@ -180,6 +185,10 @@ ssize_t netfs_read_single(struct inode *inode, struct =
file *file, struct iov_ite
 	if (IS_ERR(rreq))
 		return PTR_ERR(rreq);
=20
+	ret =3D netfs_extract_iter(iter, rreq->len, INT_MAX, 0, &rreq->dispatch_c=
ursor.bvecq, 0);
+	if (ret < 0)
+		goto cleanup_free;
+
 	ret =3D netfs_single_begin_cache_read(rreq, ictx);
 	if (ret =3D=3D -ENOMEM || ret =3D=3D -EINTR || ret =3D=3D -ERESTARTSYS)
 		goto cleanup_free;
@@ -187,7 +196,6 @@ ssize_t netfs_read_single(struct inode *inode, struct f=
ile *file, struct iov_ite
 	netfs_stat(&netfs_n_rh_read_single);
 	trace_netfs_read(rreq, 0, rreq->len, netfs_read_trace_read_single);
=20
-	rreq->buffer.iter =3D *iter;
 	netfs_single_dispatch_read(rreq);
=20
 	ret =3D netfs_wait_for_read(rreq);
diff --git a/fs/netfs/stats.c b/fs/netfs/stats.c
index 84c2a4bcc762..1dfb5667b931 100644
--- a/fs/netfs/stats.c
+++ b/fs/netfs/stats.c
@@ -47,7 +47,6 @@ atomic_t netfs_n_wh_retry_write_req;
 atomic_t netfs_n_wh_retry_write_subreq;
 atomic_t netfs_n_wb_lock_skip;
 atomic_t netfs_n_wb_lock_wait;
-atomic_t netfs_n_folioq;
 atomic_t netfs_n_bvecq;
=20
 int netfs_stats_show(struct seq_file *m, void *v)
@@ -91,11 +90,10 @@ int netfs_stats_show(struct seq_file *m, void *v)
 		   atomic_read(&netfs_n_rh_retry_read_subreq),
 		   atomic_read(&netfs_n_wh_retry_write_req),
 		   atomic_read(&netfs_n_wh_retry_write_subreq));
-	seq_printf(m, "Objs   : rr=3D%u sr=3D%u bq=3D%u foq=3D%u wsc=3D%u\n",
+	seq_printf(m, "Objs   : rr=3D%u sr=3D%u bq=3D%u wsc=3D%u\n",
 		   atomic_read(&netfs_n_rh_rreq),
 		   atomic_read(&netfs_n_rh_sreq),
 		   atomic_read(&netfs_n_bvecq),
-		   atomic_read(&netfs_n_folioq),
 		   atomic_read(&netfs_n_wh_wstream_conflict));
 	seq_printf(m, "WbLock : skip=3D%u wait=3D%u\n",
 		   atomic_read(&netfs_n_wb_lock_skip),
diff --git a/fs/netfs/write_collect.c b/fs/netfs/write_collect.c
index a839735d5675..fb8daf50c86d 100644
--- a/fs/netfs/write_collect.c
+++ b/fs/netfs/write_collect.c
@@ -111,12 +111,12 @@ int netfs_folio_written_back(struct folio *folio)
 static void netfs_writeback_unlock_folios(struct netfs_io_request *wreq,
 					  unsigned int *notes)
 {
-	struct folio_queue *folioq =3D wreq->buffer.tail;
+	struct bvecq *bvecq =3D wreq->collect_cursor.bvecq;
 	unsigned long long collected_to =3D wreq->collected_to;
-	unsigned int slot =3D wreq->buffer.first_tail_slot;
+	unsigned int slot =3D wreq->collect_cursor.slot;
=20
-	if (WARN_ON_ONCE(!folioq)) {
-		pr_err("[!] Writeback unlock found empty rolling buffer!\n");
+	if (WARN_ON_ONCE(!bvecq)) {
+		pr_err("[!] Writeback unlock found empty buffer!\n");
 		netfs_dump_request(wreq);
 		return;
 	}
@@ -127,9 +127,15 @@ static void netfs_writeback_unlock_folios(struct netfs=
_io_request *wreq,
 		return;
 	}
=20
-	if (slot >=3D folioq_nr_slots(folioq)) {
-		folioq =3D rolling_buffer_delete_spent(&wreq->buffer);
-		if (!folioq)
+	if (slot >=3D bvecq->nr_slots) {
+		/* We need to be very careful - the cleanup can catch the
+		 * dispatcher, which could lead to us having nothing left in
+		 * the queue, causing the front and back pointers to end up on
+		 * different tracks.  To avoid this, we must always keep at
+		 * least one segment in the queue.
+		 */
+		bvecq =3D bvecq_delete_spent(&wreq->collect_cursor);
+		if (!bvecq)
 			return;
 		slot =3D 0;
 	}
@@ -140,7 +146,7 @@ static void netfs_writeback_unlock_folios(struct netfs_=
io_request *wreq,
 		unsigned long long fpos, fend;
 		size_t fsize, flen;
=20
-		folio =3D folioq_folio(folioq, slot);
+		folio =3D page_folio(bvecq->bv[slot].bv_page);
 		if (WARN_ONCE(!folio_test_writeback(folio),
 			      "R=3D%08x: folio %lx is not under writeback\n",
 			      wreq->debug_id, folio->index))
@@ -163,15 +169,15 @@ static void netfs_writeback_unlock_folios(struct netf=
s_io_request *wreq,
 		wreq->cleaned_to =3D fpos + fsize;
 		*notes |=3D MADE_PROGRESS;
=20
-		/* Clean up the head folioq.  If we clear an entire folioq, then
-		 * we can get rid of it provided it's not also the tail folioq
+		/* Clean up the head bvecq.  If we clear an entire bvecq, then
+		 * we can get rid of it provided it's not also the tail bvecq
 		 * being filled by the issuer.
 		 */
-		folioq_clear(folioq, slot);
+		bvecq->bv[slot].bv_page =3D NULL;
 		slot++;
-		if (slot >=3D folioq_nr_slots(folioq)) {
-			folioq =3D rolling_buffer_delete_spent(&wreq->buffer);
-			if (!folioq)
+		if (slot >=3D bvecq->nr_slots) {
+			bvecq =3D bvecq_delete_spent(&wreq->collect_cursor);
+			if (!bvecq)
 				goto done;
 			slot =3D 0;
 		}
@@ -180,9 +186,8 @@ static void netfs_writeback_unlock_folios(struct netfs_=
io_request *wreq,
 			break;
 	}
=20
-	wreq->buffer.tail =3D folioq;
 done:
-	wreq->buffer.first_tail_slot =3D slot;
+	wreq->collect_cursor.slot =3D slot;
 }
=20
 static void netfs_cache_collect(struct netfs_io_request *wreq,
@@ -217,7 +222,8 @@ static void netfs_collect_write_results(struct netfs_io=
_request *wreq)
 	trace_netfs_rreq(wreq, netfs_rreq_trace_collect);
=20
 reassess_streams:
-	issued_to =3D atomic64_read(&wreq->issued_to);
+	/* Order reading the issued_to point before reading the queue it refers t=
o. */
+	issued_to =3D atomic64_read_acquire(&wreq->issued_to);
 	smp_rmb();
 	collected_to =3D ULLONG_MAX;
 	if (wreq->origin =3D=3D NETFS_WRITEBACK ||
diff --git a/fs/netfs/write_issue.c b/fs/netfs/write_issue.c
index 9ca2c780f469..d4c4bee4299e 100644
--- a/fs/netfs/write_issue.c
+++ b/fs/netfs/write_issue.c
@@ -108,8 +108,6 @@ struct netfs_io_request *netfs_create_write_req(struct =
address_space *mapping,
 	ictx =3D netfs_inode(wreq->inode);
 	if (is_cacheable && netfs_is_cache_enabled(ictx))
 		fscache_begin_write_operation(&wreq->cache_resources, netfs_i_cookie(ict=
x));
-	if (rolling_buffer_init(&wreq->buffer, wreq->debug_id, ITER_SOURCE) < 0)
-		goto nomem;
=20
 	wreq->cleaned_to =3D wreq->start;
 	if (wreq->cache_resources.dio_size > 1)
@@ -134,9 +132,6 @@ struct netfs_io_request *netfs_create_write_req(struct =
address_space *mapping,
 	}
=20
 	return wreq;
-nomem:
-	netfs_put_failed_request(wreq);
-	return ERR_PTR(-ENOMEM);
 }
=20
 /**
@@ -161,21 +156,13 @@ void netfs_prepare_write(struct netfs_io_request *wre=
q,
 			 loff_t start)
 {
 	struct netfs_io_subrequest *subreq;
-	struct iov_iter *wreq_iter =3D &wreq->buffer.iter;
-
-	/* Make sure we don't point the iterator at a used-up folio_queue
-	 * struct being used as a placeholder to prevent the queue from
-	 * collapsing.  In such a case, extend the queue.
-	 */
-	if (iov_iter_is_folioq(wreq_iter) &&
-	    wreq_iter->folioq_slot >=3D folioq_nr_slots(wreq_iter->folioq))
-		rolling_buffer_make_space(&wreq->buffer);
=20
 	subreq =3D netfs_alloc_subrequest(wreq);
 	subreq->source		=3D stream->source;
 	subreq->start		=3D start;
 	subreq->stream_nr	=3D stream->stream_nr;
-	subreq->io_iter		=3D *wreq_iter;
+
+	bvecq_pos_set(&subreq->dispatch_pos, &wreq->dispatch_cursor);
=20
 	_enter("R=3D%x[%x]", wreq->debug_id, subreq->debug_index);
=20
@@ -240,15 +227,15 @@ static void netfs_do_issue_write(struct netfs_io_stre=
am *stream,
 }
=20
 void netfs_reissue_write(struct netfs_io_stream *stream,
-			 struct netfs_io_subrequest *subreq,
-			 struct iov_iter *source)
+			 struct netfs_io_subrequest *subreq)
 {
-	size_t size =3D subreq->len - subreq->transferred;
-
 	// TODO: Use encrypted buffer
-	subreq->io_iter =3D *source;
-	iov_iter_advance(source, size);
-	iov_iter_truncate(&subreq->io_iter, size);
+	bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
+	iov_iter_bvec_queue(&subreq->io_iter, ITER_SOURCE,
+			    subreq->content.bvecq, subreq->content.slot,
+			    subreq->content.offset,
+			    subreq->len);
+	iov_iter_advance(&subreq->io_iter, subreq->transferred);
=20
 	subreq->retry_count++;
 	subreq->error =3D 0;
@@ -266,8 +253,57 @@ void netfs_issue_write(struct netfs_io_request *wreq,
 	if (!subreq)
 		return;
=20
+	/* If we have a write to the cache, we need to round out the first and
+	 * last entries (only those as the data will be on virtually contiguous
+	 * folios) to cache DIO boundaries.
+	 */
+	if (subreq->source =3D=3D NETFS_WRITE_TO_CACHE) {
+		struct bvecq_pos tmp_pos;
+		struct bio_vec *bv;
+		struct bvecq *bq;
+		size_t dio_size =3D wreq->cache_resources.dio_size;
+		size_t disp, len;
+		int ret;
+
+		bvecq_pos_set(&tmp_pos, &subreq->dispatch_pos);
+		ret =3D bvecq_extract(&tmp_pos, subreq->len, INT_MAX, &subreq->content.b=
vecq);
+		bvecq_pos_unset(&tmp_pos);
+		if (ret < 0) {
+			netfs_write_subrequest_terminated(subreq, -ENOMEM);
+			return;
+		}
+
+		/* Round the first entry down. */
+		bq =3D subreq->content.bvecq;
+		bv =3D &bq->bv[0];
+		disp =3D bv->bv_offset & (dio_size - 1);
+		if (disp) {
+			bv->bv_offset -=3D disp;
+			bv->bv_len +=3D disp;
+			bq->fpos -=3D disp;
+			subreq->start -=3D disp;
+			subreq->len +=3D disp;
+		}
+
+		/* Round the end of the last entry up. */
+		while (bq->next)
+			bq =3D bq->next;
+		bv =3D &bq->bv[bq->nr_slots - 1];
+		len =3D round_up(bv->bv_len, dio_size);
+		if (len > bv->bv_len) {
+			subreq->len +=3D len - bv->bv_len;
+			bv->bv_len =3D len;
+		}
+	} else {
+		bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
+	}
+
+	iov_iter_bvec_queue(&subreq->io_iter, ITER_SOURCE,
+			    subreq->content.bvecq, subreq->content.slot,
+			    subreq->content.offset,
+			    subreq->len);
+
 	stream->construct =3D NULL;
-	subreq->io_iter.count =3D subreq->len;
 	netfs_do_issue_write(stream, subreq);
 }
=20
@@ -304,7 +340,6 @@ size_t netfs_advance_write(struct netfs_io_request *wre=
q,
 	_debug("part %zx/%zx %zx/%zx", subreq->len, stream->sreq_max_len, part, l=
en);
 	subreq->len +=3D part;
 	subreq->nr_segs++;
-	stream->submit_extendable_to -=3D part;
=20
 	if (subreq->len >=3D stream->sreq_max_len ||
 	    subreq->nr_segs >=3D stream->sreq_max_segs ||
@@ -328,16 +363,35 @@ static int netfs_write_folio(struct netfs_io_request =
*wreq,
 	struct netfs_io_stream *stream;
 	struct netfs_group *fgroup; /* TODO: Use this with ceph */
 	struct netfs_folio *finfo;
-	size_t iter_off =3D 0;
+	struct bvecq *queue =3D wreq->load_cursor.bvecq;
+	unsigned int slot;
 	size_t fsize =3D folio_size(folio), flen =3D fsize, foff =3D 0;
 	loff_t fpos =3D folio_pos(folio), i_size;
 	bool to_eof =3D false, streamw =3D false;
 	bool debug =3D false;
+	int ret;
=20
 	_enter("");
=20
-	if (rolling_buffer_make_space(&wreq->buffer) < 0)
-		return -ENOMEM;
+	/* Institute a new bvec queue segment if the current one is full or if
+	 * we encounter a discontiguity.  The discontiguity break is important
+	 * when it comes to bulk unlocking folios by file range.
+	 */
+	if (bvecq_is_full(queue) ||
+	    (fpos !=3D wreq->last_end && wreq->last_end > 0)) {
+		ret =3D bvecq_buffer_make_space(&wreq->load_cursor, GFP_NOFS);
+		if (ret < 0) {
+			folio_unlock(folio);
+			return ret;
+		}
+
+		queue =3D wreq->load_cursor.bvecq;
+		queue->fpos =3D fpos;
+		if (fpos !=3D wreq->last_end)
+			queue->discontig =3D true;
+		bvecq_pos_move(&wreq->dispatch_cursor, queue);
+		wreq->dispatch_cursor.slot =3D 0;
+	}
=20
 	/* netfs_perform_write() may shift i_size around the page or from out
 	 * of the page to beyond it, but cannot move i_size into or through the
@@ -443,7 +497,13 @@ static int netfs_write_folio(struct netfs_io_request *=
wreq,
 	}
=20
 	/* Attach the folio to the rolling buffer. */
-	rolling_buffer_append(&wreq->buffer, folio, 0);
+	slot =3D queue->nr_slots;
+	bvec_set_folio(&queue->bv[slot], folio, flen, 0);
+	queue->nr_slots =3D slot + 1;
+	wreq->load_cursor.slot =3D slot + 1;
+	wreq->load_cursor.offset =3D 0;
+	wreq->last_end =3D fpos + foff + flen;
+	trace_netfs_bv_slot(queue, slot);
=20
 	/* Move the submission point forward to allow for write-streaming data
 	 * not starting at the front of the page.  We don't do write-streaming
@@ -454,7 +514,7 @@ static int netfs_write_folio(struct netfs_io_request *w=
req,
 	 */
 	for (int s =3D 0; s < NR_IO_STREAMS; s++) {
 		stream =3D &wreq->io_streams[s];
-		stream->submit_off =3D foff;
+		stream->submit_off =3D 0;
 		stream->submit_len =3D flen;
 		if (!stream->avail ||
 		    (stream->source =3D=3D NETFS_WRITE_TO_CACHE && streamw) ||
@@ -489,15 +549,11 @@ static int netfs_write_folio(struct netfs_io_request =
*wreq,
 			break;
 		stream =3D &wreq->io_streams[choose_s];
=20
-		/* Advance the iterator(s). */
-		if (stream->submit_off > iter_off) {
-			rolling_buffer_advance(&wreq->buffer, stream->submit_off - iter_off);
-			iter_off =3D stream->submit_off;
-		}
+		/* Advance the cursor. */
+		wreq->dispatch_cursor.offset =3D stream->submit_off;
=20
-		atomic64_set(&wreq->issued_to, fpos + stream->submit_off);
-		stream->submit_extendable_to =3D fsize - stream->submit_off;
-		part =3D netfs_advance_write(wreq, stream, fpos + stream->submit_off,
+		atomic64_set(&wreq->issued_to, fpos + foff + stream->submit_off);
+		part =3D netfs_advance_write(wreq, stream, fpos + foff + stream->submit_=
off,
 					   stream->submit_len, to_eof);
 		stream->submit_off +=3D part;
 		if (part > stream->submit_len)
@@ -508,9 +564,9 @@ static int netfs_write_folio(struct netfs_io_request *w=
req,
 			debug =3D true;
 	}
=20
-	if (fsize > iter_off)
-		rolling_buffer_advance(&wreq->buffer, fsize - iter_off);
-	atomic64_set(&wreq->issued_to, fpos + fsize);
+	bvecq_pos_step(&wreq->dispatch_cursor);
+	/* Order loading the queue before updating the issue_to point */
+	atomic64_set_release(&wreq->issued_to, fpos + fsize);
=20
 	if (!debug)
 		kdebug("R=3D%x: No submit", wreq->debug_id);
@@ -578,6 +634,11 @@ int netfs_writepages(struct address_space *mapping,
 		goto couldnt_start;
 	}
=20
+	if (bvecq_buffer_init(&wreq->load_cursor, GFP_NOFS) < 0)
+		goto nomem;
+	bvecq_pos_set(&wreq->dispatch_cursor, &wreq->load_cursor);
+	bvecq_pos_set(&wreq->collect_cursor, &wreq->dispatch_cursor);
+
 	__set_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &wreq->flags);
 	trace_netfs_write(wreq, netfs_write_trace_writeback);
 	netfs_stat(&netfs_n_wh_writepages);
@@ -602,12 +663,17 @@ int netfs_writepages(struct address_space *mapping,
 	netfs_end_issue_write(wreq);
=20
 	mutex_unlock(&ictx->wb_lock);
+	bvecq_pos_unset(&wreq->load_cursor);
+	bvecq_pos_unset(&wreq->dispatch_cursor);
 	netfs_wake_collector(wreq);
=20
 	netfs_put_request(wreq, netfs_rreq_trace_put_return);
 	_leave(" =3D %d", error);
 	return error;
=20
+nomem:
+	error =3D -ENOMEM;
+	netfs_put_failed_request(wreq);
 couldnt_start:
 	netfs_kill_dirty_pages(mapping, wbc, folio);
 out:
@@ -634,6 +700,15 @@ struct netfs_io_request *netfs_begin_writethrough(stru=
ct kiocb *iocb, size_t len
 		return wreq;
 	}
=20
+	if (bvecq_buffer_init(&wreq->load_cursor, GFP_NOFS) < 0) {
+		netfs_put_failed_request(wreq);
+		mutex_unlock(&ictx->wb_lock);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	bvecq_pos_set(&wreq->dispatch_cursor, &wreq->load_cursor);
+	bvecq_pos_set(&wreq->collect_cursor, &wreq->dispatch_cursor);
+
 	wreq->io_streams[0].avail =3D true;
 	trace_netfs_write(wreq, netfs_write_trace_writethrough);
 	return wreq;
@@ -649,8 +724,8 @@ int netfs_advance_writethrough(struct netfs_io_request =
*wreq, struct writeback_c
 			       struct folio *folio, size_t copied, bool to_page_end,
 			       struct folio **writethrough_cache)
 {
-	_enter("R=3D%x ic=3D%zu ws=3D%u cp=3D%zu tp=3D%u",
-	       wreq->debug_id, wreq->buffer.iter.count, wreq->wsize, copied, to_p=
age_end);
+	_enter("R=3D%x ws=3D%u cp=3D%zu tp=3D%u",
+	       wreq->debug_id, wreq->wsize, copied, to_page_end);
=20
 	if (!*writethrough_cache) {
 		if (folio_test_dirty(folio))
@@ -692,6 +767,9 @@ ssize_t netfs_end_writethrough(struct netfs_io_request =
*wreq, struct writeback_c
=20
 	mutex_unlock(&ictx->wb_lock);
=20
+	bvecq_pos_unset(&wreq->load_cursor);
+	bvecq_pos_unset(&wreq->dispatch_cursor);
+
 	if (wreq->iocb)
 		ret =3D -EIOCBQUEUED;
 	else
@@ -707,7 +785,7 @@ ssize_t netfs_end_writethrough(struct netfs_io_request =
*wreq, struct writeback_c
  * @iter: Data to write.
  *
  * Write a monolithic, non-pagecache object back to the server and/or
- * the cache.
+ * the cache.  There's a maximum of one subrequest per stream.
  */
 int netfs_writeback_single(struct address_space *mapping,
 			   struct writeback_control *wbc,
@@ -731,10 +809,18 @@ int netfs_writeback_single(struct address_space *mapp=
ing,
 		ret =3D PTR_ERR(wreq);
 		goto couldnt_start;
 	}
-
-	wreq->buffer.iter =3D *iter;
 	wreq->len =3D iov_iter_count(iter);
=20
+	ret =3D netfs_extract_iter(iter, wreq->len, INT_MAX, 0, &wreq->dispatch_c=
ursor.bvecq, 0);
+	if (ret < 0)
+		goto cleanup_free;
+	if (ret < wreq->len) {
+		ret =3D -EIO;
+		goto cleanup_free;
+	}
+
+	bvecq_pos_set(&wreq->collect_cursor, &wreq->dispatch_cursor);
+
 	__set_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &wreq->flags);
 	trace_netfs_write(wreq, netfs_write_trace_writeback_single);
 	netfs_stat(&netfs_n_wh_writepages);
@@ -754,11 +840,11 @@ int netfs_writeback_single(struct address_space *mapp=
ing,
 		subreq =3D stream->construct;
 		subreq->len =3D wreq->len;
 		stream->submit_len =3D subreq->len;
-		stream->submit_extendable_to =3D round_up(wreq->len, PAGE_SIZE);
=20
 		netfs_issue_write(wreq, stream);
 	}
=20
+	wreq->submitted =3D wreq->len;
 	smp_wmb(); /* Write lists before ALL_QUEUED. */
 	set_bit(NETFS_RREQ_ALL_QUEUED, &wreq->flags);
=20
@@ -774,6 +860,8 @@ int netfs_writeback_single(struct address_space *mappin=
g,
 	_leave(" =3D %d", ret);
 	return ret;
=20
+cleanup_free:
+	netfs_put_failed_request(wreq);
 couldnt_start:
 	mutex_unlock(&ictx->wb_lock);
 	_leave(" =3D %d", ret);
diff --git a/fs/netfs/write_retry.c b/fs/netfs/write_retry.c
index 29489a23a220..5df5c34d4610 100644
--- a/fs/netfs/write_retry.c
+++ b/fs/netfs/write_retry.c
@@ -17,6 +17,7 @@
 static void netfs_retry_write_stream(struct netfs_io_request *wreq,
 				     struct netfs_io_stream *stream)
 {
+	struct bvecq_pos dispatch_cursor =3D {};
 	struct list_head *next;
=20
 	_enter("R=3D%x[%x:]", wreq->debug_id, stream->stream_nr);
@@ -39,12 +40,8 @@ static void netfs_retry_write_stream(struct netfs_io_req=
uest *wreq,
 			if (test_bit(NETFS_SREQ_FAILED, &subreq->flags))
 				break;
 			if (__test_and_clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags)) {
-				struct iov_iter source;
-
-				netfs_reset_iter(subreq);
-				source =3D subreq->io_iter;
 				netfs_get_subrequest(subreq, netfs_sreq_trace_get_resubmit);
-				netfs_reissue_write(stream, subreq, &source);
+				netfs_reissue_write(stream, subreq);
 			}
 		}
 		return;
@@ -54,11 +51,12 @@ static void netfs_retry_write_stream(struct netfs_io_re=
quest *wreq,
=20
 	do {
 		struct netfs_io_subrequest *subreq =3D NULL, *from, *to, *tmp;
-		struct iov_iter source;
 		unsigned long long start, len;
 		size_t part;
 		bool boundary =3D false;
=20
+		bvecq_pos_unset(&dispatch_cursor);
+
 		/* Go through the stream and find the next span of contiguous
 		 * data that we then rejig (cifs, for example, needs the wsize
 		 * renegotiating) and reissue.
@@ -70,7 +68,7 @@ static void netfs_retry_write_stream(struct netfs_io_requ=
est *wreq,
=20
 		if (test_bit(NETFS_SREQ_FAILED, &from->flags) ||
 		    !test_bit(NETFS_SREQ_NEED_RETRY, &from->flags))
-			return;
+			goto out;
=20
 		list_for_each_continue(next, &stream->subrequests) {
 			subreq =3D list_entry(next, struct netfs_io_subrequest, rreq_link);
@@ -85,9 +83,8 @@ static void netfs_retry_write_stream(struct netfs_io_requ=
est *wreq,
 		/* Determine the set of buffers we're going to use.  Each
 		 * subreq gets a subset of a single overall contiguous buffer.
 		 */
-		netfs_reset_iter(from);
-		source =3D from->io_iter;
-		source.count =3D len;
+		bvecq_pos_transfer(&dispatch_cursor, &from->dispatch_pos);
+		bvecq_pos_advance(&dispatch_cursor, from->transferred);
=20
 		/* Work through the sublist. */
 		subreq =3D from;
@@ -100,14 +97,20 @@ static void netfs_retry_write_stream(struct netfs_io_r=
equest *wreq,
 			__clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
 			trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
=20
+			bvecq_pos_unset(&subreq->dispatch_pos);
+			bvecq_pos_unset(&subreq->content);
+
 			/* Renegotiate max_len (wsize) */
 			stream->sreq_max_len =3D len;
+			stream->sreq_max_segs =3D INT_MAX;
 			stream->prepare_write(subreq);
=20
-			part =3D umin(len, stream->sreq_max_len);
-			if (unlikely(stream->sreq_max_segs))
-				part =3D netfs_limit_iter(&source, 0, part, stream->sreq_max_segs);
-			subreq->len =3D part;
+			bvecq_pos_set(&subreq->dispatch_pos, &dispatch_cursor);
+			part =3D bvecq_slice(&dispatch_cursor,
+					   umin(len, stream->sreq_max_len),
+					   stream->sreq_max_segs,
+					   &subreq->nr_segs);
+			subreq->len =3D subreq->transferred + part;
 			subreq->transferred =3D 0;
 			len -=3D part;
 			start +=3D part;
@@ -116,7 +119,7 @@ static void netfs_retry_write_stream(struct netfs_io_re=
quest *wreq,
 				boundary =3D true;
=20
 			netfs_get_subrequest(subreq, netfs_sreq_trace_get_resubmit);
-			netfs_reissue_write(stream, subreq, &source);
+			netfs_reissue_write(stream, subreq);
 			if (subreq =3D=3D to)
 				break;
 		}
@@ -173,8 +176,13 @@ static void netfs_retry_write_stream(struct netfs_io_r=
equest *wreq,
=20
 			stream->prepare_write(subreq);
=20
-			part =3D umin(len, stream->sreq_max_len);
+			bvecq_pos_set(&subreq->dispatch_pos, &dispatch_cursor);
+			part =3D bvecq_slice(&dispatch_cursor,
+					   umin(len, stream->sreq_max_len),
+					   stream->sreq_max_segs,
+					   &subreq->nr_segs);
 			subreq->len =3D subreq->transferred + part;
+
 			len -=3D part;
 			start +=3D part;
 			if (!len && boundary) {
@@ -182,13 +190,16 @@ static void netfs_retry_write_stream(struct netfs_io_=
request *wreq,
 				boundary =3D false;
 			}
=20
-			netfs_reissue_write(stream, subreq, &source);
+			netfs_reissue_write(stream, subreq);
 			if (!len)
 				break;
=20
 		} while (len);
=20
 	} while (!list_is_head(next, &stream->subrequests));
+
+out:
+	bvecq_pos_unset(&dispatch_cursor);
 }
=20
 /*
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index b4602f7b6431..3345c88bbd8e 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -19,12 +19,13 @@
 #include <linux/pagemap.h>
 #include <linux/bvecq.h>
 #include <linux/uio.h>
-#include <linux/rolling_buffer.h>
=20
 enum netfs_sreq_ref_trace;
 typedef struct mempool mempool_t;
+struct readahead_control;
+struct netfs_io_request;
+struct netfs_io_subrequest;
 struct fscache_occupancy;
-struct folio_queue;
=20
 /**
  * folio_start_private_2 - Start an fscache write on a folio.  [DEPRECATED]
@@ -137,7 +138,6 @@ struct netfs_io_stream {
 	unsigned int		sreq_max_segs;	/* 0 or max number of segments in an iterato=
r */
 	unsigned int		submit_off;	/* Folio offset we're submitting from */
 	unsigned int		submit_len;	/* Amount of data left to submit */
-	unsigned int		submit_extendable_to; /* Amount I/O can be rounded up to */
 	void (*prepare_write)(struct netfs_io_subrequest *subreq);
 	void (*issue_write)(struct netfs_io_subrequest *subreq);
 	/* Collection tracking */
@@ -178,6 +178,8 @@ struct netfs_io_subrequest {
 	struct netfs_io_request *rreq;		/* Supervising I/O request */
 	struct work_struct	work;
 	struct list_head	rreq_link;	/* Link in rreq->subrequests */
+	struct bvecq_pos	dispatch_pos;	/* Bookmark in the combined queue of the s=
tart */
+	struct bvecq_pos	content;	/* The (copied) content of the subrequest */
 	struct iov_iter		io_iter;	/* Iterator for this subrequest */
 	unsigned long long	start;		/* Where to start the I/O */
 	size_t			len;		/* Size of the I/O */
@@ -239,13 +241,13 @@ struct netfs_io_request {
 	struct netfs_io_stream	io_streams[2];	/* Streams of parallel I/O operatio=
ns */
 #define NR_IO_STREAMS 2 //wreq->nr_io_streams
 	struct netfs_group	*group;		/* Writeback group being written back */
-	struct rolling_buffer	buffer;		/* Unencrypted buffer */
-#define NETFS_ROLLBUF_PUT_MARK		ROLLBUF_MARK_1
-#define NETFS_ROLLBUF_PAGECACHE_MARK	ROLLBUF_MARK_2
+	struct bvecq_pos	collect_cursor;	/* Clear-up point of I/O buffer */
+	struct bvecq_pos	load_cursor;	/* Point at which new folios are loaded in =
*/
+	struct bvecq_pos	dispatch_cursor; /* Point from which buffers are dispatc=
hed */
 	wait_queue_head_t	waitq;		/* Processor waiter */
 	void			*netfs_priv;	/* Private data for the netfs */
 	void			*netfs_priv2;	/* Private data for the netfs */
-	struct bio_vec		*direct_bv;	/* DIO buffer list (when handling iovec-iter)=
 */
+	unsigned long long	last_end;	/* End pos of last folio submitted */
 	unsigned long long	submitted;	/* Amount submitted for I/O so far */
 	unsigned long long	len;		/* Length of the request */
 	size_t			transferred;	/* Amount to be indicated as transferred */
@@ -258,7 +260,6 @@ struct netfs_io_request {
 	unsigned long long	cleaned_to;	/* Position we've cleaned folios to */
 	unsigned long long	abandon_to;	/* Position to abandon folios to */
 	pgoff_t			no_unlock_folio; /* Don't unlock this folio after read */
-	unsigned int		direct_bv_count; /* Number of elements in direct_bv[] */
 	unsigned int		debug_id;
 	unsigned int		rsize;		/* Maximum read size (0 for none) */
 	unsigned int		wsize;		/* Maximum write size (0 for none) */
@@ -267,7 +268,6 @@ struct netfs_io_request {
 	spinlock_t		lock;		/* Lock for queuing subreqs */
 	unsigned char		front_folio_order; /* Order (size) of front folio */
 	enum netfs_io_origin	origin;		/* Origin of the request */
-	bool			direct_bv_unpin; /* T if direct_bv[] must be unpinned */
 	refcount_t		ref;
 	unsigned long		flags;
 #define NETFS_RREQ_IN_PROGRESS		0	/* Unlocked when the request completes (=
has ref) */
@@ -463,12 +463,6 @@ void netfs_end_io_write(struct inode *inode);
 int netfs_start_io_direct(struct inode *inode);
 void netfs_end_io_direct(struct inode *inode);
=20
-/* Miscellaneous APIs. */
-struct folio_queue *netfs_folioq_alloc(unsigned int rreq_id, gfp_t gfp,
-				       unsigned int trace /*enum netfs_folioq_trace*/);
-void netfs_folioq_free(struct folio_queue *folioq,
-		       unsigned int trace /*enum netfs_trace_folioq*/);
-
 /* Buffer wrangling helpers API. */
 int netfs_alloc_folioq_buffer(struct address_space *mapping,
 			      struct folio_queue **_buffer,
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index fbb094231659..df3d440563ec 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -213,7 +213,9 @@
 	EM(netfs_folio_trace_store_copy,	"store-copy")	\
 	EM(netfs_folio_trace_store_plus,	"store+")	\
 	EM(netfs_folio_trace_wthru,		"wthru")	\
-	E_(netfs_folio_trace_wthru_plus,	"wthru+")
+	EM(netfs_folio_trace_wthru_plus,	"wthru+")	\
+	EM(netfs_folio_trace_zero,		"zero")		\
+	E_(netfs_folio_trace_zero_ra,		"zero-ra")
=20
 #define netfs_collect_contig_traces				\
 	EM(netfs_contig_trace_collect,		"Collect")	\
@@ -226,13 +228,13 @@
 	EM(netfs_trace_donate_to_next,		"to-next")	\
 	E_(netfs_trace_donate_to_deferred_next,	"defer-next")
=20
-#define netfs_folioq_traces					\
-	EM(netfs_trace_folioq_alloc_buffer,	"alloc-buf")	\
-	EM(netfs_trace_folioq_clear,		"clear")	\
-	EM(netfs_trace_folioq_delete,		"delete")	\
-	EM(netfs_trace_folioq_make_space,	"make-space")	\
-	EM(netfs_trace_folioq_rollbuf_init,	"roll-init")	\
-	E_(netfs_trace_folioq_read_progress,	"r-progress")
+#define netfs_bvecq_traces					\
+	EM(netfs_trace_bvecq_alloc_buffer,	"alloc-buf")	\
+	EM(netfs_trace_bvecq_clear,		"clear")	\
+	EM(netfs_trace_bvecq_delete,		"delete")	\
+	EM(netfs_trace_bvecq_make_space,	"make-space")	\
+	EM(netfs_trace_bvecq_rollbuf_init,	"roll-init")	\
+	E_(netfs_trace_bvecq_read_progress,	"r-progress")
=20
 #ifndef __NETFS_DECLARE_TRACE_ENUMS_ONCE_ONLY
 #define __NETFS_DECLARE_TRACE_ENUMS_ONCE_ONLY
@@ -252,7 +254,7 @@ enum netfs_sreq_ref_trace { netfs_sreq_ref_traces } __m=
ode(byte);
 enum netfs_folio_trace { netfs_folio_traces } __mode(byte);
 enum netfs_collect_contig_trace { netfs_collect_contig_traces } __mode(byt=
e);
 enum netfs_donate_trace { netfs_donate_traces } __mode(byte);
-enum netfs_folioq_trace { netfs_folioq_traces } __mode(byte);
+enum netfs_bvecq_trace { netfs_bvecq_traces } __mode(byte);
=20
 #endif
=20
@@ -276,7 +278,7 @@ netfs_sreq_ref_traces;
 netfs_folio_traces;
 netfs_collect_contig_traces;
 netfs_donate_traces;
-netfs_folioq_traces;
+netfs_bvecq_traces;
=20
 /*
  * Now redefine the EM() and E_() macros to map the enums to the strings t=
hat
@@ -378,10 +380,10 @@ TRACE_EVENT(netfs_sreq,
 		    __entry->len	=3D sreq->len;
 		    __entry->transferred =3D sreq->transferred;
 		    __entry->start	=3D sreq->start;
-		    __entry->slot	=3D sreq->io_iter.folioq_slot;
+		    __entry->slot	=3D sreq->dispatch_pos.slot;
 			   ),
=20
-	    TP_printk("R=3D%08x[%x] %s %s f=3D%03x s=3D%llx %zx/%zx s=3D%u e=3D%d=
",
+	    TP_printk("R=3D%08x[%x] %s %s f=3D%03x s=3D%llx %zx/%zx qs=3D%u e=3D%=
d",
 		      __entry->rreq, __entry->index,
 		      __print_symbolic(__entry->source, netfs_sreq_sources),
 		      __print_symbolic(__entry->what, netfs_sreq_traces),
@@ -756,27 +758,25 @@ TRACE_EVENT(netfs_collect_stream,
 		      __entry->collected_to, __entry->issued_to)
 	    );
=20
-TRACE_EVENT(netfs_folioq,
-	    TP_PROTO(const struct folio_queue *fq,
-		     enum netfs_folioq_trace trace),
+TRACE_EVENT(netfs_bvecq,
+	    TP_PROTO(const struct bvecq *bq,
+		     enum netfs_bvecq_trace trace),
=20
-	    TP_ARGS(fq, trace),
+	    TP_ARGS(bq, trace),
=20
 	    TP_STRUCT__entry(
-		    __field(unsigned int,		rreq)
 		    __field(unsigned int,		id)
-		    __field(enum netfs_folioq_trace,	trace)
+		    __field(enum netfs_bvecq_trace,	trace)
 			     ),
=20
 	    TP_fast_assign(
-		    __entry->rreq	=3D fq ? fq->rreq_id : 0;
-		    __entry->id		=3D fq ? fq->debug_id : 0;
+		    __entry->id		=3D bq ? bq->priv : 0;
 		    __entry->trace	=3D trace;
 			   ),
=20
-	    TP_printk("R=3D%08x fq=3D%x %s",
-		      __entry->rreq, __entry->id,
-		      __print_symbolic(__entry->trace, netfs_folioq_traces))
+	    TP_printk("fq=3D%x %s",
+		      __entry->id,
+		      __print_symbolic(__entry->trace, netfs_bvecq_traces))
 	    );
=20
 TRACE_EVENT(netfs_bv_slot,
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4702A3F54CF
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:49:02 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522144; cv=none;
 b=gyPJ2g223LzUSEH+dwD532wU1rBRmYxs7A3XRufD7TURZbB6EB+AHACV6TD1yX2Mcratnt0TMZ69Q1K7awVxO584Q9mTLnV5sBEtzDhEVvO5OVuYEMJi7DFdGvHPziZGsXVOAAMlXEjACQwJC/V4Fq6gajXDLpodMgZOo12Bs6k=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522144; c=relaxed/simple;
	bh=NuHELTMH2SeFr1y9KlDYkuQWSEKgI3Zi4FFIE8rccO0=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=EW6BO2kEFUcg4EQ3GzeeW9bzflPwVKM6TJgFTESEzuz45waPd9APxGMycHAah9JiUuhY0t2wY0BBc6NvqKDmXI/L6wVNCSB4qPlh4PgHPxt+j50JAFAialTQGZ7fhFf5IY4Qpl6DcDY53zK+qOx25x5PIFjK+LqgEWmZQHB6bRY=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=KLgnR05x; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="KLgnR05x"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774522141;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=An9ppvJd90yS+MIu3WT/4ujzY282508ExyNV62oMYCI=;
	b=KLgnR05xAZtdAC8uNggxh0s0nG5YknbIRt8jdA46r/xDBCEnBhzHL0Mp/AXTLqyT/yGeU0
	CNl9p1xPzafSNOtlxWsDCV26+cCwW4ibapEwjG1G07I38KJGkxQ18KUVaZZdK+pSYj2Wub
	MmEpkSb3P6/gyT1JZSRyH55/jAJ+kiM=
Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-180-NVW6woi9PzuY_nXMvCratw-1; Thu,
 26 Mar 2026 06:48:56 -0400
X-MC-Unique: NVW6woi9PzuY_nXMvCratw-1
X-Mimecast-MFC-AGG-ID: NVW6woi9PzuY_nXMvCratw_1774522133
Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 31466195608B;
	Thu, 26 Mar 2026 10:48:53 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 4F81F3000223;
	Thu, 26 Mar 2026 10:48:46 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Paulo Alcantara <pc@manguebit.org>,
	Shyam Prasad N <sprasad@microsoft.com>,
	Tom Talpey <tom@talpey.com>
Subject: [PATCH 19/26] cifs: Remove support for ITER_KVEC/BVEC/FOLIOQ from
 smb_extract_iter_to_rdma()
Date: Thu, 26 Mar 2026 10:45:34 +0000
Message-ID: <20260326104544.509518-20-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4
Content-Type: text/plain; charset="utf-8"

netfslib now only presents an bvecq queue and an associated ITER_BVECQ
iterator to the filesystem, so it isn't going to see ITER_KVEC, ITER_BVEC
or ITER_FOLIOQ iterators.  So remove that code.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Shyam Prasad N <sprasad@microsoft.com>
cc: Tom Talpey <tom@talpey.com>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/smb/client/smbdirect.c | 165 --------------------------------------
 1 file changed, 165 deletions(-)

diff --git a/fs/smb/client/smbdirect.c b/fs/smb/client/smbdirect.c
index f8a6be83db98..d9e026d5e9f9 100644
--- a/fs/smb/client/smbdirect.c
+++ b/fs/smb/client/smbdirect.c
@@ -3142,162 +3142,6 @@ static bool smb_set_sge(struct smb_extract_to_rdma =
*rdma,
 	return true;
 }
=20
-/*
- * Extract page fragments from a BVEC-class iterator and add them to an RD=
MA
- * element list.  The pages are not pinned.
- */
-static ssize_t smb_extract_bvec_to_rdma(struct iov_iter *iter,
-					struct smb_extract_to_rdma *rdma,
-					ssize_t maxsize)
-{
-	const struct bio_vec *bv =3D iter->bvec;
-	unsigned long start =3D iter->iov_offset;
-	unsigned int i;
-	ssize_t ret =3D 0;
-
-	for (i =3D 0; i < iter->nr_segs; i++) {
-		size_t off, len;
-
-		len =3D bv[i].bv_len;
-		if (start >=3D len) {
-			start -=3D len;
-			continue;
-		}
-
-		len =3D min_t(size_t, maxsize, len - start);
-		off =3D bv[i].bv_offset + start;
-
-		if (!smb_set_sge(rdma, bv[i].bv_page, off, len))
-			return -EIO;
-
-		ret +=3D len;
-		maxsize -=3D len;
-		if (rdma->nr_sge >=3D rdma->max_sge || maxsize <=3D 0)
-			break;
-		start =3D 0;
-	}
-
-	if (ret > 0)
-		iov_iter_advance(iter, ret);
-	return ret;
-}
-
-/*
- * Extract fragments from a KVEC-class iterator and add them to an RDMA li=
st.
- * This can deal with vmalloc'd buffers as well as kmalloc'd or static buf=
fers.
- * The pages are not pinned.
- */
-static ssize_t smb_extract_kvec_to_rdma(struct iov_iter *iter,
-					struct smb_extract_to_rdma *rdma,
-					ssize_t maxsize)
-{
-	const struct kvec *kv =3D iter->kvec;
-	unsigned long start =3D iter->iov_offset;
-	unsigned int i;
-	ssize_t ret =3D 0;
-
-	for (i =3D 0; i < iter->nr_segs; i++) {
-		struct page *page;
-		unsigned long kaddr;
-		size_t off, len, seg;
-
-		len =3D kv[i].iov_len;
-		if (start >=3D len) {
-			start -=3D len;
-			continue;
-		}
-
-		kaddr =3D (unsigned long)kv[i].iov_base + start;
-		off =3D kaddr & ~PAGE_MASK;
-		len =3D min_t(size_t, maxsize, len - start);
-		kaddr &=3D PAGE_MASK;
-
-		maxsize -=3D len;
-		do {
-			seg =3D min_t(size_t, len, PAGE_SIZE - off);
-
-			if (is_vmalloc_or_module_addr((void *)kaddr))
-				page =3D vmalloc_to_page((void *)kaddr);
-			else
-				page =3D virt_to_page((void *)kaddr);
-
-			if (!smb_set_sge(rdma, page, off, seg))
-				return -EIO;
-
-			ret +=3D seg;
-			len -=3D seg;
-			kaddr +=3D PAGE_SIZE;
-			off =3D 0;
-		} while (len > 0 && rdma->nr_sge < rdma->max_sge);
-
-		if (rdma->nr_sge >=3D rdma->max_sge || maxsize <=3D 0)
-			break;
-		start =3D 0;
-	}
-
-	if (ret > 0)
-		iov_iter_advance(iter, ret);
-	return ret;
-}
-
-/*
- * Extract folio fragments from a FOLIOQ-class iterator and add them to an=
 RDMA
- * list.  The folios are not pinned.
- */
-static ssize_t smb_extract_folioq_to_rdma(struct iov_iter *iter,
-					  struct smb_extract_to_rdma *rdma,
-					  ssize_t maxsize)
-{
-	const struct folio_queue *folioq =3D iter->folioq;
-	unsigned int slot =3D iter->folioq_slot;
-	ssize_t ret =3D 0;
-	size_t offset =3D iter->iov_offset;
-
-	BUG_ON(!folioq);
-
-	if (slot >=3D folioq_nr_slots(folioq)) {
-		folioq =3D folioq->next;
-		if (WARN_ON_ONCE(!folioq))
-			return -EIO;
-		slot =3D 0;
-	}
-
-	do {
-		struct folio *folio =3D folioq_folio(folioq, slot);
-		size_t fsize =3D folioq_folio_size(folioq, slot);
-
-		if (offset < fsize) {
-			size_t part =3D umin(maxsize, fsize - offset);
-
-			if (!smb_set_sge(rdma, folio_page(folio, 0), offset, part))
-				return -EIO;
-
-			offset +=3D part;
-			ret +=3D part;
-			maxsize -=3D part;
-		}
-
-		if (offset >=3D fsize) {
-			offset =3D 0;
-			slot++;
-			if (slot >=3D folioq_nr_slots(folioq)) {
-				if (!folioq->next) {
-					WARN_ON_ONCE(ret < iter->count);
-					break;
-				}
-				folioq =3D folioq->next;
-				slot =3D 0;
-			}
-		}
-	} while (rdma->nr_sge < rdma->max_sge && maxsize > 0);
-
-	iter->folioq =3D folioq;
-	iter->folioq_slot =3D slot;
-	iter->iov_offset =3D offset;
-	iter->count -=3D ret;
-	return ret;
-}
-
 /*
  * Extract memory fragments from a BVECQ-class iterator and add them to an=
 RDMA
  * list.  The folios are not pinned.
@@ -3373,15 +3217,6 @@ static ssize_t smb_extract_iter_to_rdma(struct iov_i=
ter *iter, size_t len,
 	int before =3D rdma->nr_sge;
=20
 	switch (iov_iter_type(iter)) {
-	case ITER_BVEC:
-		ret =3D smb_extract_bvec_to_rdma(iter, rdma, len);
-		break;
-	case ITER_KVEC:
-		ret =3D smb_extract_kvec_to_rdma(iter, rdma, len);
-		break;
-	case ITER_FOLIOQ:
-		ret =3D smb_extract_folioq_to_rdma(iter, rdma, len);
-		break;
 	case ITER_BVECQ:
 		ret =3D smb_extract_bvecq_to_rdma(iter, rdma, len);
 		break;
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2DBFE3F660D
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:49:08 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522149; cv=none;
 b=YY9Pbp+Juwz2F2vAdlVtLbI0h7dw7dpuReStTJlGXJXoQwxGwg7qULFYFtqXvRsrsqiTkBgg+hxweu0VG9pgXKAwj6aNkeDpytu9xeJfr9e6QNAjzXAXU5Ejes8LYlsPhcdldPwTBhBIn+oDAiLVABxlXH8H0QPPrclH2mDNz64=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522149; c=relaxed/simple;
	bh=H3CavFbU8S5da8e2fjCDeHpXbehFxeTwgISDgTn9O2M=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=XfVhRsQp9dU8SJ0K9ZByDqHXuPCZ39cBhEnat7vgTpo+7Hkd6qbiXj2RyWPXKBSnld9MJj8sIJvQCVgo//omYaK3DgXSFzFeetEd5/6VUg7I9XycfPuIQvPJLg5xOpRBk7gUi8pqVPHtUxvmV+k3A3/E//cmGKYgej/cC5N8EwM=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=AgVGGIRC; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="AgVGGIRC"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774522147;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=k0QgTm96/panRDmu2v4v3cUuvffY71COarsUF9Amqek=;
	b=AgVGGIRCie4XiF+4/sVWKy9kkc+JJjQdaA/MZIGjgFpQLghSetDRpITPeRSU2dm1fIaqp/
	XflMLIvL7k9H68Uo4SBzxCc9ILorNTjHZ3tWxF5rcO4bYJvF1u05oFHb4i1cjAf+iddWiD
	8oX+7bwS2YDQ02a25w1BnuWAwY0R8Ms=
Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-310-0-W5qP_vNJW-BBJ5KPgU-Q-1; Thu,
 26 Mar 2026 06:49:04 -0400
X-MC-Unique: 0-W5qP_vNJW-BBJ5KPgU-Q-1
X-Mimecast-MFC-AGG-ID: 0-W5qP_vNJW-BBJ5KPgU-Q_1774522141
Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 466FE195609E;
	Thu, 26 Mar 2026 10:49:01 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id E0E991955D84;
	Thu, 26 Mar 2026 10:48:54 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Paulo Alcantara <pc@manguebit.org>
Subject: [PATCH 20/26] netfs: Remove netfs_alloc/free_folioq_buffer()
Date: Thu, 26 Mar 2026 10:45:35 +0000
Message-ID: <20260326104544.509518-21-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17
Content-Type: text/plain; charset="utf-8"

Remove netfs_alloc/free_folioq_buffer() as these have been replaced with
netfs_alloc/free_bvecq_buffer().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/afs/dir_edit.c         |  1 -
 fs/netfs/misc.c           | 98 ---------------------------------------
 fs/smb/client/smb2ops.c   |  1 -
 fs/smb/client/smbdirect.c |  1 -
 include/linux/netfs.h     |  6 ---
 5 files changed, 107 deletions(-)

diff --git a/fs/afs/dir_edit.c b/fs/afs/dir_edit.c
index 59d3decf7692..d6a9bb4e2039 100644
--- a/fs/afs/dir_edit.c
+++ b/fs/afs/dir_edit.c
@@ -10,7 +10,6 @@
 #include <linux/namei.h>
 #include <linux/pagemap.h>
 #include <linux/iversion.h>
-#include <linux/folio_queue.h>
 #include "internal.h"
 #include "xdr_fs.h"
=20
diff --git a/fs/netfs/misc.c b/fs/netfs/misc.c
index ab142cbaad35..a19724389147 100644
--- a/fs/netfs/misc.c
+++ b/fs/netfs/misc.c
@@ -8,104 +8,6 @@
 #include <linux/swap.h>
 #include "internal.h"
=20
-#if 0
-/**
- * netfs_alloc_folioq_buffer - Allocate buffer space into a folio queue
- * @mapping: Address space to set on the folio (or NULL).
- * @_buffer: Pointer to the folio queue to add to (may point to a NULL; up=
dated).
- * @_cur_size: Current size of the buffer (updated).
- * @size: Target size of the buffer.
- * @gfp: The allocation constraints.
- */
-int netfs_alloc_folioq_buffer(struct address_space *mapping,
-			      struct folio_queue **_buffer,
-			      size_t *_cur_size, ssize_t size, gfp_t gfp)
-{
-	struct folio_queue *tail =3D *_buffer, *p;
-
-	size =3D round_up(size, PAGE_SIZE);
-	if (*_cur_size >=3D size)
-		return 0;
-
-	if (tail)
-		while (tail->next)
-			tail =3D tail->next;
-
-	do {
-		struct folio *folio;
-		int order =3D 0, slot;
-
-		if (!tail || folioq_full(tail)) {
-			p =3D netfs_folioq_alloc(0, GFP_NOFS, netfs_trace_folioq_alloc_buffer);
-			if (!p)
-				return -ENOMEM;
-			if (tail) {
-				tail->next =3D p;
-				p->prev =3D tail;
-			} else {
-				*_buffer =3D p;
-			}
-			tail =3D p;
-		}
-
-		if (size - *_cur_size > PAGE_SIZE)
-			order =3D umin(ilog2(size - *_cur_size) - PAGE_SHIFT,
-				     MAX_PAGECACHE_ORDER);
-
-		folio =3D folio_alloc(gfp, order);
-		if (!folio && order > 0)
-			folio =3D folio_alloc(gfp, 0);
-		if (!folio)
-			return -ENOMEM;
-
-		folio->mapping =3D mapping;
-		folio->index =3D *_cur_size / PAGE_SIZE;
-		trace_netfs_folio(folio, netfs_folio_trace_alloc_buffer);
-		slot =3D folioq_append_mark(tail, folio);
-		*_cur_size +=3D folioq_folio_size(tail, slot);
-	} while (*_cur_size < size);
-
-	return 0;
-}
-EXPORT_SYMBOL(netfs_alloc_folioq_buffer);
-
-/**
- * netfs_free_folioq_buffer - Free a folio queue.
- * @fq: The start of the folio queue to free
- *
- * Free up a chain of folio_queues and, if marked, the marked folios they =
point
- * to.
- */
-void netfs_free_folioq_buffer(struct folio_queue *fq)
-{
-	struct folio_queue *next;
-	struct folio_batch fbatch;
-
-	folio_batch_init(&fbatch);
-
-	for (; fq; fq =3D next) {
-		for (int slot =3D 0; slot < folioq_count(fq); slot++) {
-			struct folio *folio =3D folioq_folio(fq, slot);
-
-			if (!folio ||
-			    !folioq_is_marked(fq, slot))
-				continue;
-
-			trace_netfs_folio(folio, netfs_folio_trace_put);
-			if (folio_batch_add(&fbatch, folio))
-				folio_batch_release(&fbatch);
-		}
-
-		netfs_stat_d(&netfs_n_folioq);
-		next =3D fq->next;
-		kfree(fq);
-	}
-
-	folio_batch_release(&fbatch);
-}
-EXPORT_SYMBOL(netfs_free_folioq_buffer);
-#endif
-
 /**
  * netfs_dirty_folio - Mark folio dirty and pin a cache object for writeba=
ck
  * @mapping: The mapping the folio belongs to.
diff --git a/fs/smb/client/smb2ops.c b/fs/smb/client/smb2ops.c
index 173acca17af7..0d19c8fc4c3d 100644
--- a/fs/smb/client/smb2ops.c
+++ b/fs/smb/client/smb2ops.c
@@ -13,7 +13,6 @@
 #include <linux/sort.h>
 #include <crypto/aead.h>
 #include <linux/fiemap.h>
-#include <linux/folio_queue.h>
 #include <uapi/linux/magic.h>
 #include "cifsfs.h"
 #include "cifsglob.h"
diff --git a/fs/smb/client/smbdirect.c b/fs/smb/client/smbdirect.c
index d9e026d5e9f9..252e7757d21c 100644
--- a/fs/smb/client/smbdirect.c
+++ b/fs/smb/client/smbdirect.c
@@ -6,7 +6,6 @@
  */
 #include <linux/module.h>
 #include <linux/highmem.h>
-#include <linux/folio_queue.h>
 #define __SMBDIRECT_SOCKET_DISCONNECT(__sc) smbd_disconnect_rdma_connectio=
n(__sc)
 #include "../common/smbdirect/smbdirect_pdu.h"
 #include "smbdirect.h"
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 3345c88bbd8e..9d8576a62868 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -463,12 +463,6 @@ void netfs_end_io_write(struct inode *inode);
 int netfs_start_io_direct(struct inode *inode);
 void netfs_end_io_direct(struct inode *inode);
=20
-/* Buffer wrangling helpers API. */
-int netfs_alloc_folioq_buffer(struct address_space *mapping,
-			      struct folio_queue **_buffer,
-			      size_t *_cur_size, ssize_t size, gfp_t gfp);
-void netfs_free_folioq_buffer(struct folio_queue *fq);
-
 /**
  * netfs_inode - Get the netfs inode context from the inode
  * @inode: The inode to query
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id C29F33E8674
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:49:18 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522160; cv=none;
 b=jqXej9mmZjYO7ZZF8KMzUvPghRGlq7KFCEiQoFCPXRk6KUOfB6X4DcRyj29QcrbKKdjTxnmnuW9FgJuar9FHyP1Q54yx4PzUds5myVL7+tCMNmiSrTrctdiFpHQGJOMY89atDdhD2YjM8kmFnIbD50HPU7JfIjgsbJ7XyHRPwfA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522160; c=relaxed/simple;
	bh=SpZ0VaxBSEp+1ZOqzyTVy/Rn9WvmDJmje0biyVbE8tE=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=drpmgm9jVqnXvaN7dNKV5tbFNpdp6kEvmeihX/uR6I3Zd0yxa3SC1A1hgZaFFknpL6zMpLKoqljXaOuXLn+hjso1Fn6xGXmpcUB9lCFcTT2ZNNwhta/eB6/qbVK5kdGS4BB371KUmFrfLFuLAZZlliG+Gn2Fey/9hPo+BwT793E=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=IT9vbJ+/; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="IT9vbJ+/"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774522158;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=k6WIL/CPxi5yzw8jRwBaVYznmdI4kDLcX63WUAhouhQ=;
	b=IT9vbJ+/oEsynpdbd9m34YWE3r5EgqyZH3F4bpxQ/QiykypyhcUHEJoBE8ZitD2ZASWT2k
	ytHjPEEMpwRhDPwXKv/uq9gx1xA0TWQ8jD3ILGWpb7vFDxnAlEDzgvWsCxQZFjPhSn8FY6
	wAb4GzmooqRGBl/wgUpoWDJRV4ygQT8=
Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-204-49IjD0dQNSmL208bogD7jQ-1; Thu,
 26 Mar 2026 06:49:12 -0400
X-MC-Unique: 49IjD0dQNSmL208bogD7jQ-1
X-Mimecast-MFC-AGG-ID: 49IjD0dQNSmL208bogD7jQ_1774522150
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id DC199195608B;
	Thu, 26 Mar 2026 10:49:09 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 1B5B619560B1;
	Thu, 26 Mar 2026 10:49:02 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Paulo Alcantara <pc@manguebit.org>
Subject: [PATCH 21/26] netfs: Remove netfs_extract_user_iter()
Date: Thu, 26 Mar 2026 10:45:36 +0000
Message-ID: <20260326104544.509518-22-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12
Content-Type: text/plain; charset="utf-8"

Remove netfs_extract_user_iter() as it has been replaced with
netfs_extract_iter().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/netfs/iterator.c   | 96 -------------------------------------------
 include/linux/netfs.h |  3 --
 2 files changed, 99 deletions(-)

diff --git a/fs/netfs/iterator.c b/fs/netfs/iterator.c
index 581dbf650a19..442f893a0d65 100644
--- a/fs/netfs/iterator.c
+++ b/fs/netfs/iterator.c
@@ -137,102 +137,6 @@ ssize_t netfs_extract_iter(struct iov_iter *orig, siz=
e_t orig_len, size_t max_se
 EXPORT_SYMBOL_GPL(netfs_extract_iter);
=20
 #if 0
-/**
- * netfs_extract_user_iter - Extract the pages from a user iterator into a=
 bvec
- * @orig: The original iterator
- * @orig_len: The amount of iterator to copy
- * @new: The iterator to be set up
- * @extraction_flags: Flags to qualify the request
- *
- * Extract the page fragments from the given amount of the source iterator=
 and
- * build up a second iterator that refers to all of those bits.  This allo=
ws
- * the original iterator to be disposed of.
- *
- * @extraction_flags can have ITER_ALLOW_P2PDMA set to request peer-to-pee=
r DMA be
- * allowed on the pages extracted.
- *
- * On success, the number of elements in the bvec is returned, the original
- * iterator will have been advanced by the amount extracted.
- *
- * The iov_iter_extract_mode() function should be used to query how cleanup
- * should be performed.
- */
-ssize_t netfs_extract_user_iter(struct iov_iter *orig, size_t orig_len,
-				struct iov_iter *new,
-				iov_iter_extraction_t extraction_flags)
-{
-	struct bio_vec *bv =3D NULL;
-	struct page **pages;
-	unsigned int cur_npages;
-	unsigned int max_pages;
-	unsigned int npages =3D 0;
-	unsigned int i;
-	ssize_t ret;
-	size_t count =3D orig_len, offset, len;
-	size_t bv_size, pg_size;
-
-	if (WARN_ON_ONCE(!iter_is_ubuf(orig) && !iter_is_iovec(orig)))
-		return -EIO;
-
-	max_pages =3D iov_iter_npages(orig, INT_MAX);
-	bv_size =3D array_size(max_pages, sizeof(*bv));
-	bv =3D kvmalloc(bv_size, GFP_KERNEL);
-	if (!bv)
-		return -ENOMEM;
-
-	/* Put the page list at the end of the bvec list storage.  bvec
-	 * elements are larger than page pointers, so as long as we work
-	 * 0->last, we should be fine.
-	 */
-	pg_size =3D array_size(max_pages, sizeof(*pages));
-	pages =3D (void *)bv + bv_size - pg_size;
-
-	while (count && npages < max_pages) {
-		ret =3D iov_iter_extract_pages(orig, &pages, count,
-					     max_pages - npages, extraction_flags,
-					     &offset);
-		if (unlikely(ret <=3D 0)) {
-			ret =3D ret ?: -EIO;
-			break;
-		}
-
-		if (ret > count) {
-			pr_err("get_pages rc=3D%zd more than %zu\n", ret, count);
-			break;
-		}
-
-		count -=3D ret;
-		ret +=3D offset;
-		cur_npages =3D DIV_ROUND_UP(ret, PAGE_SIZE);
-
-		if (npages + cur_npages > max_pages) {
-			pr_err("Out of bvec array capacity (%u vs %u)\n",
-			       npages + cur_npages, max_pages);
-			break;
-		}
-
-		for (i =3D 0; i < cur_npages; i++) {
-			len =3D ret > PAGE_SIZE ? PAGE_SIZE : ret;
-			bvec_set_page(bv + npages + i, *pages++, len - offset, offset);
-			ret -=3D len;
-			offset =3D 0;
-		}
-
-		npages +=3D cur_npages;
-	}
-
-	if (ret < 0 && (ret =3D=3D -ENOMEM || npages =3D=3D 0)) {
-		for (i =3D 0; i < npages; i++)
-			unpin_user_page(bv[i].bv_page);
-		kvfree(bv);
-		return ret;
-	}
-
-	iov_iter_bvec(new, orig->data_source, bv, npages, orig_len - count);
-	return npages;
-}
-EXPORT_SYMBOL_GPL(netfs_extract_user_iter);
-
 /*
  * Select the span of a bvec iterator we're going to use.  Limit it by bot=
h maximum
  * size and maximum number of segments.  Returns the size of the span in b=
ytes.
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 9d8576a62868..65e39f9b0c10 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -448,9 +448,6 @@ void netfs_put_subrequest(struct netfs_io_subrequest *s=
ubreq,
 ssize_t netfs_extract_iter(struct iov_iter *orig, size_t orig_len, size_t =
max_segs,
 			   unsigned long long fpos, struct bvecq **_bvecq_head,
 			   iov_iter_extraction_t extraction_flags);
-ssize_t netfs_extract_user_iter(struct iov_iter *orig, size_t orig_len,
-				struct iov_iter *new,
-				iov_iter_extraction_t extraction_flags);
 size_t netfs_limit_iter(const struct iov_iter *iter, size_t start_offset,
 			size_t max_size, size_t max_segs);
 void netfs_prepare_write_failed(struct netfs_io_subrequest *subreq);
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3DD6C3F7A8D
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:49:23 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522165; cv=none;
 b=F6gykP0tk3FdQRnMw6ld6sj7B9hrA9aYjH0w+ovK8TsN6efkgLSiP7Lp1QLRCVs2UImk1GR29coE2S5XLTUyGn1Jj6eIJ3op6athjj/YS6YHUFNAJk93Zha+Utmm5Hnk770dD6oJ07OwwbP1E3OVxrSwWBlzBOCVnP19993+MKc=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522165; c=relaxed/simple;
	bh=0h/ucvwbdQ4HuvZye+TLcKW5WsMaII7mlDDanLFe43g=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=aAve7OowPQFujh0ZaYBuaLB0fQZV2ilDqu9CIignH9ov/Xd5RVX/ph+vmqqxyrIIn5FnaCiaIkLQuw56JxsRzNn6HWLjW2mo3ALQ8yGm063gEZGxbATQYlpBLOEnLaRtpT/M9uFwiGaT8aWUE9skHrecBlXB1lOShymR+3xFsGs=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=NTyhZCFk; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="NTyhZCFk"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774522162;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=UvlRTZm0obpKkxMoTaD0ISQlMHUM843URYI2Lsaunes=;
	b=NTyhZCFkkE5rzLHexUd14UR6Daks6PdIt6E9L/kkJOSxFVk0clSspViUKvYqcivBP2AOIh
	OoNJ7p7eoYB5xCr8/wj9X82vGYg72l8Iyhr+3AUJV0MR1AcWMeOHepODOnoXerv4mndJQ0
	9jnTnGnBbuMTzqrBlujQx0qR5/ML7Z0=
Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-145-bwxa7MIHOcWmKjsr45PuDQ-1; Thu,
 26 Mar 2026 06:49:20 -0400
X-MC-Unique: bwxa7MIHOcWmKjsr45PuDQ-1
X-Mimecast-MFC-AGG-ID: bwxa7MIHOcWmKjsr45PuDQ_1774522158
Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id E677A180047F;
	Thu, 26 Mar 2026 10:49:17 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 97E5D1800361;
	Thu, 26 Mar 2026 10:49:11 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Paulo Alcantara <pc@manguebit.org>
Subject: [PATCH 22/26] iov_iter: Remove ITER_FOLIOQ
Date: Thu, 26 Mar 2026 10:45:37 +0000
Message-ID: <20260326104544.509518-23-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111
Content-Type: text/plain; charset="utf-8"

Remove ITER_FOLIOQ as it's no longer used.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 include/linux/iov_iter.h   |  65 +---------
 include/linux/uio.h        |  12 --
 lib/iov_iter.c             | 235 +--------------------------------
 lib/scatterlist.c          |  67 +---------
 lib/tests/kunit_iov_iter.c | 257 -------------------------------------
 5 files changed, 5 insertions(+), 631 deletions(-)

diff --git a/include/linux/iov_iter.h b/include/linux/iov_iter.h
index 309642b3901f..9f3a4497c5c3 100644
--- a/include/linux/iov_iter.h
+++ b/include/linux/iov_iter.h
@@ -10,7 +10,6 @@
=20
 #include <linux/uio.h>
 #include <linux/bvecq.h>
-#include <linux/folio_queue.h>
=20
 typedef size_t (*iov_step_f)(void *iter_base, size_t progress, size_t len,
 			     void *priv, void *priv2);
@@ -194,62 +193,6 @@ size_t iterate_bvecq(struct iov_iter *iter, size_t len=
, void *priv, void *priv2,
 	return progress;
 }
=20
-/*
- * Handle ITER_FOLIOQ.
- */
-static __always_inline
-size_t iterate_folioq(struct iov_iter *iter, size_t len, void *priv, void =
*priv2,
-		      iov_step_f step)
-{
-	const struct folio_queue *folioq =3D iter->folioq;
-	unsigned int slot =3D iter->folioq_slot;
-	size_t progress =3D 0, skip =3D iter->iov_offset;
-
-	if (slot =3D=3D folioq_nr_slots(folioq)) {
-		/* The iterator may have been extended. */
-		folioq =3D folioq->next;
-		slot =3D 0;
-	}
-
-	do {
-		struct folio *folio =3D folioq_folio(folioq, slot);
-		size_t part, remain =3D 0, consumed;
-		size_t fsize;
-		void *base;
-
-		if (!folio)
-			break;
-
-		fsize =3D folioq_folio_size(folioq, slot);
-		if (skip < fsize) {
-			base =3D kmap_local_folio(folio, skip);
-			part =3D umin(len, PAGE_SIZE - skip % PAGE_SIZE);
-			remain =3D step(base, progress, part, priv, priv2);
-			kunmap_local(base);
-			consumed =3D part - remain;
-			len -=3D consumed;
-			progress +=3D consumed;
-			skip +=3D consumed;
-		}
-		if (skip >=3D fsize) {
-			skip =3D 0;
-			slot++;
-			if (slot =3D=3D folioq_nr_slots(folioq) && folioq->next) {
-				folioq =3D folioq->next;
-				slot =3D 0;
-			}
-		}
-		if (remain)
-			break;
-	} while (len);
-
-	iter->folioq_slot =3D slot;
-	iter->folioq =3D folioq;
-	iter->iov_offset =3D skip;
-	iter->count -=3D progress;
-	return progress;
-}
-
 /*
  * Handle ITER_XARRAY.
  */
@@ -361,8 +304,6 @@ size_t iterate_and_advance2(struct iov_iter *iter, size=
_t len, void *priv,
 		return iterate_kvec(iter, len, priv, priv2, step);
 	if (iov_iter_is_bvecq(iter))
 		return iterate_bvecq(iter, len, priv, priv2, step);
-	if (iov_iter_is_folioq(iter))
-		return iterate_folioq(iter, len, priv, priv2, step);
 	if (iov_iter_is_xarray(iter))
 		return iterate_xarray(iter, len, priv, priv2, step);
 	return iterate_discard(iter, len, priv, priv2, step);
@@ -397,8 +338,8 @@ size_t iterate_and_advance(struct iov_iter *iter, size_=
t len, void *priv,
  * buffer is presented in segments, which for kernel iteration are broken =
up by
  * physical pages and mapped, with the mapped address being presented.
  *
- * [!] Note This will only handle BVEC, KVEC, BVECQ, FOLIOQ, XARRAY and
- * DISCARD-type iterators; it will not handle UBUF or IOVEC-type iterators.
+ * [!] Note This will only handle BVEC, KVEC, BVECQ, XARRAY and DISCARD-ty=
pe
+ * iterators; it will not handle UBUF or IOVEC-type iterators.
  *
  * A step functions, @step, must be provided, one for handling mapped kern=
el
  * addresses and the other is given user addresses which have the potentia=
l to
@@ -427,8 +368,6 @@ size_t iterate_and_advance_kernel(struct iov_iter *iter=
, size_t len, void *priv,
 		return iterate_kvec(iter, len, priv, priv2, step);
 	if (iov_iter_is_bvecq(iter))
 		return iterate_bvecq(iter, len, priv, priv2, step);
-	if (iov_iter_is_folioq(iter))
-		return iterate_folioq(iter, len, priv, priv2, step);
 	if (iov_iter_is_xarray(iter))
 		return iterate_xarray(iter, len, priv, priv2, step);
 	return iterate_discard(iter, len, priv, priv2, step);
diff --git a/include/linux/uio.h b/include/linux/uio.h
index aa50d348dfcc..e84a0c4f28c6 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -11,7 +11,6 @@
 #include <uapi/linux/uio.h>
=20
 struct page;
-struct folio_queue;
=20
 typedef unsigned int __bitwise iov_iter_extraction_t;
=20
@@ -26,7 +25,6 @@ enum iter_type {
 	ITER_IOVEC,
 	ITER_BVEC,
 	ITER_KVEC,
-	ITER_FOLIOQ,
 	ITER_BVECQ,
 	ITER_XARRAY,
 	ITER_DISCARD,
@@ -69,7 +67,6 @@ struct iov_iter {
 				const struct iovec *__iov;
 				const struct kvec *kvec;
 				const struct bio_vec *bvec;
-				const struct folio_queue *folioq;
 				const struct bvecq *bvecq;
 				struct xarray *xarray;
 				void __user *ubuf;
@@ -79,7 +76,6 @@ struct iov_iter {
 	};
 	union {
 		unsigned long nr_segs;
-		u8 folioq_slot;
 		u16 bvecq_slot;
 		loff_t xarray_start;
 	};
@@ -148,11 +144,6 @@ static inline bool iov_iter_is_discard(const struct io=
v_iter *i)
 	return iov_iter_type(i) =3D=3D ITER_DISCARD;
 }
=20
-static inline bool iov_iter_is_folioq(const struct iov_iter *i)
-{
-	return iov_iter_type(i) =3D=3D ITER_FOLIOQ;
-}
-
 static inline bool iov_iter_is_bvecq(const struct iov_iter *i)
 {
 	return iov_iter_type(i) =3D=3D ITER_BVECQ;
@@ -303,9 +294,6 @@ void iov_iter_kvec(struct iov_iter *i, unsigned int dir=
ection, const struct kvec
 void iov_iter_bvec(struct iov_iter *i, unsigned int direction, const struc=
t bio_vec *bvec,
 			unsigned long nr_segs, size_t count);
 void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t c=
ount);
-void iov_iter_folio_queue(struct iov_iter *i, unsigned int direction,
-			  const struct folio_queue *folioq,
-			  unsigned int first_slot, unsigned int offset, size_t count);
 void iov_iter_bvec_queue(struct iov_iter *i, unsigned int direction,
 			 const struct bvecq *bvecq,
 			 unsigned int first_slot, unsigned int offset, size_t count);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 4f091e6d4a22..d203088dbf5a 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -538,39 +538,6 @@ static void iov_iter_iovec_advance(struct iov_iter *i,=
 size_t size)
 	i->__iov =3D iov;
 }
=20
-static void iov_iter_folioq_advance(struct iov_iter *i, size_t size)
-{
-	const struct folio_queue *folioq =3D i->folioq;
-	unsigned int slot =3D i->folioq_slot;
-
-	if (!i->count)
-		return;
-	i->count -=3D size;
-
-	if (slot >=3D folioq_nr_slots(folioq)) {
-		folioq =3D folioq->next;
-		slot =3D 0;
-	}
-
-	size +=3D i->iov_offset; /* From beginning of current segment. */
-	do {
-		size_t fsize =3D folioq_folio_size(folioq, slot);
-
-		if (likely(size < fsize))
-			break;
-		size -=3D fsize;
-		slot++;
-		if (slot >=3D folioq_nr_slots(folioq) && folioq->next) {
-			folioq =3D folioq->next;
-			slot =3D 0;
-		}
-	} while (size);
-
-	i->iov_offset =3D size;
-	i->folioq_slot =3D slot;
-	i->folioq =3D folioq;
-}
-
 static void iov_iter_bvecq_advance(struct iov_iter *i, size_t by)
 {
 	const struct bvecq *bq =3D i->bvecq;
@@ -616,8 +583,6 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
 		iov_iter_iovec_advance(i, size);
 	} else if (iov_iter_is_bvec(i)) {
 		iov_iter_bvec_advance(i, size);
-	} else if (iov_iter_is_folioq(i)) {
-		iov_iter_folioq_advance(i, size);
 	} else if (iov_iter_is_bvecq(i)) {
 		iov_iter_bvecq_advance(i, size);
 	} else if (iov_iter_is_discard(i)) {
@@ -626,32 +591,6 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
 }
 EXPORT_SYMBOL(iov_iter_advance);
=20
-static void iov_iter_folioq_revert(struct iov_iter *i, size_t unroll)
-{
-	const struct folio_queue *folioq =3D i->folioq;
-	unsigned int slot =3D i->folioq_slot;
-
-	for (;;) {
-		size_t fsize;
-
-		if (slot =3D=3D 0) {
-			folioq =3D folioq->prev;
-			slot =3D folioq_nr_slots(folioq);
-		}
-		slot--;
-
-		fsize =3D folioq_folio_size(folioq, slot);
-		if (unroll <=3D fsize) {
-			i->iov_offset =3D fsize - unroll;
-			break;
-		}
-		unroll -=3D fsize;
-	}
-
-	i->folioq_slot =3D slot;
-	i->folioq =3D folioq;
-}
-
 static void iov_iter_bvecq_revert(struct iov_iter *i, size_t unroll)
 {
 	const struct bvecq *bq =3D i->bvecq;
@@ -709,9 +648,6 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
 			}
 			unroll -=3D n;
 		}
-	} else if (iov_iter_is_folioq(i)) {
-		i->iov_offset =3D 0;
-		iov_iter_folioq_revert(i, unroll);
 	} else if (iov_iter_is_bvecq(i)) {
 		i->iov_offset =3D 0;
 		iov_iter_bvecq_revert(i, unroll);
@@ -744,8 +680,6 @@ size_t iov_iter_single_seg_count(const struct iov_iter =
*i)
 	}
 	if (!i->count)
 		return 0;
-	if (unlikely(iov_iter_is_folioq(i)))
-		return umin(folioq_folio_size(i->folioq, i->folioq_slot), i->count);
 	if (unlikely(iov_iter_is_bvecq(i)))
 		return min(i->count, i->bvecq->bv[i->bvecq_slot].bv_len - i->iov_offset);
 	return i->count;
@@ -784,36 +718,6 @@ void iov_iter_bvec(struct iov_iter *i, unsigned int di=
rection,
 }
 EXPORT_SYMBOL(iov_iter_bvec);
=20
-/**
- * iov_iter_folio_queue - Initialise an I/O iterator to use the folios in =
a folio queue
- * @i: The iterator to initialise.
- * @direction: The direction of the transfer.
- * @folioq: The starting point in the folio queue.
- * @first_slot: The first slot in the folio queue to use
- * @offset: The offset into the folio in the first slot to start at
- * @count: The size of the I/O buffer in bytes.
- *
- * Set up an I/O iterator to either draw data out of the pages attached to=
 an
- * inode or to inject data into those pages.  The pages *must* be prevented
- * from evaporation, either by taking a ref on them or locking them by the
- * caller.
- */
-void iov_iter_folio_queue(struct iov_iter *i, unsigned int direction,
-			  const struct folio_queue *folioq, unsigned int first_slot,
-			  unsigned int offset, size_t count)
-{
-	BUG_ON(direction & ~1);
-	*i =3D (struct iov_iter) {
-		.iter_type =3D ITER_FOLIOQ,
-		.data_source =3D direction,
-		.folioq =3D folioq,
-		.folioq_slot =3D first_slot,
-		.count =3D count,
-		.iov_offset =3D offset,
-	};
-}
-EXPORT_SYMBOL(iov_iter_folio_queue);
-
 /**
  * iov_iter_bvec_queue - Initialise an I/O iterator to use a segmented bve=
c queue
  * @i: The iterator to initialise.
@@ -982,9 +886,6 @@ unsigned long iov_iter_alignment(const struct iov_iter =
*i)
 	if (iov_iter_is_bvec(i))
 		return iov_iter_alignment_bvec(i);
=20
-	/* With both xarray and folioq types, we're dealing with whole folios. */
-	if (iov_iter_is_folioq(i))
-		return i->iov_offset | i->count;
 	if (iov_iter_is_bvecq(i))
 		return iov_iter_alignment_bvecq(i);
 	if (iov_iter_is_xarray(i))
@@ -1039,65 +940,6 @@ static int want_pages_array(struct page ***res, size_=
t size,
 	return count;
 }
=20
-static ssize_t iter_folioq_get_pages(struct iov_iter *iter,
-				     struct page ***ppages, size_t maxsize,
-				     unsigned maxpages, size_t *_start_offset)
-{
-	const struct folio_queue *folioq =3D iter->folioq;
-	struct page **pages;
-	unsigned int slot =3D iter->folioq_slot;
-	size_t extracted =3D 0, count =3D iter->count, iov_offset =3D iter->iov_o=
ffset;
-
-	if (slot >=3D folioq_nr_slots(folioq)) {
-		folioq =3D folioq->next;
-		slot =3D 0;
-		if (WARN_ON(iov_offset !=3D 0))
-			return -EIO;
-	}
-
-	maxpages =3D want_pages_array(ppages, maxsize, iov_offset & ~PAGE_MASK, m=
axpages);
-	if (!maxpages)
-		return -ENOMEM;
-	*_start_offset =3D iov_offset & ~PAGE_MASK;
-	pages =3D *ppages;
-
-	for (;;) {
-		struct folio *folio =3D folioq_folio(folioq, slot);
-		size_t offset =3D iov_offset, fsize =3D folioq_folio_size(folioq, slot);
-		size_t part =3D PAGE_SIZE - offset % PAGE_SIZE;
-
-		if (offset < fsize) {
-			part =3D umin(part, umin(maxsize - extracted, fsize - offset));
-			count -=3D part;
-			iov_offset +=3D part;
-			extracted +=3D part;
-
-			*pages =3D folio_page(folio, offset / PAGE_SIZE);
-			get_page(*pages);
-			pages++;
-			maxpages--;
-		}
-
-		if (maxpages =3D=3D 0 || extracted >=3D maxsize)
-			break;
-
-		if (iov_offset >=3D fsize) {
-			iov_offset =3D 0;
-			slot++;
-			if (slot =3D=3D folioq_nr_slots(folioq) && folioq->next) {
-				folioq =3D folioq->next;
-				slot =3D 0;
-			}
-		}
-	}
-
-	iter->count =3D count;
-	iter->iov_offset =3D iov_offset;
-	iter->folioq =3D folioq;
-	iter->folioq_slot =3D slot;
-	return extracted;
-}
-
 static ssize_t iter_xarray_populate_pages(struct page **pages, struct xarr=
ay *xa,
 					  pgoff_t index, unsigned int nr_pages)
 {
@@ -1249,8 +1091,6 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_=
iter *i,
 		}
 		return maxsize;
 	}
-	if (iov_iter_is_folioq(i))
-		return iter_folioq_get_pages(i, pages, maxsize, maxpages, start);
 	if (iov_iter_is_xarray(i))
 		return iter_xarray_get_pages(i, pages, maxsize, maxpages, start);
 	WARN_ON_ONCE(iov_iter_is_bvecq(i));
@@ -1366,11 +1206,6 @@ int iov_iter_npages(const struct iov_iter *i, int ma=
xpages)
 		return iov_npages(i, maxpages);
 	if (iov_iter_is_bvec(i))
 		return bvec_npages(i, maxpages);
-	if (iov_iter_is_folioq(i)) {
-		unsigned offset =3D i->iov_offset % PAGE_SIZE;
-		int npages =3D DIV_ROUND_UP(offset + i->count, PAGE_SIZE);
-		return min(npages, maxpages);
-	}
 	if (iov_iter_is_bvecq(i))
 		return iov_npages_bvecq(i, maxpages);
 	if (iov_iter_is_xarray(i)) {
@@ -1654,68 +1489,6 @@ void iov_iter_restore(struct iov_iter *i, struct iov=
_iter_state *state)
 	i->nr_segs =3D state->nr_segs;
 }
=20
-/*
- * Extract a list of contiguous pages from an ITER_FOLIOQ iterator.  This =
does
- * not get references on the pages, nor does it get a pin on them.
- */
-static ssize_t iov_iter_extract_folioq_pages(struct iov_iter *i,
-					     struct page ***pages, size_t maxsize,
-					     unsigned int maxpages,
-					     iov_iter_extraction_t extraction_flags,
-					     size_t *offset0)
-{
-	const struct folio_queue *folioq =3D i->folioq;
-	struct page **p;
-	unsigned int nr =3D 0;
-	size_t extracted =3D 0, offset, slot =3D i->folioq_slot;
-
-	if (slot >=3D folioq_nr_slots(folioq)) {
-		folioq =3D folioq->next;
-		slot =3D 0;
-		if (WARN_ON(i->iov_offset !=3D 0))
-			return -EIO;
-	}
-
-	offset =3D i->iov_offset & ~PAGE_MASK;
-	*offset0 =3D offset;
-
-	maxpages =3D want_pages_array(pages, maxsize, offset, maxpages);
-	if (!maxpages)
-		return -ENOMEM;
-	p =3D *pages;
-
-	for (;;) {
-		struct folio *folio =3D folioq_folio(folioq, slot);
-		size_t offset =3D i->iov_offset, fsize =3D folioq_folio_size(folioq, slo=
t);
-		size_t part =3D PAGE_SIZE - offset % PAGE_SIZE;
-
-		if (offset < fsize) {
-			part =3D umin(part, umin(maxsize - extracted, fsize - offset));
-			i->count -=3D part;
-			i->iov_offset +=3D part;
-			extracted +=3D part;
-
-			p[nr++] =3D folio_page(folio, offset / PAGE_SIZE);
-		}
-
-		if (nr >=3D maxpages || extracted >=3D maxsize)
-			break;
-
-		if (i->iov_offset >=3D fsize) {
-			i->iov_offset =3D 0;
-			slot++;
-			if (slot =3D=3D folioq_nr_slots(folioq) && folioq->next) {
-				folioq =3D folioq->next;
-				slot =3D 0;
-			}
-		}
-	}
-
-	i->folioq =3D folioq;
-	i->folioq_slot =3D slot;
-	return extracted;
-}
-
 /*
  * Extract a list of virtually contiguous pages from an ITER_BVECQ iterato=
r.
  * This does not get references on the pages, nor does it get a pin on the=
m.
@@ -2078,8 +1851,8 @@ static ssize_t iov_iter_extract_user_pages(struct iov=
_iter *i,
  *      added to the pages, but refs will not be taken.
  *      iov_iter_extract_will_pin() will return true.
  *
- *  (*) If the iterator is ITER_KVEC, ITER_BVEC, ITER_FOLIOQ or ITER_XARRA=
Y, the
- *      pages are merely listed; no extra refs or pins are obtained.
+ *  (*) If the iterator is ITER_KVEC, ITER_BVEC, ITER_XARRAY, the pages are
+ *      merely listed; no extra refs or pins are obtained.
  *      iov_iter_extract_will_pin() will return 0.
  *
  * Note also:
@@ -2114,10 +1887,6 @@ ssize_t iov_iter_extract_pages(struct iov_iter *i,
 		return iov_iter_extract_bvec_pages(i, pages, maxsize,
 						   maxpages, extraction_flags,
 						   offset0);
-	if (iov_iter_is_folioq(i))
-		return iov_iter_extract_folioq_pages(i, pages, maxsize,
-						     maxpages, extraction_flags,
-						     offset0);
 	if (iov_iter_is_bvecq(i))
 		return iov_iter_extract_bvecq_pages(i, pages, maxsize,
 						    maxpages, extraction_flags,
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 93a3d194a914..25f64272839e 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -12,7 +12,6 @@
 #include <linux/bvec.h>
 #include <linux/bvecq.h>
 #include <linux/uio.h>
-#include <linux/folio_queue.h>
=20
 /**
  * sg_nents - return total count of entries in scatterlist
@@ -1268,67 +1267,6 @@ static ssize_t extract_kvec_to_sg(struct iov_iter *i=
ter,
 	return ret;
 }
=20
-/*
- * Extract up to sg_max folios from an FOLIOQ-type iterator and add them to
- * the scatterlist.  The pages are not pinned.
- */
-static ssize_t extract_folioq_to_sg(struct iov_iter *iter,
-				   ssize_t maxsize,
-				   struct sg_table *sgtable,
-				   unsigned int sg_max,
-				   iov_iter_extraction_t extraction_flags)
-{
-	const struct folio_queue *folioq =3D iter->folioq;
-	struct scatterlist *sg =3D sgtable->sgl + sgtable->nents;
-	unsigned int slot =3D iter->folioq_slot;
-	ssize_t ret =3D 0;
-	size_t offset =3D iter->iov_offset;
-
-	BUG_ON(!folioq);
-
-	if (slot >=3D folioq_nr_slots(folioq)) {
-		folioq =3D folioq->next;
-		if (WARN_ON_ONCE(!folioq))
-			return 0;
-		slot =3D 0;
-	}
-
-	do {
-		struct folio *folio =3D folioq_folio(folioq, slot);
-		size_t fsize =3D folioq_folio_size(folioq, slot);
-
-		if (offset < fsize) {
-			size_t part =3D umin(maxsize - ret, fsize - offset);
-
-			sg_set_page(sg, folio_page(folio, 0), part, offset);
-			sgtable->nents++;
-			sg++;
-			sg_max--;
-			offset +=3D part;
-			ret +=3D part;
-		}
-
-		if (offset >=3D fsize) {
-			offset =3D 0;
-			slot++;
-			if (slot >=3D folioq_nr_slots(folioq)) {
-				if (!folioq->next) {
-					WARN_ON_ONCE(ret < iter->count);
-					break;
-				}
-				folioq =3D folioq->next;
-				slot =3D 0;
-			}
-		}
-	} while (sg_max > 0 && ret < maxsize);
-
-	iter->folioq =3D folioq;
-	iter->folioq_slot =3D slot;
-	iter->iov_offset =3D offset;
-	iter->count -=3D ret;
-	return ret;
-}
-
 /*
  * Extract up to sg_max folios from an BVECQ-type iterator and add them to
  * the scatterlist.  The pages are not pinned.
@@ -1453,7 +1391,7 @@ static ssize_t extract_xarray_to_sg(struct iov_iter *=
iter,
  * addition of @sg_max elements.
  *
  * The pages referred to by UBUF- and IOVEC-type iterators are extracted a=
nd
- * pinned; BVEC-, KVEC-, FOLIOQ- and XARRAY-type are extracted but aren't
+ * pinned; BVEC-, KVEC-, BVECQ- and XARRAY-type are extracted but aren't
  * pinned; DISCARD-type is not supported.
  *
  * No end mark is placed on the scatterlist; that's left to the caller.
@@ -1486,9 +1424,6 @@ ssize_t extract_iter_to_sg(struct iov_iter *iter, siz=
e_t maxsize,
 	case ITER_KVEC:
 		return extract_kvec_to_sg(iter, maxsize, sgtable, sg_max,
 					  extraction_flags);
-	case ITER_FOLIOQ:
-		return extract_folioq_to_sg(iter, maxsize, sgtable, sg_max,
-					    extraction_flags);
 	case ITER_BVECQ:
 		return extract_bvecq_to_sg(iter, maxsize, sgtable, sg_max,
 					   extraction_flags);
diff --git a/lib/tests/kunit_iov_iter.c b/lib/tests/kunit_iov_iter.c
index ff0621636ff1..7011f0ff7396 100644
--- a/lib/tests/kunit_iov_iter.c
+++ b/lib/tests/kunit_iov_iter.c
@@ -11,9 +11,7 @@
 #include <linux/vmalloc.h>
 #include <linux/mm.h>
 #include <linux/uio.h>
-#include <linux/bvec.h>
 #include <linux/bvecq.h>
-#include <linux/folio_queue.h>
 #include <kunit/test.h>
=20
 MODULE_DESCRIPTION("iov_iter testing");
@@ -364,179 +362,6 @@ static void __init iov_kunit_copy_from_bvec(struct ku=
nit *test)
 	KUNIT_SUCCEED(test);
 }
=20
-static void iov_kunit_destroy_folioq(void *data)
-{
-	struct folio_queue *folioq, *next;
-
-	for (folioq =3D data; folioq; folioq =3D next) {
-		next =3D folioq->next;
-		for (int i =3D 0; i < folioq_nr_slots(folioq); i++)
-			if (folioq_folio(folioq, i))
-				folio_put(folioq_folio(folioq, i));
-		kfree(folioq);
-	}
-}
-
-static void __init iov_kunit_load_folioq(struct kunit *test,
-					struct iov_iter *iter, int dir,
-					struct folio_queue *folioq,
-					struct page **pages, size_t npages)
-{
-	struct folio_queue *p =3D folioq;
-	size_t size =3D 0;
-	int i;
-
-	for (i =3D 0; i < npages; i++) {
-		if (folioq_full(p)) {
-			p->next =3D kzalloc_obj(struct folio_queue);
-			KUNIT_ASSERT_NOT_ERR_OR_NULL(test, p->next);
-			folioq_init(p->next, 0);
-			p->next->prev =3D p;
-			p =3D p->next;
-		}
-		folioq_append(p, page_folio(pages[i]));
-		size +=3D PAGE_SIZE;
-	}
-	iov_iter_folio_queue(iter, dir, folioq, 0, 0, size);
-}
-
-static struct folio_queue *iov_kunit_create_folioq(struct kunit *test)
-{
-	struct folio_queue *folioq;
-
-	folioq =3D kzalloc_obj(struct folio_queue);
-	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, folioq);
-	kunit_add_action_or_reset(test, iov_kunit_destroy_folioq, folioq);
-	folioq_init(folioq, 0);
-	return folioq;
-}
-
-/*
- * Test copying to a ITER_FOLIOQ-type iterator.
- */
-static void __init iov_kunit_copy_to_folioq(struct kunit *test)
-{
-	const struct kvec_test_range *pr;
-	struct iov_iter iter;
-	struct folio_queue *folioq;
-	struct page **spages, **bpages;
-	u8 *scratch, *buffer;
-	size_t bufsize, npages, size, copied;
-	int i, patt;
-
-	bufsize =3D 0x100000;
-	npages =3D bufsize / PAGE_SIZE;
-
-	folioq =3D iov_kunit_create_folioq(test);
-
-	scratch =3D iov_kunit_create_buffer(test, &spages, npages);
-	for (i =3D 0; i < bufsize; i++)
-		scratch[i] =3D pattern(i);
-
-	buffer =3D iov_kunit_create_buffer(test, &bpages, npages);
-	memset(buffer, 0, bufsize);
-
-	iov_kunit_load_folioq(test, &iter, READ, folioq, bpages, npages);
-
-	i =3D 0;
-	for (pr =3D kvec_test_ranges; pr->from >=3D 0; pr++) {
-		size =3D pr->to - pr->from;
-		KUNIT_ASSERT_LE(test, pr->to, bufsize);
-
-		iov_iter_folio_queue(&iter, READ, folioq, 0, 0, pr->to);
-		iov_iter_advance(&iter, pr->from);
-		copied =3D copy_to_iter(scratch + i, size, &iter);
-
-		KUNIT_EXPECT_EQ(test, copied, size);
-		KUNIT_EXPECT_EQ(test, iter.count, 0);
-		KUNIT_EXPECT_EQ(test, iter.iov_offset, pr->to % PAGE_SIZE);
-		i +=3D size;
-		if (test->status =3D=3D KUNIT_FAILURE)
-			goto stop;
-	}
-
-	/* Build the expected image in the scratch buffer. */
-	patt =3D 0;
-	memset(scratch, 0, bufsize);
-	for (pr =3D kvec_test_ranges; pr->from >=3D 0; pr++)
-		for (i =3D pr->from; i < pr->to; i++)
-			scratch[i] =3D pattern(patt++);
-
-	/* Compare the images */
-	for (i =3D 0; i < bufsize; i++) {
-		KUNIT_EXPECT_EQ_MSG(test, buffer[i], scratch[i], "at i=3D%x", i);
-		if (buffer[i] !=3D scratch[i])
-			return;
-	}
-
-stop:
-	KUNIT_SUCCEED(test);
-}
-
-/*
- * Test copying from a ITER_FOLIOQ-type iterator.
- */
-static void __init iov_kunit_copy_from_folioq(struct kunit *test)
-{
-	const struct kvec_test_range *pr;
-	struct iov_iter iter;
-	struct folio_queue *folioq;
-	struct page **spages, **bpages;
-	u8 *scratch, *buffer;
-	size_t bufsize, npages, size, copied;
-	int i, j;
-
-	bufsize =3D 0x100000;
-	npages =3D bufsize / PAGE_SIZE;
-
-	folioq =3D iov_kunit_create_folioq(test);
-
-	buffer =3D iov_kunit_create_buffer(test, &bpages, npages);
-	for (i =3D 0; i < bufsize; i++)
-		buffer[i] =3D pattern(i);
-
-	scratch =3D iov_kunit_create_buffer(test, &spages, npages);
-	memset(scratch, 0, bufsize);
-
-	iov_kunit_load_folioq(test, &iter, READ, folioq, bpages, npages);
-
-	i =3D 0;
-	for (pr =3D kvec_test_ranges; pr->from >=3D 0; pr++) {
-		size =3D pr->to - pr->from;
-		KUNIT_ASSERT_LE(test, pr->to, bufsize);
-
-		iov_iter_folio_queue(&iter, WRITE, folioq, 0, 0, pr->to);
-		iov_iter_advance(&iter, pr->from);
-		copied =3D copy_from_iter(scratch + i, size, &iter);
-
-		KUNIT_EXPECT_EQ(test, copied, size);
-		KUNIT_EXPECT_EQ(test, iter.count, 0);
-		KUNIT_EXPECT_EQ(test, iter.iov_offset, pr->to % PAGE_SIZE);
-		i +=3D size;
-	}
-
-	/* Build the expected image in the main buffer. */
-	i =3D 0;
-	memset(buffer, 0, bufsize);
-	for (pr =3D kvec_test_ranges; pr->from >=3D 0; pr++) {
-		for (j =3D pr->from; j < pr->to; j++) {
-			buffer[i++] =3D pattern(j);
-			if (i >=3D bufsize)
-				goto stop;
-		}
-	}
-stop:
-
-	/* Compare the images */
-	for (i =3D 0; i < bufsize; i++) {
-		KUNIT_EXPECT_EQ_MSG(test, scratch[i], buffer[i], "at i=3D%x", i);
-		if (scratch[i] !=3D buffer[i])
-			return;
-	}
-
-	KUNIT_SUCCEED(test);
-}
-
 static void iov_kunit_destroy_bvecq(void *data)
 {
 	struct bvecq *bq, *next;
@@ -1029,85 +854,6 @@ static void __init iov_kunit_extract_pages_bvec(struc=
t kunit *test)
 	KUNIT_SUCCEED(test);
 }
=20
-/*
- * Test the extraction of ITER_FOLIOQ-type iterators.
- */
-static void __init iov_kunit_extract_pages_folioq(struct kunit *test)
-{
-	const struct kvec_test_range *pr;
-	struct folio_queue *folioq;
-	struct iov_iter iter;
-	struct page **bpages, *pagelist[8], **pages =3D pagelist;
-	ssize_t len;
-	size_t bufsize, size =3D 0, npages;
-	int i, from;
-
-	bufsize =3D 0x100000;
-	npages =3D bufsize / PAGE_SIZE;
-
-	folioq =3D iov_kunit_create_folioq(test);
-
-	iov_kunit_create_buffer(test, &bpages, npages);
-	iov_kunit_load_folioq(test, &iter, READ, folioq, bpages, npages);
-
-	for (pr =3D kvec_test_ranges; pr->from >=3D 0; pr++) {
-		from =3D pr->from;
-		size =3D pr->to - from;
-		KUNIT_ASSERT_LE(test, pr->to, bufsize);
-
-		iov_iter_folio_queue(&iter, WRITE, folioq, 0, 0, pr->to);
-		iov_iter_advance(&iter, from);
-
-		do {
-			size_t offset0 =3D LONG_MAX;
-
-			for (i =3D 0; i < ARRAY_SIZE(pagelist); i++)
-				pagelist[i] =3D (void *)(unsigned long)0xaa55aa55aa55aa55ULL;
-
-			len =3D iov_iter_extract_pages(&iter, &pages, 100 * 1024,
-						     ARRAY_SIZE(pagelist), 0, &offset0);
-			KUNIT_EXPECT_GE(test, len, 0);
-			if (len < 0)
-				break;
-			KUNIT_EXPECT_LE(test, len, size);
-			KUNIT_EXPECT_EQ(test, iter.count, size - len);
-			if (len =3D=3D 0)
-				break;
-			size -=3D len;
-			KUNIT_EXPECT_GE(test, (ssize_t)offset0, 0);
-			KUNIT_EXPECT_LT(test, offset0, PAGE_SIZE);
-
-			for (i =3D 0; i < ARRAY_SIZE(pagelist); i++) {
-				struct page *p;
-				ssize_t part =3D min_t(ssize_t, len, PAGE_SIZE - offset0);
-				int ix;
-
-				KUNIT_ASSERT_GE(test, part, 0);
-				ix =3D from / PAGE_SIZE;
-				KUNIT_ASSERT_LT(test, ix, npages);
-				p =3D bpages[ix];
-				KUNIT_EXPECT_PTR_EQ(test, pagelist[i], p);
-				KUNIT_EXPECT_EQ(test, offset0, from % PAGE_SIZE);
-				from +=3D part;
-				len -=3D part;
-				KUNIT_ASSERT_GE(test, len, 0);
-				if (len =3D=3D 0)
-					break;
-				offset0 =3D 0;
-			}
-
-			if (test->status =3D=3D KUNIT_FAILURE)
-				goto stop;
-		} while (iov_iter_count(&iter) > 0);
-
-		KUNIT_EXPECT_EQ(test, size, 0);
-		KUNIT_EXPECT_EQ(test, iter.count, 0);
-	}
-
-stop:
-	KUNIT_SUCCEED(test);
-}
-
 /*
  * Test the extraction of ITER_XARRAY-type iterators.
  */
@@ -1192,15 +938,12 @@ static struct kunit_case __refdata iov_kunit_cases[]=
 =3D {
 	KUNIT_CASE(iov_kunit_copy_from_kvec),
 	KUNIT_CASE(iov_kunit_copy_to_bvec),
 	KUNIT_CASE(iov_kunit_copy_from_bvec),
-	KUNIT_CASE(iov_kunit_copy_to_folioq),
-	KUNIT_CASE(iov_kunit_copy_from_folioq),
 	KUNIT_CASE(iov_kunit_copy_to_bvecq),
 	KUNIT_CASE(iov_kunit_copy_from_bvecq),
 	KUNIT_CASE(iov_kunit_copy_to_xarray),
 	KUNIT_CASE(iov_kunit_copy_from_xarray),
 	KUNIT_CASE(iov_kunit_extract_pages_kvec),
 	KUNIT_CASE(iov_kunit_extract_pages_bvec),
-	KUNIT_CASE(iov_kunit_extract_pages_folioq),
 	KUNIT_CASE(iov_kunit_extract_pages_xarray),
 	{}
 };
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 372B33F7E67
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:49:33 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522177; cv=none;
 b=Or+8W7MvhsiJk+2pybQVXlP3a7Uv1WPoMZlDhrpjlp3Ut2rkSU0fkt1dSxocbwKPlzizZg1cPA3di9j7oLFlOWn0ysHx3C18u8YgQXkeE4JiC7ZKc5ql3JpsXQWWOHdd6mdC7oGG/1YrLttHRw+BiRGg6t1TKRqQ12GLLUfIB2I=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522177; c=relaxed/simple;
	bh=ckxwwirhW7gaqeFfrfFah+zpSvFqxBbGBtNeHLtwQ/c=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=fKnNPuCM9rBctjo0ZjrBqo9o+9wr174F32alo70x0UEXvCLqep/HzfkiYXatRYFJy1SVoFMg3BTzJu1yJi2ILXjJdKqDMT73R+Z311EbVj5xbArIdnrgIOsvPLhmvhcwY1SIWsmxEvABH+g6sfY73ksgzdT4O3ImIsDVFh3po2s=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=YNHh64gE; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="YNHh64gE"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774522172;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=j4uigjP0QA24vMb19T4wDbEFmU/7M8ptBU67QVBfdeQ=;
	b=YNHh64gECjU1F5CoG404Qkhh3yE5MK963ReqAVyOgJd4GYRsEtcC+yIyPjUZpNVd7K4AkW
	mS3Euiw8NLYJTMFbvAQsiYWUY0JFNBb9Yzqemk+I4wijyd93ivn1TjsVxFtm0b/CvAbjJl
	ozhBKZq8KbAd+BGfnSv6NIaD8xORVrk=
Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-616-riWMjUXYPcu13w47f_QtQw-1; Thu,
 26 Mar 2026 06:49:29 -0400
X-MC-Unique: riWMjUXYPcu13w47f_QtQw-1
X-Mimecast-MFC-AGG-ID: riWMjUXYPcu13w47f_QtQw_1774522166
Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 3BC7D1800372;
	Thu, 26 Mar 2026 10:49:26 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id A10C33000223;
	Thu, 26 Mar 2026 10:49:19 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Paulo Alcantara <pc@manguebit.org>
Subject: [PATCH 23/26] netfs: Remove folio_queue and rolling_buffer
Date: Thu, 26 Mar 2026 10:45:38 +0000
Message-ID: <20260326104544.509518-24-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4
Content-Type: text/plain; charset="utf-8"

Remove folio_queue and rolling_buffer as they're no longer used.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Steve French <sfrench@samba.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 Documentation/core-api/folio_queue.rst | 209 -----------------
 Documentation/core-api/index.rst       |   1 -
 fs/netfs/iterator.c                    | 192 ----------------
 fs/netfs/rolling_buffer.c              | 297 -------------------------
 include/linux/folio_queue.h            | 282 -----------------------
 include/linux/rolling_buffer.h         |  64 ------
 6 files changed, 1045 deletions(-)
 delete mode 100644 Documentation/core-api/folio_queue.rst
 delete mode 100644 fs/netfs/rolling_buffer.c
 delete mode 100644 include/linux/folio_queue.h
 delete mode 100644 include/linux/rolling_buffer.h

diff --git a/Documentation/core-api/folio_queue.rst b/Documentation/core-ap=
i/folio_queue.rst
deleted file mode 100644
index b7628896d2b6..000000000000
--- a/Documentation/core-api/folio_queue.rst
+++ /dev/null
@@ -1,209 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0+
-
-=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
-Folio Queue
-=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
-
-:Author: David Howells <dhowells@redhat.com>
-
-.. Contents:
-
- * Overview
- * Initialisation
- * Adding and removing folios
- * Querying information about a folio
- * Querying information about a folio_queue
- * Folio queue iteration
- * Folio marks
- * Lockless simultaneous production/consumption issues
-
-
-Overview
-=3D=3D=3D=3D=3D=3D=3D=3D
-
-The folio_queue struct forms a single segment in a segmented list of folios
-that can be used to form an I/O buffer.  As such, the list can be iterated=
 over
-using the ITER_FOLIOQ iov_iter type.
-
-The publicly accessible members of the structure are::
-
-	struct folio_queue {
-		struct folio_queue *next;
-		struct folio_queue *prev;
-		...
-	};
-
-A pair of pointers are provided, ``next`` and ``prev``, that point to the
-segments on either side of the segment being accessed.  Whilst this is a
-doubly-linked list, it is intentionally not a circular list; the outward
-sibling pointers in terminal segments should be NULL.
-
-Each segment in the list also stores:
-
- * an ordered sequence of folio pointers,
- * the size of each folio and
- * three 1-bit marks per folio,
-
-but these should not be accessed directly as the underlying data structure=
 may
-change, but rather the access functions outlined below should be used.
-
-The facility can be made accessible by::
-
-	#include <linux/folio_queue.h>
-
-and to use the iterator::
-
-	#include <linux/uio.h>
-
-
-Initialisation
-=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
-
-A segment should be initialised by calling::
-
-	void folioq_init(struct folio_queue *folioq);
-
-with a pointer to the segment to be initialised.  Note that this will not
-necessarily initialise all the folio pointers, so care must be taken to ch=
eck
-the number of folios added.
-
-
-Adding and removing folios
-=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
-
-Folios can be set in the next unused slot in a segment struct by calling o=
ne
-of::
-
-	unsigned int folioq_append(struct folio_queue *folioq,
-				   struct folio *folio);
-
-	unsigned int folioq_append_mark(struct folio_queue *folioq,
-					struct folio *folio);
-
-Both functions update the stored folio count, store the folio and note its
-size.  The second function also sets the first mark for the folio added.  =
Both
-functions return the number of the slot used.  [!] Note that no attempt is=
 made
-to check that the capacity wasn't overrun and the list will not be extended
-automatically.
-
-A folio can be excised by calling::
-
-	void folioq_clear(struct folio_queue *folioq, unsigned int slot);
-
-This clears the slot in the array and also clears all the marks for that f=
olio,
-but doesn't change the folio count - so future accesses of that slot must =
check
-if the slot is occupied.
-
-
-Querying information about a folio
-=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
-
-Information about the folio in a particular slot may be queried by the
-following function::
-
-	struct folio *folioq_folio(const struct folio_queue *folioq,
-				   unsigned int slot);
-
-If a folio has not yet been set in that slot, this may yield an undefined
-pointer.  The size of the folio in a slot may be queried with either of::
-
-	unsigned int folioq_folio_order(const struct folio_queue *folioq,
-					unsigned int slot);
-
-	size_t folioq_folio_size(const struct folio_queue *folioq,
-				 unsigned int slot);
-
-The first function returns the size as an order and the second as a number=
 of
-bytes.
-
-
-Querying information about a folio_queue
-=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
-
-Information may be retrieved about a particular segment with the following
-functions::
-
-	unsigned int folioq_nr_slots(const struct folio_queue *folioq);
-
-	unsigned int folioq_count(struct folio_queue *folioq);
-
-	bool folioq_full(struct folio_queue *folioq);
-
-The first function returns the maximum capacity of a segment.  It must not=
 be
-assumed that this won't vary between segments.  The second returns the num=
ber
-of folios added to a segments and the third is a shorthand to indicate if =
the
-segment has been filled to capacity.
-
-Not that the count and fullness are not affected by clearing folios from t=
he
-segment.  These are more about indicating how many slots in the array have=
 been
-initialised, and it assumed that slots won't get reused, but rather the se=
gment
-will get discarded as the queue is consumed.
-
-
-Folio marks
-=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
-
-Folios within a queue can also have marks assigned to them.  These marks c=
an be
-used to note information such as if a folio needs folio_put() calling upon=
 it.
-There are three marks available to be set for each folio.
-
-The marks can be set by::
-
-	void folioq_mark(struct folio_queue *folioq, unsigned int slot);
-	void folioq_mark2(struct folio_queue *folioq, unsigned int slot);
-
-Cleared by::
-
-	void folioq_unmark(struct folio_queue *folioq, unsigned int slot);
-	void folioq_unmark2(struct folio_queue *folioq, unsigned int slot);
-
-And the marks can be queried by::
-
-	bool folioq_is_marked(const struct folio_queue *folioq, unsigned int slot=
);
-	bool folioq_is_marked2(const struct folio_queue *folioq, unsigned int slo=
t);
-
-The marks can be used for any purpose and are not interpreted by this API.
-
-
-Folio queue iteration
-=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
-
-A list of segments may be iterated over using the I/O iterator facility us=
ing
-an ``iov_iter`` iterator of ``ITER_FOLIOQ`` type.  The iterator may be
-initialised with::
-
-	void iov_iter_folio_queue(struct iov_iter *i, unsigned int direction,
-				  const struct folio_queue *folioq,
-				  unsigned int first_slot, unsigned int offset,
-				  size_t count);
-
-This may be told to start at a particular segment, slot and offset within a
-queue.  The iov iterator functions will follow the next pointers when adva=
ncing
-and prev pointers when reverting when needed.
-
-
-Lockless simultaneous production/consumption issues
-=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
-
-If properly managed, the list can be extended by the producer at the head =
end
-and shortened by the consumer at the tail end simultaneously without the n=
eed
-to take locks.  The ITER_FOLIOQ iterator inserts appropriate barriers to a=
id
-with this.
-
-Care must be taken when simultaneously producing and consuming a list.  If=
 the
-last segment is reached and the folios it refers to are entirely consumed =
by
-the IOV iterators, an iov_iter struct will be left pointing to the last se=
gment
-with a slot number equal to the capacity of that segment.  The iterator wi=
ll
-try to continue on from this if there's another segment available when it =
is
-used again, but care must be taken lest the segment got removed and freed =
by
-the consumer before the iterator was advanced.
-
-It is recommended that the queue always contain at least one segment, even=
 if
-that segment has never been filled or is entirely spent.  This prevents the
-head and tail pointers from collapsing.
-
-
-API Function Reference
-=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
-
-.. kernel-doc:: include/linux/folio_queue.h
diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/inde=
x.rst
index 13769d5c40bf..16c529a33ac4 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -39,7 +39,6 @@ Library functionality that is used throughout the kernel.
    kref
    cleanup
    assoc_array
-   folio_queue
    xarray
    maple_tree
    idr
diff --git a/fs/netfs/iterator.c b/fs/netfs/iterator.c
index 442f893a0d65..7969c0b1f9a9 100644
--- a/fs/netfs/iterator.c
+++ b/fs/netfs/iterator.c
@@ -135,195 +135,3 @@ ssize_t netfs_extract_iter(struct iov_iter *orig, siz=
e_t orig_len, size_t max_se
 	return extracted ?: ret;
 }
 EXPORT_SYMBOL_GPL(netfs_extract_iter);
-
-#if 0
-/*
- * Select the span of a bvec iterator we're going to use.  Limit it by bot=
h maximum
- * size and maximum number of segments.  Returns the size of the span in b=
ytes.
- */
-static size_t netfs_limit_bvec(const struct iov_iter *iter, size_t start_o=
ffset,
-			       size_t max_size, size_t max_segs)
-{
-	const struct bio_vec *bvecs =3D iter->bvec;
-	unsigned int nbv =3D iter->nr_segs, ix =3D 0, nsegs =3D 0;
-	size_t len, span =3D 0, n =3D iter->count;
-	size_t skip =3D iter->iov_offset + start_offset;
-
-	if (WARN_ON(!iov_iter_is_bvec(iter)) ||
-	    WARN_ON(start_offset > n) ||
-	    n =3D=3D 0)
-		return 0;
-
-	while (n && ix < nbv && skip) {
-		len =3D bvecs[ix].bv_len;
-		if (skip < len)
-			break;
-		skip -=3D len;
-		n -=3D len;
-		ix++;
-	}
-
-	while (n && ix < nbv) {
-		len =3D min3(n, bvecs[ix].bv_len - skip, max_size);
-		span +=3D len;
-		nsegs++;
-		ix++;
-		if (span >=3D max_size || nsegs >=3D max_segs)
-			break;
-		skip =3D 0;
-		n -=3D len;
-	}
-
-	return min(span, max_size);
-}
-
-/*
- * Select the span of a kvec iterator we're going to use.  Limit it by both
- * maximum size and maximum number of segments.  Returns the size of the s=
pan
- * in bytes.
- */
-static size_t netfs_limit_kvec(const struct iov_iter *iter, size_t start_o=
ffset,
-			       size_t max_size, size_t max_segs)
-{
-	const struct kvec *kvecs =3D iter->kvec;
-	unsigned int nkv =3D iter->nr_segs, ix =3D 0, nsegs =3D 0;
-	size_t len, span =3D 0, n =3D iter->count;
-	size_t skip =3D iter->iov_offset + start_offset;
-
-	if (WARN_ON(!iov_iter_is_kvec(iter)) ||
-	    WARN_ON(start_offset > n) ||
-	    n =3D=3D 0)
-		return 0;
-
-	while (n && ix < nkv && skip) {
-		len =3D kvecs[ix].iov_len;
-		if (skip < len)
-			break;
-		skip -=3D len;
-		n -=3D len;
-		ix++;
-	}
-
-	while (n && ix < nkv) {
-		len =3D min3(n, kvecs[ix].iov_len - skip, max_size);
-		span +=3D len;
-		nsegs++;
-		ix++;
-		if (span >=3D max_size || nsegs >=3D max_segs)
-			break;
-		skip =3D 0;
-		n -=3D len;
-	}
-
-	return min(span, max_size);
-}
-
-/*
- * Select the span of an xarray iterator we're going to use.  Limit it by =
both
- * maximum size and maximum number of segments.  It is assumed that segmen=
ts
- * can be larger than a page in size, provided they're physically contiguo=
us.
- * Returns the size of the span in bytes.
- */
-static size_t netfs_limit_xarray(const struct iov_iter *iter, size_t start=
_offset,
-				 size_t max_size, size_t max_segs)
-{
-	struct folio *folio;
-	unsigned int nsegs =3D 0;
-	loff_t pos =3D iter->xarray_start + iter->iov_offset;
-	pgoff_t index =3D pos / PAGE_SIZE;
-	size_t span =3D 0, n =3D iter->count;
-
-	XA_STATE(xas, iter->xarray, index);
-
-	if (WARN_ON(!iov_iter_is_xarray(iter)) ||
-	    WARN_ON(start_offset > n) ||
-	    n =3D=3D 0)
-		return 0;
-	max_size =3D min(max_size, n - start_offset);
-
-	rcu_read_lock();
-	xas_for_each(&xas, folio, ULONG_MAX) {
-		size_t offset, flen, len;
-		if (xas_retry(&xas, folio))
-			continue;
-		if (WARN_ON(xa_is_value(folio)))
-			break;
-		if (WARN_ON(folio_test_hugetlb(folio)))
-			break;
-
-		flen =3D folio_size(folio);
-		offset =3D offset_in_folio(folio, pos);
-		len =3D min(max_size, flen - offset);
-		span +=3D len;
-		nsegs++;
-		if (span >=3D max_size || nsegs >=3D max_segs)
-			break;
-	}
-
-	rcu_read_unlock();
-	return min(span, max_size);
-}
-
-/*
- * Select the span of a folio queue iterator we're going to use.  Limit it=
 by
- * both maximum size and maximum number of segments.  Returns the size of =
the
- * span in bytes.
- */
-static size_t netfs_limit_folioq(const struct iov_iter *iter, size_t start=
_offset,
-				 size_t max_size, size_t max_segs)
-{
-	const struct folio_queue *folioq =3D iter->folioq;
-	unsigned int nsegs =3D 0;
-	unsigned int slot =3D iter->folioq_slot;
-	size_t span =3D 0, n =3D iter->count;
-
-	if (WARN_ON(!iov_iter_is_folioq(iter)) ||
-	    WARN_ON(start_offset > n) ||
-	    n =3D=3D 0)
-		return 0;
-	max_size =3D umin(max_size, n - start_offset);
-
-	if (slot >=3D folioq_nr_slots(folioq)) {
-		folioq =3D folioq->next;
-		slot =3D 0;
-	}
-
-	start_offset +=3D iter->iov_offset;
-	do {
-		size_t flen =3D folioq_folio_size(folioq, slot);
-
-		if (start_offset < flen) {
-			span +=3D flen - start_offset;
-			nsegs++;
-			start_offset =3D 0;
-		} else {
-			start_offset -=3D flen;
-		}
-		if (span >=3D max_size || nsegs >=3D max_segs)
-			break;
-
-		slot++;
-		if (slot >=3D folioq_nr_slots(folioq)) {
-			folioq =3D folioq->next;
-			slot =3D 0;
-		}
-	} while (folioq);
-
-	return umin(span, max_size);
-}
-
-size_t netfs_limit_iter(const struct iov_iter *iter, size_t start_offset,
-			size_t max_size, size_t max_segs)
-{
-	if (iov_iter_is_folioq(iter))
-		return netfs_limit_folioq(iter, start_offset, max_size, max_segs);
-	if (iov_iter_is_bvec(iter))
-		return netfs_limit_bvec(iter, start_offset, max_size, max_segs);
-	if (iov_iter_is_xarray(iter))
-		return netfs_limit_xarray(iter, start_offset, max_size, max_segs);
-	if (iov_iter_is_kvec(iter))
-		return netfs_limit_kvec(iter, start_offset, max_size, max_segs);
-	BUG();
-}
-EXPORT_SYMBOL(netfs_limit_iter);
-#endif
diff --git a/fs/netfs/rolling_buffer.c b/fs/netfs/rolling_buffer.c
deleted file mode 100644
index 292011c1cacb..000000000000
--- a/fs/netfs/rolling_buffer.c
+++ /dev/null
@@ -1,297 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-or-later
-/* Rolling buffer helpers
- *
- * Copyright (C) 2024 Red Hat, Inc. All Rights Reserved.
- * Written by David Howells (dhowells@redhat.com)
- */
-
-#include <linux/bitops.h>
-#include <linux/pagemap.h>
-#include <linux/rolling_buffer.h>
-#include <linux/slab.h>
-#include "internal.h"
-
-static atomic_t debug_ids;
-
-/**
- * netfs_folioq_alloc - Allocate a folio_queue struct
- * @rreq_id: Associated debugging ID for tracing purposes
- * @gfp: Allocation constraints
- * @trace: Trace tag to indicate the purpose of the allocation
- *
- * Allocate, initialise and account the folio_queue struct and log a trace=
 line
- * to mark the allocation.
- */
-struct folio_queue *netfs_folioq_alloc(unsigned int rreq_id, gfp_t gfp,
-				       unsigned int /*enum netfs_folioq_trace*/ trace)
-{
-	struct folio_queue *fq;
-
-	fq =3D kmalloc_obj(*fq, gfp);
-	if (fq) {
-		netfs_stat(&netfs_n_folioq);
-		folioq_init(fq, rreq_id);
-		fq->debug_id =3D atomic_inc_return(&debug_ids);
-		trace_netfs_folioq(fq, trace);
-	}
-	return fq;
-}
-EXPORT_SYMBOL(netfs_folioq_alloc);
-
-/**
- * netfs_folioq_free - Free a folio_queue struct
- * @folioq: The object to free
- * @trace: Trace tag to indicate which free
- *
- * Free and unaccount the folio_queue struct.
- */
-void netfs_folioq_free(struct folio_queue *folioq,
-		       unsigned int /*enum netfs_trace_folioq*/ trace)
-{
-	trace_netfs_folioq(folioq, trace);
-	netfs_stat_d(&netfs_n_folioq);
-	kfree(folioq);
-}
-EXPORT_SYMBOL(netfs_folioq_free);
-
-/*
- * Initialise a rolling buffer.  We allocate an empty folio queue struct t=
o so
- * that the pointers can be independently driven by the producer and the
- * consumer.
- */
-int rolling_buffer_init(struct rolling_buffer *roll, unsigned int rreq_id,
-			unsigned int direction)
-{
-	struct folio_queue *fq;
-
-	fq =3D netfs_folioq_alloc(rreq_id, GFP_NOFS, netfs_trace_folioq_rollbuf_i=
nit);
-	if (!fq)
-		return -ENOMEM;
-
-	roll->head =3D fq;
-	roll->tail =3D fq;
-	iov_iter_folio_queue(&roll->iter, direction, fq, 0, 0, 0);
-	return 0;
-}
-
-/*
- * Add another folio_queue to a rolling buffer if there's no space left.
- */
-int rolling_buffer_make_space(struct rolling_buffer *roll)
-{
-	struct folio_queue *fq, *head =3D roll->head;
-
-	if (!folioq_full(head))
-		return 0;
-
-	fq =3D netfs_folioq_alloc(head->rreq_id, GFP_NOFS, netfs_trace_folioq_mak=
e_space);
-	if (!fq)
-		return -ENOMEM;
-	fq->prev =3D head;
-
-	roll->head =3D fq;
-	if (folioq_full(head)) {
-		/* Make sure we don't leave the master iterator pointing to a
-		 * block that might get immediately consumed.
-		 */
-		if (roll->iter.folioq =3D=3D head &&
-		    roll->iter.folioq_slot =3D=3D folioq_nr_slots(head)) {
-			roll->iter.folioq =3D fq;
-			roll->iter.folioq_slot =3D 0;
-		}
-	}
-
-	/* Make sure the initialisation is stored before the next pointer.
-	 *
-	 * [!] NOTE: After we set head->next, the consumer is at liberty to
-	 * immediately delete the old head.
-	 */
-	smp_store_release(&head->next, fq);
-	return 0;
-}
-
-/*
- * Decant the list of folios to read into a rolling buffer.
- */
-ssize_t rolling_buffer_load_from_ra(struct rolling_buffer *roll,
-				    struct readahead_control *ractl,
-				    struct folio_batch *put_batch)
-{
-	struct folio_queue *fq;
-	struct page **vec;
-	int nr, ix, to;
-	ssize_t size =3D 0;
-
-	if (rolling_buffer_make_space(roll) < 0)
-		return -ENOMEM;
-
-	fq =3D roll->head;
-	vec =3D (struct page **)fq->vec.folios;
-	nr =3D __readahead_batch(ractl, vec + folio_batch_count(&fq->vec),
-			       folio_batch_space(&fq->vec));
-	ix =3D fq->vec.nr;
-	to =3D ix + nr;
-	fq->vec.nr =3D to;
-	for (; ix < to; ix++) {
-		struct folio *folio =3D folioq_folio(fq, ix);
-		unsigned int order =3D folio_order(folio);
-
-		fq->orders[ix] =3D order;
-		size +=3D PAGE_SIZE << order;
-		trace_netfs_folio(folio, netfs_folio_trace_read);
-		if (!folio_batch_add(put_batch, folio))
-			folio_batch_release(put_batch);
-	}
-	WRITE_ONCE(roll->iter.count, roll->iter.count + size);
-
-	/* Store the counter after setting the slot. */
-	smp_store_release(&roll->next_head_slot, to);
-	return size;
-}
-
-/*
- * Decant the entire list of folios to read into a rolling buffer.
- */
-ssize_t rolling_buffer_bulk_load_from_ra(struct rolling_buffer *roll,
-					 struct readahead_control *ractl,
-					 unsigned int rreq_id)
-{
-	XA_STATE(xas, &ractl->mapping->i_pages, ractl->_index);
-	struct folio_queue *fq;
-	struct folio *folio;
-	ssize_t loaded =3D 0;
-	int nr, slot =3D 0, npages =3D 0;
-
-	/* First allocate all the folioqs we're going to need to avoid having
-	 * to deal with ENOMEM later.
-	 */
-	nr =3D ractl->_nr_folios;
-	do {
-		fq =3D netfs_folioq_alloc(rreq_id, GFP_KERNEL,
-					netfs_trace_folioq_make_space);
-		if (!fq) {
-			rolling_buffer_clear(roll);
-			return -ENOMEM;
-		}
-		fq->prev =3D roll->head;
-		if (!roll->tail)
-			roll->tail =3D fq;
-		else
-			roll->head->next =3D fq;
-		roll->head =3D fq;
-		=09
-		nr -=3D folioq_nr_slots(fq);
-	} while (nr > 0);
-
-	rcu_read_lock();
-
-	fq =3D roll->tail;
-	xas_for_each(&xas, folio, ractl->_index + ractl->_nr_pages - 1) {
-		unsigned int order;
-
-		if (xas_retry(&xas, folio))
-			continue;
-		VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
-
-		order =3D folio_order(folio);
-		fq->orders[slot] =3D order;
-		fq->vec.folios[slot] =3D folio;
-		loaded +=3D PAGE_SIZE << order;
-		npages +=3D 1 << order;
-		trace_netfs_folio(folio, netfs_folio_trace_read);
-
-		slot++;
-		if (slot >=3D folioq_nr_slots(fq)) {
-			fq->vec.nr =3D slot;
-			fq =3D fq->next;
-			if (!fq) {
-				WARN_ON_ONCE(npages < readahead_count(ractl));
-				break;
-			}
-			slot =3D 0;
-		}
-	}
-
-	rcu_read_unlock();
-
-	if (fq)
-		fq->vec.nr =3D slot;
-
-	WRITE_ONCE(roll->iter.count, loaded);
-	iov_iter_folio_queue(&roll->iter, ITER_DEST, roll->tail, 0, 0, loaded);
-	ractl->_index    +=3D npages;
-	ractl->_nr_pages -=3D npages;
-	return loaded;
-}
-
-/*
- * Append a folio to the rolling buffer.
- */
-ssize_t rolling_buffer_append(struct rolling_buffer *roll, struct folio *f=
olio,
-			      unsigned int flags)
-{
-	ssize_t size =3D folio_size(folio);
-	int slot;
-
-	if (rolling_buffer_make_space(roll) < 0)
-		return -ENOMEM;
-
-	slot =3D folioq_append(roll->head, folio);
-	if (flags & ROLLBUF_MARK_1)
-		folioq_mark(roll->head, slot);
-	if (flags & ROLLBUF_MARK_2)
-		folioq_mark2(roll->head, slot);
-
-	WRITE_ONCE(roll->iter.count, roll->iter.count + size);
-
-	/* Store the counter after setting the slot. */
-	smp_store_release(&roll->next_head_slot, slot);
-	return size;
-}
-
-/*
- * Delete a spent buffer from a rolling queue and return the next in line.=
  We
- * don't return the last buffer to keep the pointers independent, but retu=
rn
- * NULL instead.
- */
-struct folio_queue *rolling_buffer_delete_spent(struct rolling_buffer *rol=
l)
-{
-	struct folio_queue *spent =3D roll->tail, *next =3D READ_ONCE(spent->next=
);
-
-	if (!next)
-		return NULL;
-	next->prev =3D NULL;
-	netfs_folioq_free(spent, netfs_trace_folioq_delete);
-	roll->tail =3D next;
-	return next;
-}
-
-/*
- * Clear out a rolling queue.  Folios that have mark 1 set are put.
- */
-void rolling_buffer_clear(struct rolling_buffer *roll)
-{
-	struct folio_batch fbatch;
-	struct folio_queue *p;
-
-	folio_batch_init(&fbatch);
-
-	while ((p =3D roll->tail)) {
-		roll->tail =3D p->next;
-		for (int slot =3D 0; slot < folioq_count(p); slot++) {
-			struct folio *folio =3D folioq_folio(p, slot);
-
-			if (!folio)
-				continue;
-			if (folioq_is_marked(p, slot)) {
-				trace_netfs_folio(folio, netfs_folio_trace_put);
-				if (!folio_batch_add(&fbatch, folio))
-					folio_batch_release(&fbatch);
-			}
-		}
-
-		netfs_folioq_free(p, netfs_trace_folioq_clear);
-	}
-
-	folio_batch_release(&fbatch);
-}
diff --git a/include/linux/folio_queue.h b/include/linux/folio_queue.h
deleted file mode 100644
index adab609c972e..000000000000
--- a/include/linux/folio_queue.h
+++ /dev/null
@@ -1,282 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-or-later */
-/* Queue of folios definitions
- *
- * Copyright (C) 2024 Red Hat, Inc. All Rights Reserved.
- * Written by David Howells (dhowells@redhat.com)
- *
- * See:
- *
- *	Documentation/core-api/folio_queue.rst
- *
- * for a description of the API.
- */
-
-#ifndef _LINUX_FOLIO_QUEUE_H
-#define _LINUX_FOLIO_QUEUE_H
-
-#include <linux/pagevec.h>
-#include <linux/mm.h>
-
-/*
- * Segment in a queue of running buffers.  Each segment can hold a number =
of
- * folios and a portion of the queue can be referenced with the ITER_FOLIOQ
- * iterator.  The possibility exists of inserting non-folio elements into =
the
- * queue (such as gaps).
- *
- * Explicit prev and next pointers are used instead of a list_head to make=
 it
- * easier to add segments to tail and remove them from the head without the
- * need for a lock.
- */
-struct folio_queue {
-	struct folio_batch	vec;		/* Folios in the queue segment */
-	u8			orders[PAGEVEC_SIZE]; /* Order of each folio */
-	struct folio_queue	*next;		/* Next queue segment or NULL */
-	struct folio_queue	*prev;		/* Previous queue segment of NULL */
-	unsigned long		marks;		/* 1-bit mark per folio */
-	unsigned long		marks2;		/* Second 1-bit mark per folio */
-#if PAGEVEC_SIZE > BITS_PER_LONG
-#error marks is not big enough
-#endif
-	unsigned int		rreq_id;
-	unsigned int		debug_id;
-};
-
-/**
- * folioq_init - Initialise a folio queue segment
- * @folioq: The segment to initialise
- * @rreq_id: The request identifier to use in tracelines.
- *
- * Initialise a folio queue segment and set an identifier to be used in tr=
aces.
- *
- * Note that the folio pointers are left uninitialised.
- */
-static inline void folioq_init(struct folio_queue *folioq, unsigned int rr=
eq_id)
-{
-	folio_batch_init(&folioq->vec);
-	folioq->next =3D NULL;
-	folioq->prev =3D NULL;
-	folioq->marks =3D 0;
-	folioq->marks2 =3D 0;
-	folioq->rreq_id =3D rreq_id;
-	folioq->debug_id =3D 0;
-}
-
-/**
- * folioq_nr_slots: Query the capacity of a folio queue segment
- * @folioq: The segment to query
- *
- * Query the number of folios that a particular folio queue segment might =
hold.
- * [!] NOTE: This must not be assumed to be the same for every segment!
- */
-static inline unsigned int folioq_nr_slots(const struct folio_queue *folio=
q)
-{
-	return PAGEVEC_SIZE;
-}
-
-/**
- * folioq_count: Query the occupancy of a folio queue segment
- * @folioq: The segment to query
- *
- * Query the number of folios that have been added to a folio queue segmen=
t.
- * Note that this is not decreased as folios are removed from a segment.
- */
-static inline unsigned int folioq_count(struct folio_queue *folioq)
-{
-	return folio_batch_count(&folioq->vec);
-}
-
-/**
- * folioq_full: Query if a folio queue segment is full
- * @folioq: The segment to query
- *
- * Query if a folio queue segment is fully occupied.  Note that this does =
not
- * change if folios are removed from a segment.
- */
-static inline bool folioq_full(struct folio_queue *folioq)
-{
-	//return !folio_batch_space(&folioq->vec);
-	return folioq_count(folioq) >=3D folioq_nr_slots(folioq);
-}
-
-/**
- * folioq_is_marked: Check first folio mark in a folio queue segment
- * @folioq: The segment to query
- * @slot: The slot number of the folio to query
- *
- * Determine if the first mark is set for the folio in the specified slot =
in a
- * folio queue segment.
- */
-static inline bool folioq_is_marked(const struct folio_queue *folioq, unsi=
gned int slot)
-{
-	return test_bit(slot, &folioq->marks);
-}
-
-/**
- * folioq_mark: Set the first mark on a folio in a folio queue segment
- * @folioq: The segment to modify
- * @slot: The slot number of the folio to modify
- *
- * Set the first mark for the folio in the specified slot in a folio queue
- * segment.
- */
-static inline void folioq_mark(struct folio_queue *folioq, unsigned int sl=
ot)
-{
-	set_bit(slot, &folioq->marks);
-}
-
-/**
- * folioq_unmark: Clear the first mark on a folio in a folio queue segment
- * @folioq: The segment to modify
- * @slot: The slot number of the folio to modify
- *
- * Clear the first mark for the folio in the specified slot in a folio que=
ue
- * segment.
- */
-static inline void folioq_unmark(struct folio_queue *folioq, unsigned int =
slot)
-{
-	clear_bit(slot, &folioq->marks);
-}
-
-/**
- * folioq_is_marked2: Check second folio mark in a folio queue segment
- * @folioq: The segment to query
- * @slot: The slot number of the folio to query
- *
- * Determine if the second mark is set for the folio in the specified slot=
 in a
- * folio queue segment.
- */
-static inline bool folioq_is_marked2(const struct folio_queue *folioq, uns=
igned int slot)
-{
-	return test_bit(slot, &folioq->marks2);
-}
-
-/**
- * folioq_mark2: Set the second mark on a folio in a folio queue segment
- * @folioq: The segment to modify
- * @slot: The slot number of the folio to modify
- *
- * Set the second mark for the folio in the specified slot in a folio queue
- * segment.
- */
-static inline void folioq_mark2(struct folio_queue *folioq, unsigned int s=
lot)
-{
-	set_bit(slot, &folioq->marks2);
-}
-
-/**
- * folioq_unmark2: Clear the second mark on a folio in a folio queue segme=
nt
- * @folioq: The segment to modify
- * @slot: The slot number of the folio to modify
- *
- * Clear the second mark for the folio in the specified slot in a folio qu=
eue
- * segment.
- */
-static inline void folioq_unmark2(struct folio_queue *folioq, unsigned int=
 slot)
-{
-	clear_bit(slot, &folioq->marks2);
-}
-
-/**
- * folioq_append: Add a folio to a folio queue segment
- * @folioq: The segment to add to
- * @folio: The folio to add
- *
- * Add a folio to the tail of the sequence in a folio queue segment, incre=
asing
- * the occupancy count and returning the slot number for the folio just ad=
ded.
- * The folio size is extracted and stored in the queue and the marks are l=
eft
- * unmodified.
- *
- * Note that it's left up to the caller to check that the segment capacity=
 will
- * not be exceeded and to extend the queue.
- */
-static inline unsigned int folioq_append(struct folio_queue *folioq, struc=
t folio *folio)
-{
-	unsigned int slot =3D folioq->vec.nr++;
-
-	folioq->vec.folios[slot] =3D folio;
-	folioq->orders[slot] =3D folio_order(folio);
-	return slot;
-}
-
-/**
- * folioq_append_mark: Add a folio to a folio queue segment
- * @folioq: The segment to add to
- * @folio: The folio to add
- *
- * Add a folio to the tail of the sequence in a folio queue segment, incre=
asing
- * the occupancy count and returning the slot number for the folio just ad=
ded.
- * The folio size is extracted and stored in the queue, the first mark is =
set
- * and and the second and third marks are left unmodified.
- *
- * Note that it's left up to the caller to check that the segment capacity=
 will
- * not be exceeded and to extend the queue.
- */
-static inline unsigned int folioq_append_mark(struct folio_queue *folioq, =
struct folio *folio)
-{
-	unsigned int slot =3D folioq->vec.nr++;
-
-	folioq->vec.folios[slot] =3D folio;
-	folioq->orders[slot] =3D folio_order(folio);
-	folioq_mark(folioq, slot);
-	return slot;
-}
-
-/**
- * folioq_folio: Get a folio from a folio queue segment
- * @folioq: The segment to access
- * @slot: The folio slot to access
- *
- * Retrieve the folio in the specified slot from a folio queue segment.  N=
ote
- * that no bounds check is made and if the slot hasn't been added into yet=
, the
- * pointer will be undefined.  If the slot has been cleared, NULL will be
- * returned.
- */
-static inline struct folio *folioq_folio(const struct folio_queue *folioq,=
 unsigned int slot)
-{
-	return folioq->vec.folios[slot];
-}
-
-/**
- * folioq_folio_order: Get the order of a folio from a folio queue segment
- * @folioq: The segment to access
- * @slot: The folio slot to access
- *
- * Retrieve the order of the folio in the specified slot from a folio queue
- * segment.  Note that no bounds check is made and if the slot hasn't been
- * added into yet, the order returned will be 0.
- */
-static inline unsigned int folioq_folio_order(const struct folio_queue *fo=
lioq, unsigned int slot)
-{
-	return folioq->orders[slot];
-}
-
-/**
- * folioq_folio_size: Get the size of a folio from a folio queue segment
- * @folioq: The segment to access
- * @slot: The folio slot to access
- *
- * Retrieve the size of the folio in the specified slot from a folio queue
- * segment.  Note that no bounds check is made and if the slot hasn't been
- * added into yet, the size returned will be PAGE_SIZE.
- */
-static inline size_t folioq_folio_size(const struct folio_queue *folioq, u=
nsigned int slot)
-{
-	return PAGE_SIZE << folioq_folio_order(folioq, slot);
-}
-
-/**
- * folioq_clear: Clear a folio from a folio queue segment
- * @folioq: The segment to clear
- * @slot: The folio slot to clear
- *
- * Clear a folio from a sequence in a folio queue segment and clear its ma=
rks.
- * The occupancy count is left unchanged.
- */
-static inline void folioq_clear(struct folio_queue *folioq, unsigned int s=
lot)
-{
-	folioq->vec.folios[slot] =3D NULL;
-	folioq_unmark(folioq, slot);
-	folioq_unmark2(folioq, slot);
-}
-
-#endif /* _LINUX_FOLIO_QUEUE_H */
diff --git a/include/linux/rolling_buffer.h b/include/linux/rolling_buffer.h
deleted file mode 100644
index b35ef43f325f..000000000000
--- a/include/linux/rolling_buffer.h
+++ /dev/null
@@ -1,64 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-or-later */
-/* Rolling buffer of folios
- *
- * Copyright (C) 2024 Red Hat, Inc. All Rights Reserved.
- * Written by David Howells (dhowells@redhat.com)
- */
-
-#ifndef _ROLLING_BUFFER_H
-#define _ROLLING_BUFFER_H
-
-#include <linux/folio_queue.h>
-#include <linux/uio.h>
-
-/*
- * Rolling buffer.  Whilst the buffer is live and in use, folios and folio
- * queue segments can be added to one end by one thread and removed from t=
he
- * other end by another thread.  The buffer isn't allowed to be empty; it =
must
- * always have at least one folio_queue in it so that neither side has to
- * modify both queue pointers.
- *
- * The iterator in the buffer is extended as buffers are inserted.  It can=
 be
- * snapshotted to use a segment of the buffer.
- */
-struct rolling_buffer {
-	struct folio_queue	*head;		/* Producer's insertion point */
-	struct folio_queue	*tail;		/* Consumer's removal point */
-	struct iov_iter		iter;		/* Iterator tracking what's left in the buffer */
-	u8			next_head_slot;	/* Next slot in ->head */
-	u8			first_tail_slot; /* First slot in ->tail */
-};
-
-/*
- * Snapshot of a rolling buffer.
- */
-struct rolling_buffer_snapshot {
-	struct folio_queue	*curr_folioq;	/* Queue segment in which current folio =
resides */
-	unsigned char		curr_slot;	/* Folio currently being read */
-	unsigned char		curr_order;	/* Order of folio */
-};
-
-/* Marks to store per-folio in the internal folio_queue structs. */
-#define ROLLBUF_MARK_1	BIT(0)
-#define ROLLBUF_MARK_2	BIT(1)
-
-int rolling_buffer_init(struct rolling_buffer *roll, unsigned int rreq_id,
-			unsigned int direction);
-int rolling_buffer_make_space(struct rolling_buffer *roll);
-ssize_t rolling_buffer_load_from_ra(struct rolling_buffer *roll,
-				    struct readahead_control *ractl,
-				    struct folio_batch *put_batch);
-ssize_t rolling_buffer_bulk_load_from_ra(struct rolling_buffer *roll,
-					 struct readahead_control *ractl,
-					 unsigned int rreq_id);
-ssize_t rolling_buffer_append(struct rolling_buffer *roll, struct folio *f=
olio,
-			      unsigned int flags);
-struct folio_queue *rolling_buffer_delete_spent(struct rolling_buffer *rol=
l);
-void rolling_buffer_clear(struct rolling_buffer *roll);
-
-static inline void rolling_buffer_advance(struct rolling_buffer *roll, siz=
e_t amount)
-{
-	iov_iter_advance(&roll->iter, amount);
-}
-
-#endif /* _ROLLING_BUFFER_H */
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4375D3E928D
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:49:41 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522182; cv=none;
 b=qtOXNqJQGqKeL7LH9abcE/uW8Wnen2Tdxd3RY8HRtpvQwhhsnlyglk446cAdweuZGe1CoNT2hkMRUCqIMF7RGr9Yxt9YNvrHNoU+vgmKu+18v+XEzIqYDvvgsQ1ECtmhCmrBBFs2mll7Ku28CWutG/EYJOCYx7Ifb8y2dGDmQO8=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522182; c=relaxed/simple;
	bh=/pg2sx7Ejglo3JyXv0SQ28GfSrx8nv9BNocJEl2xiZQ=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=HKsi/vCVsS66J0HjCXXam7ERgNdxkOfqXrAlOJAzfaAsW5NYvoI3PZjyX/CwE3lFqHOsyc3fFT2EjFQWlzxnBEYywNGWZKCALkcm9quSxROwIXm5Jy1MGCFcDrGz9V9C29iAA1l8CTICxZSjCUIMgjfpeswI4xjoRcGgc9/xmq0=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=hybBFWhI; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="hybBFWhI"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774522180;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=qTXHjK3561HUqXOE5w7QS6ZLr4v2K1LxsVJ31tdrxyM=;
	b=hybBFWhIKqc5CVh3AycWKZfsWFMEuqY1rpQNdcLHlp6MEK2NVCwhyBFDuG6lFyBeK9cilS
	/ROPH9fksVw0t1ag+t2AIwmkyj9lwYtz0XqOdEXEMCWuAbnSsmVxdtOHBdKC95+GJ+tGBC
	HoQFrqShzjBHC8RFN62PowZJ2fn/e9Y=
Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-515-MLTlG43wMQ6WTJVEzCwapg-1; Thu,
 26 Mar 2026 06:49:37 -0400
X-MC-Unique: MLTlG43wMQ6WTJVEzCwapg-1
X-Mimecast-MFC-AGG-ID: MLTlG43wMQ6WTJVEzCwapg_1774522174
Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 601471956048;
	Thu, 26 Mar 2026 10:49:34 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id EBE4B3000223;
	Thu, 26 Mar 2026 10:49:27 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Paulo Alcantara <pc@manguebit.org>
Subject: [PATCH 24/26] netfs: Check for too much data being read
Date: Thu, 26 Mar 2026 10:45:39 +0000
Message-ID: <20260326104544.509518-25-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4
Content-Type: text/plain; charset="utf-8"

Put in a check in read subreq termination to detect more data being read
for a subrequest than was requested.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/netfs/read_collect.c      | 8 ++++++++
 include/trace/events/netfs.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/fs/netfs/read_collect.c b/fs/netfs/read_collect.c
index c7180680226c..bacf047029fa 100644
--- a/fs/netfs/read_collect.c
+++ b/fs/netfs/read_collect.c
@@ -545,6 +545,14 @@ void netfs_read_subreq_terminated(struct netfs_io_subr=
equest *subreq)
 		break;
 	}
=20
+	if (subreq->transferred > subreq->len) {
+		subreq->transferred =3D 0;
+		__set_bit(NETFS_SREQ_FAILED, &subreq->flags);
+		__clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
+		trace_netfs_sreq(subreq, netfs_sreq_trace_too_much);
+		subreq->error =3D -EIO;
+	}
+
 	/* Deal with retry requests, short reads and errors.  If we retry
 	 * but don't make progress, we abandon the attempt.
 	 */
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index df3d440563ec..eeb8386e0709 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -125,6 +125,7 @@
 	EM(netfs_sreq_trace_submit,		"SUBMT")	\
 	EM(netfs_sreq_trace_superfluous,	"SPRFL")	\
 	EM(netfs_sreq_trace_terminated,		"TERM ")	\
+	EM(netfs_sreq_trace_too_much,		"!TOOM")	\
 	EM(netfs_sreq_trace_wait_for,		"_WAIT")	\
 	EM(netfs_sreq_trace_write,		"WRITE")	\
 	EM(netfs_sreq_trace_write_skip,		"SKIP ")	\
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 76C683F8DF9
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:49:49 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522190; cv=none;
 b=lCAEroGM2nmLOd/UljLnNmUFLR4zBte7dhJAqtQpH7ELdnLMxihD3ydDcDSpHODMisWQP7v1eAMcuQ/xXR4tukKinRqSglEPu9tD7ad/fqyg95fcBRRbs6MEA6vxevC+5Lqi64mGDyEccrOftujbpPoNl+lj5juy+cj0XDMiaKQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522190; c=relaxed/simple;
	bh=sU2zUgwIPPM1gWJtEBg6k7zAgJC6dwFXJciLsxo98xs=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=GDaxM26skzzIBCtjrXGEnq8433NenFCAl9xS+UoZTfzhBLFOdGYhCggYACli165pw9xuFRVUJ3kVs2J2HCULnlcsjbuOkaSFW7DxF3AnNMyiqPlEc3e+DvqzAQV8aA2hvK5wzv8V0PClxz4N9jHJbTbmCnt2E1SOtMs2s4gJLCc=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=e4uN62Gi; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="e4uN62Gi"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774522188;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=3dua10S0xY5PQQMwCx5ef5lVGVsxRJy1bcsfs1R39zY=;
	b=e4uN62GizDPs22ObsVmRiVIli/eJmyHtW7Xvtu+tX57/OBna97yq4LIKWLAfaauPSxK1/h
	FWElycj/efiR4QqPyh5MMKTaOMPJt0A9dDTZ7v4ueTNXN5Jus9cxTHrexQdYoDjKhRMmAC
	pQnkin6q6fyNc49x1VX/V9ekQLXo7C8=
Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-134-R25vA0AZMP6MQs6h0YVvDg-1; Thu,
 26 Mar 2026 06:49:45 -0400
X-MC-Unique: R25vA0AZMP6MQs6h0YVvDg-1
X-Mimecast-MFC-AGG-ID: R25vA0AZMP6MQs6h0YVvDg_1774522182
Received: from mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.111])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 4D85219560A6;
	Thu, 26 Mar 2026 10:49:42 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 022BA1800576;
	Thu, 26 Mar 2026 10:49:35 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Paulo Alcantara <pc@manguebit.org>
Subject: [PATCH 25/26] netfs: Limit the the minimum trigger for progress
 reporting
Date: Thu, 26 Mar 2026 10:45:40 +0000
Message-ID: <20260326104544.509518-26-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.111
Content-Type: text/plain; charset="utf-8"

For really big read RPC ops that span multiple folios, netfslib allows the
filesystem to give progress notifications to wake up the collector thread
to do a collection of folios that have now been fetched, even if the RPC is
still ongoing, thereby allowing the application to make progress.

The trigger for this is that at least one folio has been downloaded since
the clean point.  If, however, the folios are small, this means the
collector thread is constantly being woken up - which has a negative
performance impact on the system.

Set a minimum trigger of 256KiB or the size of the folio at the front of
the queue, whichever is larger.

Also, fix the base to be the stream collection point, not the point at
which the collector has cleaned up to (which is currently 0 until something
has been collected).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/netfs/read_collect.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/netfs/read_collect.c b/fs/netfs/read_collect.c
index bacf047029fa..6d49f9a6b1f0 100644
--- a/fs/netfs/read_collect.c
+++ b/fs/netfs/read_collect.c
@@ -494,15 +494,15 @@ void netfs_read_collection_worker(struct work_struct =
*work)
 void netfs_read_subreq_progress(struct netfs_io_subrequest *subreq)
 {
 	struct netfs_io_request *rreq =3D subreq->rreq;
-	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
-	size_t fsize =3D PAGE_SIZE << rreq->front_folio_order;
+	struct netfs_io_stream *stream =3D &rreq->io_streams[subreq->stream_nr];
+	size_t fsize =3D umax(PAGE_SIZE << rreq->front_folio_order, 256 * 1024);
=20
 	trace_netfs_sreq(subreq, netfs_sreq_trace_progress);
=20
 	/* If we are at the head of the queue, wake up the collector,
 	 * getting a ref to it if we were the ones to do so.
 	 */
-	if (subreq->start + subreq->transferred > rreq->cleaned_to + fsize &&
+	if (subreq->start + subreq->transferred >=3D stream->collected_to + fsize=
 &&
 	    (rreq->origin =3D=3D NETFS_READAHEAD ||
 	     rreq->origin =3D=3D NETFS_READPAGE ||
 	     rreq->origin =3D=3D NETFS_READ_FOR_WRITE) &&
From nobody Thu Apr  2 22:23:34 2026
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 54F403CCA16
	for <linux-kernel@vger.kernel.org>; Thu, 26 Mar 2026 10:49:58 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774522211; cv=none;
 b=ULZochSDSnoYuQpPEjduK/I87leAb51q51kTAgNMXzeq+POv8BsAbHE8QGKyLYoKi+bZNh3reqPW4HyegJQMuSDMjU7z1irs2ZPF/WEEmr7+O8hUMvA8vAr6cdZk8hkKggOVfE7gS2ZyCFbeo1/0lY4S0gCKZROC/GWebuK2Qgw=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774522211; c=relaxed/simple;
	bh=fTHmqyJdcoTyTInHKFwKiOwCExhN1eOyZ8RhtFwOg04=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=rCL9qNiiBZluPQuhA02ofbRl9czYXJii630wa83cZYV13THoTHfbKiG2Rm6jHBu78AkGoPWbFHDZ3YrUQXpeqrK29h7MTgPiOprBZvqWJUu0n5QjvIlKdWO/vwiWXO+10dR+TdfWEU3GxLeYU3LZZa3FzJjNAQ4eVjQE4zLJ2GY=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=L52BRxyi; arc=none smtp.client-ip=170.10.133.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="L52BRxyi"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1774522197;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=c+VF65Wqti8gLaqX/quu1u0c3WXc6fWf3Wezs9ZTxDc=;
	b=L52BRxyidaIC2sLqFrtfify4oLXX7HeTeo28HdetpP4rZvz1m3iIRcwW85FYPWoF5BaCbW
	Fs3TkAp4rjygG7/o9U2FZiPo9WyLBTgSBxnJpqzfiz6wu3BT906DsKn1IGnJO3E0GbnLEY
	svuA3adaF0G/74djWh+ngR5x//TzXt0=
Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-596-89SjK45AMUe870Heb8Itmw-1; Thu,
 26 Mar 2026 06:49:53 -0400
X-MC-Unique: 89SjK45AMUe870Heb8Itmw-1
X-Mimecast-MFC-AGG-ID: 89SjK45AMUe870Heb8Itmw_1774522191
Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com
 (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS
 id 1C15E195608B;
	Thu, 26 Mar 2026 10:49:51 +0000 (UTC)
Received: from warthog.procyon.org.com (unknown [10.44.33.121])
	by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP
 id 0BBBE19560B1;
	Thu, 26 Mar 2026 10:49:43 +0000 (UTC)
From: David Howells <dhowells@redhat.com>
To: Christian Brauner <christian@brauner.io>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>,
	Paulo Alcantara <pc@manguebit.com>,
	Jens Axboe <axboe@kernel.dk>,
	Leon Romanovsky <leon@kernel.org>,
	Steve French <sfrench@samba.org>,
	ChenXiaoSong <chenxiaosong@chenxiaosong.com>,
	Marc Dionne <marc.dionne@auristor.com>,
	Eric Van Hensbergen <ericvh@kernel.org>,
	Dominique Martinet <asmadeus@codewreck.org>,
	Ilya Dryomov <idryomov@gmail.com>,
	Trond Myklebust <trondmy@kernel.org>,
	netfs@lists.linux.dev,
	linux-afs@lists.infradead.org,
	linux-cifs@vger.kernel.org,
	linux-nfs@vger.kernel.org,
	ceph-devel@vger.kernel.org,
	v9fs@lists.linux.dev,
	linux-erofs@lists.ozlabs.org,
	linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Paulo Alcantara <pc@manguebit.org>
Subject: [PATCH 26/26] netfs: Combine prepare and issue ops and grab the
 buffers on request
Date: Thu, 26 Mar 2026 10:45:41 +0000
Message-ID: <20260326104544.509518-27-dhowells@redhat.com>
In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com>
References: <20260326104544.509518-1-dhowells@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12
Content-Type: text/plain; charset="utf-8"

Modify the way subrequests are generated in netfslib to try and simplify
the code.  The issue, primarily, is in writeback: the code has to create
multiple streams of write requests to disparate targets with different
properties (e.g. server and fscache), where not every folio needs to go to
every target (e.g. data just read from the server may only need writing to
the cache).

The current model in writeback, at least, is to go carefully through every
folio, preparing a subrequest for each stream when it was detected that
part of the current folio needed to go to that stream, and repeating this
within and across contiguous folios; then to issue subrequests as they
become full or hit boundaries after first setting up the buffer.  However,
this is quite difficult to follow - and makes it tricky to handle
discontiguous folios in a request.

This is changed such that netfs now accumulates buffers and attaches them
to each stream when they become valid for that stream, then flushes the
stream when a limit or a boundary is hit.  The issuing code in netfs then
loops around creating and issuing subrequests without calling a separate
prepare stage (though a function is provided to get an estimate of when
flushing should occur).  The filesystem (or cache) then gets to take a
slice of the master bvec chain as its I/O buffer for each subrequest,
including discontiguities if it can support a sparse/vectored RPC (as Ceph
can).

Similar-ish changes also apply to buffered read and unbuffered read and
write, though in each of those cases there is only a single contiguous
stream.  Though for buffered read this consists of interwoven requests from
multiple sources (server or cache).

To this end, netfslib is changed in the following ways:

 (1) ->prepare_xxx(), buffer selection and ->issue_xxx() are now collapsed
     together such that one ->issue_xxx() call is made with the subrequest
     defined to the maximum extent; the filesystem/cache then reduces the
     length of the subrequest and calls back to netfslib to grab a slice of
     the buffer, which may reduce the subrequest further if a maximum
     segment limit is set.  The filesystem/cache then dispatches the
     operation.

 (2) Retry buffer tracking is added to the netfs_io_request struct.  This
     is then selected by the subrequest retry counter being non-zero.

 (3) The use of iov_iter is pushed down to the filesystem.  Netfslib now
     provides the filesystem with a bvecq holding the buffer rather than an
     iov_iter.  The bvecq can be duplicated and headers/trailers attached
     to hold protocol and several bvecqs can be linked together to create a
     compound operation.

 (4) The ->issue_xxx() functions now return an error code that allows them
     to return an error without having to terminate the subrequest.
     Netfslib will handle the error immediately if it can but may request
     termination and punt responsibility to the result collector.

     ->issue_xxx() can return 0 if synchronously complete and -EIOCBQUEUED
     if the operation will complete (or already has completed)
     asynchronously.

 (5) During writeback, netfslib now builds up an accumulation of buffered
     data before issuing writes on each stream (one server, one cache).  It
     asks each stream for an estimate of how much data to accumulate before
     it next generates subrequests on the stream.  The filesystem or cache
     is not required to use up all the data accumulated on a stream at that
     time unless the end of the pagecache is hit.

 (6) During read-gaps, in which there are two gaps on either end of a dirty
     streaming write page that need to be filled, a buffer is constructed
     consisting of the two ends plus a sink page repeated to cover the
     middle portion.  This is passed to the server as a single write.  For
     something like Ceph, this should probably be done either as a
     vectored/sparse read or as two separate reads (if different Ceph
     objects are involved).

 (7) During unbuffered/DIO read/write, there is a single contiguous file
     region to be read or written as a single stream.  The dispatching
     function just creates subrequests and calls ->issue_xxx() repeatedly
     to eat through the bufferage.

 (8) At the start of buffered read, the entire set of folios allocated by
     VM readahead is loaded into a bvecq chain, rather than trying to do it
     piecemeal as-needed.  As the pages were already added and locked by
     the VM, this is slightly more efficient than loading piecemeal as only
     a single iteration of the xarray is required.

 (9) During buffered read, there is a single contiguous file region, to
     read as a single stream - however, this stream may be stitched
     together from subrequests to multiple sources.  Which sources are used
     where is now determined by querying the cache to find the next couple
     of extents in which it has data; netfslib uses this to direct the
     subrequests towards the appropriate sources.

     Each subrequest is given the maximum length in the current extent and
     then ->issue_read() is called.  The filesystem then limits the size
     and slices off a piece of the buffer for that extent.

(10) Cachefiles now provides an estimation function that indicates the
     standard maxima for doing DIO (MAX_RW_COUNT and BIO_MAX_VECS).

Note that sparse cachefiles still rely on the backing filesystem for
content mapping.  That will need to be addressed in a future patch and is
not trivial to fix.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/9p/vfs_addr.c                  |  49 +-
 fs/afs/dir.c                      |   8 +-
 fs/afs/file.c                     |  26 +-
 fs/afs/fsclient.c                 |   8 +-
 fs/afs/internal.h                 |   8 +-
 fs/afs/write.c                    |  35 +-
 fs/afs/yfsclient.c                |   6 +-
 fs/cachefiles/io.c                | 237 ++++++---
 fs/ceph/Kconfig                   |   1 +
 fs/ceph/addr.c                    | 127 ++---
 fs/netfs/Kconfig                  |   3 +
 fs/netfs/Makefile                 |   2 +-
 fs/netfs/buffered_read.c          | 236 +++++----
 fs/netfs/buffered_write.c         |  27 +-
 fs/netfs/direct_read.c            |  91 ++--
 fs/netfs/direct_write.c           | 145 +++---
 fs/netfs/fscache_io.c             |   6 -
 fs/netfs/internal.h               |  78 ++-
 fs/netfs/iterator.c               |   6 +-
 fs/netfs/misc.c                   |  33 +-
 fs/netfs/objects.c                |   7 +-
 fs/netfs/read_collect.c           |  33 +-
 fs/netfs/read_pgpriv2.c           | 116 +++--
 fs/netfs/read_retry.c             | 207 ++++----
 fs/netfs/read_single.c            | 150 +++---
 fs/netfs/write_collect.c          |  41 +-
 fs/netfs/write_issue.c            | 805 ++++++++++++++++++------------
 fs/netfs/write_retry.c            | 136 +++--
 fs/nfs/Kconfig                    |   1 +
 fs/nfs/fscache.c                  |  24 +-
 fs/smb/client/cifssmb.c           |  13 +-
 fs/smb/client/file.c              | 146 +++---
 fs/smb/client/smb2ops.c           |   9 +-
 fs/smb/client/smb2pdu.c           |  28 +-
 fs/smb/client/transport.c         |  15 +-
 include/linux/netfs.h             |  96 ++--
 include/trace/events/cachefiles.h |   2 +
 include/trace/events/netfs.h      |  51 +-
 net/9p/client.c                   |   8 +-
 39 files changed, 1790 insertions(+), 1230 deletions(-)

diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index 862164181bac..0db56cc00467 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -48,32 +48,71 @@ static void v9fs_begin_writeback(struct netfs_io_reques=
t *wreq)
 	wreq->io_streams[0].avail =3D true;
 }
=20
+/*
+ * Estimate how much data should be accumulated before we start issuing
+ * write subrequests.
+ */
+static int v9fs_estimate_write(struct netfs_io_request *wreq,
+			       struct netfs_io_stream *stream,
+			       struct netfs_write_estimate *estimate)
+{
+	struct p9_fid *fid =3D wreq->netfs_priv;
+	unsigned long long limit =3D ULLONG_MAX - stream->issue_from;
+	unsigned long long max_len =3D fid->clnt->msize - P9_IOHDRSZ;
+
+	estimate->issue_at =3D stream->issue_from + umin(max_len, limit);
+	return 0;
+}
+
 /*
  * Issue a subrequest to write to the server.
  */
-static void v9fs_issue_write(struct netfs_io_subrequest *subreq)
+static int v9fs_issue_write(struct netfs_io_subrequest *subreq)
 {
+	struct iov_iter iter;
 	struct p9_fid *fid =3D subreq->rreq->netfs_priv;
 	int err, len;
=20
-	len =3D p9_client_write(fid, subreq->start, &subreq->io_iter, &err);
+	subreq->len =3D umin(subreq->len, fid->clnt->msize - P9_IOHDRSZ);
+
+	err =3D netfs_prepare_write_buffer(subreq, INT_MAX);
+	if (err < 0)
+		return err;
+
+	iov_iter_bvec_queue(&iter, ITER_SOURCE, subreq->content.bvecq,
+			    subreq->content.slot, subreq->content.offset, subreq->len);
+
+	len =3D p9_client_write(fid, subreq->start, &iter, &err);
 	if (len > 0)
 		__set_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
 	netfs_write_subrequest_terminated(subreq, len ?: err);
+	return err;
 }
=20
 /**
  * v9fs_issue_read - Issue a read from 9P
  * @subreq: The read to make
+ * @rctx: Read generation context
  */
-static void v9fs_issue_read(struct netfs_io_subrequest *subreq)
+static int v9fs_issue_read(struct netfs_io_subrequest *subreq)
 {
 	struct netfs_io_request *rreq =3D subreq->rreq;
+	struct iov_iter iter;
 	struct p9_fid *fid =3D rreq->netfs_priv;
 	unsigned long long pos =3D subreq->start + subreq->transferred;
 	int total, err;
=20
-	total =3D p9_client_read(fid, pos, &subreq->io_iter, &err);
+	err =3D netfs_prepare_read_buffer(subreq, INT_MAX);
+	if (err < 0)
+		return err;
+
+	iov_iter_bvec_queue(&iter, ITER_DEST, subreq->content.bvecq,
+			    subreq->content.slot, subreq->content.offset, subreq->len);
+
+	/* After this point, we're not allowed to return an error. */
+	netfs_mark_read_submission(subreq);
+
+	total =3D p9_client_read(fid, pos, &iter, &err);
=20
 	/* if we just extended the file size, any portion not in
 	 * cache won't be on server and is zeroes */
@@ -89,6 +128,7 @@ static void v9fs_issue_read(struct netfs_io_subrequest *=
subreq)
=20
 	subreq->error =3D err;
 	netfs_read_subreq_terminated(subreq);
+	return -EIOCBQUEUED;
 }
=20
 /**
@@ -154,6 +194,7 @@ const struct netfs_request_ops v9fs_req_ops =3D {
 	.free_request		=3D v9fs_free_request,
 	.issue_read		=3D v9fs_issue_read,
 	.begin_writeback	=3D v9fs_begin_writeback,
+	.estimate_write		=3D v9fs_estimate_write,
 	.issue_write		=3D v9fs_issue_write,
 };
=20
diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 6627a0d38e73..52ab84ab8c1f 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -255,7 +255,8 @@ static ssize_t afs_do_read_single(struct afs_vnode *dvn=
ode, struct file *file)
 	if (dvnode->directory_size < i_size) {
 		size_t cur_size =3D dvnode->directory_size;
=20
-		ret =3D bvecq_expand_buffer(&dvnode->directory, &cur_size, i_size,
+		ret =3D bvecq_expand_buffer(&dvnode->directory, &cur_size,
+					  round_up(i_size, PAGE_SIZE),
 					  mapping_gfp_mask(dvnode->netfs.inode.i_mapping));
 		dvnode->directory_size =3D cur_size;
 		if (ret < 0)
@@ -2210,9 +2211,10 @@ int afs_single_writepages(struct address_space *mapp=
ing,
 	if (is_dir ?
 	    test_bit(AFS_VNODE_DIR_VALID, &dvnode->flags) :
 	    atomic64_read(&dvnode->cb_expires_at) !=3D AFS_NO_CB_PROMISE) {
+		size_t len =3D i_size_read(&dvnode->netfs.inode);
 		iov_iter_bvec_queue(&iter, ITER_SOURCE, dvnode->directory, 0, 0,
-				    i_size_read(&dvnode->netfs.inode));
-		ret =3D netfs_writeback_single(mapping, wbc, &iter);
+				    round_up(len, PAGE_SIZE));
+		ret =3D netfs_writeback_single(mapping, wbc, &iter, len);
 	}
=20
 	up_read(&dvnode->validate_lock);
diff --git a/fs/afs/file.c b/fs/afs/file.c
index 424e0c98d67f..42131fe450af 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -329,11 +329,12 @@ void afs_fetch_data_immediate_cancel(struct afs_call =
*call)
 /*
  * Fetch file data from the volume.
  */
-static void afs_issue_read(struct netfs_io_subrequest *subreq)
+static int afs_issue_read(struct netfs_io_subrequest *subreq)
 {
 	struct afs_operation *op;
 	struct afs_vnode *vnode =3D AFS_FS_I(subreq->rreq->inode);
 	struct key *key =3D subreq->rreq->netfs_priv;
+	int ret;
=20
 	_enter("%s{%llx:%llu.%u},%x,,,",
 	       vnode->volume->name,
@@ -342,19 +343,21 @@ static void afs_issue_read(struct netfs_io_subrequest=
 *subreq)
 	       vnode->fid.unique,
 	       key_serial(key));
=20
+	ret =3D netfs_prepare_read_buffer(subreq, INT_MAX);
+	if (ret < 0)
+		return ret;
+
 	op =3D afs_alloc_operation(key, vnode->volume);
-	if (IS_ERR(op)) {
-		subreq->error =3D PTR_ERR(op);
-		netfs_read_subreq_terminated(subreq);
-		return;
-	}
+	if (IS_ERR(op))
+		return PTR_ERR(op);
=20
 	afs_op_set_vnode(op, 0, vnode);
=20
 	op->fetch.subreq =3D subreq;
 	op->ops		=3D &afs_fetch_data_operation;
=20
-	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
+	/* After this point, we're not allowed to return an error. */
+	netfs_mark_read_submission(subreq);
=20
 	if (subreq->rreq->origin =3D=3D NETFS_READAHEAD ||
 	    subreq->rreq->iocb) {
@@ -363,18 +366,19 @@ static void afs_issue_read(struct netfs_io_subrequest=
 *subreq)
 		if (!afs_begin_vnode_operation(op)) {
 			subreq->error =3D afs_put_operation(op);
 			netfs_read_subreq_terminated(subreq);
-			return;
+			return -EIOCBQUEUED;
 		}
=20
 		if (!afs_select_fileserver(op)) {
-			afs_end_read(op);
-			return;
+			afs_end_read(op); /* Error recorded here. */
+			return -EIOCBQUEUED;
 		}
=20
 		afs_issue_read_call(op);
 	} else {
 		afs_do_sync_operation(op);
 	}
+	return -EIOCBQUEUED;
 }
=20
 static int afs_init_request(struct netfs_io_request *rreq, struct file *fi=
le)
@@ -453,7 +457,7 @@ const struct netfs_request_ops afs_req_ops =3D {
 	.update_i_size		=3D afs_update_i_size,
 	.invalidate_cache	=3D afs_netfs_invalidate_cache,
 	.begin_writeback	=3D afs_begin_writeback,
-	.prepare_write		=3D afs_prepare_write,
+	.estimate_write		=3D afs_estimate_write,
 	.issue_write		=3D afs_issue_write,
 	.retry_request		=3D afs_retry_request,
 };
diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c
index 95494d5f2b8a..f59a9db4bb0e 100644
--- a/fs/afs/fsclient.c
+++ b/fs/afs/fsclient.c
@@ -339,7 +339,9 @@ static int afs_deliver_fs_fetch_data(struct afs_call *c=
all)
 		if (call->remaining =3D=3D 0)
 			goto no_more_data;
=20
-		call->iter =3D &subreq->io_iter;
+		iov_iter_bvec_queue(&call->def_iter, ITER_DEST, subreq->content.bvecq,
+				    subreq->content.slot, subreq->content.offset, subreq->len);
+
 		call->iov_len =3D umin(call->remaining, subreq->len - subreq->transferre=
d);
 		call->unmarshall++;
 		fallthrough;
@@ -1085,7 +1087,7 @@ static void afs_fs_store_data64(struct afs_operation =
*op)
 	if (!call)
 		return afs_op_nomem(op);
=20
-	call->write_iter =3D op->store.write_iter;
+	call->write_iter =3D &op->store.write_iter;
=20
 	/* marshall the parameters */
 	bp =3D call->request;
@@ -1139,7 +1141,7 @@ void afs_fs_store_data(struct afs_operation *op)
 	if (!call)
 		return afs_op_nomem(op);
=20
-	call->write_iter =3D op->store.write_iter;
+	call->write_iter =3D &op->store.write_iter;
=20
 	/* marshall the parameters */
 	bp =3D call->request;
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 9bf5d2f1dbc4..a60df9357a4f 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -906,7 +906,7 @@ struct afs_operation {
 			afs_lock_type_t type;
 		} lock;
 		struct {
-			struct iov_iter	*write_iter;
+			struct iov_iter	write_iter;
 			loff_t	pos;
 			loff_t	size;
 			loff_t	i_size;
@@ -1680,8 +1680,10 @@ extern int afs_check_volume_status(struct afs_volume=
 *, struct afs_operation *);
 /*
  * write.c
  */
-void afs_prepare_write(struct netfs_io_subrequest *subreq);
-void afs_issue_write(struct netfs_io_subrequest *subreq);
+int afs_estimate_write(struct netfs_io_request *wreq,
+		       struct netfs_io_stream *stream,
+		       struct netfs_write_estimate *estimate);
+int afs_issue_write(struct netfs_io_subrequest *subreq);
 void afs_begin_writeback(struct netfs_io_request *wreq);
 void afs_retry_request(struct netfs_io_request *wreq, struct netfs_io_stre=
am *stream);
 extern int afs_writepages(struct address_space *, struct writeback_control=
 *);
diff --git a/fs/afs/write.c b/fs/afs/write.c
index 93ad86ff3345..1f6045bfeecc 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -84,17 +84,20 @@ static const struct afs_operation_ops afs_store_data_op=
eration =3D {
 };
=20
 /*
- * Prepare a subrequest to write to the server.  This sets the max_len
- * parameter.
+ * Estimate the maximum size of a write we can send to the server.
  */
-void afs_prepare_write(struct netfs_io_subrequest *subreq)
+int afs_estimate_write(struct netfs_io_request *wreq,
+		       struct netfs_io_stream *stream,
+		       struct netfs_write_estimate *estimate)
 {
-	struct netfs_io_stream *stream =3D &subreq->rreq->io_streams[subreq->stre=
am_nr];
+	unsigned long long limit =3D ULLONG_MAX - stream->issue_from;
+	unsigned long long max_len =3D 256 * 1024 * 1024;
=20
 	//if (test_bit(NETFS_SREQ_RETRYING, &subreq->flags))
-	//	subreq->max_len =3D 512 * 1024;
-	//else
-	stream->sreq_max_len =3D 256 * 1024 * 1024;
+	//	max_len =3D 512 * 1024;
+
+	estimate->issue_at =3D stream->issue_from + umin(max_len, limit);
+	return 0;
 }
=20
 /*
@@ -140,12 +143,15 @@ static void afs_issue_write_worker(struct work_struct=
 *work)
 	op->flags		|=3D AFS_OPERATION_UNINTR;
 	op->ops			=3D &afs_store_data_operation;
=20
+	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
 	afs_begin_vnode_operation(op);
=20
-	op->store.write_iter	=3D &subreq->io_iter;
 	op->store.i_size	=3D umax(pos + len, vnode->netfs.remote_i_size);
 	op->mtime		=3D inode_get_mtime(&vnode->netfs.inode);
=20
+	iov_iter_bvec_queue(&op->store.write_iter, ITER_SOURCE, subreq->content.b=
vecq,
+			    subreq->content.slot, subreq->content.offset, subreq->len);
+
 	afs_wait_for_operation(op);
 	ret =3D afs_put_operation(op);
 	switch (ret) {
@@ -169,11 +175,20 @@ static void afs_issue_write_worker(struct work_struct=
 *work)
 	netfs_write_subrequest_terminated(subreq, ret < 0 ? ret : subreq->len);
 }
=20
-void afs_issue_write(struct netfs_io_subrequest *subreq)
+int afs_issue_write(struct netfs_io_subrequest *subreq)
 {
+	int ret;
+
+	if (subreq->len > 256 * 1024 * 1024)
+		subreq->len =3D 256 * 1024 * 1024;
+	ret =3D netfs_prepare_write_buffer(subreq, INT_MAX);
+	if (ret < 0)
+		return ret;
+
 	subreq->work.func =3D afs_issue_write_worker;
 	if (!queue_work(system_dfl_wq, &subreq->work))
 		WARN_ON_ONCE(1);
+	return -EIOCBQUEUED;
 }
=20
 /*
@@ -184,6 +199,8 @@ void afs_begin_writeback(struct netfs_io_request *wreq)
 {
 	if (S_ISREG(wreq->inode->i_mode))
 		afs_get_writeback_key(wreq);
+
+	wreq->io_streams[0].avail =3D true;
 }
=20
 /*
diff --git a/fs/afs/yfsclient.c b/fs/afs/yfsclient.c
index 24fb562ebd33..ffd1d4c87290 100644
--- a/fs/afs/yfsclient.c
+++ b/fs/afs/yfsclient.c
@@ -385,7 +385,9 @@ static int yfs_deliver_fs_fetch_data64(struct afs_call =
*call)
 		if (call->remaining =3D=3D 0)
 			goto no_more_data;
=20
-		call->iter =3D &subreq->io_iter;
+		iov_iter_bvec_queue(&call->def_iter, ITER_DEST, subreq->content.bvecq,
+				    subreq->content.slot, subreq->content.offset, subreq->len);
+
 		call->iov_len =3D min(call->remaining, subreq->len - subreq->transferred=
);
 		call->unmarshall++;
 		fallthrough;
@@ -1357,7 +1359,7 @@ void yfs_fs_store_data(struct afs_operation *op)
 	if (!call)
 		return afs_op_nomem(op);
=20
-	call->write_iter =3D op->store.write_iter;
+	call->write_iter =3D &op->store.write_iter;
=20
 	/* marshall the parameters */
 	bp =3D call->request;
diff --git a/fs/cachefiles/io.c b/fs/cachefiles/io.c
index 2af55a75b511..05a37b4bdf10 100644
--- a/fs/cachefiles/io.c
+++ b/fs/cachefiles/io.c
@@ -26,7 +26,10 @@ struct cachefiles_kiocb {
 	};
 	struct cachefiles_object *object;
 	netfs_io_terminated_t	term_func;
-	void			*term_func_priv;
+	union {
+		struct netfs_io_subrequest *subreq;
+		void			*term_func_priv;
+	};
 	bool			was_async;
 	unsigned int		inval_counter;	/* Copy of cookie->inval_counter */
 	u64			b_writing;
@@ -194,6 +197,125 @@ static int cachefiles_read(struct netfs_cache_resourc=
es *cres,
 	return ret;
 }
=20
+/*
+ * Handle completion of a read from the cache issued by netfslib.
+ */
+static void cachefiles_issue_read_complete(struct kiocb *iocb, long ret)
+{
+	struct cachefiles_kiocb *ki =3D container_of(iocb, struct cachefiles_kioc=
b, iocb);
+	struct netfs_io_subrequest *subreq =3D ki->subreq;
+	struct inode *inode =3D file_inode(ki->iocb.ki_filp);
+
+	_enter("%ld", ret);
+
+	if (ret < 0) {
+		subreq->error =3D -ESTALE;
+		trace_cachefiles_io_error(ki->object, inode, ret,
+					  cachefiles_trace_read_error);
+	}
+
+	if (ret >=3D 0) {
+		if (ki->object->cookie->inval_counter =3D=3D ki->inval_counter) {
+			subreq->error =3D 0;
+			if (ret > 0) {
+				subreq->transferred +=3D ret;
+				__set_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
+			}
+		} else {
+			subreq->error =3D -ESTALE;
+		}
+	}
+
+	netfs_read_subreq_terminated(subreq);
+	cachefiles_put_kiocb(ki);
+}
+
+/*
+ * Issue a read operation to the cache.
+ */
+static int cachefiles_issue_read(struct netfs_io_subrequest *subreq)
+{
+	struct netfs_cache_resources *cres =3D &subreq->rreq->cache_resources;
+	struct cachefiles_object *object;
+	struct cachefiles_kiocb *ki;
+	struct iov_iter iter;
+	struct file *file;
+	unsigned int old_nofs;
+	ssize_t ret =3D -ENOBUFS;
+
+	if (!fscache_wait_for_operation(cres, FSCACHE_WANT_READ))
+		return -ENOBUFS;
+
+	fscache_count_read();
+	object =3D cachefiles_cres_object(cres);
+	file =3D cachefiles_cres_file(cres);
+
+	_enter("%pD,%li,%llx,%zx/%llx",
+	       file, file_inode(file)->i_ino, subreq->start, subreq->len,
+	       i_size_read(file_inode(file)));
+
+	if (subreq->len > MAX_RW_COUNT)
+		subreq->len =3D MAX_RW_COUNT;
+
+	ret =3D netfs_prepare_read_buffer(subreq, BIO_MAX_VECS);
+	if (ret < 0)
+		return ret;
+
+	iov_iter_bvec_queue(&iter, ITER_DEST, subreq->content.bvecq,
+			    subreq->content.slot, subreq->content.offset, subreq->len);
+
+	ki =3D kzalloc_obj(struct cachefiles_kiocb);
+	if (!ki)
+		return -ENOMEM;
+
+	refcount_set(&ki->ki_refcnt, 2);
+	ki->iocb.ki_filp	=3D file;
+	ki->iocb.ki_pos		=3D subreq->start;
+	ki->iocb.ki_flags	=3D IOCB_DIRECT;
+	ki->iocb.ki_ioprio	=3D get_current_ioprio();
+	ki->iocb.ki_complete	=3D cachefiles_issue_read_complete;
+	ki->object		=3D object;
+	ki->inval_counter	=3D cres->inval_counter;
+	ki->subreq		=3D subreq;
+	ki->was_async		=3D true;
+
+	/* After this point, we're not allowed to return an error. */
+	netfs_mark_read_submission(subreq);
+
+	get_file(ki->iocb.ki_filp);
+	cachefiles_grab_object(object, cachefiles_obj_get_ioreq);
+
+	trace_cachefiles_read(object, file_inode(file), ki->iocb.ki_pos, subreq->=
len);
+	old_nofs =3D memalloc_nofs_save();
+	ret =3D cachefiles_inject_read_error();
+	if (ret =3D=3D 0)
+		ret =3D vfs_iocb_iter_read(file, &ki->iocb, &iter);
+	memalloc_nofs_restore(old_nofs);
+
+	switch (ret) {
+	case -EIOCBQUEUED:
+		break;
+
+	case -ERESTARTSYS:
+	case -ERESTARTNOINTR:
+	case -ERESTARTNOHAND:
+	case -ERESTART_RESTARTBLOCK:
+		/* There's no easy way to restart the syscall since other AIO's
+		 * may be already running. Just fail this IO with EINTR.
+		 */
+		ret =3D -EINTR;
+		fallthrough;
+	default:
+		ki->was_async =3D false;
+		cachefiles_issue_read_complete(&ki->iocb, ret);
+		break;
+	}
+
+	cachefiles_put_kiocb(ki);
+	_leave(" =3D %zd", ret);
+	return -EIOCBQUEUED;
+}
+
 /*
  * Query the occupancy of the cache in a region, returning the extent of t=
he
  * next two chunks of cached data and the next hole.
@@ -610,104 +732,67 @@ int __cachefiles_prepare_write(struct cachefiles_obj=
ect *object,
 				    cachefiles_has_space_for_write);
 }
=20
-static int cachefiles_prepare_write(struct netfs_cache_resources *cres,
-				    loff_t *_start, size_t *_len, size_t upper_len,
-				    loff_t i_size, bool no_space_allocated_yet)
-{
-	struct cachefiles_object *object =3D cachefiles_cres_object(cres);
-	struct cachefiles_cache *cache =3D object->volume->cache;
-	const struct cred *saved_cred;
-	int ret;
-
-	if (!cachefiles_cres_file(cres)) {
-		if (!fscache_wait_for_operation(cres, FSCACHE_WANT_WRITE))
-			return -ENOBUFS;
-		if (!cachefiles_cres_file(cres))
-			return -ENOBUFS;
-	}
-
-	cachefiles_begin_secure(cache, &saved_cred);
-	ret =3D __cachefiles_prepare_write(object, cachefiles_cres_file(cres),
-					 _start, _len, upper_len,
-					 no_space_allocated_yet);
-	cachefiles_end_secure(cache, saved_cred);
-	return ret;
-}
-
-static void cachefiles_prepare_write_subreq(struct netfs_io_subrequest *su=
breq)
+static int cachefiles_estimate_write(struct netfs_io_request *wreq,
+				     struct netfs_io_stream *stream,
+				     struct netfs_write_estimate *estimate)
 {
-	struct netfs_io_request *wreq =3D subreq->rreq;
-	struct netfs_cache_resources *cres =3D &wreq->cache_resources;
-	struct netfs_io_stream *stream =3D &wreq->io_streams[subreq->stream_nr];
-
-	_enter("W=3D%x[%x] %llx", wreq->debug_id, subreq->debug_index, subreq->st=
art);
-
-	stream->sreq_max_len =3D MAX_RW_COUNT;
-	stream->sreq_max_segs =3D BIO_MAX_VECS;
-
-	if (!cachefiles_cres_file(cres)) {
-		if (!fscache_wait_for_operation(cres, FSCACHE_WANT_WRITE))
-			return netfs_prepare_write_failed(subreq);
-		if (!cachefiles_cres_file(cres))
-			return netfs_prepare_write_failed(subreq);
-	}
+	estimate->issue_at =3D stream->issue_from + MAX_RW_COUNT;
+	estimate->max_segs =3D BIO_MAX_VECS;
+	return 0;
 }
=20
-static void cachefiles_issue_write(struct netfs_io_subrequest *subreq)
+static int cachefiles_issue_write(struct netfs_io_subrequest *subreq)
 {
 	struct netfs_io_request *wreq =3D subreq->rreq;
 	struct netfs_cache_resources *cres =3D &wreq->cache_resources;
 	struct cachefiles_object *object =3D cachefiles_cres_object(cres);
 	struct cachefiles_cache *cache =3D object->volume->cache;
+	struct iov_iter iter;
 	const struct cred *saved_cred;
-	size_t off, pre, post, len =3D subreq->len;
 	loff_t start =3D subreq->start;
+	size_t len =3D subreq->len;
 	int ret;
=20
 	_enter("W=3D%x[%x] %llx-%llx",
 	       wreq->debug_id, subreq->debug_index, start, start + len - 1);
=20
-	/* We need to start on the cache granularity boundary */
-	off =3D start & (cache->bsize - 1);
-	if (off) {
-		pre =3D cache->bsize - off;
-		if (pre >=3D len) {
-			fscache_count_dio_misfit();
-			netfs_write_subrequest_terminated(subreq, len);
-			return;
-		}
-		subreq->transferred +=3D pre;
-		start +=3D pre;
-		len -=3D pre;
-		iov_iter_advance(&subreq->io_iter, pre);
-	}
-
-	/* We also need to end on the cache granularity boundary */
-	post =3D len & (cache->bsize - 1);
-	if (post) {
-		len -=3D post;
-		if (len =3D=3D 0) {
-			fscache_count_dio_misfit();
-			netfs_write_subrequest_terminated(subreq, post);
-			return;
-		}
-		iov_iter_truncate(&subreq->io_iter, len);
+	if (!cachefiles_cres_file(cres)) {
+		if (!fscache_wait_for_operation(cres, FSCACHE_WANT_WRITE))
+			return -EINVAL;
+		if (!cachefiles_cres_file(cres))
+			return -EINVAL;
+	}
+
+	ret =3D netfs_prepare_write_buffer(subreq, BIO_MAX_VECS);
+	if (ret < 0)
+		return ret;
+
+	/* The buffer extraction func may round out start and end. */
+	start =3D subreq->start;
+	len =3D subreq->len;
+
+	/* We need to start and end on cache granularity boundaries. */
+	if (WARN_ON_ONCE(start & (cache->bsize - 1)) ||
+	    WARN_ON_ONCE(len   & (cache->bsize - 1))) {
+		fscache_count_dio_misfit();
+		return -EIO;
 	}
=20
+	iov_iter_bvec_queue(&iter, ITER_SOURCE, subreq->content.bvecq,
+			    subreq->content.slot, subreq->content.offset, len);
+
 	trace_netfs_sreq(subreq, netfs_sreq_trace_cache_prepare);
 	cachefiles_begin_secure(cache, &saved_cred);
 	ret =3D __cachefiles_prepare_write(object, cachefiles_cres_file(cres),
 					 &start, &len, len, true);
 	cachefiles_end_secure(cache, saved_cred);
-	if (ret < 0) {
-		netfs_write_subrequest_terminated(subreq, ret);
-		return;
-	}
+	if (ret < 0)
+		return ret;
=20
 	trace_netfs_sreq(subreq, netfs_sreq_trace_cache_write);
-	cachefiles_write(&subreq->rreq->cache_resources,
-			 subreq->start, &subreq->io_iter,
+	cachefiles_write(&subreq->rreq->cache_resources, subreq->start, &iter,
 			 netfs_write_subrequest_terminated, subreq);
+	return -EIOCBQUEUED;
 }
=20
 /*
@@ -854,9 +939,9 @@ static const struct netfs_cache_ops cachefiles_netfs_ca=
che_ops =3D {
 	.end_operation		=3D cachefiles_end_operation,
 	.read			=3D cachefiles_read,
 	.write			=3D cachefiles_write,
+	.issue_read		=3D cachefiles_issue_read,
 	.issue_write		=3D cachefiles_issue_write,
-	.prepare_write		=3D cachefiles_prepare_write,
-	.prepare_write_subreq	=3D cachefiles_prepare_write_subreq,
+	.estimate_write		=3D cachefiles_estimate_write,
 	.prepare_ondemand_read	=3D cachefiles_prepare_ondemand_read,
 	.query_occupancy	=3D cachefiles_query_occupancy,
 	.collect_write		=3D cachefiles_collect_write,
diff --git a/fs/ceph/Kconfig b/fs/ceph/Kconfig
index 3d64a316ca31..aa6ccd7794d2 100644
--- a/fs/ceph/Kconfig
+++ b/fs/ceph/Kconfig
@@ -4,6 +4,7 @@ config CEPH_FS
 	depends on INET
 	select CEPH_LIB
 	select NETFS_SUPPORT
+	select NETFS_PGPRIV2
 	select FS_ENCRYPTION_ALGS if FS_ENCRYPTION
 	default n
 	help
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index e87b3bb94ee8..8aab4f7c516f 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -269,7 +269,7 @@ static void finish_netfs_read(struct ceph_osd_request *=
req)
 	ceph_dec_osd_stopping_blocker(fsc->mdsc);
 }
=20
-static bool ceph_netfs_issue_op_inline(struct netfs_io_subrequest *subreq)
+static int ceph_netfs_issue_op_inline(struct netfs_io_subrequest *subreq)
 {
 	struct netfs_io_request *rreq =3D subreq->rreq;
 	struct inode *inode =3D rreq->inode;
@@ -278,7 +278,8 @@ static bool ceph_netfs_issue_op_inline(struct netfs_io_=
subrequest *subreq)
 	struct ceph_mds_request *req;
 	struct ceph_mds_client *mdsc =3D ceph_sb_to_mdsc(inode->i_sb);
 	struct ceph_inode_info *ci =3D ceph_inode(inode);
-	ssize_t err =3D 0;
+	struct iov_iter iter;
+	ssize_t err;
 	size_t len;
 	int mode;
=20
@@ -287,21 +288,33 @@ static bool ceph_netfs_issue_op_inline(struct netfs_i=
o_subrequest *subreq)
 		__set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
 	__clear_bit(NETFS_SREQ_COPY_TO_CACHE, &subreq->flags);
=20
-	if (subreq->start >=3D inode->i_size)
+	if (subreq->start >=3D inode->i_size) {
+		__set_bit(NETFS_SREQ_HIT_EOF, &subreq->flags);
+		err =3D 0;
 		goto out;
+	}
+
+	err =3D netfs_prepare_read_buffer(subreq, INT_MAX);
+	if (err < 0)
+		return err;
+
+	iov_iter_bvec_queue(&iter, ITER_DEST, subreq->content.bvecq,
+			    subreq->content.slot, subreq->content.offset,
+			    subreq->len);
=20
 	/* We need to fetch the inline data. */
 	mode =3D ceph_try_to_choose_auth_mds(inode, CEPH_STAT_CAP_INLINE_DATA);
 	req =3D ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_GETATTR, mode);
-	if (IS_ERR(req)) {
-		err =3D PTR_ERR(req);
-		goto out;
-	}
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
 	req->r_ino1 =3D ci->i_vino;
 	req->r_args.getattr.mask =3D cpu_to_le32(CEPH_STAT_CAP_INLINE_DATA);
 	req->r_num_caps =3D 2;
=20
-	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
+	/* After this point, we're not allowed to return an error. */
+	netfs_mark_read_submission(subreq);
+
 	err =3D ceph_mdsc_do_request(mdsc, NULL, req);
 	if (err < 0)
 		goto out;
@@ -311,11 +324,11 @@ static bool ceph_netfs_issue_op_inline(struct netfs_i=
o_subrequest *subreq)
 	if (iinfo->inline_version =3D=3D CEPH_INLINE_NONE) {
 		/* The data got uninlined */
 		ceph_mdsc_put_request(req);
-		return false;
+		return 1;
 	}
=20
 	len =3D min_t(size_t, iinfo->inline_len - subreq->start, subreq->len);
-	err =3D copy_to_iter(iinfo->inline_data + subreq->start, len, &subreq->io=
_iter);
+	err =3D copy_to_iter(iinfo->inline_data + subreq->start, len, &iter);
 	if (err =3D=3D 0) {
 		err =3D -EFAULT;
 	} else {
@@ -328,26 +341,10 @@ static bool ceph_netfs_issue_op_inline(struct netfs_i=
o_subrequest *subreq)
 	subreq->error =3D err;
 	trace_netfs_sreq(subreq, netfs_sreq_trace_io_progress);
 	netfs_read_subreq_terminated(subreq);
-	return true;
+	return -EIOCBQUEUED;
 }
=20
-static int ceph_netfs_prepare_read(struct netfs_io_subrequest *subreq)
-{
-	struct netfs_io_request *rreq =3D subreq->rreq;
-	struct inode *inode =3D rreq->inode;
-	struct ceph_inode_info *ci =3D ceph_inode(inode);
-	struct ceph_fs_client *fsc =3D ceph_inode_to_fs_client(inode);
-	u64 objno, objoff;
-	u32 xlen;
-
-	/* Truncate the extent at the end of the current block */
-	ceph_calc_file_object_mapping(&ci->i_layout, subreq->start, subreq->len,
-				      &objno, &objoff, &xlen);
-	rreq->io_streams[0].sreq_max_len =3D umin(xlen, fsc->mount_options->rsize=
);
-	return 0;
-}
-
-static void ceph_netfs_issue_read(struct netfs_io_subrequest *subreq)
+static int ceph_netfs_issue_read(struct netfs_io_subrequest *subreq)
 {
 	struct netfs_io_request *rreq =3D subreq->rreq;
 	struct inode *inode =3D rreq->inode;
@@ -356,48 +353,65 @@ static void ceph_netfs_issue_read(struct netfs_io_sub=
request *subreq)
 	struct ceph_client *cl =3D fsc->client;
 	struct ceph_osd_request *req =3D NULL;
 	struct ceph_vino vino =3D ceph_vino(inode);
+	struct iov_iter iter;
+	u64 objno, objoff, len, off =3D subreq->start;
+	u32 maxlen;
 	int err;
-	u64 len;
 	bool sparse =3D IS_ENCRYPTED(inode) || ceph_test_mount_opt(fsc, SPARSEREA=
D);
-	u64 off =3D subreq->start;
 	int extent_cnt;
=20
-	if (ceph_inode_is_shutdown(inode)) {
-		err =3D -EIO;
-		goto out;
+	if (ceph_inode_is_shutdown(inode))
+		return -EIO;
+
+	if (ceph_has_inline_data(ci)) {
+		err =3D ceph_netfs_issue_op_inline(subreq);
+		if (err !=3D 1)
+			return err;
 	}
=20
-	if (ceph_has_inline_data(ci) && ceph_netfs_issue_op_inline(subreq))
-		return;
+	/* Truncate the extent at the end of the current block */
+	ceph_calc_file_object_mapping(&ci->i_layout, subreq->start, subreq->len,
+				      &objno, &objoff, &maxlen);
+	maxlen =3D umin(maxlen, fsc->mount_options->rsize);
+	len =3D umin(subreq->len, maxlen);
+	subreq->len =3D len;
=20
 	// TODO: This rounding here is slightly dodgy.  It *should* work, for
 	// now, as the cache only deals in blocks that are a multiple of
 	// PAGE_SIZE and fscrypt blocks are at most PAGE_SIZE.  What needs to
 	// happen is for the fscrypt driving to be moved into netfslib and the
 	// data in the cache also to be stored encrypted.
-	len =3D subreq->len;
 	ceph_fscrypt_adjust_off_and_len(inode, &off, &len);
=20
 	req =3D ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout, vino,
 			off, &len, 0, 1, sparse ? CEPH_OSD_OP_SPARSE_READ : CEPH_OSD_OP_READ,
 			CEPH_OSD_FLAG_READ, NULL, ci->i_truncate_seq,
 			ci->i_truncate_size, false);
-	if (IS_ERR(req)) {
-		err =3D PTR_ERR(req);
-		req =3D NULL;
-		goto out;
-	}
+	if (IS_ERR(req))
+		return PTR_ERR(req);
=20
 	if (sparse) {
 		extent_cnt =3D __ceph_sparse_read_ext_count(inode, len);
 		err =3D ceph_alloc_sparse_ext_map(&req->r_ops[0], extent_cnt);
-		if (err)
-			goto out;
+		if (err) {
+			ceph_osdc_put_request(req);
+			return err;
+		}
 	}
=20
 	doutc(cl, "%llx.%llx pos=3D%llu orig_len=3D%zu len=3D%llu\n",
 	      ceph_vinop(inode), subreq->start, subreq->len, len);
=20
+	err =3D netfs_prepare_read_buffer(subreq, INT_MAX);
+	if (err < 0) {
+		ceph_osdc_put_request(req);
+		return err;
+	}
+
+	iov_iter_bvec_queue(&iter, ITER_DEST, subreq->content.bvecq,
+			    subreq->content.slot, subreq->content.offset,
+			    subreq->len);
+
 	/*
 	 * FIXME: For now, use CEPH_OSD_DATA_TYPE_PAGES instead of _ITER for
 	 * encrypted inodes. We'd need infrastructure that handles an iov_iter
@@ -416,13 +430,12 @@ static void ceph_netfs_issue_read(struct netfs_io_sub=
request *subreq)
 		 * ceph_msg_data_cursor_init() triggers BUG_ON() in the case
 		 * if msg->sparse_read_total > msg->data_length.
 		 */
-		subreq->io_iter.count =3D len;
-
-		err =3D iov_iter_get_pages_alloc2(&subreq->io_iter, &pages, len, &page_o=
ff);
+		err =3D iov_iter_get_pages_alloc2(&iter, &pages, len, &page_off);
 		if (err < 0) {
 			doutc(cl, "%llx.%llx failed to allocate pages, %d\n",
 			      ceph_vinop(inode), err);
-			goto out;
+			ceph_osdc_put_request(req);
+			return -EIO;
 		}
=20
 		/* should always give us a page-aligned read */
@@ -433,32 +446,28 @@ static void ceph_netfs_issue_read(struct netfs_io_sub=
request *subreq)
 		osd_req_op_extent_osd_data_pages(req, 0, pages, len, 0, false,
 						 false);
 	} else {
-		osd_req_op_extent_osd_iter(req, 0, &subreq->io_iter);
+		osd_req_op_extent_osd_iter(req, 0, &iter);
 	}
 	if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
-		err =3D -EIO;
-		goto out;
+		ceph_osdc_put_request(req);
+		return -EIO;
 	}
 	req->r_callback =3D finish_netfs_read;
 	req->r_priv =3D subreq;
 	req->r_inode =3D inode;
 	ihold(inode);
=20
-	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
+	/* After this point, we're not allowed to return an error. */
+	netfs_mark_read_submission(subreq);
 	ceph_osdc_start_request(req->r_osdc, req);
-out:
 	ceph_osdc_put_request(req);
-	if (err) {
-		subreq->error =3D err;
-		netfs_read_subreq_terminated(subreq);
-	}
-	doutc(cl, "%llx.%llx result %d\n", ceph_vinop(inode), err);
+	doutc(cl, "%llx.%llx result -EIOCBQUEUED\n", ceph_vinop(inode));
+	return -EIOCBQUEUED;
 }
=20
 static int ceph_init_request(struct netfs_io_request *rreq, struct file *f=
ile)
 {
 	struct inode *inode =3D rreq->inode;
-	struct ceph_fs_client *fsc =3D ceph_inode_to_fs_client(inode);
 	struct ceph_client *cl =3D ceph_inode_to_client(inode);
 	int got =3D 0, want =3D CEPH_CAP_FILE_CACHE;
 	struct ceph_netfs_request_data *priv;
@@ -510,7 +519,6 @@ static int ceph_init_request(struct netfs_io_request *r=
req, struct file *file)
=20
 	priv->caps =3D got;
 	rreq->netfs_priv =3D priv;
-	rreq->io_streams[0].sreq_max_len =3D fsc->mount_options->rsize;
=20
 out:
 	if (ret < 0) {
@@ -538,7 +546,6 @@ static void ceph_netfs_free_request(struct netfs_io_req=
uest *rreq)
 const struct netfs_request_ops ceph_netfs_ops =3D {
 	.init_request		=3D ceph_init_request,
 	.free_request		=3D ceph_netfs_free_request,
-	.prepare_read		=3D ceph_netfs_prepare_read,
 	.issue_read		=3D ceph_netfs_issue_read,
 	.expand_readahead	=3D ceph_netfs_expand_readahead,
 	.check_write_begin	=3D ceph_netfs_check_write_begin,
diff --git a/fs/netfs/Kconfig b/fs/netfs/Kconfig
index 7701c037c328..d0e7b0971fa3 100644
--- a/fs/netfs/Kconfig
+++ b/fs/netfs/Kconfig
@@ -22,6 +22,9 @@ config NETFS_STATS
 	  between CPUs.  On the other hand, the stats are very useful for
 	  debugging purposes.  Saying 'Y' here is recommended.
=20
+config NETFS_PGPRIV2
+	bool
+
 config NETFS_DEBUG
 	bool "Enable dynamic debugging netfslib and FS-Cache"
 	depends on NETFS_SUPPORT
diff --git a/fs/netfs/Makefile b/fs/netfs/Makefile
index 0621e6870cbd..421dd0be413b 100644
--- a/fs/netfs/Makefile
+++ b/fs/netfs/Makefile
@@ -12,13 +12,13 @@ netfs-y :=3D \
 	misc.o \
 	objects.o \
 	read_collect.o \
-	read_pgpriv2.o \
 	read_retry.o \
 	read_single.o \
 	write_collect.o \
 	write_issue.o \
 	write_retry.o
=20
+netfs-$(CONFIG_NETFS_PGPRIV2) +=3D read_pgpriv2.o
 netfs-$(CONFIG_NETFS_STATS) +=3D stats.o
=20
 netfs-$(CONFIG_FSCACHE) +=3D \
diff --git a/fs/netfs/buffered_read.c b/fs/netfs/buffered_read.c
index 2cfd33abfb80..81aa99910e5d 100644
--- a/fs/netfs/buffered_read.c
+++ b/fs/netfs/buffered_read.c
@@ -98,51 +98,68 @@ static int netfs_begin_cache_read(struct netfs_io_reque=
st *rreq, struct netfs_in
 }
=20
 /*
- * netfs_prepare_read_iterator - Prepare the subreq iterator for I/O
- * @subreq: The subrequest to be set up
- *
- * Prepare the I/O iterator representing the read buffer on a subrequest f=
or
- * the filesystem to use for I/O (it can be passed directly to a socket). =
 This
- * is intended to be called from the ->issue_read() method once the filesy=
stem
- * has trimmed the request to the size it wants.
- *
- * Returns the limited size if successful and -ENOMEM if insufficient memo=
ry
- * available.
+ * Prepare the I/O buffer on a buffered read subrequest for the filesystem=
 to
+ * use as a bvec queue.
  */
-static ssize_t netfs_prepare_read_iterator(struct netfs_io_subrequest *sub=
req)
+static int netfs_prepare_buffered_read_buffer(struct netfs_io_subrequest *=
subreq,
+					      unsigned int max_segs)
 {
 	struct netfs_io_request *rreq =3D subreq->rreq;
 	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
 	ssize_t extracted;
-	size_t rsize =3D subreq->len;
=20
-	if (subreq->source =3D=3D NETFS_DOWNLOAD_FROM_SERVER)
-		rsize =3D umin(rsize, stream->sreq_max_len);
+	_enter("R=3D%08x[%x] l=3D%zx s=3D%u",
+	       rreq->debug_id, subreq->debug_index, subreq->len, max_segs);
=20
-	bvecq_pos_set(&subreq->dispatch_pos, &rreq->dispatch_cursor);
-	extracted =3D bvecq_slice(&rreq->dispatch_cursor, subreq->len,
-				stream->sreq_max_segs, &subreq->nr_segs);
-	if (extracted < rsize) {
+	bvecq_pos_set(&subreq->dispatch_pos, &stream->dispatch_cursor);
+	bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
+	extracted =3D bvecq_slice(&stream->dispatch_cursor, subreq->len,
+				max_segs, &subreq->nr_segs);
+
+	stream->buffered -=3D extracted;
+	if (extracted < subreq->len) {
 		subreq->len =3D extracted;
 		trace_netfs_sreq(subreq, netfs_sreq_trace_limited);
 	}
=20
-	return subreq->len;
+	return 0;
 }
=20
-/*
- * Issue a read against the cache.
- * - Eats the caller's ref on subreq.
+/**
+ * netfs_prepare_read_buffer - Get the buffer for a subrequest
+ * @subreq: The subrequest to get the buffer for
+ * @max_segs: Maximum number of segments in buffer (or INT_MAX)
+ *
+ * Extract a slice of buffer from the stream and attach it to the subreque=
st as
+ * a bio_vec queue.  The maximum amount of data attached is set by
+ * @subreq->len, but this may be shortened if @max_segs would be exceeded.
+ *
+ * [!] NOTE: This must be run in the same thread as ->issue_read() was cal=
led
+ * in as we access the readahead_control struct if there is one.
  */
-static void netfs_read_cache_to_pagecache(struct netfs_io_request *rreq,
-					  struct netfs_io_subrequest *subreq)
+int netfs_prepare_read_buffer(struct netfs_io_subrequest *subreq,
+			      unsigned int max_segs)
 {
-	struct netfs_cache_resources *cres =3D &rreq->cache_resources;
-
-	netfs_stat(&netfs_n_rh_read);
-	cres->ops->read(cres, subreq->start, &subreq->io_iter, NETFS_READ_HOLE_IG=
NORE,
-			netfs_cache_read_terminated, subreq);
+	switch (subreq->rreq->origin) {
+	case NETFS_READAHEAD:
+	case NETFS_READPAGE:
+	case NETFS_READ_FOR_WRITE:
+		if (subreq->retry_count)
+			return netfs_prepare_buffered_read_retry_buffer(subreq, max_segs);
+		return netfs_prepare_buffered_read_buffer(subreq, max_segs);
+
+	case NETFS_UNBUFFERED_READ:
+	case NETFS_DIO_READ:
+	case NETFS_READ_GAPS:
+		return netfs_prepare_unbuffered_read_buffer(subreq, max_segs);
+	case NETFS_READ_SINGLE:
+		return netfs_prepare_read_single_buffer(subreq, max_segs);
+	default:
+		WARN_ON_ONCE(1);
+		return -EIO;
+	}
 }
+EXPORT_SYMBOL(netfs_prepare_read_buffer);
=20
 int netfs_read_query_cache(struct netfs_io_request *rreq, struct fscache_o=
ccupancy *occ)
 {
@@ -157,12 +174,22 @@ int netfs_read_query_cache(struct netfs_io_request *r=
req, struct fscache_occupan
 	return cres->ops->query_occupancy(cres, occ);
 }
=20
-static void netfs_queue_read(struct netfs_io_request *rreq,
-			     struct netfs_io_subrequest *subreq,
-			     bool last_subreq)
+/**
+ * netfs_mark_read_submission - Mark a read subrequest as being ready for =
submission
+ * @subreq: The subrequest to be marked
+ *
+ * Calling this marks a read subrequest as being ready for submission and =
makes
+ * it available to the collection thread.  After calling this, the filesys=
tem's
+ * ->issue_read() method must invoke netfs_read_subreq_terminated() to end=
 the
+ * subrequest and must return -EIOCBQUEUED.
+ */
+void netfs_mark_read_submission(struct netfs_io_subrequest *subreq)
 {
+	struct netfs_io_request *rreq =3D subreq->rreq;
 	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
=20
+	_enter("R=3D%08x[%x]", rreq->debug_id, subreq->debug_index);
+
 	__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
=20
 	/* We add to the end of the list whilst the collector may be walking
@@ -170,45 +197,57 @@ static void netfs_queue_read(struct netfs_io_request =
*rreq,
 	 * remove entries off of the front.
 	 */
 	spin_lock(&rreq->lock);
-	list_add_tail(&subreq->rreq_link, &stream->subrequests);
-	if (list_is_first(&subreq->rreq_link, &stream->subrequests)) {
-		if (!stream->active) {
-			stream->collected_to =3D subreq->start;
-			/* Store list pointers before active flag */
-			smp_store_release(&stream->active, true);
+	if (list_empty(&subreq->rreq_link)) {
+		list_add_tail(&subreq->rreq_link, &stream->subrequests);
+		if (list_is_first(&subreq->rreq_link, &stream->subrequests)) {
+			if (!stream->active) {
+				stream->collected_to =3D subreq->start;
+				/* Store list pointers before active flag */
+				smp_store_release(&stream->active, true);
+			}
 		}
 	}
=20
-	if (last_subreq) {
+	rreq->submitted +=3D subreq->len;
+	stream->issue_from =3D subreq->start + subreq->len;
+	if (!stream->buffered) {
 		smp_wmb(); /* Write lists before ALL_QUEUED. */
 		set_bit(NETFS_RREQ_ALL_QUEUED, &rreq->flags);
+		trace_netfs_rreq(rreq, netfs_rreq_trace_all_queued);
 	}
=20
 	spin_unlock(&rreq->lock);
+
+	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
 }
+EXPORT_SYMBOL(netfs_mark_read_submission);
=20
-static void netfs_issue_read(struct netfs_io_request *rreq,
-			     struct netfs_io_subrequest *subreq)
+static int netfs_issue_read(struct netfs_io_request *rreq,
+			    struct netfs_io_subrequest *subreq)
 {
-	bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
-	iov_iter_bvec_queue(&subreq->io_iter, ITER_DEST, subreq->content.bvecq,
-			    subreq->content.slot, subreq->content.offset, subreq->len);
+	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
+
+	_enter("R=3D%08x[%x]", rreq->debug_id, subreq->debug_index);
=20
 	switch (subreq->source) {
 	case NETFS_DOWNLOAD_FROM_SERVER:
-		rreq->netfs_ops->issue_read(subreq);
-		break;
-	case NETFS_READ_FROM_CACHE:
-		netfs_read_cache_to_pagecache(rreq, subreq);
-		break;
+		return rreq->netfs_ops->issue_read(subreq);
+	case NETFS_READ_FROM_CACHE: {
+		struct netfs_cache_resources *cres =3D &rreq->cache_resources;
+
+		netfs_stat(&netfs_n_rh_read);
+		cres->ops->issue_read(subreq);
+		return -EIOCBQUEUED;
+	}
 	default:
-		bvecq_zero(&rreq->dispatch_cursor, subreq->len);
+		stream->issue_from =3D subreq->start + subreq->len;
+		stream->buffered =3D 0;
+		netfs_mark_read_submission(subreq);
+		bvecq_zero(&stream->dispatch_cursor, subreq->len);
 		subreq->transferred =3D subreq->len;
 		subreq->error =3D 0;
-		iov_iter_zero(subreq->len, &subreq->io_iter);
-		subreq->transferred =3D subreq->len;
 		netfs_read_subreq_terminated(subreq);
-		break;
+		return 0;
 	}
 }
=20
@@ -228,21 +267,18 @@ static void netfs_read_to_pagecache(struct netfs_io_r=
equest *rreq)
 		.cached_to[1]	=3D ULLONG_MAX,
 	};
 	struct fscache_occupancy *occ =3D &_occ;
+	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
 	struct netfs_inode *ictx =3D netfs_inode(rreq->inode);
-	unsigned long long start =3D rreq->start;
-	ssize_t size =3D rreq->len;
 	int ret =3D 0;
=20
 	_enter("R=3D%08x", rreq->debug_id);
=20
-	bvecq_pos_set(&rreq->dispatch_cursor, &rreq->load_cursor);
-	bvecq_pos_set(&rreq->collect_cursor, &rreq->dispatch_cursor);
+	bvecq_pos_set(&stream->dispatch_cursor, &rreq->load_cursor);
+	bvecq_pos_set(&rreq->collect_cursor, &rreq->load_cursor);
=20
 	do {
-		int (*prepare_read)(struct netfs_io_subrequest *subreq) =3D NULL;
 		struct netfs_io_subrequest *subreq;
-		unsigned long long hole_to, cache_to;
-		ssize_t slice;
+		unsigned long long hole_to, cache_to, stop;
=20
 		/* If we don't have any, find out the next couple of data
 		 * extents from the cache, containing of following the
@@ -251,7 +287,7 @@ static void netfs_read_to_pagecache(struct netfs_io_req=
uest *rreq)
 		 */
 		hole_to =3D occ->cached_from[0];
 		cache_to =3D occ->cached_to[0];
-		if (start >=3D cache_to) {
+		if (stream->issue_from >=3D cache_to) {
 			/* Extent exhausted; shuffle down. */
 			int i;
=20
@@ -279,36 +315,33 @@ static void netfs_read_to_pagecache(struct netfs_io_r=
equest *rreq)
 			break;
 		}
=20
-		subreq->start	=3D start;
-		subreq->len	=3D size;
+		subreq->start =3D stream->issue_from;
+		stop =3D stream->issue_from + stream->buffered;
=20
 		_debug("rsub %llx %llx-%llx", subreq->start, hole_to, cache_to);
=20
-		if (start >=3D hole_to && start < cache_to) {
+		if (stream->issue_from >=3D hole_to && stream->issue_from < cache_to) {
 			/* Overlap with a cached region, where the cache may
 			 * record a block of zeroes.
 			 */
-			_debug("cached s=3D%llx c=3D%llx l=3D%zx", start, cache_to, size);
-			subreq->len =3D umin(cache_to - start, size);
+			_debug("cached s=3D%llx c=3D%llx l=3D%zx",
+			       stream->issue_from, cache_to, stream->buffered);
+			subreq->len =3D umin(cache_to - stream->issue_from, stream->buffered);
 			subreq->len =3D round_up(subreq->len, occ->granularity);
 			if (occ->cached_type[0] =3D=3D FSCACHE_EXTENT_ZERO) {
 				subreq->source =3D NETFS_FILL_WITH_ZEROES;
 				netfs_stat(&netfs_n_rh_zero);
 			} else {
 				subreq->source =3D NETFS_READ_FROM_CACHE;
-				prepare_read =3D rreq->cache_resources.ops->prepare_read;
 			}
-
-			trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
-
 		} else if ((subreq->start >=3D ictx->zero_point ||
 			    subreq->start >=3D rreq->i_size) &&
-			   size > 0) {
+			   subreq->start < stop) {
 			/* If this range lies beyond the zero-point, that part
 			 * can just be cleared locally.
 			 */
-			_debug("zero %llx-%llx", start, start + size);
-			subreq->len =3D size;
+			_debug("zero %llx-%llx", subreq->start, stop);
+			subreq->len =3D stream->buffered;
 			subreq->source =3D NETFS_FILL_WITH_ZEROES;
 			if (rreq->cache_resources.ops)
 				__set_bit(NETFS_SREQ_COPY_TO_CACHE, &subreq->flags);
@@ -319,10 +352,10 @@ static void netfs_read_to_pagecache(struct netfs_io_r=
equest *rreq)
 			 * that part can just be cleared locally.
 			 */
 			unsigned long long zlimit =3D umin(rreq->i_size, ictx->zero_point);
-			unsigned long long limit =3D min3(zlimit, start + size, hole_to);
+			unsigned long long limit =3D min3(zlimit, stop, hole_to);
=20
 			_debug("limit %llx %llx", rreq->i_size, ictx->zero_point);
-			_debug("download %llx-%llx", start, start + size);
+			_debug("download %llx-%llx", subreq->start, stop);
 			subreq->len =3D umin(limit - subreq->start, ULONG_MAX);
 			subreq->source =3D NETFS_DOWNLOAD_FROM_SERVER;
 			if (rreq->cache_resources.ops)
@@ -330,10 +363,10 @@ static void netfs_read_to_pagecache(struct netfs_io_r=
equest *rreq)
 			netfs_stat(&netfs_n_rh_download);
 		}
=20
-		if (size =3D=3D 0) {
+		if (subreq->len =3D=3D 0) {
 			pr_err("ZERO-LEN READ: R=3D%08x[%x] l=3D%zx/%zx s=3D%llx z=3D%llx i=3D%=
llx",
 			       rreq->debug_id, subreq->debug_index,
-			       subreq->len, size,
+			       subreq->len, stream->buffered,
 			       subreq->start, ictx->zero_point, rreq->i_size);
 			trace_netfs_sreq(subreq, netfs_sreq_trace_cancel);
 			/* Not queued - release both refs. */
@@ -342,24 +375,8 @@ static void netfs_read_to_pagecache(struct netfs_io_re=
quest *rreq)
 			break;
 		}
=20
-		rreq->io_streams[0].sreq_max_len =3D MAX_RW_COUNT;
-		rreq->io_streams[0].sreq_max_segs =3D INT_MAX;
-
-		if (prepare_read) {
-			ret =3D prepare_read(subreq);
-			if (ret < 0) {
-				subreq->error =3D ret;
-				/* Not queued - release both refs. */
-				netfs_put_subrequest(subreq, netfs_sreq_trace_put_cancel);
-				netfs_put_subrequest(subreq, netfs_sreq_trace_put_cancel);
-				break;
-			}
-			trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
-		}
-
-		slice =3D netfs_prepare_read_iterator(subreq);
-		if (slice < 0) {
-			ret =3D slice;
+		ret =3D netfs_issue_read(rreq, subreq);
+		if (ret !=3D 0 && ret !=3D -EIOCBQUEUED) {
 			subreq->error =3D ret;
 			trace_netfs_sreq(subreq, netfs_sreq_trace_cancel);
 			/* Not queued - release both refs. */
@@ -367,18 +384,12 @@ static void netfs_read_to_pagecache(struct netfs_io_r=
equest *rreq)
 			netfs_put_subrequest(subreq, netfs_sreq_trace_put_cancel);
 			break;
 		}
-		size -=3D slice;
-		start +=3D slice;
+		ret =3D 0;
=20
-		trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
-
-		netfs_queue_read(rreq, subreq, size <=3D 0);
-		netfs_issue_read(rreq, subreq);
-		netfs_maybe_bulk_drop_ra_refs(rreq);
 		cond_resched();
-	} while (size > 0);
+	} while (stream->buffered > 0);
=20
-	if (unlikely(size > 0)) {
+	if (unlikely(stream->buffered > 0)) {
 		smp_wmb(); /* Write lists before ALL_QUEUED. */
 		set_bit(NETFS_RREQ_ALL_QUEUED, &rreq->flags);
 		netfs_wake_collector(rreq);
@@ -388,7 +399,7 @@ static void netfs_read_to_pagecache(struct netfs_io_req=
uest *rreq)
 	cmpxchg(&rreq->error, 0, ret);
=20
 	bvecq_pos_unset(&rreq->load_cursor);
-	bvecq_pos_unset(&rreq->dispatch_cursor);
+	bvecq_pos_unset(&stream->dispatch_cursor);
 }
=20
 /**
@@ -409,17 +420,22 @@ static void netfs_read_to_pagecache(struct netfs_io_r=
equest *rreq)
 void netfs_readahead(struct readahead_control *ractl)
 {
 	struct netfs_io_request *rreq;
+	struct netfs_io_stream *stream;
 	struct netfs_inode *ictx =3D netfs_inode(ractl->mapping->host);
 	unsigned long long start =3D readahead_pos(ractl);
 	ssize_t added;
 	size_t size =3D readahead_length(ractl);
 	int ret;
=20
+	_enter("");
+
 	rreq =3D netfs_alloc_request(ractl->mapping, ractl->file, start, size,
 				   NETFS_READAHEAD);
 	if (IS_ERR(rreq))
 		return;
=20
+	stream =3D &rreq->io_streams[0];
+
 	__set_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &rreq->flags);
=20
 	ret =3D netfs_begin_cache_read(rreq, ictx);
@@ -446,6 +462,8 @@ void netfs_readahead(struct readahead_control *ractl)
 	rreq->submitted =3D rreq->start + added;
 	rreq->cleaned_to =3D rreq->start;
 	rreq->front_folio_order =3D get_order(rreq->load_cursor.bvecq->bv[0].bv_l=
en);
+	stream->issue_from =3D rreq->start;
+	stream->buffered =3D added;
=20
 	netfs_read_to_pagecache(rreq);
 	netfs_maybe_bulk_drop_ra_refs(rreq);
@@ -461,6 +479,7 @@ EXPORT_SYMBOL(netfs_readahead);
  */
 static int netfs_create_singular_buffer(struct netfs_io_request *rreq, str=
uct folio *folio)
 {
+	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
 	struct bvecq *bq;
 	size_t fsize =3D folio_size(folio);
=20
@@ -470,6 +489,8 @@ static int netfs_create_singular_buffer(struct netfs_io=
_request *rreq, struct fo
 	bq =3D rreq->load_cursor.bvecq;
 	bvec_set_folio(&bq->bv[bq->nr_slots++], folio, fsize, 0);
 	rreq->submitted =3D rreq->start + fsize;
+	stream->issue_from =3D rreq->start;
+	stream->buffered =3D fsize;
 	return 0;
 }
=20
@@ -479,6 +500,7 @@ static int netfs_create_singular_buffer(struct netfs_io=
_request *rreq, struct fo
 static int netfs_read_gaps(struct file *file, struct folio *folio)
 {
 	struct netfs_io_request *rreq;
+	struct netfs_io_stream *stream;
 	struct address_space *mapping =3D folio->mapping;
 	struct netfs_folio *finfo =3D netfs_folio_info(folio);
 	struct netfs_inode *ctx =3D netfs_inode(mapping->host);
@@ -499,6 +521,7 @@ static int netfs_read_gaps(struct file *file, struct fo=
lio *folio)
 		ret =3D PTR_ERR(rreq);
 		goto alloc_error;
 	}
+	stream =3D &rreq->io_streams[0];
=20
 	ret =3D netfs_begin_cache_read(rreq, ctx);
 	if (ret =3D=3D -ENOMEM || ret =3D=3D -EINTR || ret =3D=3D -ERESTARTSYS)
@@ -546,6 +569,8 @@ static int netfs_read_gaps(struct file *file, struct fo=
lio *folio)
 	}
=20
 	rreq->submitted =3D rreq->start + flen;
+	stream->issue_from =3D rreq->start;
+	stream->buffered =3D flen;
=20
 	netfs_read_to_pagecache(rreq);
=20
@@ -618,6 +643,7 @@ int netfs_read_folio(struct file *file, struct folio *f=
olio)
 		goto discard;
=20
 	netfs_read_to_pagecache(rreq);
+
 	ret =3D netfs_wait_for_read(rreq);
 	netfs_put_request(rreq, netfs_rreq_trace_put_return);
 	return ret < 0 ? ret : 0;
diff --git a/fs/netfs/buffered_write.c b/fs/netfs/buffered_write.c
index bce3e7109ec1..855c14118c53 100644
--- a/fs/netfs/buffered_write.c
+++ b/fs/netfs/buffered_write.c
@@ -114,8 +114,8 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struct =
iov_iter *iter,
 		.range_start	=3D iocb->ki_pos,
 		.range_end	=3D iocb->ki_pos + iter->count,
 	};
-	struct netfs_io_request *wreq =3D NULL;
-	struct folio *folio =3D NULL, *writethrough =3D NULL;
+	struct netfs_writethrough *writethrough =3D NULL;
+	struct folio *folio =3D NULL;
 	unsigned int bdp_flags =3D (iocb->ki_flags & IOCB_NOWAIT) ? BDP_ASYNC : 0;
 	ssize_t written =3D 0, ret, ret2;
 	loff_t pos =3D iocb->ki_pos;
@@ -132,15 +132,13 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struc=
t iov_iter *iter,
 			goto out;
 		}
=20
-		wreq =3D netfs_begin_writethrough(iocb, iter->count);
-		if (IS_ERR(wreq)) {
+		writethrough =3D netfs_begin_writethrough(iocb, iter->count);
+		if (IS_ERR(writethrough)) {
 			wbc_detach_inode(&wbc);
-			ret =3D PTR_ERR(wreq);
-			wreq =3D NULL;
+			ret =3D PTR_ERR(writethrough);
+			writethrough =3D NULL;
 			goto out;
 		}
-		if (!is_sync_kiocb(iocb))
-			wreq->iocb =3D iocb;
 		netfs_stat(&netfs_n_wh_writethrough);
 	} else {
 		netfs_stat(&netfs_n_wh_buffered_write);
@@ -264,7 +262,7 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struct =
iov_iter *iter,
 		 * a file that's open for reading as ->read_folio() then has to
 		 * be able to flush it.
 		 */
-		if ((file->f_mode & FMODE_READ) ||
+		if (//(file->f_mode & FMODE_READ) ||
 		    netfs_is_cache_enabled(ctx)) {
 			if (finfo) {
 				netfs_stat(&netfs_n_wh_wstream_conflict);
@@ -355,13 +353,12 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struc=
t iov_iter *iter,
 		pos +=3D copied;
 		written +=3D copied;
=20
-		if (likely(!wreq)) {
+		if (likely(!writethrough)) {
 			folio_mark_dirty(folio);
 			folio_unlock(folio);
 		} else {
-			netfs_advance_writethrough(wreq, &wbc, folio, copied,
-						   offset + copied =3D=3D flen,
-						   &writethrough);
+			netfs_advance_writethrough(writethrough, &wbc, folio, copied,
+						   offset + copied =3D=3D flen);
 			/* Folio unlocked */
 		}
 	retry:
@@ -385,8 +382,8 @@ ssize_t netfs_perform_write(struct kiocb *iocb, struct =
iov_iter *iter,
 			ctx->ops->post_modify(inode);
 	}
=20
-	if (unlikely(wreq)) {
-		ret2 =3D netfs_end_writethrough(wreq, &wbc, writethrough);
+	if (unlikely(writethrough)) {
+		ret2 =3D netfs_end_writethrough(writethrough, &wbc);
 		wbc_detach_inode(&wbc);
 		if (ret2 =3D=3D -EIOCBQUEUED)
 			return ret2;
diff --git a/fs/netfs/direct_read.c b/fs/netfs/direct_read.c
index 05d09ba3d0d0..872e44227368 100644
--- a/fs/netfs/direct_read.c
+++ b/fs/netfs/direct_read.c
@@ -16,6 +16,28 @@
 #include <linux/netfs.h>
 #include "internal.h"
=20
+int netfs_prepare_unbuffered_read_buffer(struct netfs_io_subrequest *subre=
q,
+					 unsigned int max_segs)
+{
+	struct netfs_io_request *rreq =3D subreq->rreq;
+	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
+	size_t len;
+
+	bvecq_pos_set(&subreq->dispatch_pos, &stream->dispatch_cursor);
+	bvecq_pos_set(&subreq->content, &stream->dispatch_cursor);
+	len =3D bvecq_slice(&stream->dispatch_cursor, subreq->len, max_segs,
+			  &subreq->nr_segs);
+
+	if (len < subreq->len) {
+		subreq->len =3D len;
+		trace_netfs_sreq(subreq, netfs_sreq_trace_limited);
+	}
+
+	stream->buffered   -=3D subreq->len;
+	stream->issue_from +=3D subreq->len;
+	return 0;
+}
+
 /*
  * Perform a read to a buffer from the server, slicing up the region to be=
 read
  * according to the network rsize.
@@ -23,11 +45,9 @@
 static int netfs_dispatch_unbuffered_reads(struct netfs_io_request *rreq)
 {
 	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
-	unsigned long long start =3D rreq->start;
-	ssize_t size =3D rreq->len;
 	int ret =3D 0;
=20
-	bvecq_pos_set(&rreq->dispatch_cursor, &rreq->load_cursor);
+	bvecq_pos_transfer(&stream->dispatch_cursor, &rreq->load_cursor);
=20
 	do {
 		struct netfs_io_subrequest *subreq;
@@ -39,66 +59,36 @@ static int netfs_dispatch_unbuffered_reads(struct netfs=
_io_request *rreq)
 		}
=20
 		subreq->source	=3D NETFS_DOWNLOAD_FROM_SERVER;
-		subreq->start	=3D start;
-		subreq->len	=3D size;
-
-		__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
-
-		spin_lock(&rreq->lock);
-		list_add_tail(&subreq->rreq_link, &stream->subrequests);
-		if (list_is_first(&subreq->rreq_link, &stream->subrequests)) {
-			if (!stream->active) {
-				stream->collected_to =3D subreq->start;
-				/* Store list pointers before active flag */
-				smp_store_release(&stream->active, true);
-			}
-		}
-		trace_netfs_sreq(subreq, netfs_sreq_trace_added);
-		spin_unlock(&rreq->lock);
+		subreq->start	=3D stream->issue_from;
+		subreq->len	=3D stream->buffered;
=20
 		netfs_stat(&netfs_n_rh_download);
-		if (rreq->netfs_ops->prepare_read) {
-			ret =3D rreq->netfs_ops->prepare_read(subreq);
-			if (ret < 0) {
-				netfs_put_subrequest(subreq, netfs_sreq_trace_put_cancel);
-				break;
-			}
-		}
=20
-		bvecq_pos_set(&subreq->dispatch_pos, &rreq->dispatch_cursor);
-		bvecq_pos_set(&subreq->content, &rreq->dispatch_cursor);
-		subreq->len =3D bvecq_slice(&rreq->dispatch_cursor,
-					  umin(size, stream->sreq_max_len),
-					  stream->sreq_max_segs,
-					  &subreq->nr_segs);
-
-		size -=3D subreq->len;
-		start +=3D subreq->len;
-		rreq->submitted +=3D subreq->len;
-		if (size <=3D 0) {
-			smp_wmb(); /* Write lists before ALL_QUEUED. */
-			set_bit(NETFS_RREQ_ALL_QUEUED, &rreq->flags);
+		ret =3D rreq->netfs_ops->issue_read(subreq);
+		if (ret !=3D 0 && ret !=3D -EIOCBQUEUED) {
+			subreq->error =3D ret;
+			trace_netfs_sreq(subreq, netfs_sreq_trace_cancel);
+			/* Not queued - release both refs. */
+			netfs_put_subrequest(subreq, netfs_sreq_trace_put_cancel);
+			netfs_put_subrequest(subreq, netfs_sreq_trace_put_cancel);
+			break;
 		}
=20
-		iov_iter_bvec_queue(&subreq->io_iter, ITER_DEST, subreq->content.bvecq,
-				    subreq->content.slot, subreq->content.offset, subreq->len);
-
-		rreq->netfs_ops->issue_read(subreq);
-
+		ret =3D 0;
 		if (test_bit(NETFS_RREQ_PAUSE, &rreq->flags))
 			netfs_wait_for_paused_read(rreq);
 		if (test_bit(NETFS_RREQ_FAILED, &rreq->flags))
 			break;
 		cond_resched();
-	} while (size > 0);
+	} while (stream->buffered > 0);
=20
-	if (unlikely(size > 0)) {
+	if (unlikely(stream->buffered > 0)) {
 		smp_wmb(); /* Write lists before ALL_QUEUED. */
 		set_bit(NETFS_RREQ_ALL_QUEUED, &rreq->flags);
 		netfs_wake_collector(rreq);
 	}
=20
-	bvecq_pos_unset(&rreq->dispatch_cursor);
+	bvecq_pos_unset(&stream->dispatch_cursor);
 	return ret;
 }
=20
@@ -154,6 +144,7 @@ static ssize_t netfs_unbuffered_read(struct netfs_io_re=
quest *rreq, bool sync)
 ssize_t netfs_unbuffered_read_iter_locked(struct kiocb *iocb, struct iov_i=
ter *iter)
 {
 	struct netfs_io_request *rreq;
+	struct netfs_io_stream *stream;
 	ssize_t ret;
 	size_t orig_count =3D iov_iter_count(iter);
 	bool sync =3D is_sync_kiocb(iocb);
@@ -178,6 +169,8 @@ ssize_t netfs_unbuffered_read_iter_locked(struct kiocb =
*iocb, struct iov_iter *i
 	netfs_stat(&netfs_n_rh_dio_read);
 	trace_netfs_read(rreq, rreq->start, rreq->len, netfs_read_trace_dio_read);
=20
+	stream =3D &rreq->io_streams[0];
+
 	/* If this is an async op, we have to keep track of the destination
 	 * buffer for ourselves as the caller's iterator will be trashed when
 	 * we return.
@@ -192,6 +185,10 @@ ssize_t netfs_unbuffered_read_iter_locked(struct kiocb=
 *iocb, struct iov_iter *i
 	if (ret < 0)
 		goto error_put;
=20
+	rreq->len =3D ret;
+	stream->buffered =3D ret;
+	stream->issue_from =3D rreq->start;
+
 	// TODO: Set up bounce buffer if needed
=20
 	if (!sync) {
diff --git a/fs/netfs/direct_write.c b/fs/netfs/direct_write.c
index a61c6d6fd17f..b04b16d35c38 100644
--- a/fs/netfs/direct_write.c
+++ b/fs/netfs/direct_write.c
@@ -9,6 +9,32 @@
 #include <linux/uio.h>
 #include "internal.h"
=20
+/*
+ * Prepare the buffer for an unbuffered/DIO write.
+ */
+int netfs_prepare_unbuffered_write_buffer(struct netfs_io_subrequest *subr=
eq,
+					  unsigned int max_segs)
+{
+	struct netfs_io_stream *stream =3D &subreq->rreq->io_streams[subreq->stre=
am_nr];
+	size_t len;
+
+	bvecq_pos_set(&subreq->dispatch_pos, &stream->dispatch_cursor);
+	bvecq_pos_set(&subreq->content, &stream->dispatch_cursor);
+	len =3D bvecq_slice(&stream->dispatch_cursor, subreq->len, max_segs,
+			  &subreq->nr_segs);
+
+	if (len < subreq->len) {
+		subreq->len =3D len;
+		trace_netfs_sreq(subreq, netfs_sreq_trace_limited);
+	}
+
+	// TODO: Wait here for completion of prev subreq
+
+	stream->issue_from +=3D subreq->len;
+	stream->buffered   -=3D subreq->len;
+	return 0;
+}
+
 /*
  * Perform the cleanup rituals after an unbuffered write is complete.
  */
@@ -74,9 +100,9 @@ static void netfs_unbuffered_write_collect(struct netfs_=
io_request *wreq,
=20
 	wreq->transferred +=3D subreq->transferred;
 	if (subreq->transferred < subreq->len) {
-		bvecq_pos_unset(&wreq->dispatch_cursor);
-		bvecq_pos_transfer(&wreq->dispatch_cursor, &subreq->dispatch_pos);
-		bvecq_pos_advance(&wreq->dispatch_cursor, subreq->transferred);
+		bvecq_pos_unset(&stream->dispatch_cursor);
+		bvecq_pos_transfer(&stream->dispatch_cursor, &subreq->dispatch_pos);
+		bvecq_pos_advance(&stream->dispatch_cursor, subreq->transferred);
 	}
=20
 	stream->collected_to =3D subreq->start + subreq->transferred;
@@ -85,6 +111,7 @@ static void netfs_unbuffered_write_collect(struct netfs_=
io_request *wreq,
=20
 	trace_netfs_collect_stream(wreq, stream);
 	trace_netfs_collect_state(wreq, wreq->collected_to, 0);
+	/* TODO: Progressively clean up wreq->direct_bq */
 }
=20
 /*
@@ -103,60 +130,60 @@ static int netfs_unbuffered_write(struct netfs_io_req=
uest *wreq)
=20
 	_enter("%llx", wreq->len);
=20
-	bvecq_pos_set(&wreq->dispatch_cursor, &wreq->load_cursor);
-	bvecq_pos_set(&wreq->collect_cursor, &wreq->dispatch_cursor);
+	stream->issue_from =3D wreq->start;
+	stream->buffered =3D wreq->len;
+	bvecq_pos_set(&stream->dispatch_cursor, &wreq->load_cursor);
+	bvecq_pos_set(&wreq->collect_cursor, &stream->dispatch_cursor);
=20
 	if (wreq->origin =3D=3D NETFS_DIO_WRITE)
 		inode_dio_begin(wreq->inode);
=20
-	stream->collected_to =3D wreq->start;
-
 	for (;;) {
 		bool retry =3D false;
=20
 		if (!subreq) {
-			netfs_prepare_write(wreq, stream, wreq->start + wreq->transferred);
-			subreq =3D stream->construct;
-			stream->construct =3D NULL;
-		} else {
-			bvecq_pos_set(&subreq->dispatch_pos, &wreq->dispatch_cursor);
+			subreq =3D netfs_alloc_write_subreq(wreq, stream);
+			if (!subreq)
+				return -ENOMEM;
 		}
=20
-		/* Check if (re-)preparation failed. */
-		if (unlikely(test_bit(NETFS_SREQ_FAILED, &subreq->flags))) {
-			netfs_write_subrequest_terminated(subreq, subreq->error);
-			wreq->error =3D subreq->error;
+		ret =3D stream->issue_write(subreq);
+		switch (ret) {
+		case 0:
+			/* Already completed synchronously. */
 			break;
-		}
-
-		subreq->len =3D bvecq_slice(&wreq->dispatch_cursor, stream->sreq_max_len,
-					  stream->sreq_max_segs, &subreq->nr_segs);
-		bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
-
-		iov_iter_bvec_queue(&subreq->io_iter, ITER_SOURCE,
-				    subreq->content.bvecq, subreq->content.slot,
-				    subreq->content.offset,
-				    subreq->len);
-
-		if (!iov_iter_count(&subreq->io_iter))
+		case -EIOCBQUEUED:
+			/* Async, need to wait. */
+			ret =3D netfs_wait_for_in_progress_subreq(wreq, subreq);
+			if (ret < 0) {
+				if (ret =3D=3D -EAGAIN) {
+					retry =3D true;
+					break;
+				}
+
+				list_del_init(&subreq->rreq_link);
+				ret =3D subreq->error;
+				netfs_put_subrequest(subreq, netfs_sreq_trace_put_failed);
+				subreq =3D NULL;
+				goto failed;
+			}
 			break;
-
-		trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
-		stream->issue_write(subreq);
-
-		/* Async, need to wait. */
-		netfs_wait_for_in_progress_stream(wreq, stream);
-
-		if (test_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags)) {
+		case -EAGAIN:
+			/* Need to retry. */
+			__set_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
 			retry =3D true;
-		} else if (test_bit(NETFS_SREQ_FAILED, &subreq->flags)) {
-			ret =3D subreq->error;
+			break;
+		default:
+			/* Probably failed before dispatch. */
+			subreq->error =3D ret;
 			wreq->error =3D ret;
-			netfs_see_subrequest(subreq, netfs_sreq_trace_see_failed);
+			__set_bit(NETFS_SREQ_FAILED, &subreq->flags);
+			trace_netfs_sreq(subreq, netfs_sreq_trace_cancel);
+			list_del_init(&subreq->rreq_link);
+			netfs_put_subrequest(subreq, netfs_sreq_trace_put_cancel);
 			subreq =3D NULL;
-			break;
+			goto failed;
 		}
-		ret =3D 0;
=20
 		if (!retry) {
 			netfs_unbuffered_write_collect(wreq, stream, subreq);
@@ -171,20 +198,21 @@ static int netfs_unbuffered_write(struct netfs_io_req=
uest *wreq)
 			continue;
 		}
=20
-		/* We need to retry the last subrequest, so first reset the
-		 * iterator, taking into account what, if anything, we managed
-		 * to transfer.
+		/* We need to retry the last subrequest, so first wind back the
+		 * buffer position.
 		 */
 		subreq->error =3D -EAGAIN;
 		trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
=20
 		bvecq_pos_unset(&subreq->content);
-		bvecq_pos_unset(&wreq->dispatch_cursor);
-		bvecq_pos_transfer(&wreq->dispatch_cursor, &subreq->dispatch_pos);
+		bvecq_pos_unset(&stream->dispatch_cursor);
+		bvecq_pos_transfer(&stream->dispatch_cursor, &subreq->dispatch_pos);
=20
 		if (subreq->transferred > 0) {
-			wreq->transferred +=3D subreq->transferred;
-			bvecq_pos_advance(&wreq->dispatch_cursor, subreq->transferred);
+			wreq->transferred  +=3D subreq->transferred;
+			stream->issue_from -=3D subreq->len - subreq->transferred;
+			stream->buffered   +=3D subreq->len - subreq->transferred;
+			bvecq_pos_advance(&stream->dispatch_cursor, subreq->transferred);
 		}
=20
 		if (stream->source =3D=3D NETFS_UPLOAD_TO_SERVER &&
@@ -192,25 +220,21 @@ static int netfs_unbuffered_write(struct netfs_io_req=
uest *wreq)
 			wreq->netfs_ops->retry_request(wreq, stream);
=20
 		__clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
-		__clear_bit(NETFS_SREQ_BOUNDARY, &subreq->flags);
 		__clear_bit(NETFS_SREQ_FAILED, &subreq->flags);
-		subreq->start		=3D wreq->start + wreq->transferred;
-		subreq->len		=3D wreq->len   - wreq->transferred;
+		__clear_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
+		subreq->start		=3D stream->issue_from;
+		subreq->len		=3D stream->buffered;
 		subreq->transferred	=3D 0;
 		subreq->retry_count	+=3D 1;
-		stream->sreq_max_len	=3D UINT_MAX;
-		stream->sreq_max_segs	=3D INT_MAX;
=20
 		netfs_get_subrequest(subreq, netfs_sreq_trace_get_resubmit);
=20
-		if (stream->prepare_write)
-			stream->prepare_write(subreq);
 		__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
 		netfs_stat(&netfs_n_wh_retry_write_subreq);
 	}
=20
-	bvecq_pos_unset(&wreq->dispatch_cursor);
-	bvecq_pos_unset(&wreq->load_cursor);
+failed:
+	bvecq_pos_unset(&stream->dispatch_cursor);
 	netfs_unbuffered_write_done(wreq);
 	_leave(" =3D %d", ret);
 	return ret;
@@ -254,6 +278,7 @@ ssize_t netfs_unbuffered_write_iter_locked(struct kiocb=
 *iocb, struct iov_iter *
 	if (IS_ERR(wreq))
 		return PTR_ERR(wreq);
=20
+	wreq->len =3D iov_iter_count(iter);
 	wreq->io_streams[0].avail =3D true;
 	trace_netfs_write(wreq, (iocb->ki_flags & IOCB_DIRECT ?
 				 netfs_write_trace_dio_write :
@@ -264,9 +289,7 @@ ssize_t netfs_unbuffered_write_iter_locked(struct kiocb=
 *iocb, struct iov_iter *
 		 * we have to save the source buffer as the iterator is only
 		 * good until we return.  In such a case, extract an iterator
 		 * to represent as much of the the output buffer as we can
-		 * manage.  Note that the extraction might not be able to
-		 * allocate a sufficiently large bvec array and may shorten the
-		 * request.
+		 * manage.  Note that the extraction may shorten the request.
 		 */
 		ssize_t n =3D netfs_extract_iter(iter, len, INT_MAX, iocb->ki_pos,
 					       &wreq->load_cursor.bvecq, 0);
@@ -281,8 +304,6 @@ ssize_t netfs_unbuffered_write_iter_locked(struct kiocb=
 *iocb, struct iov_iter *
 		       wreq->load_cursor.bvecq->max_slots);
 	}
=20
-	__set_bit(NETFS_RREQ_USE_IO_ITER, &wreq->flags);
-
 	/* Copy the data into the bounce buffer and encrypt it. */
 	// TODO
=20
diff --git a/fs/netfs/fscache_io.c b/fs/netfs/fscache_io.c
index 37f05b4d3469..70b10ac23a27 100644
--- a/fs/netfs/fscache_io.c
+++ b/fs/netfs/fscache_io.c
@@ -239,10 +239,6 @@ void __fscache_write_to_cache(struct fscache_cookie *c=
ookie,
 				    fscache_access_io_write) < 0)
 		goto abandon_free;
=20
-	ret =3D cres->ops->prepare_write(cres, &start, &len, len, i_size, false);
-	if (ret < 0)
-		goto abandon_end;
-
 	/* TODO: Consider clearing page bits now for space the write isn't
 	 * covering.  This is more complicated than it appears when THPs are
 	 * taken into account.
@@ -252,8 +248,6 @@ void __fscache_write_to_cache(struct fscache_cookie *co=
okie,
 	fscache_write(cres, start, &iter, fscache_wreq_done, wreq);
 	return;
=20
-abandon_end:
-	return fscache_wreq_done(wreq, ret);
 abandon_free:
 	kfree(wreq);
 abandon:
diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index ddae82f94ce0..ecf7cd5b5ca1 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -34,6 +34,18 @@ int netfs_prefetch_for_write(struct file *file, struct f=
olio *folio,
 void netfs_update_i_size(struct netfs_inode *ctx, struct inode *inode,
 			 loff_t pos, size_t copied);
=20
+/*
+ * direct_read.c
+ */
+int netfs_prepare_unbuffered_read_buffer(struct netfs_io_subrequest *subre=
q,
+					 unsigned int max_segs);
+
+/*
+ * direct_write.c
+ */
+int netfs_prepare_unbuffered_write_buffer(struct netfs_io_subrequest *subr=
eq,
+					  unsigned int max_segs);
+
 /*
  * main.c
  */
@@ -70,6 +82,8 @@ struct bvecq *netfs_buffer_make_space(struct netfs_io_req=
uest *rreq,
 				      enum netfs_bvecq_trace trace);
 void netfs_wake_collector(struct netfs_io_request *rreq);
 void netfs_subreq_clear_in_progress(struct netfs_io_subrequest *subreq);
+int netfs_wait_for_in_progress_subreq(struct netfs_io_request *rreq,
+				      struct netfs_io_subrequest *subreq);
 void netfs_wait_for_in_progress_stream(struct netfs_io_request *rreq,
 				       struct netfs_io_stream *stream);
 ssize_t netfs_wait_for_read(struct netfs_io_request *rreq);
@@ -113,16 +127,53 @@ void netfs_cache_read_terminated(void *priv, ssize_t =
transferred_or_error);
 /*
  * read_pgpriv2.c
  */
+#ifdef CONFIG_NETFS_PGPRIV2
+int netfs_prepare_pgpriv2_write_buffer(struct netfs_io_subrequest *subreq,
+				       unsigned int max_segs);
 void netfs_pgpriv2_copy_to_cache(struct netfs_io_request *rreq, struct fol=
io *folio);
 void netfs_pgpriv2_end_copy_to_cache(struct netfs_io_request *rreq);
 bool netfs_pgpriv2_unlock_copied_folios(struct netfs_io_request *wreq);
+static inline bool netfs_using_pgpriv2(const struct netfs_io_request *rreq)
+{
+	return test_bit(NETFS_RREQ_USE_PGPRIV2, &rreq->flags);
+}
+#else
+static inline int netfs_prepare_pgpriv2_write_buffer(struct netfs_io_subre=
quest *subreq,
+						     unsigned int max_segs)
+{
+	return -EIO;
+}
+static inline void netfs_pgpriv2_copy_to_cache(struct netfs_io_request *rr=
eq, struct folio *folio)
+{
+}
+static inline void netfs_pgpriv2_end_copy_to_cache(struct netfs_io_request=
 *rreq)
+{
+}
+static inline bool netfs_pgpriv2_unlock_copied_folios(struct netfs_io_requ=
est *wreq)
+{
+	return true;
+}
+static inline bool netfs_using_pgpriv2(const struct netfs_io_request *rreq)
+{
+	return false;
+}
+#endif
=20
 /*
  * read_retry.c
  */
+int netfs_prepare_buffered_read_retry_buffer(struct netfs_io_subrequest *s=
ubreq,
+					     unsigned int max_segs);
+int netfs_reset_for_read_retry(struct netfs_io_subrequest *subreq);
 void netfs_retry_reads(struct netfs_io_request *rreq);
 void netfs_unlock_abandoned_read_pages(struct netfs_io_request *rreq);
=20
+/*
+ * read_single.c
+ */
+int netfs_prepare_read_single_buffer(struct netfs_io_subrequest *subreq,
+				     unsigned int max_segs);
+
 /*
  * stats.c
  */
@@ -194,30 +245,25 @@ void netfs_write_collection_worker(struct work_struct=
 *work);
 /*
  * write_issue.c
  */
+struct netfs_writethrough;
 struct netfs_io_request *netfs_create_write_req(struct address_space *mapp=
ing,
 						struct file *file,
 						loff_t start,
 						enum netfs_io_origin origin);
-void netfs_prepare_write(struct netfs_io_request *wreq,
-			 struct netfs_io_stream *stream,
-			 loff_t start);
-void netfs_reissue_write(struct netfs_io_stream *stream,
-			 struct netfs_io_subrequest *subreq);
-void netfs_issue_write(struct netfs_io_request *wreq,
-		       struct netfs_io_stream *stream);
-size_t netfs_advance_write(struct netfs_io_request *wreq,
-			   struct netfs_io_stream *stream,
-			   loff_t start, size_t len, bool to_eof);
-struct netfs_io_request *netfs_begin_writethrough(struct kiocb *iocb, size=
_t len);
-int netfs_advance_writethrough(struct netfs_io_request *wreq, struct write=
back_control *wbc,
-			       struct folio *folio, size_t copied, bool to_page_end,
-			       struct folio **writethrough_cache);
-ssize_t netfs_end_writethrough(struct netfs_io_request *wreq, struct write=
back_control *wbc,
-			       struct folio *writethrough_cache);
+struct netfs_io_subrequest *netfs_alloc_write_subreq(struct netfs_io_reque=
st *wreq,
+						     struct netfs_io_stream *stream);
+struct netfs_writethrough *netfs_begin_writethrough(struct kiocb *iocb, si=
ze_t len);
+int netfs_advance_writethrough(struct netfs_writethrough *wthru,
+			       struct writeback_control *wbc,
+			       struct folio *folio, size_t copied, bool to_page_end);
+ssize_t netfs_end_writethrough(struct netfs_writethrough *wthru,
+			       struct writeback_control *wbc);
=20
 /*
  * write_retry.c
  */
+int netfs_prepare_write_retry_buffer(struct netfs_io_subrequest *subreq,
+				     unsigned int max_segs);
 void netfs_retry_writes(struct netfs_io_request *wreq);
=20
 /*
diff --git a/fs/netfs/iterator.c b/fs/netfs/iterator.c
index 7969c0b1f9a9..69164e8b8e57 100644
--- a/fs/netfs/iterator.c
+++ b/fs/netfs/iterator.c
@@ -102,14 +102,14 @@ ssize_t netfs_extract_iter(struct iov_iter *orig, siz=
e_t orig_len, size_t max_se
 			}
=20
 			if (got =3D=3D 0) {
-				pr_err("extract_pages gave nothing from %zu, %zu\n",
+				pr_err("extract_pages gave nothing from %zx, %zx\n",
 				       extracted, orig_len);
 				ret =3D -EIO;
 				goto out;
 			}
=20
-			if (got > orig_len - extracted) {
-				pr_err("extract_pages rc=3D%zd more than %zu\n",
+			if (got > orig_len) {
+				pr_err("extract_pages rc=3D%zx more than %zx\n",
 				       got, orig_len);
 				goto out;
 			}
diff --git a/fs/netfs/misc.c b/fs/netfs/misc.c
index a19724389147..796dc227c2b2 100644
--- a/fs/netfs/misc.c
+++ b/fs/netfs/misc.c
@@ -232,6 +232,37 @@ void netfs_subreq_clear_in_progress(struct netfs_io_su=
brequest *subreq)
 		netfs_wake_collector(rreq);
 }
=20
+/*
+ * Wait for a subrequest to come to completion.
+ */
+int netfs_wait_for_in_progress_subreq(struct netfs_io_request *rreq,
+				      struct netfs_io_subrequest *subreq)
+{
+	if (netfs_check_subreq_in_progress(subreq)) {
+		DEFINE_WAIT(myself);
+
+		trace_netfs_rreq(rreq, netfs_rreq_trace_wait_quiesce);
+		for (;;) {
+			prepare_to_wait(&rreq->waitq, &myself, TASK_UNINTERRUPTIBLE);
+
+			if (!netfs_check_subreq_in_progress(subreq))
+				break;
+
+			trace_netfs_sreq(subreq, netfs_sreq_trace_wait_for);
+			schedule();
+		}
+
+		trace_netfs_rreq(rreq, netfs_rreq_trace_waited_quiesce);
+		finish_wait(&rreq->waitq, &myself);
+	}
+
+	if (test_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags))
+		return -EAGAIN;
+	if (test_bit(NETFS_SREQ_FAILED, &subreq->flags))
+		return subreq->error;
+	return 0;
+}
+
 /*
  * Wait for all outstanding I/O in a stream to quiesce.
  */
@@ -361,7 +392,7 @@ static ssize_t netfs_wait_for_in_progress(struct netfs_=
io_request *rreq,
 		case NETFS_UNBUFFERED_WRITE:
 			break;
 		default:
-			if (rreq->submitted < rreq->len) {
+			if (rreq->transferred < rreq->len) {
 				trace_netfs_failure(rreq, NULL, ret, netfs_fail_short_read);
 				ret =3D -EIO;
 			}
diff --git a/fs/netfs/objects.c b/fs/netfs/objects.c
index eff431cd7d6a..3db79943762d 100644
--- a/fs/netfs/objects.c
+++ b/fs/netfs/objects.c
@@ -46,8 +46,6 @@ struct netfs_io_request *netfs_alloc_request(struct addre=
ss_space *mapping,
 	rreq->i_size	=3D i_size_read(inode);
 	rreq->debug_id	=3D atomic_inc_return(&debug_ids);
 	rreq->wsize	=3D INT_MAX;
-	rreq->io_streams[0].sreq_max_len =3D ULONG_MAX;
-	rreq->io_streams[0].sreq_max_segs =3D 0;
 	spin_lock_init(&rreq->lock);
 	INIT_LIST_HEAD(&rreq->io_streams[0].subrequests);
 	INIT_LIST_HEAD(&rreq->io_streams[1].subrequests);
@@ -134,8 +132,10 @@ static void netfs_deinit_request(struct netfs_io_reque=
st *rreq)
 	if (rreq->cache_resources.ops)
 		rreq->cache_resources.ops->end_operation(&rreq->cache_resources);
 	bvecq_pos_unset(&rreq->load_cursor);
-	bvecq_pos_unset(&rreq->dispatch_cursor);
 	bvecq_pos_unset(&rreq->collect_cursor);
+	bvecq_pos_unset(&rreq->retry_cursor);
+	for (int i =3D 0; i < NR_IO_STREAMS; i++)
+		bvecq_pos_unset(&rreq->io_streams[i].dispatch_cursor);
=20
 	if (atomic_dec_and_test(&ictx->io_count))
 		wake_up_var(&ictx->io_count);
@@ -226,6 +226,7 @@ static void netfs_free_subrequest(struct netfs_io_subre=
quest *subreq)
 	struct netfs_io_request *rreq =3D subreq->rreq;
=20
 	trace_netfs_sreq(subreq, netfs_sreq_trace_free);
+	WARN_ON_ONCE(!list_empty(&subreq->rreq_link));
 	if (rreq->netfs_ops->free_subrequest)
 		rreq->netfs_ops->free_subrequest(subreq);
 	bvecq_pos_unset(&subreq->dispatch_pos);
diff --git a/fs/netfs/read_collect.c b/fs/netfs/read_collect.c
index 6d49f9a6b1f0..fbb0425ecb89 100644
--- a/fs/netfs/read_collect.c
+++ b/fs/netfs/read_collect.c
@@ -36,6 +36,7 @@ static void netfs_clear_unread(struct netfs_io_subrequest=
 *subreq)
=20
 	if (subreq->start + subreq->transferred >=3D subreq->rreq->i_size)
 		__set_bit(NETFS_SREQ_HIT_EOF, &subreq->flags);
+	trace_netfs_rreq(subreq->rreq, netfs_rreq_trace_zero_unread);
 }
=20
 /*
@@ -58,7 +59,7 @@ static void netfs_unlock_read_folio(struct netfs_io_reque=
st *rreq,
 	flush_dcache_folio(folio);
 	folio_mark_uptodate(folio);
=20
-	if (!test_bit(NETFS_RREQ_USE_PGPRIV2, &rreq->flags)) {
+	if (!netfs_using_pgpriv2(rreq)) {
 		finfo =3D netfs_folio_info(folio);
 		if (finfo) {
 			trace_netfs_folio(folio, netfs_folio_trace_filled_gaps);
@@ -264,8 +265,7 @@ static void netfs_collect_read_results(struct netfs_io_=
request *rreq)
 				transferred =3D front->len;
 				trace_netfs_rreq(rreq, netfs_rreq_trace_set_abandon);
 			}
-			if (front->start + transferred >=3D rreq->cleaned_to + fsize ||
-			    test_bit(NETFS_SREQ_HIT_EOF, &front->flags))
+			if (front->start + transferred >=3D rreq->cleaned_to + fsize)
 				netfs_read_unlock_folios(rreq, &notes);
 		} else {
 			stream->collected_to =3D front->start + transferred;
@@ -381,31 +381,6 @@ static void netfs_rreq_assess_dio(struct netfs_io_requ=
est *rreq)
 		inode_dio_end(rreq->inode);
 }
=20
-/*
- * Do processing after reading a monolithic single object.
- */
-static void netfs_rreq_assess_single(struct netfs_io_request *rreq)
-{
-	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
-
-	if (!rreq->error && stream->source =3D=3D NETFS_DOWNLOAD_FROM_SERVER &&
-	    fscache_resources_valid(&rreq->cache_resources)) {
-		trace_netfs_rreq(rreq, netfs_rreq_trace_dirty);
-		netfs_single_mark_inode_dirty(rreq->inode);
-	}
-
-	if (rreq->iocb) {
-		rreq->iocb->ki_pos +=3D rreq->transferred;
-		if (rreq->iocb->ki_complete) {
-			trace_netfs_rreq(rreq, netfs_rreq_trace_ki_complete);
-			rreq->iocb->ki_complete(
-				rreq->iocb, rreq->error ? rreq->error : rreq->transferred);
-		}
-	}
-	if (rreq->netfs_ops->done)
-		rreq->netfs_ops->done(rreq);
-}
-
 /*
  * Perform the collection of subrequests and folios.
  *
@@ -441,7 +416,7 @@ bool netfs_read_collection(struct netfs_io_request *rre=
q)
 		netfs_rreq_assess_dio(rreq);
 		break;
 	case NETFS_READ_SINGLE:
-		netfs_rreq_assess_single(rreq);
+		WARN_ON_ONCE(1);
 		break;
 	default:
 		break;
diff --git a/fs/netfs/read_pgpriv2.c b/fs/netfs/read_pgpriv2.c
index fb783318318e..5f4d1a21afc5 100644
--- a/fs/netfs/read_pgpriv2.c
+++ b/fs/netfs/read_pgpriv2.c
@@ -13,8 +13,39 @@
 #include <linux/task_io_accounting_ops.h>
 #include "internal.h"
=20
+int netfs_prepare_pgpriv2_write_buffer(struct netfs_io_subrequest *subreq,
+				       unsigned int max_segs)
+{
+	struct netfs_io_request *creq =3D subreq->rreq;
+	struct netfs_io_stream *stream =3D &creq->io_streams[1];
+	size_t len;
+
+	bvecq_pos_set(&subreq->dispatch_pos, &stream->dispatch_cursor);
+	bvecq_pos_set(&subreq->content, &stream->dispatch_cursor);
+	len =3D bvecq_slice(&stream->dispatch_cursor, subreq->len, max_segs,
+			  &subreq->nr_segs);
+
+	if (len < subreq->len) {
+		subreq->len =3D len;
+		trace_netfs_sreq(subreq, netfs_sreq_trace_limited);
+	}
+
+	// TODO: Wait here for completion of prev subreq
+
+	stream->issue_from +=3D subreq->len;
+	stream->buffered   -=3D subreq->len;
+	if (stream->buffered =3D=3D 0) {
+		smp_wmb(); /* Write lists before ALL_QUEUED. */
+		set_bit(NETFS_RREQ_ALL_QUEUED, &creq->flags);
+	}
+	return 0;
+}
+
 /*
- * [DEPRECATED] Copy a folio to the cache with PG_private_2 set.
+ * [DEPRECATED] Copy a folio to the cache with PG_private_2 set.  Note tha=
t the
+ * folio won't necessarily be contiguous with the previous one as there mi=
ght
+ * be a mixture of folios read from the cache and downloaded from the serv=
er
+ * (or just zeroed).
  */
 static void netfs_pgpriv2_copy_folio(struct netfs_io_request *creq, struct=
 folio *folio)
 {
@@ -24,7 +55,6 @@ static void netfs_pgpriv2_copy_folio(struct netfs_io_requ=
est *creq, struct folio
 	size_t dio_size =3D PAGE_SIZE;
 	size_t fsize =3D folio_size(folio), flen =3D fsize;
 	loff_t fpos =3D folio_pos(folio), i_size;
-	bool to_eof =3D false;
=20
 	_enter("");
=20
@@ -44,12 +74,8 @@ static void netfs_pgpriv2_copy_folio(struct netfs_io_req=
uest *creq, struct folio
 	if (fpos + fsize > creq->i_size)
 		creq->i_size =3D i_size;
=20
-	if (flen > i_size - fpos) {
+	if (flen > i_size - fpos)
 		flen =3D i_size - fpos;
-		to_eof =3D true;
-	} else if (flen =3D=3D i_size - fpos) {
-		to_eof =3D true;
-	}
=20
 	flen =3D round_up(flen, dio_size);
=20
@@ -57,7 +83,6 @@ static void netfs_pgpriv2_copy_folio(struct netfs_io_requ=
est *creq, struct folio
=20
 	trace_netfs_folio(folio, netfs_folio_trace_store_copy);
=20
-
 	/* Institute a new bvec queue segment if the current one is full or if
 	 * we encounter a discontiguity.  The discontiguity break is important
 	 * when it comes to bulk unlocking folios by file range.
@@ -79,40 +104,13 @@ static void netfs_pgpriv2_copy_folio(struct netfs_io_r=
equest *creq, struct folio
 	/* Attach the folio to the rolling buffer. */
 	slot =3D queue->nr_slots;
 	bvec_set_folio(&queue->bv[slot], folio, fsize, 0);
-	/* Order incrementing the slot counter after the slot is filled. */
-	smp_store_release(&queue->nr_slots, slot + 1);
+	queue->nr_slots =3D slot + 1;
 	creq->load_cursor.slot =3D slot + 1;
 	creq->load_cursor.offset =3D 0;
 	trace_netfs_bv_slot(queue, slot);
+	trace_netfs_wback(creq, folio, 0);
=20
-	cache->submit_off =3D 0;
-	cache->submit_len =3D flen;
-
-	/* Attach the folio to one or more subrequests.  For a big folio, we
-	 * could end up with thousands of subrequests if the wsize is small -
-	 * but we might need to wait during the creation of subrequests for
-	 * network resources (eg. SMB credits).
-	 */
-	do {
-		ssize_t part;
-
-		creq->dispatch_cursor.offset =3D cache->submit_off;
-
-		atomic64_set(&creq->issued_to, fpos + cache->submit_off);
-		part =3D netfs_advance_write(creq, cache, fpos + cache->submit_off,
-					   cache->submit_len, to_eof);
-		cache->submit_off +=3D part;
-		if (part > cache->submit_len)
-			cache->submit_len =3D 0;
-		else
-			cache->submit_len -=3D part;
-	} while (cache->submit_len > 0);
-
-	bvecq_pos_step(&creq->dispatch_cursor);
-	atomic64_set(&creq->issued_to, fpos + fsize);
-
-	if (flen < fsize)
-		netfs_issue_write(creq, cache);
+	cache->buffered +=3D flen;
 }
=20
 /*
@@ -122,6 +120,7 @@ static struct netfs_io_request *netfs_pgpriv2_begin_cop=
y_to_cache(
 	struct netfs_io_request *rreq, struct folio *folio)
 {
 	struct netfs_io_request *creq;
+	struct netfs_io_stream *cache;
=20
 	if (!fscache_resources_valid(&rreq->cache_resources))
 		goto cancel;
@@ -131,12 +130,15 @@ static struct netfs_io_request *netfs_pgpriv2_begin_c=
opy_to_cache(
 	if (IS_ERR(creq))
 		goto cancel;
=20
-	if (!creq->io_streams[1].avail)
+	cache =3D &creq->io_streams[1];
+	if (!cache->avail)
+		goto cancel_put;
+
+	if (bvecq_buffer_init(&creq->load_cursor, GFP_KERNEL) < 0)
 		goto cancel_put;
=20
-	bvecq_buffer_init(&creq->load_cursor, GFP_KERNEL);
-	bvecq_pos_set(&creq->dispatch_cursor, &creq->load_cursor);
-	bvecq_pos_set(&creq->collect_cursor, &creq->dispatch_cursor);
+	bvecq_pos_set(&cache->dispatch_cursor, &creq->load_cursor);
+	bvecq_pos_set(&creq->collect_cursor, &creq->load_cursor);
=20
 	__set_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &creq->flags);
 	trace_netfs_copy2cache(rreq, creq);
@@ -171,19 +173,43 @@ void netfs_pgpriv2_copy_to_cache(struct netfs_io_requ=
est *rreq, struct folio *fo
 	netfs_pgpriv2_copy_folio(creq, folio);
 }
=20
+/*
+ * Issue all pending writes on the cache stream.
+ */
+static int netfs_pgpriv2_issue_stream(struct netfs_io_request *wreq,
+				      struct netfs_io_stream *stream)
+{
+	int ret;
+
+	atomic64_set_release(&stream->issued_to, wreq->start);
+
+	do {
+		struct netfs_io_subrequest *subreq;
+
+		subreq =3D netfs_alloc_write_subreq(wreq, stream);
+		if (!subreq)
+			return -ENOMEM;
+
+		ret =3D stream->issue_write(subreq);
+		if (ret < 0 && ret !=3D -EIOCBQUEUED)
+			break;
+	} while (stream->buffered > 0);
+
+	return ret;
+}
+
 /*
  * [DEPRECATED] End writing to the cache, flushing out any outstanding wri=
tes.
  */
 void netfs_pgpriv2_end_copy_to_cache(struct netfs_io_request *rreq)
 {
 	struct netfs_io_request *creq =3D rreq->copy_to_cache;
+	struct netfs_io_stream *stream =3D &creq->io_streams[1];
=20
 	if (IS_ERR_OR_NULL(creq))
 		return;
=20
-	netfs_issue_write(creq, &creq->io_streams[1]);
-	smp_wmb(); /* Write lists before ALL_QUEUED. */
-	set_bit(NETFS_RREQ_ALL_QUEUED, &creq->flags);
+	netfs_pgpriv2_issue_stream(creq, stream);
 	trace_netfs_rreq(rreq, netfs_rreq_trace_end_copy_to_cache);
 	if (list_empty_careful(&creq->io_streams[1].subrequests))
 		netfs_wake_collector(creq);
diff --git a/fs/netfs/read_retry.c b/fs/netfs/read_retry.c
index 6f2eb14aac72..b3bc924ffe8e 100644
--- a/fs/netfs/read_retry.c
+++ b/fs/netfs/read_retry.c
@@ -9,19 +9,55 @@
 #include <linux/slab.h>
 #include "internal.h"
=20
-static void netfs_reissue_read(struct netfs_io_request *rreq,
-			       struct netfs_io_subrequest *subreq)
+/*
+ * Prepare the I/O buffer on a buffered read subrequest for the filesystem=
 to
+ * use as a bvec queue.
+ */
+int netfs_prepare_buffered_read_retry_buffer(struct netfs_io_subrequest *s=
ubreq,
+					     unsigned int max_segs)
 {
+	struct netfs_io_request *rreq =3D subreq->rreq;
+	size_t len;
+
+	bvecq_pos_set(&subreq->dispatch_pos, &rreq->retry_cursor);
 	bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
-	iov_iter_bvec_queue(&subreq->io_iter, ITER_DEST, subreq->content.bvecq,
-			    subreq->content.slot, subreq->content.offset, subreq->len);
-	iov_iter_advance(&subreq->io_iter, subreq->transferred);
+	len =3D bvecq_slice(&rreq->retry_cursor, subreq->len, max_segs,
+			  &subreq->nr_segs);
+	if (len < subreq->len) {
+		subreq->len =3D len;
+		trace_netfs_sreq(subreq, netfs_sreq_trace_limited);
+	}
+	rreq->retry_buffered -=3D subreq->len;
+	rreq->retry_start    +=3D subreq->len;
+	return 0;
+}
=20
-	subreq->error =3D 0;
+/*
+ * Reset the state of the subrequest and discard any buffering so that we =
can
+ * retry (where this may include sending it to the server instead of the
+ * cache).
+ */
+int netfs_reset_for_read_retry(struct netfs_io_subrequest *subreq)
+{
+	trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
+
+	if (subreq->retry_count > 3) {
+		trace_netfs_sreq(subreq, netfs_sreq_trace_too_many_retries);
+		return subreq->error;
+	}
+
+	subreq->retry_count++;
 	__clear_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
+	__clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
+	__clear_bit(NETFS_SREQ_FAILED, &subreq->flags);
 	__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
-	netfs_stat(&netfs_n_rh_retry_read_subreq);
-	subreq->rreq->netfs_ops->issue_read(subreq);
+	bvecq_pos_unset(&subreq->content);
+	bvecq_pos_unset(&subreq->dispatch_pos);
+	subreq->error =3D 0;
+	subreq->transferred =3D 0;
+	netfs_get_subrequest(subreq, netfs_sreq_trace_get_resubmit);
+	netfs_stat(&netfs_n_wh_retry_write_subreq);
+	return 0;
 }
=20
 /*
@@ -32,8 +68,8 @@ static void netfs_retry_read_subrequests(struct netfs_io_=
request *rreq)
 {
 	struct netfs_io_subrequest *subreq;
 	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
-	struct bvecq_pos dispatch_cursor =3D {};
 	struct list_head *next;
+	int ret;
=20
 	_enter("R=3D%x", rreq->debug_id);
=20
@@ -43,47 +79,19 @@ static void netfs_retry_read_subrequests(struct netfs_i=
o_request *rreq)
 	if (rreq->netfs_ops->retry_request)
 		rreq->netfs_ops->retry_request(rreq, NULL);
=20
-	/* If there's no renegotiation to do, just resend each retryable subreq
-	 * up to the first permanently failed one.
-	 */
-	if (!rreq->netfs_ops->prepare_read &&
-	    !rreq->cache_resources.ops) {
-		list_for_each_entry(subreq, &stream->subrequests, rreq_link) {
-			if (test_bit(NETFS_SREQ_FAILED, &subreq->flags))
-				break;
-			if (__test_and_clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags)) {
-				__clear_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
-				subreq->retry_count++;
-				netfs_get_subrequest(subreq, netfs_sreq_trace_get_resubmit);
-				netfs_reissue_read(rreq, subreq);
-			}
-		}
-		return;
-	}
-
 	/* Okay, we need to renegotiate all the download requests and flip any
 	 * failed cache reads over to being download requests and negotiate
-	 * those also.  All fully successful subreqs have been removed from the
-	 * list and any spare data from those has been donated.
-	 *
-	 * What we do is decant the list and rebuild it one subreq at a time so
-	 * that we don't end up with donations jumping over a gap we're busy
-	 * populating with smaller subrequests.  In the event that the subreq
-	 * we just launched finishes before we insert the next subreq, it'll
-	 * fill in rreq->prev_donated instead.
-	 *
-	 * Note: Alternatively, we could split the tail subrequest right before
-	 * we reissue it and fix up the donations under lock.
+	 * those also.
 	 */
 	next =3D stream->subrequests.next;
=20
 	do {
 		struct netfs_io_subrequest *from, *to, *tmp;
-		unsigned long long start, len;
-		size_t part;
-		bool boundary =3D false, subreq_superfluous =3D false;
+		unsigned long long start;
+		size_t len;
+		bool subreq_superfluous =3D false;
=20
-		bvecq_pos_unset(&dispatch_cursor);
+		bvecq_pos_unset(&rreq->retry_cursor);
=20
 		/* Go through the subreqs and find the next span of contiguous
 		 * buffer that we then rejig (cifs, for example, needs the
@@ -98,8 +106,7 @@ static void netfs_retry_read_subrequests(struct netfs_io=
_request *rreq)
 		       rreq->debug_id, from->debug_index,
 		       from->start, from->transferred, from->len);
=20
-		if (test_bit(NETFS_SREQ_FAILED, &from->flags) ||
-		    !test_bit(NETFS_SREQ_NEED_RETRY, &from->flags)) {
+		if (!test_bit(NETFS_SREQ_NEED_RETRY, &from->flags)) {
 			subreq =3D from;
 			goto abandon;
 		}
@@ -107,68 +114,53 @@ static void netfs_retry_read_subrequests(struct netfs=
_io_request *rreq)
 		list_for_each_continue(next, &stream->subrequests) {
 			subreq =3D list_entry(next, struct netfs_io_subrequest, rreq_link);
 			if (subreq->start + subreq->transferred !=3D start + len ||
-			    test_bit(NETFS_SREQ_BOUNDARY, &subreq->flags) ||
 			    !test_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags))
 				break;
 			to =3D subreq;
 			len +=3D to->len;
 		}
=20
-		_debug(" - range: %llx-%llx %llx", start, start + len - 1, len);
+		_debug(" - range: %llx-%llx %zx", start, start + len - 1, len);
=20
 		/* Determine the set of buffers we're going to use.  Each
-		 * subreq gets a subset of a single overall contiguous buffer.
+		 * subreq takes a subset of a single overall contiguous buffer.
 		 */
-		bvecq_pos_transfer(&dispatch_cursor, &from->dispatch_pos);
-		bvecq_pos_advance(&dispatch_cursor, from->transferred);
+		bvecq_pos_transfer(&rreq->retry_cursor, &from->dispatch_pos);
+		bvecq_pos_advance(&rreq->retry_cursor, from->transferred);
+		rreq->retry_start =3D start;
+		rreq->retry_buffered =3D len;
=20
 		/* Work through the sublist. */
 		subreq =3D from;
 		list_for_each_entry_from(subreq, &stream->subrequests, rreq_link) {
-			if (!len) {
+			if (rreq->retry_buffered =3D=3D 0) {
 				subreq_superfluous =3D true;
 				break;
 			}
 			subreq->source	=3D NETFS_DOWNLOAD_FROM_SERVER;
-			subreq->start	=3D start - subreq->transferred;
-			subreq->len	=3D len   + subreq->transferred;
-			__clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
-			__clear_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
-			subreq->retry_count++;
+			subreq->start	=3D rreq->retry_start;
+			subreq->len	=3D rreq->retry_buffered;
=20
-			bvecq_pos_unset(&subreq->dispatch_pos);
-			bvecq_pos_unset(&subreq->content);
-
-			trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
-
-			/* Renegotiate max_len (rsize) */
-			stream->sreq_max_len =3D subreq->len;
-			stream->sreq_max_segs =3D INT_MAX;
-			if (rreq->netfs_ops->prepare_read &&
-			    rreq->netfs_ops->prepare_read(subreq) < 0) {
-				trace_netfs_sreq(subreq, netfs_sreq_trace_reprep_failed);
+			ret =3D netfs_reset_for_read_retry(subreq);
+			if (ret < 0) {
 				__set_bit(NETFS_SREQ_FAILED, &subreq->flags);
+				rreq->error =3D ret;
 				goto abandon;
 			}
=20
-			bvecq_pos_set(&subreq->dispatch_pos, &dispatch_cursor);
-			part =3D bvecq_slice(&dispatch_cursor,
-					   umin(len, stream->sreq_max_len),
-					   stream->sreq_max_segs,
-					   &subreq->nr_segs);
-			subreq->len =3D subreq->transferred + part;
-
-			len -=3D part;
-			start +=3D part;
-			if (!len) {
-				if (boundary)
-					__set_bit(NETFS_SREQ_BOUNDARY, &subreq->flags);
-			} else {
-				__clear_bit(NETFS_SREQ_BOUNDARY, &subreq->flags);
+			netfs_stat(&netfs_n_rh_download);
+			ret =3D rreq->netfs_ops->issue_read(subreq);
+			if (ret < 0 && ret !=3D -EIOCBQUEUED) {
+				if (ret =3D=3D -ENOMEM)
+					goto abandon;
+				subreq->error =3D ret;
+				if (ret !=3D -EAGAIN) {
+					__set_bit(NETFS_SREQ_FAILED, &subreq->flags);
+					goto abandon_after;
+				}
+				__set_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
+				netfs_read_subreq_terminated(subreq);
 			}
-
-			netfs_get_subrequest(subreq, netfs_sreq_trace_get_resubmit);
-			netfs_reissue_read(rreq, subreq);
 			if (subreq =3D=3D to) {
 				subreq_superfluous =3D false;
 				break;
@@ -178,7 +170,7 @@ static void netfs_retry_read_subrequests(struct netfs_i=
o_request *rreq)
 		/* If we managed to use fewer subreqs, we can discard the
 		 * excess; if we used the same number, then we're done.
 		 */
-		if (!len) {
+		if (rreq->retry_buffered =3D=3D 0) {
 			if (!subreq_superfluous)
 				continue;
 			list_for_each_entry_safe_from(subreq, tmp,
@@ -194,7 +186,8 @@ static void netfs_retry_read_subrequests(struct netfs_i=
o_request *rreq)
 		}
=20
 		/* We ran out of subrequests, so we need to allocate some more
-		 * and insert them after.
+		 * and insert them after.  They must start with being marked
+		 * for retry to switch to the retry cursor.
 		 */
 		do {
 			subreq =3D netfs_alloc_subrequest(rreq);
@@ -203,8 +196,8 @@ static void netfs_retry_read_subrequests(struct netfs_i=
o_request *rreq)
 				goto abandon_after;
 			}
 			subreq->source		=3D NETFS_DOWNLOAD_FROM_SERVER;
-			subreq->start		=3D start;
-			subreq->len		=3D len;
+			subreq->start		=3D rreq->retry_start;
+			subreq->len		=3D rreq->retry_buffered;
 			subreq->stream_nr	=3D stream->stream_nr;
 			subreq->retry_count	=3D 1;
=20
@@ -216,37 +209,26 @@ static void netfs_retry_read_subrequests(struct netfs=
_io_request *rreq)
 			to =3D list_next_entry(to, rreq_link);
 			trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
=20
-			stream->sreq_max_len	=3D umin(len, rreq->rsize);
-			stream->sreq_max_segs	=3D INT_MAX;
-
 			netfs_stat(&netfs_n_rh_download);
-			if (rreq->netfs_ops->prepare_read(subreq) < 0) {
-				trace_netfs_sreq(subreq, netfs_sreq_trace_reprep_failed);
-				__set_bit(NETFS_SREQ_FAILED, &subreq->flags);
-				goto abandon;
+			ret =3D rreq->netfs_ops->issue_read(subreq);
+			if (ret < 0 && ret !=3D -EIOCBQUEUED) {
+				if (ret =3D=3D -ENOMEM)
+					goto abandon;
+				subreq->error =3D ret;
+				if (ret !=3D -EAGAIN) {
+					__set_bit(NETFS_SREQ_FAILED, &subreq->flags);
+					goto abandon_after;
+				}
+				__set_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
+				netfs_read_subreq_terminated(subreq);
 			}
=20
-			bvecq_pos_set(&subreq->dispatch_pos, &dispatch_cursor);
-			part =3D bvecq_slice(&dispatch_cursor,
-					   umin(len, stream->sreq_max_len),
-					   stream->sreq_max_segs,
-					   &subreq->nr_segs);
-			subreq->len =3D subreq->transferred + part;
-
-			len -=3D part;
-			start +=3D part;
-			if (!len && boundary) {
-				__set_bit(NETFS_SREQ_BOUNDARY, &to->flags);
-				boundary =3D false;
-			}
-
-			netfs_reissue_read(rreq, subreq);
-		} while (len);
+		} while (rreq->retry_buffered > 0);
=20
 	} while (!list_is_head(next, &stream->subrequests));
=20
 out:
-	bvecq_pos_unset(&dispatch_cursor);
+	bvecq_pos_unset(&rreq->retry_cursor);
 	return;
=20
 	/* If we hit an error, fail all remaining incomplete subrequests */
@@ -295,8 +277,6 @@ void netfs_unlock_abandoned_read_pages(struct netfs_io_=
request *rreq)
 	struct bvecq *p;
=20
 	for (p =3D rreq->collect_cursor.bvecq; p; p =3D p->next) {
-		if (!p->free)
-			continue;
 		for (int slot =3D 0; slot < p->nr_slots; slot++) {
 			if (!p->bv[slot].bv_page)
 				continue;
@@ -310,6 +290,7 @@ void netfs_unlock_abandoned_read_pages(struct netfs_io_=
request *rreq)
 			}
 			trace_netfs_folio(folio, netfs_folio_trace_abandon);
 			folio_unlock(folio);
+			p->bv[slot].bv_page =3D NULL;
 		}
 	}
 }
diff --git a/fs/netfs/read_single.c b/fs/netfs/read_single.c
index b386cae77ece..52b9e12a820a 100644
--- a/fs/netfs/read_single.c
+++ b/fs/netfs/read_single.c
@@ -16,6 +16,19 @@
 #include <linux/netfs.h>
 #include "internal.h"
=20
+int netfs_prepare_read_single_buffer(struct netfs_io_subrequest *subreq,
+				     unsigned int max_segs)
+{
+	struct netfs_io_request *rreq =3D subreq->rreq;
+	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
+
+	bvecq_pos_set(&subreq->dispatch_pos, &rreq->load_cursor);
+	bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
+
+	stream->issue_from +=3D subreq->len;
+	return 0;
+}
+
 /**
  * netfs_single_mark_inode_dirty - Mark a single, monolithic object inode =
dirty
  * @inode: The inode to mark
@@ -58,24 +71,12 @@ static int netfs_single_begin_cache_read(struct netfs_i=
o_request *rreq, struct n
 	return fscache_begin_read_operation(&rreq->cache_resources, netfs_i_cooki=
e(ctx));
 }
=20
-static void netfs_single_read_cache(struct netfs_io_request *rreq,
-				    struct netfs_io_subrequest *subreq)
-{
-	struct netfs_cache_resources *cres =3D &rreq->cache_resources;
-
-	_enter("R=3D%08x[%x]", rreq->debug_id, subreq->debug_index);
-	netfs_stat(&netfs_n_rh_read);
-	cres->ops->read(cres, subreq->start, &subreq->io_iter, NETFS_READ_HOLE_FA=
IL,
-			netfs_cache_read_terminated, subreq);
-}
-
 /*
  * Perform a read to a buffer from the cache or the server.  Only a single
  * subreq is permitted as the object must be fetched in a single transacti=
on.
  */
 static int netfs_single_dispatch_read(struct netfs_io_request *rreq)
 {
-	struct netfs_io_stream *stream =3D &rreq->io_streams[0];
 	struct fscache_occupancy occ =3D {
 		.query_from	=3D 0,
 		.query_to	=3D rreq->len,
@@ -85,76 +86,79 @@ static int netfs_single_dispatch_read(struct netfs_io_r=
equest *rreq)
 		.cached_to[1]	=3D ULLONG_MAX,
 	};
 	struct netfs_io_subrequest *subreq;
-	int ret =3D 0;
+	int ret;
+
+	ret =3D netfs_read_query_cache(rreq, &occ);
+	if (ret < 0)
+		return ret;
=20
 	subreq =3D netfs_alloc_subrequest(rreq);
 	if (!subreq)
 		return -ENOMEM;
=20
-	subreq->source	=3D NETFS_DOWNLOAD_FROM_SERVER;
 	subreq->start	=3D 0;
 	subreq->len	=3D rreq->len;
=20
-	bvecq_pos_set(&subreq->dispatch_pos, &rreq->dispatch_cursor);
-	bvecq_pos_set(&subreq->content, &rreq->dispatch_cursor);
-
-	iov_iter_bvec_queue(&subreq->io_iter, ITER_DEST, subreq->content.bvecq,
-			    subreq->content.slot, subreq->content.offset, subreq->len);
+	trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
=20
 	/* Try to use the cache if the cache content matches the size of the
 	 * remote file.
 	 */
-	netfs_read_query_cache(rreq, &occ);
 	if (occ.cached_from[0] =3D=3D 0 &&
-	    occ.cached_to[0] =3D=3D rreq->len)
-		subreq->source =3D NETFS_READ_FROM_CACHE;
+	    occ.cached_to[0] =3D=3D rreq->len) {
+		struct netfs_cache_resources *cres =3D &rreq->cache_resources;
=20
-	__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
+		subreq->source =3D NETFS_READ_FROM_CACHE;
+		netfs_stat(&netfs_n_rh_read);
+		ret =3D cres->ops->issue_read(subreq);
+		if (ret =3D=3D -EIOCBQUEUED)
+			ret =3D netfs_wait_for_in_progress_subreq(rreq, subreq);
+		if (ret =3D=3D -ENOMEM)
+			goto cancel;
+		if (ret =3D=3D 0)
+			goto success;
+
+		/* Didn't manage to retrieve from the cache, so toss it to the
+		 * server instead.
+		 */
+		if (netfs_reset_for_read_retry(subreq) < 0)
+			goto cancel;
+	}
=20
-	spin_lock(&rreq->lock);
-	list_add_tail(&subreq->rreq_link, &stream->subrequests);
-	trace_netfs_sreq(subreq, netfs_sreq_trace_added);
-	/* Store list pointers before active flag */
-	smp_store_release(&stream->active, true);
-	spin_unlock(&rreq->lock);
+	__set_bit(NETFS_RREQ_FOLIO_COPY_TO_CACHE, &rreq->flags);
=20
-	switch (subreq->source) {
-	case NETFS_DOWNLOAD_FROM_SERVER:
+	/* Try to send it to the cache. */
+	for (;;) {
+		subreq->source =3D NETFS_DOWNLOAD_FROM_SERVER;
 		netfs_stat(&netfs_n_rh_download);
-		if (rreq->netfs_ops->prepare_read) {
-			ret =3D rreq->netfs_ops->prepare_read(subreq);
-			if (ret < 0)
-				goto cancel;
-		}
-
-		rreq->netfs_ops->issue_read(subreq);
-		rreq->submitted +=3D subreq->len;
-		break;
-	case NETFS_READ_FROM_CACHE:
-		if (rreq->cache_resources.ops->prepare_read) {
-			ret =3D rreq->cache_resources.ops->prepare_read(subreq);
-			if (ret < 0)
-				goto cancel;
-		}
-
-		trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
-		netfs_single_read_cache(rreq, subreq);
-		rreq->submitted +=3D subreq->len;
-		ret =3D 0;
-		break;
-	default:
-		pr_warn("Unexpected single-read source %u\n", subreq->source);
-		WARN_ON_ONCE(true);
-		ret =3D -EIO;
-		break;
+		ret =3D rreq->netfs_ops->issue_read(subreq);
+		if (ret =3D=3D -EIOCBQUEUED)
+			ret =3D netfs_wait_for_in_progress_subreq(rreq, subreq);
+		if (ret =3D=3D 0)
+			goto success;
+		if (ret =3D=3D -ENOMEM)
+			goto cancel;
+		if (ret !=3D -EAGAIN)
+			goto failed;
+		if (netfs_reset_for_read_retry(subreq) < 0)
+			goto cancel;
 	}
=20
-	smp_wmb(); /* Write lists before ALL_QUEUED. */
-	set_bit(NETFS_RREQ_ALL_QUEUED, &rreq->flags);
-	return ret;
+success:
+	rreq->transferred =3D subreq->transferred;
+	list_del_init(&subreq->rreq_link);
+	netfs_put_subrequest(subreq, netfs_sreq_trace_put_consumed);
+	return 0;
 cancel:
+	rreq->error =3D ret;
+	list_del_init(&subreq->rreq_link);
 	netfs_put_subrequest(subreq, netfs_sreq_trace_put_cancel);
 	return ret;
+failed:
+	rreq->error =3D ret;
+	list_del_init(&subreq->rreq_link);
+	netfs_put_subrequest(subreq, netfs_sreq_trace_put_failed);
+	return ret;
 }
=20
 /**
@@ -185,7 +189,7 @@ ssize_t netfs_read_single(struct inode *inode, struct f=
ile *file, struct iov_ite
 	if (IS_ERR(rreq))
 		return PTR_ERR(rreq);
=20
-	ret =3D netfs_extract_iter(iter, rreq->len, INT_MAX, 0, &rreq->dispatch_c=
ursor.bvecq, 0);
+	ret =3D netfs_extract_iter(iter, rreq->len, INT_MAX, 0, &rreq->load_curso=
r.bvecq, 0);
 	if (ret < 0)
 		goto cleanup_free;
=20
@@ -196,9 +200,29 @@ ssize_t netfs_read_single(struct inode *inode, struct =
file *file, struct iov_ite
 	netfs_stat(&netfs_n_rh_read_single);
 	trace_netfs_read(rreq, 0, rreq->len, netfs_read_trace_read_single);
=20
-	netfs_single_dispatch_read(rreq);
+	ret =3D netfs_single_dispatch_read(rreq);
+
+	trace_netfs_rreq(rreq, netfs_rreq_trace_complete);
+	if (ret =3D=3D 0) {
+		task_io_account_read(rreq->transferred);
+
+		if (test_bit(NETFS_RREQ_FOLIO_COPY_TO_CACHE, &rreq->flags) &&
+		    fscache_resources_valid(&rreq->cache_resources)) {
+			trace_netfs_rreq(rreq, netfs_rreq_trace_dirty);
+			netfs_single_mark_inode_dirty(rreq->inode);
+		}
+		ret =3D rreq->transferred;
+	}
+
+	if (rreq->netfs_ops->done)
+		rreq->netfs_ops->done(rreq);
+
+	netfs_wake_rreq_flag(rreq, NETFS_RREQ_IN_PROGRESS, netfs_rreq_trace_wake_=
ip);
+	/* As we cleared NETFS_RREQ_IN_PROGRESS, we acquired its ref. */
+	netfs_put_request(rreq, netfs_rreq_trace_put_work_ip);
+
+	trace_netfs_rreq(rreq, netfs_rreq_trace_done);
=20
-	ret =3D netfs_wait_for_read(rreq);
 	netfs_put_request(rreq, netfs_rreq_trace_put_return);
 	return ret;
=20
diff --git a/fs/netfs/write_collect.c b/fs/netfs/write_collect.c
index fb8daf50c86d..bfca6d48361f 100644
--- a/fs/netfs/write_collect.c
+++ b/fs/netfs/write_collect.c
@@ -28,8 +28,8 @@ static void netfs_dump_request(const struct netfs_io_requ=
est *rreq)
 	       rreq->origin, rreq->error);
 	pr_err("  st=3D%llx tsl=3D%zx/%llx/%llx\n",
 	       rreq->start, rreq->transferred, rreq->submitted, rreq->len);
-	pr_err("  cci=3D%llx/%llx/%llx\n",
-	       rreq->cleaned_to, rreq->collected_to, atomic64_read(&rreq->issued_=
to));
+	pr_err("  cci=3D%llx/%llx\n",
+	       rreq->cleaned_to, rreq->collected_to);
 	pr_err("  iw=3D%pSR\n", rreq->netfs_ops->issue_write);
 	for (int i =3D 0; i < NR_IO_STREAMS; i++) {
 		const struct netfs_io_subrequest *sreq;
@@ -38,8 +38,9 @@ static void netfs_dump_request(const struct netfs_io_requ=
est *rreq)
 		pr_err("  str[%x] s=3D%x e=3D%d acnf=3D%u,%u,%u,%u\n",
 		       s->stream_nr, s->source, s->error,
 		       s->avail, s->active, s->need_retry, s->failed);
-		pr_err("  str[%x] ct=3D%llx t=3D%zx\n",
-		       s->stream_nr, s->collected_to, s->transferred);
+		pr_err("  str[%x] it=3D%llx ct=3D%llx t=3D%zx\n",
+		       s->stream_nr, atomic64_read(&s->issued_to),
+		       s->collected_to, s->transferred);
 		list_for_each_entry(sreq, &s->subrequests, rreq_link) {
 			pr_err("  sreq[%x:%x] sc=3D%u s=3D%llx t=3D%zx/%zx r=3D%d f=3D%lx\n",
 			       sreq->stream_nr, sreq->debug_index, sreq->source,
@@ -56,7 +57,7 @@ static void netfs_dump_request(const struct netfs_io_requ=
est *rreq)
  */
 int netfs_folio_written_back(struct folio *folio)
 {
-	enum netfs_folio_trace why =3D netfs_folio_trace_clear;
+	enum netfs_folio_trace why =3D netfs_folio_trace_endwb;
 	struct netfs_inode *ictx =3D netfs_inode(folio->mapping->host);
 	struct netfs_folio *finfo;
 	struct netfs_group *group =3D NULL;
@@ -76,13 +77,13 @@ int netfs_folio_written_back(struct folio *folio)
 		group =3D finfo->netfs_group;
 		gcount++;
 		kfree(finfo);
-		why =3D netfs_folio_trace_clear_s;
+		why =3D netfs_folio_trace_endwb_s;
 		goto end_wb;
 	}
=20
 	if ((group =3D netfs_folio_group(folio))) {
 		if (group =3D=3D NETFS_FOLIO_COPY_TO_CACHE) {
-			why =3D netfs_folio_trace_clear_cc;
+			why =3D netfs_folio_trace_endwb_cc;
 			folio_detach_private(folio);
 			goto end_wb;
 		}
@@ -95,7 +96,7 @@ int netfs_folio_written_back(struct folio *folio)
 		if (!folio_test_dirty(folio)) {
 			folio_detach_private(folio);
 			gcount++;
-			why =3D netfs_folio_trace_clear_g;
+			why =3D netfs_folio_trace_endwb_g;
 		}
 	}
=20
@@ -222,9 +223,7 @@ static void netfs_collect_write_results(struct netfs_io=
_request *wreq)
 	trace_netfs_rreq(wreq, netfs_rreq_trace_collect);
=20
 reassess_streams:
-	/* Order reading the issued_to point before reading the queue it refers t=
o. */
-	issued_to =3D atomic64_read_acquire(&wreq->issued_to);
-	smp_rmb();
+	issued_to =3D ULLONG_MAX;
 	collected_to =3D ULLONG_MAX;
 	if (wreq->origin =3D=3D NETFS_WRITEBACK ||
 	    wreq->origin =3D=3D NETFS_WRITETHROUGH ||
@@ -239,14 +238,26 @@ static void netfs_collect_write_results(struct netfs_=
io_request *wreq)
 	 * to the tail whilst we're doing this.
 	 */
 	for (s =3D 0; s < NR_IO_STREAMS; s++) {
+		unsigned long long s_issued_to;
+
 		stream =3D &wreq->io_streams[s];
-		/* Read active flag before list pointers */
+		/* Read active flag before issued_to */
 		if (!smp_load_acquire(&stream->active))
 			continue;
=20
-		front =3D list_first_entry_or_null(&stream->subrequests,
-						 struct netfs_io_subrequest, rreq_link);
-		while (front) {
+		for (;;) {
+			/* Order reading the issued_to point before reading the
+			 * queue it refers to.
+			 */
+			s_issued_to =3D atomic64_read_acquire(&stream->issued_to);
+			if (s_issued_to < issued_to)
+				issued_to =3D s_issued_to;
+
+			front =3D list_first_entry_or_null(&stream->subrequests,
+							 struct netfs_io_subrequest, rreq_link);
+			if (!front)
+				break;
+
 			trace_netfs_collect_sreq(wreq, front);
 			//_debug("sreq [%x] %llx %zx/%zx",
 			//       front->debug_index, front->start, front->transferred, front->l=
en);
diff --git a/fs/netfs/write_issue.c b/fs/netfs/write_issue.c
index d4c4bee4299e..ec84d2bcabeb 100644
--- a/fs/netfs/write_issue.c
+++ b/fs/netfs/write_issue.c
@@ -36,6 +36,39 @@
 #include <linux/pagemap.h>
 #include "internal.h"
=20
+#define NOTE_UPLOAD_AVAIL	0x001	/* Upload is available */
+#define NOTE_CACHE_AVAIL	0x002	/* Local cache is available */
+#define NOTE_CACHE_COPY		0x004	/* Copy folio to cache */
+#define NOTE_UPLOAD		0x008	/* Upload folio to server */
+#define NOTE_UPLOAD_STARTED	0x010	/* Upload started */
+#define NOTE_STREAMW		0x020	/* Folio is from a streaming write */
+#define NOTE_DISCONTIG_BEFORE	0x040	/* Folio discontiguous with the previo=
us folio */
+#define NOTE_DISCONTIG_AFTER	0x080	/* Folio discontiguous with the next fo=
lio */
+#define NOTE_TO_EOF		0x100	/* Data in folio ends at EOF */
+#define NOTE_FLUSH_ANYWAY	0x200	/* Flush data, even if not hit estimated l=
imit */
+
+#define NOTES__KEEP_MASK (NOTE_UPLOAD_AVAIL | NOTE_CACHE_AVAIL | NOTE_UPLO=
AD_STARTED)
+
+struct netfs_wb_params {
+	unsigned long long	last_end;	/* End file pos of previous folio */
+	unsigned long long	folio_start;	/* File pos of folio */
+	unsigned int		folio_len;	/* Length of folio */
+	unsigned int		dirty_offset;	/* Offset of dirty region in folio */
+	unsigned int		dirty_len;	/* Length of dirty region in folio */
+	unsigned int		notes;		/* Notes on applicability */
+	struct bvecq_pos	dispatch_cursor; /* Folio queue anchor for issue_at */
+	struct netfs_write_estimate estimates[2];
+};
+
+struct netfs_writethrough {
+	struct netfs_wb_params	params;
+	struct netfs_io_request	*wreq;
+	struct folio		*in_progress;
+};
+
+static int netfs_prepare_write_single_buffer(struct netfs_io_subrequest *s=
ubreq,
+					     unsigned int max_segs);
+
 /*
  * Kill all dirty folios in the event of an unrecoverable error, starting =
with
  * a locked folio we've already obtained from writeback_iter().
@@ -115,65 +148,48 @@ struct netfs_io_request *netfs_create_write_req(struc=
t address_space *mapping,
=20
 	wreq->io_streams[0].stream_nr		=3D 0;
 	wreq->io_streams[0].source		=3D NETFS_UPLOAD_TO_SERVER;
-	wreq->io_streams[0].prepare_write	=3D ictx->ops->prepare_write;
+	wreq->io_streams[0].applicable		=3D NOTE_UPLOAD;
+	wreq->io_streams[0].estimate_write	=3D ictx->ops->estimate_write;
 	wreq->io_streams[0].issue_write		=3D ictx->ops->issue_write;
 	wreq->io_streams[0].collected_to	=3D start;
 	wreq->io_streams[0].transferred		=3D 0;
=20
 	wreq->io_streams[1].stream_nr		=3D 1;
 	wreq->io_streams[1].source		=3D NETFS_WRITE_TO_CACHE;
+	wreq->io_streams[1].applicable		=3D NOTE_CACHE_COPY;
 	wreq->io_streams[1].collected_to	=3D start;
 	wreq->io_streams[1].transferred		=3D 0;
 	if (fscache_resources_valid(&wreq->cache_resources)) {
 		wreq->io_streams[1].avail	=3D true;
 		wreq->io_streams[1].active	=3D true;
-		wreq->io_streams[1].prepare_write =3D wreq->cache_resources.ops->prepare=
_write_subreq;
+		wreq->io_streams[1].estimate_write =3D wreq->cache_resources.ops->estima=
te_write;
 		wreq->io_streams[1].issue_write =3D wreq->cache_resources.ops->issue_wri=
te;
 	}
=20
 	return wreq;
 }
=20
-/**
- * netfs_prepare_write_failed - Note write preparation failed
- * @subreq: The subrequest to mark
- *
- * Mark a subrequest to note that preparation for write failed.
- */
-void netfs_prepare_write_failed(struct netfs_io_subrequest *subreq)
-{
-	__set_bit(NETFS_SREQ_FAILED, &subreq->flags);
-	trace_netfs_sreq(subreq, netfs_sreq_trace_prep_failed);
-}
-EXPORT_SYMBOL(netfs_prepare_write_failed);
-
 /*
- * Prepare a write subrequest.  We need to allocate a new subrequest
- * if we don't have one.
+ * Allocate and prepare a write subrequest.
  */
-void netfs_prepare_write(struct netfs_io_request *wreq,
-			 struct netfs_io_stream *stream,
-			 loff_t start)
+struct netfs_io_subrequest *netfs_alloc_write_subreq(struct netfs_io_reque=
st *wreq,
+						     struct netfs_io_stream *stream)
 {
 	struct netfs_io_subrequest *subreq;
=20
 	subreq =3D netfs_alloc_subrequest(wreq);
 	subreq->source		=3D stream->source;
-	subreq->start		=3D start;
+	subreq->start		=3D stream->issue_from;
+	subreq->len		=3D stream->buffered;
 	subreq->stream_nr	=3D stream->stream_nr;
=20
-	bvecq_pos_set(&subreq->dispatch_pos, &wreq->dispatch_cursor);
-
 	_enter("R=3D%x[%x]", wreq->debug_id, subreq->debug_index);
=20
 	trace_netfs_sreq(subreq, netfs_sreq_trace_prepare);
=20
-	stream->sreq_max_len	=3D UINT_MAX;
-	stream->sreq_max_segs	=3D INT_MAX;
 	switch (stream->source) {
 	case NETFS_UPLOAD_TO_SERVER:
 		netfs_stat(&netfs_n_wh_upload);
-		stream->sreq_max_len =3D wreq->wsize;
 		break;
 	case NETFS_WRITE_TO_CACHE:
 		netfs_stat(&netfs_n_wh_write);
@@ -183,9 +199,6 @@ void netfs_prepare_write(struct netfs_io_request *wreq,
 		break;
 	}
=20
-	if (stream->prepare_write)
-		stream->prepare_write(subreq);
-
 	__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
=20
 	/* We add to the end of the list whilst the collector may be walking
@@ -194,84 +207,46 @@ void netfs_prepare_write(struct netfs_io_request *wre=
q,
 	 */
 	spin_lock(&wreq->lock);
 	list_add_tail(&subreq->rreq_link, &stream->subrequests);
-	if (list_is_first(&subreq->rreq_link, &stream->subrequests)) {
-		if (!stream->active) {
-			stream->collected_to =3D subreq->start;
-			/* Write list pointers before active flag */
-			smp_store_release(&stream->active, true);
-		}
-	}
+	if (list_is_first(&subreq->rreq_link, &stream->subrequests) &&
+	    stream->collected_to =3D=3D 0)
+		stream->collected_to =3D subreq->start;
=20
 	spin_unlock(&wreq->lock);
-
-	stream->construct =3D subreq;
+	return subreq;
 }
=20
 /*
- * Set the I/O iterator for the filesystem/cache to use and dispatch the I=
/O
- * operation.  The operation may be asynchronous and should call
- * netfs_write_subrequest_terminated() when complete.
+ * Prepare the buffer for a buffered write.
  */
-static void netfs_do_issue_write(struct netfs_io_stream *stream,
-				 struct netfs_io_subrequest *subreq)
+static int netfs_prepare_buffered_write_buffer(struct netfs_io_subrequest =
*subreq,
+					       unsigned int max_segs)
 {
 	struct netfs_io_request *wreq =3D subreq->rreq;
+	struct netfs_io_stream *stream =3D &wreq->io_streams[subreq->stream_nr];
+	ssize_t len;
=20
-	_enter("R=3D%x[%x],%zx", wreq->debug_id, subreq->debug_index, subreq->len=
);
-
-	if (test_bit(NETFS_SREQ_FAILED, &subreq->flags))
-		return netfs_write_subrequest_terminated(subreq, subreq->error);
-
-	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
-	stream->issue_write(subreq);
-}
-
-void netfs_reissue_write(struct netfs_io_stream *stream,
-			 struct netfs_io_subrequest *subreq)
-{
-	// TODO: Use encrypted buffer
-	bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
-	iov_iter_bvec_queue(&subreq->io_iter, ITER_SOURCE,
-			    subreq->content.bvecq, subreq->content.slot,
-			    subreq->content.offset,
-			    subreq->len);
-	iov_iter_advance(&subreq->io_iter, subreq->transferred);
-
-	subreq->retry_count++;
-	subreq->error =3D 0;
-	__clear_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
-	__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
-	netfs_stat(&netfs_n_wh_retry_write_subreq);
-	netfs_do_issue_write(stream, subreq);
-}
-
-void netfs_issue_write(struct netfs_io_request *wreq,
-		       struct netfs_io_stream *stream)
-{
-	struct netfs_io_subrequest *subreq =3D stream->construct;
+	_enter("%zx,{,%u,%u},%u",
+	       subreq->len, stream->dispatch_cursor.slot, stream->dispatch_cursor=
.offset, max_segs);
=20
-	if (!subreq)
-		return;
+	bvecq_pos_set(&subreq->dispatch_pos, &stream->dispatch_cursor);
=20
 	/* If we have a write to the cache, we need to round out the first and
 	 * last entries (only those as the data will be on virtually contiguous
 	 * folios) to cache DIO boundaries.
 	 */
 	if (subreq->source =3D=3D NETFS_WRITE_TO_CACHE) {
-		struct bvecq_pos tmp_pos;
 		struct bio_vec *bv;
 		struct bvecq *bq;
 		size_t dio_size =3D wreq->cache_resources.dio_size;
-		size_t disp, len;
-		int ret;
+		size_t disp, dlen;
=20
-		bvecq_pos_set(&tmp_pos, &subreq->dispatch_pos);
-		ret =3D bvecq_extract(&tmp_pos, subreq->len, INT_MAX, &subreq->content.b=
vecq);
-		bvecq_pos_unset(&tmp_pos);
-		if (ret < 0) {
-			netfs_write_subrequest_terminated(subreq, -ENOMEM);
-			return;
-		}
+		len =3D bvecq_extract(&stream->dispatch_cursor, subreq->len, max_segs,
+				    &subreq->content.bvecq);
+		if (len < 0)
+			return -ENOMEM;
+
+		_debug("extract %zx/%zx", len, subreq->len);
+		subreq->len =3D len;
=20
 		/* Round the first entry down. */
 		bq =3D subreq->content.bvecq;
@@ -289,96 +264,276 @@ void netfs_issue_write(struct netfs_io_request *wreq,
 		while (bq->next)
 			bq =3D bq->next;
 		bv =3D &bq->bv[bq->nr_slots - 1];
-		len =3D round_up(bv->bv_len, dio_size);
-		if (len > bv->bv_len) {
-			subreq->len +=3D len - bv->bv_len;
-			bv->bv_len =3D len;
+		dlen =3D round_up(bv->bv_len, dio_size);
+		if (dlen > bv->bv_len) {
+			subreq->len +=3D dlen - bv->bv_len;
+			bv->bv_len =3D dlen;
 		}
 	} else {
-		bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
+		bvecq_pos_set(&subreq->content, &stream->dispatch_cursor);
+		len =3D bvecq_slice(&stream->dispatch_cursor, subreq->len, max_segs,
+				  &subreq->nr_segs);
+
+		if (len < subreq->len) {
+			subreq->len =3D len;
+			trace_netfs_sreq(subreq, netfs_sreq_trace_limited);
+		}
 	}
=20
-	iov_iter_bvec_queue(&subreq->io_iter, ITER_SOURCE,
-			    subreq->content.bvecq, subreq->content.slot,
-			    subreq->content.offset,
-			    subreq->len);
+	stream->issue_from +=3D len;
+	stream->buffered   -=3D len;
+	if (stream->buffered =3D=3D 0) {
+		stream->buffering =3D false;
+		bvecq_pos_unset(&stream->dispatch_cursor);
+	}
+	/* Order loading the queue before updating the issue_to point */
+	atomic64_set_release(&stream->issued_to, stream->issue_from);
+	return 0;
+}
+
+/**
+ * netfs_prepare_write_buffer - Get the buffer for a subrequest
+ * @subreq: The subrequest to get the buffer for
+ * @max_segs: Maximum number of segments in buffer (or INT_MAX)
+ *
+ * Extract a slice of buffer from the stream and attach it to the subreque=
st as
+ * a bio_vec queue.  The maximum amount of data attached is set by
+ * @subreq->len, but this may be shortened if @max_segs would be exceeded.
+ */
+int netfs_prepare_write_buffer(struct netfs_io_subrequest *subreq,
+			       unsigned int max_segs)
+{
+	struct netfs_io_request *rreq =3D subreq->rreq;
+
+	switch (rreq->origin) {
+	case NETFS_WRITEBACK:
+	case NETFS_WRITETHROUGH:
+		if (test_bit(NETFS_RREQ_RETRYING, &rreq->flags))
+			return netfs_prepare_write_retry_buffer(subreq, max_segs);
+		return netfs_prepare_buffered_write_buffer(subreq, max_segs);
+
+	case NETFS_UNBUFFERED_WRITE:
+	case NETFS_DIO_WRITE:
+		return netfs_prepare_unbuffered_write_buffer(subreq, max_segs);
=20
-	stream->construct =3D NULL;
-	netfs_do_issue_write(stream, subreq);
+	case NETFS_WRITEBACK_SINGLE:
+		return netfs_prepare_write_single_buffer(subreq, max_segs);
+
+	case NETFS_PGPRIV2_COPY_TO_CACHE:
+		return netfs_prepare_pgpriv2_write_buffer(subreq, max_segs);
+
+	default:
+		WARN_ON_ONCE(1);
+		return -EIO;
+	}
 }
+EXPORT_SYMBOL(netfs_prepare_write_buffer);
=20
 /*
- * Add data to the write subrequest, dispatching each as we fill it up or =
if it
- * is discontiguous with the previous.  We only fill one part at a time so=
 that
- * we can avoid overrunning the credits obtained (cifs) and try to paralle=
lise
- * content-crypto preparation with network writes.
+ * Issue writes for a stream.
  */
-size_t netfs_advance_write(struct netfs_io_request *wreq,
-			   struct netfs_io_stream *stream,
-			   loff_t start, size_t len, bool to_eof)
+static int netfs_issue_writes(struct netfs_io_request *wreq,
+			      struct netfs_io_stream *stream,
+			      struct netfs_wb_params *params)
 {
-	struct netfs_io_subrequest *subreq =3D stream->construct;
-	size_t part;
+	struct netfs_write_estimate *estimate =3D &params->estimates[stream->stre=
am_nr];
+
+	for (;;) {
+		struct netfs_io_subrequest *subreq;
+		int ret;
+
+		subreq =3D netfs_alloc_write_subreq(wreq, stream);
+		if (!subreq)
+			return -ENOMEM;
=20
-	if (!stream->avail) {
-		_leave("no write");
-		return len;
+		ret =3D stream->issue_write(subreq);
+		if (ret < 0 && ret !=3D -EIOCBQUEUED)
+			return ret;
+
+		if (stream->buffered =3D=3D 0) {
+			if (stream->stream_nr =3D=3D 0)
+				params->notes &=3D ~NOTE_UPLOAD_STARTED;
+			return 0;
+		}
+
+		if (!(params->notes & NOTE_FLUSH_ANYWAY)) {
+			estimate->issue_at =3D ULLONG_MAX;
+			estimate->max_segs =3D INT_MAX;
+			stream->estimate_write(wreq, stream, estimate);
+			if (stream->issue_from + stream->buffered < estimate->issue_at &&
+			    estimate->max_segs > 0)
+				return 0;
+		}
+	}
+}
+
+/*
+ * Issue pending writes on a stream.
+ */
+static int netfs_issue_stream(struct netfs_io_request *wreq,
+			      struct netfs_wb_params *params, int s)
+{
+	struct netfs_write_estimate *estimate =3D &params->estimates[s];
+	struct netfs_io_stream *stream =3D &wreq->io_streams[s];
+	unsigned long long dirty_start;
+	bool discontig_before =3D params->notes & NOTE_DISCONTIG_BEFORE;
+	int ret;
+
+	_enter("%x", params->notes);
+
+	/* If the current folio doesn't contribute to this stream, see if we
+	 * need to flush it.
+	 */
+	if (!(params->notes & stream->applicable)) {
+		if (!stream->buffering) {
+			atomic64_set_release(&stream->issued_to,
+					     params->folio_start + params->folio_len);
+			return 0;
+		}
+		discontig_before =3D true;
+	}
+
+	/* Issue writes if we meet a discontiguity before the current folio.
+	 * Even if the filesystem can do sparse/vectored writes, we still
+	 * generate a subreq per contiguous region rather than generating
+	 * separate extent lists.
+	 */
+	if (stream->buffering && discontig_before) {
+		params->notes |=3D NOTE_FLUSH_ANYWAY;
+		ret =3D netfs_issue_writes(wreq, stream, params);
+		if (ret < 0)
+			return ret;
+		stream->buffering =3D false;
+		params->notes &=3D ~NOTE_FLUSH_ANYWAY;
+	}
+
+	if (!(params->notes & stream->applicable)) {
+		atomic64_set_release(&stream->issued_to,
+				     params->folio_start + params->folio_len);
+		return 0;
+	}
+
+	/* If we're not currently buffering on this stream, we need to get an
+	 * estimate of when we need to issue a write.  It might be within the
+	 * starting folio.
+	 */
+	dirty_start =3D params->folio_start + params->dirty_offset;
+	if (!stream->buffering) {
+		stream->buffering =3D true;
+		stream->issue_from =3D dirty_start;
+		bvecq_pos_set(&stream->dispatch_cursor, &params->dispatch_cursor);
+		estimate->issue_at =3D ULLONG_MAX;
+		estimate->max_segs =3D INT_MAX;
+		stream->estimate_write(wreq, stream, estimate);
+	}
+
+	stream->buffered +=3D params->dirty_len;
+	estimate->max_segs--;
+
+	/* Poke the filesystem to issue writes when we hit the limit it set or
+	 * if the data ends before the end of the page.
+	 */
+	if (params->notes & NOTE_DISCONTIG_AFTER)
+		params->notes |=3D NOTE_FLUSH_ANYWAY;
+	_debug("[%u] %llx + %zx >=3D %llx, %u %x",
+	       s, stream->issue_from, stream->buffered, estimate->issue_at,
+	       estimate->max_segs, params->notes);
+	if (stream->issue_from + stream->buffered >=3D estimate->issue_at ||
+	    estimate->max_segs <=3D 0 ||
+	    (params->notes & NOTE_FLUSH_ANYWAY)) {
+		ret =3D netfs_issue_writes(wreq, stream, params);
+		if (ret < 0)
+			return ret;
 	}
=20
-	_enter("R=3D%x[%x]", wreq->debug_id, subreq ? subreq->debug_index : 0);
+	return 0;
+}
+
+/*
+ * See which streams need writes issuing and issue them.
+ */
+static int netfs_issue_streams(struct netfs_io_request *wreq,
+			       struct netfs_wb_params *params)
+{
+	int ret =3D 0, ret2;
+
+	_enter("%x", params->notes);
=20
-	if (subreq && start !=3D subreq->start + subreq->len) {
-		netfs_issue_write(wreq, stream);
-		subreq =3D NULL;
+	for (int s =3D 0; s < NR_IO_STREAMS; s++) {
+		ret2 =3D netfs_issue_stream(wreq, params, s);
+		if (ret2 < 0)
+			ret =3D ret2;
 	}
+	return ret;
+}
=20
-	if (!stream->construct)
-		netfs_prepare_write(wreq, stream, start);
-	subreq =3D stream->construct;
+/*
+ * End the issuing of writes, let the collector know we're done.
+ */
+static void netfs_end_issue_write(struct netfs_io_request *wreq,
+				  struct netfs_wb_params *params)
+{
+	bool needs_poke =3D true;
=20
-	part =3D umin(stream->sreq_max_len - subreq->len, len);
-	_debug("part %zx/%zx %zx/%zx", subreq->len, stream->sreq_max_len, part, l=
en);
-	subreq->len +=3D part;
-	subreq->nr_segs++;
+	params->notes |=3D NOTE_FLUSH_ANYWAY;
+
+	for (int s =3D 0; s < NR_IO_STREAMS; s++) {
+		struct netfs_io_stream *stream =3D &wreq->io_streams[s];
+		int ret;
+
+		if (stream->buffering) {
+			ret =3D netfs_issue_writes(wreq, stream, params);
+			if (ret < 0) {
+				/* Leave the error somewhere the completion
+				 * path can pick it up if there isn't already
+				 * another error logged.
+				 */
+				cmpxchg(&wreq->error, 0, ret);
+			}
+			stream->buffering =3D false;
+		}
+	}
+
+	smp_wmb(); /* Write subreq lists before ALL_QUEUED. */
+	set_bit(NETFS_RREQ_ALL_QUEUED, &wreq->flags);
+
+	for (int s =3D 0; s < NR_IO_STREAMS; s++) {
+		struct netfs_io_stream *stream =3D &wreq->io_streams[s];
=20
-	if (subreq->len >=3D stream->sreq_max_len ||
-	    subreq->nr_segs >=3D stream->sreq_max_segs ||
-	    to_eof) {
-		netfs_issue_write(wreq, stream);
-		subreq =3D NULL;
+		if (!stream->active)
+			continue;
+		if (!list_empty(&stream->subrequests))
+			needs_poke =3D false;
 	}
=20
-	return part;
+	if (needs_poke)
+		netfs_wake_collector(wreq);
 }
=20
 /*
- * Write some of a pending folio data back to the server.
+ * Queue a folio for writeback.
  */
-static int netfs_write_folio(struct netfs_io_request *wreq,
-			     struct writeback_control *wbc,
-			     struct folio *folio)
+static int netfs_queue_wb_folio(struct netfs_io_request *wreq,
+				struct writeback_control *wbc,
+				struct folio *folio,
+				struct netfs_wb_params *params)
 {
-	struct netfs_io_stream *upload =3D &wreq->io_streams[0];
-	struct netfs_io_stream *cache  =3D &wreq->io_streams[1];
-	struct netfs_io_stream *stream;
 	struct netfs_group *fgroup; /* TODO: Use this with ceph */
 	struct netfs_folio *finfo;
 	struct bvecq *queue =3D wreq->load_cursor.bvecq;
 	unsigned int slot;
 	size_t fsize =3D folio_size(folio), flen =3D fsize, foff =3D 0;
 	loff_t fpos =3D folio_pos(folio), i_size;
-	bool to_eof =3D false, streamw =3D false;
-	bool debug =3D false;
 	int ret;
=20
-	_enter("");
+	_enter("%x", params->notes);
=20
 	/* Institute a new bvec queue segment if the current one is full or if
 	 * we encounter a discontiguity.  The discontiguity break is important
 	 * when it comes to bulk unlocking folios by file range.
 	 */
 	if (bvecq_is_full(queue) ||
-	    (fpos !=3D wreq->last_end && wreq->last_end > 0)) {
+	    (fpos !=3D params->last_end && params->last_end > 0)) {
 		ret =3D bvecq_buffer_make_space(&wreq->load_cursor, GFP_NOFS);
 		if (ret < 0) {
 			folio_unlock(folio);
@@ -387,10 +542,10 @@ static int netfs_write_folio(struct netfs_io_request =
*wreq,
=20
 		queue =3D wreq->load_cursor.bvecq;
 		queue->fpos =3D fpos;
-		if (fpos !=3D wreq->last_end)
+		if (fpos !=3D params->last_end)
 			queue->discontig =3D true;
-		bvecq_pos_move(&wreq->dispatch_cursor, queue);
-		wreq->dispatch_cursor.slot =3D 0;
+		bvecq_pos_move(&params->dispatch_cursor, queue);
+		params->dispatch_cursor.slot =3D 0;
 	}
=20
 	/* netfs_perform_write() may shift i_size around the page or from out
@@ -418,23 +573,36 @@ static int netfs_write_folio(struct netfs_io_request =
*wreq,
 	if (finfo) {
 		foff =3D finfo->dirty_offset;
 		flen =3D foff + finfo->dirty_len;
-		streamw =3D true;
+		params->notes |=3D NOTE_STREAMW;
+		if (foff > 0)
+			params->notes |=3D NOTE_DISCONTIG_BEFORE;
+		if (flen < fsize)
+			params->notes |=3D NOTE_DISCONTIG_AFTER;
 	}
=20
+	if (params->last_end && fpos !=3D params->last_end)
+		params->notes |=3D NOTE_DISCONTIG_BEFORE;
+	params->last_end =3D fpos + fsize;
+
 	if (wreq->origin =3D=3D NETFS_WRITETHROUGH) {
-		to_eof =3D false;
 		if (flen > i_size - fpos)
 			flen =3D i_size - fpos;
+		/* EOF may be changing. */
 	} else if (flen > i_size - fpos) {
 		flen =3D i_size - fpos;
-		if (!streamw)
+		if (!(params->notes & NOTE_STREAMW))
 			folio_zero_segment(folio, flen, fsize);
-		to_eof =3D true;
+		params->notes |=3D NOTE_TO_EOF;
 	} else if (flen =3D=3D i_size - fpos) {
-		to_eof =3D true;
+		params->notes |=3D NOTE_TO_EOF;
 	}
 	flen -=3D foff;
=20
+	params->folio_start	=3D fpos;
+	params->folio_len	=3D fsize;
+	params->dirty_offset	=3D foff;
+	params->dirty_len	=3D flen;
+
 	_debug("folio %zx %zx %zx", foff, flen, fsize);
=20
 	/* Deal with discontinuities in the stream of dirty pages.  These can
@@ -454,22 +622,31 @@ static int netfs_write_folio(struct netfs_io_request =
*wreq,
 	 *     write-back group.
 	 */
 	if (fgroup =3D=3D NETFS_FOLIO_COPY_TO_CACHE) {
-		netfs_issue_write(wreq, upload);
+		if (!(params->notes & NOTE_CACHE_AVAIL)) {
+			trace_netfs_folio(folio, netfs_folio_trace_cancel_copy);
+			goto cancel_folio;
+		}
+		params->notes |=3D NOTE_CACHE_COPY;
+		trace_netfs_folio(folio, netfs_folio_trace_store_copy);
 	} else if (fgroup !=3D wreq->group) {
 		/* We can't write this page to the server yet. */
 		kdebug("wrong group");
-		folio_redirty_for_writepage(wbc, folio);
-		folio_unlock(folio);
-		netfs_issue_write(wreq, upload);
-		netfs_issue_write(wreq, cache);
-		return 0;
+		goto skip_folio;
+	} else if (!(params->notes & (NOTE_UPLOAD_AVAIL | NOTE_CACHE_AVAIL))) {
+		trace_netfs_folio(folio, netfs_folio_trace_cancel_store);
+		goto cancel_folio_discard;
+	} else {
+		if (params->notes & NOTE_UPLOAD_STARTED) {
+			params->notes |=3D NOTE_UPLOAD;
+			trace_netfs_folio(folio, netfs_folio_trace_store_plus);
+		} else {
+			params->notes |=3D NOTE_UPLOAD | NOTE_UPLOAD_STARTED;
+			trace_netfs_folio(folio, netfs_folio_trace_store);
+		}
+		if (params->notes & NOTE_CACHE_AVAIL)
+			params->notes |=3D NOTE_CACHE_COPY;
 	}
=20
-	if (foff > 0)
-		netfs_issue_write(wreq, upload);
-	if (streamw)
-		netfs_issue_write(wreq, cache);
-
 	/* Flip the page to the writeback state and unlock.  If we're called
 	 * from write-through, then the page has already been put into the wb
 	 * state.
@@ -478,129 +655,37 @@ static int netfs_write_folio(struct netfs_io_request=
 *wreq,
 		folio_start_writeback(folio);
 	folio_unlock(folio);
=20
-	if (fgroup =3D=3D NETFS_FOLIO_COPY_TO_CACHE) {
-		if (!cache->avail) {
-			trace_netfs_folio(folio, netfs_folio_trace_cancel_copy);
-			netfs_issue_write(wreq, upload);
-			netfs_folio_written_back(folio);
-			return 0;
-		}
-		trace_netfs_folio(folio, netfs_folio_trace_store_copy);
-	} else if (!upload->avail && !cache->avail) {
-		trace_netfs_folio(folio, netfs_folio_trace_cancel_store);
-		netfs_folio_written_back(folio);
-		return 0;
-	} else if (!upload->construct) {
-		trace_netfs_folio(folio, netfs_folio_trace_store);
-	} else {
-		trace_netfs_folio(folio, netfs_folio_trace_store_plus);
-	}
-
 	/* Attach the folio to the rolling buffer. */
 	slot =3D queue->nr_slots;
-	bvec_set_folio(&queue->bv[slot], folio, flen, 0);
+	bvec_set_folio(&queue->bv[slot], folio, flen, foff);
 	queue->nr_slots =3D slot + 1;
 	wreq->load_cursor.slot =3D slot + 1;
 	wreq->load_cursor.offset =3D 0;
-	wreq->last_end =3D fpos + foff + flen;
 	trace_netfs_bv_slot(queue, slot);
+	trace_netfs_wback(wreq, folio, params->notes);
=20
-	/* Move the submission point forward to allow for write-streaming data
-	 * not starting at the front of the page.  We don't do write-streaming
-	 * with the cache as the cache requires DIO alignment.
-	 *
-	 * Also skip uploading for data that's been read and just needs copying
-	 * to the cache.
-	 */
-	for (int s =3D 0; s < NR_IO_STREAMS; s++) {
-		stream =3D &wreq->io_streams[s];
-		stream->submit_off =3D 0;
-		stream->submit_len =3D flen;
-		if (!stream->avail ||
-		    (stream->source =3D=3D NETFS_WRITE_TO_CACHE && streamw) ||
-		    (stream->source =3D=3D NETFS_UPLOAD_TO_SERVER &&
-		     fgroup =3D=3D NETFS_FOLIO_COPY_TO_CACHE)) {
-			stream->submit_off =3D UINT_MAX;
-			stream->submit_len =3D 0;
-		}
-	}
-
-	/* Attach the folio to one or more subrequests.  For a big folio, we
-	 * could end up with thousands of subrequests if the wsize is small -
-	 * but we might need to wait during the creation of subrequests for
-	 * network resources (eg. SMB credits).
-	 */
-	for (;;) {
-		ssize_t part;
-		size_t lowest_off =3D ULONG_MAX;
-		int choose_s =3D -1;
-
-		/* Always add to the lowest-submitted stream first. */
-		for (int s =3D 0; s < NR_IO_STREAMS; s++) {
-			stream =3D &wreq->io_streams[s];
-			if (stream->submit_len > 0 &&
-			    stream->submit_off < lowest_off) {
-				lowest_off =3D stream->submit_off;
-				choose_s =3D s;
-			}
-		}
-
-		if (choose_s < 0)
-			break;
-		stream =3D &wreq->io_streams[choose_s];
-
-		/* Advance the cursor. */
-		wreq->dispatch_cursor.offset =3D stream->submit_off;
-
-		atomic64_set(&wreq->issued_to, fpos + foff + stream->submit_off);
-		part =3D netfs_advance_write(wreq, stream, fpos + foff + stream->submit_=
off,
-					   stream->submit_len, to_eof);
-		stream->submit_off +=3D part;
-		if (part > stream->submit_len)
-			stream->submit_len =3D 0;
-		else
-			stream->submit_len -=3D part;
-		if (part > 0)
-			debug =3D true;
-	}
-
-	bvecq_pos_step(&wreq->dispatch_cursor);
-	/* Order loading the queue before updating the issue_to point */
-	atomic64_set_release(&wreq->issued_to, fpos + fsize);
-
-	if (!debug)
-		kdebug("R=3D%x: No submit", wreq->debug_id);
-
-	if (foff + flen < fsize)
-		for (int s =3D 0; s < NR_IO_STREAMS; s++)
-			netfs_issue_write(wreq, &wreq->io_streams[s]);
-
-	_leave(" =3D 0");
+out:
+	_leave(" =3D %x", params->notes);
 	return 0;
-}
=20
-/*
- * End the issuing of writes, letting the collector know we're done.
- */
-static void netfs_end_issue_write(struct netfs_io_request *wreq)
-{
-	bool needs_poke =3D true;
-
-	smp_wmb(); /* Write subreq lists before ALL_QUEUED. */
-	set_bit(NETFS_RREQ_ALL_QUEUED, &wreq->flags);
-
-	for (int s =3D 0; s < NR_IO_STREAMS; s++) {
-		struct netfs_io_stream *stream =3D &wreq->io_streams[s];
-
-		if (!stream->active)
-			continue;
-		if (!list_empty(&stream->subrequests))
-			needs_poke =3D false;
-		netfs_issue_write(wreq, stream);
-	}
-
-	if (needs_poke)
-		netfs_wake_collector(wreq);
+skip_folio:
+	ret =3D folio_redirty_for_writepage(wbc, folio);
+	folio_unlock(folio);
+	if (ret < 0)
+		return ret;
+	params->notes |=3D NOTE_DISCONTIG_BEFORE;
+	goto out;
+cancel_folio_discard:
+	netfs_put_group(fgroup);
+cancel_folio:
+	folio_detach_private(folio);
+	kfree(finfo);
+	folio_unlock(folio);
+	folio_cancel_dirty(folio);
+	if (wreq->origin =3D=3D NETFS_WRITETHROUGH)
+		folio_end_writeback(folio);
+	params->notes |=3D NOTE_DISCONTIG_BEFORE;
+	goto out;
 }
=20
 /*
@@ -611,6 +696,7 @@ int netfs_writepages(struct address_space *mapping,
 {
 	struct netfs_inode *ictx =3D netfs_inode(mapping->host);
 	struct netfs_io_request *wreq =3D NULL;
+	struct netfs_wb_params params =3D {};
 	struct folio *folio;
 	int error =3D 0;
=20
@@ -636,35 +722,48 @@ int netfs_writepages(struct address_space *mapping,
=20
 	if (bvecq_buffer_init(&wreq->load_cursor, GFP_NOFS) < 0)
 		goto nomem;
-	bvecq_pos_set(&wreq->dispatch_cursor, &wreq->load_cursor);
-	bvecq_pos_set(&wreq->collect_cursor, &wreq->dispatch_cursor);
+	bvecq_pos_set(&params.dispatch_cursor, &wreq->load_cursor);
+	bvecq_pos_set(&wreq->collect_cursor, &wreq->load_cursor);
=20
 	__set_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &wreq->flags);
 	trace_netfs_write(wreq, netfs_write_trace_writeback);
 	netfs_stat(&netfs_n_wh_writepages);
=20
-	do {
-		_debug("wbiter %lx %llx", folio->index, atomic64_read(&wreq->issued_to));
+	if (wreq->io_streams[1].avail)
+		params.notes |=3D NOTE_CACHE_AVAIL;
=20
-		/* It appears we don't have to handle cyclic writeback wrapping. */
-		WARN_ON_ONCE(wreq && folio_pos(folio) < atomic64_read(&wreq->issued_to));
+	do {
+		_debug("wbiter %lx", folio->index);
=20
 		if (netfs_folio_group(folio) !=3D NETFS_FOLIO_COPY_TO_CACHE &&
 		    unlikely(!test_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags))) {
 			set_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags);
 			wreq->netfs_ops->begin_writeback(wreq);
+			if (wreq->io_streams[0].avail) {
+				params.notes |=3D NOTE_UPLOAD_AVAIL;
+				/* Order setting the active flag after other fields. */
+				smp_store_release(&wreq->io_streams[0].active, true);
+			}
 		}
=20
-		error =3D netfs_write_folio(wreq, wbc, folio);
+		params.notes &=3D NOTES__KEEP_MASK;
+		error =3D netfs_queue_wb_folio(wreq, wbc, folio, &params);
+		if (error < 0)
+			break;
+		error =3D netfs_issue_streams(wreq, &params);
 		if (error < 0)
 			break;
+
+		bvecq_pos_step(&params.dispatch_cursor);
 	} while ((folio =3D writeback_iter(mapping, wbc, folio, &error)));
=20
-	netfs_end_issue_write(wreq);
+	netfs_end_issue_write(wreq, &params);
=20
 	mutex_unlock(&ictx->wb_lock);
 	bvecq_pos_unset(&wreq->load_cursor);
-	bvecq_pos_unset(&wreq->dispatch_cursor);
+	bvecq_pos_unset(&params.dispatch_cursor);
+	for (int i =3D 0; i < NR_IO_STREAMS; i++)
+		bvecq_pos_unset(&wreq->io_streams[i].dispatch_cursor);
 	netfs_wake_collector(wreq);
=20
 	netfs_put_request(wreq, netfs_rreq_trace_put_return);
@@ -686,32 +785,55 @@ EXPORT_SYMBOL(netfs_writepages);
 /*
  * Begin a write operation for writing through the pagecache.
  */
-struct netfs_io_request *netfs_begin_writethrough(struct kiocb *iocb, size=
_t len)
+struct netfs_writethrough *netfs_begin_writethrough(struct kiocb *iocb, si=
ze_t len)
 {
+	struct netfs_writethrough *wthru =3D NULL;
 	struct netfs_io_request *wreq =3D NULL;
 	struct netfs_inode *ictx =3D netfs_inode(file_inode(iocb->ki_filp));
=20
+	wthru =3D kzalloc_obj(struct netfs_writethrough);
+	if (!wthru)
+		return ERR_PTR(-ENOMEM);
+
 	mutex_lock(&ictx->wb_lock);
=20
 	wreq =3D netfs_create_write_req(iocb->ki_filp->f_mapping, iocb->ki_filp,
 				      iocb->ki_pos, NETFS_WRITETHROUGH);
 	if (IS_ERR(wreq)) {
 		mutex_unlock(&ictx->wb_lock);
-		return wreq;
+		kfree(wthru);
+		return ERR_CAST(wreq);
 	}
+	wthru->wreq =3D wreq;
=20
 	if (bvecq_buffer_init(&wreq->load_cursor, GFP_NOFS) < 0) {
 		netfs_put_failed_request(wreq);
 		mutex_unlock(&ictx->wb_lock);
+		kfree(wthru);
 		return ERR_PTR(-ENOMEM);
 	}
=20
-	bvecq_pos_set(&wreq->dispatch_cursor, &wreq->load_cursor);
-	bvecq_pos_set(&wreq->collect_cursor, &wreq->dispatch_cursor);
+	bvecq_pos_set(&wthru->params.dispatch_cursor, &wreq->load_cursor);
+	bvecq_pos_set(&wreq->collect_cursor, &wreq->load_cursor);
+
+	if (wreq->io_streams[1].avail)
+		wthru->params.notes |=3D NOTE_CACHE_AVAIL;
=20
 	wreq->io_streams[0].avail =3D true;
 	trace_netfs_write(wreq, netfs_write_trace_writethrough);
-	return wreq;
+	if (!is_sync_kiocb(iocb))
+		wreq->iocb =3D iocb;
+
+	if (unlikely(!test_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags))) {
+		set_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags);
+		/* Don't call ->begin_writeback() as ->init_request() gets file*. */
+		if (wreq->io_streams[0].avail) {
+			wthru->params.notes |=3D NOTE_UPLOAD_AVAIL;
+			/* Order setting the active flag after other fields. */
+			smp_store_release(&wreq->io_streams[0].active, true);
+		}
+	}
+	return wthru;
 }
=20
 /*
@@ -720,14 +842,17 @@ struct netfs_io_request *netfs_begin_writethrough(str=
uct kiocb *iocb, size_t len
  * to the request.  If we've added more than wsize then we need to create =
a new
  * subrequest.
  */
-int netfs_advance_writethrough(struct netfs_io_request *wreq, struct write=
back_control *wbc,
-			       struct folio *folio, size_t copied, bool to_page_end,
-			       struct folio **writethrough_cache)
+int netfs_advance_writethrough(struct netfs_writethrough *wthru,
+			       struct writeback_control *wbc,
+			       struct folio *folio, size_t copied, bool to_page_end)
 {
+	struct netfs_io_request *wreq =3D wthru->wreq;
+	int ret;
+
 	_enter("R=3D%x ws=3D%u cp=3D%zu tp=3D%u",
 	       wreq->debug_id, wreq->wsize, copied, to_page_end);
=20
-	if (!*writethrough_cache) {
+	if (!wthru->in_progress) {
 		if (folio_test_dirty(folio))
 			/* Sigh.  mmap. */
 			folio_clear_dirty_for_io(folio);
@@ -738,63 +863,113 @@ int netfs_advance_writethrough(struct netfs_io_reque=
st *wreq, struct writeback_c
 			trace_netfs_folio(folio, netfs_folio_trace_wthru);
 		else
 			trace_netfs_folio(folio, netfs_folio_trace_wthru_plus);
-		*writethrough_cache =3D folio;
+		wthru->in_progress =3D folio;
 	}
=20
 	wreq->len +=3D copied;
 	if (!to_page_end)
 		return 0;
=20
-	*writethrough_cache =3D NULL;
-	return netfs_write_folio(wreq, wbc, folio);
+	wthru->in_progress =3D NULL;
+	wthru->params.notes &=3D NOTES__KEEP_MASK;
+	ret =3D netfs_queue_wb_folio(wreq, wbc, folio, &wthru->params);
+	if (ret < 0)
+		return ret;
+	return netfs_issue_streams(wreq, &wthru->params);
 }
=20
 /*
  * End a write operation used when writing through the pagecache.
  */
-ssize_t netfs_end_writethrough(struct netfs_io_request *wreq, struct write=
back_control *wbc,
-			       struct folio *writethrough_cache)
+ssize_t netfs_end_writethrough(struct netfs_writethrough *wthru,
+			       struct writeback_control *wbc)
 {
+	struct netfs_io_request *wreq =3D wthru->wreq;
 	struct netfs_inode *ictx =3D netfs_inode(wreq->inode);
 	ssize_t ret;
=20
 	_enter("R=3D%x", wreq->debug_id);
=20
-	if (writethrough_cache)
-		netfs_write_folio(wreq, wbc, writethrough_cache);
+	if (wthru->in_progress) {
+		wthru->params.notes &=3D NOTES__KEEP_MASK;
+		ret =3D netfs_queue_wb_folio(wreq, wbc, wthru->in_progress, &wthru->para=
ms);
+		if (ret =3D=3D 0)
+			ret =3D netfs_issue_streams(wreq, &wthru->params);
+		wthru->in_progress =3D NULL;
+	}
=20
-	netfs_end_issue_write(wreq);
+	netfs_end_issue_write(wreq, &wthru->params);
=20
 	mutex_unlock(&ictx->wb_lock);
=20
 	bvecq_pos_unset(&wreq->load_cursor);
-	bvecq_pos_unset(&wreq->dispatch_cursor);
+	bvecq_pos_unset(&wthru->params.dispatch_cursor);
+	for (int i =3D 0; i < NR_IO_STREAMS; i++)
+		bvecq_pos_unset(&wreq->io_streams[i].dispatch_cursor);
=20
 	if (wreq->iocb)
 		ret =3D -EIOCBQUEUED;
 	else
 		ret =3D netfs_wait_for_write(wreq);
 	netfs_put_request(wreq, netfs_rreq_trace_put_return);
+	kfree(wthru);
 	return ret;
 }
=20
+/*
+ * Prepare a buffer for a single monolithic write.
+ */
+static int netfs_prepare_write_single_buffer(struct netfs_io_subrequest *s=
ubreq,
+					     unsigned int max_segs)
+{
+	struct netfs_io_request *wreq =3D subreq->rreq;
+	struct netfs_io_stream *stream =3D &wreq->io_streams[subreq->stream_nr];
+	struct bio_vec *bv;
+	struct bvecq *bq;
+	size_t dio_size =3D wreq->cache_resources.dio_size;
+	size_t dlen;
+
+	bvecq_pos_set(&subreq->dispatch_pos, &stream->dispatch_cursor);
+	bvecq_pos_set(&subreq->content, &subreq->dispatch_pos);
+
+	/* Round the end of the last entry up. */
+	bq =3D subreq->content.bvecq;
+	while (bq->next)
+		bq =3D bq->next;
+	bv =3D &bq->bv[bq->nr_slots - 1];
+	dlen =3D round_up(bv->bv_len, dio_size);
+	if (dlen > bv->bv_len) {
+		subreq->len +=3D dlen - bv->bv_len;
+		bv->bv_len =3D dlen;
+	}
+
+	stream->buffered   =3D 0;
+	stream->issue_from =3D subreq->len;
+	wreq->submitted    =3D subreq->len;
+	return 0;
+}
+
 /**
  * netfs_writeback_single - Write back a monolithic payload
  * @mapping: The mapping to write from
  * @wbc: Hints from the VM
- * @iter: Data to write.
+ * @iter: Data to write
+ * @len: Amount of data to write
  *
  * Write a monolithic, non-pagecache object back to the server and/or
  * the cache.  There's a maximum of one subrequest per stream.
  */
 int netfs_writeback_single(struct address_space *mapping,
 			   struct writeback_control *wbc,
-			   struct iov_iter *iter)
+			   struct iov_iter *iter,
+			   size_t len)
 {
 	struct netfs_io_request *wreq;
 	struct netfs_inode *ictx =3D netfs_inode(mapping->host);
 	int ret;
=20
+	_enter("%zx,%zx", iov_iter_count(iter), len);
+
 	if (!mutex_trylock(&ictx->wb_lock)) {
 		if (wbc->sync_mode =3D=3D WB_SYNC_NONE) {
 			netfs_stat(&netfs_n_wb_lock_skip);
@@ -809,23 +984,24 @@ int netfs_writeback_single(struct address_space *mapp=
ing,
 		ret =3D PTR_ERR(wreq);
 		goto couldnt_start;
 	}
-	wreq->len =3D iov_iter_count(iter);
=20
-	ret =3D netfs_extract_iter(iter, wreq->len, INT_MAX, 0, &wreq->dispatch_c=
ursor.bvecq, 0);
+	wreq->len =3D len;
+
+	ret =3D netfs_extract_iter(iter, len, INT_MAX, 0, &wreq->load_cursor.bvec=
q, 0);
 	if (ret < 0)
 		goto cleanup_free;
-	if (ret < wreq->len) {
+	if (ret < len) {
 		ret =3D -EIO;
 		goto cleanup_free;
 	}
=20
-	bvecq_pos_set(&wreq->collect_cursor, &wreq->dispatch_cursor);
+	bvecq_pos_set(&wreq->collect_cursor, &wreq->load_cursor);
=20
 	__set_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &wreq->flags);
 	trace_netfs_write(wreq, netfs_write_trace_writeback_single);
 	netfs_stat(&netfs_n_wh_writepages);
=20
-	if (__test_and_set_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags))
+	if (test_bit(NETFS_RREQ_UPLOAD_TO_SERVER, &wreq->flags))
 		wreq->netfs_ops->begin_writeback(wreq);
=20
 	for (int s =3D 0; s < NR_IO_STREAMS; s++) {
@@ -835,13 +1011,22 @@ int netfs_writeback_single(struct address_space *map=
ping,
 		if (!stream->avail)
 			continue;
=20
-		netfs_prepare_write(wreq, stream, 0);
+		stream->issue_from =3D 0;
+		stream->buffered   =3D len;
+
+		subreq =3D netfs_alloc_write_subreq(wreq, stream);
+		if (!subreq) {
+			ret =3D -ENOMEM;
+			break;
+		}
+
+		bvecq_pos_set(&stream->dispatch_cursor, &wreq->load_cursor);
=20
-		subreq =3D stream->construct;
-		subreq->len =3D wreq->len;
-		stream->submit_len =3D subreq->len;
+		ret =3D stream->issue_write(subreq);
+		if (ret < 0 && ret !=3D -EIOCBQUEUED)
+			netfs_write_subrequest_terminated(subreq, ret);
=20
-		netfs_issue_write(wreq, stream);
+		bvecq_pos_unset(&stream->dispatch_cursor);
 	}
=20
 	wreq->submitted =3D wreq->len;
diff --git a/fs/netfs/write_retry.c b/fs/netfs/write_retry.c
index 5df5c34d4610..096ddf7a2e5c 100644
--- a/fs/netfs/write_retry.c
+++ b/fs/netfs/write_retry.c
@@ -12,12 +12,43 @@
 #include "internal.h"
=20
 /*
- * Perform retries on the streams that need it.
+ * Prepare the write buffer for a retry.  We can't necessarily reuse the w=
rite
+ * buffer from the previous run of a subrequest because the filesystem is
+ * permitted to modify it (add headers/trailers, encrypt it).  Further, the
+ * subrequest may now be a different size (e.g. cifs has to negotiate for
+ * maximum transfer size).  Also, we can't look at *stream as that may sti=
ll
+ * refer to the source material being broken up into original subrequests.
+ */
+int netfs_prepare_write_retry_buffer(struct netfs_io_subrequest *subreq,
+				     unsigned int max_segs)
+{
+	struct netfs_io_request *wreq =3D subreq->rreq;
+	struct netfs_io_stream *stream =3D &wreq->io_streams[subreq->stream_nr];
+	size_t len;
+
+	bvecq_pos_set(&subreq->dispatch_pos, &wreq->retry_cursor);
+	bvecq_pos_set(&subreq->content, &wreq->retry_cursor);
+	len =3D bvecq_slice(&wreq->retry_cursor, subreq->len, max_segs, &subreq->=
nr_segs);
+
+	if (len < subreq->len) {
+		subreq->len =3D len;
+		trace_netfs_sreq(subreq, netfs_sreq_trace_limited);
+	}
+
+	stream->issue_from +=3D len;
+	stream->buffered   -=3D len;
+	if (stream->buffered =3D=3D 0)
+		bvecq_pos_unset(&wreq->retry_cursor);
+	return 0;
+}
+
+/*
+ * Perform retries on the streams that need it.  This only has to deal with
+ * buffered writes; unbuffered write retry is handled in direct_write.c.
  */
 static void netfs_retry_write_stream(struct netfs_io_request *wreq,
 				     struct netfs_io_stream *stream)
 {
-	struct bvecq_pos dispatch_cursor =3D {};
 	struct list_head *next;
=20
 	_enter("R=3D%x[%x:]", wreq->debug_id, stream->stream_nr);
@@ -32,30 +63,14 @@ static void netfs_retry_write_stream(struct netfs_io_re=
quest *wreq,
 	if (unlikely(stream->failed))
 		return;
=20
-	/* If there's no renegotiation to do, just resend each failed subreq. */
-	if (!stream->prepare_write) {
-		struct netfs_io_subrequest *subreq;
-
-		list_for_each_entry(subreq, &stream->subrequests, rreq_link) {
-			if (test_bit(NETFS_SREQ_FAILED, &subreq->flags))
-				break;
-			if (__test_and_clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags)) {
-				netfs_get_subrequest(subreq, netfs_sreq_trace_get_resubmit);
-				netfs_reissue_write(stream, subreq);
-			}
-		}
-		return;
-	}
-
 	next =3D stream->subrequests.next;
=20
 	do {
 		struct netfs_io_subrequest *subreq =3D NULL, *from, *to, *tmp;
 		unsigned long long start, len;
-		size_t part;
-		bool boundary =3D false;
+		int ret;
=20
-		bvecq_pos_unset(&dispatch_cursor);
+		bvecq_pos_unset(&wreq->retry_cursor);
=20
 		/* Go through the stream and find the next span of contiguous
 		 * data that we then rejig (cifs, for example, needs the wsize
@@ -73,7 +88,6 @@ static void netfs_retry_write_stream(struct netfs_io_requ=
est *wreq,
 		list_for_each_continue(next, &stream->subrequests) {
 			subreq =3D list_entry(next, struct netfs_io_subrequest, rreq_link);
 			if (subreq->start + subreq->transferred !=3D start + len ||
-			    test_bit(NETFS_SREQ_BOUNDARY, &subreq->flags) ||
 			    !test_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags))
 				break;
 			to =3D subreq;
@@ -83,43 +97,40 @@ static void netfs_retry_write_stream(struct netfs_io_re=
quest *wreq,
 		/* Determine the set of buffers we're going to use.  Each
 		 * subreq gets a subset of a single overall contiguous buffer.
 		 */
-		bvecq_pos_transfer(&dispatch_cursor, &from->dispatch_pos);
-		bvecq_pos_advance(&dispatch_cursor, from->transferred);
+		bvecq_pos_transfer(&wreq->retry_cursor, &from->dispatch_pos);
+		bvecq_pos_advance(&wreq->retry_cursor, from->transferred);
+		wreq->retry_start =3D start;
+		wreq->retry_buffered =3D len;
=20
 		/* Work through the sublist. */
 		subreq =3D from;
 		list_for_each_entry_from(subreq, &stream->subrequests, rreq_link) {
-			if (!len)
+			if (!wreq->retry_buffered)
 				break;
=20
-			subreq->start	=3D start;
-			subreq->len	=3D len;
-			__clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
-			trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
-
 			bvecq_pos_unset(&subreq->dispatch_pos);
 			bvecq_pos_unset(&subreq->content);
+			subreq->content.bvecq =3D NULL;
+			subreq->content.slot =3D 0;
+			subreq->content.offset =3D 0;
=20
-			/* Renegotiate max_len (wsize) */
-			stream->sreq_max_len =3D len;
-			stream->sreq_max_segs =3D INT_MAX;
-			stream->prepare_write(subreq);
-
-			bvecq_pos_set(&subreq->dispatch_pos, &dispatch_cursor);
-			part =3D bvecq_slice(&dispatch_cursor,
-					   umin(len, stream->sreq_max_len),
-					   stream->sreq_max_segs,
-					   &subreq->nr_segs);
-			subreq->len =3D subreq->transferred + part;
-			subreq->transferred =3D 0;
-			len -=3D part;
-			start +=3D part;
-			if (len && subreq =3D=3D to &&
-			    __test_and_clear_bit(NETFS_SREQ_BOUNDARY, &to->flags))
-				boundary =3D true;
-
+			__clear_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
+			__clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags);
+			__clear_bit(NETFS_SREQ_FAILED, &subreq->flags);
+			__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
+			subreq->start		=3D wreq->retry_start;
+			subreq->len		=3D wreq->retry_buffered;
+			subreq->transferred	=3D 0;
+			subreq->retry_count	+=3D 1;
+			subreq->error		=3D 0;
+
+			netfs_stat(&netfs_n_wh_retry_write_subreq);
+			trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
 			netfs_get_subrequest(subreq, netfs_sreq_trace_get_resubmit);
-			netfs_reissue_write(stream, subreq);
+			ret =3D stream->issue_write(subreq);
+			if (ret < 0 && ret !=3D -EIOCBQUEUED)
+				netfs_write_subrequest_terminated(subreq, ret);
+
 			if (subreq =3D=3D to)
 				break;
 		}
@@ -160,12 +171,9 @@ static void netfs_retry_write_stream(struct netfs_io_r=
equest *wreq,
 			to =3D list_next_entry(to, rreq_link);
 			trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
=20
-			stream->sreq_max_len	=3D len;
-			stream->sreq_max_segs	=3D INT_MAX;
 			switch (stream->source) {
 			case NETFS_UPLOAD_TO_SERVER:
 				netfs_stat(&netfs_n_wh_upload);
-				stream->sreq_max_len =3D umin(len, wreq->wsize);
 				break;
 			case NETFS_WRITE_TO_CACHE:
 				netfs_stat(&netfs_n_wh_write);
@@ -174,32 +182,16 @@ static void netfs_retry_write_stream(struct netfs_io_=
request *wreq,
 				WARN_ON_ONCE(1);
 			}
=20
-			stream->prepare_write(subreq);
-
-			bvecq_pos_set(&subreq->dispatch_pos, &dispatch_cursor);
-			part =3D bvecq_slice(&dispatch_cursor,
-					   umin(len, stream->sreq_max_len),
-					   stream->sreq_max_segs,
-					   &subreq->nr_segs);
-			subreq->len =3D subreq->transferred + part;
-
-			len -=3D part;
-			start +=3D part;
-			if (!len && boundary) {
-				__set_bit(NETFS_SREQ_BOUNDARY, &to->flags);
-				boundary =3D false;
-			}
-
-			netfs_reissue_write(stream, subreq);
-			if (!len)
-				break;
+			ret =3D stream->issue_write(subreq);
+			if (ret < 0 && ret !=3D -EIOCBQUEUED)
+				netfs_write_subrequest_terminated(subreq, ret);
=20
 		} while (len);
=20
 	} while (!list_is_head(next, &stream->subrequests));
=20
 out:
-	bvecq_pos_unset(&dispatch_cursor);
+	bvecq_pos_unset(&wreq->retry_cursor);
 }
=20
 /*
@@ -237,4 +229,6 @@ void netfs_retry_writes(struct netfs_io_request *wreq)
 			netfs_retry_write_stream(wreq, stream);
 		}
 	}
+
+	pr_notice("Retrying\n");
 }
diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig
index 12cb0ca738af..ae463867cf01 100644
--- a/fs/nfs/Kconfig
+++ b/fs/nfs/Kconfig
@@ -173,6 +173,7 @@ config NFS_FSCACHE
 	bool "Provide NFS client caching support"
 	depends on NFS_FS
 	select NETFS_SUPPORT
+	select NETFS_PGPRIV2
 	select FSCACHE
 	help
 	  Say Y here if you want NFS data to be cached locally on disc through
diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c
index 9b7fdad4a920..bc82821d77a3 100644
--- a/fs/nfs/fscache.c
+++ b/fs/nfs/fscache.c
@@ -273,8 +273,6 @@ static int nfs_netfs_init_request(struct netfs_io_reque=
st *rreq, struct file *fi
 	rreq->debug_id =3D atomic_inc_return(&nfs_netfs_debug_id);
 	/* [DEPRECATED] Use PG_private_2 to mark folio being written to the cache=
. */
 	__set_bit(NETFS_RREQ_USE_PGPRIV2, &rreq->flags);
-	rreq->io_streams[0].sreq_max_len =3D NFS_SB(rreq->inode->i_sb)->rsize;
-
 	return 0;
 }
=20
@@ -296,8 +294,9 @@ static struct nfs_netfs_io_data *nfs_netfs_alloc(struct=
 netfs_io_subrequest *sre
 	return netfs;
 }
=20
-static void nfs_netfs_issue_read(struct netfs_io_subrequest *sreq)
+static int nfs_netfs_issue_read(struct netfs_io_subrequest *sreq)
 {
+	struct netfs_io_request		*rreq =3D sreq->rreq;
 	struct nfs_netfs_io_data	*netfs;
 	struct nfs_pageio_descriptor	pgio;
 	struct inode *inode =3D sreq->rreq->inode;
@@ -307,6 +306,13 @@ static void nfs_netfs_issue_read(struct netfs_io_subre=
quest *sreq)
 	pgoff_t start, last;
 	int err;
=20
+	if (sreq->len > NFS_SB(rreq->inode->i_sb)->rsize)
+		sreq->len =3D NFS_SB(rreq->inode->i_sb)->rsize;
+
+	err =3D netfs_prepare_read_buffer(sreq, INT_MAX);
+	if (err < 0)
+		return err;
+
 	start =3D (sreq->start + sreq->transferred) >> PAGE_SHIFT;
 	last =3D ((sreq->start + sreq->len - sreq->transferred - 1) >> PAGE_SHIFT=
);
=20
@@ -314,14 +320,15 @@ static void nfs_netfs_issue_read(struct netfs_io_subr=
equest *sreq)
 			     &nfs_async_read_completion_ops);
=20
 	netfs =3D nfs_netfs_alloc(sreq);
-	if (!netfs) {
-		sreq->error =3D -ENOMEM;
-		return netfs_read_subreq_terminated(sreq);
-	}
+	if (!netfs)
+		return -ENOMEM;
+
+	/* After this point, we're not allowed to return an error. */
+	netfs_mark_read_submission(sreq);
=20
 	pgio.pg_netfs =3D netfs; /* used in completion */
=20
-	xa_for_each_range(&sreq->rreq->mapping->i_pages, idx, page, start, last) {
+	xa_for_each_range(&rreq->mapping->i_pages, idx, page, start, last) {
 		/* nfs_read_add_folio() may schedule() due to pNFS layout and other RPCs=
  */
 		err =3D nfs_read_add_folio(&pgio, ctx, page_folio(page));
 		if (err < 0) {
@@ -332,6 +339,7 @@ static void nfs_netfs_issue_read(struct netfs_io_subreq=
uest *sreq)
 out:
 	nfs_pageio_complete_read(&pgio);
 	nfs_netfs_put(netfs);
+	return -EIOCBQUEUED;
 }
=20
 void nfs_netfs_initiate_read(struct nfs_pgio_header *hdr)
diff --git a/fs/smb/client/cifssmb.c b/fs/smb/client/cifssmb.c
index 3990a9012264..dc9120802edb 100644
--- a/fs/smb/client/cifssmb.c
+++ b/fs/smb/client/cifssmb.c
@@ -1466,8 +1466,7 @@ cifs_readv_callback(struct TCP_Server_Info *server, s=
truct mid_q_entry *mid)
 	struct netfs_inode *ictx =3D netfs_inode(rdata->rreq->inode);
 	struct cifs_tcon *tcon =3D tlink_tcon(rdata->req->cfile->tlink);
 	struct smb_rqst rqst =3D { .rq_iov =3D rdata->iov,
-				 .rq_nvec =3D 1,
-				 .rq_iter =3D rdata->subreq.io_iter };
+				 .rq_nvec =3D 1};
 	struct cifs_credits credits =3D {
 		.value =3D 1,
 		.instance =3D 0,
@@ -1481,6 +1480,11 @@ cifs_readv_callback(struct TCP_Server_Info *server, =
struct mid_q_entry *mid)
 		 __func__, mid->mid, mid->mid_state, rdata->result,
 		 rdata->subreq.len);
=20
+	if (rdata->got_bytes)
+		iov_iter_bvec_queue(&rqst.rq_iter, ITER_DEST,
+				    rdata->subreq.content.bvecq, rdata->subreq.content.slot,
+				    rdata->subreq.content.offset, rdata->subreq.len);
+
 	switch (mid->mid_state) {
 	case MID_RESPONSE_RECEIVED:
 		/* result already set, check signature */
@@ -2002,7 +2006,10 @@ cifs_async_writev(struct cifs_io_subrequest *wdata)
=20
 	rqst.rq_iov =3D iov;
 	rqst.rq_nvec =3D 1;
-	rqst.rq_iter =3D wdata->subreq.io_iter;
+
+	iov_iter_bvec_queue(&rqst.rq_iter, ITER_SOURCE,
+			    wdata->subreq.content.bvecq, wdata->subreq.content.slot,
+			    wdata->subreq.content.offset, wdata->subreq.len);
=20
 	cifs_dbg(FYI, "async write at %llu %zu bytes\n",
 		 wdata->subreq.start, wdata->subreq.len);
diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c
index cffcf82c1b69..a933c12b39ea 100644
--- a/fs/smb/client/file.c
+++ b/fs/smb/client/file.c
@@ -44,18 +44,34 @@ static int cifs_reopen_file(struct cifsFileInfo *cfile,=
 bool can_flush);
  * Prepare a subrequest to upload to the server.  We need to allocate cred=
its
  * so that we know the maximum amount of data that we can include in it.
  */
-static void cifs_prepare_write(struct netfs_io_subrequest *subreq)
+static int cifs_estimate_write(struct netfs_io_request *wreq,
+			       struct netfs_io_stream *stream,
+			       struct netfs_write_estimate *estimate)
+{
+	struct cifs_sb_info *cifs_sb =3D CIFS_SB(wreq->inode->i_sb);
+
+	estimate->issue_at =3D stream->issue_from + cifs_sb->ctx->wsize;
+	return 0;
+}
+
+/*
+ * Issue a subrequest to upload to the server.
+ */
+static int cifs_issue_write(struct netfs_io_subrequest *subreq)
 {
 	struct cifs_io_subrequest *wdata =3D
 		container_of(subreq, struct cifs_io_subrequest, subreq);
 	struct cifs_io_request *req =3D wdata->req;
-	struct netfs_io_stream *stream =3D &req->rreq.io_streams[subreq->stream_n=
r];
 	struct TCP_Server_Info *server;
 	struct cifsFileInfo *open_file =3D req->cfile;
-	struct cifs_sb_info *cifs_sb =3D CIFS_SB(wdata->rreq->inode->i_sb);
-	size_t wsize =3D req->rreq.wsize;
+	struct cifs_sb_info *cifs_sb =3D CIFS_SB(subreq->rreq->inode->i_sb);
+	unsigned int max_segs =3D INT_MAX;
+	size_t len;
 	int rc;
=20
+	if (cifs_forced_shutdown(cifs_sb))
+		return smb_EIO(smb_eio_trace_forced_shutdown);
+
 	if (!wdata->have_xid) {
 		wdata->xid =3D get_xid();
 		wdata->have_xid =3D true;
@@ -74,18 +90,16 @@ static void cifs_prepare_write(struct netfs_io_subreque=
st *subreq)
 		if (rc < 0) {
 			if (rc =3D=3D -EAGAIN)
 				goto retry;
-			subreq->error =3D rc;
-			return netfs_prepare_write_failed(subreq);
+			return rc;
 		}
 	}
=20
-	rc =3D server->ops->wait_mtu_credits(server, wsize, &stream->sreq_max_len,
-					   &wdata->credits);
-	if (rc < 0) {
-		subreq->error =3D rc;
-		return netfs_prepare_write_failed(subreq);
-	}
+	len =3D umin(subreq->len, cifs_sb->ctx->wsize);
+	rc =3D server->ops->wait_mtu_credits(server, len, &len, &wdata->credits);
+	if (rc < 0)
+		return rc;
=20
+	subreq->len =3D len;
 	wdata->credits.rreq_debug_id =3D subreq->rreq->debug_id;
 	wdata->credits.rreq_debug_index =3D subreq->debug_index;
 	wdata->credits.in_flight_check =3D 1;
@@ -101,39 +115,29 @@ static void cifs_prepare_write(struct netfs_io_subreq=
uest *subreq)
 		const struct smbdirect_socket_parameters *sp =3D
 			smbd_get_parameters(server->smbd_conn);
=20
-		stream->sreq_max_segs =3D sp->max_frmr_depth;
+		max_segs =3D sp->max_frmr_depth;
 	}
 #endif
-}
-
-/*
- * Issue a subrequest to upload to the server.
- */
-static void cifs_issue_write(struct netfs_io_subrequest *subreq)
-{
-	struct cifs_io_subrequest *wdata =3D
-		container_of(subreq, struct cifs_io_subrequest, subreq);
-	struct cifs_sb_info *sbi =3D CIFS_SB(subreq->rreq->inode->i_sb);
-	int rc;
=20
-	if (cifs_forced_shutdown(sbi)) {
-		rc =3D smb_EIO(smb_eio_trace_forced_shutdown);
-		goto fail;
+	rc =3D netfs_prepare_write_buffer(subreq, max_segs);
+	if (rc < 0) {
+		add_credits_and_wake_if(wdata->server, &wdata->credits, 0);
+		return rc;
 	}
=20
-	rc =3D adjust_credits(wdata->server, wdata, cifs_trace_rw_credits_issue_w=
rite_adjust);
+	rc =3D adjust_credits(server, wdata, cifs_trace_rw_credits_issue_write_ad=
just);
 	if (rc)
-		goto fail;
+		goto fail_with_credits;
=20
 	rc =3D -EAGAIN;
 	if (wdata->req->cfile->invalidHandle)
-		goto fail;
+		goto fail_with_credits;
=20
 	wdata->server->ops->async_writev(wdata);
 out:
-	return;
+	return -EIOCBQUEUED;
=20
-fail:
+fail_with_credits:
 	if (rc =3D=3D -EAGAIN)
 		trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
 	else
@@ -149,17 +153,25 @@ static void cifs_netfs_invalidate_cache(struct netfs_=
io_request *wreq)
 }
=20
 /*
- * Negotiate the size of a read operation on behalf of the netfs library.
+ * Issue a read operation on behalf of the netfs helper functions.  We're =
asked
+ * to make a read of a certain size at a point in the file.  We are permit=
ted
+ * to only read a portion of that, but as long as we read something, the n=
etfs
+ * helper will call us again so that we can issue another read.
  */
-static int cifs_prepare_read(struct netfs_io_subrequest *subreq)
+static int cifs_issue_read(struct netfs_io_subrequest *subreq)
 {
 	struct netfs_io_request *rreq =3D subreq->rreq;
 	struct cifs_io_subrequest *rdata =3D container_of(subreq, struct cifs_io_=
subrequest, subreq);
 	struct cifs_io_request *req =3D container_of(subreq->rreq, struct cifs_io=
_request, rreq);
-	struct TCP_Server_Info *server;
+	struct TCP_Server_Info *server =3D rdata->server;
 	struct cifs_sb_info *cifs_sb =3D CIFS_SB(rreq->inode->i_sb);
-	size_t size;
-	int rc =3D 0;
+	unsigned int max_segs =3D INT_MAX;
+	size_t len;
+	int rc;
+
+	cifs_dbg(FYI, "%s: op=3D%08x[%x] mapping=3D%p len=3D%zu/%zu\n",
+		 __func__, rreq->debug_id, subreq->debug_index, rreq->mapping,
+		 subreq->transferred, subreq->len);
=20
 	if (!rdata->have_xid) {
 		rdata->xid =3D get_xid();
@@ -173,17 +185,15 @@ static int cifs_prepare_read(struct netfs_io_subreque=
st *subreq)
 		cifs_negotiate_rsize(server, cifs_sb->ctx,
 				     tlink_tcon(req->cfile->tlink));
=20
-	rc =3D server->ops->wait_mtu_credits(server, cifs_sb->ctx->rsize,
-					   &size, &rdata->credits);
+	len =3D umin(subreq->len, cifs_sb->ctx->rsize);
+	rc =3D server->ops->wait_mtu_credits(server, len, &len, &rdata->credits);
 	if (rc)
 		return rc;
=20
-	rreq->io_streams[0].sreq_max_len =3D size;
-
-	rdata->credits.in_flight_check =3D 1;
+	subreq->len =3D len;
 	rdata->credits.rreq_debug_id =3D rreq->debug_id;
 	rdata->credits.rreq_debug_index =3D subreq->debug_index;
-
+	rdata->credits.in_flight_check =3D 1;
 	trace_smb3_rw_credits(rdata->rreq->debug_id,
 			      rdata->subreq.debug_index,
 			      rdata->credits.value,
@@ -195,33 +205,17 @@ static int cifs_prepare_read(struct netfs_io_subreque=
st *subreq)
 		const struct smbdirect_socket_parameters *sp =3D
 			smbd_get_parameters(server->smbd_conn);
=20
-		rreq->io_streams[0].sreq_max_segs =3D sp->max_frmr_depth;
+		max_segs =3D sp->max_frmr_depth;
 	}
 #endif
-	return 0;
-}
-
-/*
- * Issue a read operation on behalf of the netfs helper functions.  We're =
asked
- * to make a read of a certain size at a point in the file.  We are permit=
ted
- * to only read a portion of that, but as long as we read something, the n=
etfs
- * helper will call us again so that we can issue another read.
- */
-static void cifs_issue_read(struct netfs_io_subrequest *subreq)
-{
-	struct netfs_io_request *rreq =3D subreq->rreq;
-	struct cifs_io_subrequest *rdata =3D container_of(subreq, struct cifs_io_=
subrequest, subreq);
-	struct cifs_io_request *req =3D container_of(subreq->rreq, struct cifs_io=
_request, rreq);
-	struct TCP_Server_Info *server =3D rdata->server;
-	int rc =3D 0;
=20
-	cifs_dbg(FYI, "%s: op=3D%08x[%x] mapping=3D%p len=3D%zu/%zu\n",
-		 __func__, rreq->debug_id, subreq->debug_index, rreq->mapping,
-		 subreq->transferred, subreq->len);
+	rc =3D netfs_prepare_read_buffer(subreq, max_segs);
+	if (rc < 0)
+		goto fail_with_credits;
=20
 	rc =3D adjust_credits(server, rdata, cifs_trace_rw_credits_issue_read_adj=
ust);
 	if (rc)
-		goto failed;
+		goto fail_with_credits;
=20
 	if (req->cfile->invalidHandle) {
 		do {
@@ -235,15 +229,24 @@ static void cifs_issue_read(struct netfs_io_subreques=
t *subreq)
 	    subreq->rreq->origin !=3D NETFS_DIO_READ)
 		__set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags);
=20
-	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
+	/* After this point, we're not allowed to return an error. */
+	netfs_mark_read_submission(subreq);
+
 	rc =3D rdata->server->ops->async_readv(rdata);
-	if (rc)
-		goto failed;
-	return;
+	if (rc) {
+		subreq->error =3D rc;
+		netfs_read_subreq_terminated(subreq);
+	}
+	return -EIOCBQUEUED;
=20
+fail_with_credits:
+	if (rc =3D=3D -EAGAIN)
+		trace_netfs_sreq(subreq, netfs_sreq_trace_retry);
+	else
+		trace_netfs_sreq(subreq, netfs_sreq_trace_fail);
+	add_credits_and_wake_if(rdata->server, &rdata->credits, 0);
 failed:
-	subreq->error =3D rc;
-	netfs_read_subreq_terminated(subreq);
+	return rc;
 }
=20
 /*
@@ -353,11 +356,10 @@ const struct netfs_request_ops cifs_req_ops =3D {
 	.init_request		=3D cifs_init_request,
 	.free_request		=3D cifs_free_request,
 	.free_subrequest	=3D cifs_free_subrequest,
-	.prepare_read		=3D cifs_prepare_read,
 	.issue_read		=3D cifs_issue_read,
 	.done			=3D cifs_rreq_done,
 	.begin_writeback	=3D cifs_begin_writeback,
-	.prepare_write		=3D cifs_prepare_write,
+	.estimate_write		=3D cifs_estimate_write,
 	.issue_write		=3D cifs_issue_write,
 	.invalidate_cache	=3D cifs_netfs_invalidate_cache,
 };
diff --git a/fs/smb/client/smb2ops.c b/fs/smb/client/smb2ops.c
index 0d19c8fc4c3d..d15f196df1e7 100644
--- a/fs/smb/client/smb2ops.c
+++ b/fs/smb/client/smb2ops.c
@@ -4705,6 +4705,7 @@ handle_read_data(struct TCP_Server_Info *server, stru=
ct mid_q_entry *mid,
 	unsigned int cur_page_idx;
 	unsigned int pad_len;
 	struct cifs_io_subrequest *rdata =3D mid->callback_data;
+	struct iov_iter iter;
 	struct smb2_hdr *shdr =3D (struct smb2_hdr *)buf;
 	size_t copied;
 	bool use_rdma_mr =3D false;
@@ -4777,6 +4778,10 @@ handle_read_data(struct TCP_Server_Info *server, str=
uct mid_q_entry *mid,
=20
 	pad_len =3D data_offset - server->vals->read_rsp_size;
=20
+	iov_iter_bvec_queue(&iter, ITER_DEST,
+			    rdata->subreq.content.bvecq, rdata->subreq.content.slot,
+			    rdata->subreq.content.offset, rdata->subreq.len);
+
 	if (buf_len <=3D data_offset) {
 		/* read response payload is in pages */
 		cur_page_idx =3D pad_len / PAGE_SIZE;
@@ -4806,7 +4811,7 @@ handle_read_data(struct TCP_Server_Info *server, stru=
ct mid_q_entry *mid,
=20
 		/* Copy the data to the output I/O iterator. */
 		rdata->result =3D cifs_copy_bvecq_to_iter(buffer, buffer_len,
-							cur_off, &rdata->subreq.io_iter);
+							cur_off, &iter);
 		if (rdata->result !=3D 0) {
 			if (is_offloaded)
 				mid->mid_state =3D MID_RESPONSE_MALFORMED;
@@ -4819,7 +4824,7 @@ handle_read_data(struct TCP_Server_Info *server, stru=
ct mid_q_entry *mid,
 	} else if (buf_len >=3D data_offset + data_len) {
 		/* read response payload is in buf */
 		WARN_ONCE(buffer, "read data can be either in buf or in buffer");
-		copied =3D copy_to_iter(buf + data_offset, data_len, &rdata->subreq.io_i=
ter);
+		copied =3D copy_to_iter(buf + data_offset, data_len, &iter);
 		if (copied =3D=3D 0)
 			return smb_EIO2(smb_eio_trace_rx_copy_to_iter, copied, data_len);
 		rdata->got_bytes =3D copied;
diff --git a/fs/smb/client/smb2pdu.c b/fs/smb/client/smb2pdu.c
index c43ca74e8704..717d65d32dd3 100644
--- a/fs/smb/client/smb2pdu.c
+++ b/fs/smb/client/smb2pdu.c
@@ -4539,9 +4539,13 @@ smb2_new_read_req(void **buf, unsigned int *total_le=
n,
 	 */
 	if (rdata && smb3_use_rdma_offload(io_parms)) {
 		struct smbdirect_buffer_descriptor_v1 *v1;
+		struct iov_iter iter;
 		bool need_invalidate =3D server->dialect =3D=3D SMB30_PROT_ID;
=20
-		rdata->mr =3D smbd_register_mr(server->smbd_conn, &rdata->subreq.io_iter,
+		iov_iter_bvec_queue(&iter, ITER_DEST,
+				    rdata->subreq.content.bvecq, rdata->subreq.content.slot,
+				    rdata->subreq.content.offset, rdata->subreq.len);
+		rdata->mr =3D smbd_register_mr(server->smbd_conn, &iter,
 					     true, need_invalidate);
 		if (!rdata->mr)
 			return -EAGAIN;
@@ -4606,9 +4610,10 @@ smb2_readv_callback(struct TCP_Server_Info *server, =
struct mid_q_entry *mid)
 	unsigned int rreq_debug_id =3D rdata->rreq->debug_id;
 	unsigned int subreq_debug_index =3D rdata->subreq.debug_index;
=20
-	if (rdata->got_bytes) {
-		rqst.rq_iter	  =3D rdata->subreq.io_iter;
-	}
+	if (rdata->got_bytes)
+		iov_iter_bvec_queue(&rqst.rq_iter, ITER_DEST,
+				    rdata->subreq.content.bvecq, rdata->subreq.content.slot,
+				    rdata->subreq.content.offset, rdata->subreq.len);
=20
 	WARN_ONCE(rdata->server !=3D server,
 		  "rdata server %p !=3D mid server %p",
@@ -5096,7 +5101,9 @@ smb2_async_writev(struct cifs_io_subrequest *wdata)
 		goto out;
=20
 	rqst.rq_iov =3D iov;
-	rqst.rq_iter =3D wdata->subreq.io_iter;
+	iov_iter_bvec_queue(&rqst.rq_iter, ITER_SOURCE,
+			    wdata->subreq.content.bvecq, wdata->subreq.content.slot,
+			    wdata->subreq.content.offset, wdata->subreq.len);
=20
 	rqst.rq_iov[0].iov_len =3D total_len - 1;
 	rqst.rq_iov[0].iov_base =3D (char *)req;
@@ -5135,9 +5142,14 @@ smb2_async_writev(struct cifs_io_subrequest *wdata)
 	 */
 	if (smb3_use_rdma_offload(io_parms)) {
 		struct smbdirect_buffer_descriptor_v1 *v1;
+		struct iov_iter iter;
 		bool need_invalidate =3D server->dialect =3D=3D SMB30_PROT_ID;
=20
-		wdata->mr =3D smbd_register_mr(server->smbd_conn, &wdata->subreq.io_iter,
+		iov_iter_bvec_queue(&iter, ITER_SOURCE,
+				    wdata->subreq.content.bvecq, wdata->subreq.content.slot,
+				    wdata->subreq.content.offset, wdata->subreq.len);
+
+		wdata->mr =3D smbd_register_mr(server->smbd_conn, &iter,
 					     false, need_invalidate);
 		if (!wdata->mr) {
 			rc =3D -EAGAIN;
@@ -5176,8 +5188,8 @@ smb2_async_writev(struct cifs_io_subrequest *wdata)
 		smb2_set_replay(server, &rqst);
 	}
=20
-	cifs_dbg(FYI, "async write at %llu %u bytes iter=3D%zx\n",
-		 io_parms->offset, io_parms->length, iov_iter_count(&wdata->subreq.io_it=
er));
+	cifs_dbg(FYI, "async write at %llu %u bytes len=3D%zx\n",
+		 io_parms->offset, io_parms->length, wdata->subreq.len);
=20
 	if (wdata->credits.value > 0) {
 		shdr->CreditCharge =3D cpu_to_le16(DIV_ROUND_UP(wdata->subreq.len,
diff --git a/fs/smb/client/transport.c b/fs/smb/client/transport.c
index 05f8099047e1..dd1313736fcb 100644
--- a/fs/smb/client/transport.c
+++ b/fs/smb/client/transport.c
@@ -1264,12 +1264,19 @@ cifs_readv_receive(struct TCP_Server_Info *server, =
struct mid_q_entry *mid)
 	}
=20
 #ifdef CONFIG_CIFS_SMB_DIRECT
-	if (rdata->mr)
+	if (rdata->mr) {
 		length =3D data_len; /* An RDMA read is already done. */
-	else
+	} else {
+#endif
+		struct iov_iter iter;
+
+		iov_iter_bvec_queue(&iter, ITER_DEST, rdata->subreq.content.bvecq,
+				    rdata->subreq.content.slot, rdata->subreq.content.offset,
+				    data_len);
+		length =3D cifs_read_iter_from_socket(server, &iter, data_len);
+#ifdef CONFIG_CIFS_SMB_DIRECT
+	}
 #endif
-		length =3D cifs_read_iter_from_socket(server, &rdata->subreq.io_iter,
-						    data_len);
 	if (length > 0)
 		rdata->got_bytes +=3D length;
 	server->total_read +=3D length;
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 65e39f9b0c10..51c021975f0d 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -66,7 +66,7 @@ struct netfs_inode {
 #endif
 	struct mutex		wb_lock;	/* Writeback serialisation */
 	loff_t			remote_i_size;	/* Size of the remote file */
-	loff_t			zero_point;	/* Size after which we assume there's no data
+	unsigned long long	zero_point;	/* Size after which we assume there's no d=
ata
 						 * on the server */
 	atomic_t		io_count;	/* Number of outstanding reqs */
 	unsigned long		flags;
@@ -126,25 +126,39 @@ static inline struct netfs_group *netfs_folio_group(s=
truct folio *folio)
 	return priv;
 }
=20
+/*
+ * Estimate of maximum write subrequest for writeback.  The filesystem is
+ * responsible for filling this in when called from ->estimate_write(), th=
ough
+ * netfslib will preset infinite defaults.
+ */
+struct netfs_write_estimate {
+	unsigned long long	issue_at;	/* Point at which we must submit */
+	int			max_segs;	/* Max number of segments in a single RPC */
+};
+
 /*
  * Stream of I/O subrequests going to a particular destination, such as the
  * server or the local cache.  This is mainly intended for writing where w=
e may
  * have to write to multiple destinations concurrently.
  */
 struct netfs_io_stream {
-	/* Submission tracking */
-	struct netfs_io_subrequest *construct;	/* Op being constructed */
-	size_t			sreq_max_len;	/* Maximum size of a subrequest */
-	unsigned int		sreq_max_segs;	/* 0 or max number of segments in an iterato=
r */
-	unsigned int		submit_off;	/* Folio offset we're submitting from */
-	unsigned int		submit_len;	/* Amount of data left to submit */
-	void (*prepare_write)(struct netfs_io_subrequest *subreq);
-	void (*issue_write)(struct netfs_io_subrequest *subreq);
+	/* Submission tracking (main dispatch only; not retry) */
+	struct bvecq_pos	dispatch_cursor; /* Point from which buffers are dispatc=
hed */
+	unsigned long long	issue_from;	/* Current issue point */
+	size_t			buffered;	/* Amount in buffer */
+	u8			applicable;	/* What sources are applicable (NOTE_* mask) */
+	bool			buffering;	/* T if buffering on this stream */
+	int (*estimate_write)(struct netfs_io_request *wreq,
+			      struct netfs_io_stream *stream,
+			      struct netfs_write_estimate *estimate);
+	int (*issue_write)(struct netfs_io_subrequest *subreq);
+	atomic64_t		issued_to;	/* Point to which can be considered issued */
+
 	/* Collection tracking */
 	struct list_head	subrequests;	/* Contributory I/O operations */
 	unsigned long long	collected_to;	/* Position we've collected results to */
 	size_t			transferred;	/* The amount transferred from this stream */
-	unsigned short		error;		/* Aggregate error for the stream */
+	short			error;		/* Aggregate error for the stream */
 	enum netfs_io_source	source;		/* Where to read from/write to */
 	unsigned char		stream_nr;	/* Index of stream in parent table */
 	bool			avail;		/* T if stream is available */
@@ -180,14 +194,13 @@ struct netfs_io_subrequest {
 	struct list_head	rreq_link;	/* Link in rreq->subrequests */
 	struct bvecq_pos	dispatch_pos;	/* Bookmark in the combined queue of the s=
tart */
 	struct bvecq_pos	content;	/* The (copied) content of the subrequest */
-	struct iov_iter		io_iter;	/* Iterator for this subrequest */
 	unsigned long long	start;		/* Where to start the I/O */
 	size_t			len;		/* Size of the I/O */
 	size_t			transferred;	/* Amount of data transferred */
+	unsigned int		nr_segs;	/* Number of segments in content */
 	refcount_t		ref;
 	short			error;		/* 0 or error that occurred */
 	unsigned short		debug_index;	/* Index in list (for debugging output) */
-	unsigned int		nr_segs;	/* Number of segs in io_iter */
 	u8			retry_count;	/* The number of retries (0 on initial pass) */
 	enum netfs_io_source	source;		/* Where to read from/write to */
 	unsigned char		stream_nr;	/* I/O stream this belongs to */
@@ -196,7 +209,6 @@ struct netfs_io_subrequest {
 #define NETFS_SREQ_CLEAR_TAIL		1	/* Set if the rest of the read should be =
cleared */
 #define NETFS_SREQ_MADE_PROGRESS	4	/* Set if we transferred at least some =
data */
 #define NETFS_SREQ_ONDEMAND		5	/* Set if it's from on-demand read mode */
-#define NETFS_SREQ_BOUNDARY		6	/* Set if ends on hard boundary (eg. ceph o=
bject) */
 #define NETFS_SREQ_HIT_EOF		7	/* Set if short due to EOF */
 #define NETFS_SREQ_IN_PROGRESS		8	/* Unlocked when the subrequest complete=
s */
 #define NETFS_SREQ_NEED_RETRY		9	/* Set if the filesystem requests a retry=
 */
@@ -243,22 +255,25 @@ struct netfs_io_request {
 	struct netfs_group	*group;		/* Writeback group being written back */
 	struct bvecq_pos	collect_cursor;	/* Clear-up point of I/O buffer */
 	struct bvecq_pos	load_cursor;	/* Point at which new folios are loaded in =
*/
-	struct bvecq_pos	dispatch_cursor; /* Point from which buffers are dispatc=
hed */
+	struct bvecq_pos	retry_cursor;	/* Point from which retries are dispatched=
 */
 	wait_queue_head_t	waitq;		/* Processor waiter */
 	void			*netfs_priv;	/* Private data for the netfs */
 	void			*netfs_priv2;	/* Private data for the netfs */
-	unsigned long long	last_end;	/* End pos of last folio submitted */
 	unsigned long long	submitted;	/* Amount submitted for I/O so far */
 	unsigned long long	len;		/* Length of the request */
 	size_t			transferred;	/* Amount to be indicated as transferred */
 	long			error;		/* 0 or error that occurred */
 	unsigned long long	i_size;		/* Size of the file */
 	unsigned long long	start;		/* Start position */
-	atomic64_t		issued_to;	/* Write issuer folio cursor */
 	unsigned long long	collected_to;	/* Point we've collected to */
 	unsigned long long	cache_coll_to;	/* Point the cache has collected to */
 	unsigned long long	cleaned_to;	/* Position we've cleaned folios to */
 	unsigned long long	abandon_to;	/* Position to abandon folios to */
+#ifdef CONFIG_NETFS_PGPRIV2
+	unsigned long long	last_end;	/* End of last folio added */
+#endif
+	unsigned long long	retry_start;	/* Position to retry from */
+	size_t			retry_buffered;	/* Amount of data to retry */
 	pgoff_t			no_unlock_folio; /* Don't unlock this folio after read */
 	unsigned int		debug_id;
 	unsigned int		rsize;		/* Maximum read size (0 for none) */
@@ -282,8 +297,10 @@ struct netfs_io_request {
 #define NETFS_RREQ_UPLOAD_TO_SERVER	11	/* Need to write to the server */
 #define NETFS_RREQ_USE_IO_ITER		12	/* Use ->io_iter rather than ->i_pages =
*/
 #define NETFS_RREQ_NEED_PUT_RA_REFS	13	/* Need to put the folio refs RA ga=
ve us */
+#ifdef CONFIG_NETFS_PGPRIV2
 #define NETFS_RREQ_USE_PGPRIV2		31	/* [DEPRECATED] Use PG_private_2 to mark
 						 * write to cache on read */
+#endif
 	const struct netfs_request_ops *netfs_ops;
 };
=20
@@ -299,8 +316,7 @@ struct netfs_request_ops {
=20
 	/* Read request handling */
 	void (*expand_readahead)(struct netfs_io_request *rreq);
-	int (*prepare_read)(struct netfs_io_subrequest *subreq);
-	void (*issue_read)(struct netfs_io_subrequest *subreq);
+	int (*issue_read)(struct netfs_io_subrequest *subreq);
 	bool (*is_still_valid)(struct netfs_io_request *rreq);
 	int (*check_write_begin)(struct file *file, loff_t pos, unsigned len,
 				 struct folio **foliop, void **_fsdata);
@@ -312,8 +328,10 @@ struct netfs_request_ops {
=20
 	/* Write request handling */
 	void (*begin_writeback)(struct netfs_io_request *wreq);
-	void (*prepare_write)(struct netfs_io_subrequest *subreq);
-	void (*issue_write)(struct netfs_io_subrequest *subreq);
+	int (*estimate_write)(struct netfs_io_request *wreq,
+			      struct netfs_io_stream *stream,
+			      struct netfs_write_estimate *estimate);
+	int (*issue_write)(struct netfs_io_subrequest *subreq);
 	void (*retry_request)(struct netfs_io_request *wreq, struct netfs_io_stre=
am *stream);
 	void (*invalidate_cache)(struct netfs_io_request *wreq);
 };
@@ -348,8 +366,16 @@ struct netfs_cache_ops {
 		     netfs_io_terminated_t term_func,
 		     void *term_func_priv);
=20
+	/* Estimate the amount of data that can be written in an op. */
+	int (*estimate_write)(struct netfs_io_request *wreq,
+			      struct netfs_io_stream *stream,
+			      struct netfs_write_estimate *estimate);
+
+	/* Read data from the cache for a netfs subrequest. */
+	int (*issue_read)(struct netfs_io_subrequest *subreq);
+
 	/* Write data to the cache from a netfs subrequest. */
-	void (*issue_write)(struct netfs_io_subrequest *subreq);
+	int (*issue_write)(struct netfs_io_subrequest *subreq);
=20
 	/* Expand readahead request */
 	void (*expand_readahead)(struct netfs_cache_resources *cres,
@@ -357,25 +383,6 @@ struct netfs_cache_ops {
 				 unsigned long long *_len,
 				 unsigned long long i_size);
=20
-	/* Prepare a read operation, shortening it to a cached/uncached
-	 * boundary as appropriate.
-	 */
-	int (*prepare_read)(struct netfs_io_subrequest *subreq);
-
-	/* Prepare a write subrequest, working out if we're allowed to do it
-	 * and finding out the maximum amount of data to gather before
-	 * attempting to submit.  If we're not permitted to do it, the
-	 * subrequest should be marked failed.
-	 */
-	void (*prepare_write_subreq)(struct netfs_io_subrequest *subreq);
-
-	/* Prepare a write operation, working out what part of the write we can
-	 * actually do.
-	 */
-	int (*prepare_write)(struct netfs_cache_resources *cres,
-			     loff_t *_start, size_t *_len, size_t upper_len,
-			     loff_t i_size, bool no_space_allocated_yet);
-
 	/* Prepare an on-demand read operation, shortening it to a cached/uncached
 	 * boundary as appropriate.
 	 */
@@ -418,10 +425,9 @@ void netfs_single_mark_inode_dirty(struct inode *inode=
);
 ssize_t netfs_read_single(struct inode *inode, struct file *file, struct i=
ov_iter *iter);
 int netfs_writeback_single(struct address_space *mapping,
 			   struct writeback_control *wbc,
-			   struct iov_iter *iter);
+			   struct iov_iter *iter, size_t len);
=20
 /* Address operations API */
-struct readahead_control;
 void netfs_readahead(struct readahead_control *);
 int netfs_read_folio(struct file *, struct folio *);
 int netfs_write_begin(struct netfs_inode *, struct file *,
@@ -439,6 +445,7 @@ bool netfs_release_folio(struct folio *folio, gfp_t gfp=
);
 vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *ne=
tfs_group);
=20
 /* (Sub)request management API. */
+void netfs_mark_read_submission(struct netfs_io_subrequest *subreq);
 void netfs_read_subreq_progress(struct netfs_io_subrequest *subreq);
 void netfs_read_subreq_terminated(struct netfs_io_subrequest *subreq);
 void netfs_get_subrequest(struct netfs_io_subrequest *subreq,
@@ -448,9 +455,8 @@ void netfs_put_subrequest(struct netfs_io_subrequest *s=
ubreq,
 ssize_t netfs_extract_iter(struct iov_iter *orig, size_t orig_len, size_t =
max_segs,
 			   unsigned long long fpos, struct bvecq **_bvecq_head,
 			   iov_iter_extraction_t extraction_flags);
-size_t netfs_limit_iter(const struct iov_iter *iter, size_t start_offset,
-			size_t max_size, size_t max_segs);
-void netfs_prepare_write_failed(struct netfs_io_subrequest *subreq);
+int netfs_prepare_read_buffer(struct netfs_io_subrequest *subreq, unsigned=
 int max_segs);
+int netfs_prepare_write_buffer(struct netfs_io_subrequest *subreq, unsigne=
d int max_segs);
 void netfs_write_subrequest_terminated(void *_op, ssize_t transferred_or_e=
rror);
=20
 int netfs_start_io_read(struct inode *inode);
diff --git a/include/trace/events/cachefiles.h b/include/trace/events/cache=
files.h
index 4bba6fda1f8b..c080167451ab 100644
--- a/include/trace/events/cachefiles.h
+++ b/include/trace/events/cachefiles.h
@@ -70,6 +70,7 @@ enum cachefiles_coherency_trace {
 enum cachefiles_trunc_trace {
 	cachefiles_trunc_clear_padding,
 	cachefiles_trunc_dio_adjust,
+	cachefiles_trunc_discard_tail,
 	cachefiles_trunc_expand_tmpfile,
 	cachefiles_trunc_shrink,
 };
@@ -160,6 +161,7 @@ enum cachefiles_error_trace {
 #define cachefiles_trunc_traces						\
 	EM(cachefiles_trunc_clear_padding,	"CLRPAD")		\
 	EM(cachefiles_trunc_dio_adjust,		"DIOADJ")		\
+	EM(cachefiles_trunc_discard_tail,	"DSCDTL")		\
 	EM(cachefiles_trunc_expand_tmpfile,	"EXPTMP")		\
 	E_(cachefiles_trunc_shrink,		"SHRINK")
=20
diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h
index eeb8386e0709..ba38cc102bd7 100644
--- a/include/trace/events/netfs.h
+++ b/include/trace/events/netfs.h
@@ -49,6 +49,7 @@
 	E_(NETFS_PGPRIV2_COPY_TO_CACHE,		"2C")
=20
 #define netfs_rreq_traces					\
+	EM(netfs_rreq_trace_all_queued,		"ALL-Q  ")	\
 	EM(netfs_rreq_trace_assess,		"ASSESS ")	\
 	EM(netfs_rreq_trace_collect,		"COLLECT")	\
 	EM(netfs_rreq_trace_complete,		"COMPLET")	\
@@ -77,7 +78,8 @@
 	EM(netfs_rreq_trace_waited_quiesce,	"DONE-QUIESCE")	\
 	EM(netfs_rreq_trace_wake_ip,		"WAKE-IP")	\
 	EM(netfs_rreq_trace_wake_queue,		"WAKE-Q ")	\
-	E_(netfs_rreq_trace_write_done,		"WR-DONE")
+	EM(netfs_rreq_trace_write_done,		"WR-DONE")	\
+	E_(netfs_rreq_trace_zero_unread,	"ZERO-UR")
=20
 #define netfs_sreq_sources					\
 	EM(NETFS_SOURCE_UNKNOWN,		"----")		\
@@ -126,6 +128,7 @@
 	EM(netfs_sreq_trace_superfluous,	"SPRFL")	\
 	EM(netfs_sreq_trace_terminated,		"TERM ")	\
 	EM(netfs_sreq_trace_too_much,		"!TOOM")	\
+	EM(netfs_sreq_trace_too_many_retries,	"!RETR")	\
 	EM(netfs_sreq_trace_wait_for,		"_WAIT")	\
 	EM(netfs_sreq_trace_write,		"WRITE")	\
 	EM(netfs_sreq_trace_write_skip,		"SKIP ")	\
@@ -189,12 +192,12 @@
 	EM(netfs_folio_trace_alloc_buffer,	"alloc-buf")	\
 	EM(netfs_folio_trace_cancel_copy,	"cancel-copy")	\
 	EM(netfs_folio_trace_cancel_store,	"cancel-store")	\
-	EM(netfs_folio_trace_clear,		"clear")	\
-	EM(netfs_folio_trace_clear_cc,		"clear-cc")	\
-	EM(netfs_folio_trace_clear_g,		"clear-g")	\
-	EM(netfs_folio_trace_clear_s,		"clear-s")	\
 	EM(netfs_folio_trace_copy_to_cache,	"mark-copy")	\
 	EM(netfs_folio_trace_end_copy,		"end-copy")	\
+	EM(netfs_folio_trace_endwb,		"endwb")	\
+	EM(netfs_folio_trace_endwb_cc,		"endwb-cc")	\
+	EM(netfs_folio_trace_endwb_g,		"endwb-g")	\
+	EM(netfs_folio_trace_endwb_s,		"endwb-s")	\
 	EM(netfs_folio_trace_filled_gaps,	"filled-gaps")	\
 	EM(netfs_folio_trace_kill,		"kill")		\
 	EM(netfs_folio_trace_kill_cc,		"kill-cc")	\
@@ -381,10 +384,10 @@ TRACE_EVENT(netfs_sreq,
 		    __entry->len	=3D sreq->len;
 		    __entry->transferred =3D sreq->transferred;
 		    __entry->start	=3D sreq->start;
-		    __entry->slot	=3D sreq->dispatch_pos.slot;
+		    __entry->slot	=3D sreq->content.slot;
 			   ),
=20
-	    TP_printk("R=3D%08x[%x] %s %s f=3D%03x s=3D%llx %zx/%zx qs=3D%u e=3D%=
d",
+	    TP_printk("R=3D%08x[%x] %s %s f=3D%03x s=3D%llx %zx/%zx bv=3D%u e=3D%=
d",
 		      __entry->rreq, __entry->index,
 		      __print_symbolic(__entry->source, netfs_sreq_sources),
 		      __print_symbolic(__entry->what, netfs_sreq_traces),
@@ -492,6 +495,7 @@ TRACE_EVENT(netfs_folio,
 	    TP_STRUCT__entry(
 		    __field(ino_t,			ino)
 		    __field(pgoff_t,			index)
+		    __field(unsigned long,		pfn)
 		    __field(unsigned int,		nr)
 		    __field(enum netfs_folio_trace,	why)
 			     ),
@@ -502,13 +506,40 @@ TRACE_EVENT(netfs_folio,
 		    __entry->why =3D why;
 		    __entry->index =3D folio->index;
 		    __entry->nr =3D folio_nr_pages(folio);
+		    __entry->pfn =3D folio_pfn(folio);
 			   ),
=20
-	    TP_printk("i=3D%05lx ix=3D%05lx-%05lx %s",
+	    TP_printk("p=3D%lx i=3D%05lx ix=3D%05lx-%05lx %s",
+		      __entry->pfn,
 		      __entry->ino, __entry->index, __entry->index + __entry->nr - 1,
 		      __print_symbolic(__entry->why, netfs_folio_traces))
 	    );
=20
+TRACE_EVENT(netfs_wback,
+	    TP_PROTO(struct netfs_io_request *wreq, struct folio *folio, unsigned=
 int notes),
+
+	    TP_ARGS(wreq, folio, notes),
+
+	    TP_STRUCT__entry(
+		    __field(pgoff_t,			index)
+		    __field(unsigned int,		wreq)
+		    __field(unsigned int,		nr)
+		    __field(unsigned int,		notes)
+			     ),
+
+	    TP_fast_assign(
+		    __entry->wreq =3D wreq->debug_id;
+		    __entry->notes =3D notes;
+		    __entry->index =3D folio->index;
+		    __entry->nr =3D folio_nr_pages(folio);
+			   ),
+
+	    TP_printk("R=3D%08x ix=3D%05lx-%05lx n=3D%02x",
+		      __entry->wreq,
+		      __entry->index, __entry->index + __entry->nr - 1,
+		      __entry->notes)
+	    );
+
 TRACE_EVENT(netfs_write_iter,
 	    TP_PROTO(const struct kiocb *iocb, const struct iov_iter *from),
=20
@@ -751,7 +782,7 @@ TRACE_EVENT(netfs_collect_stream,
 		    __entry->wreq	=3D wreq->debug_id;
 		    __entry->stream	=3D stream->stream_nr;
 		    __entry->collected_to =3D stream->collected_to;
-		    __entry->issued_to	=3D atomic64_read(&wreq->issued_to);
+		    __entry->issued_to	=3D atomic64_read(&stream->issued_to);
 			   ),
=20
 	    TP_printk("R=3D%08x[%x:] cto=3D%llx ito=3D%llx",
@@ -775,7 +806,7 @@ TRACE_EVENT(netfs_bvecq,
 		    __entry->trace	=3D trace;
 			   ),
=20
-	    TP_printk("fq=3D%x %s",
+	    TP_printk("bq=3D%x %s",
 		      __entry->id,
 		      __print_symbolic(__entry->trace, netfs_bvecq_traces))
 	    );
diff --git a/net/9p/client.c b/net/9p/client.c
index f0dcf252af7e..8d365c000553 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -1561,6 +1561,7 @@ void
 p9_client_write_subreq(struct netfs_io_subrequest *subreq)
 {
 	struct netfs_io_request *wreq =3D subreq->rreq;
+	struct iov_iter iter;
 	struct p9_fid *fid =3D wreq->netfs_priv;
 	struct p9_client *clnt =3D fid->clnt;
 	struct p9_req_t *req;
@@ -1571,14 +1572,17 @@ p9_client_write_subreq(struct netfs_io_subrequest *=
subreq)
 	p9_debug(P9_DEBUG_9P, ">>> TWRITE fid %d offset %llu len %d\n",
 		 fid->fid, start, len);
=20
+	iov_iter_bvec_queue(&iter, ITER_SOURCE, subreq->content.bvecq,
+			    subreq->content.slot, subreq->content.offset, subreq->len);
+
 	/* Don't bother zerocopy for small IO (< 1024) */
 	if (clnt->trans_mod->zc_request && len > 1024) {
-		req =3D p9_client_zc_rpc(clnt, P9_TWRITE, NULL, &subreq->io_iter,
+		req =3D p9_client_zc_rpc(clnt, P9_TWRITE, NULL, &iter,
 				       0, wreq->len, P9_ZC_HDR_SZ, "dqd",
 				       fid->fid, start, len);
 	} else {
 		req =3D p9_client_rpc(clnt, P9_TWRITE, "dqV", fid->fid,
-				    start, len, &subreq->io_iter);
+				    start, len, &iter);
 	}
 	if (IS_ERR(req)) {
 		netfs_write_subrequest_terminated(subreq, PTR_ERR(req));