From nobody Sat Feb 7 09:59:31 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 45812187343 for ; Mon, 29 Jul 2024 16:22:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722270162; cv=none; b=fkEQPCh8F66HcgLHISNf11rqoqHmhPQqYbgl9W7JbZLSt0hL9KFxqzQkLnlNlr7TIHXxVhG3xvW2G1BE4ONTlIkZRO0Bb0f627LQRajYUvZeonuBkIUHBM2zsbYTX762vT6p8XEYlHwJqp6WO25qFYU4rLN5ctY2SYx3dAKy4KM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722270162; c=relaxed/simple; bh=Ql6CwqRHDEk23qIXtvVkV3AR9rNJw7+hCkRJ3oBtmuE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=DXK3anTsYuvFjQ/AXoTNEUL8bFAFcovBJcKpbnIaGGRXNyUGw56qwmTiCLsasmsgruoTCZTWi98b+/VvO5cTpB1gixDUYAVKlFIAQ+Z9cxBf4kTVeW4FuqqLIyeoBlTCQ3I+AcUDNHgIZ9tTOFO90UD8u8bPOdQVA9w9RXb837o= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=V6PT3OYa; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="V6PT3OYa" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1722270156; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=rMptiU4LWcZboHNb1XAW6fmM6cKvjdsJw5hPh/ZIHzE=; b=V6PT3OYa+u3SdY11XdOgLvg8k6h3eZaf4nKpayk/cClcanCyVdGOfnFLHArT6ZOQCZGC1y rR9ltI/W0RXf4r4qZrB6xoltzmingWha0NfPtcMDRqd9iJFVXaWYIdb5a5IOt2nuOAwrIN oHBVYiIA8+l0fy4Z8vvnI1sJaXv2g5o= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-388-rPAYK6yJPYO3F9RwXqgERw-1; Mon, 29 Jul 2024 12:22:32 -0400 X-MC-Unique: rPAYK6yJPYO3F9RwXqgERw-1 Received: from mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.15]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 18ADF1955D4A; Mon, 29 Jul 2024 16:22:28 +0000 (UTC) Received: from warthog.procyon.org.uk.com (unknown [10.42.28.216]) by mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id A218A1955D42; Mon, 29 Jul 2024 16:22:21 +0000 (UTC) From: David Howells To: Christian Brauner , Steve French , Matthew Wilcox Cc: David Howells , Jeff Layton , Gao Xiang , Dominique Martinet , Marc Dionne , Paulo Alcantara , Shyam Prasad N , Tom Talpey , Eric Van Hensbergen , Ilya Dryomov , netfs@lists.linux.dev, linux-afs@lists.infradead.org, linux-cifs@vger.kernel.org, linux-nfs@vger.kernel.org, ceph-devel@vger.kernel.org, v9fs@lists.linux.dev, linux-erofs@lists.ozlabs.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 18/24] netfs: Speed up buffered reading Date: Mon, 29 Jul 2024 17:19:47 +0100 Message-ID: <20240729162002.3436763-19-dhowells@redhat.com> In-Reply-To: <20240729162002.3436763-1-dhowells@redhat.com> References: <20240729162002.3436763-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.0 on 10.30.177.15 Content-Type: text/plain; charset="utf-8" Improve the efficiency of buffered reads in a number of ways: (1) Overhaul the algorithm in general so that it's a lot more compact and split the read submission code between buffered and unbuffered versions. The unbuffered version can be vastly simplified. (2) Read-result collection is handed off to a work queue rather than being done in the I/O thread. Multiple subrequests can be processes simultaneously. (3) When a subrequest is collected, any folios it fully spans are collected and "spare" data on either side is donated to either the previous or the next subrequest in the sequence. Notes: (*) Readahead expansion is massively slows down fio, presumably because it causes a load of extra allocations, both folio and xarray, up front before RPC requests can be transmitted. (*) RDMA with cifs does appear to work, both with SIW and RXE. Signed-off-by: David Howells cc: Jeff Layton cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org --- fs/9p/vfs_addr.c | 5 +- fs/afs/file.c | 20 +- fs/afs/fsclient.c | 9 +- fs/afs/yfsclient.c | 9 +- fs/ceph/addr.c | 72 ++-- fs/netfs/Makefile | 3 +- fs/netfs/buffered_read.c | 677 ++++++++++++++++++++++++----------- fs/netfs/direct_read.c | 147 +++++++- fs/netfs/internal.h | 19 +- fs/netfs/iterator.c | 50 +++ fs/netfs/main.c | 3 +- fs/netfs/objects.c | 8 +- fs/netfs/read_collect.c | 540 ++++++++++++++++++++++++++++ fs/netfs/read_retry.c | 256 +++++++++++++ fs/netfs/write_issue.c | 4 - fs/nfs/fscache.c | 19 +- fs/nfs/fscache.h | 7 +- fs/smb/client/cifssmb.c | 6 +- fs/smb/client/file.c | 57 +-- fs/smb/client/smb2pdu.c | 10 +- include/linux/netfs.h | 24 +- include/trace/events/netfs.h | 99 ++++- 22 files changed, 1707 insertions(+), 337 deletions(-) create mode 100644 fs/netfs/read_collect.c create mode 100644 fs/netfs/read_retry.c diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c index a97ceb105cd8..ee2f24647303 100644 --- a/fs/9p/vfs_addr.c +++ b/fs/9p/vfs_addr.c @@ -77,7 +77,10 @@ static void v9fs_issue_read(struct netfs_io_subrequest *= subreq) * cache won't be on server and is zeroes */ __set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags); =20 - netfs_subreq_terminated(subreq, err ?: total, false); + if (!err) + subreq->transferred +=3D total; + + netfs_read_subreq_terminated(subreq, err, false); } =20 /** diff --git a/fs/afs/file.c b/fs/afs/file.c index addb106dba4c..492d857a3fa0 100644 --- a/fs/afs/file.c +++ b/fs/afs/file.c @@ -16,6 +16,7 @@ #include #include #include +#include #include "internal.h" =20 static int afs_file_mmap(struct file *file, struct vm_area_struct *vma); @@ -242,8 +243,10 @@ static void afs_fetch_data_notify(struct afs_operation= *op) =20 req->error =3D error; if (subreq) { - __set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags); - netfs_subreq_terminated(subreq, error ?: req->actual_len, false); + subreq->rreq->i_size =3D req->file_size; + if (req->pos + req->actual_len >=3D req->file_size) + __set_bit(NETFS_SREQ_HIT_EOF, &subreq->flags); + netfs_read_subreq_terminated(subreq, error, false); req->subreq =3D NULL; } else if (req->done) { req->done(req); @@ -261,6 +264,12 @@ static void afs_fetch_data_success(struct afs_operatio= n *op) afs_fetch_data_notify(op); } =20 +static void afs_fetch_data_aborted(struct afs_operation *op) +{ + afs_check_for_remote_deletion(op); + afs_fetch_data_notify(op); +} + static void afs_fetch_data_put(struct afs_operation *op) { op->fetch.req->error =3D afs_op_error(op); @@ -271,7 +280,7 @@ static const struct afs_operation_ops afs_fetch_data_op= eration =3D { .issue_afs_rpc =3D afs_fs_fetch_data, .issue_yfs_rpc =3D yfs_fs_fetch_data, .success =3D afs_fetch_data_success, - .aborted =3D afs_check_for_remote_deletion, + .aborted =3D afs_fetch_data_aborted, .failed =3D afs_fetch_data_notify, .put =3D afs_fetch_data_put, }; @@ -293,7 +302,7 @@ int afs_fetch_data(struct afs_vnode *vnode, struct afs_= read *req) op =3D afs_alloc_operation(req->key, vnode->volume); if (IS_ERR(op)) { if (req->subreq) - netfs_subreq_terminated(req->subreq, PTR_ERR(op), false); + netfs_read_subreq_terminated(req->subreq, PTR_ERR(op), false); return PTR_ERR(op); } =20 @@ -312,7 +321,7 @@ static void afs_read_worker(struct work_struct *work) =20 fsreq =3D afs_alloc_read(GFP_NOFS); if (!fsreq) - return netfs_subreq_terminated(subreq, -ENOMEM, false); + return netfs_read_subreq_terminated(subreq, -ENOMEM, false); =20 fsreq->subreq =3D subreq; fsreq->pos =3D subreq->start + subreq->transferred; @@ -321,6 +330,7 @@ static void afs_read_worker(struct work_struct *work) fsreq->vnode =3D vnode; fsreq->iter =3D &subreq->io_iter; =20 + trace_netfs_sreq(subreq, netfs_sreq_trace_submit); afs_fetch_data(fsreq->vnode, fsreq); afs_put_read(fsreq); } diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c index 79cd30775b7a..098fa034a1cc 100644 --- a/fs/afs/fsclient.c +++ b/fs/afs/fsclient.c @@ -304,6 +304,7 @@ static int afs_deliver_fs_fetch_data(struct afs_call *c= all) struct afs_vnode_param *vp =3D &op->file[0]; struct afs_read *req =3D op->fetch.req; const __be32 *bp; + size_t count_before; int ret; =20 _enter("{%u,%zu,%zu/%llu}", @@ -345,10 +346,14 @@ static int afs_deliver_fs_fetch_data(struct afs_call = *call) =20 /* extract the returned data */ case 2: - _debug("extract data %zu/%llu", - iov_iter_count(call->iter), req->actual_len); + count_before =3D call->iov_len; + _debug("extract data %zu/%llu", count_before, req->actual_len); =20 ret =3D afs_extract_data(call, true); + if (req->subreq) { + req->subreq->transferred +=3D count_before - call->iov_len; + netfs_read_subreq_progress(req->subreq, false); + } if (ret < 0) return ret; =20 diff --git a/fs/afs/yfsclient.c b/fs/afs/yfsclient.c index f521e66d3bf6..024227aba4cd 100644 --- a/fs/afs/yfsclient.c +++ b/fs/afs/yfsclient.c @@ -355,6 +355,7 @@ static int yfs_deliver_fs_fetch_data64(struct afs_call = *call) struct afs_vnode_param *vp =3D &op->file[0]; struct afs_read *req =3D op->fetch.req; const __be32 *bp; + size_t count_before; int ret; =20 _enter("{%u,%zu, %zu/%llu}", @@ -391,10 +392,14 @@ static int yfs_deliver_fs_fetch_data64(struct afs_cal= l *call) =20 /* extract the returned data */ case 2: - _debug("extract data %zu/%llu", - iov_iter_count(call->iter), req->actual_len); + count_before =3D call->iov_len; + _debug("extract data %zu/%llu", count_before, req->actual_len); =20 ret =3D afs_extract_data(call, true); + if (req->subreq) { + req->subreq->transferred +=3D count_before - call->iov_len; + netfs_read_subreq_progress(req->subreq, false); + } if (ret < 0) return ret; =20 diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 8c16bc5250ef..997843269304 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -205,21 +205,6 @@ static void ceph_netfs_expand_readahead(struct netfs_i= o_request *rreq) } } =20 -static bool ceph_netfs_clamp_length(struct netfs_io_subrequest *subreq) -{ - struct inode *inode =3D subreq->rreq->inode; - struct ceph_fs_client *fsc =3D ceph_inode_to_fs_client(inode); - struct ceph_inode_info *ci =3D ceph_inode(inode); - u64 objno, objoff; - u32 xlen; - - /* Truncate the extent at the end of the current block */ - ceph_calc_file_object_mapping(&ci->i_layout, subreq->start, subreq->len, - &objno, &objoff, &xlen); - subreq->len =3D min(xlen, fsc->mount_options->rsize); - return true; -} - static void finish_netfs_read(struct ceph_osd_request *req) { struct inode *inode =3D req->r_inode; @@ -263,7 +248,11 @@ static void finish_netfs_read(struct ceph_osd_request = *req) calc_pages_for(osd_data->alignment, osd_data->length), false); } - netfs_subreq_terminated(subreq, err, false); + if (err > 0) { + subreq->transferred =3D err; + err =3D 0; + } + netfs_read_subreq_terminated(subreq, err, false); iput(req->r_inode); ceph_dec_osd_stopping_blocker(fsc->mdsc); } @@ -277,7 +266,6 @@ static bool ceph_netfs_issue_op_inline(struct netfs_io_= subrequest *subreq) struct ceph_mds_request *req; struct ceph_mds_client *mdsc =3D ceph_sb_to_mdsc(inode->i_sb); struct ceph_inode_info *ci =3D ceph_inode(inode); - struct iov_iter iter; ssize_t err =3D 0; size_t len; int mode; @@ -312,17 +300,36 @@ static bool ceph_netfs_issue_op_inline(struct netfs_i= o_subrequest *subreq) } =20 len =3D min_t(size_t, iinfo->inline_len - subreq->start, subreq->len); - iov_iter_xarray(&iter, ITER_DEST, &rreq->mapping->i_pages, subreq->start,= len); - err =3D copy_to_iter(iinfo->inline_data + subreq->start, len, &iter); - if (err =3D=3D 0) + err =3D copy_to_iter(iinfo->inline_data + subreq->start, len, &subreq->io= _iter); + if (err =3D=3D 0) { err =3D -EFAULT; + } else { + subreq->transferred +=3D err; + err =3D 0; + } =20 ceph_mdsc_put_request(req); out: - netfs_subreq_terminated(subreq, err, false); + netfs_read_subreq_terminated(subreq, err, false); return true; } =20 +static int ceph_netfs_prepare_read(struct netfs_io_subrequest *subreq) +{ + struct netfs_io_request *rreq =3D subreq->rreq; + struct inode *inode =3D rreq->inode; + struct ceph_inode_info *ci =3D ceph_inode(inode); + struct ceph_fs_client *fsc =3D ceph_inode_to_fs_client(inode); + u64 objno, objoff; + u32 xlen; + + /* Truncate the extent at the end of the current block */ + ceph_calc_file_object_mapping(&ci->i_layout, subreq->start, subreq->len, + &objno, &objoff, &xlen); + rreq->io_streams[0].sreq_max_len =3D umin(xlen, fsc->mount_options->rsize= ); + return 0; +} + static void ceph_netfs_issue_read(struct netfs_io_subrequest *subreq) { struct netfs_io_request *rreq =3D subreq->rreq; @@ -332,9 +339,8 @@ static void ceph_netfs_issue_read(struct netfs_io_subre= quest *subreq) struct ceph_client *cl =3D fsc->client; struct ceph_osd_request *req =3D NULL; struct ceph_vino vino =3D ceph_vino(inode); - struct iov_iter iter; - int err =3D 0; - u64 len =3D subreq->len; + int err; + u64 len; bool sparse =3D IS_ENCRYPTED(inode) || ceph_test_mount_opt(fsc, SPARSEREA= D); u64 off =3D subreq->start; int extent_cnt; @@ -347,6 +353,12 @@ static void ceph_netfs_issue_read(struct netfs_io_subr= equest *subreq) if (ceph_has_inline_data(ci) && ceph_netfs_issue_op_inline(subreq)) return; =20 + // TODO: This rounding here is slightly dodgy. It *should* work, for + // now, as the cache only deals in blocks that are a multiple of + // PAGE_SIZE and fscrypt blocks are at most PAGE_SIZE. What needs to + // happen is for the fscrypt driving to be moved into netfslib and the + // data in the cache also to be stored encrypted. + len =3D subreq->len; ceph_fscrypt_adjust_off_and_len(inode, &off, &len); =20 req =3D ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout, vino, @@ -369,8 +381,6 @@ static void ceph_netfs_issue_read(struct netfs_io_subre= quest *subreq) doutc(cl, "%llx.%llx pos=3D%llu orig_len=3D%zu len=3D%llu\n", ceph_vinop(inode), subreq->start, subreq->len, len); =20 - iov_iter_xarray(&iter, ITER_DEST, &rreq->mapping->i_pages, subreq->start,= len); - /* * FIXME: For now, use CEPH_OSD_DATA_TYPE_PAGES instead of _ITER for * encrypted inodes. We'd need infrastructure that handles an iov_iter @@ -382,7 +392,7 @@ static void ceph_netfs_issue_read(struct netfs_io_subre= quest *subreq) struct page **pages; size_t page_off; =20 - err =3D iov_iter_get_pages_alloc2(&iter, &pages, len, &page_off); + err =3D iov_iter_get_pages_alloc2(&subreq->io_iter, &pages, len, &page_o= ff); if (err < 0) { doutc(cl, "%llx.%llx failed to allocate pages, %d\n", ceph_vinop(inode), err); @@ -397,7 +407,7 @@ static void ceph_netfs_issue_read(struct netfs_io_subre= quest *subreq) osd_req_op_extent_osd_data_pages(req, 0, pages, len, 0, false, false); } else { - osd_req_op_extent_osd_iter(req, 0, &iter); + osd_req_op_extent_osd_iter(req, 0, &subreq->io_iter); } if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) { err =3D -EIO; @@ -412,13 +422,14 @@ static void ceph_netfs_issue_read(struct netfs_io_sub= request *subreq) out: ceph_osdc_put_request(req); if (err) - netfs_subreq_terminated(subreq, err, false); + netfs_read_subreq_terminated(subreq, err, false); doutc(cl, "%llx.%llx result %d\n", ceph_vinop(inode), err); } =20 static int ceph_init_request(struct netfs_io_request *rreq, struct file *f= ile) { struct inode *inode =3D rreq->inode; + struct ceph_fs_client *fsc =3D ceph_inode_to_fs_client(inode); struct ceph_client *cl =3D ceph_inode_to_client(inode); int got =3D 0, want =3D CEPH_CAP_FILE_CACHE; struct ceph_netfs_request_data *priv; @@ -467,6 +478,7 @@ static int ceph_init_request(struct netfs_io_request *r= req, struct file *file) =20 priv->caps =3D got; rreq->netfs_priv =3D priv; + rreq->io_streams[0].sreq_max_len =3D fsc->mount_options->rsize; =20 out: if (ret < 0) @@ -491,9 +503,9 @@ static void ceph_netfs_free_request(struct netfs_io_req= uest *rreq) const struct netfs_request_ops ceph_netfs_ops =3D { .init_request =3D ceph_init_request, .free_request =3D ceph_netfs_free_request, + .prepare_read =3D ceph_netfs_prepare_read, .issue_read =3D ceph_netfs_issue_read, .expand_readahead =3D ceph_netfs_expand_readahead, - .clamp_length =3D ceph_netfs_clamp_length, .check_write_begin =3D ceph_netfs_check_write_begin, }; =20 diff --git a/fs/netfs/Makefile b/fs/netfs/Makefile index 8e6781e0b10b..cadeb28b8e47 100644 --- a/fs/netfs/Makefile +++ b/fs/netfs/Makefile @@ -5,12 +5,13 @@ netfs-y :=3D \ buffered_write.o \ direct_read.o \ direct_write.o \ - io.o \ iterator.o \ locking.o \ main.o \ misc.o \ objects.o \ + read_collect.o \ + read_retry.o \ write_collect.o \ write_issue.o =20 diff --git a/fs/netfs/buffered_read.c b/fs/netfs/buffered_read.c index a688d4c75d99..97d33614553e 100644 --- a/fs/netfs/buffered_read.c +++ b/fs/netfs/buffered_read.c @@ -9,126 +9,6 @@ #include #include "internal.h" =20 -/* - * Unlock the folios in a read operation. We need to set PG_writeback on = any - * folios we're going to write back before we unlock them. - * - * Note that if the deprecated NETFS_RREQ_USE_PGPRIV2 is set then we use - * PG_private_2 and do a direct write to the cache from here instead. - */ -void netfs_rreq_unlock_folios(struct netfs_io_request *rreq) -{ - struct netfs_io_subrequest *subreq; - struct netfs_folio *finfo; - struct folio *folio; - pgoff_t start_page =3D rreq->start / PAGE_SIZE; - pgoff_t last_page =3D ((rreq->start + rreq->len) / PAGE_SIZE) - 1; - size_t account =3D 0; - bool subreq_failed =3D false; - - XA_STATE(xas, &rreq->mapping->i_pages, start_page); - - if (test_bit(NETFS_RREQ_FAILED, &rreq->flags)) { - __clear_bit(NETFS_RREQ_COPY_TO_CACHE, &rreq->flags); - list_for_each_entry(subreq, &rreq->subrequests, rreq_link) { - __clear_bit(NETFS_SREQ_COPY_TO_CACHE, &subreq->flags); - } - } - - /* Walk through the pagecache and the I/O request lists simultaneously. - * We may have a mixture of cached and uncached sections and we only - * really want to write out the uncached sections. This is slightly - * complicated by the possibility that we might have huge pages with a - * mixture inside. - */ - subreq =3D list_first_entry(&rreq->subrequests, - struct netfs_io_subrequest, rreq_link); - subreq_failed =3D (subreq->error < 0); - - trace_netfs_rreq(rreq, netfs_rreq_trace_unlock); - - rcu_read_lock(); - xas_for_each(&xas, folio, last_page) { - loff_t pg_end; - bool pg_failed =3D false; - bool wback_to_cache =3D false; - bool folio_started =3D false; - - if (xas_retry(&xas, folio)) - continue; - - pg_end =3D folio_pos(folio) + folio_size(folio) - 1; - - for (;;) { - loff_t sreq_end; - - if (!subreq) { - pg_failed =3D true; - break; - } - if (test_bit(NETFS_RREQ_USE_PGPRIV2, &rreq->flags)) { - if (!folio_started && test_bit(NETFS_SREQ_COPY_TO_CACHE, - &subreq->flags)) { - trace_netfs_folio(folio, netfs_folio_trace_copy_to_cache); - folio_start_private_2(folio); - folio_started =3D true; - } - } else { - wback_to_cache |=3D - test_bit(NETFS_SREQ_COPY_TO_CACHE, &subreq->flags); - } - pg_failed |=3D subreq_failed; - sreq_end =3D subreq->start + subreq->len - 1; - if (pg_end < sreq_end) - break; - - account +=3D subreq->transferred; - if (!list_is_last(&subreq->rreq_link, &rreq->subrequests)) { - subreq =3D list_next_entry(subreq, rreq_link); - subreq_failed =3D (subreq->error < 0); - } else { - subreq =3D NULL; - subreq_failed =3D false; - } - - if (pg_end =3D=3D sreq_end) - break; - } - - if (!pg_failed) { - flush_dcache_folio(folio); - finfo =3D netfs_folio_info(folio); - if (finfo) { - trace_netfs_folio(folio, netfs_folio_trace_filled_gaps); - if (finfo->netfs_group) - folio_change_private(folio, finfo->netfs_group); - else - folio_detach_private(folio); - kfree(finfo); - } - folio_mark_uptodate(folio); - if (wback_to_cache && !WARN_ON_ONCE(folio_get_private(folio) !=3D NULL)= ) { - trace_netfs_folio(folio, netfs_folio_trace_copy_to_cache); - folio_attach_private(folio, NETFS_FOLIO_COPY_TO_CACHE); - filemap_dirty_folio(folio->mapping, folio); - } - } - - if (!test_bit(NETFS_RREQ_DONT_UNLOCK_FOLIOS, &rreq->flags)) { - if (folio->index =3D=3D rreq->no_unlock_folio && - test_bit(NETFS_RREQ_NO_UNLOCK_FOLIO, &rreq->flags)) - _debug("no unlock"); - else - folio_unlock(folio); - } - } - rcu_read_unlock(); - - task_io_account_read(account); - if (rreq->netfs_ops->done) - rreq->netfs_ops->done(rreq); -} - static void netfs_cache_expand_readahead(struct netfs_io_request *rreq, unsigned long long *_start, unsigned long long *_len, @@ -183,6 +63,323 @@ static int netfs_begin_cache_read(struct netfs_io_requ= est *rreq, struct netfs_in return fscache_begin_read_operation(&rreq->cache_resources, netfs_i_cooki= e(ctx)); } =20 +/* + * Decant the list of folios to read into a rolling buffer. + */ +static size_t netfs_load_buffer_from_ra(struct netfs_io_request *rreq, + struct folio_queue *folioq) +{ + unsigned int order, nr; + size_t size =3D 0; + + nr =3D __readahead_batch(rreq->ractl, (struct page **)folioq->vec.folios, + ARRAY_SIZE(folioq->vec.folios)); + folioq->vec.nr =3D nr; + for (int i =3D 0; i < nr; i++) { + struct folio *folio =3D folioq_folio(folioq, i); + + trace_netfs_folio(folio, netfs_folio_trace_read); + order =3D folio_order(folio); + folioq->orders[i] =3D order; + size +=3D PAGE_SIZE << order; + } + + for (int i =3D nr; i < folioq_nr_slots(folioq); i++) + folioq_clear(folioq, i); + + return size; +} + +/* + * netfs_prepare_read_iterator - Prepare the subreq iterator for I/O + * @subreq: The subrequest to be set up + * + * Prepare the I/O iterator representing the read buffer on a subrequest f= or + * the filesystem to use for I/O (it can be passed directly to a socket). = This + * is intended to be called from the ->issue_read() method once the filesy= stem + * has trimmed the request to the size it wants. + * + * Returns the limited size if successful and -ENOMEM if insufficient memo= ry + * available. + * + * [!] NOTE: This must be run in the same thread as ->issue_read() was cal= led + * in as we access the readahead_control struct. + */ +static ssize_t netfs_prepare_read_iterator(struct netfs_io_subrequest *sub= req) +{ + struct netfs_io_request *rreq =3D subreq->rreq; + size_t rsize; + + rsize =3D umin(subreq->len, rreq->io_streams[0].sreq_max_len); + + if (rreq->ractl) { + /* If we don't have sufficient folios in the rolling buffer, + * extract a folioq's worth from the readahead region at a time + * into the buffer. Note that this acquires a ref on each page + * that we will need to release later - but we don't want to do + * that until after we've started the I/O. + */ + while (rreq->submitted < subreq->start + rsize) { + struct folio_queue *tail =3D rreq->buffer_tail, *new; + size_t added; + + new =3D kmalloc(sizeof(*new), GFP_NOFS); + if (!new) + return -ENOMEM; + netfs_stat(&netfs_n_folioq); + folioq_init(new); + new->prev =3D tail; + tail->next =3D new; + rreq->buffer_tail =3D new; + added =3D netfs_load_buffer_from_ra(rreq, new); + rreq->iter.count +=3D added; + rreq->submitted +=3D added; + } + } + + subreq->len =3D rsize; + if (unlikely(rreq->io_streams[0].sreq_max_segs)) { + size_t limit =3D netfs_limit_iter(&rreq->iter, 0, rsize, + rreq->io_streams[0].sreq_max_segs); + + if (limit < rsize) { + subreq->len =3D limit; + trace_netfs_sreq(subreq, netfs_sreq_trace_limited); + } + } + + subreq->io_iter =3D rreq->iter; + + if (iov_iter_is_folioq(&subreq->io_iter)) { + subreq->curr_folioq =3D (struct folio_queue *)subreq->io_iter.folioq; + subreq->curr_folioq_slot =3D subreq->io_iter.folioq_slot; + subreq->curr_folio_order =3D subreq->curr_folioq->orders[subreq->curr_fo= lioq_slot]; + } + + iov_iter_truncate(&subreq->io_iter, subreq->len); + iov_iter_advance(&rreq->iter, subreq->len); + return subreq->len; +} + +static enum netfs_io_source netfs_cache_prepare_read(struct netfs_io_reque= st *rreq, + struct netfs_io_subrequest *subreq, + loff_t i_size) +{ + struct netfs_cache_resources *cres =3D &rreq->cache_resources; + + if (!cres->ops) + return NETFS_DOWNLOAD_FROM_SERVER; + return cres->ops->prepare_read(subreq, i_size); +} + +static void netfs_cache_read_terminated(void *priv, ssize_t transferred_or= _error, + bool was_async) +{ + struct netfs_io_subrequest *subreq =3D priv; + + if (transferred_or_error < 0) { + netfs_read_subreq_terminated(subreq, transferred_or_error, was_async); + return; + } + + if (transferred_or_error > 0) + subreq->transferred +=3D transferred_or_error; + netfs_read_subreq_terminated(subreq, 0, was_async); +} + +/* + * Issue a read against the cache. + * - Eats the caller's ref on subreq. + */ +static void netfs_read_cache_to_pagecache(struct netfs_io_request *rreq, + struct netfs_io_subrequest *subreq) +{ + struct netfs_cache_resources *cres =3D &rreq->cache_resources; + + netfs_stat(&netfs_n_rh_read); + cres->ops->read(cres, subreq->start, &subreq->io_iter, NETFS_READ_HOLE_IG= NORE, + netfs_cache_read_terminated, subreq); +} + +/* + * Perform a read to the pagecache from a series of sources of different t= ypes, + * slicing up the region to be read according to available cache blocks and + * network rsize. + */ +static void netfs_read_to_pagecache(struct netfs_io_request *rreq) +{ + struct netfs_inode *ictx =3D netfs_inode(rreq->inode); + unsigned long long start =3D rreq->start; + ssize_t size =3D rreq->len; + int ret =3D 0; + + atomic_inc(&rreq->nr_outstanding); + + do { + struct netfs_io_subrequest *subreq; + enum netfs_io_source source =3D NETFS_DOWNLOAD_FROM_SERVER; + ssize_t slice; + + subreq =3D netfs_alloc_subrequest(rreq); + if (!subreq) { + ret =3D -ENOMEM; + break; + } + + subreq->start =3D start; + subreq->len =3D size; + + atomic_inc(&rreq->nr_outstanding); + spin_lock_bh(&rreq->lock); + list_add_tail(&subreq->rreq_link, &rreq->subrequests); + subreq->prev_donated =3D rreq->prev_donated; + rreq->prev_donated =3D 0; + trace_netfs_sreq(subreq, netfs_sreq_trace_added); + spin_unlock_bh(&rreq->lock); + + source =3D netfs_cache_prepare_read(rreq, subreq, rreq->i_size); + subreq->source =3D source; + if (source =3D=3D NETFS_DOWNLOAD_FROM_SERVER) { + unsigned long long zp =3D umin(ictx->zero_point, rreq->i_size); + size_t len =3D subreq->len; + + if (subreq->start >=3D zp) { + subreq->source =3D source =3D NETFS_FILL_WITH_ZEROES; + goto fill_with_zeroes; + } + + if (len > zp - subreq->start) + len =3D zp - subreq->start; + if (len =3D=3D 0) { + pr_err("ZERO-LEN READ: R=3D%08x[%x] l=3D%zx/%zx s=3D%llx z=3D%llx i=3D= %llx", + rreq->debug_id, subreq->debug_index, + subreq->len, size, + subreq->start, ictx->zero_point, rreq->i_size); + break; + } + subreq->len =3D len; + + netfs_stat(&netfs_n_rh_download); + if (rreq->netfs_ops->prepare_read) { + ret =3D rreq->netfs_ops->prepare_read(subreq); + if (ret < 0) { + atomic_dec(&rreq->nr_outstanding); + netfs_put_subrequest(subreq, false, + netfs_sreq_trace_put_cancel); + break; + } + trace_netfs_sreq(subreq, netfs_sreq_trace_prepare); + } + + slice =3D netfs_prepare_read_iterator(subreq); + if (slice < 0) { + atomic_dec(&rreq->nr_outstanding); + netfs_put_subrequest(subreq, false, netfs_sreq_trace_put_cancel); + ret =3D slice; + break; + } + + rreq->netfs_ops->issue_read(subreq); + goto done; + } + + fill_with_zeroes: + if (source =3D=3D NETFS_FILL_WITH_ZEROES) { + subreq->source =3D NETFS_FILL_WITH_ZEROES; + trace_netfs_sreq(subreq, netfs_sreq_trace_submit); + netfs_stat(&netfs_n_rh_zero); + slice =3D netfs_prepare_read_iterator(subreq); + __set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags); + netfs_read_subreq_terminated(subreq, 0, false); + goto done; + } + + if (source =3D=3D NETFS_READ_FROM_CACHE) { + trace_netfs_sreq(subreq, netfs_sreq_trace_submit); + slice =3D netfs_prepare_read_iterator(subreq); + netfs_read_cache_to_pagecache(rreq, subreq); + goto done; + } + + if (source =3D=3D NETFS_INVALID_READ) + break; + + done: + size -=3D slice; + start +=3D slice; + cond_resched(); + } while (size > 0); + + if (atomic_dec_and_test(&rreq->nr_outstanding)) + netfs_rreq_terminated(rreq, false); + + /* Defer error return as we may need to wait for outstanding I/O. */ + cmpxchg(&rreq->error, 0, ret); +} + +/* + * Wait for the read operation to complete, successfully or otherwise. + */ +static int netfs_wait_for_read(struct netfs_io_request *rreq) +{ + int ret; + + trace_netfs_rreq(rreq, netfs_rreq_trace_wait_ip); + wait_on_bit(&rreq->flags, NETFS_RREQ_IN_PROGRESS, TASK_UNINTERRUPTIBLE); + ret =3D rreq->error; + if (ret =3D=3D 0 && rreq->submitted < rreq->len) { + trace_netfs_failure(rreq, NULL, ret, netfs_fail_short_read); + ret =3D -EIO; + } + + return ret; +} + +/* + * Set up the initial folioq of buffer folios in the rolling buffer and se= t the + * iterator to refer to it. + */ +static int netfs_prime_buffer(struct netfs_io_request *rreq) +{ + struct folio_queue *folioq; + size_t added; + + folioq =3D kmalloc(sizeof(*folioq), GFP_KERNEL); + if (!folioq) + return -ENOMEM; + netfs_stat(&netfs_n_folioq); + folioq_init(folioq); + rreq->buffer =3D folioq; + rreq->buffer_tail =3D folioq; + rreq->submitted =3D rreq->start; + iov_iter_folio_queue(&rreq->iter, ITER_DEST, folioq, 0, 0, 0); + + added =3D netfs_load_buffer_from_ra(rreq, folioq); + rreq->iter.count +=3D added; + rreq->submitted +=3D added; + return 0; +} + +/* + * Drop the ref on each folio that we inherited from the VM readahead code= . We + * still have the folio locks to pin the page until we complete the I/O. + */ +static void netfs_put_ra_refs(struct folio_queue *folioq) +{ + while (folioq) { +#if 0 + for (unsigned int slot =3D 0; slot < folioq_count(folioq); slot++) { + if (!folioq_folio(folioq, slot)) + continue; + trace_netfs_folio(folioq_folio(folioq, slot), + netfs_folio_trace_read_put); + } +#endif + folio_batch_release(&folioq->vec); + folioq =3D folioq->next; + } +} + /** * netfs_readahead - Helper to manage a read request * @ractl: The description of the readahead request @@ -201,22 +398,17 @@ static int netfs_begin_cache_read(struct netfs_io_req= uest *rreq, struct netfs_in void netfs_readahead(struct readahead_control *ractl) { struct netfs_io_request *rreq; - struct netfs_inode *ctx =3D netfs_inode(ractl->mapping->host); + struct netfs_inode *ictx =3D netfs_inode(ractl->mapping->host); + unsigned long long start =3D readahead_pos(ractl); + size_t size =3D readahead_length(ractl); int ret; =20 - _enter("%lx,%x", readahead_index(ractl), readahead_count(ractl)); - - if (readahead_count(ractl) =3D=3D 0) - return; - - rreq =3D netfs_alloc_request(ractl->mapping, ractl->file, - readahead_pos(ractl), - readahead_length(ractl), + rreq =3D netfs_alloc_request(ractl->mapping, ractl->file, start, size, NETFS_READAHEAD); if (IS_ERR(rreq)) return; =20 - ret =3D netfs_begin_cache_read(rreq, ctx); + ret =3D netfs_begin_cache_read(rreq, ictx); if (ret =3D=3D -ENOMEM || ret =3D=3D -EINTR || ret =3D=3D -ERESTARTSYS) goto cleanup_free; =20 @@ -226,18 +418,15 @@ void netfs_readahead(struct readahead_control *ractl) =20 netfs_rreq_expand(rreq, ractl); =20 - /* Set up the output buffer */ - iov_iter_xarray(&rreq->iter, ITER_DEST, &ractl->mapping->i_pages, - rreq->start, rreq->len); + rreq->ractl =3D ractl; + if (netfs_prime_buffer(rreq) < 0) + goto cleanup_free; + netfs_read_to_pagecache(rreq); =20 - /* Drop the refs on the folios here rather than in the cache or - * filesystem. The locks will be dropped in netfs_rreq_unlock(). - */ - while (readahead_folio(ractl)) - ; + /* Release the folio refs whilst we're waiting for the I/O. */ + netfs_put_ra_refs(rreq->buffer); =20 - netfs_begin_read(rreq, false); - netfs_put_request(rreq, false, netfs_rreq_trace_put_return); + netfs_put_request(rreq, true, netfs_rreq_trace_put_return); return; =20 cleanup_free: @@ -246,6 +435,117 @@ void netfs_readahead(struct readahead_control *ractl) } EXPORT_SYMBOL(netfs_readahead); =20 +/* + * Create a rolling buffer with a single occupying folio. + */ +static int netfs_create_singular_buffer(struct netfs_io_request *rreq, str= uct folio *folio) +{ + struct folio_queue *folioq; + + folioq =3D kmalloc(sizeof(*folioq), GFP_KERNEL); + if (!folioq) + return -ENOMEM; + + netfs_stat(&netfs_n_folioq); + folioq_init(folioq); + folioq_append(folioq, folio); + BUG_ON(folioq_folio(folioq, 0) !=3D folio); + BUG_ON(folioq_folio_order(folioq, 0) !=3D folio_order(folio)); + rreq->buffer =3D folioq; + rreq->buffer_tail =3D folioq; + rreq->submitted =3D rreq->start + rreq->len; + iov_iter_folio_queue(&rreq->iter, ITER_DEST, folioq, 0, 0, rreq->len); + rreq->ractl =3D (struct readahead_control *)1UL; + return 0; +} + +/* + * Read into gaps in a folio partially filled by a streaming write. + */ +static int netfs_read_gaps(struct file *file, struct folio *folio) +{ + struct netfs_io_request *rreq; + struct address_space *mapping =3D folio->mapping; + struct netfs_folio *finfo =3D netfs_folio_info(folio); + struct netfs_inode *ctx =3D netfs_inode(mapping->host); + struct folio *sink =3D NULL; + struct bio_vec *bvec; + unsigned int from =3D finfo->dirty_offset; + unsigned int to =3D from + finfo->dirty_len; + unsigned int off =3D 0, i =3D 0; + size_t flen =3D folio_size(folio); + size_t nr_bvec =3D flen / PAGE_SIZE + 2; + size_t part; + int ret; + + _enter("%lx", folio->index); + + rreq =3D netfs_alloc_request(mapping, file, folio_pos(folio), flen, NETFS= _READ_GAPS); + if (IS_ERR(rreq)) { + ret =3D PTR_ERR(rreq); + goto alloc_error; + } + + ret =3D netfs_begin_cache_read(rreq, ctx); + if (ret =3D=3D -ENOMEM || ret =3D=3D -EINTR || ret =3D=3D -ERESTARTSYS) + goto discard; + + netfs_stat(&netfs_n_rh_read_folio); + trace_netfs_read(rreq, rreq->start, rreq->len, netfs_read_trace_read_gaps= ); + + /* Fiddle the buffer so that a gap at the beginning and/or a gap at the + * end get copied to, but the middle is discarded. + */ + ret =3D -ENOMEM; + bvec =3D kmalloc_array(nr_bvec, sizeof(*bvec), GFP_KERNEL); + if (!bvec) + goto discard; + + sink =3D folio_alloc(GFP_KERNEL, 0); + if (!sink) { + kfree(bvec); + goto discard; + } + + trace_netfs_folio(folio, netfs_folio_trace_read_gaps); + + rreq->direct_bv =3D bvec; + rreq->direct_bv_count =3D nr_bvec; + if (from > 0) { + bvec_set_folio(&bvec[i++], folio, from, 0); + off =3D from; + } + while (off < to) { + part =3D min_t(size_t, to - off, PAGE_SIZE); + bvec_set_folio(&bvec[i++], sink, part, 0); + off +=3D part; + } + if (to < flen) + bvec_set_folio(&bvec[i++], folio, flen - to, to); + iov_iter_bvec(&rreq->iter, ITER_DEST, bvec, i, rreq->len); + rreq->submitted =3D rreq->start + flen; + + netfs_read_to_pagecache(rreq); + + if (sink) + folio_put(sink); + + ret =3D netfs_wait_for_read(rreq); + if (ret =3D=3D 0) { + flush_dcache_folio(folio); + folio_mark_uptodate(folio); + } + folio_unlock(folio); + netfs_put_request(rreq, false, netfs_rreq_trace_put_return); + return ret < 0 ? ret : 0; + +discard: + netfs_put_request(rreq, false, netfs_rreq_trace_put_discard); +alloc_error: + folio_unlock(folio); + return ret; +} + /** * netfs_read_folio - Helper to manage a read_folio request * @file: The file to read from @@ -265,9 +565,13 @@ int netfs_read_folio(struct file *file, struct folio *= folio) struct address_space *mapping =3D folio->mapping; struct netfs_io_request *rreq; struct netfs_inode *ctx =3D netfs_inode(mapping->host); - struct folio *sink =3D NULL; int ret; =20 + if (folio_test_dirty(folio)) { + trace_netfs_folio(folio, netfs_folio_trace_read_gaps); + return netfs_read_gaps(file, folio); + } + _enter("%lx", folio->index); =20 rreq =3D netfs_alloc_request(mapping, file, @@ -286,54 +590,12 @@ int netfs_read_folio(struct file *file, struct folio = *folio) trace_netfs_read(rreq, rreq->start, rreq->len, netfs_read_trace_readpage); =20 /* Set up the output buffer */ - if (folio_test_dirty(folio)) { - /* Handle someone trying to read from an unflushed streaming - * write. We fiddle the buffer so that a gap at the beginning - * and/or a gap at the end get copied to, but the middle is - * discarded. - */ - struct netfs_folio *finfo =3D netfs_folio_info(folio); - struct bio_vec *bvec; - unsigned int from =3D finfo->dirty_offset; - unsigned int to =3D from + finfo->dirty_len; - unsigned int off =3D 0, i =3D 0; - size_t flen =3D folio_size(folio); - size_t nr_bvec =3D flen / PAGE_SIZE + 2; - size_t part; - - ret =3D -ENOMEM; - bvec =3D kmalloc_array(nr_bvec, sizeof(*bvec), GFP_KERNEL); - if (!bvec) - goto discard; - - sink =3D folio_alloc(GFP_KERNEL, 0); - if (!sink) - goto discard; - - trace_netfs_folio(folio, netfs_folio_trace_read_gaps); - - rreq->direct_bv =3D bvec; - rreq->direct_bv_count =3D nr_bvec; - if (from > 0) { - bvec_set_folio(&bvec[i++], folio, from, 0); - off =3D from; - } - while (off < to) { - part =3D min_t(size_t, to - off, PAGE_SIZE); - bvec_set_folio(&bvec[i++], sink, part, 0); - off +=3D part; - } - if (to < flen) - bvec_set_folio(&bvec[i++], folio, flen - to, to); - iov_iter_bvec(&rreq->iter, ITER_DEST, bvec, i, rreq->len); - } else { - iov_iter_xarray(&rreq->iter, ITER_DEST, &mapping->i_pages, - rreq->start, rreq->len); - } + ret =3D netfs_create_singular_buffer(rreq, folio); + if (ret < 0) + goto discard; =20 - ret =3D netfs_begin_read(rreq, true); - if (sink) - folio_put(sink); + netfs_read_to_pagecache(rreq); + ret =3D netfs_wait_for_read(rreq); netfs_put_request(rreq, false, netfs_rreq_trace_put_return); return ret < 0 ? ret : 0; =20 @@ -395,7 +657,7 @@ static bool netfs_skip_folio_read(struct folio *folio, = loff_t pos, size_t len, } =20 /** - * netfs_write_begin - Helper to prepare for writing + * netfs_write_begin - Helper to prepare for writing [DEPRECATED] * @ctx: The netfs context * @file: The file to read from * @mapping: The mapping to read from @@ -406,13 +668,10 @@ static bool netfs_skip_folio_read(struct folio *folio= , loff_t pos, size_t len, * * Pre-read data for a write-begin request by drawing data from the cache = if * possible, or the netfs if not. Space beyond the EOF is zero-filled. - * Multiple I/O requests from different sources will get munged together. = If - * necessary, the readahead window can be expanded in either direction to a - * more convenient alighment for RPC efficiency or to make storage in the = cache - * feasible. + * Multiple I/O requests from different sources will get munged together. * * The calling netfs must provide a table of operations, only one of which, - * issue_op, is mandatory. + * issue_read, is mandatory. * * The check_write_begin() operation can be provided to check for and flush * conflicting writes once the folio is grabbed and locked. It is passed a @@ -437,8 +696,6 @@ int netfs_write_begin(struct netfs_inode *ctx, pgoff_t index =3D pos >> PAGE_SHIFT; int ret; =20 - DEFINE_READAHEAD(ractl, file, NULL, mapping, index); - retry: folio =3D __filemap_get_folio(mapping, index, FGP_WRITEBEGIN, mapping_gfp_mask(mapping)); @@ -486,22 +743,13 @@ int netfs_write_begin(struct netfs_inode *ctx, netfs_stat(&netfs_n_rh_write_begin); trace_netfs_read(rreq, pos, len, netfs_read_trace_write_begin); =20 - /* Expand the request to meet caching requirements and download - * preferences. - */ - ractl._nr_pages =3D folio_nr_pages(folio); - netfs_rreq_expand(rreq, &ractl); - /* Set up the output buffer */ - iov_iter_xarray(&rreq->iter, ITER_DEST, &mapping->i_pages, - rreq->start, rreq->len); - - /* We hold the folio locks, so we can drop the references */ - folio_get(folio); - while (readahead_folio(&ractl)) - ; + ret =3D netfs_create_singular_buffer(rreq, folio); + if (ret < 0) + goto error_put; =20 - ret =3D netfs_begin_read(rreq, true); + netfs_read_to_pagecache(rreq); + ret =3D netfs_wait_for_read(rreq); if (ret < 0) goto error; netfs_put_request(rreq, false, netfs_rreq_trace_put_return); @@ -557,10 +805,13 @@ int netfs_prefetch_for_write(struct file *file, struc= t folio *folio, trace_netfs_read(rreq, start, flen, netfs_read_trace_prefetch_for_write); =20 /* Set up the output buffer */ - iov_iter_xarray(&rreq->iter, ITER_DEST, &mapping->i_pages, - rreq->start, rreq->len); + ret =3D netfs_create_singular_buffer(rreq, folio); + if (ret < 0) + goto error_put; =20 - ret =3D netfs_begin_read(rreq, true); + folioq_mark2(rreq->buffer, 0); + netfs_read_to_pagecache(rreq); + ret =3D netfs_wait_for_read(rreq); netfs_put_request(rreq, false, netfs_rreq_trace_put_return); return ret; =20 diff --git a/fs/netfs/direct_read.c b/fs/netfs/direct_read.c index 10a1e4da6bda..b1a66a6e6bc2 100644 --- a/fs/netfs/direct_read.c +++ b/fs/netfs/direct_read.c @@ -16,6 +16,143 @@ #include #include "internal.h" =20 +static void netfs_prepare_dio_read_iterator(struct netfs_io_subrequest *su= breq) +{ + struct netfs_io_request *rreq =3D subreq->rreq; + size_t rsize; + + rsize =3D umin(subreq->len, rreq->io_streams[0].sreq_max_len); + subreq->len =3D rsize; + + if (unlikely(rreq->io_streams[0].sreq_max_segs)) { + size_t limit =3D netfs_limit_iter(&rreq->iter, 0, rsize, + rreq->io_streams[0].sreq_max_segs); + + if (limit < rsize) { + subreq->len =3D limit; + trace_netfs_sreq(subreq, netfs_sreq_trace_limited); + } + } + + trace_netfs_sreq(subreq, netfs_sreq_trace_prepare); + + subreq->io_iter =3D rreq->iter; + iov_iter_truncate(&subreq->io_iter, subreq->len); + iov_iter_advance(&rreq->iter, subreq->len); +} + +/* + * Perform a read to a buffer from the server, slicing up the region to be= read + * according to the network rsize. + */ +static int netfs_dispatch_unbuffered_reads(struct netfs_io_request *rreq) +{ + unsigned long long start =3D rreq->start; + ssize_t size =3D rreq->len; + int ret =3D 0; + + atomic_set(&rreq->nr_outstanding, 1); + + do { + struct netfs_io_subrequest *subreq; + ssize_t slice; + + subreq =3D netfs_alloc_subrequest(rreq); + if (!subreq) { + ret =3D -ENOMEM; + break; + } + + subreq->source =3D NETFS_DOWNLOAD_FROM_SERVER; + subreq->start =3D start; + subreq->len =3D size; + + atomic_inc(&rreq->nr_outstanding); + spin_lock_bh(&rreq->lock); + list_add_tail(&subreq->rreq_link, &rreq->subrequests); + subreq->prev_donated =3D rreq->prev_donated; + rreq->prev_donated =3D 0; + trace_netfs_sreq(subreq, netfs_sreq_trace_added); + spin_unlock_bh(&rreq->lock); + + netfs_stat(&netfs_n_rh_download); + if (rreq->netfs_ops->prepare_read) { + ret =3D rreq->netfs_ops->prepare_read(subreq); + if (ret < 0) { + atomic_dec(&rreq->nr_outstanding); + netfs_put_subrequest(subreq, false, netfs_sreq_trace_put_cancel); + break; + } + } + + netfs_prepare_dio_read_iterator(subreq); + slice =3D subreq->len; + rreq->netfs_ops->issue_read(subreq); + + size -=3D slice; + start +=3D slice; + rreq->submitted +=3D slice; + + if (test_bit(NETFS_RREQ_BLOCKED, &rreq->flags) && + test_bit(NETFS_RREQ_NONBLOCK, &rreq->flags)) + break; + cond_resched(); + } while (size > 0); + + if (atomic_dec_and_test(&rreq->nr_outstanding)) + netfs_rreq_terminated(rreq, false); + return ret; +} + +/* + * Perform a read to an application buffer, bypassing the pagecache and the + * local disk cache. + */ +static int netfs_unbuffered_read(struct netfs_io_request *rreq, bool sync) +{ + int ret; + + _enter("R=3D%x %llx-%llx", + rreq->debug_id, rreq->start, rreq->start + rreq->len - 1); + + if (rreq->len =3D=3D 0) { + pr_err("Zero-sized read [R=3D%x]\n", rreq->debug_id); + return -EIO; + } + + // TODO: Use bounce buffer if requested + + inode_dio_begin(rreq->inode); + + ret =3D netfs_dispatch_unbuffered_reads(rreq); + + if (!rreq->submitted) { + netfs_put_request(rreq, false, netfs_rreq_trace_put_no_submit); + inode_dio_end(rreq->inode); + ret =3D 0; + goto out; + } + + if (sync) { + trace_netfs_rreq(rreq, netfs_rreq_trace_wait_ip); + wait_on_bit(&rreq->flags, NETFS_RREQ_IN_PROGRESS, + TASK_UNINTERRUPTIBLE); + + ret =3D rreq->error; + if (ret =3D=3D 0 && rreq->submitted < rreq->len && + rreq->origin !=3D NETFS_DIO_READ) { + trace_netfs_failure(rreq, NULL, ret, netfs_fail_short_read); + ret =3D -EIO; + } + } else { + ret =3D -EIOCBQUEUED; + } + +out: + _leave(" =3D %d", ret); + return ret; +} + /** * netfs_unbuffered_read_iter_locked - Perform an unbuffered or direct I/O= read * @iocb: The I/O control descriptor describing the read @@ -31,7 +168,7 @@ ssize_t netfs_unbuffered_read_iter_locked(struct kiocb *= iocb, struct iov_iter *i struct netfs_io_request *rreq; ssize_t ret; size_t orig_count =3D iov_iter_count(iter); - bool async =3D !is_sync_kiocb(iocb); + bool sync =3D is_sync_kiocb(iocb); =20 _enter(""); =20 @@ -78,13 +215,13 @@ ssize_t netfs_unbuffered_read_iter_locked(struct kiocb= *iocb, struct iov_iter *i =20 // TODO: Set up bounce buffer if needed =20 - if (async) + if (!sync) rreq->iocb =3D iocb; =20 - ret =3D netfs_begin_read(rreq, is_sync_kiocb(iocb)); + ret =3D netfs_unbuffered_read(rreq, sync); if (ret < 0) goto out; /* May be -EIOCBQUEUED */ - if (!async) { + if (sync) { // TODO: Copy from bounce buffer iocb->ki_pos +=3D rreq->transferred; ret =3D rreq->transferred; @@ -94,8 +231,6 @@ ssize_t netfs_unbuffered_read_iter_locked(struct kiocb *= iocb, struct iov_iter *i netfs_put_request(rreq, false, netfs_rreq_trace_put_return); if (ret > 0) orig_count -=3D ret; - if (ret !=3D -EIOCBQUEUED) - iov_iter_revert(iter, orig_count - iov_iter_count(iter)); return ret; } EXPORT_SYMBOL(netfs_unbuffered_read_iter_locked); diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h index 21a3c7d13585..a4ee0cebf014 100644 --- a/fs/netfs/internal.h +++ b/fs/netfs/internal.h @@ -23,16 +23,9 @@ /* * buffered_read.c */ -void netfs_rreq_unlock_folios(struct netfs_io_request *rreq); int netfs_prefetch_for_write(struct file *file, struct folio *folio, size_t offset, size_t len); =20 -/* - * io.c - */ -void netfs_rreq_work(struct work_struct *work); -int netfs_begin_read(struct netfs_io_request *rreq, bool sync); - /* * main.c */ @@ -90,6 +83,18 @@ static inline void netfs_see_request(struct netfs_io_req= uest *rreq, trace_netfs_rreq_ref(rreq->debug_id, refcount_read(&rreq->ref), what); } =20 +/* + * read_collect.c + */ +void netfs_read_termination_worker(struct work_struct *work); +void netfs_rreq_terminated(struct netfs_io_request *rreq, bool was_async); + +/* + * read_retry.c + */ +void netfs_retry_reads(struct netfs_io_request *rreq); +void netfs_unlock_abandoned_read_pages(struct netfs_io_request *rreq); + /* * stats.c */ diff --git a/fs/netfs/iterator.c b/fs/netfs/iterator.c index b781bbbf1d8d..72a435e5fc6d 100644 --- a/fs/netfs/iterator.c +++ b/fs/netfs/iterator.c @@ -188,9 +188,59 @@ static size_t netfs_limit_xarray(const struct iov_iter= *iter, size_t start_offse return min(span, max_size); } =20 +/* + * Select the span of a folio queue iterator we're going to use. Limit it= by + * both maximum size and maximum number of segments. Returns the size of = the + * span in bytes. + */ +static size_t netfs_limit_folioq(const struct iov_iter *iter, size_t start= _offset, + size_t max_size, size_t max_segs) +{ + const struct folio_queue *folioq =3D iter->folioq; + unsigned int nsegs =3D 0; + unsigned int slot =3D iter->folioq_slot; + size_t span =3D 0, n =3D iter->count; + + if (WARN_ON(!iov_iter_is_folioq(iter)) || + WARN_ON(start_offset > n) || + n =3D=3D 0) + return 0; + max_size =3D umin(max_size, n - start_offset); + + if (slot >=3D folioq_nr_slots(folioq)) { + folioq =3D folioq->next; + slot =3D 0; + } + + start_offset +=3D iter->iov_offset; + do { + size_t flen =3D folioq_folio_size(folioq, slot); + + if (start_offset < flen) { + span +=3D flen - start_offset; + nsegs++; + start_offset =3D 0; + } else { + start_offset -=3D flen; + } + if (span >=3D max_size || nsegs >=3D max_segs) + break; + + slot++; + if (slot >=3D folioq_nr_slots(folioq)) { + folioq =3D folioq->next; + slot =3D 0; + } + } while (folioq); + + return umin(span, max_size); +} + size_t netfs_limit_iter(const struct iov_iter *iter, size_t start_offset, size_t max_size, size_t max_segs) { + if (iov_iter_is_folioq(iter)) + return netfs_limit_folioq(iter, start_offset, max_size, max_segs); if (iov_iter_is_bvec(iter)) return netfs_limit_bvec(iter, start_offset, max_size, max_segs); if (iov_iter_is_xarray(iter)) diff --git a/fs/netfs/main.c b/fs/netfs/main.c index 1ee712bb3610..ca416235bdf7 100644 --- a/fs/netfs/main.c +++ b/fs/netfs/main.c @@ -36,6 +36,7 @@ DEFINE_SPINLOCK(netfs_proc_lock); static const char *netfs_origins[nr__netfs_io_origin] =3D { [NETFS_READAHEAD] =3D "RA", [NETFS_READPAGE] =3D "RP", + [NETFS_READ_GAPS] =3D "RG", [NETFS_READ_FOR_WRITE] =3D "RW", [NETFS_DIO_READ] =3D "DR", [NETFS_WRITEBACK] =3D "WB", @@ -61,7 +62,7 @@ static int netfs_requests_seq_show(struct seq_file *m, vo= id *v) =20 rreq =3D list_entry(v, struct netfs_io_request, proc_link); seq_printf(m, - "%08x %s %3d %2lx %4d %3d @%04llx %llx/%llx", + "%08x %s %3d %2lx %4ld %3d @%04llx %llx/%llx", rreq->debug_id, netfs_origins[rreq->origin], refcount_read(&rreq->ref), diff --git a/fs/netfs/objects.c b/fs/netfs/objects.c index 9d2563d4dab8..da8314c48483 100644 --- a/fs/netfs/objects.c +++ b/fs/netfs/objects.c @@ -40,7 +40,6 @@ struct netfs_io_request *netfs_alloc_request(struct addre= ss_space *mapping, memset(rreq, 0, kmem_cache_size(cache)); rreq->start =3D start; rreq->len =3D len; - rreq->upper_len =3D len; rreq->origin =3D origin; rreq->netfs_ops =3D ctx->ops; rreq->mapping =3D mapping; @@ -48,6 +47,8 @@ struct netfs_io_request *netfs_alloc_request(struct addre= ss_space *mapping, rreq->i_size =3D i_size_read(inode); rreq->debug_id =3D atomic_inc_return(&debug_ids); rreq->wsize =3D INT_MAX; + rreq->io_streams[0].sreq_max_len =3D ULONG_MAX; + rreq->io_streams[0].sreq_max_segs =3D 0; spin_lock_init(&rreq->lock); INIT_LIST_HEAD(&rreq->io_streams[0].subrequests); INIT_LIST_HEAD(&rreq->io_streams[1].subrequests); @@ -56,9 +57,10 @@ struct netfs_io_request *netfs_alloc_request(struct addr= ess_space *mapping, =20 if (origin =3D=3D NETFS_READAHEAD || origin =3D=3D NETFS_READPAGE || + origin =3D=3D NETFS_READ_GAPS || origin =3D=3D NETFS_READ_FOR_WRITE || origin =3D=3D NETFS_DIO_READ) - INIT_WORK(&rreq->work, netfs_rreq_work); + INIT_WORK(&rreq->work, netfs_read_termination_worker); else INIT_WORK(&rreq->work, netfs_write_collection_worker); =20 @@ -173,7 +175,7 @@ void netfs_put_request(struct netfs_io_request *rreq, b= ool was_async, if (was_async) { rreq->work.func =3D netfs_free_request; if (!queue_work(system_unbound_wq, &rreq->work)) - BUG(); + WARN_ON(1); } else { netfs_free_request(&rreq->work); } diff --git a/fs/netfs/read_collect.c b/fs/netfs/read_collect.c new file mode 100644 index 000000000000..f9907907106c --- /dev/null +++ b/fs/netfs/read_collect.c @@ -0,0 +1,540 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Network filesystem read subrequest result collection, assessment and + * retrying. + * + * Copyright (C) 2024 Red Hat, Inc. All Rights Reserved. + * Written by David Howells (dhowells@redhat.com) + */ + +#include +#include +#include +#include +#include +#include +#include "internal.h" + +/* + * Clear the unread part of an I/O request. + */ +static void netfs_clear_unread(struct netfs_io_subrequest *subreq) +{ + netfs_reset_iter(subreq); + WARN_ON_ONCE(subreq->len - subreq->transferred !=3D iov_iter_count(&subre= q->io_iter)); + iov_iter_zero(iov_iter_count(&subreq->io_iter), &subreq->io_iter); + if (subreq->start + subreq->transferred >=3D subreq->rreq->i_size) + __set_bit(NETFS_SREQ_HIT_EOF, &subreq->flags); +} + +/* + * Flush, mark and unlock a folio that's now completely read. If we want = to + * cache the folio, we set the group to NETFS_FOLIO_COPY_TO_CACHE, mark it + * dirty and let writeback handle it. + */ +static void netfs_unlock_read_folio(struct netfs_io_subrequest *subreq, + struct netfs_io_request *rreq, + struct folio *folio) +{ + struct netfs_folio *finfo; + + flush_dcache_folio(folio); + folio_mark_uptodate(folio); + + if (!test_bit(NETFS_RREQ_USE_PGPRIV2, &rreq->flags)) { + finfo =3D netfs_folio_info(folio); + if (finfo) { + trace_netfs_folio(folio, netfs_folio_trace_filled_gaps); + if (finfo->netfs_group) + folio_change_private(folio, finfo->netfs_group); + else + folio_detach_private(folio); + kfree(finfo); + } + + if (test_bit(NETFS_SREQ_COPY_TO_CACHE, &subreq->flags)) { + if (!WARN_ON_ONCE(folio_get_private(folio) !=3D NULL)) { + trace_netfs_folio(folio, netfs_folio_trace_copy_to_cache); + folio_attach_private(folio, NETFS_FOLIO_COPY_TO_CACHE); + folio_mark_dirty(folio); + } + } else { + trace_netfs_folio(folio, netfs_folio_trace_read_done); + } + } else { + // TODO: Use of PG_private_2 is deprecated. + if (test_bit(NETFS_SREQ_COPY_TO_CACHE, &subreq->flags)) { + trace_netfs_folio(folio, netfs_folio_trace_copy_to_cache); + folio_start_private_2(folio); + } + } + + if (!test_bit(NETFS_RREQ_DONT_UNLOCK_FOLIOS, &rreq->flags)) { + if (folio->index =3D=3D rreq->no_unlock_folio && + test_bit(NETFS_RREQ_NO_UNLOCK_FOLIO, &rreq->flags)) + _debug("no unlock"); + else + folio_unlock(folio); + } +} + +/* + * Unlock any folios that are now completely read. Returns true if the + * subrequest is removed from the list. + */ +static bool netfs_consume_read_data(struct netfs_io_subrequest *subreq, bo= ol was_async) +{ + struct netfs_io_subrequest *prev, *next; + struct netfs_io_request *rreq =3D subreq->rreq; + struct folio_queue *folioq =3D subreq->curr_folioq; + size_t avail, prev_donated, next_donated, fsize, part, excess; + loff_t fpos, start; + loff_t fend; + int slot =3D subreq->curr_folioq_slot; + + if (WARN(subreq->transferred > subreq->len, + "Subreq overread: R%x[%x] %zu > %zu", + rreq->debug_id, subreq->debug_index, + subreq->transferred, subreq->len)) + subreq->transferred =3D subreq->len; + +next_folio: + fsize =3D PAGE_SIZE << subreq->curr_folio_order; + fpos =3D round_down(subreq->start + subreq->consumed, fsize); + fend =3D fpos + fsize; + + if (WARN_ON_ONCE(!folioq) || + WARN_ON_ONCE(!folioq_folio(folioq, slot)) || + WARN_ON_ONCE(folioq_folio(folioq, slot)->index !=3D fpos / PAGE_SIZE)= ) { + pr_err("R=3D%08x[%x] s=3D%llx-%llx ctl=3D%zx/%zx/%zx sl=3D%u\n", + rreq->debug_id, subreq->debug_index, + subreq->start, subreq->start + subreq->transferred - 1, + subreq->consumed, subreq->transferred, subreq->len, + slot); + if (folioq) { + struct folio *folio =3D folioq_folio(folioq, slot); + + pr_err("folioq: %02x%02x%02x%02x\n", + folioq->orders[0], folioq->orders[1], + folioq->orders[2], folioq->orders[3]); + if (folio) + pr_err("folio: %llx-%llx ix=3D%llx o=3D%u qo=3D%u\n", + fpos, fend - 1, folio_pos(folio), folio_order(folio), + folioq_folio_order(folioq, slot)); + } + } + +donation_changed: + /* Try to consume the current folio if we've hit or passed the end of + * it. There's a possibility that this subreq doesn't start at the + * beginning of the folio, in which case we need to donate to/from the + * preceding subreq. + * + * We also need to include any potential donation back from the + * following subreq. + */ + prev_donated =3D READ_ONCE(subreq->prev_donated); + next_donated =3D READ_ONCE(subreq->next_donated); + if (prev_donated || next_donated) { + spin_lock_bh(&rreq->lock); + prev_donated =3D subreq->prev_donated; + next_donated =3D subreq->next_donated; + subreq->start -=3D prev_donated; + subreq->len +=3D prev_donated; + subreq->transferred +=3D prev_donated; + prev_donated =3D subreq->prev_donated =3D 0; + if (subreq->transferred =3D=3D subreq->len) { + subreq->len +=3D next_donated; + subreq->transferred +=3D next_donated; + next_donated =3D subreq->next_donated =3D 0; + } + trace_netfs_sreq(subreq, netfs_sreq_trace_add_donations); + spin_unlock_bh(&rreq->lock); + } + + avail =3D subreq->transferred; + if (avail =3D=3D subreq->len) + avail +=3D next_donated; + start =3D subreq->start; + if (subreq->consumed =3D=3D 0) { + start -=3D prev_donated; + avail +=3D prev_donated; + } else { + start +=3D subreq->consumed; + avail -=3D subreq->consumed; + } + part =3D umin(avail, fsize); + + trace_netfs_progress(subreq, start, avail, part); + + if (start + avail >=3D fend) { + if (fpos =3D=3D start) { + /* Flush, unlock and mark for caching any folio we've just read. */ + subreq->consumed =3D fend - subreq->start; + netfs_unlock_read_folio(subreq, rreq, folioq_folio(folioq, slot)); + folioq_mark2(folioq, slot); + if (subreq->consumed >=3D subreq->len) + goto remove_subreq; + } else if (fpos < start) { + excess =3D fend - subreq->start; + + spin_lock_bh(&rreq->lock); + /* If we complete first on a folio split with the + * preceding subreq, donate to that subreq - otherwise + * we get the responsibility. + */ + if (subreq->prev_donated !=3D prev_donated) { + spin_unlock_bh(&rreq->lock); + goto donation_changed; + } + + if (list_is_first(&subreq->rreq_link, &rreq->subrequests)) { + spin_unlock_bh(&rreq->lock); + pr_err("Can't donate prior to front\n"); + goto bad; + } + + prev =3D list_prev_entry(subreq, rreq_link); + WRITE_ONCE(prev->next_donated, prev->next_donated + excess); + subreq->start +=3D excess; + subreq->len -=3D excess; + subreq->transferred -=3D excess; + trace_netfs_donate(rreq, subreq, prev, excess, + netfs_trace_donate_tail_to_prev); + trace_netfs_sreq(subreq, netfs_sreq_trace_donate_to_prev); + + if (subreq->consumed >=3D subreq->len) + goto remove_subreq_locked; + spin_unlock_bh(&rreq->lock); + } else { + pr_err("fpos > start\n"); + goto bad; + } + + /* Advance the rolling buffer to the next folio. */ + slot++; + if (slot >=3D folioq_nr_slots(folioq)) { + slot =3D 0; + folioq =3D folioq->next; + subreq->curr_folioq =3D folioq; + } + subreq->curr_folioq_slot =3D slot; + if (folioq && folioq_folio(folioq, slot)) + subreq->curr_folio_order =3D folioq->orders[slot]; + if (!was_async) + cond_resched(); + goto next_folio; + } + + /* Deal with partial progress. */ + if (subreq->transferred < subreq->len) + return false; + + /* Donate the remaining downloaded data to one of the neighbouring + * subrequests. Note that we may race with them doing the same thing. + */ + spin_lock_bh(&rreq->lock); + + if (subreq->prev_donated !=3D prev_donated || + subreq->next_donated !=3D next_donated) { + spin_unlock_bh(&rreq->lock); + cond_resched(); + goto donation_changed; + } + + /* Deal with the trickiest case: that this subreq is in the middle of a + * folio, not touching either edge, but finishes first. In such a + * case, we donate to the previous subreq, if there is one, so that the + * donation is only handled when that completes - and remove this + * subreq from the list. + * + * If the previous subreq finished first, we will have acquired their + * donation and should be able to unlock folios and/or donate nextwards. + */ + if (!subreq->consumed && + !prev_donated && + !list_is_first(&subreq->rreq_link, &rreq->subrequests)) { + prev =3D list_prev_entry(subreq, rreq_link); + WRITE_ONCE(prev->next_donated, prev->next_donated + subreq->len); + subreq->start +=3D subreq->len; + subreq->len =3D 0; + subreq->transferred =3D 0; + trace_netfs_donate(rreq, subreq, prev, subreq->len, + netfs_trace_donate_to_prev); + trace_netfs_sreq(subreq, netfs_sreq_trace_donate_to_prev); + goto remove_subreq_locked; + } + + /* If we can't donate down the chain, donate up the chain instead. */ + excess =3D subreq->len - subreq->consumed + next_donated; + + if (!subreq->consumed) + excess +=3D prev_donated; + + if (list_is_last(&subreq->rreq_link, &rreq->subrequests)) { + rreq->prev_donated =3D excess; + trace_netfs_donate(rreq, subreq, NULL, excess, + netfs_trace_donate_to_deferred_next); + } else { + next =3D list_next_entry(subreq, rreq_link); + WRITE_ONCE(next->prev_donated, excess); + trace_netfs_donate(rreq, subreq, next, excess, + netfs_trace_donate_to_next); + } + trace_netfs_sreq(subreq, netfs_sreq_trace_donate_to_next); + subreq->len =3D subreq->consumed; + subreq->transferred =3D subreq->consumed; + goto remove_subreq_locked; + +remove_subreq: + spin_lock_bh(&rreq->lock); +remove_subreq_locked: + subreq->consumed =3D subreq->len; + list_del(&subreq->rreq_link); + spin_unlock_bh(&rreq->lock); + netfs_put_subrequest(subreq, false, netfs_sreq_trace_put_consumed); + return true; + +bad: + /* Errr... prev and next both donated to us, but insufficient to finish + * the folio. + */ + printk("R=3D%08x[%x] s=3D%llx-%llx %zx/%zx/%zx\n", + rreq->debug_id, subreq->debug_index, + subreq->start, subreq->start + subreq->transferred - 1, + subreq->consumed, subreq->transferred, subreq->len); + printk("folio: %llx-%llx\n", fpos, fend - 1); + printk("donated: prev=3D%zx next=3D%zx\n", prev_donated, next_donated); + printk("s=3D%llx av=3D%zx part=3D%zx\n", start, avail, part); + BUG(); +} + +/* + * Do page flushing and suchlike after DIO. + */ +static void netfs_rreq_assess_dio(struct netfs_io_request *rreq) +{ + struct netfs_io_subrequest *subreq; + unsigned int i; + + /* Collect unbuffered reads and direct reads, adding up the transfer + * sizes until we find the first short or failed subrequest. + */ + list_for_each_entry(subreq, &rreq->subrequests, rreq_link) { + rreq->transferred +=3D subreq->transferred; + + if (subreq->transferred < subreq->len || + test_bit(NETFS_SREQ_FAILED, &subreq->flags)) { + rreq->error =3D subreq->error; + break; + } + } + + if (rreq->origin =3D=3D NETFS_DIO_READ) { + for (i =3D 0; i < rreq->direct_bv_count; i++) { + flush_dcache_page(rreq->direct_bv[i].bv_page); + // TODO: cifs marks pages in the destination buffer + // dirty under some circumstances after a read. Do we + // need to do that too? + set_page_dirty(rreq->direct_bv[i].bv_page); + } + } + + if (rreq->iocb) { + rreq->iocb->ki_pos +=3D rreq->transferred; + if (rreq->iocb->ki_complete) + rreq->iocb->ki_complete( + rreq->iocb, rreq->error ? rreq->error : rreq->transferred); + } + if (rreq->netfs_ops->done) + rreq->netfs_ops->done(rreq); + if (rreq->origin =3D=3D NETFS_DIO_READ) + inode_dio_end(rreq->inode); +} + +/* + * Assess the state of a read request and decide what to do next. + * + * Note that we could be in an ordinary kernel thread, on a workqueue or in + * softirq context at this point. We inherit a ref from the caller. + */ +static void netfs_rreq_assess(struct netfs_io_request *rreq) +{ + trace_netfs_rreq(rreq, netfs_rreq_trace_assess); + + //netfs_rreq_is_still_valid(rreq); + + if (test_and_clear_bit(NETFS_RREQ_NEED_RETRY, &rreq->flags)) { + netfs_retry_reads(rreq); + return; + } + + if (rreq->origin =3D=3D NETFS_DIO_READ || + rreq->origin =3D=3D NETFS_READ_GAPS) + netfs_rreq_assess_dio(rreq); + task_io_account_read(rreq->transferred); + + trace_netfs_rreq(rreq, netfs_rreq_trace_wake_ip); + clear_bit_unlock(NETFS_RREQ_IN_PROGRESS, &rreq->flags); + wake_up_bit(&rreq->flags, NETFS_RREQ_IN_PROGRESS); + + trace_netfs_rreq(rreq, netfs_rreq_trace_done); + netfs_clear_subrequests(rreq, false); + netfs_unlock_abandoned_read_pages(rreq); +} + +void netfs_read_termination_worker(struct work_struct *work) +{ + struct netfs_io_request *rreq =3D + container_of(work, struct netfs_io_request, work); + netfs_see_request(rreq, netfs_rreq_trace_see_work); + netfs_rreq_assess(rreq); + netfs_put_request(rreq, false, netfs_rreq_trace_put_work_complete); +} + +/* + * Handle the completion of all outstanding I/O operations on a read reque= st. + * We inherit a ref from the caller. + */ +void netfs_rreq_terminated(struct netfs_io_request *rreq, bool was_async) +{ + if (!was_async) + return netfs_rreq_assess(rreq); + if (!work_pending(&rreq->work)) { + netfs_get_request(rreq, netfs_rreq_trace_get_work); + if (!queue_work(system_unbound_wq, &rreq->work)) + netfs_put_request(rreq, was_async, netfs_rreq_trace_put_work_nq); + } +} + +/** + * netfs_read_subreq_progress - Note progress of a read operation. + * @subreq: The read request that has terminated. + * @was_async: True if we're in an asynchronous context. + * + * This tells the read side of netfs lib that a contributory I/O operation= has + * made some progress and that it may be possible to unlock some folios. + * + * Before calling, the filesystem should update subreq->transferred to tra= ck + * the amount of data copied into the output buffer. + * + * If @was_async is true, the caller might be running in softirq or interr= upt + * context and we can't sleep. + */ +void netfs_read_subreq_progress(struct netfs_io_subrequest *subreq, + bool was_async) +{ + struct netfs_io_request *rreq =3D subreq->rreq; + + trace_netfs_sreq(subreq, netfs_sreq_trace_progress); + + if (subreq->transferred > subreq->consumed && + (rreq->origin =3D=3D NETFS_READAHEAD || + rreq->origin =3D=3D NETFS_READPAGE || + rreq->origin =3D=3D NETFS_READ_FOR_WRITE)) { + netfs_consume_read_data(subreq, was_async); + __clear_bit(NETFS_SREQ_NO_PROGRESS, &subreq->flags); + } +} +EXPORT_SYMBOL(netfs_read_subreq_progress); + +/** + * netfs_read_subreq_terminated - Note the termination of an I/O operation. + * @subreq: The I/O request that has terminated. + * @error: Error code indicating type of completion. + * @was_async: The termination was asynchronous + * + * This tells the read helper that a contributory I/O operation has termin= ated, + * one way or another, and that it should integrate the results. + * + * The caller indicates the outcome of the operation through @error, suppl= ying + * 0 to indicate a successful or retryable transfer (if NETFS_SREQ_NEED_RE= TRY + * is set) or a negative error code. The helper will look after reissuing= I/O + * operations as appropriate and writing downloaded data to the cache. + * + * Before calling, the filesystem should update subreq->transferred to tra= ck + * the amount of data copied into the output buffer. + * + * If @was_async is true, the caller might be running in softirq or interr= upt + * context and we can't sleep. + */ +void netfs_read_subreq_terminated(struct netfs_io_subrequest *subreq, + int error, bool was_async) +{ + struct netfs_io_request *rreq =3D subreq->rreq; + + switch (subreq->source) { + case NETFS_READ_FROM_CACHE: + netfs_stat(&netfs_n_rh_read_done); + break; + case NETFS_DOWNLOAD_FROM_SERVER: + netfs_stat(&netfs_n_rh_download_done); + break; + default: + break; + } + + if (rreq->origin !=3D NETFS_DIO_READ) { + /* Collect buffered reads. + * + * If the read completed validly short, then we can clear the + * tail before going on to unlock the folios. + */ + if (error =3D=3D 0 && subreq->transferred < subreq->len && + (test_bit(NETFS_SREQ_HIT_EOF, &subreq->flags) || + test_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags))) { + netfs_clear_unread(subreq); + subreq->transferred =3D subreq->len; + trace_netfs_sreq(subreq, netfs_sreq_trace_clear); + } + if (subreq->transferred > subreq->consumed && + (rreq->origin =3D=3D NETFS_READAHEAD || + rreq->origin =3D=3D NETFS_READPAGE || + rreq->origin =3D=3D NETFS_READ_FOR_WRITE)) { + netfs_consume_read_data(subreq, was_async); + __clear_bit(NETFS_SREQ_NO_PROGRESS, &subreq->flags); + } + rreq->transferred +=3D subreq->transferred; + } + + /* Deal with retry requests, short reads and errors. If we retry + * but don't make progress, we abandon the attempt. + */ + if (!error && subreq->transferred < subreq->len) { + if (test_bit(NETFS_SREQ_HIT_EOF, &subreq->flags)) { + trace_netfs_sreq(subreq, netfs_sreq_trace_hit_eof); + } else { + trace_netfs_sreq(subreq, netfs_sreq_trace_short); + if (subreq->transferred > subreq->consumed) { + __set_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags); + __clear_bit(NETFS_SREQ_NO_PROGRESS, &subreq->flags); + set_bit(NETFS_RREQ_NEED_RETRY, &rreq->flags); + } else if (!__test_and_set_bit(NETFS_SREQ_NO_PROGRESS, &subreq->flags))= { + __set_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags); + set_bit(NETFS_RREQ_NEED_RETRY, &rreq->flags); + } else { + __set_bit(NETFS_SREQ_FAILED, &subreq->flags); + error =3D -ENODATA; + } + } + } + + subreq->error =3D error; + trace_netfs_sreq(subreq, netfs_sreq_trace_terminated); + + if (unlikely(error < 0)) { + trace_netfs_failure(rreq, subreq, error, netfs_fail_read); + if (subreq->source =3D=3D NETFS_READ_FROM_CACHE) { + netfs_stat(&netfs_n_rh_read_failed); + } else { + netfs_stat(&netfs_n_rh_download_failed); + set_bit(NETFS_RREQ_FAILED, &rreq->flags); + rreq->error =3D subreq->error; + } + } + + if (atomic_dec_and_test(&rreq->nr_outstanding)) + netfs_rreq_terminated(rreq, was_async); + + netfs_put_subrequest(subreq, was_async, netfs_sreq_trace_put_terminated); +} +EXPORT_SYMBOL(netfs_read_subreq_terminated); diff --git a/fs/netfs/read_retry.c b/fs/netfs/read_retry.c new file mode 100644 index 000000000000..0350592ea804 --- /dev/null +++ b/fs/netfs/read_retry.c @@ -0,0 +1,256 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Network filesystem read subrequest retrying. + * + * Copyright (C) 2024 Red Hat, Inc. All Rights Reserved. + * Written by David Howells (dhowells@redhat.com) + */ + +#include +#include +#include "internal.h" + +static void netfs_reissue_read(struct netfs_io_request *rreq, + struct netfs_io_subrequest *subreq) +{ + struct iov_iter *io_iter =3D &subreq->io_iter; + + if (iov_iter_is_folioq(io_iter)) { + subreq->curr_folioq =3D (struct folio_queue *)io_iter->folioq; + subreq->curr_folioq_slot =3D io_iter->folioq_slot; + subreq->curr_folio_order =3D subreq->curr_folioq->orders[subreq->curr_fo= lioq_slot]; + } + + atomic_inc(&rreq->nr_outstanding); + __set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags); + netfs_get_subrequest(subreq, netfs_sreq_trace_get_resubmit); + subreq->rreq->netfs_ops->issue_read(subreq); +} + +/* + * Go through the list of failed/short reads, retrying all retryable ones.= We + * need to switch failed cache reads to network downloads. + */ +static void netfs_retry_read_subrequests(struct netfs_io_request *rreq) +{ + struct netfs_io_subrequest *subreq; + struct netfs_io_stream *stream0 =3D &rreq->io_streams[0]; + LIST_HEAD(sublist); + LIST_HEAD(queue); + + _enter("R=3D%x", rreq->debug_id); + + if (list_empty(&rreq->subrequests)) + return; + + if (rreq->netfs_ops->retry_request) + rreq->netfs_ops->retry_request(rreq, NULL); + + /* If there's no renegotiation to do, just resend each retryable subreq + * up to the first permanently failed one. + */ + if (!rreq->netfs_ops->prepare_read && + !test_bit(NETFS_RREQ_COPY_TO_CACHE, &rreq->flags)) { + struct netfs_io_subrequest *subreq; + + list_for_each_entry(subreq, &rreq->subrequests, rreq_link) { + if (test_bit(NETFS_SREQ_FAILED, &subreq->flags)) + break; + if (__test_and_clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags)) { + netfs_reset_iter(subreq); + netfs_reissue_read(rreq, subreq); + } + } + return; + } + + /* Okay, we need to renegotiate all the download requests and flip any + * failed cache reads over to being download requests and negotiate + * those also. All fully successful subreqs have been removed from the + * list and any spare data from those has been donated. + * + * What we do is decant the list and rebuild it one subreq at a time so + * that we don't end up with donations jumping over a gap we're busy + * populating with smaller subrequests. In the event that the subreq + * we just launched finishes before we insert the next subreq, it'll + * fill in rreq->prev_donated instead. + + * Note: Alternatively, we could split the tail subrequest right before + * we reissue it and fix up the donations under lock. + */ + list_splice_init(&rreq->subrequests, &queue); + + do { + struct netfs_io_subrequest *from; + struct iov_iter source; + unsigned long long start, len; + size_t part, deferred_next_donated =3D 0; + bool boundary =3D false; + + /* Go through the subreqs and find the next span of contiguous + * buffer that we then rejig (cifs, for example, needs the + * rsize renegotiating) and reissue. + */ + from =3D list_first_entry(&queue, struct netfs_io_subrequest, rreq_link); + list_move_tail(&from->rreq_link, &sublist); + start =3D from->start + from->transferred; + len =3D from->len - from->transferred; + + _debug("from R=3D%08x[%x] s=3D%llx ctl=3D%zx/%zx/%zx", + rreq->debug_id, from->debug_index, + from->start, from->consumed, from->transferred, from->len); + + if (test_bit(NETFS_SREQ_FAILED, &from->flags) || + !test_bit(NETFS_SREQ_NEED_RETRY, &from->flags)) + goto abandon; + + deferred_next_donated =3D from->next_donated; + while ((subreq =3D list_first_entry_or_null( + &queue, struct netfs_io_subrequest, rreq_link))) { + if (subreq->start !=3D start + len || + subreq->transferred > 0 || + !test_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags)) + break; + list_move_tail(&subreq->rreq_link, &sublist); + len +=3D subreq->len; + deferred_next_donated =3D subreq->next_donated; + if (test_bit(NETFS_SREQ_BOUNDARY, &subreq->flags)) + break; + } + + _debug(" - range: %llx-%llx %llx", start, start + len - 1, len); + + /* Determine the set of buffers we're going to use. Each + * subreq gets a subset of a single overall contiguous buffer. + */ + netfs_reset_iter(from); + source =3D from->io_iter; + source.count =3D len; + + /* Work through the sublist. */ + while ((subreq =3D list_first_entry_or_null( + &sublist, struct netfs_io_subrequest, rreq_link))) { + list_del(&subreq->rreq_link); + + subreq->source =3D NETFS_DOWNLOAD_FROM_SERVER; + subreq->start =3D start - subreq->transferred; + subreq->len =3D len + subreq->transferred; + stream0->sreq_max_len =3D subreq->len; + + __clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags); + __set_bit(NETFS_SREQ_RETRYING, &subreq->flags); + + spin_lock_bh(&rreq->lock); + list_add_tail(&subreq->rreq_link, &rreq->subrequests); + subreq->prev_donated +=3D rreq->prev_donated; + rreq->prev_donated =3D 0; + trace_netfs_sreq(subreq, netfs_sreq_trace_retry); + spin_unlock_bh(&rreq->lock); + + BUG_ON(!len); + + /* Renegotiate max_len (rsize) */ + if (rreq->netfs_ops->prepare_read(subreq) < 0) { + trace_netfs_sreq(subreq, netfs_sreq_trace_reprep_failed); + __set_bit(NETFS_SREQ_FAILED, &subreq->flags); + } + + part =3D umin(len, stream0->sreq_max_len); + if (unlikely(rreq->io_streams[0].sreq_max_segs)) + part =3D netfs_limit_iter(&source, 0, part, stream0->sreq_max_segs); + subreq->len =3D subreq->transferred + part; + subreq->io_iter =3D source; + iov_iter_truncate(&subreq->io_iter, part); + iov_iter_advance(&source, part); + len -=3D part; + start +=3D part; + if (!len) { + if (boundary) + __set_bit(NETFS_SREQ_BOUNDARY, &subreq->flags); + subreq->next_donated =3D deferred_next_donated; + } else { + __clear_bit(NETFS_SREQ_BOUNDARY, &subreq->flags); + subreq->next_donated =3D 0; + } + + netfs_reissue_read(rreq, subreq); + if (!len) + break; + + /* If we ran out of subrequests, allocate another. */ + if (list_empty(&sublist)) { + subreq =3D netfs_alloc_subrequest(rreq); + if (!subreq) + goto abandon; + subreq->source =3D NETFS_DOWNLOAD_FROM_SERVER; + subreq->start =3D start; + + /* We get two refs, but need just one. */ + netfs_put_subrequest(subreq, false, netfs_sreq_trace_new); + trace_netfs_sreq(subreq, netfs_sreq_trace_split); + list_add_tail(&subreq->rreq_link, &sublist); + } + } + + /* If we managed to use fewer subreqs, we can discard the + * excess. + */ + while ((subreq =3D list_first_entry_or_null( + &sublist, struct netfs_io_subrequest, rreq_link))) { + trace_netfs_sreq(subreq, netfs_sreq_trace_discard); + list_del(&subreq->rreq_link); + netfs_put_subrequest(subreq, false, netfs_sreq_trace_put_done); + } + + } while (!list_empty(&queue)); + + return; + + /* If we hit ENOMEM, fail all remaining subrequests */ +abandon: + list_splice_init(&sublist, &queue); + list_for_each_entry(subreq, &queue, rreq_link) { + if (!subreq->error) + subreq->error =3D -ENOMEM; + __clear_bit(NETFS_SREQ_FAILED, &subreq->flags); + __clear_bit(NETFS_SREQ_NEED_RETRY, &subreq->flags); + __clear_bit(NETFS_SREQ_RETRYING, &subreq->flags); + } + spin_lock_bh(&rreq->lock); + list_splice_tail_init(&queue, &rreq->subrequests); + spin_unlock_bh(&rreq->lock); +} + +/* + * Retry reads. + */ +void netfs_retry_reads(struct netfs_io_request *rreq) +{ + trace_netfs_rreq(rreq, netfs_rreq_trace_resubmit); + + atomic_inc(&rreq->nr_outstanding); + + netfs_retry_read_subrequests(rreq); + + if (atomic_dec_and_test(&rreq->nr_outstanding)) + netfs_rreq_terminated(rreq, false); +} + +/* + * Unlock any the pages that haven't been unlocked yet due to abandoned + * subrequests. + */ +void netfs_unlock_abandoned_read_pages(struct netfs_io_request *rreq) +{ + struct folio_queue *p; + + for (p =3D rreq->buffer; p; p =3D p->next) { + for (int slot =3D 0; slot < folioq_count(p); slot++) { + struct folio *folio =3D folioq_folio(p, slot); + + if (folio && !folioq_is_marked2(p, slot)) { + trace_netfs_folio(folio, netfs_folio_trace_abandon); + folio_unlock(folio); + } + } + } +} diff --git a/fs/netfs/write_issue.c b/fs/netfs/write_issue.c index 10ff2bd290cb..43cec03c6514 100644 --- a/fs/netfs/write_issue.c +++ b/fs/netfs/write_issue.c @@ -159,10 +159,6 @@ static void netfs_prepare_write(struct netfs_io_reques= t *wreq, =20 _enter("R=3D%x[%x]", wreq->debug_id, subreq->debug_index); =20 - trace_netfs_sreq_ref(wreq->debug_id, subreq->debug_index, - refcount_read(&subreq->ref), - netfs_sreq_trace_new); - trace_netfs_sreq(subreq, netfs_sreq_trace_prepare); =20 stream->sreq_max_len =3D UINT_MAX; diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c index 7202ce84d0eb..cc93cfb16f9d 100644 --- a/fs/nfs/fscache.c +++ b/fs/nfs/fscache.c @@ -265,6 +265,7 @@ static int nfs_netfs_init_request(struct netfs_io_reque= st *rreq, struct file *fi { rreq->netfs_priv =3D get_nfs_open_context(nfs_file_open_context(file)); rreq->debug_id =3D atomic_inc_return(&nfs_netfs_debug_id); + rreq->io_streams[0].sreq_max_len =3D NFS_SB(rreq->inode->i_sb)->rsize; =20 return 0; } @@ -286,14 +287,6 @@ static struct nfs_netfs_io_data *nfs_netfs_alloc(struc= t netfs_io_subrequest *sre return netfs; } =20 -static bool nfs_netfs_clamp_length(struct netfs_io_subrequest *sreq) -{ - size_t rsize =3D NFS_SB(sreq->rreq->inode->i_sb)->rsize; - - sreq->len =3D min(sreq->len, rsize); - return true; -} - static void nfs_netfs_issue_read(struct netfs_io_subrequest *sreq) { struct nfs_netfs_io_data *netfs; @@ -302,17 +295,18 @@ static void nfs_netfs_issue_read(struct netfs_io_subr= equest *sreq) struct nfs_open_context *ctx =3D sreq->rreq->netfs_priv; struct page *page; unsigned long idx; + pgoff_t start, last; int err; - pgoff_t start =3D (sreq->start + sreq->transferred) >> PAGE_SHIFT; - pgoff_t last =3D ((sreq->start + sreq->len - - sreq->transferred - 1) >> PAGE_SHIFT); + + start =3D (sreq->start + sreq->transferred) >> PAGE_SHIFT; + last =3D ((sreq->start + sreq->len - sreq->transferred - 1) >> PAGE_SHIFT= ); =20 nfs_pageio_init_read(&pgio, inode, false, &nfs_async_read_completion_ops); =20 netfs =3D nfs_netfs_alloc(sreq); if (!netfs) - return netfs_subreq_terminated(sreq, -ENOMEM, false); + return netfs_read_subreq_terminated(sreq, -ENOMEM, false); =20 pgio.pg_netfs =3D netfs; /* used in completion */ =20 @@ -377,5 +371,4 @@ const struct netfs_request_ops nfs_netfs_ops =3D { .init_request =3D nfs_netfs_init_request, .free_request =3D nfs_netfs_free_request, .issue_read =3D nfs_netfs_issue_read, - .clamp_length =3D nfs_netfs_clamp_length }; diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h index fbed0027996f..5edebcff8ddf 100644 --- a/fs/nfs/fscache.h +++ b/fs/nfs/fscache.h @@ -60,8 +60,6 @@ static inline void nfs_netfs_get(struct nfs_netfs_io_data= *netfs) =20 static inline void nfs_netfs_put(struct nfs_netfs_io_data *netfs) { - ssize_t final_len; - /* Only the last RPC completion should call netfs_subreq_terminated() */ if (!refcount_dec_and_test(&netfs->refcount)) return; @@ -74,8 +72,9 @@ static inline void nfs_netfs_put(struct nfs_netfs_io_data= *netfs) * Correct the final length here to be no larger than the netfs subrequest * length, and thus avoid netfs's "Subreq overread" warning message. */ - final_len =3D min_t(s64, netfs->sreq->len, atomic64_read(&netfs->transfer= red)); - netfs_subreq_terminated(netfs->sreq, netfs->error ?: final_len, false); + netfs->sreq->transferred =3D min_t(s64, netfs->sreq->len, + atomic64_read(&netfs->transferred)); + netfs_read_subreq_terminated(netfs->sreq, netfs->error, false); kfree(netfs); } static inline void nfs_netfs_inode_init(struct nfs_inode *nfsi) diff --git a/fs/smb/client/cifssmb.c b/fs/smb/client/cifssmb.c index 595c4b673707..d5e1bbefd5e8 100644 --- a/fs/smb/client/cifssmb.c +++ b/fs/smb/client/cifssmb.c @@ -1309,10 +1309,8 @@ cifs_readv_callback(struct mid_q_entry *mid) if (rdata->result =3D=3D 0 || rdata->result =3D=3D -EAGAIN) iov_iter_advance(&rdata->subreq.io_iter, rdata->got_bytes); rdata->credits.value =3D 0; - netfs_subreq_terminated(&rdata->subreq, - (rdata->result =3D=3D 0 || rdata->result =3D=3D -EAGAIN) ? - rdata->got_bytes : rdata->result, - false); + rdata->subreq.transferred +=3D rdata->got_bytes; + netfs_read_subreq_terminated(&rdata->subreq, rdata->result, false); release_mid(mid); add_credits(server, &credits, 0); } diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c index c1d8c5d2ac71..569c17b41b8e 100644 --- a/fs/smb/client/file.c +++ b/fs/smb/client/file.c @@ -140,25 +140,22 @@ static void cifs_netfs_invalidate_cache(struct netfs_= io_request *wreq) } =20 /* - * Split the read up according to how many credits we can get for each pie= ce. - * It's okay to sleep here if we need to wait for more credit to become - * available. - * - * We also choose the server and allocate an operation ID to be cleaned up - * later. + * Negotiate the size of a read operation on behalf of the netfs library. */ -static bool cifs_clamp_length(struct netfs_io_subrequest *subreq) +static int cifs_prepare_read(struct netfs_io_subrequest *subreq) { struct netfs_io_request *rreq =3D subreq->rreq; - struct netfs_io_stream *stream =3D &rreq->io_streams[subreq->stream_nr]; struct cifs_io_subrequest *rdata =3D container_of(subreq, struct cifs_io_= subrequest, subreq); struct cifs_io_request *req =3D container_of(subreq->rreq, struct cifs_io= _request, rreq); struct TCP_Server_Info *server =3D req->server; struct cifs_sb_info *cifs_sb =3D CIFS_SB(rreq->inode->i_sb); - int rc; + size_t size; + int rc =3D 0; =20 - rdata->xid =3D get_xid(); - rdata->have_xid =3D true; + if (!rdata->have_xid) { + rdata->xid =3D get_xid(); + rdata->have_xid =3D true; + } rdata->server =3D server; =20 if (cifs_sb->ctx->rsize =3D=3D 0) @@ -166,13 +163,12 @@ static bool cifs_clamp_length(struct netfs_io_subrequ= est *subreq) server->ops->negotiate_rsize(tlink_tcon(req->cfile->tlink), cifs_sb->ctx); =20 - rc =3D server->ops->wait_mtu_credits(server, cifs_sb->ctx->rsize, - &stream->sreq_max_len, &rdata->credits); - if (rc) { - subreq->error =3D rc; - return false; - } + &size, &rdata->credits); + if (rc) + return rc; + + rreq->io_streams[0].sreq_max_len =3D size; =20 rdata->credits.in_flight_check =3D 1; rdata->credits.rreq_debug_id =3D rreq->debug_id; @@ -184,13 +180,11 @@ static bool cifs_clamp_length(struct netfs_io_subrequ= est *subreq) server->credits, server->in_flight, 0, cifs_trace_rw_credits_read_submit); =20 - subreq->len =3D umin(subreq->len, stream->sreq_max_len); - #ifdef CONFIG_CIFS_SMB_DIRECT if (server->smbd_conn) - stream->sreq_max_segs =3D server->smbd_conn->max_frmr_depth; + rreq->io_streams[0].sreq_max_segs =3D server->smbd_conn->max_frmr_depth; #endif - return true; + return 0; } =20 /* @@ -199,31 +193,40 @@ static bool cifs_clamp_length(struct netfs_io_subrequ= est *subreq) * to only read a portion of that, but as long as we read something, the n= etfs * helper will call us again so that we can issue another read. */ -static void cifs_req_issue_read(struct netfs_io_subrequest *subreq) +static void cifs_issue_read(struct netfs_io_subrequest *subreq) { struct netfs_io_request *rreq =3D subreq->rreq; struct cifs_io_subrequest *rdata =3D container_of(subreq, struct cifs_io_= subrequest, subreq); struct cifs_io_request *req =3D container_of(subreq->rreq, struct cifs_io= _request, rreq); + struct TCP_Server_Info *server =3D req->server; int rc =3D 0; =20 cifs_dbg(FYI, "%s: op=3D%08x[%x] mapping=3D%p len=3D%zu/%zu\n", __func__, rreq->debug_id, subreq->debug_index, rreq->mapping, subreq->transferred, subreq->len); =20 + rc =3D adjust_credits(server, rdata, cifs_trace_rw_credits_issue_read_adj= ust); + if (rc) + goto failed; + if (req->cfile->invalidHandle) { do { rc =3D cifs_reopen_file(req->cfile, true); } while (rc =3D=3D -EAGAIN); if (rc) - goto out; + goto failed; } =20 __set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags); =20 + trace_netfs_sreq(subreq, netfs_sreq_trace_submit); rc =3D rdata->server->ops->async_readv(rdata); -out: if (rc) - netfs_subreq_terminated(subreq, rc, false); + goto failed; + return; + +failed: + netfs_read_subreq_terminated(subreq, rc, false); } =20 /* @@ -330,8 +333,8 @@ const struct netfs_request_ops cifs_req_ops =3D { .init_request =3D cifs_init_request, .free_request =3D cifs_free_request, .free_subrequest =3D cifs_free_subrequest, - .clamp_length =3D cifs_clamp_length, - .issue_read =3D cifs_req_issue_read, + .prepare_read =3D cifs_prepare_read, + .issue_read =3D cifs_issue_read, .done =3D cifs_rreq_done, .begin_writeback =3D cifs_begin_writeback, .prepare_write =3D cifs_prepare_write, diff --git a/fs/smb/client/smb2pdu.c b/fs/smb/client/smb2pdu.c index 9a06b5594669..dbc9e69ad223 100644 --- a/fs/smb/client/smb2pdu.c +++ b/fs/smb/client/smb2pdu.c @@ -4495,9 +4495,7 @@ static void smb2_readv_worker(struct work_struct *wor= k) struct cifs_io_subrequest *rdata =3D container_of(work, struct cifs_io_subrequest, subreq.work); =20 - netfs_subreq_terminated(&rdata->subreq, - (rdata->result =3D=3D 0 || rdata->result =3D=3D -EAGAIN) ? - rdata->got_bytes : rdata->result, true); + netfs_read_subreq_terminated(&rdata->subreq, rdata->result, false); } =20 static void @@ -4551,6 +4549,7 @@ smb2_readv_callback(struct mid_q_entry *mid) break; case MID_REQUEST_SUBMITTED: case MID_RETRY_NEEDED: + __set_bit(NETFS_SREQ_NEED_RETRY, &rdata->subreq.flags); rdata->result =3D -EAGAIN; if (server->sign && rdata->got_bytes) /* reset bytes number since we can not check a sign */ @@ -4604,6 +4603,10 @@ smb2_readv_callback(struct mid_q_entry *mid) server->credits, server->in_flight, 0, cifs_trace_rw_credits_read_response_clear); rdata->credits.value =3D 0; + rdata->subreq.transferred +=3D rdata->got_bytes; + if (rdata->subreq.start + rdata->subreq.transferred >=3D rdata->subreq.rr= eq->i_size) + __set_bit(NETFS_SREQ_HIT_EOF, &rdata->subreq.flags); + trace_netfs_sreq(&rdata->subreq, netfs_sreq_trace_io_progress); INIT_WORK(&rdata->subreq.work, smb2_readv_worker); queue_work(cifsiod_wq, &rdata->subreq.work); release_mid(mid); @@ -4867,6 +4870,7 @@ smb2_writev_callback(struct mid_q_entry *mid) server->credits, server->in_flight, 0, cifs_trace_rw_credits_write_response_clear); wdata->credits.value =3D 0; + trace_netfs_sreq(&wdata->subreq, netfs_sreq_trace_io_progress); cifs_write_subrequest_terminated(wdata, result ?: written, true); release_mid(mid); trace_smb3_rw_credits(rreq_debug_id, subreq_debug_index, 0, diff --git a/include/linux/netfs.h b/include/linux/netfs.h index c47753a24623..be1686f0fe34 100644 --- a/include/linux/netfs.h +++ b/include/linux/netfs.h @@ -180,20 +180,26 @@ struct netfs_io_subrequest { unsigned long long start; /* Where to start the I/O */ size_t len; /* Size of the I/O */ size_t transferred; /* Amount of data transferred */ + size_t consumed; /* Amount of read data consumed */ + size_t prev_donated; /* Amount of data donated from previous subreq */ + size_t next_donated; /* Amount of data donated from next subreq */ refcount_t ref; short error; /* 0 or error that occurred */ unsigned short debug_index; /* Index in list (for debugging output) */ unsigned int nr_segs; /* Number of segs in io_iter */ enum netfs_io_source source; /* Where to read from/write to */ unsigned char stream_nr; /* I/O stream this belongs to */ + unsigned char curr_folioq_slot; /* Folio currently being read */ + unsigned char curr_folio_order; /* Order of folio */ + struct folio_queue *curr_folioq; /* Queue segment in which current folio = resides */ unsigned long flags; #define NETFS_SREQ_COPY_TO_CACHE 0 /* Set if should copy the data to the c= ache */ #define NETFS_SREQ_CLEAR_TAIL 1 /* Set if the rest of the read should be = cleared */ -#define NETFS_SREQ_SHORT_IO 2 /* Set if the I/O was short */ #define NETFS_SREQ_SEEK_DATA_READ 3 /* Set if ->read() should SEEK_DATA fi= rst */ #define NETFS_SREQ_NO_PROGRESS 4 /* Set if we didn't manage to read any d= ata */ #define NETFS_SREQ_ONDEMAND 5 /* Set if it's from on-demand read mode */ #define NETFS_SREQ_BOUNDARY 6 /* Set if ends on hard boundary (eg. ceph o= bject) */ +#define NETFS_SREQ_HIT_EOF 7 /* Set if short due to EOF */ #define NETFS_SREQ_IN_PROGRESS 8 /* Unlocked when the subrequest complete= s */ #define NETFS_SREQ_NEED_RETRY 9 /* Set if the filesystem requests a retry= */ #define NETFS_SREQ_RETRYING 10 /* Set if we're retrying */ @@ -203,6 +209,7 @@ struct netfs_io_subrequest { enum netfs_io_origin { NETFS_READAHEAD, /* This read was triggered by readahead */ NETFS_READPAGE, /* This read is a synchronous read */ + NETFS_READ_GAPS, /* This read is a synchronous read to fill gaps */ NETFS_READ_FOR_WRITE, /* This read is to prepare a write */ NETFS_DIO_READ, /* This is a direct I/O read */ NETFS_WRITEBACK, /* This write was triggered by writepages */ @@ -225,6 +232,7 @@ struct netfs_io_request { struct address_space *mapping; /* The mapping being accessed */ struct kiocb *iocb; /* AIO completion vector */ struct netfs_cache_resources cache_resources; + struct readahead_control *ractl; /* Readahead descriptor */ struct list_head proc_link; /* Link in netfs_iorequests */ struct list_head subrequests; /* Contributory I/O operations */ struct netfs_io_stream io_streams[2]; /* Streams of parallel I/O operatio= ns */ @@ -245,12 +253,10 @@ struct netfs_io_request { unsigned int nr_group_rel; /* Number of refs to release on ->group */ spinlock_t lock; /* Lock for queuing subreqs */ atomic_t nr_outstanding; /* Number of ops in progress */ - atomic_t nr_copy_ops; /* Number of copy-to-cache ops in progress */ - size_t upper_len; /* Length can be extended to here */ unsigned long long submitted; /* Amount submitted for I/O so far */ unsigned long long len; /* Length of the request */ size_t transferred; /* Amount to be indicated as transferred */ - short error; /* 0 or error that occurred */ + long error; /* 0 or error that occurred */ enum netfs_io_origin origin; /* Origin of the request */ bool direct_bv_unpin; /* T if direct_bv[] must be unpinned */ u8 buffer_head_slot; /* First slot in ->buffer */ @@ -261,9 +267,9 @@ struct netfs_io_request { unsigned long long collected_to; /* Point we've collected to */ unsigned long long cleaned_to; /* Position we've cleaned folios to */ pgoff_t no_unlock_folio; /* Don't unlock this folio after read */ + size_t prev_donated; /* Fallback for subreq->prev_donated */ refcount_t ref; unsigned long flags; -#define NETFS_RREQ_INCOMPLETE_IO 0 /* Some ioreqs terminated short or with= error */ #define NETFS_RREQ_COPY_TO_CACHE 1 /* Need to write to the cache */ #define NETFS_RREQ_NO_UNLOCK_FOLIO 2 /* Don't unlock no_unlock_folio on co= mpletion */ #define NETFS_RREQ_DONT_UNLOCK_FOLIOS 3 /* Don't unlock the folios on comp= letion */ @@ -276,6 +282,7 @@ struct netfs_io_request { #define NETFS_RREQ_PAUSE 11 /* Pause subrequest generation */ #define NETFS_RREQ_USE_IO_ITER 12 /* Use ->io_iter rather than ->i_pages = */ #define NETFS_RREQ_ALL_QUEUED 13 /* All subreqs are now queued */ +#define NETFS_RREQ_NEED_RETRY 14 /* Need to try retrying */ #define NETFS_RREQ_USE_PGPRIV2 31 /* [DEPRECATED] Use PG_private_2 to mark * write to cache on read */ const struct netfs_request_ops *netfs_ops; @@ -294,7 +301,7 @@ struct netfs_request_ops { =20 /* Read request handling */ void (*expand_readahead)(struct netfs_io_request *rreq); - bool (*clamp_length)(struct netfs_io_subrequest *subreq); + int (*prepare_read)(struct netfs_io_subrequest *subreq); void (*issue_read)(struct netfs_io_subrequest *subreq); bool (*is_still_valid)(struct netfs_io_request *rreq); int (*check_write_begin)(struct file *file, loff_t pos, unsigned len, @@ -424,7 +431,10 @@ bool netfs_release_folio(struct folio *folio, gfp_t gf= p); vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *ne= tfs_group); =20 /* (Sub)request management API. */ -void netfs_subreq_terminated(struct netfs_io_subrequest *, ssize_t, bool); +void netfs_read_subreq_progress(struct netfs_io_subrequest *subreq, + bool was_async); +void netfs_read_subreq_terminated(struct netfs_io_subrequest *subreq, + int error, bool was_async); void netfs_get_subrequest(struct netfs_io_subrequest *subreq, enum netfs_sreq_ref_trace what); void netfs_put_subrequest(struct netfs_io_subrequest *subreq, diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h index 065fa168f964..4ac3b5d56ebd 100644 --- a/include/trace/events/netfs.h +++ b/include/trace/events/netfs.h @@ -20,6 +20,7 @@ EM(netfs_read_trace_expanded, "EXPANDED ") \ EM(netfs_read_trace_readahead, "READAHEAD") \ EM(netfs_read_trace_readpage, "READPAGE ") \ + EM(netfs_read_trace_read_gaps, "READ-GAPS") \ EM(netfs_read_trace_prefetch_for_write, "PREFETCHW") \ E_(netfs_read_trace_write_begin, "WRITEBEGN") =20 @@ -33,6 +34,7 @@ #define netfs_rreq_origins \ EM(NETFS_READAHEAD, "RA") \ EM(NETFS_READPAGE, "RP") \ + EM(NETFS_READ_GAPS, "RG") \ EM(NETFS_READ_FOR_WRITE, "RW") \ EM(NETFS_DIO_READ, "DR") \ EM(NETFS_WRITEBACK, "WB") \ @@ -68,15 +70,25 @@ E_(NETFS_INVALID_WRITE, "INVL") =20 #define netfs_sreq_traces \ + EM(netfs_sreq_trace_add_donations, "+DON ") \ + EM(netfs_sreq_trace_added, "ADD ") \ + EM(netfs_sreq_trace_clear, "CLEAR") \ EM(netfs_sreq_trace_discard, "DSCRD") \ + EM(netfs_sreq_trace_donate_to_prev, "DON-P") \ + EM(netfs_sreq_trace_donate_to_next, "DON-N") \ EM(netfs_sreq_trace_download_instead, "RDOWN") \ EM(netfs_sreq_trace_fail, "FAIL ") \ EM(netfs_sreq_trace_free, "FREE ") \ + EM(netfs_sreq_trace_hit_eof, "EOF ") \ + EM(netfs_sreq_trace_io_progress, "IO ") \ EM(netfs_sreq_trace_limited, "LIMIT") \ EM(netfs_sreq_trace_prepare, "PREP ") \ EM(netfs_sreq_trace_prep_failed, "PRPFL") \ - EM(netfs_sreq_trace_resubmit_short, "SHORT") \ + EM(netfs_sreq_trace_progress, "PRGRS") \ + EM(netfs_sreq_trace_reprep_failed, "REPFL") \ EM(netfs_sreq_trace_retry, "RETRY") \ + EM(netfs_sreq_trace_short, "SHORT") \ + EM(netfs_sreq_trace_split, "SPLIT") \ EM(netfs_sreq_trace_submit, "SUBMT") \ EM(netfs_sreq_trace_terminated, "TERM ") \ EM(netfs_sreq_trace_write, "WRITE") \ @@ -117,7 +129,7 @@ EM(netfs_sreq_trace_new, "NEW ") \ EM(netfs_sreq_trace_put_cancel, "PUT CANCEL ") \ EM(netfs_sreq_trace_put_clear, "PUT CLEAR ") \ - EM(netfs_sreq_trace_put_discard, "PUT DISCARD") \ + EM(netfs_sreq_trace_put_consumed, "PUT CONSUME") \ EM(netfs_sreq_trace_put_done, "PUT DONE ") \ EM(netfs_sreq_trace_put_failed, "PUT FAILED ") \ EM(netfs_sreq_trace_put_merged, "PUT MERGED ") \ @@ -137,6 +149,7 @@ EM(netfs_flush_content, "flush") \ EM(netfs_streaming_filled_page, "mod-streamw-f") \ EM(netfs_streaming_cont_filled_page, "mod-streamw-f+") \ + EM(netfs_folio_trace_abandon, "abandon") \ EM(netfs_folio_trace_cancel_copy, "cancel-copy") \ EM(netfs_folio_trace_clear, "clear") \ EM(netfs_folio_trace_clear_cc, "clear-cc") \ @@ -152,7 +165,10 @@ EM(netfs_folio_trace_mkwrite_plus, "mkwrite+") \ EM(netfs_folio_trace_not_under_wback, "!wback") \ EM(netfs_folio_trace_put, "put") \ + EM(netfs_folio_trace_read, "read") \ + EM(netfs_folio_trace_read_done, "read-done") \ EM(netfs_folio_trace_read_gaps, "read-gaps") \ + EM(netfs_folio_trace_read_put, "read-put") \ EM(netfs_folio_trace_redirtied, "redirtied") \ EM(netfs_folio_trace_store, "store") \ EM(netfs_folio_trace_store_copy, "store-copy") \ @@ -165,6 +181,12 @@ EM(netfs_contig_trace_jump, "-->JUMP-->") \ E_(netfs_contig_trace_unlock, "Unlock") =20 +#define netfs_donate_traces \ + EM(netfs_trace_donate_tail_to_prev, "tail-to-prev") \ + EM(netfs_trace_donate_to_prev, "to-prev") \ + EM(netfs_trace_donate_to_next, "to-next") \ + E_(netfs_trace_donate_to_deferred_next, "defer-next") + #ifndef __NETFS_DECLARE_TRACE_ENUMS_ONCE_ONLY #define __NETFS_DECLARE_TRACE_ENUMS_ONCE_ONLY =20 @@ -182,6 +204,7 @@ enum netfs_rreq_ref_trace { netfs_rreq_ref_traces } __m= ode(byte); enum netfs_sreq_ref_trace { netfs_sreq_ref_traces } __mode(byte); enum netfs_folio_trace { netfs_folio_traces } __mode(byte); enum netfs_collect_contig_trace { netfs_collect_contig_traces } __mode(byt= e); +enum netfs_donate_trace { netfs_donate_traces } __mode(byte); =20 #endif =20 @@ -204,6 +227,7 @@ netfs_rreq_ref_traces; netfs_sreq_ref_traces; netfs_folio_traces; netfs_collect_contig_traces; +netfs_donate_traces; =20 /* * Now redefine the EM() and E_() macros to map the enums to the strings t= hat @@ -224,6 +248,7 @@ TRACE_EVENT(netfs_read, TP_STRUCT__entry( __field(unsigned int, rreq ) __field(unsigned int, cookie ) + __field(loff_t, i_size ) __field(loff_t, start ) __field(size_t, len ) __field(enum netfs_read_trace, what ) @@ -233,18 +258,19 @@ TRACE_EVENT(netfs_read, TP_fast_assign( __entry->rreq =3D rreq->debug_id; __entry->cookie =3D rreq->cache_resources.debug_id; + __entry->i_size =3D rreq->i_size; __entry->start =3D start; __entry->len =3D len; __entry->what =3D what; __entry->netfs_inode =3D rreq->inode->i_ino; ), =20 - TP_printk("R=3D%08x %s c=3D%08x ni=3D%x s=3D%llx %zx", + TP_printk("R=3D%08x %s c=3D%08x ni=3D%x s=3D%llx l=3D%zx sz=3D%llx", __entry->rreq, __print_symbolic(__entry->what, netfs_read_traces), __entry->cookie, __entry->netfs_inode, - __entry->start, __entry->len) + __entry->start, __entry->len, __entry->i_size) ); =20 TRACE_EVENT(netfs_rreq, @@ -649,6 +675,71 @@ TRACE_EVENT(netfs_collect_stream, __entry->collected_to, __entry->front) ); =20 +TRACE_EVENT(netfs_progress, + TP_PROTO(const struct netfs_io_subrequest *subreq, + unsigned long long start, size_t avail, size_t part), + + TP_ARGS(subreq, start, avail, part), + + TP_STRUCT__entry( + __field(unsigned int, rreq) + __field(unsigned int, subreq) + __field(unsigned int, consumed) + __field(unsigned int, transferred) + __field(unsigned long long, f_start) + __field(unsigned int, f_avail) + __field(unsigned int, f_part) + __field(unsigned char, slot) + ), + + TP_fast_assign( + __entry->rreq =3D subreq->rreq->debug_id; + __entry->subreq =3D subreq->debug_index; + __entry->consumed =3D subreq->consumed; + __entry->transferred =3D subreq->transferred; + __entry->f_start =3D start; + __entry->f_avail =3D avail; + __entry->f_part =3D part; + __entry->slot =3D subreq->curr_folioq_slot; + ), + + TP_printk("R=3D%08x[%02x] s=3D%llx ct=3D%x/%x pa=3D%x/%x sl=3D%x", + __entry->rreq, __entry->subreq, __entry->f_start, + __entry->consumed, __entry->transferred, + __entry->f_part, __entry->f_avail, __entry->slot) + ); + +TRACE_EVENT(netfs_donate, + TP_PROTO(const struct netfs_io_request *rreq, + const struct netfs_io_subrequest *from, + const struct netfs_io_subrequest *to, + size_t amount, + enum netfs_donate_trace trace), + + TP_ARGS(rreq, from, to, amount, trace), + + TP_STRUCT__entry( + __field(unsigned int, rreq) + __field(unsigned int, from) + __field(unsigned int, to) + __field(unsigned int, amount) + __field(enum netfs_donate_trace, trace) + ), + + TP_fast_assign( + __entry->rreq =3D rreq->debug_id; + __entry->from =3D from->debug_index; + __entry->to =3D to ? to->debug_index : -1; + __entry->amount =3D amount; + __entry->trace =3D trace; + ), + + TP_printk("R=3D%08x[%02x] -> [%02x] %s am=3D%x", + __entry->rreq, __entry->from, __entry->to, + __print_symbolic(__entry->trace, netfs_donate_traces), + __entry->amount) + ); + #undef EM #undef E_ #endif /* _TRACE_NETFS_H */