From nobody Thu Apr 2 23:53:31 2026 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AE28A3ED111 for ; Thu, 26 Mar 2026 10:48:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774522093; cv=none; b=bDgSvU9OwBphwygjoN4a/RqY7PSU4OJIMG2549fiyzy1Q73Ucq79WLRYseonEoynT8tZm/NHqJekv6KVGsc65FSyyzMqfDrImCWHrhJCIRRsREf1Jdwou63K6v1VT2hOfIW4SNKS2qQcY3oxGNA1qyWMntU5a79LUxEr19wzPE0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774522093; c=relaxed/simple; bh=KDBbASYwY1Jt9g1tjqPa+0s93HJKDlMALHBY+TRuzqw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=c7MFXZKZR0zuIh/mqLErz6sR00oVt+GPP87fmPm/ggPHlgXjmKfVHT09oGU8GrOvOW3Jjg+z8+ehIJRDMIOim91Xh9DBto9YXLmVxp7iqk9Ui4WrvzbrcOItYHe63NwwBvk2qdKYy6ivzgrjYvwA1RUFAbAJa1Cn/t1mn2ujqyM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=afdRqkSU; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="afdRqkSU" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1774522089; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=yDP/Sm6+MeU+8XN4gJgSUzN+frxmLx3rFrwpkE8Ii+s=; b=afdRqkSU5vLpebp9i9U1XxmcHSnbsXsBn3kfR4t8re7K19JEF5244bYl5N7WehIqxKU02a QVEbSb1YvtJHqYnX9pmUkttd5l9bfiG29+/IAVAKX8rWhKlw5Hlo1us4qxVUQz88q9LW17 Q4JcE4JM/RAhLq88+fIUJa5JRBzho0E= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-70-2aoNTZEbMpuAFj9PGCTVWg-1; Thu, 26 Mar 2026 06:48:03 -0400 X-MC-Unique: 2aoNTZEbMpuAFj9PGCTVWg-1 X-Mimecast-MFC-AGG-ID: 2aoNTZEbMpuAFj9PGCTVWg_1774522080 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 9BE531956065; Thu, 26 Mar 2026 10:48:00 +0000 (UTC) Received: from warthog.procyon.org.com (unknown [10.44.33.121]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id CC011180075D; Thu, 26 Mar 2026 10:47:53 +0000 (UTC) From: David Howells To: Christian Brauner , Matthew Wilcox , Christoph Hellwig Cc: David Howells , Paulo Alcantara , Jens Axboe , Leon Romanovsky , Steve French , ChenXiaoSong , Marc Dionne , Eric Van Hensbergen , Dominique Martinet , Ilya Dryomov , Trond Myklebust , netfs@lists.linux.dev, linux-afs@lists.infradead.org, linux-cifs@vger.kernel.org, linux-nfs@vger.kernel.org, ceph-devel@vger.kernel.org, v9fs@lists.linux.dev, linux-erofs@lists.ozlabs.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Paulo Alcantara Subject: [PATCH 13/26] netfs: Add some tools for managing bvecq chains Date: Thu, 26 Mar 2026 10:45:28 +0000 Message-ID: <20260326104544.509518-14-dhowells@redhat.com> In-Reply-To: <20260326104544.509518-1-dhowells@redhat.com> References: <20260326104544.509518-1-dhowells@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 Content-Type: text/plain; charset="utf-8" Provide a selection of tools for managing bvec queue chains. This includes: (1) Allocation, prepopulation, expansion, shortening and refcounting of bvecqs and bvecq chains. This can be used to do things like creating an encryption buffer in cifs or a directory content buffer in afs. The memory segments will be appropriate disposed off according to the flags on the bvecq. (2) Management of a bvecq chain as a rolling buffer and the management of positions within it. (3) Loading folios, slicing chains and clearing content. Signed-off-by: David Howells cc: Paulo Alcantara cc: Matthew Wilcox cc: Christoph Hellwig cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org --- fs/netfs/Makefile | 1 + fs/netfs/bvecq.c | 706 +++++++++++++++++++++++++++++++++++ fs/netfs/internal.h | 1 + fs/netfs/stats.c | 4 +- include/linux/bvecq.h | 165 +++++++- include/linux/iov_iter.h | 4 +- include/linux/netfs.h | 1 + include/trace/events/netfs.h | 24 ++ lib/iov_iter.c | 16 +- lib/scatterlist.c | 4 +- lib/tests/kunit_iov_iter.c | 18 +- 11 files changed, 919 insertions(+), 25 deletions(-) create mode 100644 fs/netfs/bvecq.c diff --git a/fs/netfs/Makefile b/fs/netfs/Makefile index b43188d64bd8..e1f12ecb5abf 100644 --- a/fs/netfs/Makefile +++ b/fs/netfs/Makefile @@ -3,6 +3,7 @@ netfs-y :=3D \ buffered_read.o \ buffered_write.o \ + bvecq.o \ direct_read.o \ direct_write.o \ iterator.o \ diff --git a/fs/netfs/bvecq.c b/fs/netfs/bvecq.c new file mode 100644 index 000000000000..c71646ea5243 --- /dev/null +++ b/fs/netfs/bvecq.c @@ -0,0 +1,706 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Buffering helpers for bvec queues + * + * Copyright (C) 2025 Red Hat, Inc. All Rights Reserved. + * Written by David Howells (dhowells@redhat.com) + */ + +#include "internal.h" + +void bvecq_dump(const struct bvecq *bq) +{ + int b =3D 0; + + for (; bq; bq =3D bq->next, b++) { + int skipz =3D 0; + + pr_notice("BQ[%u] %u/%u fp=3D%llx\n", b, bq->nr_slots, bq->max_slots, bq= ->fpos); + for (int s =3D 0; s < bq->nr_slots; s++) { + const struct bio_vec *bv =3D &bq->bv[s]; + + if (!bv->bv_page && !bv->bv_len && skipz < 2) { + skipz =3D 1; + continue; + } + if (skipz =3D=3D 1) + pr_notice("BQ[%u:00-%02u] ...\n", b, s - 1); + skipz =3D 2; + pr_notice("BQ[%u:%02u] %10lx %04x %04x %u\n", + b, s, + bv->bv_page ? page_to_pfn(bv->bv_page) : 0, + bv->bv_offset, bv->bv_len, + bv->bv_page ? page_count(bv->bv_page) : 0); + } + } +} +EXPORT_SYMBOL(bvecq_dump); + +/** + * bvecq_alloc_one - Allocate a single bvecq node with unpopulated slots + * @nr_slots: Number of slots to allocate + * @gfp: The allocation constraints. + * + * Allocate a single bvecq node and initialise the header. A number of in= line + * slots are also allocated, rounded up to fit after the header in a power= -of-2 + * slab object of up to 512 bytes (up to 29 slots on a 64-bit cpu). The s= lot + * array is not initialised. + * + * Return: The node pointer or NULL on allocation failure. + */ +struct bvecq *bvecq_alloc_one(size_t nr_slots, gfp_t gfp) +{ + struct bvecq *bq; + const size_t max_size =3D 512; + const size_t max_slots =3D (max_size - sizeof(*bq)) / sizeof(bq->__bv[0]); + size_t part =3D umin(nr_slots, max_slots); + size_t size =3D roundup_pow_of_two(struct_size(bq, __bv, part)); + + bq =3D kmalloc(size, gfp); + if (bq) { + *bq =3D (struct bvecq) { + .ref =3D REFCOUNT_INIT(1), + .bv =3D bq->__bv, + .inline_bv =3D true, + .max_slots =3D (size - sizeof(*bq)) / sizeof(bq->__bv[0]), + }; + netfs_stat(&netfs_n_bvecq); + } + return bq; +} +EXPORT_SYMBOL(bvecq_alloc_one); + +/** + * bvecq_alloc_chain - Allocate an unpopulated bvecq chain + * @nr_slots: Number of slots to allocate + * @gfp: The allocation constraints. + * + * Allocate a chain of bvecq nodes providing at least the requested cumula= tive + * number of slots. + * + * Return: The first node pointer or NULL on allocation failure. + */ +struct bvecq *bvecq_alloc_chain(size_t nr_slots, gfp_t gfp) +{ + struct bvecq *head =3D NULL, *tail =3D NULL; + + _enter("%zu", nr_slots); + + for (;;) { + struct bvecq *bq; + + bq =3D bvecq_alloc_one(nr_slots, gfp); + if (!bq) + goto oom; + + if (tail) { + tail->next =3D bq; + bq->prev =3D tail; + } else { + head =3D bq; + } + tail =3D bq; + if (tail->max_slots >=3D nr_slots) + break; + nr_slots -=3D tail->max_slots; + } + + return head; +oom: + bvecq_put(head); + return NULL; +} +EXPORT_SYMBOL(bvecq_alloc_chain); + +/** + * bvecq_alloc_buffer - Allocate a bvecq chain and populate with buffers + * @size: Target size of the buffer (can be 0 for an empty buffer) + * @pre_slots: Number of preamble slots to set aside + * @gfp: The allocation constraints. + * + * Allocate a chain of bvecq nodes and populate the slots with sufficient = pages + * to provide at least the requested amount of space, leaving the first + * @pre_slots slots unset. The pages allocated may be compound pages larg= er + * than PAGE_SIZE and thus occupy fewer slots. The pages have their refco= unts + * set to 1 and can be passed to MSG_SPLICE_PAGES. + * + * Return: The first node pointer or NULL on allocation failure. + */ +struct bvecq *bvecq_alloc_buffer(size_t size, unsigned int pre_slots, gfp_= t gfp) +{ + struct bvecq *head =3D NULL, *tail =3D NULL, *p =3D NULL; + size_t count =3D DIV_ROUND_UP(size, PAGE_SIZE); + + _enter("%zx,%zx,%u", size, count, pre_slots); + + do { + struct page **pages; + int want, got; + + p =3D bvecq_alloc_one(umin(count, 32 - 3), gfp); + if (!p) + goto oom; + + p->free =3D true; + + if (tail) { + tail->next =3D p; + p->prev =3D tail; + } else { + head =3D p; + } + tail =3D p; + if (!count) + break; + + pages =3D (struct page **)&p->bv[p->max_slots]; + pages -=3D p->max_slots - pre_slots; + memset(pages, 0, (p->max_slots - pre_slots) * sizeof(pages[0])); + + want =3D umin(count, p->max_slots - pre_slots); + got =3D alloc_pages_bulk(gfp, want, pages); + if (got < want) { + for (int i =3D 0; i < got; i++) + __free_page(pages[i]); + goto oom; + } + + tail->nr_slots =3D pre_slots + got; + for (int i =3D 0; i < got; i++) { + int j =3D pre_slots + i; + + set_page_count(pages[i], 1); + bvec_set_page(&tail->bv[j], pages[i], PAGE_SIZE, 0); + } + + count -=3D got; + pre_slots =3D 0; + } while (count > 0); + + return head; +oom: + bvecq_put(head); + return NULL; +} +EXPORT_SYMBOL(bvecq_alloc_buffer); + +/* + * Free the page pointed to be a segment as necessary. + */ +static void bvecq_free_seg(struct bvecq *bq, unsigned int seg) +{ + if (!bq->free || + !bq->bv[seg].bv_page) + return; + + if (bq->unpin) + unpin_user_page(bq->bv[seg].bv_page); + else + __free_page(bq->bv[seg].bv_page); +} + +/** + * bvecq_put - Put a ref on a bvec queue + * @bq: The start of the folio queue to free + * + * Put the ref(s) on the nodes in a bvec queue, freeing up the node and the + * page fragments it points to as the refcounts become zero. + */ +void bvecq_put(struct bvecq *bq) +{ + struct bvecq *next; + + for (; bq; bq =3D next) { + if (!refcount_dec_and_test(&bq->ref)) + break; + for (int seg =3D 0; seg < bq->nr_slots; seg++) + bvecq_free_seg(bq, seg); + next =3D bq->next; + netfs_stat_d(&netfs_n_bvecq); + kfree(bq); + } +} +EXPORT_SYMBOL(bvecq_put); + +/** + * bvecq_expand_buffer - Allocate buffer space into a bvec queue + * @_buffer: Pointer to the bvecq chain to expand (may point to a NULL; up= dated). + * @_cur_size: Current size of the buffer (updated). + * @size: Target size of the buffer. + * @gfp: The allocation constraints. + */ +int bvecq_expand_buffer(struct bvecq **_buffer, size_t *_cur_size, ssize_t= size, gfp_t gfp) +{ + struct bvecq *tail =3D *_buffer; + const size_t max_slots =3D 32; + + size =3D round_up(size, PAGE_SIZE); + if (*_cur_size >=3D size) + return 0; + + if (tail) + while (tail->next) + tail =3D tail->next; + + do { + struct page *page; + int order =3D 0; + + if (!tail || bvecq_is_full(tail)) { + struct bvecq *p; + + p =3D bvecq_alloc_one(max_slots, gfp); + if (!p) + return -ENOMEM; + if (tail) { + tail->next =3D p; + p->prev =3D tail; + } else { + *_buffer =3D p; + } + tail =3D p; + } + + if (size - *_cur_size > PAGE_SIZE) + order =3D umin(ilog2(size - *_cur_size) - PAGE_SHIFT, + MAX_PAGECACHE_ORDER); + + page =3D alloc_pages(gfp | __GFP_COMP, order); + if (!page && order > 0) + page =3D alloc_pages(gfp | __GFP_COMP, 0); + if (!page) + return -ENOMEM; + + bvec_set_page(&tail->bv[tail->nr_slots++], page, PAGE_SIZE << order, 0); + *_cur_size +=3D PAGE_SIZE << order; + } while (*_cur_size < size); + + return 0; +} +EXPORT_SYMBOL(bvecq_expand_buffer); + +/** + * bvecq_shorten_buffer - Shorten a bvec queue buffer + * @bq: The start of the buffer to shorten + * @slot: The slot to start from + * @size: The size to retain + * + * Shorten the content of a bvec queue down to the minimum number of segme= nts, + * starting at the specified segment, to retain the specified size. + * + * Return: 0 if successful; -EMSGSIZE if there is insufficient content. + */ +int bvecq_shorten_buffer(struct bvecq *bq, unsigned int slot, size_t size) +{ + ssize_t retain =3D size; + + /* Skip through the segments we want to keep. */ + for (; bq; bq =3D bq->next) { + for (; slot < bq->nr_slots; slot++) { + retain -=3D bq->bv[slot].bv_len; + if (retain < 0) + goto found; + } + slot =3D 0; + } + if (WARN_ON_ONCE(retain > 0)) + return -EMSGSIZE; + return 0; + +found: + /* Shorten the entry to be retained and clean the rest of this bvecq. */ + bq->bv[slot].bv_len +=3D retain; + slot++; + for (int i =3D slot; i < bq->nr_slots; i++) + bvecq_free_seg(bq, i); + bq->nr_slots =3D slot; + + /* Free the queue tail. */ + bvecq_put(bq->next); + bq->next =3D NULL; + return 0; +} +EXPORT_SYMBOL(bvecq_shorten_buffer); + +/** + * bvecq_buffer_init - Initialise a buffer and set position + * @pos: The position to point at the new buffer. + * @gfp: The allocation constraints. + * + * Initialise a rolling buffer. We allocate an unpopulated bvecq node to = so + * that the pointers can be independently driven by the producer and the + * consumer. + * + * Return 0 if successful; -ENOMEM on allocation failure. + */ +int bvecq_buffer_init(struct bvecq_pos *pos, gfp_t gfp) +{ + struct bvecq *bq; + + bq =3D bvecq_alloc_one(13, gfp); + if (!bq) + return -ENOMEM; + + pos->bvecq =3D bq; /* Comes with a ref. */ + pos->slot =3D 0; + pos->offset =3D 0; + return 0; +} + +/** + * bvecq_buffer_make_space - Start a new bvecq node in a buffer + * @pos: The position of the last node. + * @gfp: The allocation constraints. + * + * Add a new node on to the buffer chain at the specified position, either + * because the previous one is full or because we have a discontiguity to + * contend with, and update @pos to point to it. + * + * Return: 0 if successful; -ENOMEM on allocation failure. + */ +int bvecq_buffer_make_space(struct bvecq_pos *pos, gfp_t gfp) +{ + struct bvecq *bq, *head =3D pos->bvecq; + + bq =3D bvecq_alloc_one(14, gfp); + if (!bq) + return -ENOMEM; + bq->prev =3D head; + + pos->bvecq =3D bvecq_get(bq); + pos->slot =3D 0; + pos->offset =3D 0; + + /* Make sure the initialisation is stored before the next pointer. + * + * [!] NOTE: After we set head->next, the consumer is at liberty to + * immediately delete the old head. + */ + smp_store_release(&head->next, bq); + bvecq_put(head); + return 0; +} + +/** + * bvecq_pos_advance - Advance a bvecq position + * @pos: The position to advance. + * @amount: The amount of bytes to advance by. + * + * Advance the specified bvecq position by @amount bytes. @pos is updated= and + * bvecq ref counts may have been manipulated. If the position hits the e= nd of + * the queue, then it is left pointing beyond the last slot of the last bv= ecq + * so that it doesn't break the chain. + */ +void bvecq_pos_advance(struct bvecq_pos *pos, size_t amount) +{ + struct bvecq *bq =3D pos->bvecq; + unsigned int slot =3D pos->slot; + size_t offset =3D pos->offset; + + if (slot >=3D bq->nr_slots) { + bq =3D bq->next; + slot =3D 0; + } + + while (amount) { + const struct bio_vec *bv =3D &bq->bv[slot]; + size_t part =3D umin(bv->bv_len - offset, amount); + + if (likely(part < bv->bv_len)) { + offset +=3D part; + break; + } + amount -=3D part; + offset =3D 0; + slot++; + if (slot >=3D bq->nr_slots) { + if (!bq->next) + break; + bq =3D bq->next; + slot =3D 0; + } + } + + pos->slot =3D slot; + pos->offset =3D offset; + bvecq_pos_move(pos, bq); +} + +/** + * bvecq_zero - Clear memory starting at the bvecq position. + * @pos: The position in the bvecq chain to start clearing. + * @amount: The number of bytes to clear. + * + * Clear memory fragments pointed to by a bvec queue. @pos is updated and + * bvecq ref counts may have been manipulated. If the position hits the e= nd of + * the queue, then it is left pointing beyond the last slot of the last bv= ecq + * so that it doesn't break the chain. + * + * Return: The number of bytes cleared. + */ +ssize_t bvecq_zero(struct bvecq_pos *pos, size_t amount) +{ + struct bvecq *bq =3D pos->bvecq; + unsigned int slot =3D pos->slot; + ssize_t cleared =3D 0; + size_t offset =3D pos->offset; + + if (WARN_ON_ONCE(!bq)) + return 0; + + if (slot >=3D bq->nr_slots) { + bq =3D bq->next; + if (WARN_ON_ONCE(!bq)) + return 0; + slot =3D 0; + } + + do { + const struct bio_vec *bv =3D &bq->bv[slot]; + + if (offset < bv->bv_len) { + size_t part =3D umin(amount - cleared, bv->bv_len - offset); + + memzero_page(bv->bv_page, bv->bv_offset + offset, part); + + offset +=3D part; + cleared +=3D part; + } + + if (offset >=3D bv->bv_len) { + offset =3D 0; + slot++; + if (slot >=3D bq->nr_slots) { + if (!bq->next) + break; + bq =3D bq->next; + slot =3D 0; + } + } + } while (cleared < amount); + + bvecq_pos_move(pos, bq); + pos->slot =3D slot; + pos->offset =3D offset; + return cleared; +} + +/** + * bvecq_slice - Find a slice of a bvecq queue + * @pos: The position to start at. + * @max_size: The maximum size of the slice (or ULONG_MAX). + * @max_segs: The maximum number of segments in the slice (or INT_MAX). + * @_nr_segs: Where to put the number of segments (updated). + * + * Determine the size and number of segments that can be obtained the next + * slice of bvec queue up to the maximum size and segment count specified.= The + * slice is also limited if a discontiguity is found. + * + * @pos is updated to the end of the slice. If the position hits the end = of + * the queue, then it is left pointing beyond the last slot of the last bv= ecq + * so that it doesn't break the chain. + * + * Return: The number of bytes in the slice. + */ +size_t bvecq_slice(struct bvecq_pos *pos, size_t max_size, + unsigned int max_segs, unsigned int *_nr_segs) +{ + struct bvecq *bq; + unsigned int slot =3D pos->slot, nsegs =3D 0; + size_t size =3D 0; + size_t offset =3D pos->offset; + + bq =3D pos->bvecq; + for (;;) { + for (; slot < bq->nr_slots; slot++) { + const struct bio_vec *bvec =3D &bq->bv[slot]; + + if (offset < bvec->bv_len && bvec->bv_page) { + size_t part =3D umin(bvec->bv_len - offset, max_size); + + size +=3D part; + offset +=3D part; + max_size -=3D part; + nsegs++; + if (!max_size || nsegs >=3D max_segs) + goto out; + } + offset =3D 0; + } + + /* pos->bvecq isn't allowed to go NULL as the queue may get + * extended and we would lose our place. + */ + if (!bq->next) + break; + slot =3D 0; + bq =3D bq->next; + if (bq->discontig && size > 0) + break; + } + +out: + *_nr_segs =3D nsegs; + if (slot =3D=3D bq->nr_slots && bq->next) { + bq =3D bq->next; + slot =3D 0; + offset =3D 0; + } + bvecq_pos_move(pos, bq); + pos->slot =3D slot; + pos->offset =3D offset; + return size; +} + +/** + * bvecq_extract - Extract a slice of a bvecq queue into a new bvecq queue + * @pos: The position to start at. + * @max_size: The maximum size of the slice (or ULONG_MAX). + * @max_segs: The maximum number of segments in the slice (or INT_MAX). + * @to: Where to put the extraction bvecq chain head (updated). + * + * Allocate a new bvecq and extract into it memory fragments from a slice = of + * bvec queue, starting at @pos. The slice is also limited if a discontig= uity + * is found. No refs are taken on the page. + * + * @pos is updated to the end of the slice. If the position hits the end = of + * the queue, then it is left pointing beyond the last slot of the last bv= ecq + * so that it doesn't break the chain. + * + * If successful, *@to is set to point to the head of the newly allocated = chain + * and the caller inherits a ref to it. + * + * Return: The number of bytes extracted; -ENOMEM on allocation failure or= -EIO + * if no segments were available to extract. + */ +ssize_t bvecq_extract(struct bvecq_pos *pos, size_t max_size, + unsigned int max_segs, struct bvecq **to) +{ + struct bvecq_pos tmp_pos; + struct bvecq *src, *dst =3D NULL; + unsigned int slot =3D pos->slot, nsegs; + ssize_t extracted =3D 0; + size_t offset =3D pos->offset, amount; + + *to =3D NULL; + if (WARN_ON_ONCE(!max_segs)) + max_segs =3D INT_MAX; + + bvecq_pos_set(&tmp_pos, pos); + amount =3D bvecq_slice(&tmp_pos, max_size, max_segs, &nsegs); + bvecq_pos_unset(&tmp_pos); + if (nsegs =3D=3D 0) + return -EIO; + + dst =3D bvecq_alloc_chain(nsegs, GFP_KERNEL); + if (!dst) + return -ENOMEM; + *to =3D dst; + max_segs =3D nsegs; + nsegs =3D 0; + + /* Transcribe the segments */ + src =3D pos->bvecq; + for (;;) { + for (; slot < src->nr_slots; slot++) { + const struct bio_vec *sv =3D &src->bv[slot]; + struct bio_vec *dv =3D &dst->bv[dst->nr_slots]; + + _debug("EXTR BQ=3D%x[%x] off=3D%zx am=3D%zx p=3D%lx", + src->priv, slot, offset, amount, page_to_pfn(sv->bv_page)); + + if (offset < sv->bv_len && sv->bv_page) { + size_t part =3D umin(sv->bv_len - offset, amount); + + bvec_set_page(dv, sv->bv_page, part, + sv->bv_offset + offset); + extracted +=3D part; + amount -=3D part; + offset +=3D part; + trace_netfs_bv_slot(dst, dst->nr_slots); + dst->nr_slots++; + nsegs++; + if (bvecq_is_full(dst)) + dst =3D dst->next; + if (nsegs >=3D max_segs) + goto out; + if (amount =3D=3D 0) + goto out; + } + offset =3D 0; + } + + /* pos->bvecq isn't allowed to go NULL as the queue may get + * extended and we would lose our place. + */ + if (!src->next) + break; + slot =3D 0; + src =3D src->next; + if (src->discontig && extracted > 0) + break; + } + +out: + if (slot =3D=3D src->nr_slots && src->next) { + src =3D src->next; + slot =3D 0; + offset =3D 0; + } + bvecq_pos_move(pos, src); + pos->slot =3D slot; + pos->offset =3D offset; + return extracted; +} + +/** + * bvecq_load_from_ra - Allocate a bvecq chain and load from readahead + * @pos: Blank position object to attach the new chain to. + * @ractl: The readahead control context. + * + * Decant the set of folios to be read from the readahead context into a b= vecq + * chain. Each folio occupies one bio_vec element. + * + * Return: Amount of data loaded or -ENOMEM on allocation failure. + */ +ssize_t bvecq_load_from_ra(struct bvecq_pos *pos, struct readahead_control= *ractl) +{ + XA_STATE(xas, &ractl->mapping->i_pages, ractl->_index); + struct folio *folio; + struct bvecq *bq; + size_t loaded =3D 0; + + bq =3D bvecq_alloc_chain(ractl->_nr_folios, GFP_NOFS); + if (!bq) + return -ENOMEM; + + pos->bvecq =3D bq; + pos->slot =3D 0; + pos->offset =3D 0; + + rcu_read_lock(); + + xas_for_each(&xas, folio, ractl->_index + ractl->_nr_pages - 1) { + size_t len; + + if (xas_retry(&xas, folio)) + continue; + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); + + len =3D folio_size(folio); + bvec_set_folio(&bq->bv[bq->nr_slots++], folio, len, 0); + loaded +=3D len; + trace_netfs_folio(folio, netfs_folio_trace_read); + + if (bq->nr_slots >=3D bq->max_slots) { + bq =3D bq->next; + if (!bq) + break; + } + } + + rcu_read_unlock(); + + ractl->_index +=3D ractl->_nr_pages; + ractl->_nr_pages =3D 0; + return loaded; +} diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h index 2fcf31de5f2c..ad47bcc1947b 100644 --- a/fs/netfs/internal.h +++ b/fs/netfs/internal.h @@ -168,6 +168,7 @@ extern atomic_t netfs_n_wh_retry_write_subreq; extern atomic_t netfs_n_wb_lock_skip; extern atomic_t netfs_n_wb_lock_wait; extern atomic_t netfs_n_folioq; +extern atomic_t netfs_n_bvecq; =20 int netfs_stats_show(struct seq_file *m, void *v); =20 diff --git a/fs/netfs/stats.c b/fs/netfs/stats.c index ab6b916addc4..84c2a4bcc762 100644 --- a/fs/netfs/stats.c +++ b/fs/netfs/stats.c @@ -48,6 +48,7 @@ atomic_t netfs_n_wh_retry_write_subreq; atomic_t netfs_n_wb_lock_skip; atomic_t netfs_n_wb_lock_wait; atomic_t netfs_n_folioq; +atomic_t netfs_n_bvecq; =20 int netfs_stats_show(struct seq_file *m, void *v) { @@ -90,9 +91,10 @@ int netfs_stats_show(struct seq_file *m, void *v) atomic_read(&netfs_n_rh_retry_read_subreq), atomic_read(&netfs_n_wh_retry_write_req), atomic_read(&netfs_n_wh_retry_write_subreq)); - seq_printf(m, "Objs : rr=3D%u sr=3D%u foq=3D%u wsc=3D%u\n", + seq_printf(m, "Objs : rr=3D%u sr=3D%u bq=3D%u foq=3D%u wsc=3D%u\n", atomic_read(&netfs_n_rh_rreq), atomic_read(&netfs_n_rh_sreq), + atomic_read(&netfs_n_bvecq), atomic_read(&netfs_n_folioq), atomic_read(&netfs_n_wh_wstream_conflict)); seq_printf(m, "WbLock : skip=3D%u wait=3D%u\n", diff --git a/include/linux/bvecq.h b/include/linux/bvecq.h index 462125af1cc7..6c58a7fb6472 100644 --- a/include/linux/bvecq.h +++ b/include/linux/bvecq.h @@ -17,7 +17,7 @@ * iterated over with an ITER_BVECQ iterator. The list is non-circular; n= ext * and prev are NULL at the ends. * - * The bv pointer points to the segment array; this may be __bv if allocat= ed + * The bv pointer points to the bio_vec array; this may be __bv if allocat= ed * together. The caller is responsible for determining whether or not thi= s is * the case as the array pointed to by bv may be follow on directly from t= he * bvecq by accident of allocation (ie. ->bv =3D=3D ->__bv is *not* suffic= ient to @@ -33,8 +33,8 @@ struct bvecq { unsigned long long fpos; /* File position */ refcount_t ref; u32 priv; /* Private data */ - u16 nr_segs; /* Number of elements in bv[] used */ - u16 max_segs; /* Number of elements allocated in bv[] */ + u16 nr_slots; /* Number of elements in bv[] used */ + u16 max_slots; /* Number of elements allocated in bv[] */ bool inline_bv:1; /* T if __bv[] is being used */ bool free:1; /* T if the pages need freeing */ bool unpin:1; /* T if the pages need unpinning, not freeing */ @@ -43,4 +43,163 @@ struct bvecq { struct bio_vec __bv[]; /* Default array (if ->inline_bv) */ }; =20 +/* + * Position in a bio_vec queue. The bvecq holds a ref on the queue segmen= t it + * points to. + */ +struct bvecq_pos { + struct bvecq *bvecq; /* The first bvecq */ + unsigned int offset; /* The offset within the starting slot */ + u16 slot; /* The starting slot */ +}; + +void bvecq_dump(const struct bvecq *bq); +struct bvecq *bvecq_alloc_one(size_t nr_slots, gfp_t gfp); +struct bvecq *bvecq_alloc_chain(size_t nr_slots, gfp_t gfp); +struct bvecq *bvecq_alloc_buffer(size_t size, unsigned int pre_slots, gfp_= t gfp); +void bvecq_put(struct bvecq *bq); +int bvecq_expand_buffer(struct bvecq **_buffer, size_t *_cur_size, ssize_t= size, gfp_t gfp); +int bvecq_shorten_buffer(struct bvecq *bq, unsigned int slot, size_t size); +int bvecq_buffer_init(struct bvecq_pos *pos, gfp_t gfp); +int bvecq_buffer_make_space(struct bvecq_pos *pos, gfp_t gfp); +void bvecq_pos_advance(struct bvecq_pos *pos, size_t amount); +ssize_t bvecq_zero(struct bvecq_pos *pos, size_t amount); +size_t bvecq_slice(struct bvecq_pos *pos, size_t max_size, + unsigned int max_segs, unsigned int *_nr_segs); +ssize_t bvecq_extract(struct bvecq_pos *pos, size_t max_size, + unsigned int max_segs, struct bvecq **to); +ssize_t bvecq_load_from_ra(struct bvecq_pos *pos, struct readahead_control= *ractl); + +/** + * bvecq_get - Get a ref on a bvecq + * @bq: The bvecq to get a ref on + */ +static inline struct bvecq *bvecq_get(struct bvecq *bq) +{ + refcount_inc(&bq->ref); + return bq; +} + +/** + * bvecq_is_full - Determine if a bvecq is full + * @bvecq: The object to query + * + * Return: true if full; false if not. + */ +static inline bool bvecq_is_full(const struct bvecq *bvecq) +{ + return bvecq->nr_slots >=3D bvecq->max_slots; +} + +/** + * bvecq_pos_set - Set one position to be the same as another + * @pos: The position object to set + * @at: The source position. + * + * Set @pos to have the same position as @at. This may take a ref on the + * bvecq pointed to. + */ +static inline void bvecq_pos_set(struct bvecq_pos *pos, const struct bvecq= _pos *at) +{ + *pos =3D *at; + bvecq_get(pos->bvecq); +} + +/** + * bvecq_pos_unset - Unset a position + * @pos: The position object to unset + * + * Unset @pos. This does any needed ref cleanup. + */ +static inline void bvecq_pos_unset(struct bvecq_pos *pos) +{ + bvecq_put(pos->bvecq); + pos->bvecq =3D NULL; + pos->slot =3D 0; + pos->offset =3D 0; +} + +/** + * bvecq_pos_transfer - Transfer one position to another, clearing the fir= st + * @pos: The position object to set + * @from: The source position to clear. + * + * Set @pos to have the same position as @from and then clear @from. This= may + * transfer a ref on the bvecq pointed to. + */ +static inline void bvecq_pos_transfer(struct bvecq_pos *pos, struct bvecq_= pos *from) +{ + *pos =3D *from; + from->bvecq =3D NULL; + from->slot =3D 0; + from->offset =3D 0; +} + +/** + * bvecq_pos_move - Update a position to a new bvecq + * @pos: The position object to update. + * @to: The new bvecq to point at. + * + * Update @pos to point to @to if it doesn't already do so. This may + * manipulate refs on the bvecqs pointed to. + */ +static inline void bvecq_pos_move(struct bvecq_pos *pos, struct bvecq *to) +{ + struct bvecq *old =3D pos->bvecq; + + if (old !=3D to) { + pos->bvecq =3D bvecq_get(to); + bvecq_put(old); + } +} + +/** + * bvecq_pos_step - Step a position to the next slot if possible + * @pos: The position object to step. + * + * Update @pos to point to the next slot in the queue if not at the end. = This + * may manipulate refs on the bvecqs pointed to. + * + * Return: true if successful, false if was at the end. + */ +static inline bool bvecq_pos_step(struct bvecq_pos *pos) +{ + struct bvecq *bq =3D pos->bvecq; + + pos->slot++; + pos->offset =3D 0; + if (pos->slot <=3D bq->nr_slots) + return true; + if (!bq->next) + return false; + bvecq_pos_move(pos, bq->next); + return true; +} + +/** + * bvecq_delete_spent - Delete the bvecq at the front if possible + * @pos: The position object to update. + * + * Delete the used up bvecq at the front of the queue that @pos points to = if it + * is not the last node in the queue; if it is the last node in the queue,= it + * is kept so that the queue doesn't become detached from the other end. = This + * may manipulate refs on the bvecqs pointed to. + */ +static inline struct bvecq *bvecq_delete_spent(struct bvecq_pos *pos) +{ + struct bvecq *spent =3D pos->bvecq; + /* Read the contents of the queue node after the pointer to it. */ + struct bvecq *next =3D smp_load_acquire(&spent->next); + + if (!next) + return NULL; + next->prev =3D NULL; + spent->next =3D NULL; + bvecq_put(spent); + pos->bvecq =3D next; /* We take spent's ref */ + pos->slot =3D 0; + pos->offset =3D 0; + return next; +} + #endif /* _LINUX_BVECQ_H */ diff --git a/include/linux/iov_iter.h b/include/linux/iov_iter.h index 999607ece481..309642b3901f 100644 --- a/include/linux/iov_iter.h +++ b/include/linux/iov_iter.h @@ -152,7 +152,7 @@ size_t iterate_bvecq(struct iov_iter *iter, size_t len,= void *priv, void *priv2, unsigned int slot =3D iter->bvecq_slot; size_t progress =3D 0, skip =3D iter->iov_offset; =20 - if (slot =3D=3D bq->nr_segs) { + if (slot =3D=3D bq->nr_slots) { /* The iterator may have been extended. */ bq =3D bq->next; slot =3D 0; @@ -176,7 +176,7 @@ size_t iterate_bvecq(struct iov_iter *iter, size_t len,= void *priv, void *priv2, if (skip >=3D bvec->bv_len) { skip =3D 0; slot++; - if (slot >=3D bq->nr_segs) { + if (slot >=3D bq->nr_slots) { if (!bq->next) break; bq =3D bq->next; diff --git a/include/linux/netfs.h b/include/linux/netfs.h index cc56b6512769..5bc48aacf7f6 100644 --- a/include/linux/netfs.h +++ b/include/linux/netfs.h @@ -17,6 +17,7 @@ #include #include #include +#include #include #include =20 diff --git a/include/trace/events/netfs.h b/include/trace/events/netfs.h index b8236f9e940e..fbb094231659 100644 --- a/include/trace/events/netfs.h +++ b/include/trace/events/netfs.h @@ -779,6 +779,30 @@ TRACE_EVENT(netfs_folioq, __print_symbolic(__entry->trace, netfs_folioq_traces)) ); =20 +TRACE_EVENT(netfs_bv_slot, + TP_PROTO(const struct bvecq *bq, int slot), + + TP_ARGS(bq, slot), + + TP_STRUCT__entry( + __field(unsigned long, pfn) + __field(unsigned int, offset) + __field(unsigned int, len) + __field(unsigned int, slot) + ), + + TP_fast_assign( + __entry->slot =3D slot; + __entry->pfn =3D page_to_pfn(bq->bv[slot].bv_page); + __entry->offset =3D bq->bv[slot].bv_offset; + __entry->len =3D bq->bv[slot].bv_len; + ), + + TP_printk("bq[%x] p=3D%lx %x-%x", + __entry->slot, + __entry->pfn, __entry->offset, __entry->offset + __entry->len) + ); + #undef EM #undef E_ #endif /* _TRACE_NETFS_H */ diff --git a/lib/iov_iter.c b/lib/iov_iter.c index df8d037894b1..4f091e6d4a22 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -580,7 +580,7 @@ static void iov_iter_bvecq_advance(struct iov_iter *i, = size_t by) return; i->count -=3D by; =20 - if (slot >=3D bq->nr_segs) { + if (slot >=3D bq->nr_slots) { bq =3D bq->next; slot =3D 0; } @@ -593,7 +593,7 @@ static void iov_iter_bvecq_advance(struct iov_iter *i, = size_t by) break; by -=3D len; slot++; - if (slot >=3D bq->nr_segs && bq->next) { + if (slot >=3D bq->nr_slots && bq->next) { bq =3D bq->next; slot =3D 0; } @@ -662,7 +662,7 @@ static void iov_iter_bvecq_revert(struct iov_iter *i, s= ize_t unroll) =20 if (slot =3D=3D 0) { bq =3D bq->prev; - slot =3D bq->nr_segs; + slot =3D bq->nr_slots; } slot--; =20 @@ -947,7 +947,7 @@ static unsigned long iov_iter_alignment_bvecq(const str= uct iov_iter *iter) return res; =20 for (bq =3D iter->bvecq; bq; bq =3D bq->next) { - for (; slot < bq->nr_segs; slot++) { + for (; slot < bq->nr_slots; slot++) { const struct bio_vec *bvec =3D &bq->bv[slot]; size_t part =3D umin(bvec->bv_len - skip, size); =20 @@ -1331,7 +1331,7 @@ static size_t iov_npages_bvecq(const struct iov_iter = *iter, size_t maxpages) size_t size =3D iter->count; =20 for (bq =3D iter->bvecq; bq; bq =3D bq->next) { - for (; slot < bq->nr_segs; slot++) { + for (; slot < bq->nr_slots; slot++) { const struct bio_vec *bvec =3D &bq->bv[slot]; size_t offs =3D (bvec->bv_offset + skip) % PAGE_SIZE; size_t part =3D umin(bvec->bv_len - skip, size); @@ -1731,7 +1731,7 @@ static ssize_t iov_iter_extract_bvecq_pages(struct io= v_iter *iter, unsigned int seg =3D iter->bvecq_slot, count =3D 0, nr =3D 0; size_t extracted =3D 0, offset =3D iter->iov_offset; =20 - if (seg >=3D bvecq->nr_segs) { + if (seg >=3D bvecq->nr_slots) { bvecq =3D bvecq->next; if (WARN_ON_ONCE(!bvecq)) return 0; @@ -1763,7 +1763,7 @@ static ssize_t iov_iter_extract_bvecq_pages(struct io= v_iter *iter, if (offset >=3D blen) { offset =3D 0; seg++; - if (seg >=3D bvecq->nr_segs) { + if (seg >=3D bvecq->nr_slots) { if (!bvecq->next) { WARN_ON_ONCE(extracted < iter->count); break; @@ -1816,7 +1816,7 @@ static ssize_t iov_iter_extract_bvecq_pages(struct io= v_iter *iter, if (offset >=3D blen) { offset =3D 0; seg++; - if (seg >=3D bvecq->nr_segs) { + if (seg >=3D bvecq->nr_slots) { if (!bvecq->next) { WARN_ON_ONCE(extracted < iter->count); break; diff --git a/lib/scatterlist.c b/lib/scatterlist.c index 03e3883a1a2d..93a3d194a914 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -1345,7 +1345,7 @@ static ssize_t extract_bvecq_to_sg(struct iov_iter *i= ter, ssize_t ret =3D 0; size_t offset =3D iter->iov_offset; =20 - if (seg >=3D bvecq->nr_segs) { + if (seg >=3D bvecq->nr_slots) { bvecq =3D bvecq->next; if (WARN_ON_ONCE(!bvecq)) return 0; @@ -1373,7 +1373,7 @@ static ssize_t extract_bvecq_to_sg(struct iov_iter *i= ter, if (offset >=3D blen) { offset =3D 0; seg++; - if (seg >=3D bvecq->nr_segs) { + if (seg >=3D bvecq->nr_slots) { if (!bvecq->next) { WARN_ON_ONCE(ret < iter->count); break; diff --git a/lib/tests/kunit_iov_iter.c b/lib/tests/kunit_iov_iter.c index 5bc941f64343..ff0621636ff1 100644 --- a/lib/tests/kunit_iov_iter.c +++ b/lib/tests/kunit_iov_iter.c @@ -543,28 +543,28 @@ static void iov_kunit_destroy_bvecq(void *data) =20 for (bq =3D data; bq; bq =3D next) { next =3D bq->next; - for (int i =3D 0; i < bq->nr_segs; i++) + for (int i =3D 0; i < bq->nr_slots; i++) if (bq->bv[i].bv_page) put_page(bq->bv[i].bv_page); kfree(bq); } } =20 -static struct bvecq *iov_kunit_alloc_bvecq(struct kunit *test, unsigned in= t max_segs) +static struct bvecq *iov_kunit_alloc_bvecq(struct kunit *test, unsigned in= t max_slots) { struct bvecq *bq; =20 - bq =3D kzalloc(struct_size(bq, __bv, max_segs), GFP_KERNEL); + bq =3D kzalloc(struct_size(bq, __bv, max_slots), GFP_KERNEL); KUNIT_ASSERT_NOT_ERR_OR_NULL(test, bq); - bq->max_segs =3D max_segs; + bq->max_slots =3D max_slots; return bq; } =20 -static struct bvecq *iov_kunit_create_bvecq(struct kunit *test, unsigned i= nt max_segs) +static struct bvecq *iov_kunit_create_bvecq(struct kunit *test, unsigned i= nt max_slots) { struct bvecq *bq; =20 - bq =3D iov_kunit_alloc_bvecq(test, max_segs); + bq =3D iov_kunit_alloc_bvecq(test, max_slots); kunit_add_action_or_reset(test, iov_kunit_destroy_bvecq, bq); return bq; } @@ -578,13 +578,13 @@ static void __init iov_kunit_load_bvecq(struct kunit = *test, size_t size =3D 0; =20 for (int i =3D 0; i < npages; i++) { - if (bq->nr_segs >=3D bq->max_segs) { + if (bq->nr_slots >=3D bq->max_slots) { bq->next =3D iov_kunit_alloc_bvecq(test, 8); bq->next->prev =3D bq; bq =3D bq->next; } - bvec_set_page(&bq->bv[bq->nr_segs], pages[i], PAGE_SIZE, 0); - bq->nr_segs++; + bvec_set_page(&bq->bv[bq->nr_slots], pages[i], PAGE_SIZE, 0); + bq->nr_slots++; size +=3D PAGE_SIZE; } iov_iter_bvec_queue(iter, dir, bq_head, 0, 0, size);