include/linux/iov_iter.h | 77 +---- include/linux/uio.h | 37 +++ lib/iov_iter.c | 675 ++++++++++++++++++++++++++++++++++++++- 3 files changed, 710 insertions(+), 79 deletions(-)
Hi Leon,
Here are some patches that illustrate some of what I'm thinking of doing to
iov iterators. Note that they are incomplete as I won't have time to
finish or test them before LSF, but I thought I'd post them for use as a
discussion point.
So the first thing I want to do is to move certain iterators out of line
from the main inline iteration multiplexor. This code gets relentlessly
duplicated and adding further iterator types expands a whole load of
places. So the DISCARD iterator (which is just a simple short circuit) and
the XARRAY iterator (which is obsolete) move out of line.
Then I want to add three more types, for now:
(1) ITER_ITERLIST. A compound iterator that takes an array of iterators
of disparate types. The aim here is to make it possible to fabricate
an network message in one go (say an RPC call) and pass it to a socket
without the need for corking.
(2) ITER_SCATTERLIST. An iterator that takes a scatterlist. This can be
used to act as a bridge in converting interfaces that currently take a
scatterlist (e.g. crypto). It requires extra fields adding to the
iov_iter struct because chained scatterlists do not have have a rewind
capability and so iov_iter_revert() must go back to the beginning and
fast-forward.
(3) ITER_SKBUFF. An iterator that takes a network buffer. The aim here
is to render skb_to_sgvec() unnecessary for doing crypto operations.
The patches can also be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-experimental
David
David Howells (4):
iov_iter: Move ITER_DISCARD and ITER_XARRAY iteration out-of-line
iov_iter: Add an iterator-of-iterators
iov_iter: Add a scatterlist iterator type
iov_iter: Add a scatterlist iterator type [INCOMPLETE]
include/linux/iov_iter.h | 77 +----
include/linux/uio.h | 37 +++
lib/iov_iter.c | 675 ++++++++++++++++++++++++++++++++++++++-
3 files changed, 710 insertions(+), 79 deletions(-)
This is going entirely in the wrong direction. We don't need more iter types but less. The reason why we have to many is because the underlying representation of the ranges is a mess which goes deeper than just the iterator, because it also means we have to convert between the underlying representations all the time. E.g. the socket code should have (and either has for a while or at least there were patches) been using bio_vecs instead of reinventing them as sk fragment. The crypto code should not be using scatterlists, which are a horrible data structure because they mix up the physical memory description and the dma mapping information which isn't even used for most uses, etc. So instead of more iters let's convert everyone to a common scatter/gather memory definition, which simplifies the iters. For now that is the bio_vec, which really should be converted from storing a struct page to a phys_addr_t (and maybe renamed if that helps adoption). That allows to trivially kill the kvec for example. As for the head/tail - that seems to be a odd NFS/sunrpc fetish. I've actually started a little project to just convert the sunrpc code to use bio_vecs, which massively simplifies the code, and allows directly passing it to the iters in the socket API. It doesn't quite work yet but shows how all these custom (and in this case rather ad-hoc) memory fragment representation cause a huge mess. I don't think the iterlist can work in practice, but it would be nice to have for a few use cases. If it worked it should hopefully allow to kill off the odd xarray iterator.
On Sat, Mar 22, 2025 at 11:21:28PM -0700, Christoph Hellwig wrote: > This is going entirely in the wrong direction. We don't need more iter > types but less. The reason why we have to many is because the underlying > representation of the ranges is a mess which goes deeper than just the > iterator, because it also means we have to convert between the > underlying representations all the time. > > E.g. the socket code should have (and either has for a while or at least > there were patches) been using bio_vecs instead of reinventing them as sk > fragment. The crypto code should not be using scatterlists, which are a I did this work six years ago -- see 8842d285bafa Unfortunately, networking is full of inconsiderate arseholes who backed it out without even talking to me in 21d2e6737c97
On 3/23/25 07:21, Christoph Hellwig wrote: > This is going entirely in the wrong direction. We don't need more iter > types but less. The reason why we have to many is because the underlying > representation of the ranges is a mess which goes deeper than just the > iterator, because it also means we have to convert between the > underlying representations all the time. > > E.g. the socket code should have (and either has for a while or at least > there were patches) been using bio_vecs instead of reinventing them as sk > fragment. The crypto code should not be using scatterlists, which are a > horrible data structure because they mix up the physical memory > description and the dma mapping information which isn't even used for > most uses, etc. > > So instead of more iters let's convert everyone to a common > scatter/gather memory definition, which simplifies the iters. For now > that is the bio_vec, which really should be converted from storing a > struct page to a phys_addr_t (and maybe renamed if that helps adoption). > That allows to trivially kill the kvec for example. > > As for the head/tail - that seems to be a odd NFS/sunrpc fetish. I've > actually started a little project to just convert the sunrpc code to > use bio_vecs, which massively simplifies the code, and allows directly > passing it to the iters in the socket API. It doesn't quite work yet > but shows how all these custom (and in this case rather ad-hoc) memory > fragment representation cause a huge mess. > > I don't think the iterlist can work in practice, but it would be nice > to have for a few use cases. If it worked it should hopefully allow > to kill off the odd xarray iterator. > Can we have a session around this? IE define how iterators should be used, and what the iterator elements should be. If we do it properly this will also fix the frozen page discussion we're having; if we define iterators whose data elements are _not_ pages then clearly one cannot take a reference to them. But in either case, we should define the long-term goal such that people can start converting stuff. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
© 2016 - 2025 Red Hat, Inc.