[v5] iov_iter: Convert the iterator macros into inline funcs

[PATCH v5 00/11] iov_iter: Convert the iterator macros into inline funcs

Posted by David Howells 2 years, 4 months ago

Hi Jens,

Could you take these patches into the block tree?  Note that it's on top of
the kunit patchset though it could be separate.  The patches convert the
iov_iter iteration macros to be inline functions.

 (1) Convert iter->user_backed to user_backed_iter() in the sound PCM
     driver.

 (2) Convert iter->user_backed to user_backed_iter() in a couple of
     infiniband drivers.

 (3) Renumber the type enum so that the ITER_* constants match the order in
     iterate_and_advance*().

 (4) Since (3) puts UBUF and IOVEC at 0 and 1, change user_backed_iter() to
     just use the type value and get rid of the extra flag.

 (5) Convert the iov_iter iteration macros to always-inline functions to
     make the code easier to follow.  It uses function pointers, but they
     get optimised away.  The priv2 argument likewise gets optimised away
     if unused.

     The functions are placed into a header file so that bespoke iterators
     can be created elsewhere.  For instance, rbd has an optimisation that
     requires it to scan to the buffer it is given to see if it is all
     zeros.  It would be nice if this could use iterate_and_advance() - but
     that's currently buried inside lib/iov_iter.c.

     Whilst we could provide a generic iteration function that takes a pair
     of function pointers instead, it does add around 16% overhead in the
     framework, presumably from the combination of function pointers and
     various mitigations.

     Further, if it is known that just a particular iterator-type is in
     play, just that iteration function can be used.

 (6) Move the check for ->copy_mc to _copy_from_iter() and
     copy_page_from_iter_atomic() rather than in memcpy_from_iter_mc()
     where it gets repeated for every segment.  Instead, we check once and
     invoke a side function that can use iterate_bvec() rather than
     iterate_and_advance() and supply a different step function.

 (7) Provide a cut-down iteration function that only handles kernel-backed
     iterators (ie. BVEC, KVEC, XARRAY and DISCARD) for situations where we
     know that we won't see UBUF/IOVEC.  This crops up in a number of
     filesystems where we decant a UBUF/IOVEC iter into a series of BVEC
     iters.

 (8) Move the copy-and-csum code to net/ where it can be in proximity with
     the code that uses it.  This eliminates the code if CONFIG_NET=n and
     allows for the slim possibility of it being inlined.

 (9) Fold memcpy_and_csum() in to its two users.

(10) Move csum_and_copy_from_iter_full() out of line and merge in
     csum_and_copy_from_iter() since the former is the only caller of the
     latter.

(11) Move hash_and_copy_to_iter() to net/ where it can be with its only
     caller.

Anyway, the changes in compiled function size either side of patches
(5)+(6) on x86_64 look like:

	__copy_from_iter_mc                      new 0xe8
	_copy_from_iter                          inc 0x360 -> 0x3a7 +0x47
	_copy_from_iter_flushcache               inc 0x34c -> 0x38e +0x42
	_copy_from_iter_nocache                  inc 0x354 -> 0x378 +0x24
	_copy_mc_to_iter                         inc 0x396 -> 0x3f1 +0x5b
	_copy_to_iter                            inc 0x33b -> 0x385 +0x4a
	copy_page_from_iter_atomic.part.0        inc 0x393 -> 0x3e0 +0x4d
	copy_page_to_iter_nofault.part.0         inc 0x3de -> 0x3eb +0xd
	copyin                                   del 0x30
	copyout                                  del 0x2d
	copyout_mc                               del 0x2b
	csum_and_copy_from_iter                  inc 0x3db -> 0x41d +0x42
	csum_and_copy_to_iter                    inc 0x45d -> 0x48d +0x30
	iov_iter_zero                            inc 0x34a -> 0x36a +0x20
	memcpy_from_iter.isra.0                  del 0x1f

Note that there's a noticeable expansion on some of the main functions
because a number of the helpers get inlined instead of being called.

In terms of benchmarking patch (5)+(6), three runs without them:

	iov_kunit_benchmark_ubuf: avg 3955 uS, stddev 169 uS
	iov_kunit_benchmark_ubuf: avg 4122 uS, stddev 1292 uS
	iov_kunit_benchmark_ubuf: avg 4451 uS, stddev 1362 uS
	iov_kunit_benchmark_iovec: avg 6607 uS, stddev 22 uS
	iov_kunit_benchmark_iovec: avg 6608 uS, stddev 19 uS
	iov_kunit_benchmark_iovec: avg 6609 uS, stddev 24 uS
	iov_kunit_benchmark_bvec: avg 3166 uS, stddev 11 uS
	iov_kunit_benchmark_bvec: avg 3167 uS, stddev 13 uS
	iov_kunit_benchmark_bvec: avg 3170 uS, stddev 16 uS
	iov_kunit_benchmark_bvec_split: avg 3394 uS, stddev 12 uS
	iov_kunit_benchmark_bvec_split: avg 3394 uS, stddev 20 uS
	iov_kunit_benchmark_bvec_split: avg 3395 uS, stddev 19 uS
	iov_kunit_benchmark_kvec: avg 2672 uS, stddev 12 uS
	iov_kunit_benchmark_kvec: avg 2672 uS, stddev 12 uS
	iov_kunit_benchmark_kvec: avg 2672 uS, stddev 9 uS
	iov_kunit_benchmark_xarray: avg 3719 uS, stddev 9 uS
	iov_kunit_benchmark_xarray: avg 3719 uS, stddev 9 uS
	iov_kunit_benchmark_xarray: avg 3721 uS, stddev 24 uS

and three runs with them:

	iov_kunit_benchmark_ubuf: avg 4110 uS, stddev 1254 uS
	iov_kunit_benchmark_ubuf: avg 4141 uS, stddev 1411 uS
	iov_kunit_benchmark_ubuf: avg 4572 uS, stddev 1889 uS
	iov_kunit_benchmark_iovec: avg 6582 uS, stddev 27 uS
	iov_kunit_benchmark_iovec: avg 6585 uS, stddev 25 uS
	iov_kunit_benchmark_iovec: avg 6586 uS, stddev 48 uS
	iov_kunit_benchmark_bvec: avg 3175 uS, stddev 13 uS
	iov_kunit_benchmark_bvec: avg 3177 uS, stddev 12 uS
	iov_kunit_benchmark_bvec: avg 3178 uS, stddev 12 uS
	iov_kunit_benchmark_bvec_split: avg 3380 uS, stddev 20 uS
	iov_kunit_benchmark_bvec_split: avg 3384 uS, stddev 15 uS
	iov_kunit_benchmark_bvec_split: avg 3386 uS, stddev 25 uS
	iov_kunit_benchmark_kvec: avg 2671 uS, stddev 11 uS
	iov_kunit_benchmark_kvec: avg 2672 uS, stddev 12 uS
	iov_kunit_benchmark_kvec: avg 2677 uS, stddev 20 uS
	iov_kunit_benchmark_xarray: avg 3599 uS, stddev 20 uS
	iov_kunit_benchmark_xarray: avg 3603 uS, stddev 8 uS
	iov_kunit_benchmark_xarray: avg 3610 uS, stddev 16 uS

I've pushed the patches here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-cleanup

David

Changes
=======
ver #5)
 - Moved kunit test patches out to a separate set.
 - Moved "iter->count - progress" into individual iteration subfunctions.
 - Fix iterate_iovec() and to update iter->__iovec after subtracting it to
   calculate iter->nr_segs.
 - Merged the function inlining patch and the move-to-header patch.
 - Rearranged the patch order slightly to put the patches to move
   networking stuff to net/ last.
 - Went back to a simpler patch to special-case ->copy_mc in
   _copy_from_iter() and copy_page_from_iter_atomic() before we get to call
   iterate_and_advance() so as to reduce the number of checks for this.

ver #4)
 - Fix iterate_bvec() and iterate_kvec() to update iter->bvec and
   iter->kvec after subtracting it to calculate iter->nr_segs.
 - Change iterate_xarray() to use start+progress rather than increasing
   start to reduce code size.
 - Added patches to move some iteration functions over to net/ as the files
   there can #include the iterator framework.
 - Added a patch to benchmark the iteration.
 - Tried an experimental patch to make copy_from_iter() and similar always
   catch MCE.

ver #3)
 - Use min_t(size_t,) not min() to avoid a warning on Hexagon.
 - Inline all the step functions.
 - Added a patch to better handle copy_mc.

ver #2)
 - Rebased on top of Willy's changes in linux-next.
 - Change the checksum argument to the iteration functions to be a general
   void* and use it to pass iter->copy_mc flag to memcpy_from_iter_mc() to
   avoid using a function pointer.
 - Arrange the end of the iterate_*() functions to look the same to give
   the optimiser the best chance.
 - Make iterate_and_advance() a wrapper around iterate_and_advance2().
 - Adjust iterate_and_advance2() to use if-else-if-else-if-else rather than
   switch(), to put ITER_BVEC before KVEC and to mark UBUF and IOVEC as
   likely().
 - Move "iter->count += progress" into iterate_and_advance2() from the
   iterate functions.
 - Mark a number of the iterator helpers with __always_inline.
 - Fix _copy_from_iter_flushcache() to use memcpy_from_iter_flushcache()
   not memcpy_from_iter().

Link: https://lore.kernel.org/r/3710261.1691764329@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/855.1692047347@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/20230816120741.534415-1-dhowells@redhat.com/ # v3
Link: https://lore.kernel.org/r/20230913165648.2570623-1-dhowells@redhat.com/ # v4

David Howells (11):
  sound: Fix snd_pcm_readv()/writev() to use iov access functions
  infiniband: Use user_backed_iter() to see if iterator is UBUF/IOVEC
  iov_iter: Renumber ITER_* constants
  iov_iter: Derive user-backedness from the iterator type
  iov_iter: Convert iterate*() to inline funcs
  iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc()
  iov_iter: Add a kernel-type iterator-only iteration function
  iov_iter, net: Move csum_and_copy_to/from_iter() to net/
  iov_iter, net: Fold in csum_and_memcpy()
  iov_iter, net: Merge csum_and_copy_from_iter{,_full}() together
  iov_iter, net: Move hash_and_copy_to_iter() to net/

 drivers/infiniband/hw/hfi1/file_ops.c    |   2 +-
 drivers/infiniband/hw/qib/qib_file_ops.c |   2 +-
 include/linux/iov_iter.h                 | 305 ++++++++++++++++
 include/linux/skbuff.h                   |   3 +
 include/linux/uio.h                      |  29 +-
 lib/iov_iter.c                           | 441 +++++++----------------
 net/core/datagram.c                      |  75 +++-
 net/core/skbuff.c                        |  40 ++
 sound/core/pcm_native.c                  |   4 +-
 9 files changed, 569 insertions(+), 332 deletions(-)
 create mode 100644 include/linux/iov_iter.h

RE: [PATCH v5 00/11] iov_iter: Convert the iterator macros into inline funcs

Posted by David Laight 2 years, 4 months ago

From: David Howells
> Sent: 20 September 2023 23:22
...
>  (8) Move the copy-and-csum code to net/ where it can be in proximity with
>      the code that uses it.  This eliminates the code if CONFIG_NET=n and
>      allows for the slim possibility of it being inlined.
> 
>  (9) Fold memcpy_and_csum() in to its two users.
> 
> (10) Move csum_and_copy_from_iter_full() out of line and merge in
>      csum_and_copy_from_iter() since the former is the only caller of the
>      latter.

I thought that the real idea behind these was to do the checksum
at the same time as the copy to avoid loading the data into the L1
data-cache twice - especially for long buffers.
I wonder how often there are multiple iov[] that actually make
it better than just check summing the linear buffer?

I had a feeling that check summing of udp data was done during
copy_to/from_user, but the code can't be the copy-and-csum here
for that because it is missing support form odd-length buffers.

Intel x86 desktop chips can easily checksum at 8 bytes/clock
(But probably not with the current code!).
(I've got ~12 bytes/clock using adox and adcx but that loop
is entirely horrid and it would need run-time patching.
Especially since I think some AMD cpu execute them very slowly.)

OTOH 'rep movs[bq]' copy will copy 16 bytes/clock (32 if the
destination is 32 byte aligned - it pretty much won't be).

So you'd need a csum-and-copy loop that did 16 bytes every
three clocks to get the same throughput for long buffers.
In principle splitting the 'adc memory' into two instructions
is the same number of u-ops - but I'm sure I've tried to do
that and failed and the extra memory write can happen in
parallel with everything else.
So I don't think you'll get 16 bytes in two clocks - but you
might get it is three.

OTOH for a cpu where memcpy is code loop summing the data in
the copy loop is likely to be a gain.

But I suspect doing the checksum and copy at the same time
got 'all to complicated' to actually implement fully.
With most modern ethernet chips checksumming receive pacakets
does it really get used enough for the additional complexity?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Re: [PATCH v5 00/11] iov_iter: Convert the iterator macros into inline funcs

Posted by David Howells 2 years, 4 months ago

David Laight <David.Laight@ACULAB.COM> wrote:

> >  (8) Move the copy-and-csum code to net/ where it can be in proximity with
> >      the code that uses it.  This eliminates the code if CONFIG_NET=n and
> >      allows for the slim possibility of it being inlined.
> > 
> >  (9) Fold memcpy_and_csum() in to its two users.
> > 
> > (10) Move csum_and_copy_from_iter_full() out of line and merge in
> >      csum_and_copy_from_iter() since the former is the only caller of the
> >      latter.
> 
> I thought that the real idea behind these was to do the checksum
> at the same time as the copy to avoid loading the data into the L1
> data-cache twice - especially for long buffers.
> I wonder how often there are multiple iov[] that actually make
> it better than just check summing the linear buffer?

It also reduces the overhead for finding the data to checksum in the case the
packet gets split since we're doing the checksumming as we copy - but with a
linear buffer, that's negligible.

> I had a feeling that check summing of udp data was done during
> copy_to/from_user, but the code can't be the copy-and-csum here
> for that because it is missing support form odd-length buffers.

Is there a bug there?

> Intel x86 desktop chips can easily checksum at 8 bytes/clock
> (But probably not with the current code!).
> (I've got ~12 bytes/clock using adox and adcx but that loop
> is entirely horrid and it would need run-time patching.
> Especially since I think some AMD cpu execute them very slowly.)
> 
> OTOH 'rep movs[bq]' copy will copy 16 bytes/clock (32 if the
> destination is 32 byte aligned - it pretty much won't be).
> 
> So you'd need a csum-and-copy loop that did 16 bytes every
> three clocks to get the same throughput for long buffers.
> In principle splitting the 'adc memory' into two instructions
> is the same number of u-ops - but I'm sure I've tried to do
> that and failed and the extra memory write can happen in
> parallel with everything else.
> So I don't think you'll get 16 bytes in two clocks - but you
> might get it is three.
> 
> OTOH for a cpu where memcpy is code loop summing the data in
> the copy loop is likely to be a gain.
> 
> But I suspect doing the checksum and copy at the same time
> got 'all to complicated' to actually implement fully.
> With most modern ethernet chips checksumming receive pacakets
> does it really get used enough for the additional complexity?

You may be right.  That's more a question for the networking folks than for
me.  It's entirely possible that the checksumming code is just not used on
modern systems these days.

Maybe Willem can comment since he's the UDP maintainer?

David

Re: [PATCH v5 00/11] iov_iter: Convert the iterator macros into inline funcs

Posted by Willem de Bruijn 2 years, 4 months ago

On Fri, Sep 22, 2023 at 2:01 PM David Howells <dhowells@redhat.com> wrote:
>
> David Laight <David.Laight@ACULAB.COM> wrote:
>
> > >  (8) Move the copy-and-csum code to net/ where it can be in proximity with
> > >      the code that uses it.  This eliminates the code if CONFIG_NET=n and
> > >      allows for the slim possibility of it being inlined.
> > >
> > >  (9) Fold memcpy_and_csum() in to its two users.
> > >
> > > (10) Move csum_and_copy_from_iter_full() out of line and merge in
> > >      csum_and_copy_from_iter() since the former is the only caller of the
> > >      latter.
> >
> > I thought that the real idea behind these was to do the checksum
> > at the same time as the copy to avoid loading the data into the L1
> > data-cache twice - especially for long buffers.
> > I wonder how often there are multiple iov[] that actually make
> > it better than just check summing the linear buffer?
>
> It also reduces the overhead for finding the data to checksum in the case the
> packet gets split since we're doing the checksumming as we copy - but with a
> linear buffer, that's negligible.
>
> > I had a feeling that check summing of udp data was done during
> > copy_to/from_user, but the code can't be the copy-and-csum here
> > for that because it is missing support form odd-length buffers.
>
> Is there a bug there?
>
> > Intel x86 desktop chips can easily checksum at 8 bytes/clock
> > (But probably not with the current code!).
> > (I've got ~12 bytes/clock using adox and adcx but that loop
> > is entirely horrid and it would need run-time patching.
> > Especially since I think some AMD cpu execute them very slowly.)
> >
> > OTOH 'rep movs[bq]' copy will copy 16 bytes/clock (32 if the
> > destination is 32 byte aligned - it pretty much won't be).
> >
> > So you'd need a csum-and-copy loop that did 16 bytes every
> > three clocks to get the same throughput for long buffers.
> > In principle splitting the 'adc memory' into two instructions
> > is the same number of u-ops - but I'm sure I've tried to do
> > that and failed and the extra memory write can happen in
> > parallel with everything else.
> > So I don't think you'll get 16 bytes in two clocks - but you
> > might get it is three.
> >
> > OTOH for a cpu where memcpy is code loop summing the data in
> > the copy loop is likely to be a gain.
> >
> > But I suspect doing the checksum and copy at the same time
> > got 'all to complicated' to actually implement fully.
> > With most modern ethernet chips checksumming receive pacakets
> > does it really get used enough for the additional complexity?
>
> You may be right.  That's more a question for the networking folks than for
> me.  It's entirely possible that the checksumming code is just not used on
> modern systems these days.
>
> Maybe Willem can comment since he's the UDP maintainer?

Perhaps these days it is more relevant to embedded systems than high
end servers.

RE: [PATCH v5 00/11] iov_iter: Convert the iterator macros into inline funcs

Posted by David Laight 2 years, 4 months ago

From: Willem de Bruijn
> Sent: 23 September 2023 07:59
> 
> On Fri, Sep 22, 2023 at 2:01 PM David Howells <dhowells@redhat.com> wrote:
> >
> > David Laight <David.Laight@ACULAB.COM> wrote:
> >
> > > >  (8) Move the copy-and-csum code to net/ where it can be in proximity with
> > > >      the code that uses it.  This eliminates the code if CONFIG_NET=n and
> > > >      allows for the slim possibility of it being inlined.
> > > >
> > > >  (9) Fold memcpy_and_csum() in to its two users.
> > > >
> > > > (10) Move csum_and_copy_from_iter_full() out of line and merge in
> > > >      csum_and_copy_from_iter() since the former is the only caller of the
> > > >      latter.
> > >
> > > I thought that the real idea behind these was to do the checksum
> > > at the same time as the copy to avoid loading the data into the L1
> > > data-cache twice - especially for long buffers.
> > > I wonder how often there are multiple iov[] that actually make
> > > it better than just check summing the linear buffer?
> >
> > It also reduces the overhead for finding the data to checksum in the case the
> > packet gets split since we're doing the checksumming as we copy - but with a
> > linear buffer, that's negligible.
> >
> > > I had a feeling that check summing of udp data was done during
> > > copy_to/from_user, but the code can't be the copy-and-csum here
> > > for that because it is missing support form odd-length buffers.
> >
> > Is there a bug there?

No, I misread the code - i shouldn't scan patches when I'd
got a viral head code...

...
> > You may be right.  That's more a question for the networking folks than for
> > me.  It's entirely possible that the checksumming code is just not used on
> > modern systems these days.
> >
> > Maybe Willem can comment since he's the UDP maintainer?
> 
> Perhaps these days it is more relevant to embedded systems than high
> end servers.

The checksum and copy are done together.
I probably missed it because the function isn't passed the
old checksum (which it can pretty much process for free).
Instead the caller is adding it afterwards - which involves
and extra explicit csum_add().

The x86-x84 ip checksum loops are all horrid though.
The unrolling in them is so 1990's.
With the out-of-order pipeline the memory accesses tend
to take care of themselves.
Not to mention that a whole raft of (now oldish) cpu take two
clocks to execute 'adc'.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)