[PATCH net-next v5 00/27] io_uring zerocopy send

Pavel Begunkov posted 27 patches 3 years, 9 months ago
include/linux/io_uring_types.h                |  37 ++
include/linux/skbuff.h                        |  66 +-
include/linux/socket.h                        |   5 +
include/uapi/linux/io_uring.h                 |  45 +-
io_uring/Makefile                             |   2 +-
io_uring/io_uring.c                           |  42 +-
io_uring/io_uring.h                           |  22 +
io_uring/net.c                                | 187 ++++++
io_uring/net.h                                |   4 +
io_uring/notif.c                              | 215 +++++++
io_uring/notif.h                              |  87 +++
io_uring/opdef.c                              |  24 +-
io_uring/rsrc.c                               |  55 +-
io_uring/rsrc.h                               |  16 +-
io_uring/tctx.h                               |  26 -
net/compat.c                                  |   1 +
net/core/datagram.c                           |  14 +-
net/core/skbuff.c                             |  37 +-
net/ipv4/ip_output.c                          |  50 +-
net/ipv4/tcp.c                                |  32 +-
net/ipv6/ip6_output.c                         |  49 +-
net/socket.c                                  |   3 +
tools/testing/selftests/net/Makefile          |   1 +
.../selftests/net/io_uring_zerocopy_tx.c      | 605 ++++++++++++++++++
.../selftests/net/io_uring_zerocopy_tx.sh     | 131 ++++
25 files changed, 1628 insertions(+), 128 deletions(-)
create mode 100644 io_uring/notif.c
create mode 100644 io_uring/notif.h
create mode 100644 tools/testing/selftests/net/io_uring_zerocopy_tx.c
create mode 100755 tools/testing/selftests/net/io_uring_zerocopy_tx.sh
[PATCH net-next v5 00/27] io_uring zerocopy send
Posted by Pavel Begunkov 3 years, 9 months ago
NOTE: Not to be picked directly. After getting necessary acks, I'll be
      working out merging with Jakub and Jens.

The patchset implements io_uring zerocopy send. It works with both registered
and normal buffers, mixing is allowed but not recommended. Apart from usual
request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
the userspace when buffers are freed and can be reused (see API design below),
which is delivered into io_uring's Completion Queue. Those "buffer-free"
notifications are not necessarily per request, but the userspace has control
over it and should explicitly attaching a number of requests to a single
notification. The series also adds some internal optimisations when used with
registered buffers like removing page referencing.

From the kernel networking perspective there are two main changes. The first
one is passing ubuf_info into the network layer from io_uring (inside of an
in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
caching on the io_uring side, but also helps to avoid cross-referencing
and synchronisation problems. The second part is an optional optimisation
removing page referencing for requests with registered buffers.

Benchmarking UDP with an optimised version of the selftest (see [1]), which
sends a bunch of requests, waits for completions and repeats. "+ flush" column
posts one additional "buffer-free" notification per request, and just "zc"
doesn't post buffer notifications at all.

NIC (requests / second):
IO size | non-zc    | zc             | zc + flush
4000    | 495134    | 606420 (+22%)  | 558971 (+12%)
1500    | 551808    | 577116 (+4.5%) | 565803 (+2.5%)
1000    | 584677    | 592088 (+1.2%) | 560885 (-4%)
600     | 596292    | 598550 (+0.4%) | 555366 (-6.7%)

dummy (requests / second):
IO size | non-zc    | zc             | zc + flush
8000    | 1299916   | 2396600 (+84%) | 2224219 (+71%)
4000    | 1869230   | 2344146 (+25%) | 2170069 (+16%)
1200    | 2071617   | 2361960 (+14%) | 2203052 (+6%)
600     | 2106794   | 2381527 (+13%) | 2195295 (+4%)

Previously it also brought a massive performance speedup compared to the
msg_zerocopy tool (see [3]), which is probably not super interesting. There
is also an additional bunch of refcounting optimisations that was omitted from
the series for simplicity and as they don't change the picture drastically,
they will be sent as follow up, as well as flushing optimisations closing the
performance gap b/w two last columns.

For TCP on localhost (with hacks enabling localhost zerocopy) and including
additional overhead for receive:

IO size | non-zc    | zc
1200    | 4174      | 4148
4096    | 7597      | 11228

Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the
omitted optimisations will somewhat help, should look better for 4000,
but couldn't test properly because of setup problems.

Links:

  liburing (benchmark + tests):
  [1] https://github.com/isilence/liburing/tree/zc_v4

  kernel repo:
  [2] https://github.com/isilence/linux/tree/zc_v4

  RFC v1:
  [3] https://lore.kernel.org/io-uring/cover.1638282789.git.asml.silence@gmail.com/

  RFC v2:
  https://lore.kernel.org/io-uring/cover.1640029579.git.asml.silence@gmail.com/

  Net patches based:
  git@github.com:isilence/linux.git zc_v4-net-base
  or
  https://github.com/isilence/linux/tree/zc_v4-net-base

API design overview:

  The series introduces an io_uring concept of notifactors. From the userspace
  perspective it's an entity to which it can bind one or more requests and then
  requesting to flush it. Flushing a notifier makes it impossible to attach new
  requests to it, and instructs the notifier to post a completion once all
  requests attached to it are completed and the kernel doesn't need the buffers
  anymore.

  Notifications are stored in notification slots, which should be registered as
  an array in io_uring. Each slot stores only one notifier at any particular
  moment. Flushing removes it from the slot and the slot automatically replaces
  it with a new notifier. All operations with notifiers are done by specifying
  an index of a slot it's currently in.

  When registering a notification the userspace specifies a u64 tag for each
  slot, which will be copied in notification completion entries as
  cqe::user_data. cqe::res is 0 and cqe::flags is equal to wrap around u32
  sequence number counting notifiers of a slot.

Changelog:

  v4 -> v5
    remove ubuf_info checks from custom iov_iter callbacks to
    avoid disabling the page refs optimisations for TCP

  v3 -> v4
    custom iov_iter handling

  RFC v2 -> v3:
    mem accounting for non-registered buffers
    allow mixing registered and normal requests per notifier
    notification flushing via IORING_OP_RSRC_UPDATE
    TCP support
    fix buffer indexing
    fix io-wq ->uring_lock locking
    fix bugs when mixing with MSG_ZEROCOPY
    fix managed refs bugs in skbuff.c

  RFC -> RFC v2:
    remove additional overhead for non-zc from skb_release_data()
    avoid msg propagation, hide extra bits of non-zc overhead
    task_work based "buffer free" notifications
    improve io_uring's notification refcounting
    added 5/19, (no pfmemalloc tracking)
    added 8/19 and 9/19 preventing small copies with zc
    misc small changes

David Ahern (1):
  net: Allow custom iter handler in msghdr

Pavel Begunkov (26):
  ipv4: avoid partial copy for zc
  ipv6: avoid partial copy for zc
  skbuff: don't mix ubuf_info from different sources
  skbuff: add SKBFL_DONT_ORPHAN flag
  skbuff: carry external ubuf_info in msghdr
  net: introduce managed frags infrastructure
  net: introduce __skb_fill_page_desc_noacc
  ipv4/udp: support externally provided ubufs
  ipv6/udp: support externally provided ubufs
  tcp: support externally provided ubufs
  io_uring: initialise msghdr::msg_ubuf
  io_uring: export io_put_task()
  io_uring: add zc notification infrastructure
  io_uring: cache struct io_notif
  io_uring: complete notifiers in tw
  io_uring: add rsrc referencing for notifiers
  io_uring: add notification slot registration
  io_uring: wire send zc request type
  io_uring: account locked pages for non-fixed zc
  io_uring: allow to pass addr into sendzc
  io_uring: sendzc with fixed buffers
  io_uring: flush notifiers after sendzc
  io_uring: rename IORING_OP_FILES_UPDATE
  io_uring: add zc notification flush requests
  io_uring: enable managed frags with register buffers
  selftests/io_uring: test zerocopy send

 include/linux/io_uring_types.h                |  37 ++
 include/linux/skbuff.h                        |  66 +-
 include/linux/socket.h                        |   5 +
 include/uapi/linux/io_uring.h                 |  45 +-
 io_uring/Makefile                             |   2 +-
 io_uring/io_uring.c                           |  42 +-
 io_uring/io_uring.h                           |  22 +
 io_uring/net.c                                | 187 ++++++
 io_uring/net.h                                |   4 +
 io_uring/notif.c                              | 215 +++++++
 io_uring/notif.h                              |  87 +++
 io_uring/opdef.c                              |  24 +-
 io_uring/rsrc.c                               |  55 +-
 io_uring/rsrc.h                               |  16 +-
 io_uring/tctx.h                               |  26 -
 net/compat.c                                  |   1 +
 net/core/datagram.c                           |  14 +-
 net/core/skbuff.c                             |  37 +-
 net/ipv4/ip_output.c                          |  50 +-
 net/ipv4/tcp.c                                |  32 +-
 net/ipv6/ip6_output.c                         |  49 +-
 net/socket.c                                  |   3 +
 tools/testing/selftests/net/Makefile          |   1 +
 .../selftests/net/io_uring_zerocopy_tx.c      | 605 ++++++++++++++++++
 .../selftests/net/io_uring_zerocopy_tx.sh     | 131 ++++
 25 files changed, 1628 insertions(+), 128 deletions(-)
 create mode 100644 io_uring/notif.c
 create mode 100644 io_uring/notif.h
 create mode 100644 tools/testing/selftests/net/io_uring_zerocopy_tx.c
 create mode 100755 tools/testing/selftests/net/io_uring_zerocopy_tx.sh

-- 
2.37.0
Re: [PATCH net-next v5 00/27] io_uring zerocopy send
Posted by Jinjie Ruan 1 year, 1 month ago

On 2022/7/13 4:52, Pavel Begunkov wrote:
> NOTE: Not to be picked directly. After getting necessary acks, I'll be
>       working out merging with Jakub and Jens.
> 
> The patchset implements io_uring zerocopy send. It works with both registered
> and normal buffers, mixing is allowed but not recommended. Apart from usual
> request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
> the userspace when buffers are freed and can be reused (see API design below),
> which is delivered into io_uring's Completion Queue. Those "buffer-free"
> notifications are not necessarily per request, but the userspace has control
> over it and should explicitly attaching a number of requests to a single
> notification. The series also adds some internal optimisations when used with
> registered buffers like removing page referencing.
> 
>>From the kernel networking perspective there are two main changes. The first
> one is passing ubuf_info into the network layer from io_uring (inside of an
> in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
> caching on the io_uring side, but also helps to avoid cross-referencing
> and synchronisation problems. The second part is an optional optimisation
> removing page referencing for requests with registered buffers.
> 
> Benchmarking UDP with an optimised version of the selftest (see [1]), which

Hi, Pavel, I'm interested in zero copy sending of io_uring, but I can't
reproduce its performance using zerocopy send selftest test case, such
as "bash io_uring_zerocopy_tx.sh 6 udp -m 0/1/2/3 -n 64", even baseline
performance may be the best.

               MB/s
NONZC         8379
ZC            5910
ZC_FIXED      6294
MIXED         6350

And the zero-copy example in [1] does not seem to work because the
kernel is modified by following commit:

https://lore.kernel.org/all/cover.1662027856.git.asml.silence@gmail.com/

Can you help me reproduce this performance test result? Is it necessary
to configure better parameters to reproduce the problem?


> sends a bunch of requests, waits for completions and repeats. "+ flush" column
> posts one additional "buffer-free" notification per request, and just "zc"
> doesn't post buffer notifications at all.
> 
> NIC (requests / second):
> IO size | non-zc    | zc             | zc + flush
> 4000    | 495134    | 606420 (+22%)  | 558971 (+12%)
> 1500    | 551808    | 577116 (+4.5%) | 565803 (+2.5%)
> 1000    | 584677    | 592088 (+1.2%) | 560885 (-4%)
> 600     | 596292    | 598550 (+0.4%) | 555366 (-6.7%)
> 
> dummy (requests / second):
> IO size | non-zc    | zc             | zc + flush
> 8000    | 1299916   | 2396600 (+84%) | 2224219 (+71%)
> 4000    | 1869230   | 2344146 (+25%) | 2170069 (+16%)
> 1200    | 2071617   | 2361960 (+14%) | 2203052 (+6%)
> 600     | 2106794   | 2381527 (+13%) | 2195295 (+4%)
> 
> Previously it also brought a massive performance speedup compared to the
> msg_zerocopy tool (see [3]), which is probably not super interesting. There
> is also an additional bunch of refcounting optimisations that was omitted from
> the series for simplicity and as they don't change the picture drastically,
> they will be sent as follow up, as well as flushing optimisations closing the
> performance gap b/w two last columns.
> 
> For TCP on localhost (with hacks enabling localhost zerocopy) and including
> additional overhead for receive:
> 
> IO size | non-zc    | zc
> 1200    | 4174      | 4148
> 4096    | 7597      | 11228
> 
> Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the
> omitted optimisations will somewhat help, should look better for 4000,
> but couldn't test properly because of setup problems.
> 
> Links:
> 
>   liburing (benchmark + tests):
>   [1] https://github.com/isilence/liburing/tree/zc_v4
> 
>   kernel repo:
>   [2] https://github.com/isilence/linux/tree/zc_v4
> 
>   RFC v1:
>   [3] https://lore.kernel.org/io-uring/cover.1638282789.git.asml.silence@gmail.com/
> 
>   RFC v2:
>   https://lore.kernel.org/io-uring/cover.1640029579.git.asml.silence@gmail.com/
> 
>   Net patches based:
>   git@github.com:isilence/linux.git zc_v4-net-base
>   or
>   https://github.com/isilence/linux/tree/zc_v4-net-base
> 
> API design overview:
> 
>   The series introduces an io_uring concept of notifactors. From the userspace
>   perspective it's an entity to which it can bind one or more requests and then
>   requesting to flush it. Flushing a notifier makes it impossible to attach new
>   requests to it, and instructs the notifier to post a completion once all
>   requests attached to it are completed and the kernel doesn't need the buffers
>   anymore.
> 
>   Notifications are stored in notification slots, which should be registered as
>   an array in io_uring. Each slot stores only one notifier at any particular
>   moment. Flushing removes it from the slot and the slot automatically replaces
>   it with a new notifier. All operations with notifiers are done by specifying
>   an index of a slot it's currently in.
> 
>   When registering a notification the userspace specifies a u64 tag for each
>   slot, which will be copied in notification completion entries as
>   cqe::user_data. cqe::res is 0 and cqe::flags is equal to wrap around u32
>   sequence number counting notifiers of a slot.
> 
> Changelog:
> 
>   v4 -> v5
>     remove ubuf_info checks from custom iov_iter callbacks to
>     avoid disabling the page refs optimisations for TCP
> 
>   v3 -> v4
>     custom iov_iter handling
> 
>   RFC v2 -> v3:
>     mem accounting for non-registered buffers
>     allow mixing registered and normal requests per notifier
>     notification flushing via IORING_OP_RSRC_UPDATE
>     TCP support
>     fix buffer indexing
>     fix io-wq ->uring_lock locking
>     fix bugs when mixing with MSG_ZEROCOPY
>     fix managed refs bugs in skbuff.c
> 
>   RFC -> RFC v2:
>     remove additional overhead for non-zc from skb_release_data()
>     avoid msg propagation, hide extra bits of non-zc overhead
>     task_work based "buffer free" notifications
>     improve io_uring's notification refcounting
>     added 5/19, (no pfmemalloc tracking)
>     added 8/19 and 9/19 preventing small copies with zc
>     misc small changes
> 
> David Ahern (1):
>   net: Allow custom iter handler in msghdr
> 
> Pavel Begunkov (26):
>   ipv4: avoid partial copy for zc
>   ipv6: avoid partial copy for zc
>   skbuff: don't mix ubuf_info from different sources
>   skbuff: add SKBFL_DONT_ORPHAN flag
>   skbuff: carry external ubuf_info in msghdr
>   net: introduce managed frags infrastructure
>   net: introduce __skb_fill_page_desc_noacc
>   ipv4/udp: support externally provided ubufs
>   ipv6/udp: support externally provided ubufs
>   tcp: support externally provided ubufs
>   io_uring: initialise msghdr::msg_ubuf
>   io_uring: export io_put_task()
>   io_uring: add zc notification infrastructure
>   io_uring: cache struct io_notif
>   io_uring: complete notifiers in tw
>   io_uring: add rsrc referencing for notifiers
>   io_uring: add notification slot registration
>   io_uring: wire send zc request type
>   io_uring: account locked pages for non-fixed zc
>   io_uring: allow to pass addr into sendzc
>   io_uring: sendzc with fixed buffers
>   io_uring: flush notifiers after sendzc
>   io_uring: rename IORING_OP_FILES_UPDATE
>   io_uring: add zc notification flush requests
>   io_uring: enable managed frags with register buffers
>   selftests/io_uring: test zerocopy send
> 
>  include/linux/io_uring_types.h                |  37 ++
>  include/linux/skbuff.h                        |  66 +-
>  include/linux/socket.h                        |   5 +
>  include/uapi/linux/io_uring.h                 |  45 +-
>  io_uring/Makefile                             |   2 +-
>  io_uring/io_uring.c                           |  42 +-
>  io_uring/io_uring.h                           |  22 +
>  io_uring/net.c                                | 187 ++++++
>  io_uring/net.h                                |   4 +
>  io_uring/notif.c                              | 215 +++++++
>  io_uring/notif.h                              |  87 +++
>  io_uring/opdef.c                              |  24 +-
>  io_uring/rsrc.c                               |  55 +-
>  io_uring/rsrc.h                               |  16 +-
>  io_uring/tctx.h                               |  26 -
>  net/compat.c                                  |   1 +
>  net/core/datagram.c                           |  14 +-
>  net/core/skbuff.c                             |  37 +-
>  net/ipv4/ip_output.c                          |  50 +-
>  net/ipv4/tcp.c                                |  32 +-
>  net/ipv6/ip6_output.c                         |  49 +-
>  net/socket.c                                  |   3 +
>  tools/testing/selftests/net/Makefile          |   1 +
>  .../selftests/net/io_uring_zerocopy_tx.c      | 605 ++++++++++++++++++
>  .../selftests/net/io_uring_zerocopy_tx.sh     | 131 ++++
>  25 files changed, 1628 insertions(+), 128 deletions(-)
>  create mode 100644 io_uring/notif.c
>  create mode 100644 io_uring/notif.h
>  create mode 100644 tools/testing/selftests/net/io_uring_zerocopy_tx.c
>  create mode 100755 tools/testing/selftests/net/io_uring_zerocopy_tx.sh
>
Re: [PATCH net-next v5 00/27] io_uring zerocopy send
Posted by Pavel Begunkov 1 year, 1 month ago
On 2/18/25 01:47, Jinjie Ruan wrote:
> On 2022/7/13 4:52, Pavel Begunkov wrote:
>> NOTE: Not to be picked directly. After getting necessary acks, I'll be
>>        working out merging with Jakub and Jens.
>>
>> The patchset implements io_uring zerocopy send. It works with both registered
>> and normal buffers, mixing is allowed but not recommended. Apart from usual
>> request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
>> the userspace when buffers are freed and can be reused (see API design below),
>> which is delivered into io_uring's Completion Queue. Those "buffer-free"
>> notifications are not necessarily per request, but the userspace has control
>> over it and should explicitly attaching a number of requests to a single
>> notification. The series also adds some internal optimisations when used with
>> registered buffers like removing page referencing.
>>
>> >From the kernel networking perspective there are two main changes. The first
>> one is passing ubuf_info into the network layer from io_uring (inside of an
>> in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
>> caching on the io_uring side, but also helps to avoid cross-referencing
>> and synchronisation problems. The second part is an optional optimisation
>> removing page referencing for requests with registered buffers.
>>
>> Benchmarking UDP with an optimised version of the selftest (see [1]), which
> 
> Hi, Pavel, I'm interested in zero copy sending of io_uring, but I can't
> reproduce its performance using zerocopy send selftest test case, such
> as "bash io_uring_zerocopy_tx.sh 6 udp -m 0/1/2/3 -n 64", even baseline
> performance may be the best.
> 
>                 MB/s
> NONZC         8379
> ZC            5910
> ZC_FIXED      6294
> MIXED         6350

It's using veth, and zerocopy is effectively disabled for most of
virtual devices, or to be specific "for paths that may loop packets
to receive sockets".

https://lore.kernel.org/netdev/20170803202945.70750-6-willemdebruijn.kernel@gmail.com/

So that's the worst of the two, it copies data but also incurs the
overhead for notifications. You can use a dummy device as a sink with
no receiver, but you'll get more realistic numbers if you use a real
device (that supports features required for zerocopy).

> And the zero-copy example in [1] does not seem to work because the
> kernel is modified by following commit:
> 
> https://lore.kernel.org/all/cover.1662027856.git.asml.silence@gmail.com/

The right version was merged long ago and sits in

liburing/examples/send-zerocopy.c

It's brushed up more than the selftest version, so I'd suggest using
that one. Arguments are a bit different, but it prints help.

./send-zerocopy -6 udp -D <ip> -t 10 -n 1 -l0 -b1 -d -z1

-- 
Pavel Begunkov
Re: (subset) [PATCH net-next v5 00/27] io_uring zerocopy send
Posted by Jens Axboe 3 years, 9 months ago
On Tue, 12 Jul 2022 21:52:24 +0100, Pavel Begunkov wrote:
> NOTE: Not to be picked directly. After getting necessary acks, I'll be
>       working out merging with Jakub and Jens.
> 
> The patchset implements io_uring zerocopy send. It works with both registered
> and normal buffers, mixing is allowed but not recommended. Apart from usual
> request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
> the userspace when buffers are freed and can be reused (see API design below),
> which is delivered into io_uring's Completion Queue. Those "buffer-free"
> notifications are not necessarily per request, but the userspace has control
> over it and should explicitly attaching a number of requests to a single
> notification. The series also adds some internal optimisations when used with
> registered buffers like removing page referencing.
> 
> [...]

Applied, thanks!

[12/27] io_uring: initialise msghdr::msg_ubuf
        commit: 06f241e2bf4ba2a3e77269be25d21c0196a57a4f
[13/27] io_uring: export io_put_task()
        commit: ba64c07a6ef9a05ca9eb09e13b70df7500e78cf8
[14/27] io_uring: add zc notification infrastructure
        commit: 6f322c753daee4b9d4ad494d4e8b05da610d804c
[15/27] io_uring: cache struct io_notif
        commit: cf49e2d47c49e547d4bc370efe73785fc82354e5
[16/27] io_uring: complete notifiers in tw
        commit: 9cc16ae447db07d210175d2ad2419784dd20f784
[17/27] io_uring: add rsrc referencing for notifiers
        commit: e133e289093ea35c1f7f940fe4c0ceb62037dc59
[18/27] io_uring: add notification slot registration
        commit: f20b817fd29b64ef6de24b83ef23e1f3fb273967
[19/27] io_uring: wire send zc request type
        commit: 480ec5ff9a5a75d68423c0bd02e57a9ee6325320
[20/27] io_uring: account locked pages for non-fixed zc
        commit: fcb98e61d0232cff7dd14ae85ad1c88d68f98273
[21/27] io_uring: allow to pass addr into sendzc
        commit: 7ab12997edc9aa3e2be4169f929c50a1fcd41004
[22/27] io_uring: sendzc with fixed buffers
        commit: bb4019de9ea11d21137b4a8ff01d9e338071d633
[23/27] io_uring: flush notifiers after sendzc
        commit: 95a70c191696da64a6ae235d52132a5c17866dae
[24/27] io_uring: rename IORING_OP_FILES_UPDATE
        commit: d488e605a45192f9f60c7624d46ba0b8c4d93aab
[25/27] io_uring: add zc notification flush requests
        commit: cb155defb9bf20a647c8825a085695f3f94fdb60
[26/27] io_uring: enable managed frags with register buffers
        commit: 04ae3dbe8a027cf10ab759456ffc4fb119486f74
[27/27] selftests/io_uring: test zerocopy send
        commit: 0c450de20ce7d6bc8a2f97c98387baf910454477

Best regards,
-- 
Jens Axboe