[RFC v2 00/24] Per queue configs and large rx buffer support for zcrx

Pavel Begunkov posted 24 patches 1 month, 3 weeks ago
Documentation/netlink/specs/ethtool.yaml      |   4 +
Documentation/netlink/specs/netdev.yaml       |  15 ++
Documentation/networking/ethtool-netlink.rst  |   7 +-
drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 142 +++++++++++---
drivers/net/ethernet/broadcom/bnxt/bnxt.h     |   5 +-
.../net/ethernet/broadcom/bnxt/bnxt_ethtool.c |   9 +-
drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |   6 +-
drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h |   2 +-
drivers/net/ethernet/google/gve/gve_main.c    |   9 +-
.../marvell/octeontx2/nic/otx2_ethtool.c      |   6 +-
.../net/ethernet/mellanox/mlx5/core/en_main.c |   9 +-
drivers/net/netdevsim/netdev.c                |   8 +-
include/linux/ethtool.h                       |   3 +
include/net/netdev_queues.h                   |  84 ++++++--
include/net/netdev_rx_queue.h                 |   3 +-
include/net/netlink.h                         |  19 ++
include/net/page_pool/types.h                 |   1 +
.../uapi/linux/ethtool_netlink_generated.h    |   1 +
include/uapi/linux/io_uring.h                 |   2 +-
include/uapi/linux/netdev.h                   |   2 +
io_uring/zcrx.c                               |  36 +++-
net/core/Makefile                             |   2 +-
net/core/dev.c                                |  12 +-
net/core/dev.h                                |  15 ++
net/core/netdev-genl-gen.c                    |  15 ++
net/core/netdev-genl-gen.h                    |   1 +
net/core/netdev-genl.c                        |  92 +++++++++
net/core/netdev_config.c                      | 183 ++++++++++++++++++
net/core/netdev_rx_queue.c                    |  22 ++-
net/core/page_pool.c                          |   3 +
net/ethtool/common.c                          |   4 +-
net/ethtool/netlink.c                         |  14 +-
net/ethtool/rings.c                           |  14 +-
tools/include/uapi/linux/netdev.h             |   2 +
34 files changed, 662 insertions(+), 90 deletions(-)
create mode 100644 net/core/netdev_config.c
[RFC v2 00/24] Per queue configs and large rx buffer support for zcrx
Posted by Pavel Begunkov 1 month, 3 weeks ago
This series implements large rx buffer support for io_uring/zcrx on
top of Jakub's queue configuration changes, but it can also be used
by other memory providers. Large rx buffers can be drastically
beneficial with high-end hw-gro enabled cards that can coalesce traffic
into larger pages, reducing the number of frags traversing the network
stack and resuling in larger contiguous chunks of data for the
userspace. Benchamrks showed up to ~30% improvement in CPU util.

For example, for 200Gbit broadcom NIC, 4K vs 32K buffers, and napi and
userspace pinned to the same CPU:

packets=23987040 (MB=2745098), rps=199559 (MB/s=22837)
CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
  0    1.53    0.00   27.78    2.72    1.31   66.45    0.22
packets=24078368 (MB=2755550), rps=200319 (MB/s=22924)
CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
  0    0.69    0.00    8.26   31.65    1.83   57.00    0.57

And for napi and userspace on different CPUs:

packets=10725082 (MB=1227388), rps=198285 (MB/s=22692)
CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
  0    0.10    0.00    0.50    0.00    0.50   74.50    24.40
  1    4.51    0.00   44.33   47.22    2.08    1.85    0.00
packets=14026235 (MB=1605175), rps=198388 (MB/s=22703)
CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
  0    0.10    0.00    0.70    0.00    1.00   43.78   54.42
  1    1.09    0.00   31.95   62.91    1.42    2.63    0.00

Patch 22 allows to pass queue config from a memory provider. Most
of necessary zcrx changes are already queued in a separate branch,
so the zcrx changes are contained in Patch 24 and are fairly
simple. The uAPI is simple and imperative, the buffer length is
passed in the zcrx registration structure, 

Patches 2-21 are taken from Jakub's series with per queue
configuration [1]. Quoting Jakub:

"... The direct motivation for the series is that zero-copy Rx queues would
like to use larger Rx buffers. Most modern high-speed NICs support HW-GRO,
and can coalesce payloads into pages much larger than than the MTU.
Enabling larger buffers globally is a bit precarious as it exposes us
to potentially very inefficient memory use. Also allocating large
buffers may not be easy or cheap under load. Zero-copy queues service
only select traffic and have pre-allocated memory so the concerns don't
apply as much.

The per-queue config has to address 3 problems:
- user API
- driver API
- memory provider API

For user API the main question is whether we expose the config via
ethtool or netdev nl. I picked the latter - via queue GET/SET, rather
than extending the ethtool RINGS_GET API. I worry slightly that queue
GET/SET will turn in a monster like SETLINK. OTOH the only per-queue
settings we have in ethtool which are not going via RINGS_SET is
IRQ coalescing.

My goal for the driver API was to avoid complexity in the drivers.
The queue management API has gained two ops, responsible for preparing
configuration for a given queue, and validating whether the config
is supported. The validating is used both for NIC-wide and per-queue
changes. Queue alloc/start ops have a new "config" argument which
contains the current config for a given queue (we use queue restart
to apply per-queue settings). Outside of queue reset paths drivers
can call netdev_queue_config() which returns the config for an arbitrary
queue. Long story short I anticipate it to be used during ndo_open.

In the core I extended struct netdev_config with per queue settings.
All in all this isn't too far from what was there in my "queue API
prototype" a few years ago ..."

Kernel branch with all dependencies: 
git: https://github.com/isilence/linux.git zcrx/large-buffers-v2
url: https://github.com/isilence/linux/tree/zcrx/large-buffers-v2

Per queue configuration series:
[1] https://lore.kernel.org/all/20250421222827.283737-1-kuba@kernel.org/

v2: - Add MAX_PAGE_ORDER check on pp init (Patch 1)
    - Applied comments rewording (Patch 2)
    - Adjust pp.max_len based on order (Patch 8)
    - Patch up mlx5 queue callbacks after rebase (Patch 12)
    - Minor ->queue_mgmt_ops refactoring (Patch 15)
    - Rebased to account for both fill level and agg_size_fac (Patch 17)
    - Pass providers buf length in struct pp_memory_provider_params and
      apply it in __netdev_queue_confi(). (Patch 22)
    - Use ->supported_ring_params to validate drivers support of set
      qcfg parameters. (Patch 23)

Jakub Kicinski (20):
  docs: ethtool: document that rx_buf_len must control payload lengths
  net: ethtool: report max value for rx-buf-len
  net: use zero value to restore rx_buf_len to default
  net: clarify the meaning of netdev_config members
  net: add rx_buf_len to netdev config
  eth: bnxt: read the page size from the adapter struct
  eth: bnxt: set page pool page order based on rx_page_size
  eth: bnxt: support setting size of agg buffers via ethtool
  net: move netdev_config manipulation to dedicated helpers
  net: reduce indent of struct netdev_queue_mgmt_ops members
  net: allocate per-queue config structs and pass them thru the queue
    API
  net: pass extack to netdev_rx_queue_restart()
  net: add queue config validation callback
  eth: bnxt: always set the queue mgmt ops
  eth: bnxt: store the rx buf size per queue
  eth: bnxt: adjust the fill level of agg queues with larger buffers
  netdev: add support for setting rx-buf-len per queue
  net: wipe the setting of deactived queues
  eth: bnxt: use queue op config validate
  eth: bnxt: support per queue configuration of rx-buf-len

Pavel Begunkov (4):
  net: page_pool: sanitise allocation order
  net: let pp memory provider to specify rx buf len
  net: validate driver supports passed qcfg params
  io_uring/zcrx: implement large rx buffer support

 Documentation/netlink/specs/ethtool.yaml      |   4 +
 Documentation/netlink/specs/netdev.yaml       |  15 ++
 Documentation/networking/ethtool-netlink.rst  |   7 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 142 +++++++++++---
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |   5 +-
 .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c |   9 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |   6 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h |   2 +-
 drivers/net/ethernet/google/gve/gve_main.c    |   9 +-
 .../marvell/octeontx2/nic/otx2_ethtool.c      |   6 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |   9 +-
 drivers/net/netdevsim/netdev.c                |   8 +-
 include/linux/ethtool.h                       |   3 +
 include/net/netdev_queues.h                   |  84 ++++++--
 include/net/netdev_rx_queue.h                 |   3 +-
 include/net/netlink.h                         |  19 ++
 include/net/page_pool/types.h                 |   1 +
 .../uapi/linux/ethtool_netlink_generated.h    |   1 +
 include/uapi/linux/io_uring.h                 |   2 +-
 include/uapi/linux/netdev.h                   |   2 +
 io_uring/zcrx.c                               |  36 +++-
 net/core/Makefile                             |   2 +-
 net/core/dev.c                                |  12 +-
 net/core/dev.h                                |  15 ++
 net/core/netdev-genl-gen.c                    |  15 ++
 net/core/netdev-genl-gen.h                    |   1 +
 net/core/netdev-genl.c                        |  92 +++++++++
 net/core/netdev_config.c                      | 183 ++++++++++++++++++
 net/core/netdev_rx_queue.c                    |  22 ++-
 net/core/page_pool.c                          |   3 +
 net/ethtool/common.c                          |   4 +-
 net/ethtool/netlink.c                         |  14 +-
 net/ethtool/rings.c                           |  14 +-
 tools/include/uapi/linux/netdev.h             |   2 +
 34 files changed, 662 insertions(+), 90 deletions(-)
 create mode 100644 net/core/netdev_config.c

-- 
2.49.0
Re: [RFC v2 00/24] Per queue configs and large rx buffer support for zcrx
Posted by Dragos Tatulea 1 month, 3 weeks ago
Hi Pavel,

On Fri, Aug 08, 2025 at 03:54:23PM +0100, Pavel Begunkov wrote:
> [...] 
> For example, for 200Gbit broadcom NIC, 4K vs 32K buffers, and napi and
> userspace pinned to the same CPU:
> 
> packets=23987040 (MB=2745098), rps=199559 (MB/s=22837)
> CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
>   0    1.53    0.00   27.78    2.72    1.31   66.45    0.22
> packets=24078368 (MB=2755550), rps=200319 (MB/s=22924)
> CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
>   0    0.69    0.00    8.26   31.65    1.83   57.00    0.57
> 
> And for napi and userspace on different CPUs:
> 
> packets=10725082 (MB=1227388), rps=198285 (MB/s=22692)
> CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
>   0    0.10    0.00    0.50    0.00    0.50   74.50    24.40
>   1    4.51    0.00   44.33   47.22    2.08    1.85    0.00
> packets=14026235 (MB=1605175), rps=198388 (MB/s=22703)
> CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
>   0    0.10    0.00    0.70    0.00    1.00   43.78   54.42
>   1    1.09    0.00   31.95   62.91    1.42    2.63    0.00
>
What did you use for this benchmark, send-zerocopy? Could you share a
branch and how you ran it please?

I have added some initial support to mlx5 for rx-buf-len and would like
to benchmark it and compare it to what you posted.

Thanks,
Dragos
Re: [RFC v2 00/24] Per queue configs and large rx buffer support for zcrx
Posted by Pavel Begunkov 1 month, 3 weeks ago
On 8/13/25 16:39, Dragos Tatulea wrote:
> Hi Pavel,
> 
> On Fri, Aug 08, 2025 at 03:54:23PM +0100, Pavel Begunkov wrote:
>> [...]
>> For example, for 200Gbit broadcom NIC, 4K vs 32K buffers, and napi and
>> userspace pinned to the same CPU:
>>
>> packets=23987040 (MB=2745098), rps=199559 (MB/s=22837)
>> CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
>>    0    1.53    0.00   27.78    2.72    1.31   66.45    0.22
>> packets=24078368 (MB=2755550), rps=200319 (MB/s=22924)
>> CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
>>    0    0.69    0.00    8.26   31.65    1.83   57.00    0.57
>>
>> And for napi and userspace on different CPUs:
>>
>> packets=10725082 (MB=1227388), rps=198285 (MB/s=22692)
>> CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
>>    0    0.10    0.00    0.50    0.00    0.50   74.50    24.40
>>    1    4.51    0.00   44.33   47.22    2.08    1.85    0.00
>> packets=14026235 (MB=1605175), rps=198388 (MB/s=22703)
>> CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
>>    0    0.10    0.00    0.70    0.00    1.00   43.78   54.42
>>    1    1.09    0.00   31.95   62.91    1.42    2.63    0.00
>>
> What did you use for this benchmark, send-zerocopy? Could you share a
> branch and how you ran it please?
> 
> I have added some initial support to mlx5 for rx-buf-len and would like
> to benchmark it and compare it to what you posted.

You can use this branch:
https://github.com/isilence/liburing.git zcrx/rx-buf-len

# server
examples/zcrx -p <port> -q <queue_idx> -i <interface_name> -A1 \
              -B <rx_buf_len> -S <area size / memory provided>

"-A1" here is for using huge pages, so don't forget to configure
/proc/sys/vm/nr_hugepages.

# client
examples/send-zerocopy -6 tcp -D <ip addr> -p <port>
                        -t <runtime secs>
                        -l -b1 -n1 -z1 -d -s<send size>

I had to play with the client a bit for it to keep up with
the server. "-l" enables huge pages, and had to bump up the
send size. You can also add -v to both for a basic payload
verification.

-- 
Pavel Begunkov
Re: [RFC v2 00/24] Per queue configs and large rx buffer support for zcrx
Posted by Dragos Tatulea 1 month, 2 weeks ago
On Thu, Aug 14, 2025 at 11:46:35AM +0100, Pavel Begunkov wrote:
> On 8/13/25 16:39, Dragos Tatulea wrote:
> > Hi Pavel,
> > 
> > On Fri, Aug 08, 2025 at 03:54:23PM +0100, Pavel Begunkov wrote:
> > > [...]
> > > For example, for 200Gbit broadcom NIC, 4K vs 32K buffers, and napi and
> > > userspace pinned to the same CPU:
> > > 
> > > packets=23987040 (MB=2745098), rps=199559 (MB/s=22837)
> > > CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
> > >    0    1.53    0.00   27.78    2.72    1.31   66.45    0.22
> > > packets=24078368 (MB=2755550), rps=200319 (MB/s=22924)
> > > CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
> > >    0    0.69    0.00    8.26   31.65    1.83   57.00    0.57
> > > 
> > > And for napi and userspace on different CPUs:
> > > 
> > > packets=10725082 (MB=1227388), rps=198285 (MB/s=22692)
> > > CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
> > >    0    0.10    0.00    0.50    0.00    0.50   74.50    24.40
> > >    1    4.51    0.00   44.33   47.22    2.08    1.85    0.00
> > > packets=14026235 (MB=1605175), rps=198388 (MB/s=22703)
> > > CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
> > >    0    0.10    0.00    0.70    0.00    1.00   43.78   54.42
> > >    1    1.09    0.00   31.95   62.91    1.42    2.63    0.00
> > > 
I forgot to ask: what is the MTU here?

> > What did you use for this benchmark, send-zerocopy? Could you share a
> > branch and how you ran it please?
> > 
> > I have added some initial support to mlx5 for rx-buf-len and would like
> > to benchmark it and compare it to what you posted.
> 
> You can use this branch:
> https://github.com/isilence/liburing.git zcrx/rx-buf-len
> 
> # server
> examples/zcrx -p <port> -q <queue_idx> -i <interface_name> -A1 \
>              -B <rx_buf_len> -S <area size / memory provided>
>
> "-A1" here is for using huge pages, so don't forget to configure
> /proc/sys/vm/nr_hugepages.
> 
> # client
> examples/send-zerocopy -6 tcp -D <ip addr> -p <port>
>                        -t <runtime secs>
>                        -l -b1 -n1 -z1 -d -s<send size>
>
Thanks a lot for the branch and the instructions Pavel! I am playing
with them now and seeing some preliminary good results. Will post
them once we share the patches.

> I had to play with the client a bit for it to keep up with
> the server. "-l" enables huge pages, and had to bump up the
> send size. You can also add -v to both for a basic payload
> verification.
>
I see what you mean. I also had to make the rx memory larger once
rx-buf-len >= 32K. Otherwise the traffic was hanging after a second or
so. This is probably related to the currently known issue where if a
page_pool is too small due to incorrect sizing of the buffer, mlx5 hangs
on first refill. That still needs fixing.

Thanks,
Dragos
Re: [RFC v2 00/24] Per queue configs and large rx buffer support for zcrx
Posted by Pavel Begunkov 1 month, 2 weeks ago
On 8/15/25 17:44, Dragos Tatulea wrote:
> On Thu, Aug 14, 2025 at 11:46:35AM +0100, Pavel Begunkov wrote:
...>> "-A1" here is for using huge pages, so don't forget to configure
>> /proc/sys/vm/nr_hugepages.
>>
>> # client
>> examples/send-zerocopy -6 tcp -D <ip addr> -p <port>
>>                         -t <runtime secs>
>>                         -l -b1 -n1 -z1 -d -s<send size>
>>
> Thanks a lot for the branch and the instructions Pavel! I am playing
> with them now and seeing some preliminary good results. Will post
> them once we share the patches.

Sounds good

>> I had to play with the client a bit for it to keep up with
>> the server. "-l" enables huge pages, and had to bump up the
>> send size. You can also add -v to both for a basic payload
>> verification.
>>
> I see what you mean. I also had to make the rx memory larger once

Forgot to mention that the tool doesn't consider CPU affinities,
so I had to configure task and irq affinities by hand.

> rx-buf-len >= 32K. Otherwise the traffic was hanging after a second or
> so. This is probably related to the currently known issue where if a
> page_pool is too small due to incorrect sizing of the buffer, mlx5 hangs
> on first refill. That still needs fixing.

-- 
Pavel Begunkov