[PATCH net-next v4 00/24][pull request] Queue configs and large buffer providers

Pavel Begunkov posted 24 patches 2 months ago
Documentation/netlink/specs/ethtool.yaml      |   4 +
Documentation/netlink/specs/netdev.yaml       |  15 ++
Documentation/networking/ethtool-netlink.rst  |   7 +-
drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 148 +++++++++++---
drivers/net/ethernet/broadcom/bnxt/bnxt.h     |   5 +-
.../net/ethernet/broadcom/bnxt/bnxt_ethtool.c |   9 +-
drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |   6 +-
drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h |   2 +-
drivers/net/ethernet/google/gve/gve_main.c    |   9 +-
.../ethernet/hisilicon/hns3/hns3_ethtool.c    |  10 +-
.../marvell/octeontx2/nic/otx2_ethtool.c      |   6 +-
.../net/ethernet/mellanox/mlx5/core/en_main.c |  10 +-
drivers/net/ethernet/meta/fbnic/fbnic_txrx.c  |   8 +-
drivers/net/netdevsim/netdev.c                |   8 +-
include/linux/ethtool.h                       |   3 +
include/net/netdev_queues.h                   |  88 +++++++--
include/net/netdev_rx_queue.h                 |   3 +-
include/net/netlink.h                         |  19 ++
include/net/page_pool/types.h                 |   1 +
.../uapi/linux/ethtool_netlink_generated.h    |   1 +
include/uapi/linux/netdev.h                   |   2 +
net/core/Makefile                             |   1 +
net/core/dev.c                                |  12 +-
net/core/dev.h                                |  15 ++
net/core/netdev-genl-gen.c                    |  15 ++
net/core/netdev-genl-gen.h                    |   1 +
net/core/netdev-genl.c                        |  92 +++++++++
net/core/netdev_config.c                      | 183 ++++++++++++++++++
net/core/netdev_rx_queue.c                    |  22 ++-
net/core/page_pool.c                          |   3 +
net/ethtool/common.c                          |   4 +-
net/ethtool/netlink.c                         |  14 +-
net/ethtool/rings.c                           |  14 +-
tools/include/uapi/linux/netdev.h             |   2 +
34 files changed, 650 insertions(+), 92 deletions(-)
create mode 100644 net/core/netdev_config.c
[PATCH net-next v4 00/24][pull request] Queue configs and large buffer providers
Posted by Pavel Begunkov 2 months ago
Add support for per-queue rx buffer length configuration based on [2]
and basic infrastructure for using it in memory providers like
io_uring/zcrx. Note, it only includes net/ patches and leaves out
zcrx to be merged separately. Large rx buffers can be beneficial with
hw-gro enabled cards that can coalesce traffic, which reduces the
number of frags traversing the network stack and resuling in larger
contiguous chunks of data given to the userspace.

Benchmarks with zcrx [2+3] show up to ~30% improvement in CPU util.
E.g. comparison for 4K vs 32K buffers with a 200Gbit NIC, napi and
userspace pinned to the same CPU:

packets=23987040 (MB=2745098), rps=199559 (MB/s=22837)
CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
  0    1.53    0.00   27.78    2.72    1.31   66.45    0.22
packets=24078368 (MB=2755550), rps=200319 (MB/s=22924)
CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
  0    0.69    0.00    8.26   31.65    1.83   57.00    0.57

netdev + zcrx changes:
[1] https://github.com/isilence/linux.git zcrx/large-buffers-v4

Per queue configuration series:
[2] https://lore.kernel.org/all/20250421222827.283737-1-kuba@kernel.org/

Liburing example:
[3] https://github.com/isilence/liburing.git zcrx/rx-buf-len

---
The following changes since commit 3a8660878839faadb4f1a6dd72c3179c1df56787:

  Linux 6.18-rc1 (2025-10-12 13:42:36 -0700)

are available in the Git repository at:

  https://github.com/isilence/linux.git tags/net-for-6.19-queue-rx-buf-len

for you to fetch changes up to bc5737ba2a1e5586408cd0398b2db0f218ed3e89:

  net: validate driver supports passed qcfg params (2025-10-13 10:04:05 +0100)


v4: - Update fbnic qops
    - Propagate max buf len for hns3
    - Use configured buf size in __bnxt_alloc_rx_netmem
    - Minor stylistic changes
v3: https://lore.kernel.org/all/cover.1755499375.git.asml.silence@gmail.com/
    - Rebased, excluded zcrx specific patches
    - Set agg_size_fac to 1 on warning
v2: https://lore.kernel.org/all/cover.1754657711.git.asml.silence@gmail.com/
    - Add MAX_PAGE_ORDER check on pp init
    - Applied comments rewording
    - Adjust pp.max_len based on order
    - Patch up mlx5 queue callbacks after rebase
    - Minor ->queue_mgmt_ops refactoring
    - Rebased to account for both fill level and agg_size_fac
    - Pass providers buf length in struct pp_memory_provider_params and
      apply it in __netdev_queue_confi().
    - Use ->supported_ring_params to validate drivers support of set
      qcfg parameters.

Jakub Kicinski (20):
  docs: ethtool: document that rx_buf_len must control payload lengths
  net: ethtool: report max value for rx-buf-len
  net: use zero value to restore rx_buf_len to default
  net: clarify the meaning of netdev_config members
  net: add rx_buf_len to netdev config
  eth: bnxt: read the page size from the adapter struct
  eth: bnxt: set page pool page order based on rx_page_size
  eth: bnxt: support setting size of agg buffers via ethtool
  net: move netdev_config manipulation to dedicated helpers
  net: reduce indent of struct netdev_queue_mgmt_ops members
  net: allocate per-queue config structs and pass them thru the queue
    API
  net: pass extack to netdev_rx_queue_restart()
  net: add queue config validation callback
  eth: bnxt: always set the queue mgmt ops
  eth: bnxt: store the rx buf size per queue
  eth: bnxt: adjust the fill level of agg queues with larger buffers
  netdev: add support for setting rx-buf-len per queue
  net: wipe the setting of deactived queues
  eth: bnxt: use queue op config validate
  eth: bnxt: support per queue configuration of rx-buf-len

Pavel Begunkov (4):
  net: page_pool: sanitise allocation order
  net: hns3: net: use zero to restore rx_buf_len to default
  net: let pp memory provider to specify rx buf len
  net: validate driver supports passed qcfg params

 Documentation/netlink/specs/ethtool.yaml      |   4 +
 Documentation/netlink/specs/netdev.yaml       |  15 ++
 Documentation/networking/ethtool-netlink.rst  |   7 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 148 +++++++++++---
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |   5 +-
 .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c |   9 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |   6 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h |   2 +-
 drivers/net/ethernet/google/gve/gve_main.c    |   9 +-
 .../ethernet/hisilicon/hns3/hns3_ethtool.c    |  10 +-
 .../marvell/octeontx2/nic/otx2_ethtool.c      |   6 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  10 +-
 drivers/net/ethernet/meta/fbnic/fbnic_txrx.c  |   8 +-
 drivers/net/netdevsim/netdev.c                |   8 +-
 include/linux/ethtool.h                       |   3 +
 include/net/netdev_queues.h                   |  88 +++++++--
 include/net/netdev_rx_queue.h                 |   3 +-
 include/net/netlink.h                         |  19 ++
 include/net/page_pool/types.h                 |   1 +
 .../uapi/linux/ethtool_netlink_generated.h    |   1 +
 include/uapi/linux/netdev.h                   |   2 +
 net/core/Makefile                             |   1 +
 net/core/dev.c                                |  12 +-
 net/core/dev.h                                |  15 ++
 net/core/netdev-genl-gen.c                    |  15 ++
 net/core/netdev-genl-gen.h                    |   1 +
 net/core/netdev-genl.c                        |  92 +++++++++
 net/core/netdev_config.c                      | 183 ++++++++++++++++++
 net/core/netdev_rx_queue.c                    |  22 ++-
 net/core/page_pool.c                          |   3 +
 net/ethtool/common.c                          |   4 +-
 net/ethtool/netlink.c                         |  14 +-
 net/ethtool/rings.c                           |  14 +-
 tools/include/uapi/linux/netdev.h             |   2 +
 34 files changed, 650 insertions(+), 92 deletions(-)
 create mode 100644 net/core/netdev_config.c

-- 
2.49.0
Re: [PATCH net-next v4 00/24][pull request] Queue configs and large buffer providers
Posted by Pavel Begunkov 2 months ago
On 10/13/25 15:54, Pavel Begunkov wrote:

Forgot to CC io_uring

> Add support for per-queue rx buffer length configuration based on [2]
> and basic infrastructure for using it in memory providers like
> io_uring/zcrx. Note, it only includes net/ patches and leaves out
> zcrx to be merged separately. Large rx buffers can be beneficial with
> hw-gro enabled cards that can coalesce traffic, which reduces the
> number of frags traversing the network stack and resuling in larger
> contiguous chunks of data given to the userspace.

Same note as the last time, not great that it's over the 15 patches,
but I don't see a good way to shrink it considering that the original
series [2] is 22 patches long, and I'll somehow need to pull it it
into the io_uring tree after. Please let me know if there is a strong
feeling about that, and/or what would the preferred way be.

-- 
Pavel Begunkov
Re: [PATCH net-next v4 00/24][pull request] Queue configs and large buffer providers
Posted by Jakub Kicinski 2 months ago
On Mon, 13 Oct 2025 15:54:02 +0100 Pavel Begunkov wrote:
> Jakub Kicinski (20):
>   docs: ethtool: document that rx_buf_len must control payload lengths
>   net: ethtool: report max value for rx-buf-len
>   net: use zero value to restore rx_buf_len to default
>   net: clarify the meaning of netdev_config members
>   net: add rx_buf_len to netdev config
>   eth: bnxt: read the page size from the adapter struct
>   eth: bnxt: set page pool page order based on rx_page_size
>   eth: bnxt: support setting size of agg buffers via ethtool
>   net: move netdev_config manipulation to dedicated helpers
>   net: reduce indent of struct netdev_queue_mgmt_ops members
>   net: allocate per-queue config structs and pass them thru the queue
>     API
>   net: pass extack to netdev_rx_queue_restart()
>   net: add queue config validation callback
>   eth: bnxt: always set the queue mgmt ops
>   eth: bnxt: store the rx buf size per queue
>   eth: bnxt: adjust the fill level of agg queues with larger buffers
>   netdev: add support for setting rx-buf-len per queue
>   net: wipe the setting of deactived queues
>   eth: bnxt: use queue op config validate
>   eth: bnxt: support per queue configuration of rx-buf-len

I'd like to rework these a little bit.
On reflection I don't like the single size control.
Please hold off.

Also what's the resolution for the maintainers entry / cross posting?
Re: [PATCH net-next v4 00/24][pull request] Queue configs and large buffer providers
Posted by Pavel Begunkov 2 months ago
On 10/13/25 18:54, Jakub Kicinski wrote:
> On Mon, 13 Oct 2025 15:54:02 +0100 Pavel Begunkov wrote:
>> Jakub Kicinski (20):
>>    docs: ethtool: document that rx_buf_len must control payload lengths
>>    net: ethtool: report max value for rx-buf-len
>>    net: use zero value to restore rx_buf_len to default
>>    net: clarify the meaning of netdev_config members
>>    net: add rx_buf_len to netdev config
>>    eth: bnxt: read the page size from the adapter struct
>>    eth: bnxt: set page pool page order based on rx_page_size
>>    eth: bnxt: support setting size of agg buffers via ethtool
>>    net: move netdev_config manipulation to dedicated helpers
>>    net: reduce indent of struct netdev_queue_mgmt_ops members
>>    net: allocate per-queue config structs and pass them thru the queue
>>      API
>>    net: pass extack to netdev_rx_queue_restart()
>>    net: add queue config validation callback
>>    eth: bnxt: always set the queue mgmt ops
>>    eth: bnxt: store the rx buf size per queue
>>    eth: bnxt: adjust the fill level of agg queues with larger buffers
>>    netdev: add support for setting rx-buf-len per queue
>>    net: wipe the setting of deactived queues
>>    eth: bnxt: use queue op config validate
>>    eth: bnxt: support per queue configuration of rx-buf-len
> 
> I'd like to rework these a little bit.
> On reflection I don't like the single size control.
> Please hold off.

I think that would be quite unproductive considering that this series
has been around for 3 months already with no forward progress, and the
API was posted 6 months ago. I have a better idea, I'll shrink it down
by removing all unnecessary parts, that makes it much much simpler and
should detangle the effort from ethtool bits like Stan once suggested.
I've also been bothered for some time by it growing to 24 patches, it'll
help with that as well. And it'll be a good base to put all the netlink
configuration bits on top if necessary.

> Also what's the resolution for the maintainers entry / cross posting?

I'm pretty much interested as well :) I've been CC'ing netdev as a
gesture of goodwill, that's despite you blocking an unrelated series
because of a rule you made up and retrospectively applied and belittling
my work after. It doesn't seem that you content with it either,
evidently from you blocking it again. I'm very curious what's that all
about? And since you're unwilling to deal with the series, maybe you'll
let other maintainers to handle it?

-- 
Pavel Begunkov
Re: [PATCH net-next v4 00/24][pull request] Queue configs and large buffer providers
Posted by Mina Almasry 2 months ago
On Mon, Oct 13, 2025 at 10:54 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 13 Oct 2025 15:54:02 +0100 Pavel Begunkov wrote:
> > Jakub Kicinski (20):
> >   docs: ethtool: document that rx_buf_len must control payload lengths
> >   net: ethtool: report max value for rx-buf-len
> >   net: use zero value to restore rx_buf_len to default
> >   net: clarify the meaning of netdev_config members
> >   net: add rx_buf_len to netdev config
> >   eth: bnxt: read the page size from the adapter struct
> >   eth: bnxt: set page pool page order based on rx_page_size
> >   eth: bnxt: support setting size of agg buffers via ethtool
> >   net: move netdev_config manipulation to dedicated helpers
> >   net: reduce indent of struct netdev_queue_mgmt_ops members
> >   net: allocate per-queue config structs and pass them thru the queue
> >     API
> >   net: pass extack to netdev_rx_queue_restart()
> >   net: add queue config validation callback
> >   eth: bnxt: always set the queue mgmt ops
> >   eth: bnxt: store the rx buf size per queue
> >   eth: bnxt: adjust the fill level of agg queues with larger buffers
> >   netdev: add support for setting rx-buf-len per queue
> >   net: wipe the setting of deactived queues
> >   eth: bnxt: use queue op config validate
> >   eth: bnxt: support per queue configuration of rx-buf-len
>
> I'd like to rework these a little bit.
> On reflection I don't like the single size control.
> Please hold off.
>

FWIW when I last looked at this I didn't like that the size control
seemed to control the size of the allocations made from the pp, but
not the size actually posted to the NIC.

I.e. in the scenario where the driver fragments each pp buffer into 2,
and the user asks for 8K rx-buf-len, the size actually posted to the
NIC would have actually been 4K (8K / 2 for 2 fragments).

Not sure how much of a concern this really is. I thought it would be
great if somehow rx-buf-len controlled the buffer sizes actually
posted to the NIC, because that what ultimately matters, no (it ends
up being the size of the incoming frags)? Or does that not matter for
some reason I'm missing?

-- 
Thanks,
Mina
Re: [PATCH net-next v4 00/24][pull request] Queue configs and large buffer providers
Posted by Pavel Begunkov 2 months ago
On 10/14/25 05:41, Mina Almasry wrote:
> On Mon, Oct 13, 2025 at 10:54 AM Jakub Kicinski <kuba@kernel.org> wrote:
>>
>> On Mon, 13 Oct 2025 15:54:02 +0100 Pavel Begunkov wrote:
>>> Jakub Kicinski (20):
>>>    docs: ethtool: document that rx_buf_len must control payload lengths
>>>    net: ethtool: report max value for rx-buf-len
>>>    net: use zero value to restore rx_buf_len to default
>>>    net: clarify the meaning of netdev_config members
>>>    net: add rx_buf_len to netdev config
>>>    eth: bnxt: read the page size from the adapter struct
>>>    eth: bnxt: set page pool page order based on rx_page_size
>>>    eth: bnxt: support setting size of agg buffers via ethtool
>>>    net: move netdev_config manipulation to dedicated helpers
>>>    net: reduce indent of struct netdev_queue_mgmt_ops members
>>>    net: allocate per-queue config structs and pass them thru the queue
>>>      API
>>>    net: pass extack to netdev_rx_queue_restart()
>>>    net: add queue config validation callback
>>>    eth: bnxt: always set the queue mgmt ops
>>>    eth: bnxt: store the rx buf size per queue
>>>    eth: bnxt: adjust the fill level of agg queues with larger buffers
>>>    netdev: add support for setting rx-buf-len per queue
>>>    net: wipe the setting of deactived queues
>>>    eth: bnxt: use queue op config validate
>>>    eth: bnxt: support per queue configuration of rx-buf-len
>>
>> I'd like to rework these a little bit.
>> On reflection I don't like the single size control.
>> Please hold off.
>>
> 
> FWIW when I last looked at this I didn't like that the size control
> seemed to control the size of the allocations made from the pp, but
> not the size actually posted to the NIC.
> 
> I.e. in the scenario where the driver fragments each pp buffer into 2,
> and the user asks for 8K rx-buf-len, the size actually posted to the
> NIC would have actually been 4K (8K / 2 for 2 fragments).
> 
> Not sure how much of a concern this really is. I thought it would be
> great if somehow rx-buf-len controlled the buffer sizes actually
> posted to the NIC, because that what ultimately matters, no (it ends
> up being the size of the incoming frags)? Or does that not matter for
> some reason I'm missing?

Maybe we should just make a rule that if hardware doesn't support
the given size, qops should fail, but ultimately the userspace
should be able to handle it either way as gro is packing not
100% reliably.

-- 
Pavel Begunkov