[PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy

Tariq Toukan posted 11 patches 6 months, 3 weeks ago
There is a newer version of this series
drivers/net/ethernet/mellanox/mlx5/core/en.h  |  11 +-
.../ethernet/mellanox/mlx5/core/en/params.c   |  36 ++-
.../ethernet/mellanox/mlx5/core/en_ethtool.c  |  50 ++++
.../net/ethernet/mellanox/mlx5/core/en_main.c | 281 +++++++++++++-----
.../net/ethernet/mellanox/mlx5/core/en_rx.c   | 136 +++++----
.../ethernet/mellanox/mlx5/core/en_stats.c    |  53 ++++
.../ethernet/mellanox/mlx5/core/en_stats.h    |  24 ++
include/linux/skbuff.h                        |  12 +
net/Kconfig                                   |   2 +-
9 files changed, 445 insertions(+), 160 deletions(-)
[PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy
Posted by Tariq Toukan 6 months, 3 weeks ago
This series from the team adds support for zerocopy rx TCP with devmem
and io_uring for ConnectX7 NICs and above. For performance reasons and
simplicity HW-GRO will also be turned on when header-data split mode is
on.

Find more details below.

Regards,
Tariq

Performance
===========

Test setup:

* CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (single NUMA)
* NIC: ConnectX7
* Benchmarking tool: kperf [1]
* Single TCP flow
* Test duration: 60s

With application thread and interrupts pinned to the *same* core:

|------+-----------+----------|
| MTU  | epoll     | io_uring |
|------+-----------+----------|
| 1500 | 61.6 Gbps | 114 Gbps |
| 4096 | 69.3 Gbps | 151 Gbps |
| 9000 | 67.8 Gbps | 187 Gbps |
|------+-----------+----------|

The CPU usage for io_uring is 95%.

Reproduction steps for io_uring:

server --no-daemon -a 2001:db8::1 --no-memcmp --iou --iou_sendzc \
        --iou_zcrx --iou_dev_name eth2 --iou_zcrx_queue_id 2

server --no-daemon -a 2001:db8::2 --no-memcmp --iou --iou_sendzc

client --src 2001:db8::2 --dst 2001:db8::1 \
        --msg-zerocopy -t 60 --cpu-min=2 --cpu-max=2

Patch overview:
================

First, a netmem API for skb_can_coalesce is added to the core to be able
to do skb fragment coalescing on netmems.

The next patches introduce some cleanups in the internal SHAMPO code and
improvements to hw gro capability checks in FW.

A separate page_pool is introduced for headers. Ethtool stats are added
as well.

Then the driver is converted to use the netmem API and to allow support
for unreadable netmem page pool.

The queue management ops are implemented.

Finally, the tcp-data-split ring parameter is exposed.

Changelog
=========

Changes from v1 [0]:
- Added support for skb_can_coalesce_netmem().
- Avoid netmem_to_page() casts in the driver.
- Fixed code to abide 80 char limit with some exceptions to avoid
code churn.

References
==========

[0] v1: https://lore.kernel.org/all/20250116215530.158886-1-saeed@kernel.org/
[1] kperf: git://git.kernel.dk/kperf.git


Dragos Tatulea (1):
  net: Add skb_can_coalesce for netmem

Saeed Mahameed (10):
  net: Kconfig NET_DEVMEM selects GENERIC_ALLOCATOR
  net/mlx5e: SHAMPO: Reorganize mlx5_rq_shampo_alloc
  net/mlx5e: SHAMPO: Remove redundant params
  net/mlx5e: SHAMPO: Improve hw gro capability checking
  net/mlx5e: SHAMPO: Separate pool for headers
  net/mlx5e: SHAMPO: Headers page pool stats
  net/mlx5e: Convert over to netmem
  net/mlx5e: Add support for UNREADABLE netmem page pools
  net/mlx5e: Implement queue mgmt ops and single channel swap
  net/mlx5e: Support ethtool tcp-data-split settings

 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  11 +-
 .../ethernet/mellanox/mlx5/core/en/params.c   |  36 ++-
 .../ethernet/mellanox/mlx5/core/en_ethtool.c  |  50 ++++
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 281 +++++++++++++-----
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 136 +++++----
 .../ethernet/mellanox/mlx5/core/en_stats.c    |  53 ++++
 .../ethernet/mellanox/mlx5/core/en_stats.h    |  24 ++
 include/linux/skbuff.h                        |  12 +
 net/Kconfig                                   |   2 +-
 9 files changed, 445 insertions(+), 160 deletions(-)


base-commit: 33e1b1b3991ba8c0d02b2324a582e084272205d6
-- 
2.31.1
Re: [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy
Posted by Stanislav Fomichev 6 months, 2 weeks ago
On 05/23, Tariq Toukan wrote:
> This series from the team adds support for zerocopy rx TCP with devmem
> and io_uring for ConnectX7 NICs and above. For performance reasons and
> simplicity HW-GRO will also be turned on when header-data split mode is
> on.
> 
> Find more details below.
> 
> Regards,
> Tariq
> 
> Performance
> ===========
> 
> Test setup:
> 
> * CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (single NUMA)
> * NIC: ConnectX7
> * Benchmarking tool: kperf [1]
> * Single TCP flow
> * Test duration: 60s
> 
> With application thread and interrupts pinned to the *same* core:
> 
> |------+-----------+----------|
> | MTU  | epoll     | io_uring |
> |------+-----------+----------|
> | 1500 | 61.6 Gbps | 114 Gbps |
> | 4096 | 69.3 Gbps | 151 Gbps |
> | 9000 | 67.8 Gbps | 187 Gbps |
> |------+-----------+----------|
> 
> The CPU usage for io_uring is 95%.
> 
> Reproduction steps for io_uring:
> 
> server --no-daemon -a 2001:db8::1 --no-memcmp --iou --iou_sendzc \
>         --iou_zcrx --iou_dev_name eth2 --iou_zcrx_queue_id 2
> 
> server --no-daemon -a 2001:db8::2 --no-memcmp --iou --iou_sendzc
> 
> client --src 2001:db8::2 --dst 2001:db8::1 \
>         --msg-zerocopy -t 60 --cpu-min=2 --cpu-max=2
> 
> Patch overview:
> ================
> 
> First, a netmem API for skb_can_coalesce is added to the core to be able
> to do skb fragment coalescing on netmems.
> 
> The next patches introduce some cleanups in the internal SHAMPO code and
> improvements to hw gro capability checks in FW.
> 
> A separate page_pool is introduced for headers. Ethtool stats are added
> as well.
> 
> Then the driver is converted to use the netmem API and to allow support
> for unreadable netmem page pool.
> 
> The queue management ops are implemented.
> 
> Finally, the tcp-data-split ring parameter is exposed.
> 
> Changelog
> =========
> 
> Changes from v1 [0]:
> - Added support for skb_can_coalesce_netmem().
> - Avoid netmem_to_page() casts in the driver.
> - Fixed code to abide 80 char limit with some exceptions to avoid
> code churn.

Since there is gonna be 2-3 weeks of closed net-next, can you
also add a patch for the tx side? It should be trivial (skip dma unmap
for niovs in tx completions plus netdev->netmem_tx=1).

And, btw, what about the issue that Cosmin raised in [0]? Is it addressed
in this series?

0: https://lore.kernel.org/netdev/9322c3c4826ed1072ddc9a2103cc641060665864.camel@nvidia.com/
Re: [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy
Posted by Dragos Tatulea 6 months, 2 weeks ago
On Tue, May 27, 2025 at 09:05:49AM -0700, Stanislav Fomichev wrote:
> On 05/23, Tariq Toukan wrote:
> > This series from the team adds support for zerocopy rx TCP with devmem
> > and io_uring for ConnectX7 NICs and above. For performance reasons and
> > simplicity HW-GRO will also be turned on when header-data split mode is
> > on.
> > 
> > Find more details below.
> > 
> > Regards,
> > Tariq
> > 
> > Performance
> > ===========
> > 
> > Test setup:
> > 
> > * CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (single NUMA)
> > * NIC: ConnectX7
> > * Benchmarking tool: kperf [1]
> > * Single TCP flow
> > * Test duration: 60s
> > 
> > With application thread and interrupts pinned to the *same* core:
> > 
> > |------+-----------+----------|
> > | MTU  | epoll     | io_uring |
> > |------+-----------+----------|
> > | 1500 | 61.6 Gbps | 114 Gbps |
> > | 4096 | 69.3 Gbps | 151 Gbps |
> > | 9000 | 67.8 Gbps | 187 Gbps |
> > |------+-----------+----------|
> > 
> > The CPU usage for io_uring is 95%.
> > 
> > Reproduction steps for io_uring:
> > 
> > server --no-daemon -a 2001:db8::1 --no-memcmp --iou --iou_sendzc \
> >         --iou_zcrx --iou_dev_name eth2 --iou_zcrx_queue_id 2
> > 
> > server --no-daemon -a 2001:db8::2 --no-memcmp --iou --iou_sendzc
> > 
> > client --src 2001:db8::2 --dst 2001:db8::1 \
> >         --msg-zerocopy -t 60 --cpu-min=2 --cpu-max=2
> > 
> > Patch overview:
> > ================
> > 
> > First, a netmem API for skb_can_coalesce is added to the core to be able
> > to do skb fragment coalescing on netmems.
> > 
> > The next patches introduce some cleanups in the internal SHAMPO code and
> > improvements to hw gro capability checks in FW.
> > 
> > A separate page_pool is introduced for headers. Ethtool stats are added
> > as well.
> > 
> > Then the driver is converted to use the netmem API and to allow support
> > for unreadable netmem page pool.
> > 
> > The queue management ops are implemented.
> > 
> > Finally, the tcp-data-split ring parameter is exposed.
> > 
> > Changelog
> > =========
> > 
> > Changes from v1 [0]:
> > - Added support for skb_can_coalesce_netmem().
> > - Avoid netmem_to_page() casts in the driver.
> > - Fixed code to abide 80 char limit with some exceptions to avoid
> > code churn.
> 
> Since there is gonna be 2-3 weeks of closed net-next, can you
> also add a patch for the tx side? It should be trivial (skip dma unmap
> for niovs in tx completions plus netdev->netmem_tx=1).
>
Seems indeed trivial. We will add it.

> And, btw, what about the issue that Cosmin raised in [0]? Is it addressed
> in this series?
> 
> 0: https://lore.kernel.org/netdev/9322c3c4826ed1072ddc9a2103cc641060665864.camel@nvidia.com/
We wanted to fix this afterwards as it needs to change a more subtle
part in the code that replenishes pages. This needs more thinking and
testing.

Thanks,
Dragos
Re: [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy
Posted by Stanislav Fomichev 6 months, 2 weeks ago
On 05/28, Dragos Tatulea wrote:
> On Tue, May 27, 2025 at 09:05:49AM -0700, Stanislav Fomichev wrote:
> > On 05/23, Tariq Toukan wrote:
> > > This series from the team adds support for zerocopy rx TCP with devmem
> > > and io_uring for ConnectX7 NICs and above. For performance reasons and
> > > simplicity HW-GRO will also be turned on when header-data split mode is
> > > on.
> > > 
> > > Find more details below.
> > > 
> > > Regards,
> > > Tariq
> > > 
> > > Performance
> > > ===========
> > > 
> > > Test setup:
> > > 
> > > * CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (single NUMA)
> > > * NIC: ConnectX7
> > > * Benchmarking tool: kperf [1]
> > > * Single TCP flow
> > > * Test duration: 60s
> > > 
> > > With application thread and interrupts pinned to the *same* core:
> > > 
> > > |------+-----------+----------|
> > > | MTU  | epoll     | io_uring |
> > > |------+-----------+----------|
> > > | 1500 | 61.6 Gbps | 114 Gbps |
> > > | 4096 | 69.3 Gbps | 151 Gbps |
> > > | 9000 | 67.8 Gbps | 187 Gbps |
> > > |------+-----------+----------|
> > > 
> > > The CPU usage for io_uring is 95%.
> > > 
> > > Reproduction steps for io_uring:
> > > 
> > > server --no-daemon -a 2001:db8::1 --no-memcmp --iou --iou_sendzc \
> > >         --iou_zcrx --iou_dev_name eth2 --iou_zcrx_queue_id 2
> > > 
> > > server --no-daemon -a 2001:db8::2 --no-memcmp --iou --iou_sendzc
> > > 
> > > client --src 2001:db8::2 --dst 2001:db8::1 \
> > >         --msg-zerocopy -t 60 --cpu-min=2 --cpu-max=2
> > > 
> > > Patch overview:
> > > ================
> > > 
> > > First, a netmem API for skb_can_coalesce is added to the core to be able
> > > to do skb fragment coalescing on netmems.
> > > 
> > > The next patches introduce some cleanups in the internal SHAMPO code and
> > > improvements to hw gro capability checks in FW.
> > > 
> > > A separate page_pool is introduced for headers. Ethtool stats are added
> > > as well.
> > > 
> > > Then the driver is converted to use the netmem API and to allow support
> > > for unreadable netmem page pool.
> > > 
> > > The queue management ops are implemented.
> > > 
> > > Finally, the tcp-data-split ring parameter is exposed.
> > > 
> > > Changelog
> > > =========
> > > 
> > > Changes from v1 [0]:
> > > - Added support for skb_can_coalesce_netmem().
> > > - Avoid netmem_to_page() casts in the driver.
> > > - Fixed code to abide 80 char limit with some exceptions to avoid
> > > code churn.
> > 
> > Since there is gonna be 2-3 weeks of closed net-next, can you
> > also add a patch for the tx side? It should be trivial (skip dma unmap
> > for niovs in tx completions plus netdev->netmem_tx=1).
> >
> Seems indeed trivial. We will add it.
> 
> > And, btw, what about the issue that Cosmin raised in [0]? Is it addressed
> > in this series?
> > 
> > 0: https://lore.kernel.org/netdev/9322c3c4826ed1072ddc9a2103cc641060665864.camel@nvidia.com/
> We wanted to fix this afterwards as it needs to change a more subtle
> part in the code that replenishes pages. This needs more thinking and
> testing.

Thanks! For my understanding: does the issue occur only during initial
queue refill? Or the same problem will happen any time there is a burst
of traffic that might exhaust all rx descriptors?
Re: [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy
Posted by Mina Almasry 6 months, 2 weeks ago
On Wed, May 28, 2025 at 8:45 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
>
> On 05/28, Dragos Tatulea wrote:
> > On Tue, May 27, 2025 at 09:05:49AM -0700, Stanislav Fomichev wrote:
> > > On 05/23, Tariq Toukan wrote:
> > > > This series from the team adds support for zerocopy rx TCP with devmem
> > > > and io_uring for ConnectX7 NICs and above. For performance reasons and
> > > > simplicity HW-GRO will also be turned on when header-data split mode is
> > > > on.
> > > >
> > > > Find more details below.
> > > >
> > > > Regards,
> > > > Tariq
> > > >
> > > > Performance
> > > > ===========
> > > >
> > > > Test setup:
> > > >
> > > > * CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (single NUMA)
> > > > * NIC: ConnectX7
> > > > * Benchmarking tool: kperf [1]
> > > > * Single TCP flow
> > > > * Test duration: 60s
> > > >
> > > > With application thread and interrupts pinned to the *same* core:
> > > >
> > > > |------+-----------+----------|
> > > > | MTU  | epoll     | io_uring |
> > > > |------+-----------+----------|
> > > > | 1500 | 61.6 Gbps | 114 Gbps |
> > > > | 4096 | 69.3 Gbps | 151 Gbps |
> > > > | 9000 | 67.8 Gbps | 187 Gbps |
> > > > |------+-----------+----------|
> > > >
> > > > The CPU usage for io_uring is 95%.
> > > >
> > > > Reproduction steps for io_uring:
> > > >
> > > > server --no-daemon -a 2001:db8::1 --no-memcmp --iou --iou_sendzc \
> > > >         --iou_zcrx --iou_dev_name eth2 --iou_zcrx_queue_id 2
> > > >
> > > > server --no-daemon -a 2001:db8::2 --no-memcmp --iou --iou_sendzc
> > > >
> > > > client --src 2001:db8::2 --dst 2001:db8::1 \
> > > >         --msg-zerocopy -t 60 --cpu-min=2 --cpu-max=2
> > > >
> > > > Patch overview:
> > > > ================
> > > >
> > > > First, a netmem API for skb_can_coalesce is added to the core to be able
> > > > to do skb fragment coalescing on netmems.
> > > >
> > > > The next patches introduce some cleanups in the internal SHAMPO code and
> > > > improvements to hw gro capability checks in FW.
> > > >
> > > > A separate page_pool is introduced for headers. Ethtool stats are added
> > > > as well.
> > > >
> > > > Then the driver is converted to use the netmem API and to allow support
> > > > for unreadable netmem page pool.
> > > >
> > > > The queue management ops are implemented.
> > > >
> > > > Finally, the tcp-data-split ring parameter is exposed.
> > > >
> > > > Changelog
> > > > =========
> > > >
> > > > Changes from v1 [0]:
> > > > - Added support for skb_can_coalesce_netmem().
> > > > - Avoid netmem_to_page() casts in the driver.
> > > > - Fixed code to abide 80 char limit with some exceptions to avoid
> > > > code churn.
> > >
> > > Since there is gonna be 2-3 weeks of closed net-next, can you
> > > also add a patch for the tx side? It should be trivial (skip dma unmap
> > > for niovs in tx completions plus netdev->netmem_tx=1).
> > >
> > Seems indeed trivial. We will add it.
> >
> > > And, btw, what about the issue that Cosmin raised in [0]? Is it addressed
> > > in this series?
> > >
> > > 0: https://lore.kernel.org/netdev/9322c3c4826ed1072ddc9a2103cc641060665864.camel@nvidia.com/
> > We wanted to fix this afterwards as it needs to change a more subtle
> > part in the code that replenishes pages. This needs more thinking and
> > testing.
>
> Thanks! For my understanding: does the issue occur only during initial
> queue refill? Or the same problem will happen any time there is a burst
> of traffic that might exhaust all rx descriptors?
>

Minor: a burst in traffic likely won't reproduce this case, I'm sure
mlx5 can drive the hardware to line rate consistently. It's more if
the machine is under extreme memory pressure, I think,
page_pool_alloc_pages and friends may return ENOMEM, which reproduces
the same edge case as the dma-buf being extremely small which also
makes page_pool_alloc_netmems return -ENOMEM.

-- 
Thanks,
Mina
Re: [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy
Posted by Stanislav Fomichev 6 months, 2 weeks ago
On 05/28, Mina Almasry wrote:
> On Wed, May 28, 2025 at 8:45 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> >
> > On 05/28, Dragos Tatulea wrote:
> > > On Tue, May 27, 2025 at 09:05:49AM -0700, Stanislav Fomichev wrote:
> > > > On 05/23, Tariq Toukan wrote:
> > > > > This series from the team adds support for zerocopy rx TCP with devmem
> > > > > and io_uring for ConnectX7 NICs and above. For performance reasons and
> > > > > simplicity HW-GRO will also be turned on when header-data split mode is
> > > > > on.
> > > > >
> > > > > Find more details below.
> > > > >
> > > > > Regards,
> > > > > Tariq
> > > > >
> > > > > Performance
> > > > > ===========
> > > > >
> > > > > Test setup:
> > > > >
> > > > > * CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (single NUMA)
> > > > > * NIC: ConnectX7
> > > > > * Benchmarking tool: kperf [1]
> > > > > * Single TCP flow
> > > > > * Test duration: 60s
> > > > >
> > > > > With application thread and interrupts pinned to the *same* core:
> > > > >
> > > > > |------+-----------+----------|
> > > > > | MTU  | epoll     | io_uring |
> > > > > |------+-----------+----------|
> > > > > | 1500 | 61.6 Gbps | 114 Gbps |
> > > > > | 4096 | 69.3 Gbps | 151 Gbps |
> > > > > | 9000 | 67.8 Gbps | 187 Gbps |
> > > > > |------+-----------+----------|
> > > > >
> > > > > The CPU usage for io_uring is 95%.
> > > > >
> > > > > Reproduction steps for io_uring:
> > > > >
> > > > > server --no-daemon -a 2001:db8::1 --no-memcmp --iou --iou_sendzc \
> > > > >         --iou_zcrx --iou_dev_name eth2 --iou_zcrx_queue_id 2
> > > > >
> > > > > server --no-daemon -a 2001:db8::2 --no-memcmp --iou --iou_sendzc
> > > > >
> > > > > client --src 2001:db8::2 --dst 2001:db8::1 \
> > > > >         --msg-zerocopy -t 60 --cpu-min=2 --cpu-max=2
> > > > >
> > > > > Patch overview:
> > > > > ================
> > > > >
> > > > > First, a netmem API for skb_can_coalesce is added to the core to be able
> > > > > to do skb fragment coalescing on netmems.
> > > > >
> > > > > The next patches introduce some cleanups in the internal SHAMPO code and
> > > > > improvements to hw gro capability checks in FW.
> > > > >
> > > > > A separate page_pool is introduced for headers. Ethtool stats are added
> > > > > as well.
> > > > >
> > > > > Then the driver is converted to use the netmem API and to allow support
> > > > > for unreadable netmem page pool.
> > > > >
> > > > > The queue management ops are implemented.
> > > > >
> > > > > Finally, the tcp-data-split ring parameter is exposed.
> > > > >
> > > > > Changelog
> > > > > =========
> > > > >
> > > > > Changes from v1 [0]:
> > > > > - Added support for skb_can_coalesce_netmem().
> > > > > - Avoid netmem_to_page() casts in the driver.
> > > > > - Fixed code to abide 80 char limit with some exceptions to avoid
> > > > > code churn.
> > > >
> > > > Since there is gonna be 2-3 weeks of closed net-next, can you
> > > > also add a patch for the tx side? It should be trivial (skip dma unmap
> > > > for niovs in tx completions plus netdev->netmem_tx=1).
> > > >
> > > Seems indeed trivial. We will add it.
> > >
> > > > And, btw, what about the issue that Cosmin raised in [0]? Is it addressed
> > > > in this series?
> > > >
> > > > 0: https://lore.kernel.org/netdev/9322c3c4826ed1072ddc9a2103cc641060665864.camel@nvidia.com/
> > > We wanted to fix this afterwards as it needs to change a more subtle
> > > part in the code that replenishes pages. This needs more thinking and
> > > testing.
> >
> > Thanks! For my understanding: does the issue occur only during initial
> > queue refill? Or the same problem will happen any time there is a burst
> > of traffic that might exhaust all rx descriptors?
> >
> 
> Minor: a burst in traffic likely won't reproduce this case, I'm sure
> mlx5 can drive the hardware to line rate consistently. It's more if
> the machine is under extreme memory pressure, I think,
> page_pool_alloc_pages and friends may return ENOMEM, which reproduces
> the same edge case as the dma-buf being extremely small which also
> makes page_pool_alloc_netmems return -ENOMEM.

What I want to understand is whether the kernel/driver will oops when dmabuf
runs out of buffers after initial setup. Either traffic burst and/or userspace
being slow on refill - doesn't matter.
Re: [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy
Posted by Dragos Tatulea 6 months, 2 weeks ago
On Wed, May 28, 2025 at 04:04:18PM -0700, Stanislav Fomichev wrote:
> On 05/28, Mina Almasry wrote:
> > On Wed, May 28, 2025 at 8:45 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:
> > >
> > > On 05/28, Dragos Tatulea wrote:
> > > > On Tue, May 27, 2025 at 09:05:49AM -0700, Stanislav Fomichev wrote:
> > > > > On 05/23, Tariq Toukan wrote:
> > > > > > This series from the team adds support for zerocopy rx TCP with devmem
> > > > > > and io_uring for ConnectX7 NICs and above. For performance reasons and
> > > > > > simplicity HW-GRO will also be turned on when header-data split mode is
> > > > > > on.
> > > > > >
> > > > > > Find more details below.
> > > > > >
> > > > > > Regards,
> > > > > > Tariq
> > > > > >
> > > > > > Performance
> > > > > > ===========
> > > > > >
> > > > > > Test setup:
> > > > > >
> > > > > > * CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (single NUMA)
> > > > > > * NIC: ConnectX7
> > > > > > * Benchmarking tool: kperf [1]
> > > > > > * Single TCP flow
> > > > > > * Test duration: 60s
> > > > > >
> > > > > > With application thread and interrupts pinned to the *same* core:
> > > > > >
> > > > > > |------+-----------+----------|
> > > > > > | MTU  | epoll     | io_uring |
> > > > > > |------+-----------+----------|
> > > > > > | 1500 | 61.6 Gbps | 114 Gbps |
> > > > > > | 4096 | 69.3 Gbps | 151 Gbps |
> > > > > > | 9000 | 67.8 Gbps | 187 Gbps |
> > > > > > |------+-----------+----------|
> > > > > >
> > > > > > The CPU usage for io_uring is 95%.
> > > > > >
> > > > > > Reproduction steps for io_uring:
> > > > > >
> > > > > > server --no-daemon -a 2001:db8::1 --no-memcmp --iou --iou_sendzc \
> > > > > >         --iou_zcrx --iou_dev_name eth2 --iou_zcrx_queue_id 2
> > > > > >
> > > > > > server --no-daemon -a 2001:db8::2 --no-memcmp --iou --iou_sendzc
> > > > > >
> > > > > > client --src 2001:db8::2 --dst 2001:db8::1 \
> > > > > >         --msg-zerocopy -t 60 --cpu-min=2 --cpu-max=2
> > > > > >
> > > > > > Patch overview:
> > > > > > ================
> > > > > >
> > > > > > First, a netmem API for skb_can_coalesce is added to the core to be able
> > > > > > to do skb fragment coalescing on netmems.
> > > > > >
> > > > > > The next patches introduce some cleanups in the internal SHAMPO code and
> > > > > > improvements to hw gro capability checks in FW.
> > > > > >
> > > > > > A separate page_pool is introduced for headers. Ethtool stats are added
> > > > > > as well.
> > > > > >
> > > > > > Then the driver is converted to use the netmem API and to allow support
> > > > > > for unreadable netmem page pool.
> > > > > >
> > > > > > The queue management ops are implemented.
> > > > > >
> > > > > > Finally, the tcp-data-split ring parameter is exposed.
> > > > > >
> > > > > > Changelog
> > > > > > =========
> > > > > >
> > > > > > Changes from v1 [0]:
> > > > > > - Added support for skb_can_coalesce_netmem().
> > > > > > - Avoid netmem_to_page() casts in the driver.
> > > > > > - Fixed code to abide 80 char limit with some exceptions to avoid
> > > > > > code churn.
> > > > >
> > > > > Since there is gonna be 2-3 weeks of closed net-next, can you
> > > > > also add a patch for the tx side? It should be trivial (skip dma unmap
> > > > > for niovs in tx completions plus netdev->netmem_tx=1).
> > > > >
> > > > Seems indeed trivial. We will add it.
> > > >
> > > > > And, btw, what about the issue that Cosmin raised in [0]? Is it addressed
> > > > > in this series?
> > > > >
> > > > > 0: https://lore.kernel.org/netdev/9322c3c4826ed1072ddc9a2103cc641060665864.camel@nvidia.com/
> > > > We wanted to fix this afterwards as it needs to change a more subtle
> > > > part in the code that replenishes pages. This needs more thinking and
> > > > testing.
> > >
> > > Thanks! For my understanding: does the issue occur only during initial
> > > queue refill? Or the same problem will happen any time there is a burst
> > > of traffic that might exhaust all rx descriptors?
> > >
> > 
> > Minor: a burst in traffic likely won't reproduce this case, I'm sure
> > mlx5 can drive the hardware to line rate consistently. It's more if
> > the machine is under extreme memory pressure, I think,
> > page_pool_alloc_pages and friends may return ENOMEM, which reproduces
> > the same edge case as the dma-buf being extremely small which also
> > makes page_pool_alloc_netmems return -ENOMEM.
> 
> What I want to understand is whether the kernel/driver will oops when dmabuf
> runs out of buffers after initial setup. Either traffic burst and/or userspace
> being slow on refill - doesn't matter.
There is no OOPS but the queue can't handle more traffic because it
can't allocate more buffers and it can't release old buffers either.

AFAIU from Cosmin the condition happened on initial queue fill when
there are no buffers to be released for the current WQE.

Thanks,
Dragos
Re: [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy
Posted by Cosmin Ratiu 6 months, 1 week ago
On Thu, 2025-05-29 at 11:11 +0000, Dragos Tatulea wrote:
> 
> AFAIU from Cosmin the condition happened on initial queue fill when
> there are no buffers to be released for the current WQE.

The issue happens when there isn't enough memory in the pool to
completely fill the rx ring with descriptors, and then rx eventually
fully stops once the posted descriptors get exhausted, because the ring
refill logic will actually only release ring memory back to the pool
from ring_tail, when ring_head == ring_tail (for cache efficiency).
This means if the ring cannot be completely filled, memory never gets
released because ring_head != ring_tail.

The easy workaround is to have a pool with enough memory to let the rx
ring completely fill up. I suspect in real life this is easily the
case, but in the contrived ncdevmem test with udmabuf memory of 128 MB
and artificially high ring size & MTU this corner case was hit.

As Dragos said, we will look into this after the code has been posted.

Cosmin.
Re: [PATCH net-next V2 00/11] net/mlx5e: Add support for devmem and io_uring TCP zero-copy
Posted by Jakub Kicinski 6 months, 2 weeks ago
On Fri, 23 May 2025 00:41:15 +0300 Tariq Toukan wrote:
>   net: Kconfig NET_DEVMEM selects GENERIC_ALLOCATOR

I'll apply this one already, seems like a good cleanup.