[PATCH net 4/5] net/mlx5e: RX, Fix XDP multi-buf frag counting for striding RQ

Tariq Toukan posted 5 patches 1 month ago
[PATCH net 4/5] net/mlx5e: RX, Fix XDP multi-buf frag counting for striding RQ
Posted by Tariq Toukan 1 month ago
From: Dragos Tatulea <dtatulea@nvidia.com>

XDP multi-buf programs can modify the layout of the XDP buffer when the
program calls bpf_xdp_pull_data() or bpf_xdp_adjust_tail(). The
referenced commit in the fixes tag corrected the assumption in the mlx5
driver that the XDP buffer layout doesn't change during a program
execution. However, this fix introduced another issue: the dropped
fragments still need to be counted on the driver side to avoid page
fragment reference counting issues.

The issue was discovered by the drivers/net/xdp.py selftest,
more specifically the test_xdp_native_tx_mb:
- The mlx5 driver allocates a page_pool page and initializes it with
  a frag counter of 64 (pp_ref_count=64) and the internal frag counter
  to 0.
- The test sends one packet with no payload.
- On RX (mlx5e_skb_from_cqe_mpwrq_nonlinear()), mlx5 configures the XDP
  buffer with the packet data starting in the first fragment which is the
  page mentioned above.
- The XDP program runs and calls bpf_xdp_pull_data() which moves the
  header into the linear part of the XDP buffer. As the packet doesn't
  contain more data, the program drops the tail fragment since it no
  longer contains any payload (pp_ref_count=63).
- mlx5 device skips counting this fragment. Internal frag counter
  remains 0.
- mlx5 releases all 64 fragments of the page but page pp_ref_count is
  63 => negative reference counting error.

Resulting splat during the test:

  WARNING: CPU: 0 PID: 188225 at ./include/net/page_pool/helpers.h:297 mlx5e_page_release_fragmented.isra.0+0xbd/0xe0 [mlx5_core]
  Modules linked in: [...]
  CPU: 0 UID: 0 PID: 188225 Comm: ip Not tainted 6.18.0-rc7_for_upstream_min_debug_2025_12_08_11_44 #1 NONE
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  RIP: 0010:mlx5e_page_release_fragmented.isra.0+0xbd/0xe0 [mlx5_core]
  [...]
  Call Trace:
   <TASK>
   mlx5e_free_rx_mpwqe+0x20a/0x250 [mlx5_core]
   mlx5e_dealloc_rx_mpwqe+0x37/0xb0 [mlx5_core]
   mlx5e_free_rx_descs+0x11a/0x170 [mlx5_core]
   mlx5e_close_rq+0x78/0xa0 [mlx5_core]
   mlx5e_close_queues+0x46/0x2a0 [mlx5_core]
   mlx5e_close_channel+0x24/0x90 [mlx5_core]
   mlx5e_close_channels+0x5d/0xf0 [mlx5_core]
   mlx5e_safe_switch_params+0x2ec/0x380 [mlx5_core]
   mlx5e_change_mtu+0x11d/0x490 [mlx5_core]
   mlx5e_change_nic_mtu+0x19/0x30 [mlx5_core]
   netif_set_mtu_ext+0xfc/0x240
   do_setlink.isra.0+0x226/0x1100
   rtnl_newlink+0x7a9/0xba0
   rtnetlink_rcv_msg+0x220/0x3c0
   netlink_rcv_skb+0x4b/0xf0
   netlink_unicast+0x255/0x380
   netlink_sendmsg+0x1f3/0x420
   __sock_sendmsg+0x38/0x60
   ____sys_sendmsg+0x1e8/0x240
   ___sys_sendmsg+0x7c/0xb0
   [...]
   __sys_sendmsg+0x5f/0xb0
   do_syscall_64+0x55/0xc70

The problem applies for XDP_PASS as well which is handled in a different
code path in the driver.

This patch fixes the issue by doing page frag counting on all the
original XDP buffer fragments for all relevant XDP actions (XDP_TX ,
XDP_REDIRECT and XDP_PASS). This is basically reverting to the original
counting before the commit in the fixes tag.

As frag_page is still pointing to the original tail, the nr_frags
parameter to xdp_update_skb_frags_info() needs to be calculated
in a different way to reflect the new nr_frags.

Fixes: 87bcef158ac1 ("net/mlx5e: RX, Fix generating skb from non-linear xdp_buff for striding RQ")
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Cc: Amery Hung <ameryhung@gmail.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index efcfcddab376..40e53a612989 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1957,14 +1957,13 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 
 	if (prog) {
 		u8 nr_frags_free, old_nr_frags = sinfo->nr_frags;
+		u8 new_nr_frags;
 		u32 len;
 
 		if (mlx5e_xdp_handle(rq, prog, mxbuf)) {
 			if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
 				struct mlx5e_frag_page *pfp;
 
-				frag_page -= old_nr_frags - sinfo->nr_frags;
-
 				for (pfp = head_page; pfp < frag_page; pfp++)
 					pfp->frags++;
 
@@ -1975,13 +1974,12 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 			return NULL; /* page/packet was consumed by XDP */
 		}
 
-		nr_frags_free = old_nr_frags - sinfo->nr_frags;
-		if (unlikely(nr_frags_free)) {
-			frag_page -= nr_frags_free;
+		new_nr_frags = sinfo->nr_frags;
+		nr_frags_free = old_nr_frags - new_nr_frags;
+		if (unlikely(nr_frags_free))
 			truesize -= (nr_frags_free - 1) * PAGE_SIZE +
 				ALIGN(pg_consumed_bytes,
 				      BIT(rq->mpwqe.log_stride_sz));
-		}
 
 		len = mxbuf->xdp.data_end - mxbuf->xdp.data;
 
@@ -2003,7 +2001,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 			struct mlx5e_frag_page *pagep;
 
 			/* sinfo->nr_frags is reset by build_skb, calculate again. */
-			xdp_update_skb_frags_info(skb, frag_page - head_page,
+			xdp_update_skb_frags_info(skb, new_nr_frags,
 						  sinfo->xdp_frags_size,
 						  truesize,
 						  xdp_buff_get_skb_flags(&mxbuf->xdp));
-- 
2.44.0
Re: [PATCH net 4/5] net/mlx5e: RX, Fix XDP multi-buf frag counting for striding RQ: manual merge
Posted by Matthieu Baerts 1 month ago
Hi Tariq, Dragos,

+cc: linux-next

On 05/03/2026 15:26, Tariq Toukan wrote:
> From: Dragos Tatulea <dtatulea@nvidia.com>
> 
> XDP multi-buf programs can modify the layout of the XDP buffer when the
> program calls bpf_xdp_pull_data() or bpf_xdp_adjust_tail(). The
> referenced commit in the fixes tag corrected the assumption in the mlx5
> driver that the XDP buffer layout doesn't change during a program
> execution. However, this fix introduced another issue: the dropped
> fragments still need to be counted on the driver side to avoid page
> fragment reference counting issues.

FYI, we got a small conflict when merging 'net' in 'net-next' in the
MPTCP tree due to this patch applied in 'net':

  db25c42c2e1f ("net/mlx5e: RX, Fix XDP multi-buf frag counting for striding RQ")

and this one from 'net-next':

  dff1c3164a69 ("net/mlx5e: SHAMPO, Always calculate page size")

----- Generic Message -----
The best is to avoid conflicts between 'net' and 'net-next' trees but if
they cannot be avoided when preparing patches, a note about how to fix
them is much appreciated.

The conflict has been resolved on our side [1] and the resolution we
suggest is attached to this email. Please report any issues linked to
this conflict resolution as it might be used by others. If you worked on
the mentioned patches, don't hesitate to ACK this conflict resolution.
---------------------------

Rerere cache is available in [2].

[1] https://github.com/multipath-tcp/mptcp_net-next/commit/9cbb5f8a4a18
[2] https://github.com/multipath-tcp/mptcp-upstream-rr-cache/commit/0bbafdd

(...)

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index efcfcddab376..40e53a612989 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c

(...)

> @@ -1975,13 +1974,12 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>  			return NULL; /* page/packet was consumed by XDP */
>  		}
>  
> -		nr_frags_free = old_nr_frags - sinfo->nr_frags;
> -		if (unlikely(nr_frags_free)) {
> -			frag_page -= nr_frags_free;
> +		new_nr_frags = sinfo->nr_frags;
> +		nr_frags_free = old_nr_frags - new_nr_frags;
> +		if (unlikely(nr_frags_free))
>  			truesize -= (nr_frags_free - 1) * PAGE_SIZE +

The conflict is in the context: this line above has been modified on
'net-next' (s/PAGE_SIZE/page_size/), while the ones above it have been
modified on 'net'.

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.

---
diff --cc drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 8fb57a4f36dd,268e20884757..f5c0e2a0ada9
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@@ -1977,13 -1971,12 +1973,12 @@@ mlx5e_skb_from_cqe_mpwrq_nonlinear(stru
  			return NULL; /* page/packet was consumed by XDP */
  		}
  
- 		nr_frags_free = old_nr_frags - sinfo->nr_frags;
- 		if (unlikely(nr_frags_free)) {
- 			frag_page -= nr_frags_free;
+ 		new_nr_frags = sinfo->nr_frags;
+ 		nr_frags_free = old_nr_frags - new_nr_frags;
+ 		if (unlikely(nr_frags_free))
 -			truesize -= (nr_frags_free - 1) * PAGE_SIZE +
 +			truesize -= (nr_frags_free - 1) * page_size +
  				ALIGN(pg_consumed_bytes,
  				      BIT(rq->mpwqe.log_stride_sz));
- 		}
  
  		len = mxbuf->xdp.data_end - mxbuf->xdp.data;
Re: [PATCH net 4/5] net/mlx5e: RX, Fix XDP multi-buf frag counting for striding RQ: manual merge
Posted by Dragos Tatulea 1 month ago
Hi Matthieu,

On 09.03.26 13:52, Matthieu Baerts wrote:
> Hi Tariq, Dragos,
> 
> +cc: linux-next
> 
> On 05/03/2026 15:26, Tariq Toukan wrote:
>> From: Dragos Tatulea <dtatulea@nvidia.com>
>>
>> XDP multi-buf programs can modify the layout of the XDP buffer when the
>> program calls bpf_xdp_pull_data() or bpf_xdp_adjust_tail(). The
>> referenced commit in the fixes tag corrected the assumption in the mlx5
>> driver that the XDP buffer layout doesn't change during a program
>> execution. However, this fix introduced another issue: the dropped
>> fragments still need to be counted on the driver side to avoid page
>> fragment reference counting issues.
> 
> FYI, we got a small conflict when merging 'net' in 'net-next' in the
> MPTCP tree due to this patch applied in 'net':
> 
>   db25c42c2e1f ("net/mlx5e: RX, Fix XDP multi-buf frag counting for striding RQ")
> 
> and this one from 'net-next':
> 
>   dff1c3164a69 ("net/mlx5e: SHAMPO, Always calculate page size")
> 
> ----- Generic Message -----
> The best is to avoid conflicts between 'net' and 'net-next' trees but if
> they cannot be avoided when preparing patches, a note about how to fix
> them is much appreciated.
> 
Apologies for this. Will take note next time.

> The conflict has been resolved on our side [1] and the resolution we
> suggest is attached to this email. Please report any issues linked to
> this conflict resolution as it might be used by others. If you worked on
> the mentioned patches, don't hesitate to ACK this conflict resolution.
> ---------------------------
> 
> Rerere cache is available in [2].
> 
> [1] https://github.com/multipath-tcp/mptcp_net-next/commit/9cbb5f8a4a18
Conflict resolution from [1] seems good.

Acked-by: Dragos Tatulea <dtatulea@nvidia.com>

Thanks,
Dragos