[v1] net: stmmac: fix transmit queue timed out after resume

[PATCH net Resend] net: stmmac: fix transmit queue timed out after resume

Posted by Tao Wang 1 month ago

after resume dev_watchdog() message:
"NETDEV WATCHDOG: CPU: x: transmit queue x timed out xx ms"

The trigging scenario is as follows:

When the TSO function sets tx_skbuff_dma[tx_q->cur_tx].last_segment = true

, and the last_segment value is not cleared in stmmac_free_tx_buffer after

 resume, restarting TSO transmission may incorrectly use

tx_q->tx_skbuff_dma[first_entry].last_segment = true for a new TSO packet.

When the tx queue has timed out, and the emac TX descriptor is as follows:
eth0: 221 [0x0000000876d10dd0]: 0x73660cbe 0x8 0x42 0xb04416a0
eth0: 222 [0x0000000876d10de0]: 0x77731d40 0x8 0x16a0 0x90000000

Descriptor 221 is the TSO header, and descriptor 222 is the TSO payload.

In the tdes3 (0xb04416a0), bit 29 (first descriptor) and bit 28

(last descriptor) of the TSO packet 221 DMA descriptor cannot both be

set to 1 simultaneously. Since descriptor 222 is the actual last

descriptor, failing to set it properly will cause the EMAC DMA to stop

and hang.

To solve the issue, set last_segment to false in stmmac_free_tx_buffer:
tx_q->tx_skbuff_dma[i].last_segment = false;
Set last_segment to false in stmmac_tso_xmit, and do not use the default
 value: tx_q->tx_skbuff_dma[first_entry].last_segment = false;
This will prevent similar issues from occurring in the future.

Signed-off-by: Tao Wang <tao03.wang@horizon.auto>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index b3730312aeed..d786ac3c78f7 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1653,6 +1653,7 @@ static void stmmac_free_tx_buffer(struct stmmac_priv *priv,
 
 	tx_q->tx_skbuff_dma[i].buf = 0;
 	tx_q->tx_skbuff_dma[i].map_as_page = false;
+	tx_q->tx_skbuff_dma[i].last_segment = false;
 }
 
 /**
@@ -4448,6 +4449,7 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 	if (dma_mapping_error(priv->device, des))
 		goto dma_map_err;
 
+	tx_q->tx_skbuff_dma[first_entry].last_segment = false;
 	stmmac_set_desc_addr(priv, first, des);
 	stmmac_tso_allocator(priv, des + proto_hdr_len, pay_len,
 			     (nfrags == 0), queue);
-- 
2.34.1

Re: [PATCH net Resend] net: stmmac: fix transmit queue timed out after resume

Posted by Jakub Kicinski 3 weeks, 5 days ago

On Fri, 9 Jan 2026 15:02:11 +0800 Tao Wang wrote:
> after resume dev_watchdog() message:
> "NETDEV WATCHDOG: CPU: x: transmit queue x timed out xx ms"
> 
> The trigging scenario is as follows:
> 
> When the TSO function sets tx_skbuff_dma[tx_q->cur_tx].last_segment = true
> 
> , and the last_segment value is not cleared in stmmac_free_tx_buffer after
> 
>  resume, restarting TSO transmission may incorrectly use
> 
> tx_q->tx_skbuff_dma[first_entry].last_segment = true for a new TSO packet.
> 
> When the tx queue has timed out, and the emac TX descriptor is as follows:
> eth0: 221 [0x0000000876d10dd0]: 0x73660cbe 0x8 0x42 0xb04416a0
> eth0: 222 [0x0000000876d10de0]: 0x77731d40 0x8 0x16a0 0x90000000
> 
> Descriptor 221 is the TSO header, and descriptor 222 is the TSO payload.
> 
> In the tdes3 (0xb04416a0), bit 29 (first descriptor) and bit 28
> 
> (last descriptor) of the TSO packet 221 DMA descriptor cannot both be
> 
> set to 1 simultaneously. Since descriptor 222 is the actual last
> 
> descriptor, failing to set it properly will cause the EMAC DMA to stop
> 
> and hang.

For some reason the reposted version of the patch has unnecessary empty
lines separating each line of this paragraph.

> To solve the issue, set last_segment to false in stmmac_free_tx_buffer:
> tx_q->tx_skbuff_dma[i].last_segment = false;
> Set last_segment to false in stmmac_tso_xmit, and do not use the default
>  value: tx_q->tx_skbuff_dma[first_entry].last_segment = false;
> This will prevent similar issues from occurring in the future.

Please add a suitable Fixes tag, pointing at the commit which
introduced this incorrect behavior (either the commit which broke it or
the commit which added this code if it was always broken).
-- 
pw-bot: cr

[PATCH net v2] net: stmmac: fix transmit queue timed out after resume

Posted by Tao Wang 3 weeks, 4 days ago

after resume dev_watchdog() message:
"NETDEV WATCHDOG: CPU: x: transmit queue x timed out xx ms"

The trigging scenario is as follows:
When the TSO function sets tx_skbuff_dma[tx_q->cur_tx].last_segment = true,
 and the last_segment value is not cleared in stmmac_free_tx_buffer after
 resume, restarting TSO transmission may incorrectly use
tx_q->tx_skbuff_dma[first_entry].last_segment = true for a new TSO packet.

When the tx queue has timed out, and the emac TX descriptor is as follows:
eth0: 221 [0x0000000876d10dd0]: 0x73660cbe 0x8 0x42 0xb04416a0
eth0: 222 [0x0000000876d10de0]: 0x77731d40 0x8 0x16a0 0x90000000

Descriptor 221 is the TSO header, and descriptor 222 is the TSO payload.
In the tdes3 (0xb04416a0), bit 29 (first descriptor) and bit 28
(last descriptor) of the TSO packet 221 DMA descriptor cannot both be
set to 1 simultaneously. Since descriptor 222 is the actual last
descriptor, failing to set it properly will cause the EMAC DMA to stop
and hang.

To solve the issue, set last_segment to false in stmmac_free_tx_buffer:
tx_q->tx_skbuff_dma[i].last_segment = false.  Do not use the last_segment
 default value and set last_segment to false in stmmac_tso_xmit. This
will prevent similar issues from occurring in the future.

Fixes: c2837423cb54 ("net: stmmac: Rework TX Coalesce logic")

changelog:
v1 -> v2:
	- Modify commit message, del empty line, add fixed commit
	 information.

Signed-off-by: Tao Wang <tao03.wang@horizon.auto>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index b3730312aeed..d786ac3c78f7 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1653,6 +1653,7 @@ static void stmmac_free_tx_buffer(struct stmmac_priv *priv,
 
 	tx_q->tx_skbuff_dma[i].buf = 0;
 	tx_q->tx_skbuff_dma[i].map_as_page = false;
+	tx_q->tx_skbuff_dma[i].last_segment = false;
 }
 
 /**
@@ -4448,6 +4449,7 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 	if (dma_mapping_error(priv->device, des))
 		goto dma_map_err;
 
+	tx_q->tx_skbuff_dma[first_entry].last_segment = false;
 	stmmac_set_desc_addr(priv, first, des);
 	stmmac_tso_allocator(priv, des + proto_hdr_len, pay_len,
 			     (nfrags == 0), queue);
-- 
2.34.1

Re: [PATCH net v2] net: stmmac: fix transmit queue timed out after resume

Posted by Jakub Kicinski 3 weeks, 3 days ago

On Wed, 14 Jan 2026 19:00:31 +0800 Tao Wang wrote:
> To solve the issue, set last_segment to false in stmmac_free_tx_buffer:
> tx_q->tx_skbuff_dma[i].last_segment = false.  Do not use the last_segment
>  default value and set last_segment to false in stmmac_tso_xmit. This
> will prevent similar issues from occurring in the future.
> 
> Fixes: c2837423cb54 ("net: stmmac: Rework TX Coalesce logic")
> 
> changelog:
> v1 -> v2:
> 	- Modify commit message, del empty line, add fixed commit
> 	 information.
> 
> Signed-off-by: Tao Wang <tao03.wang@horizon.auto>

When you repost to address Russell's feedback in the commit
message please:
 - follow the recommended format (changelog placement and no empty
   lines between Fixes and SoB):
https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#changes-requested
 - do not send new version in reply to the old one, start a new thread
-- 
pw-bot: cr

Re: Re: [PATCH net v2] net: stmmac: fix transmit queue timed out after resume

Posted by Tao Wang 3 weeks, 3 days ago

> > To solve the issue, set last_segment to false in stmmac_free_tx_buffer:
> > tx_q->tx_skbuff_dma[i].last_segment = false.  Do not use the last_segment
> >  default value and set last_segment to false in stmmac_tso_xmit. This
> > will prevent similar issues from occurring in the future.
> > 
> > Fixes: c2837423cb54 ("net: stmmac: Rework TX Coalesce logic")
> > 
> > changelog:
> > v1 -> v2:
> > 	- Modify commit message, del empty line, add fixed commit
> > 	 information.
> > 
> > Signed-off-by: Tao Wang <tao03.wang@horizon.auto>
> 
> When you repost to address Russell's feedback in the commit
> message please:
>  - follow the recommended format (changelog placement and no empty
>    lines between Fixes and SoB):
> https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#changes-requested
>  - do not send new version in reply to the old one, start a new thread

Understood, I will correct the commit message format and post the next
 version of the patch as a new thread.

Thanks
Tao Wang

Re: [PATCH net v2] net: stmmac: fix transmit queue timed out after resume

Posted by Russell King (Oracle) 3 weeks, 4 days ago

On Wed, Jan 14, 2026 at 07:00:31PM +0800, Tao Wang wrote:
> after resume dev_watchdog() message:
> "NETDEV WATCHDOG: CPU: x: transmit queue x timed out xx ms"
> 
> The trigging scenario is as follows:
> When the TSO function sets tx_skbuff_dma[tx_q->cur_tx].last_segment = true,
>  and the last_segment value is not cleared in stmmac_free_tx_buffer after
>  resume, restarting TSO transmission may incorrectly use
> tx_q->tx_skbuff_dma[first_entry].last_segment = true for a new TSO packet.
> 
> When the tx queue has timed out, and the emac TX descriptor is as follows:
> eth0: 221 [0x0000000876d10dd0]: 0x73660cbe 0x8 0x42 0xb04416a0
> eth0: 222 [0x0000000876d10de0]: 0x77731d40 0x8 0x16a0 0x90000000
> 
> Descriptor 221 is the TSO header, and descriptor 222 is the TSO payload.
> In the tdes3 (0xb04416a0), bit 29 (first descriptor) and bit 28
> (last descriptor) of the TSO packet 221 DMA descriptor cannot both be
> set to 1 simultaneously. Since descriptor 222 is the actual last
> descriptor, failing to set it properly will cause the EMAC DMA to stop
> and hang.
> 
> To solve the issue, set last_segment to false in stmmac_free_tx_buffer:
> tx_q->tx_skbuff_dma[i].last_segment = false.  Do not use the last_segment
>  default value and set last_segment to false in stmmac_tso_xmit. This
> will prevent similar issues from occurring in the future.

While I agree with the change for stmmac_tso_xmit(), please explain why
the change in stmmac_free_tx_buffer() is necessary.

It seems to me that if this is missing in stmmac_free_tx_buffer(), the
driver should have more problems than just TSO.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

Re: Re: [PATCH net v2] net: stmmac: fix transmit queue timed out after resume

Posted by Tao Wang 3 weeks, 3 days ago

> While I agree with the change for stmmac_tso_xmit(), please explain why
> the change in stmmac_free_tx_buffer() is necessary.
>
> It seems to me that if this is missing in stmmac_free_tx_buffer(), the
> driver should have more problems than just TSO.

The change in stmmac_free_tx_buffer() is intended to be generic for all
users of last_segment, not only for the TSO path.

So far, I have not observed any issues with stmmac_xmit() or
stmmac_xdp_xmit_xdpf(), but this change ensures consistent and correct
handling of last_segment across all relevant transmit paths.

Thanks
Tao Wang

Re: Re: [PATCH net v2] net: stmmac: fix transmit queue timed out after resume

Posted by Russell King (Oracle) 3 weeks, 2 days ago

On Thu, Jan 15, 2026 at 03:08:53PM +0800, Tao Wang wrote:
> > While I agree with the change for stmmac_tso_xmit(), please explain why
> > the change in stmmac_free_tx_buffer() is necessary.
> >
> > It seems to me that if this is missing in stmmac_free_tx_buffer(), the
> > driver should have more problems than just TSO.
> 
> The change in stmmac_free_tx_buffer() is intended to be generic for all
> users of last_segment, not only for the TSO path.

However, transmit is a hotpath, so work needs to be minimised for good
performance. We don't want anything that is unnecessary in these paths.

If we always explicitly set .last_segment when adding any packet to the
ring, then there is absolutely no need to also do so when freeing them.

Also, I think there's a similar issue with .is_jumbo.

So, I think it would make more sense to have some helpers for setting
up the tx_skbuff_dma entry. Maybe something like the below? I'll see
if I can measure the performance impact of this later today, but I
can't guarantee I'll get to that.

The idea here is to ensure that all members with the exception of
xsk_meta are fully initialised when an entry is populated.

I haven't removed anything in the tx_q->tx_skbuff_dma entry release
path yet, but with this in place, we should be able to eliminate the
clearance of these in stmmac_tx_clean() and stmmac_free_tx_buffer().

Note that the driver assumes setting .buf to zero means the entry is
cleared. dma_addr_t is a cookie which is device specific, and zero
may be a valid DMA cookie. Only DMA_MAPPING_ERROR is invalid, and
can be assumed to hold any meaning in driver code. So that needs
fixing as well.

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index a8a78fe7d01f..0e605d0f6a94 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1874,6 +1874,34 @@ static int init_dma_rx_desc_rings(struct net_device *dev,
 	return ret;
 }
 
+static void stmmac_set_tx_dma_entry(struct stmmac_tx_queue *tx_q,
+				    unsigned int entry,
+				    enum stmmac_txbuf_type type,
+				    dma_addr_t addr, size_t len,
+				    bool map_as_page)
+{
+	tx_q->tx_skbuff_dma[entry].buf = addr;
+	tx_q->tx_skbuff_dma[entry].len = len;
+	tx_q->tx_skbuff_dma[entry].buf_type = type;
+	tx_q->tx_skbuff_dma[entry].map_as_page = map_as_page;
+	tx_q->tx_skbuff_dma[entry].last_segment = false;
+	tx_q->tx_skbuff_dma[entry].is_jumbo = false;
+}
+
+static void stmmac_set_tx_skb_dma_entry(struct stmmac_tx_queue *tx_q,
+					unsigned int entry, dma_addr_t addr,
+					size_t len, bool map_as_page)
+{
+	stmmac_set_tx_dma_entry(tx_q, entry, STMMAC_TXBUF_T_SKB, addr, len,
+				map_as_page);
+}
+
+static void stmmac_set_tx_dma_last_segment(struct stmmac_tx_queue *tx_q,
+					   unsigned int entry)
+{
+	tx_q->tx_skbuff_dma[entry].last_segment = true;
+}
+
 /**
  * __init_dma_tx_desc_rings - init the TX descriptor ring (per queue)
  * @priv: driver private structure
@@ -1919,11 +1947,8 @@ static int __init_dma_tx_desc_rings(struct stmmac_priv *priv,
 			p = tx_q->dma_tx + i;
 
 		stmmac_clear_desc(priv, p);
+		stmmac_set_tx_skb_dma_entry(tx_q, i, 0, 0, false);
 
-		tx_q->tx_skbuff_dma[i].buf = 0;
-		tx_q->tx_skbuff_dma[i].map_as_page = false;
-		tx_q->tx_skbuff_dma[i].len = 0;
-		tx_q->tx_skbuff_dma[i].last_segment = false;
 		tx_q->tx_skbuff[i] = NULL;
 	}
 
@@ -2649,19 +2674,15 @@ static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
 		meta = xsk_buff_get_metadata(pool, xdp_desc.addr);
 		xsk_buff_raw_dma_sync_for_device(pool, dma_addr, xdp_desc.len);
 
-		tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_XSK_TX;
-
 		/* To return XDP buffer to XSK pool, we simple call
 		 * xsk_tx_completed(), so we don't need to fill up
 		 * 'buf' and 'xdpf'.
 		 */
-		tx_q->tx_skbuff_dma[entry].buf = 0;
-		tx_q->xdpf[entry] = NULL;
+		stmmac_set_tx_dma_entry(tx_q, entry, STMMAC_TXBUF_T_XSK_TX,
+					0, xdp_desc.len, false);
+		stmmac_set_tx_dma_last_segment(tx_q, entry);
 
-		tx_q->tx_skbuff_dma[entry].map_as_page = false;
-		tx_q->tx_skbuff_dma[entry].len = xdp_desc.len;
-		tx_q->tx_skbuff_dma[entry].last_segment = true;
-		tx_q->tx_skbuff_dma[entry].is_jumbo = false;
+		tx_q->xdpf[entry] = NULL;
 
 		stmmac_set_desc_addr(priv, tx_desc, dma_addr);
 
@@ -2836,6 +2857,9 @@ static int stmmac_tx_clean(struct stmmac_priv *priv, int budget, u32 queue,
 			tx_q->tx_skbuff_dma[entry].map_as_page = false;
 		}
 
+		/* This looks at tx_q->tx_skbuff_dma[tx_q->dirty_tx].is_jumbo
+		 * and tx_q->tx_skbuff_dma[tx_q->dirty_tx].last_segment
+		 */
 		stmmac_clean_desc3(priv, tx_q, p);
 
 		tx_q->tx_skbuff_dma[entry].last_segment = false;
@@ -4494,10 +4518,8 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 	 * this DMA buffer right after the DMA engine completely finishes the
 	 * full buffer transmission.
 	 */
-	tx_q->tx_skbuff_dma[tx_q->cur_tx].buf = des;
-	tx_q->tx_skbuff_dma[tx_q->cur_tx].len = skb_headlen(skb);
-	tx_q->tx_skbuff_dma[tx_q->cur_tx].map_as_page = false;
-	tx_q->tx_skbuff_dma[tx_q->cur_tx].buf_type = STMMAC_TXBUF_T_SKB;
+	stmmac_set_tx_skb_dma_entry(tx_q, tx_q->cur_tx, des, skb_headlen(skb),
+				    false);
 
 	/* Prepare fragments */
 	for (i = 0; i < nfrags; i++) {
@@ -4512,17 +4534,14 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 		stmmac_tso_allocator(priv, des, skb_frag_size(frag),
 				     (i == nfrags - 1), queue);
 
-		tx_q->tx_skbuff_dma[tx_q->cur_tx].buf = des;
-		tx_q->tx_skbuff_dma[tx_q->cur_tx].len = skb_frag_size(frag);
-		tx_q->tx_skbuff_dma[tx_q->cur_tx].map_as_page = true;
-		tx_q->tx_skbuff_dma[tx_q->cur_tx].buf_type = STMMAC_TXBUF_T_SKB;
+		stmmac_set_tx_skb_dma_entry(tx_q, tx_q->cur_tx, des,
+					    skb_frag_size(frag), true);
 	}
 
-	tx_q->tx_skbuff_dma[tx_q->cur_tx].last_segment = true;
+	stmmac_set_tx_dma_last_segment(tx_q, tx_q->cur_tx);
 
 	/* Only the last descriptor gets to point to the skb. */
 	tx_q->tx_skbuff[tx_q->cur_tx] = skb;
-	tx_q->tx_skbuff_dma[tx_q->cur_tx].buf_type = STMMAC_TXBUF_T_SKB;
 
 	/* Manage tx mitigation */
 	tx_packets = (tx_q->cur_tx + 1) - first_tx;
@@ -4774,23 +4793,18 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 		if (dma_mapping_error(priv->device, des))
 			goto dma_map_err; /* should reuse desc w/o issues */
 
-		tx_q->tx_skbuff_dma[entry].buf = des;
-
+		stmmac_set_tx_skb_dma_entry(tx_q, entry, des, len, true);
 		stmmac_set_desc_addr(priv, desc, des);
 
-		tx_q->tx_skbuff_dma[entry].map_as_page = true;
-		tx_q->tx_skbuff_dma[entry].len = len;
-		tx_q->tx_skbuff_dma[entry].last_segment = last_segment;
-		tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_SKB;
-
 		/* Prepare the descriptor and set the own bit too */
 		stmmac_prepare_tx_desc(priv, desc, 0, len, csum_insertion,
 				priv->mode, 1, last_segment, skb->len);
 	}
 
+	stmmac_set_tx_dma_last_segment(tx_q, entry);
+
 	/* Only the last descriptor gets to point to the skb. */
 	tx_q->tx_skbuff[entry] = skb;
-	tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_SKB;
 
 	/* According to the coalesce parameter the IC bit for the latest
 	 * segment is reset and the timer re-started to clean the tx status.
@@ -4869,14 +4883,13 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 		if (dma_mapping_error(priv->device, des))
 			goto dma_map_err;
 
-		tx_q->tx_skbuff_dma[first_entry].buf = des;
-		tx_q->tx_skbuff_dma[first_entry].buf_type = STMMAC_TXBUF_T_SKB;
-		tx_q->tx_skbuff_dma[first_entry].map_as_page = false;
+		stmmac_set_tx_skb_dma_entry(tx_q, first_entry, des, nopaged_len,
+					    false);
 
 		stmmac_set_desc_addr(priv, first, des);
 
-		tx_q->tx_skbuff_dma[first_entry].len = nopaged_len;
-		tx_q->tx_skbuff_dma[first_entry].last_segment = last_segment;
+		if (last_segment)
+			stmmac_set_tx_dma_last_segment(tx_q, first_entry);
 
 		if (unlikely((skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) &&
 			     priv->hwts_tx_en)) {
@@ -5064,6 +5077,7 @@ static int stmmac_xdp_xmit_xdpf(struct stmmac_priv *priv, int queue,
 	struct stmmac_tx_queue *tx_q = &priv->dma_conf.tx_queue[queue];
 	bool csum = !priv->plat->tx_queues_cfg[queue].coe_unsupported;
 	unsigned int entry = tx_q->cur_tx;
+	enum stmmac_txbuf_type buf_type;
 	struct dma_desc *tx_desc;
 	dma_addr_t dma_addr;
 	bool set_ic;
@@ -5091,7 +5105,7 @@ static int stmmac_xdp_xmit_xdpf(struct stmmac_priv *priv, int queue,
 		if (dma_mapping_error(priv->device, dma_addr))
 			return STMMAC_XDP_CONSUMED;
 
-		tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_XDP_NDO;
+		buf_type = STMMAC_TXBUF_T_XDP_NDO;
 	} else {
 		struct page *page = virt_to_page(xdpf->data);
 
@@ -5100,14 +5114,12 @@ static int stmmac_xdp_xmit_xdpf(struct stmmac_priv *priv, int queue,
 		dma_sync_single_for_device(priv->device, dma_addr,
 					   xdpf->len, DMA_BIDIRECTIONAL);
 
-		tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_XDP_TX;
+		buf_type = STMMAC_TXBUF_T_XDP_TX;
 	}
 
-	tx_q->tx_skbuff_dma[entry].buf = dma_addr;
-	tx_q->tx_skbuff_dma[entry].map_as_page = false;
-	tx_q->tx_skbuff_dma[entry].len = xdpf->len;
-	tx_q->tx_skbuff_dma[entry].last_segment = true;
-	tx_q->tx_skbuff_dma[entry].is_jumbo = false;
+	stmmac_set_tx_dma_entry(tx_q, entry, buf_type, dma_addr, xdpf->len,
+				false);
+	stmmac_set_tx_dma_last_segment(tx_q, entry);
 
 	tx_q->xdpf[entry] = xdpf;
 

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

Re: Re: [PATCH net v2] net: stmmac: fix transmit queue timed out after resume

Posted by Tao Wang 3 weeks, 2 days ago

> > > While I agree with the change for stmmac_tso_xmit(), please explain why
> > > the change in stmmac_free_tx_buffer() is necessary.
> > >
> > > It seems to me that if this is missing in stmmac_free_tx_buffer(), the
> > > driver should have more problems than just TSO.
> > 
> > The change in stmmac_free_tx_buffer() is intended to be generic for all
> > users of last_segment, not only for the TSO path.
> 
> However, transmit is a hotpath, so work needs to be minimised for good
> performance. We don't want anything that is unnecessary in these paths.
> 
> If we always explicitly set .last_segment when adding any packet to the
> ring, then there is absolutely no need to also do so when freeing them.
> 
> Also, I think there's a similar issue with .is_jumbo.
> 
> So, I think it would make more sense to have some helpers for setting
> up the tx_skbuff_dma entry. Maybe something like the below? I'll see
> if I can measure the performance impact of this later today, but I
> can't guarantee I'll get to that.
> 
> The idea here is to ensure that all members with the exception of
> xsk_meta are fully initialised when an entry is populated.
> 
> I haven't removed anything in the tx_q->tx_skbuff_dma entry release
> path yet, but with this in place, we should be able to eliminate the
> clearance of these in stmmac_tx_clean() and stmmac_free_tx_buffer().
> 
> Note that the driver assumes setting .buf to zero means the entry is
> cleared. dma_addr_t is a cookie which is device specific, and zero
> may be a valid DMA cookie. Only DMA_MAPPING_ERROR is invalid, and
> can be assumed to hold any meaning in driver code. So that needs
> fixing as well.
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index a8a78fe7d01f..0e605d0f6a94 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -1874,6 +1874,34 @@ static int init_dma_rx_desc_rings(struct net_device *dev,
>  	return ret;
>  }
>  
> +static void stmmac_set_tx_dma_entry(struct stmmac_tx_queue *tx_q,
> +				    unsigned int entry,
> +				    enum stmmac_txbuf_type type,
> +				    dma_addr_t addr, size_t len,
> +				    bool map_as_page)
> +{
> +	tx_q->tx_skbuff_dma[entry].buf = addr;
> +	tx_q->tx_skbuff_dma[entry].len = len;
> +	tx_q->tx_skbuff_dma[entry].buf_type = type;
> +	tx_q->tx_skbuff_dma[entry].map_as_page = map_as_page;
> +	tx_q->tx_skbuff_dma[entry].last_segment = false;
> +	tx_q->tx_skbuff_dma[entry].is_jumbo = false;
> +}
> +
> +static void stmmac_set_tx_skb_dma_entry(struct stmmac_tx_queue *tx_q,
> +					unsigned int entry, dma_addr_t addr,
> +					size_t len, bool map_as_page)
> +{
> +	stmmac_set_tx_dma_entry(tx_q, entry, STMMAC_TXBUF_T_SKB, addr, len,
> +				map_as_page);
> +}
> +
> +static void stmmac_set_tx_dma_last_segment(struct stmmac_tx_queue *tx_q,
> +					   unsigned int entry)
> +{
> +	tx_q->tx_skbuff_dma[entry].last_segment = true;
> +}
> +
>  /**
>   * __init_dma_tx_desc_rings - init the TX descriptor ring (per queue)
>   * @priv: driver private structure
> @@ -1919,11 +1947,8 @@ static int __init_dma_tx_desc_rings(struct stmmac_priv *priv,
>  			p = tx_q->dma_tx + i;
>  
>  		stmmac_clear_desc(priv, p);
> +		stmmac_set_tx_skb_dma_entry(tx_q, i, 0, 0, false);
>  
> -		tx_q->tx_skbuff_dma[i].buf = 0;
> -		tx_q->tx_skbuff_dma[i].map_as_page = false;
> -		tx_q->tx_skbuff_dma[i].len = 0;
> -		tx_q->tx_skbuff_dma[i].last_segment = false;
>  		tx_q->tx_skbuff[i] = NULL;
>  	}
>  
> @@ -2649,19 +2674,15 @@ static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
>  		meta = xsk_buff_get_metadata(pool, xdp_desc.addr);
>  		xsk_buff_raw_dma_sync_for_device(pool, dma_addr, xdp_desc.len);
>  
> -		tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_XSK_TX;
> -
>  		/* To return XDP buffer to XSK pool, we simple call
>  		 * xsk_tx_completed(), so we don't need to fill up
>  		 * 'buf' and 'xdpf'.
>  		 */
> -		tx_q->tx_skbuff_dma[entry].buf = 0;
> -		tx_q->xdpf[entry] = NULL;
> +		stmmac_set_tx_dma_entry(tx_q, entry, STMMAC_TXBUF_T_XSK_TX,
> +					0, xdp_desc.len, false);
> +		stmmac_set_tx_dma_last_segment(tx_q, entry);
>  
> -		tx_q->tx_skbuff_dma[entry].map_as_page = false;
> -		tx_q->tx_skbuff_dma[entry].len = xdp_desc.len;
> -		tx_q->tx_skbuff_dma[entry].last_segment = true;
> -		tx_q->tx_skbuff_dma[entry].is_jumbo = false;
> +		tx_q->xdpf[entry] = NULL;
>  
>  		stmmac_set_desc_addr(priv, tx_desc, dma_addr);
>  
> @@ -2836,6 +2857,9 @@ static int stmmac_tx_clean(struct stmmac_priv *priv, int budget, u32 queue,
>  			tx_q->tx_skbuff_dma[entry].map_as_page = false;
>  		}
>  
> +		/* This looks at tx_q->tx_skbuff_dma[tx_q->dirty_tx].is_jumbo
> +		 * and tx_q->tx_skbuff_dma[tx_q->dirty_tx].last_segment
> +		 */
>  		stmmac_clean_desc3(priv, tx_q, p);
>  
>  		tx_q->tx_skbuff_dma[entry].last_segment = false;
> @@ -4494,10 +4518,8 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
>  	 * this DMA buffer right after the DMA engine completely finishes the
>  	 * full buffer transmission.
>  	 */
> -	tx_q->tx_skbuff_dma[tx_q->cur_tx].buf = des;
> -	tx_q->tx_skbuff_dma[tx_q->cur_tx].len = skb_headlen(skb);
> -	tx_q->tx_skbuff_dma[tx_q->cur_tx].map_as_page = false;
> -	tx_q->tx_skbuff_dma[tx_q->cur_tx].buf_type = STMMAC_TXBUF_T_SKB;
> +	stmmac_set_tx_skb_dma_entry(tx_q, tx_q->cur_tx, des, skb_headlen(skb),
> +				    false);
>  
>  	/* Prepare fragments */
>  	for (i = 0; i < nfrags; i++) {
> @@ -4512,17 +4534,14 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
>  		stmmac_tso_allocator(priv, des, skb_frag_size(frag),
>  				     (i == nfrags - 1), queue);
>  
> -		tx_q->tx_skbuff_dma[tx_q->cur_tx].buf = des;
> -		tx_q->tx_skbuff_dma[tx_q->cur_tx].len = skb_frag_size(frag);
> -		tx_q->tx_skbuff_dma[tx_q->cur_tx].map_as_page = true;
> -		tx_q->tx_skbuff_dma[tx_q->cur_tx].buf_type = STMMAC_TXBUF_T_SKB;
> +		stmmac_set_tx_skb_dma_entry(tx_q, tx_q->cur_tx, des,
> +					    skb_frag_size(frag), true);
>  	}
>  
> -	tx_q->tx_skbuff_dma[tx_q->cur_tx].last_segment = true;
> +	stmmac_set_tx_dma_last_segment(tx_q, tx_q->cur_tx);
>  
>  	/* Only the last descriptor gets to point to the skb. */
>  	tx_q->tx_skbuff[tx_q->cur_tx] = skb;
> -	tx_q->tx_skbuff_dma[tx_q->cur_tx].buf_type = STMMAC_TXBUF_T_SKB;
>  
>  	/* Manage tx mitigation */
>  	tx_packets = (tx_q->cur_tx + 1) - first_tx;
> @@ -4774,23 +4793,18 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
>  		if (dma_mapping_error(priv->device, des))
>  			goto dma_map_err; /* should reuse desc w/o issues */
>  
> -		tx_q->tx_skbuff_dma[entry].buf = des;
> -
> +		stmmac_set_tx_skb_dma_entry(tx_q, entry, des, len, true);
>  		stmmac_set_desc_addr(priv, desc, des);
>  
> -		tx_q->tx_skbuff_dma[entry].map_as_page = true;
> -		tx_q->tx_skbuff_dma[entry].len = len;
> -		tx_q->tx_skbuff_dma[entry].last_segment = last_segment;
> -		tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_SKB;
> -
>  		/* Prepare the descriptor and set the own bit too */
>  		stmmac_prepare_tx_desc(priv, desc, 0, len, csum_insertion,
>  				priv->mode, 1, last_segment, skb->len);
>  	}
>  
> +	stmmac_set_tx_dma_last_segment(tx_q, entry);
> +
>  	/* Only the last descriptor gets to point to the skb. */
>  	tx_q->tx_skbuff[entry] = skb;
> -	tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_SKB;
>  
>  	/* According to the coalesce parameter the IC bit for the latest
>  	 * segment is reset and the timer re-started to clean the tx status.
> @@ -4869,14 +4883,13 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
>  		if (dma_mapping_error(priv->device, des))
>  			goto dma_map_err;
>  
> -		tx_q->tx_skbuff_dma[first_entry].buf = des;
> -		tx_q->tx_skbuff_dma[first_entry].buf_type = STMMAC_TXBUF_T_SKB;
> -		tx_q->tx_skbuff_dma[first_entry].map_as_page = false;
> +		stmmac_set_tx_skb_dma_entry(tx_q, first_entry, des, nopaged_len,
> +					    false);
>  
>  		stmmac_set_desc_addr(priv, first, des);
>  
> -		tx_q->tx_skbuff_dma[first_entry].len = nopaged_len;
> -		tx_q->tx_skbuff_dma[first_entry].last_segment = last_segment;
> +		if (last_segment)
> +			stmmac_set_tx_dma_last_segment(tx_q, first_entry);
>  
>  		if (unlikely((skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) &&
>  			     priv->hwts_tx_en)) {
> @@ -5064,6 +5077,7 @@ static int stmmac_xdp_xmit_xdpf(struct stmmac_priv *priv, int queue,
>  	struct stmmac_tx_queue *tx_q = &priv->dma_conf.tx_queue[queue];
>  	bool csum = !priv->plat->tx_queues_cfg[queue].coe_unsupported;
>  	unsigned int entry = tx_q->cur_tx;
> +	enum stmmac_txbuf_type buf_type;
>  	struct dma_desc *tx_desc;
>  	dma_addr_t dma_addr;
>  	bool set_ic;
> @@ -5091,7 +5105,7 @@ static int stmmac_xdp_xmit_xdpf(struct stmmac_priv *priv, int queue,
>  		if (dma_mapping_error(priv->device, dma_addr))
>  			return STMMAC_XDP_CONSUMED;
>  
> -		tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_XDP_NDO;
> +		buf_type = STMMAC_TXBUF_T_XDP_NDO;
>  	} else {
>  		struct page *page = virt_to_page(xdpf->data);
>  
> @@ -5100,14 +5114,12 @@ static int stmmac_xdp_xmit_xdpf(struct stmmac_priv *priv, int queue,
>  		dma_sync_single_for_device(priv->device, dma_addr,
>  					   xdpf->len, DMA_BIDIRECTIONAL);
>  
> -		tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_XDP_TX;
> +		buf_type = STMMAC_TXBUF_T_XDP_TX;
>  	}
>  
> -	tx_q->tx_skbuff_dma[entry].buf = dma_addr;
> -	tx_q->tx_skbuff_dma[entry].map_as_page = false;
> -	tx_q->tx_skbuff_dma[entry].len = xdpf->len;
> -	tx_q->tx_skbuff_dma[entry].last_segment = true;
> -	tx_q->tx_skbuff_dma[entry].is_jumbo = false;
> +	stmmac_set_tx_dma_entry(tx_q, entry, buf_type, dma_addr, xdpf->len,
> +				false);
> +	stmmac_set_tx_dma_last_segment(tx_q, entry);
>  
>  	tx_q->xdpf[entry] = xdpf;

Since the changes are relatively large, I suggest splitting them into 
a separate optimization patch.As I cannot validate the is_jumbo scenario, 
I have dropped the changes to stmmac_free_tx_buffer.I will submit a 
separate patch focusing only on fixing the TSO case.

Re: Re: [PATCH net v2] net: stmmac: fix transmit queue timed out after resume

Posted by Russell King (Oracle) 3 weeks, 2 days ago

On Thu, Jan 15, 2026 at 12:09:18PM +0000, Russell King (Oracle) wrote:
> On Thu, Jan 15, 2026 at 03:08:53PM +0800, Tao Wang wrote:
> > > While I agree with the change for stmmac_tso_xmit(), please explain why
> > > the change in stmmac_free_tx_buffer() is necessary.
> > >
> > > It seems to me that if this is missing in stmmac_free_tx_buffer(), the
> > > driver should have more problems than just TSO.
> > 
> > The change in stmmac_free_tx_buffer() is intended to be generic for all
> > users of last_segment, not only for the TSO path.
> 
> However, transmit is a hotpath, so work needs to be minimised for good
> performance. We don't want anything that is unnecessary in these paths.
> 
> If we always explicitly set .last_segment when adding any packet to the
> ring, then there is absolutely no need to also do so when freeing them.
> 
> Also, I think there's a similar issue with .is_jumbo.
> 
> So, I think it would make more sense to have some helpers for setting
> up the tx_skbuff_dma entry. Maybe something like the below? I'll see
> if I can measure the performance impact of this later today, but I
> can't guarantee I'll get to that.
> 
> The idea here is to ensure that all members with the exception of
> xsk_meta are fully initialised when an entry is populated.
> 
> I haven't removed anything in the tx_q->tx_skbuff_dma entry release
> path yet, but with this in place, we should be able to eliminate the
> clearance of these in stmmac_tx_clean() and stmmac_free_tx_buffer().
> 
> Note that the driver assumes setting .buf to zero means the entry is
> cleared. dma_addr_t is a cookie which is device specific, and zero
> may be a valid DMA cookie. Only DMA_MAPPING_ERROR is invalid, and
> can be assumed to hold any meaning in driver code. So that needs
> fixing as well.

I've just run iperf3 in both directions with the kernel I had on the
board (based on 6.18.0-rc7-net-next+), and stmmac really isn't looking
particularly great - by that I mean, iperf3 *failed* spectacularly.

First, running in normal mode (stmmac transmitting, x86 receiving)
it's only capable of 210Mbps, which is nowhere near line rate.

However, when running iperf3 in reverse mode, it filled the stmmac's
receive queue, which then started spewing PAUSE frames at a rate of
knots, flooding the network, and causing the entire network to stop.
It never recovered without rebooting.

Trying again on 6.19.0-rc4-net-next+,

stmmac transmitting shows the same dire performance:

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  24.2 MBytes   203 Mbits/sec    0    230 KBytes
[  5]   1.00-2.00   sec  25.5 MBytes   214 Mbits/sec    0    230 KBytes
[  5]   2.00-3.00   sec  25.0 MBytes   210 Mbits/sec    0    230 KBytes
[  5]   3.00-4.00   sec  25.5 MBytes   214 Mbits/sec    0    230 KBytes
[  5]   4.00-5.00   sec  25.1 MBytes   211 Mbits/sec    0    230 KBytes
[  5]   5.00-6.00   sec  25.1 MBytes   211 Mbits/sec    0    230 KBytes
[  5]   6.00-7.00   sec  25.7 MBytes   215 Mbits/sec    0    230 KBytes
[  5]   7.00-8.00   sec  25.2 MBytes   212 Mbits/sec    0    230 KBytes
[  5]   8.00-9.00   sec  25.3 MBytes   212 Mbits/sec    0    346 KBytes
[  5]   9.00-10.00  sec  25.4 MBytes   213 Mbits/sec    0    346 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   252 MBytes   211 Mbits/sec    0             sender
[  5]   0.00-10.02  sec   250 MBytes   210 Mbits/sec                  receiver

stmmac receiving shows the same problem:

[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  64.1 MBytes   537 Mbits/sec
[  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec
[  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec
[  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec
[  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec
[  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec
[  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec
[  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec
[  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec
^C[  5]   9.00-9.43   sec  0.00 Bytes  0.00 bits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-9.43   sec  0.00 Bytes  0.00 bits/sec                  sender
[  5]   0.00-9.43   sec  64.1 MBytes  57.0 Mbits/sec                  receiver
iperf3: interrupt - the client has terminated

and it's now spewing PAUSE frames again.

The RXQ 0 debug register shows:

Value at address 0x02490d38: 0x002b0020

bits 29:16 (PRXQ = 43) is the number of packets in the RX queue
bits 5:4 (RXQSTS = 10) shows that the internal RX queue is above the
  flow control activate threshold.

The RXQ 0 operating mode register shows:

Value at address 0x02490d30: 0x0ff1c4e0

bits 29:20 (RQS = 15) indicates that the receive queue size is
  (255 + 1) * 256 = 65536 bytes (which is what hw feature 1 reports)

bits 16:14 (RFD = 7) indicates the threshold for deactivating flow
  control

bits 10:8 (RFA = 4) indicates the threshold for activing flow control

Disabling EHFC (bit 7, enable hardware flow control) stops the flood.

Looking at the receive descriptor ring, all the entries are marked
with RDES3_OWN | RDES3_BUFFER1_VALID_ADDR - so there are free ring
entries, but the hardware is not transferring the queued packets.

Looking at the channel 0 status register, it's indicating RBU
(receive buffer unavailable.)

This gets more weird.

Channel 0 Rx descriptor tail pointer register:
Value at address 0x02491128: 0xffffee30
Channel 0 current application receive descriptor register:
Value at address 0x0249114c: 0xffffee30

Receive queue descriptor:
227 [0x0000007fffffee30]: 0xfee00040 0x7f 0x0 0x81000000

I've tried writing to the tail pointer register (both the current
value and the next descriptor value), this doesn't seem to change
anything.

I've tried clearing SR in DMA_CHAN_RX_CONTROL() and setting it,
again no change.

So, it looks like the receive hardware has permanently stalled,
needing at minimum a soft reset of the entire stmmac core to
recover it.

I think I'm going to have to declare stmmac receive on dwmac4 to
be buggy at the moment, as I can't get to the bottom of what's
causing this.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

Re: [PATCH net v2] net: stmmac: fix transmit queue timed out after resume

Posted by Maxime Chevallier 3 weeks, 2 days ago

Hi,

> 
> I've just run iperf3 in both directions with the kernel I had on the
> board (based on 6.18.0-rc7-net-next+), and stmmac really isn't looking
> particularly great - by that I mean, iperf3 *failed* spectacularly.
> 
> First, running in normal mode (stmmac transmitting, x86 receiving)
> it's only capable of 210Mbps, which is nowhere near line rate.
> 
> However, when running iperf3 in reverse mode, it filled the stmmac's
> receive queue, which then started spewing PAUSE frames at a rate of
> knots, flooding the network, and causing the entire network to stop.
> It never recovered without rebooting.
> 
> Trying again on 6.19.0-rc4-net-next+,
> 
> stmmac transmitting shows the same dire performance:
> 
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  24.2 MBytes   203 Mbits/sec    0    230 KBytes
> [  5]   1.00-2.00   sec  25.5 MBytes   214 Mbits/sec    0    230 KBytes
> [  5]   2.00-3.00   sec  25.0 MBytes   210 Mbits/sec    0    230 KBytes
> [  5]   3.00-4.00   sec  25.5 MBytes   214 Mbits/sec    0    230 KBytes
> [  5]   4.00-5.00   sec  25.1 MBytes   211 Mbits/sec    0    230 KBytes
> [  5]   5.00-6.00   sec  25.1 MBytes   211 Mbits/sec    0    230 KBytes
> [  5]   6.00-7.00   sec  25.7 MBytes   215 Mbits/sec    0    230 KBytes
> [  5]   7.00-8.00   sec  25.2 MBytes   212 Mbits/sec    0    230 KBytes
> [  5]   8.00-9.00   sec  25.3 MBytes   212 Mbits/sec    0    346 KBytes
> [  5]   9.00-10.00  sec  25.4 MBytes   213 Mbits/sec    0    346 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec   252 MBytes   211 Mbits/sec    0             sender
> [  5]   0.00-10.02  sec   250 MBytes   210 Mbits/sec                  receiver
> 
> stmmac receiving shows the same problem:
> 
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-1.00   sec  64.1 MBytes   537 Mbits/sec
> [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec
> [  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec
> [  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec
> [  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec
> [  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec
> [  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec
> [  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec
> [  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec
> ^C[  5]   9.00-9.43   sec  0.00 Bytes  0.00 bits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-9.43   sec  0.00 Bytes  0.00 bits/sec                  sender
> [  5]   0.00-9.43   sec  64.1 MBytes  57.0 Mbits/sec                  receiver
> iperf3: interrupt - the client has terminated

Heh, I was able to reproduce something similar on imx8mp, that has an
imx-dwmac (dwmac 4/5 according to dmesg) :

DUT to x86

Connecting to host 192.168.2.1, port 5201
[  5] local 192.168.2.13 port 54744 connected to 192.168.2.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  0.00 Bytes  0.00 bits/sec    2   1.41 KBytes
[  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes

x86 to DUT :

Reverse mode, remote host 192.168.2.1 is sending
[  5] local 192.168.2.13 port 47050 connected to 192.168.2.1 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   112 MBytes   935 Mbits/sec
[  5]   1.00-2.00   sec   112 MBytes   936 Mbits/sec
[  5]   2.00-3.00   sec   112 MBytes   936 Mbits/sec

Nothing as bas as what you face, but there's defintely something going
on there. "good" news is that it worked in v6.19-rc1, I have a bisect
ongoing.

I'll update once I have homed-in on something.

Maxime

Re: [PATCH net v2] net: stmmac: fix transmit queue timed out after resume

Posted by Maxime Chevallier 3 weeks, 2 days ago

Hi again,

On 15/01/2026 22:04, Maxime Chevallier wrote:
> Hi,
> 
>>
>> I've just run iperf3 in both directions with the kernel I had on the
>> board (based on 6.18.0-rc7-net-next+), and stmmac really isn't looking
>> particularly great - by that I mean, iperf3 *failed* spectacularly.
>>
>> First, running in normal mode (stmmac transmitting, x86 receiving)
>> it's only capable of 210Mbps, which is nowhere near line rate.
>>
>> However, when running iperf3 in reverse mode, it filled the stmmac's
>> receive queue, which then started spewing PAUSE frames at a rate of
>> knots, flooding the network, and causing the entire network to stop.
>> It never recovered without rebooting.

 [...]

> Heh, I was able to reproduce something similar on imx8mp, that has an
> imx-dwmac (dwmac 4/5 according to dmesg) :
> 
> DUT to x86
> 
> Connecting to host 192.168.2.1, port 5201
> [  5] local 192.168.2.13 port 54744 connected to 192.168.2.1 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  0.00 Bytes  0.00 bits/sec    2   1.41 KBytes
> [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> 
> x86 to DUT :
> 
> Reverse mode, remote host 192.168.2.1 is sending
> [  5] local 192.168.2.13 port 47050 connected to 192.168.2.1 port 5201
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-1.00   sec   112 MBytes   935 Mbits/sec
> [  5]   1.00-2.00   sec   112 MBytes   936 Mbits/sec
> [  5]   2.00-3.00   sec   112 MBytes   936 Mbits/sec
> 
> Nothing as bas as what you face, but there's defintely something going
> on there. "good" news is that it worked in v6.19-rc1, I have a bisect
> ongoing.
> 
> I'll update once I have homed-in on something.
> 
> Maxime

So the bisect results are in, at least for the problem I noticed. It's
not certain yet this is the same problem as Russell, and maybe not the
same as Tao Wang as well...

The culprit commit is :

commit 8409495bf6c907a5bc9632464dbdd8fb619f9ceb (HEAD)
Author: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Date:   Thu Jan 8 17:36:40 2026 +0000

    net: stmmac: cores: remove many xxx_SHIFT definitions
    
    We have many xxx_SHIFT definitions along side their corresponding
    xxx_MASK definitions for the various cores. Manually using the
    shift and mask can be error prone, as shown with the dwmac4 RXFSTS
    fix patch.
    
    Convert sites that use xxx_SHIFT and xxx_MASK directly to use
    FIELD_GET(), FIELD_PREP(), and u32_replace_bits() as appropriate.
    
    Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
    Link: https://patch.msgid.link/E1vdtw8-00000002Gtu-0Hyu@rmk-PC.armlinux.org.uk
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Lore link :

https://lore.kernel.org/netdev/E1vdtw8-00000002Gtu-0Hyu@rmk-PC.armlinux.org.uk/

I confirm that iperf3 works perfectly in both directions before this commit,
and I get 0 bits/s when running "iperf3 -c my_host" on the DUT that has stmmac.

Looks like something happened while cleaning-up the macros for the various
definitions.

Unfortunately it's getting late here, I'm not going to dig any further
tonight :(

Thanks,

Maxime

Re: [PATCH net v2] net: stmmac: fix transmit queue timed out after resume

Posted by Russell King (Oracle) 3 weeks, 2 days ago

On Thu, Jan 15, 2026 at 10:35:26PM +0100, Maxime Chevallier wrote:
> Hi again,
> 
> On 15/01/2026 22:04, Maxime Chevallier wrote:
> > Hi,
> > 
> >>
> >> I've just run iperf3 in both directions with the kernel I had on the
> >> board (based on 6.18.0-rc7-net-next+), and stmmac really isn't looking
> >> particularly great - by that I mean, iperf3 *failed* spectacularly.
> >>
> >> First, running in normal mode (stmmac transmitting, x86 receiving)
> >> it's only capable of 210Mbps, which is nowhere near line rate.
> >>
> >> However, when running iperf3 in reverse mode, it filled the stmmac's
> >> receive queue, which then started spewing PAUSE frames at a rate of
> >> knots, flooding the network, and causing the entire network to stop.
> >> It never recovered without rebooting.
> 
>  [...]
> 
> > Heh, I was able to reproduce something similar on imx8mp, that has an
> > imx-dwmac (dwmac 4/5 according to dmesg) :
> > 
> > DUT to x86
> > 
> > Connecting to host 192.168.2.1, port 5201
> > [  5] local 192.168.2.13 port 54744 connected to 192.168.2.1 port 5201
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec  0.00 Bytes  0.00 bits/sec    2   1.41 KBytes
> > [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    1   1.41 KBytes
> > 
> > x86 to DUT :
> > 
> > Reverse mode, remote host 192.168.2.1 is sending
> > [  5] local 192.168.2.13 port 47050 connected to 192.168.2.1 port 5201
> > [ ID] Interval           Transfer     Bitrate
> > [  5]   0.00-1.00   sec   112 MBytes   935 Mbits/sec
> > [  5]   1.00-2.00   sec   112 MBytes   936 Mbits/sec
> > [  5]   2.00-3.00   sec   112 MBytes   936 Mbits/sec
> > 
> > Nothing as bas as what you face, but there's defintely something going
> > on there. "good" news is that it worked in v6.19-rc1, I have a bisect
> > ongoing.
> > 
> > I'll update once I have homed-in on something.
> > 
> > Maxime
> 
> So the bisect results are in, at least for the problem I noticed. It's
> not certain yet this is the same problem as Russell, and maybe not the
> same as Tao Wang as well...
> 
> The culprit commit is :
> 
> commit 8409495bf6c907a5bc9632464dbdd8fb619f9ceb (HEAD)
> Author: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
> Date:   Thu Jan 8 17:36:40 2026 +0000
> 
>     net: stmmac: cores: remove many xxx_SHIFT definitions
>     
>     We have many xxx_SHIFT definitions along side their corresponding
>     xxx_MASK definitions for the various cores. Manually using the
>     shift and mask can be error prone, as shown with the dwmac4 RXFSTS
>     fix patch.
>     
>     Convert sites that use xxx_SHIFT and xxx_MASK directly to use
>     FIELD_GET(), FIELD_PREP(), and u32_replace_bits() as appropriate.
>     
>     Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
>     Link: https://patch.msgid.link/E1vdtw8-00000002Gtu-0Hyu@rmk-PC.armlinux.org.uk
>     Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> 
> Lore link :
> 
> https://lore.kernel.org/netdev/E1vdtw8-00000002Gtu-0Hyu@rmk-PC.armlinux.org.uk/
> 
> I confirm that iperf3 works perfectly in both directions before this commit,
> and I get 0 bits/s when running "iperf3 -c my_host" on the DUT that has stmmac.
> 
> Looks like something happened while cleaning-up the macros for the various
> definitions.

Thanks for finding the blame.

A few other interesting things... I have an old 6.14 kernel on the
platform, and that gives what I deem to be good transmit performance.
Receive performance is low, but it doesn't fail.

I wrote a shell script to use devmem2 to dump all the stmmac registers.
These seem more significant on the face of it... but I'm working it out
as I write this email:

-Value at address 0x02490010: 0x00010008
+Value at address 0x02490010: 0x00080008
-Value at address 0x02490014: 0x20020008
+Value at address 0x02490014: 0x20000008
-Value at address 0x02490018: 0x00000001
+Value at address 0x02490018: 0x04000001

These are GMAC_HASH_TAB()

-Value at address 0x02490060: 0x001a0000
+Value at address 0x02490060: 0x00120000

VLAN_ONCL, bit is VLAN_CSVL, changed in commit:
c657f86106c8 net: stmmac: vlan: Disable 802.1AD tag insertion offload.

-Value at address 0x024900c0: 0x01000000
+Value at address 0x024900c0: 0x05000000

GMAC_PMT - bit 26, part of the RWKPTR[4:0] bitfield, read-only.

-Value at address 0x02490d30: 0x0ff1c4a0
+Value at address 0x02490d30: 0x0ff1c4e0

MTL_CHAN_RX_OP_MODE(0) - bit 6 is different, MTL_OP_MODE_DIS_TCP_EF.

This is a change from:
fe4042797651 net: stmmac: dwmac4: stop hardware from dropping checksum-error packets

-Value at address 0x02491104: 0x00101011
+Value at address 0x02491104: 0x00001011

DMA_CHAN_TX_CONTROL(0) - but this is significant.
In dwmac4_dma_init_tx_chan(), we have:

-       value = value | (txpbl << DMA_BUS_MODE_PBL_SHIFT);
+       value = value | FIELD_PREP(DMA_BUS_MODE_PBL, txpbl);

and the corresponding change in the header file:

 /* DMA SYS Bus Mode bitmap */
 #define DMA_BUS_MODE_SPH               BIT(24)
 #define DMA_BUS_MODE_PBL               BIT(16)
-#define DMA_BUS_MODE_PBL_SHIFT         16
-#define DMA_BUS_MODE_RPBL_SHIFT                16
+#define DMA_BUS_MODE_RPBL_MASK         GENMASK(21, 16)
 #define DMA_BUS_MODE_MB                        BIT(14)
 #define DMA_BUS_MODE_FB                        BIT(0)

The combination of DMA_BUS_MODE_PBL and DMA_BUS_MODE_PBL_SHIFT leads
one to believe that this is a single bit field, whereas there
is another overlapping field called RPBL that is wider. RPBL gets
used for DMA_CHAN_RX_CONTROL, whereas PBL gets used for
DMA_CHAN_TX_CONTROL.

txpbl for the Jetson Xavier NX board (tegra194) is 16:

arch/arm64/boot/dts/nvidia/tegra194.dtsi:                       snps,txpbl = <16>;

which is txpbl. 16 doesn't fit into a single bit. The header file
was wrong.

According to non-Tegra documentation (the closest I have for
dwmac4 is stm32mp151), this field is called TXPBL[5:0] covering
bits 21:16 of this register, and is the transmit burst length.

However, while this may explain the transmit slowdown because it's
on the transmit side, it doesn't explain the receive problem.

-Value at address 0x0249113c: 0x000d07c0
+Value at address 0x0249113c: 0x000507c0

DMA_CHAN_SLOT_CTRL_STATUS(0) - bit 19 RSN[3:0] bit 3, readonly.

With the TXPBL thing fixed, for transmit I now get:

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1003 MBytes   841 Mbits/sec    0             sender
[  5]   0.00-10.01  sec  1002 MBytes   839 Mbits/sec                  receiver

which is way better, but receive still fails, with a storm of
PAUSE, with RBU set.

Transmit fix (eventually):
https://lore.kernel.org/r/E1vgY1k-00000003vOC-0Z1H@rmk-PC.armlinux.org.uk

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

Re: [PATCH net v2] net: stmmac: fix transmit queue timed out after resume

Posted by Russell King (Oracle) 3 weeks, 1 day ago

On Fri, Jan 16, 2026 at 12:50:35AM +0000, Russell King (Oracle) wrote:
> However, while this may explain the transmit slowdown because it's
> on the transmit side, it doesn't explain the receive problem.

I'm bisecting to find the cause of the receive issue, but it's going to
take a long time (in the mean time, I can't do any mainline work.)

So far, the range of good/bad has been narrowed down to 6.14 is good,
1b98f357dadd ("Merge tag 'net-next-6.16' of
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next") is bad.

14 more iterations to go. Might be complete by Sunday. (Slowness in
building the more fully featured net-next I use primarily for build
testing, the slowness of the platform to reboot, and the need to
manually test each build.)

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

Re: [PATCH net v2] net: stmmac: fix transmit queue timed out after resume

Posted by Russell King (Oracle) 3 weeks, 1 day ago

On Fri, Jan 16, 2026 at 01:37:48PM +0000, Russell King (Oracle) wrote:
> On Fri, Jan 16, 2026 at 12:50:35AM +0000, Russell King (Oracle) wrote:
> > However, while this may explain the transmit slowdown because it's
> > on the transmit side, it doesn't explain the receive problem.
> 
> I'm bisecting to find the cause of the receive issue, but it's going to
> take a long time (in the mean time, I can't do any mainline work.)
> 
> So far, the range of good/bad has been narrowed down to 6.14 is good,
> 1b98f357dadd ("Merge tag 'net-next-6.16' of
> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next") is bad.
> 
> 14 more iterations to go. Might be complete by Sunday. (Slowness in
> building the more fully featured net-next I use primarily for build
> testing, the slowness of the platform to reboot, and the need to
> manually test each build.)

Well, that's been a waste of time today. While the next iteration was
building, because it's been suspicious that each and every bisect
point has failed so far, I decided to re-check 6.14, and that fails.
So, it looks like this problem has existed for some considerable
time. I don't have the compute power locally to bisect over a massive
range of kernels, so I'm afraid stmmac receive is going to have to
stay broken unless someone else can bisect (and find a "good" point
in the git history.)

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

Re: [PATCH net v2] net: stmmac: fix transmit queue timed out after resume

Posted by Maxime Chevallier 3 weeks, 1 day ago

Hi,

On 16/01/2026 19:08, Russell King (Oracle) wrote:
> On Fri, Jan 16, 2026 at 01:37:48PM +0000, Russell King (Oracle) wrote:
>> On Fri, Jan 16, 2026 at 12:50:35AM +0000, Russell King (Oracle) wrote:
>>> However, while this may explain the transmit slowdown because it's
>>> on the transmit side, it doesn't explain the receive problem.
>>
>> I'm bisecting to find the cause of the receive issue, but it's going to
>> take a long time (in the mean time, I can't do any mainline work.)
>>
>> So far, the range of good/bad has been narrowed down to 6.14 is good,
>> 1b98f357dadd ("Merge tag 'net-next-6.16' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next") is bad.
>>
>> 14 more iterations to go. Might be complete by Sunday. (Slowness in
>> building the more fully featured net-next I use primarily for build
>> testing, the slowness of the platform to reboot, and the need to
>> manually test each build.)
> 
> Well, that's been a waste of time today. While the next iteration was
> building, because it's been suspicious that each and every bisect
> point has failed so far, I decided to re-check 6.14, and that fails.
> So, it looks like this problem has existed for some considerable
> time. I don't have the compute power locally to bisect over a massive
> range of kernels, so I'm afraid stmmac receive is going to have to
> stay broken unless someone else can bisect (and find a "good" point
> in the git history.)
> 

To me RX looks OK, at least on the various devices I have that use
stmmac. It's fine on Cyclone V socfpga, and imx8mp. Maybe that's Jetson
specific ?

I've got pretty-much line rate with a basic 'iperf3 -c XX" and same with
'iperf3 -c XX -R". What commands are you running to check the issue ?

Are you still seeing the pause frames flood ?

Maxime

Re: [PATCH net v2] net: stmmac: fix transmit queue timed out after resume

Posted by Russell King (Oracle) 3 weeks, 1 day ago

On Fri, Jan 16, 2026 at 07:27:16PM +0100, Maxime Chevallier wrote:
> Hi,
> 
> On 16/01/2026 19:08, Russell King (Oracle) wrote:
> > On Fri, Jan 16, 2026 at 01:37:48PM +0000, Russell King (Oracle) wrote:
> >> On Fri, Jan 16, 2026 at 12:50:35AM +0000, Russell King (Oracle) wrote:
> >>> However, while this may explain the transmit slowdown because it's
> >>> on the transmit side, it doesn't explain the receive problem.
> >>
> >> I'm bisecting to find the cause of the receive issue, but it's going to
> >> take a long time (in the mean time, I can't do any mainline work.)
> >>
> >> So far, the range of good/bad has been narrowed down to 6.14 is good,
> >> 1b98f357dadd ("Merge tag 'net-next-6.16' of
> >> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next") is bad.
> >>
> >> 14 more iterations to go. Might be complete by Sunday. (Slowness in
> >> building the more fully featured net-next I use primarily for build
> >> testing, the slowness of the platform to reboot, and the need to
> >> manually test each build.)
> > 
> > Well, that's been a waste of time today. While the next iteration was
> > building, because it's been suspicious that each and every bisect
> > point has failed so far, I decided to re-check 6.14, and that fails.
> > So, it looks like this problem has existed for some considerable
> > time. I don't have the compute power locally to bisect over a massive
> > range of kernels, so I'm afraid stmmac receive is going to have to
> > stay broken unless someone else can bisect (and find a "good" point
> > in the git history.)
> > 
> 
> To me RX looks OK, at least on the various devices I have that use
> stmmac. It's fine on Cyclone V socfpga, and imx8mp. Maybe that's Jetson
> specific ?

Maybe - it could be something to do with MMUs slowing down the packet
rate, or it could be uncovering a bug in stmmac's handling of dwmac4
when it runs out of descriptors in the ring.

The problem I'm seeing is that RBU ends up being set in the channel 0
control register (there's only a single channel) which means that the
hardware moved on to the next receive descriptor, and found that it
didn't own it.

It _should_ be counted by this statistic:

     rx_buf_unav_irq: 0

but clearly, this doesn't work, because here is the channel 0 status
register:

Value at address 0x02491160: 0x00000484

which has:

#define DMA_CHAN_STATUS_RBU             BIT(7)

set. The documentation I have (sadly not for Xavier but for stm32mp151)
states that when this occurs, a "Receive Poll Demand" command needs to
be issued, but fails to explain how to do that. Older cores (such as
dwmac1000) had a "received poll demand" register to write to for this.

> I've got pretty-much line rate with a basic 'iperf3 -c XX" and same with
> 'iperf3 -c XX -R". What commands are you running to check the issue ?

Merely iperf3 -R -c XX, it's enough to make it fall over normally
within the first second.

> Are you still seeing the pause frames flood ?

Yes, because the receive DMA has stopped, which makes the FIFO between
the MAC and MTL fill above the threshold for sending pause frames.

In order to stop the disruption to my network (because it basically
causes *everything* to clog up) I've had to turn off pause autoneg,
but that doesn't affect whether or not this happens.

It _may_ be worth testing whether adding a ndelay(500) into the
receive processing path, thereby making it intentionally slow,
allows you to reproduce the problem. If it does, then that confirms
that we're missing something in the dwmac4 handling for RBU.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

Re: [PATCH net v2] net: stmmac: fix transmit queue timed out after resume

Posted by Russell King (Oracle) 3 weeks, 1 day ago

On Fri, Jan 16, 2026 at 07:22:39PM +0000, Russell King (Oracle) wrote:
> Yes, because the receive DMA has stopped, which makes the FIFO between
> the MAC and MTL fill above the threshold for sending pause frames.
> 
> In order to stop the disruption to my network (because it basically
> causes *everything* to clog up) I've had to turn off pause autoneg,
> but that doesn't affect whether or not this happens.
> 
> It _may_ be worth testing whether adding a ndelay(500) into the
> receive processing path, thereby making it intentionally slow,
> allows you to reproduce the problem. If it does, then that confirms
> that we're missing something in the dwmac4 handling for RBU.

I notice that the iMX8MP TRM says similar about the RBU bit
(see 11.7.6.1.482.3 bit 7).

However, it does say that in ring mode, merely advancing the tail
pointer should be sufficient. I can write the tail pointer register
using devmem2, but the hardware never wakes up.

E.g.:

Channel 0 Current Application Receive Descriptor:
Value at address 0x0249114c: 0xfffff910

Channel 0 Rx Descriptor Tail Pointer:
Value at address 0x02491128: 0xfffff910

Value at address 0x02491128: 0xfffff910
Written 0xfffff940; readback 0xfffff940
Value at address 0x02491128: 0xfffff940
Written 0xfffff980; readback 0xfffff980

Value at address 0x0249114c: 0xfffff910

So, the hardware hasn't advanced. Here's the ring state:

			  RDES0     RDES1 RDES2 RDES3
401 [0x0000007ffffff910]: 0xffd63040 0x7f 0x0 0x81000000
402 [0x0000007ffffff920]: 0xffd64040 0x7f 0x0 0x81000000
403 [0x0000007ffffff930]: 0xffd3f040 0x7f 0x0 0x81000000
404 [0x0000007ffffff940]: 0xffeed040 0x7f 0x0 0x81000000
405 [0x0000007ffffff950]: 0xfff2f040 0x7f 0x0 0x81000000
406 [0x0000007ffffff960]: 0xffbee040 0x7f 0x0 0x81000000
407 [0x0000007ffffff970]: 0xffbef040 0x7f 0x0 0x81000000
408 [0x0000007ffffff980]: 0xffbf0040 0x7f 0x0 0x81000000

bit 31 of RDES3 is RDES3_OWN, which when set, means the dwmac core
has ownership of the buffer. Bit 24 means buffer 1 addresa valid
(stored in RDES0). So, if the iMX8MP information is correct, then
advancing 0x02491128 to point at the following descriptors should
"wake" the receive side, but it does not.

Other registers:

Queue 0 Receive Debug:
Value at address 0x02490d38: 0x002a0020

bit 0 = 0 (MTL Rx Queue Write Controller Active Status not detected)
bit 2:1 = 0 (Read controller Idle state)
bits 5:4 = 2 (Rx Queue fill-level above flow-control activate threshold)
bits 29:16 = 0x2a 42 packets in receive queue

Because the internal queue is above the flow-control activate
threshold, that causes the stmmac hardware to constantly spew pause
frames, and, as the stmmac receive side is essentially stuck and won't
make progress even when there are free buffers, the only way to release
this state is via a software reset of the entire core.

Why don't pause frames save us? Well, pause frames will only be sent
when the receive queue fills to the activate threshold, which can only
happen _after_ packets stop being transferred to the descriptor rings.
In other words, it can only happen when a RBU event has been detected,
which suspends the receiver - and it seems when that happens, it is
irrecoverable without soft-reset on Xavier.

Right now, I'm not sure what to think about this - I don't know whether
it's the hardware that's at fault, or whether there's an issue in the
driver. What I know for certain is what I've stated above, and the
fact that iperf3 -R has *extremely* detrimental effects on my *entire*
network.

The reason is... you connect two Netgear switches together, they use
flow control, and you have no way to turn that off... So, once stmmac
starts sending pause frames, the switches queue for that port fills,
and when further frames come in for that port, the switch sends pause
frames to the next switch behind which stops all traffic flow between
the two switches, severing the network. All the time that stmmac keeps
that up, so does the switch it is connected to.

If another machine happens to send a packet that needs to be queued on
the port that stmmac is connected to (e.g. broadcast or multicast)
then... that port starts sending pause frames back to that machine,
severing its network connection permanently while stmmac is spewing
pause frames.

Thus, the entire network goes down, on account of _one_ machine
repeatedly sending pause frames, preventing packet delivery.

While the idea of a lossless network _seems_ like a good idea, in
reality it gives an attacker who can get on a platform and take
control of the ethernet NIC the ability to completely screw an entire
network if flow control is enabled everywhere. I'm thinking at this
point... just say no to flow control, disable it everywhere one can.
Ethernet was designed to lose packets when it needs to, to ensure
fairness. Flow control destroys that fairness and results in networks
being severed.

"attacker" is maybe too strong - consider what happens if the kernel
crashes on a stmmac platform, so it can't receive packets anymore,
and the ring fills up, causing it to start spewing pause frames.
It's goodbye network!

I'm just rambling, but I think that point is justified.

Thoughts - should the kernel default to having flow control enabled
or disabled in light of this? Should this feature require explicit
administrative configuration given the severity of network disruption?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

Re: [PATCH net v2] net: stmmac: fix transmit queue timed out after resume

Posted by Jakub Kicinski 3 weeks ago

On Fri, 16 Jan 2026 20:57:05 +0000 Russell King (Oracle) wrote:
> Thoughts - should the kernel default to having flow control enabled
> or disabled in light of this? Should this feature require explicit
> administrative configuration given the severity of network disruption?

FWIW in DC historically we have seen a few NICs which have tiny buffers
so back-pressuring up to the top of rack switch is helpful. Switches
have more reasonable buffers. That's just NIC Tx pause, switch Rx pause
(from downlink ports, not fabric!). Letting switches generate pause is 
a recipe for.. not having a network. We'd need to figure out why Netgear
does what it does in your case, IMHO.

Re: [PATCH net v2] net: stmmac: fix transmit queue timed out after resume

Posted by Russell King (Oracle) 3 weeks ago

On Sat, Jan 17, 2026 at 09:06:34AM -0800, Jakub Kicinski wrote:
> Letting switches generate pause is
> a recipe for.. not having a network. We'd need to figure out why Netgear
> does what it does in your case, IMHO.

... because they're dumb consumer switches.

Also, the correct term is "a notwork" :D

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!