drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 2 ++ 1 file changed, 2 insertions(+)
after resume dev_watchdog() message:
"NETDEV WATCHDOG: CPU: x: transmit queue x timed out xx ms"
The trigging scenario is as follows:
When the TSO function sets tx_skbuff_dma[tx_q->cur_tx].last_segment = true
, and the last_segment value is not cleared in stmmac_free_tx_buffer after
resume, restarting TSO transmission may incorrectly use
tx_q->tx_skbuff_dma[first_entry].last_segment = true for a new TSO packet.
When the tx queue has timed out, and the emac TX descriptor is as follows:
eth0: 221 [0x0000000876d10dd0]: 0x73660cbe 0x8 0x42 0xb04416a0
eth0: 222 [0x0000000876d10de0]: 0x77731d40 0x8 0x16a0 0x90000000
Descriptor 221 is the TSO header, and descriptor 222 is the TSO payload.
In the tdes3 (0xb04416a0), bit 29 (first descriptor) and bit 28
(last descriptor) of the TSO packet 221 DMA descriptor cannot both be
set to 1 simultaneously. Since descriptor 222 is the actual last
descriptor, failing to set it properly will cause the EMAC DMA to stop
and hang.
To solve the issue, set last_segment to false in stmmac_free_tx_buffer:
tx_q->tx_skbuff_dma[i].last_segment = false;
Set last_segment to false in stmmac_tso_xmit, and do not use the default
value: tx_q->tx_skbuff_dma[first_entry].last_segment = false;
This will prevent similar issues from occurring in the future.
Signed-off-by: Tao Wang <tao03.wang@horizon.auto>
---
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index b3730312aeed..d786ac3c78f7 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1653,6 +1653,7 @@ static void stmmac_free_tx_buffer(struct stmmac_priv *priv,
tx_q->tx_skbuff_dma[i].buf = 0;
tx_q->tx_skbuff_dma[i].map_as_page = false;
+ tx_q->tx_skbuff_dma[i].last_segment = false;
}
/**
@@ -4448,6 +4449,7 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
if (dma_mapping_error(priv->device, des))
goto dma_map_err;
+ tx_q->tx_skbuff_dma[first_entry].last_segment = false;
stmmac_set_desc_addr(priv, first, des);
stmmac_tso_allocator(priv, des + proto_hdr_len, pay_len,
(nfrags == 0), queue);
--
2.34.1
On Fri, 9 Jan 2026 15:02:11 +0800 Tao Wang wrote: > after resume dev_watchdog() message: > "NETDEV WATCHDOG: CPU: x: transmit queue x timed out xx ms" > > The trigging scenario is as follows: > > When the TSO function sets tx_skbuff_dma[tx_q->cur_tx].last_segment = true > > , and the last_segment value is not cleared in stmmac_free_tx_buffer after > > resume, restarting TSO transmission may incorrectly use > > tx_q->tx_skbuff_dma[first_entry].last_segment = true for a new TSO packet. > > When the tx queue has timed out, and the emac TX descriptor is as follows: > eth0: 221 [0x0000000876d10dd0]: 0x73660cbe 0x8 0x42 0xb04416a0 > eth0: 222 [0x0000000876d10de0]: 0x77731d40 0x8 0x16a0 0x90000000 > > Descriptor 221 is the TSO header, and descriptor 222 is the TSO payload. > > In the tdes3 (0xb04416a0), bit 29 (first descriptor) and bit 28 > > (last descriptor) of the TSO packet 221 DMA descriptor cannot both be > > set to 1 simultaneously. Since descriptor 222 is the actual last > > descriptor, failing to set it properly will cause the EMAC DMA to stop > > and hang. For some reason the reposted version of the patch has unnecessary empty lines separating each line of this paragraph. > To solve the issue, set last_segment to false in stmmac_free_tx_buffer: > tx_q->tx_skbuff_dma[i].last_segment = false; > Set last_segment to false in stmmac_tso_xmit, and do not use the default > value: tx_q->tx_skbuff_dma[first_entry].last_segment = false; > This will prevent similar issues from occurring in the future. Please add a suitable Fixes tag, pointing at the commit which introduced this incorrect behavior (either the commit which broke it or the commit which added this code if it was always broken). -- pw-bot: cr
after resume dev_watchdog() message:
"NETDEV WATCHDOG: CPU: x: transmit queue x timed out xx ms"
The trigging scenario is as follows:
When the TSO function sets tx_skbuff_dma[tx_q->cur_tx].last_segment = true,
and the last_segment value is not cleared in stmmac_free_tx_buffer after
resume, restarting TSO transmission may incorrectly use
tx_q->tx_skbuff_dma[first_entry].last_segment = true for a new TSO packet.
When the tx queue has timed out, and the emac TX descriptor is as follows:
eth0: 221 [0x0000000876d10dd0]: 0x73660cbe 0x8 0x42 0xb04416a0
eth0: 222 [0x0000000876d10de0]: 0x77731d40 0x8 0x16a0 0x90000000
Descriptor 221 is the TSO header, and descriptor 222 is the TSO payload.
In the tdes3 (0xb04416a0), bit 29 (first descriptor) and bit 28
(last descriptor) of the TSO packet 221 DMA descriptor cannot both be
set to 1 simultaneously. Since descriptor 222 is the actual last
descriptor, failing to set it properly will cause the EMAC DMA to stop
and hang.
To solve the issue, set last_segment to false in stmmac_free_tx_buffer:
tx_q->tx_skbuff_dma[i].last_segment = false. Do not use the last_segment
default value and set last_segment to false in stmmac_tso_xmit. This
will prevent similar issues from occurring in the future.
Fixes: c2837423cb54 ("net: stmmac: Rework TX Coalesce logic")
changelog:
v1 -> v2:
- Modify commit message, del empty line, add fixed commit
information.
Signed-off-by: Tao Wang <tao03.wang@horizon.auto>
---
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index b3730312aeed..d786ac3c78f7 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1653,6 +1653,7 @@ static void stmmac_free_tx_buffer(struct stmmac_priv *priv,
tx_q->tx_skbuff_dma[i].buf = 0;
tx_q->tx_skbuff_dma[i].map_as_page = false;
+ tx_q->tx_skbuff_dma[i].last_segment = false;
}
/**
@@ -4448,6 +4449,7 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
if (dma_mapping_error(priv->device, des))
goto dma_map_err;
+ tx_q->tx_skbuff_dma[first_entry].last_segment = false;
stmmac_set_desc_addr(priv, first, des);
stmmac_tso_allocator(priv, des + proto_hdr_len, pay_len,
(nfrags == 0), queue);
--
2.34.1
On Wed, 14 Jan 2026 19:00:31 +0800 Tao Wang wrote:
> To solve the issue, set last_segment to false in stmmac_free_tx_buffer:
> tx_q->tx_skbuff_dma[i].last_segment = false. Do not use the last_segment
> default value and set last_segment to false in stmmac_tso_xmit. This
> will prevent similar issues from occurring in the future.
>
> Fixes: c2837423cb54 ("net: stmmac: Rework TX Coalesce logic")
>
> changelog:
> v1 -> v2:
> - Modify commit message, del empty line, add fixed commit
> information.
>
> Signed-off-by: Tao Wang <tao03.wang@horizon.auto>
When you repost to address Russell's feedback in the commit
message please:
- follow the recommended format (changelog placement and no empty
lines between Fixes and SoB):
https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#changes-requested
- do not send new version in reply to the old one, start a new thread
--
pw-bot: cr
> > To solve the issue, set last_segment to false in stmmac_free_tx_buffer:
> > tx_q->tx_skbuff_dma[i].last_segment = false. Do not use the last_segment
> > default value and set last_segment to false in stmmac_tso_xmit. This
> > will prevent similar issues from occurring in the future.
> >
> > Fixes: c2837423cb54 ("net: stmmac: Rework TX Coalesce logic")
> >
> > changelog:
> > v1 -> v2:
> > - Modify commit message, del empty line, add fixed commit
> > information.
> >
> > Signed-off-by: Tao Wang <tao03.wang@horizon.auto>
>
> When you repost to address Russell's feedback in the commit
> message please:
> - follow the recommended format (changelog placement and no empty
> lines between Fixes and SoB):
> https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#changes-requested
> - do not send new version in reply to the old one, start a new thread
Understood, I will correct the commit message format and post the next
version of the patch as a new thread.
Thanks
Tao Wang
On Wed, Jan 14, 2026 at 07:00:31PM +0800, Tao Wang wrote: > after resume dev_watchdog() message: > "NETDEV WATCHDOG: CPU: x: transmit queue x timed out xx ms" > > The trigging scenario is as follows: > When the TSO function sets tx_skbuff_dma[tx_q->cur_tx].last_segment = true, > and the last_segment value is not cleared in stmmac_free_tx_buffer after > resume, restarting TSO transmission may incorrectly use > tx_q->tx_skbuff_dma[first_entry].last_segment = true for a new TSO packet. > > When the tx queue has timed out, and the emac TX descriptor is as follows: > eth0: 221 [0x0000000876d10dd0]: 0x73660cbe 0x8 0x42 0xb04416a0 > eth0: 222 [0x0000000876d10de0]: 0x77731d40 0x8 0x16a0 0x90000000 > > Descriptor 221 is the TSO header, and descriptor 222 is the TSO payload. > In the tdes3 (0xb04416a0), bit 29 (first descriptor) and bit 28 > (last descriptor) of the TSO packet 221 DMA descriptor cannot both be > set to 1 simultaneously. Since descriptor 222 is the actual last > descriptor, failing to set it properly will cause the EMAC DMA to stop > and hang. > > To solve the issue, set last_segment to false in stmmac_free_tx_buffer: > tx_q->tx_skbuff_dma[i].last_segment = false. Do not use the last_segment > default value and set last_segment to false in stmmac_tso_xmit. This > will prevent similar issues from occurring in the future. While I agree with the change for stmmac_tso_xmit(), please explain why the change in stmmac_free_tx_buffer() is necessary. It seems to me that if this is missing in stmmac_free_tx_buffer(), the driver should have more problems than just TSO. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
> While I agree with the change for stmmac_tso_xmit(), please explain why > the change in stmmac_free_tx_buffer() is necessary. > > It seems to me that if this is missing in stmmac_free_tx_buffer(), the > driver should have more problems than just TSO. The change in stmmac_free_tx_buffer() is intended to be generic for all users of last_segment, not only for the TSO path. So far, I have not observed any issues with stmmac_xmit() or stmmac_xdp_xmit_xdpf(), but this change ensures consistent and correct handling of last_segment across all relevant transmit paths. Thanks Tao Wang
On Thu, Jan 15, 2026 at 03:08:53PM +0800, Tao Wang wrote:
> > While I agree with the change for stmmac_tso_xmit(), please explain why
> > the change in stmmac_free_tx_buffer() is necessary.
> >
> > It seems to me that if this is missing in stmmac_free_tx_buffer(), the
> > driver should have more problems than just TSO.
>
> The change in stmmac_free_tx_buffer() is intended to be generic for all
> users of last_segment, not only for the TSO path.
However, transmit is a hotpath, so work needs to be minimised for good
performance. We don't want anything that is unnecessary in these paths.
If we always explicitly set .last_segment when adding any packet to the
ring, then there is absolutely no need to also do so when freeing them.
Also, I think there's a similar issue with .is_jumbo.
So, I think it would make more sense to have some helpers for setting
up the tx_skbuff_dma entry. Maybe something like the below? I'll see
if I can measure the performance impact of this later today, but I
can't guarantee I'll get to that.
The idea here is to ensure that all members with the exception of
xsk_meta are fully initialised when an entry is populated.
I haven't removed anything in the tx_q->tx_skbuff_dma entry release
path yet, but with this in place, we should be able to eliminate the
clearance of these in stmmac_tx_clean() and stmmac_free_tx_buffer().
Note that the driver assumes setting .buf to zero means the entry is
cleared. dma_addr_t is a cookie which is device specific, and zero
may be a valid DMA cookie. Only DMA_MAPPING_ERROR is invalid, and
can be assumed to hold any meaning in driver code. So that needs
fixing as well.
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index a8a78fe7d01f..0e605d0f6a94 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1874,6 +1874,34 @@ static int init_dma_rx_desc_rings(struct net_device *dev,
return ret;
}
+static void stmmac_set_tx_dma_entry(struct stmmac_tx_queue *tx_q,
+ unsigned int entry,
+ enum stmmac_txbuf_type type,
+ dma_addr_t addr, size_t len,
+ bool map_as_page)
+{
+ tx_q->tx_skbuff_dma[entry].buf = addr;
+ tx_q->tx_skbuff_dma[entry].len = len;
+ tx_q->tx_skbuff_dma[entry].buf_type = type;
+ tx_q->tx_skbuff_dma[entry].map_as_page = map_as_page;
+ tx_q->tx_skbuff_dma[entry].last_segment = false;
+ tx_q->tx_skbuff_dma[entry].is_jumbo = false;
+}
+
+static void stmmac_set_tx_skb_dma_entry(struct stmmac_tx_queue *tx_q,
+ unsigned int entry, dma_addr_t addr,
+ size_t len, bool map_as_page)
+{
+ stmmac_set_tx_dma_entry(tx_q, entry, STMMAC_TXBUF_T_SKB, addr, len,
+ map_as_page);
+}
+
+static void stmmac_set_tx_dma_last_segment(struct stmmac_tx_queue *tx_q,
+ unsigned int entry)
+{
+ tx_q->tx_skbuff_dma[entry].last_segment = true;
+}
+
/**
* __init_dma_tx_desc_rings - init the TX descriptor ring (per queue)
* @priv: driver private structure
@@ -1919,11 +1947,8 @@ static int __init_dma_tx_desc_rings(struct stmmac_priv *priv,
p = tx_q->dma_tx + i;
stmmac_clear_desc(priv, p);
+ stmmac_set_tx_skb_dma_entry(tx_q, i, 0, 0, false);
- tx_q->tx_skbuff_dma[i].buf = 0;
- tx_q->tx_skbuff_dma[i].map_as_page = false;
- tx_q->tx_skbuff_dma[i].len = 0;
- tx_q->tx_skbuff_dma[i].last_segment = false;
tx_q->tx_skbuff[i] = NULL;
}
@@ -2649,19 +2674,15 @@ static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
meta = xsk_buff_get_metadata(pool, xdp_desc.addr);
xsk_buff_raw_dma_sync_for_device(pool, dma_addr, xdp_desc.len);
- tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_XSK_TX;
-
/* To return XDP buffer to XSK pool, we simple call
* xsk_tx_completed(), so we don't need to fill up
* 'buf' and 'xdpf'.
*/
- tx_q->tx_skbuff_dma[entry].buf = 0;
- tx_q->xdpf[entry] = NULL;
+ stmmac_set_tx_dma_entry(tx_q, entry, STMMAC_TXBUF_T_XSK_TX,
+ 0, xdp_desc.len, false);
+ stmmac_set_tx_dma_last_segment(tx_q, entry);
- tx_q->tx_skbuff_dma[entry].map_as_page = false;
- tx_q->tx_skbuff_dma[entry].len = xdp_desc.len;
- tx_q->tx_skbuff_dma[entry].last_segment = true;
- tx_q->tx_skbuff_dma[entry].is_jumbo = false;
+ tx_q->xdpf[entry] = NULL;
stmmac_set_desc_addr(priv, tx_desc, dma_addr);
@@ -2836,6 +2857,9 @@ static int stmmac_tx_clean(struct stmmac_priv *priv, int budget, u32 queue,
tx_q->tx_skbuff_dma[entry].map_as_page = false;
}
+ /* This looks at tx_q->tx_skbuff_dma[tx_q->dirty_tx].is_jumbo
+ * and tx_q->tx_skbuff_dma[tx_q->dirty_tx].last_segment
+ */
stmmac_clean_desc3(priv, tx_q, p);
tx_q->tx_skbuff_dma[entry].last_segment = false;
@@ -4494,10 +4518,8 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
* this DMA buffer right after the DMA engine completely finishes the
* full buffer transmission.
*/
- tx_q->tx_skbuff_dma[tx_q->cur_tx].buf = des;
- tx_q->tx_skbuff_dma[tx_q->cur_tx].len = skb_headlen(skb);
- tx_q->tx_skbuff_dma[tx_q->cur_tx].map_as_page = false;
- tx_q->tx_skbuff_dma[tx_q->cur_tx].buf_type = STMMAC_TXBUF_T_SKB;
+ stmmac_set_tx_skb_dma_entry(tx_q, tx_q->cur_tx, des, skb_headlen(skb),
+ false);
/* Prepare fragments */
for (i = 0; i < nfrags; i++) {
@@ -4512,17 +4534,14 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
stmmac_tso_allocator(priv, des, skb_frag_size(frag),
(i == nfrags - 1), queue);
- tx_q->tx_skbuff_dma[tx_q->cur_tx].buf = des;
- tx_q->tx_skbuff_dma[tx_q->cur_tx].len = skb_frag_size(frag);
- tx_q->tx_skbuff_dma[tx_q->cur_tx].map_as_page = true;
- tx_q->tx_skbuff_dma[tx_q->cur_tx].buf_type = STMMAC_TXBUF_T_SKB;
+ stmmac_set_tx_skb_dma_entry(tx_q, tx_q->cur_tx, des,
+ skb_frag_size(frag), true);
}
- tx_q->tx_skbuff_dma[tx_q->cur_tx].last_segment = true;
+ stmmac_set_tx_dma_last_segment(tx_q, tx_q->cur_tx);
/* Only the last descriptor gets to point to the skb. */
tx_q->tx_skbuff[tx_q->cur_tx] = skb;
- tx_q->tx_skbuff_dma[tx_q->cur_tx].buf_type = STMMAC_TXBUF_T_SKB;
/* Manage tx mitigation */
tx_packets = (tx_q->cur_tx + 1) - first_tx;
@@ -4774,23 +4793,18 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
if (dma_mapping_error(priv->device, des))
goto dma_map_err; /* should reuse desc w/o issues */
- tx_q->tx_skbuff_dma[entry].buf = des;
-
+ stmmac_set_tx_skb_dma_entry(tx_q, entry, des, len, true);
stmmac_set_desc_addr(priv, desc, des);
- tx_q->tx_skbuff_dma[entry].map_as_page = true;
- tx_q->tx_skbuff_dma[entry].len = len;
- tx_q->tx_skbuff_dma[entry].last_segment = last_segment;
- tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_SKB;
-
/* Prepare the descriptor and set the own bit too */
stmmac_prepare_tx_desc(priv, desc, 0, len, csum_insertion,
priv->mode, 1, last_segment, skb->len);
}
+ stmmac_set_tx_dma_last_segment(tx_q, entry);
+
/* Only the last descriptor gets to point to the skb. */
tx_q->tx_skbuff[entry] = skb;
- tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_SKB;
/* According to the coalesce parameter the IC bit for the latest
* segment is reset and the timer re-started to clean the tx status.
@@ -4869,14 +4883,13 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
if (dma_mapping_error(priv->device, des))
goto dma_map_err;
- tx_q->tx_skbuff_dma[first_entry].buf = des;
- tx_q->tx_skbuff_dma[first_entry].buf_type = STMMAC_TXBUF_T_SKB;
- tx_q->tx_skbuff_dma[first_entry].map_as_page = false;
+ stmmac_set_tx_skb_dma_entry(tx_q, first_entry, des, nopaged_len,
+ false);
stmmac_set_desc_addr(priv, first, des);
- tx_q->tx_skbuff_dma[first_entry].len = nopaged_len;
- tx_q->tx_skbuff_dma[first_entry].last_segment = last_segment;
+ if (last_segment)
+ stmmac_set_tx_dma_last_segment(tx_q, first_entry);
if (unlikely((skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) &&
priv->hwts_tx_en)) {
@@ -5064,6 +5077,7 @@ static int stmmac_xdp_xmit_xdpf(struct stmmac_priv *priv, int queue,
struct stmmac_tx_queue *tx_q = &priv->dma_conf.tx_queue[queue];
bool csum = !priv->plat->tx_queues_cfg[queue].coe_unsupported;
unsigned int entry = tx_q->cur_tx;
+ enum stmmac_txbuf_type buf_type;
struct dma_desc *tx_desc;
dma_addr_t dma_addr;
bool set_ic;
@@ -5091,7 +5105,7 @@ static int stmmac_xdp_xmit_xdpf(struct stmmac_priv *priv, int queue,
if (dma_mapping_error(priv->device, dma_addr))
return STMMAC_XDP_CONSUMED;
- tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_XDP_NDO;
+ buf_type = STMMAC_TXBUF_T_XDP_NDO;
} else {
struct page *page = virt_to_page(xdpf->data);
@@ -5100,14 +5114,12 @@ static int stmmac_xdp_xmit_xdpf(struct stmmac_priv *priv, int queue,
dma_sync_single_for_device(priv->device, dma_addr,
xdpf->len, DMA_BIDIRECTIONAL);
- tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_XDP_TX;
+ buf_type = STMMAC_TXBUF_T_XDP_TX;
}
- tx_q->tx_skbuff_dma[entry].buf = dma_addr;
- tx_q->tx_skbuff_dma[entry].map_as_page = false;
- tx_q->tx_skbuff_dma[entry].len = xdpf->len;
- tx_q->tx_skbuff_dma[entry].last_segment = true;
- tx_q->tx_skbuff_dma[entry].is_jumbo = false;
+ stmmac_set_tx_dma_entry(tx_q, entry, buf_type, dma_addr, xdpf->len,
+ false);
+ stmmac_set_tx_dma_last_segment(tx_q, entry);
tx_q->xdpf[entry] = xdpf;
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
> > > While I agree with the change for stmmac_tso_xmit(), please explain why
> > > the change in stmmac_free_tx_buffer() is necessary.
> > >
> > > It seems to me that if this is missing in stmmac_free_tx_buffer(), the
> > > driver should have more problems than just TSO.
> >
> > The change in stmmac_free_tx_buffer() is intended to be generic for all
> > users of last_segment, not only for the TSO path.
>
> However, transmit is a hotpath, so work needs to be minimised for good
> performance. We don't want anything that is unnecessary in these paths.
>
> If we always explicitly set .last_segment when adding any packet to the
> ring, then there is absolutely no need to also do so when freeing them.
>
> Also, I think there's a similar issue with .is_jumbo.
>
> So, I think it would make more sense to have some helpers for setting
> up the tx_skbuff_dma entry. Maybe something like the below? I'll see
> if I can measure the performance impact of this later today, but I
> can't guarantee I'll get to that.
>
> The idea here is to ensure that all members with the exception of
> xsk_meta are fully initialised when an entry is populated.
>
> I haven't removed anything in the tx_q->tx_skbuff_dma entry release
> path yet, but with this in place, we should be able to eliminate the
> clearance of these in stmmac_tx_clean() and stmmac_free_tx_buffer().
>
> Note that the driver assumes setting .buf to zero means the entry is
> cleared. dma_addr_t is a cookie which is device specific, and zero
> may be a valid DMA cookie. Only DMA_MAPPING_ERROR is invalid, and
> can be assumed to hold any meaning in driver code. So that needs
> fixing as well.
>
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index a8a78fe7d01f..0e605d0f6a94 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -1874,6 +1874,34 @@ static int init_dma_rx_desc_rings(struct net_device *dev,
> return ret;
> }
>
> +static void stmmac_set_tx_dma_entry(struct stmmac_tx_queue *tx_q,
> + unsigned int entry,
> + enum stmmac_txbuf_type type,
> + dma_addr_t addr, size_t len,
> + bool map_as_page)
> +{
> + tx_q->tx_skbuff_dma[entry].buf = addr;
> + tx_q->tx_skbuff_dma[entry].len = len;
> + tx_q->tx_skbuff_dma[entry].buf_type = type;
> + tx_q->tx_skbuff_dma[entry].map_as_page = map_as_page;
> + tx_q->tx_skbuff_dma[entry].last_segment = false;
> + tx_q->tx_skbuff_dma[entry].is_jumbo = false;
> +}
> +
> +static void stmmac_set_tx_skb_dma_entry(struct stmmac_tx_queue *tx_q,
> + unsigned int entry, dma_addr_t addr,
> + size_t len, bool map_as_page)
> +{
> + stmmac_set_tx_dma_entry(tx_q, entry, STMMAC_TXBUF_T_SKB, addr, len,
> + map_as_page);
> +}
> +
> +static void stmmac_set_tx_dma_last_segment(struct stmmac_tx_queue *tx_q,
> + unsigned int entry)
> +{
> + tx_q->tx_skbuff_dma[entry].last_segment = true;
> +}
> +
> /**
> * __init_dma_tx_desc_rings - init the TX descriptor ring (per queue)
> * @priv: driver private structure
> @@ -1919,11 +1947,8 @@ static int __init_dma_tx_desc_rings(struct stmmac_priv *priv,
> p = tx_q->dma_tx + i;
>
> stmmac_clear_desc(priv, p);
> + stmmac_set_tx_skb_dma_entry(tx_q, i, 0, 0, false);
>
> - tx_q->tx_skbuff_dma[i].buf = 0;
> - tx_q->tx_skbuff_dma[i].map_as_page = false;
> - tx_q->tx_skbuff_dma[i].len = 0;
> - tx_q->tx_skbuff_dma[i].last_segment = false;
> tx_q->tx_skbuff[i] = NULL;
> }
>
> @@ -2649,19 +2674,15 @@ static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
> meta = xsk_buff_get_metadata(pool, xdp_desc.addr);
> xsk_buff_raw_dma_sync_for_device(pool, dma_addr, xdp_desc.len);
>
> - tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_XSK_TX;
> -
> /* To return XDP buffer to XSK pool, we simple call
> * xsk_tx_completed(), so we don't need to fill up
> * 'buf' and 'xdpf'.
> */
> - tx_q->tx_skbuff_dma[entry].buf = 0;
> - tx_q->xdpf[entry] = NULL;
> + stmmac_set_tx_dma_entry(tx_q, entry, STMMAC_TXBUF_T_XSK_TX,
> + 0, xdp_desc.len, false);
> + stmmac_set_tx_dma_last_segment(tx_q, entry);
>
> - tx_q->tx_skbuff_dma[entry].map_as_page = false;
> - tx_q->tx_skbuff_dma[entry].len = xdp_desc.len;
> - tx_q->tx_skbuff_dma[entry].last_segment = true;
> - tx_q->tx_skbuff_dma[entry].is_jumbo = false;
> + tx_q->xdpf[entry] = NULL;
>
> stmmac_set_desc_addr(priv, tx_desc, dma_addr);
>
> @@ -2836,6 +2857,9 @@ static int stmmac_tx_clean(struct stmmac_priv *priv, int budget, u32 queue,
> tx_q->tx_skbuff_dma[entry].map_as_page = false;
> }
>
> + /* This looks at tx_q->tx_skbuff_dma[tx_q->dirty_tx].is_jumbo
> + * and tx_q->tx_skbuff_dma[tx_q->dirty_tx].last_segment
> + */
> stmmac_clean_desc3(priv, tx_q, p);
>
> tx_q->tx_skbuff_dma[entry].last_segment = false;
> @@ -4494,10 +4518,8 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
> * this DMA buffer right after the DMA engine completely finishes the
> * full buffer transmission.
> */
> - tx_q->tx_skbuff_dma[tx_q->cur_tx].buf = des;
> - tx_q->tx_skbuff_dma[tx_q->cur_tx].len = skb_headlen(skb);
> - tx_q->tx_skbuff_dma[tx_q->cur_tx].map_as_page = false;
> - tx_q->tx_skbuff_dma[tx_q->cur_tx].buf_type = STMMAC_TXBUF_T_SKB;
> + stmmac_set_tx_skb_dma_entry(tx_q, tx_q->cur_tx, des, skb_headlen(skb),
> + false);
>
> /* Prepare fragments */
> for (i = 0; i < nfrags; i++) {
> @@ -4512,17 +4534,14 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
> stmmac_tso_allocator(priv, des, skb_frag_size(frag),
> (i == nfrags - 1), queue);
>
> - tx_q->tx_skbuff_dma[tx_q->cur_tx].buf = des;
> - tx_q->tx_skbuff_dma[tx_q->cur_tx].len = skb_frag_size(frag);
> - tx_q->tx_skbuff_dma[tx_q->cur_tx].map_as_page = true;
> - tx_q->tx_skbuff_dma[tx_q->cur_tx].buf_type = STMMAC_TXBUF_T_SKB;
> + stmmac_set_tx_skb_dma_entry(tx_q, tx_q->cur_tx, des,
> + skb_frag_size(frag), true);
> }
>
> - tx_q->tx_skbuff_dma[tx_q->cur_tx].last_segment = true;
> + stmmac_set_tx_dma_last_segment(tx_q, tx_q->cur_tx);
>
> /* Only the last descriptor gets to point to the skb. */
> tx_q->tx_skbuff[tx_q->cur_tx] = skb;
> - tx_q->tx_skbuff_dma[tx_q->cur_tx].buf_type = STMMAC_TXBUF_T_SKB;
>
> /* Manage tx mitigation */
> tx_packets = (tx_q->cur_tx + 1) - first_tx;
> @@ -4774,23 +4793,18 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
> if (dma_mapping_error(priv->device, des))
> goto dma_map_err; /* should reuse desc w/o issues */
>
> - tx_q->tx_skbuff_dma[entry].buf = des;
> -
> + stmmac_set_tx_skb_dma_entry(tx_q, entry, des, len, true);
> stmmac_set_desc_addr(priv, desc, des);
>
> - tx_q->tx_skbuff_dma[entry].map_as_page = true;
> - tx_q->tx_skbuff_dma[entry].len = len;
> - tx_q->tx_skbuff_dma[entry].last_segment = last_segment;
> - tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_SKB;
> -
> /* Prepare the descriptor and set the own bit too */
> stmmac_prepare_tx_desc(priv, desc, 0, len, csum_insertion,
> priv->mode, 1, last_segment, skb->len);
> }
>
> + stmmac_set_tx_dma_last_segment(tx_q, entry);
> +
> /* Only the last descriptor gets to point to the skb. */
> tx_q->tx_skbuff[entry] = skb;
> - tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_SKB;
>
> /* According to the coalesce parameter the IC bit for the latest
> * segment is reset and the timer re-started to clean the tx status.
> @@ -4869,14 +4883,13 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
> if (dma_mapping_error(priv->device, des))
> goto dma_map_err;
>
> - tx_q->tx_skbuff_dma[first_entry].buf = des;
> - tx_q->tx_skbuff_dma[first_entry].buf_type = STMMAC_TXBUF_T_SKB;
> - tx_q->tx_skbuff_dma[first_entry].map_as_page = false;
> + stmmac_set_tx_skb_dma_entry(tx_q, first_entry, des, nopaged_len,
> + false);
>
> stmmac_set_desc_addr(priv, first, des);
>
> - tx_q->tx_skbuff_dma[first_entry].len = nopaged_len;
> - tx_q->tx_skbuff_dma[first_entry].last_segment = last_segment;
> + if (last_segment)
> + stmmac_set_tx_dma_last_segment(tx_q, first_entry);
>
> if (unlikely((skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) &&
> priv->hwts_tx_en)) {
> @@ -5064,6 +5077,7 @@ static int stmmac_xdp_xmit_xdpf(struct stmmac_priv *priv, int queue,
> struct stmmac_tx_queue *tx_q = &priv->dma_conf.tx_queue[queue];
> bool csum = !priv->plat->tx_queues_cfg[queue].coe_unsupported;
> unsigned int entry = tx_q->cur_tx;
> + enum stmmac_txbuf_type buf_type;
> struct dma_desc *tx_desc;
> dma_addr_t dma_addr;
> bool set_ic;
> @@ -5091,7 +5105,7 @@ static int stmmac_xdp_xmit_xdpf(struct stmmac_priv *priv, int queue,
> if (dma_mapping_error(priv->device, dma_addr))
> return STMMAC_XDP_CONSUMED;
>
> - tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_XDP_NDO;
> + buf_type = STMMAC_TXBUF_T_XDP_NDO;
> } else {
> struct page *page = virt_to_page(xdpf->data);
>
> @@ -5100,14 +5114,12 @@ static int stmmac_xdp_xmit_xdpf(struct stmmac_priv *priv, int queue,
> dma_sync_single_for_device(priv->device, dma_addr,
> xdpf->len, DMA_BIDIRECTIONAL);
>
> - tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_XDP_TX;
> + buf_type = STMMAC_TXBUF_T_XDP_TX;
> }
>
> - tx_q->tx_skbuff_dma[entry].buf = dma_addr;
> - tx_q->tx_skbuff_dma[entry].map_as_page = false;
> - tx_q->tx_skbuff_dma[entry].len = xdpf->len;
> - tx_q->tx_skbuff_dma[entry].last_segment = true;
> - tx_q->tx_skbuff_dma[entry].is_jumbo = false;
> + stmmac_set_tx_dma_entry(tx_q, entry, buf_type, dma_addr, xdpf->len,
> + false);
> + stmmac_set_tx_dma_last_segment(tx_q, entry);
>
> tx_q->xdpf[entry] = xdpf;
Since the changes are relatively large, I suggest splitting them into
a separate optimization patch.As I cannot validate the is_jumbo scenario,
I have dropped the changes to stmmac_free_tx_buffer.I will submit a
separate patch focusing only on fixing the TSO case.
On Thu, Jan 15, 2026 at 12:09:18PM +0000, Russell King (Oracle) wrote: > On Thu, Jan 15, 2026 at 03:08:53PM +0800, Tao Wang wrote: > > > While I agree with the change for stmmac_tso_xmit(), please explain why > > > the change in stmmac_free_tx_buffer() is necessary. > > > > > > It seems to me that if this is missing in stmmac_free_tx_buffer(), the > > > driver should have more problems than just TSO. > > > > The change in stmmac_free_tx_buffer() is intended to be generic for all > > users of last_segment, not only for the TSO path. > > However, transmit is a hotpath, so work needs to be minimised for good > performance. We don't want anything that is unnecessary in these paths. > > If we always explicitly set .last_segment when adding any packet to the > ring, then there is absolutely no need to also do so when freeing them. > > Also, I think there's a similar issue with .is_jumbo. > > So, I think it would make more sense to have some helpers for setting > up the tx_skbuff_dma entry. Maybe something like the below? I'll see > if I can measure the performance impact of this later today, but I > can't guarantee I'll get to that. > > The idea here is to ensure that all members with the exception of > xsk_meta are fully initialised when an entry is populated. > > I haven't removed anything in the tx_q->tx_skbuff_dma entry release > path yet, but with this in place, we should be able to eliminate the > clearance of these in stmmac_tx_clean() and stmmac_free_tx_buffer(). > > Note that the driver assumes setting .buf to zero means the entry is > cleared. dma_addr_t is a cookie which is device specific, and zero > may be a valid DMA cookie. Only DMA_MAPPING_ERROR is invalid, and > can be assumed to hold any meaning in driver code. So that needs > fixing as well. I've just run iperf3 in both directions with the kernel I had on the board (based on 6.18.0-rc7-net-next+), and stmmac really isn't looking particularly great - by that I mean, iperf3 *failed* spectacularly. First, running in normal mode (stmmac transmitting, x86 receiving) it's only capable of 210Mbps, which is nowhere near line rate. However, when running iperf3 in reverse mode, it filled the stmmac's receive queue, which then started spewing PAUSE frames at a rate of knots, flooding the network, and causing the entire network to stop. It never recovered without rebooting. Trying again on 6.19.0-rc4-net-next+, stmmac transmitting shows the same dire performance: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 24.2 MBytes 203 Mbits/sec 0 230 KBytes [ 5] 1.00-2.00 sec 25.5 MBytes 214 Mbits/sec 0 230 KBytes [ 5] 2.00-3.00 sec 25.0 MBytes 210 Mbits/sec 0 230 KBytes [ 5] 3.00-4.00 sec 25.5 MBytes 214 Mbits/sec 0 230 KBytes [ 5] 4.00-5.00 sec 25.1 MBytes 211 Mbits/sec 0 230 KBytes [ 5] 5.00-6.00 sec 25.1 MBytes 211 Mbits/sec 0 230 KBytes [ 5] 6.00-7.00 sec 25.7 MBytes 215 Mbits/sec 0 230 KBytes [ 5] 7.00-8.00 sec 25.2 MBytes 212 Mbits/sec 0 230 KBytes [ 5] 8.00-9.00 sec 25.3 MBytes 212 Mbits/sec 0 346 KBytes [ 5] 9.00-10.00 sec 25.4 MBytes 213 Mbits/sec 0 346 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 252 MBytes 211 Mbits/sec 0 sender [ 5] 0.00-10.02 sec 250 MBytes 210 Mbits/sec receiver stmmac receiving shows the same problem: [ ID] Interval Transfer Bitrate [ 5] 0.00-1.00 sec 64.1 MBytes 537 Mbits/sec [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec [ 5] 2.00-3.00 sec 0.00 Bytes 0.00 bits/sec [ 5] 3.00-4.00 sec 0.00 Bytes 0.00 bits/sec [ 5] 4.00-5.00 sec 0.00 Bytes 0.00 bits/sec [ 5] 5.00-6.00 sec 0.00 Bytes 0.00 bits/sec [ 5] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec [ 5] 7.00-8.00 sec 0.00 Bytes 0.00 bits/sec [ 5] 8.00-9.00 sec 0.00 Bytes 0.00 bits/sec ^C[ 5] 9.00-9.43 sec 0.00 Bytes 0.00 bits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 5] 0.00-9.43 sec 0.00 Bytes 0.00 bits/sec sender [ 5] 0.00-9.43 sec 64.1 MBytes 57.0 Mbits/sec receiver iperf3: interrupt - the client has terminated and it's now spewing PAUSE frames again. The RXQ 0 debug register shows: Value at address 0x02490d38: 0x002b0020 bits 29:16 (PRXQ = 43) is the number of packets in the RX queue bits 5:4 (RXQSTS = 10) shows that the internal RX queue is above the flow control activate threshold. The RXQ 0 operating mode register shows: Value at address 0x02490d30: 0x0ff1c4e0 bits 29:20 (RQS = 15) indicates that the receive queue size is (255 + 1) * 256 = 65536 bytes (which is what hw feature 1 reports) bits 16:14 (RFD = 7) indicates the threshold for deactivating flow control bits 10:8 (RFA = 4) indicates the threshold for activing flow control Disabling EHFC (bit 7, enable hardware flow control) stops the flood. Looking at the receive descriptor ring, all the entries are marked with RDES3_OWN | RDES3_BUFFER1_VALID_ADDR - so there are free ring entries, but the hardware is not transferring the queued packets. Looking at the channel 0 status register, it's indicating RBU (receive buffer unavailable.) This gets more weird. Channel 0 Rx descriptor tail pointer register: Value at address 0x02491128: 0xffffee30 Channel 0 current application receive descriptor register: Value at address 0x0249114c: 0xffffee30 Receive queue descriptor: 227 [0x0000007fffffee30]: 0xfee00040 0x7f 0x0 0x81000000 I've tried writing to the tail pointer register (both the current value and the next descriptor value), this doesn't seem to change anything. I've tried clearing SR in DMA_CHAN_RX_CONTROL() and setting it, again no change. So, it looks like the receive hardware has permanently stalled, needing at minimum a soft reset of the entire stmmac core to recover it. I think I'm going to have to declare stmmac receive on dwmac4 to be buggy at the moment, as I can't get to the bottom of what's causing this. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
Hi, > > I've just run iperf3 in both directions with the kernel I had on the > board (based on 6.18.0-rc7-net-next+), and stmmac really isn't looking > particularly great - by that I mean, iperf3 *failed* spectacularly. > > First, running in normal mode (stmmac transmitting, x86 receiving) > it's only capable of 210Mbps, which is nowhere near line rate. > > However, when running iperf3 in reverse mode, it filled the stmmac's > receive queue, which then started spewing PAUSE frames at a rate of > knots, flooding the network, and causing the entire network to stop. > It never recovered without rebooting. > > Trying again on 6.19.0-rc4-net-next+, > > stmmac transmitting shows the same dire performance: > > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 24.2 MBytes 203 Mbits/sec 0 230 KBytes > [ 5] 1.00-2.00 sec 25.5 MBytes 214 Mbits/sec 0 230 KBytes > [ 5] 2.00-3.00 sec 25.0 MBytes 210 Mbits/sec 0 230 KBytes > [ 5] 3.00-4.00 sec 25.5 MBytes 214 Mbits/sec 0 230 KBytes > [ 5] 4.00-5.00 sec 25.1 MBytes 211 Mbits/sec 0 230 KBytes > [ 5] 5.00-6.00 sec 25.1 MBytes 211 Mbits/sec 0 230 KBytes > [ 5] 6.00-7.00 sec 25.7 MBytes 215 Mbits/sec 0 230 KBytes > [ 5] 7.00-8.00 sec 25.2 MBytes 212 Mbits/sec 0 230 KBytes > [ 5] 8.00-9.00 sec 25.3 MBytes 212 Mbits/sec 0 346 KBytes > [ 5] 9.00-10.00 sec 25.4 MBytes 213 Mbits/sec 0 346 KBytes > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 252 MBytes 211 Mbits/sec 0 sender > [ 5] 0.00-10.02 sec 250 MBytes 210 Mbits/sec receiver > > stmmac receiving shows the same problem: > > [ ID] Interval Transfer Bitrate > [ 5] 0.00-1.00 sec 64.1 MBytes 537 Mbits/sec > [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec > [ 5] 2.00-3.00 sec 0.00 Bytes 0.00 bits/sec > [ 5] 3.00-4.00 sec 0.00 Bytes 0.00 bits/sec > [ 5] 4.00-5.00 sec 0.00 Bytes 0.00 bits/sec > [ 5] 5.00-6.00 sec 0.00 Bytes 0.00 bits/sec > [ 5] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec > [ 5] 7.00-8.00 sec 0.00 Bytes 0.00 bits/sec > [ 5] 8.00-9.00 sec 0.00 Bytes 0.00 bits/sec > ^C[ 5] 9.00-9.43 sec 0.00 Bytes 0.00 bits/sec > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate > [ 5] 0.00-9.43 sec 0.00 Bytes 0.00 bits/sec sender > [ 5] 0.00-9.43 sec 64.1 MBytes 57.0 Mbits/sec receiver > iperf3: interrupt - the client has terminated Heh, I was able to reproduce something similar on imx8mp, that has an imx-dwmac (dwmac 4/5 according to dmesg) : DUT to x86 Connecting to host 192.168.2.1, port 5201 [ 5] local 192.168.2.13 port 54744 connected to 192.168.2.1 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 0.00 Bytes 0.00 bits/sec 2 1.41 KBytes [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes x86 to DUT : Reverse mode, remote host 192.168.2.1 is sending [ 5] local 192.168.2.13 port 47050 connected to 192.168.2.1 port 5201 [ ID] Interval Transfer Bitrate [ 5] 0.00-1.00 sec 112 MBytes 935 Mbits/sec [ 5] 1.00-2.00 sec 112 MBytes 936 Mbits/sec [ 5] 2.00-3.00 sec 112 MBytes 936 Mbits/sec Nothing as bas as what you face, but there's defintely something going on there. "good" news is that it worked in v6.19-rc1, I have a bisect ongoing. I'll update once I have homed-in on something. Maxime
Hi again,
On 15/01/2026 22:04, Maxime Chevallier wrote:
> Hi,
>
>>
>> I've just run iperf3 in both directions with the kernel I had on the
>> board (based on 6.18.0-rc7-net-next+), and stmmac really isn't looking
>> particularly great - by that I mean, iperf3 *failed* spectacularly.
>>
>> First, running in normal mode (stmmac transmitting, x86 receiving)
>> it's only capable of 210Mbps, which is nowhere near line rate.
>>
>> However, when running iperf3 in reverse mode, it filled the stmmac's
>> receive queue, which then started spewing PAUSE frames at a rate of
>> knots, flooding the network, and causing the entire network to stop.
>> It never recovered without rebooting.
[...]
> Heh, I was able to reproduce something similar on imx8mp, that has an
> imx-dwmac (dwmac 4/5 according to dmesg) :
>
> DUT to x86
>
> Connecting to host 192.168.2.1, port 5201
> [ 5] local 192.168.2.13 port 54744 connected to 192.168.2.1 port 5201
> [ ID] Interval Transfer Bitrate Retr Cwnd
> [ 5] 0.00-1.00 sec 0.00 Bytes 0.00 bits/sec 2 1.41 KBytes
> [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes
>
> x86 to DUT :
>
> Reverse mode, remote host 192.168.2.1 is sending
> [ 5] local 192.168.2.13 port 47050 connected to 192.168.2.1 port 5201
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-1.00 sec 112 MBytes 935 Mbits/sec
> [ 5] 1.00-2.00 sec 112 MBytes 936 Mbits/sec
> [ 5] 2.00-3.00 sec 112 MBytes 936 Mbits/sec
>
> Nothing as bas as what you face, but there's defintely something going
> on there. "good" news is that it worked in v6.19-rc1, I have a bisect
> ongoing.
>
> I'll update once I have homed-in on something.
>
> Maxime
So the bisect results are in, at least for the problem I noticed. It's
not certain yet this is the same problem as Russell, and maybe not the
same as Tao Wang as well...
The culprit commit is :
commit 8409495bf6c907a5bc9632464dbdd8fb619f9ceb (HEAD)
Author: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Date: Thu Jan 8 17:36:40 2026 +0000
net: stmmac: cores: remove many xxx_SHIFT definitions
We have many xxx_SHIFT definitions along side their corresponding
xxx_MASK definitions for the various cores. Manually using the
shift and mask can be error prone, as shown with the dwmac4 RXFSTS
fix patch.
Convert sites that use xxx_SHIFT and xxx_MASK directly to use
FIELD_GET(), FIELD_PREP(), and u32_replace_bits() as appropriate.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vdtw8-00000002Gtu-0Hyu@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Lore link :
https://lore.kernel.org/netdev/E1vdtw8-00000002Gtu-0Hyu@rmk-PC.armlinux.org.uk/
I confirm that iperf3 works perfectly in both directions before this commit,
and I get 0 bits/s when running "iperf3 -c my_host" on the DUT that has stmmac.
Looks like something happened while cleaning-up the macros for the various
definitions.
Unfortunately it's getting late here, I'm not going to dig any further
tonight :(
Thanks,
Maxime
On Thu, Jan 15, 2026 at 10:35:26PM +0100, Maxime Chevallier wrote: > Hi again, > > On 15/01/2026 22:04, Maxime Chevallier wrote: > > Hi, > > > >> > >> I've just run iperf3 in both directions with the kernel I had on the > >> board (based on 6.18.0-rc7-net-next+), and stmmac really isn't looking > >> particularly great - by that I mean, iperf3 *failed* spectacularly. > >> > >> First, running in normal mode (stmmac transmitting, x86 receiving) > >> it's only capable of 210Mbps, which is nowhere near line rate. > >> > >> However, when running iperf3 in reverse mode, it filled the stmmac's > >> receive queue, which then started spewing PAUSE frames at a rate of > >> knots, flooding the network, and causing the entire network to stop. > >> It never recovered without rebooting. > > [...] > > > Heh, I was able to reproduce something similar on imx8mp, that has an > > imx-dwmac (dwmac 4/5 according to dmesg) : > > > > DUT to x86 > > > > Connecting to host 192.168.2.1, port 5201 > > [ 5] local 192.168.2.13 port 54744 connected to 192.168.2.1 port 5201 > > [ ID] Interval Transfer Bitrate Retr Cwnd > > [ 5] 0.00-1.00 sec 0.00 Bytes 0.00 bits/sec 2 1.41 KBytes > > [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 1 1.41 KBytes > > > > x86 to DUT : > > > > Reverse mode, remote host 192.168.2.1 is sending > > [ 5] local 192.168.2.13 port 47050 connected to 192.168.2.1 port 5201 > > [ ID] Interval Transfer Bitrate > > [ 5] 0.00-1.00 sec 112 MBytes 935 Mbits/sec > > [ 5] 1.00-2.00 sec 112 MBytes 936 Mbits/sec > > [ 5] 2.00-3.00 sec 112 MBytes 936 Mbits/sec > > > > Nothing as bas as what you face, but there's defintely something going > > on there. "good" news is that it worked in v6.19-rc1, I have a bisect > > ongoing. > > > > I'll update once I have homed-in on something. > > > > Maxime > > So the bisect results are in, at least for the problem I noticed. It's > not certain yet this is the same problem as Russell, and maybe not the > same as Tao Wang as well... > > The culprit commit is : > > commit 8409495bf6c907a5bc9632464dbdd8fb619f9ceb (HEAD) > Author: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> > Date: Thu Jan 8 17:36:40 2026 +0000 > > net: stmmac: cores: remove many xxx_SHIFT definitions > > We have many xxx_SHIFT definitions along side their corresponding > xxx_MASK definitions for the various cores. Manually using the > shift and mask can be error prone, as shown with the dwmac4 RXFSTS > fix patch. > > Convert sites that use xxx_SHIFT and xxx_MASK directly to use > FIELD_GET(), FIELD_PREP(), and u32_replace_bits() as appropriate. > > Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> > Link: https://patch.msgid.link/E1vdtw8-00000002Gtu-0Hyu@rmk-PC.armlinux.org.uk > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > > Lore link : > > https://lore.kernel.org/netdev/E1vdtw8-00000002Gtu-0Hyu@rmk-PC.armlinux.org.uk/ > > I confirm that iperf3 works perfectly in both directions before this commit, > and I get 0 bits/s when running "iperf3 -c my_host" on the DUT that has stmmac. > > Looks like something happened while cleaning-up the macros for the various > definitions. Thanks for finding the blame. A few other interesting things... I have an old 6.14 kernel on the platform, and that gives what I deem to be good transmit performance. Receive performance is low, but it doesn't fail. I wrote a shell script to use devmem2 to dump all the stmmac registers. These seem more significant on the face of it... but I'm working it out as I write this email: -Value at address 0x02490010: 0x00010008 +Value at address 0x02490010: 0x00080008 -Value at address 0x02490014: 0x20020008 +Value at address 0x02490014: 0x20000008 -Value at address 0x02490018: 0x00000001 +Value at address 0x02490018: 0x04000001 These are GMAC_HASH_TAB() -Value at address 0x02490060: 0x001a0000 +Value at address 0x02490060: 0x00120000 VLAN_ONCL, bit is VLAN_CSVL, changed in commit: c657f86106c8 net: stmmac: vlan: Disable 802.1AD tag insertion offload. -Value at address 0x024900c0: 0x01000000 +Value at address 0x024900c0: 0x05000000 GMAC_PMT - bit 26, part of the RWKPTR[4:0] bitfield, read-only. -Value at address 0x02490d30: 0x0ff1c4a0 +Value at address 0x02490d30: 0x0ff1c4e0 MTL_CHAN_RX_OP_MODE(0) - bit 6 is different, MTL_OP_MODE_DIS_TCP_EF. This is a change from: fe4042797651 net: stmmac: dwmac4: stop hardware from dropping checksum-error packets -Value at address 0x02491104: 0x00101011 +Value at address 0x02491104: 0x00001011 DMA_CHAN_TX_CONTROL(0) - but this is significant. In dwmac4_dma_init_tx_chan(), we have: - value = value | (txpbl << DMA_BUS_MODE_PBL_SHIFT); + value = value | FIELD_PREP(DMA_BUS_MODE_PBL, txpbl); and the corresponding change in the header file: /* DMA SYS Bus Mode bitmap */ #define DMA_BUS_MODE_SPH BIT(24) #define DMA_BUS_MODE_PBL BIT(16) -#define DMA_BUS_MODE_PBL_SHIFT 16 -#define DMA_BUS_MODE_RPBL_SHIFT 16 +#define DMA_BUS_MODE_RPBL_MASK GENMASK(21, 16) #define DMA_BUS_MODE_MB BIT(14) #define DMA_BUS_MODE_FB BIT(0) The combination of DMA_BUS_MODE_PBL and DMA_BUS_MODE_PBL_SHIFT leads one to believe that this is a single bit field, whereas there is another overlapping field called RPBL that is wider. RPBL gets used for DMA_CHAN_RX_CONTROL, whereas PBL gets used for DMA_CHAN_TX_CONTROL. txpbl for the Jetson Xavier NX board (tegra194) is 16: arch/arm64/boot/dts/nvidia/tegra194.dtsi: snps,txpbl = <16>; which is txpbl. 16 doesn't fit into a single bit. The header file was wrong. According to non-Tegra documentation (the closest I have for dwmac4 is stm32mp151), this field is called TXPBL[5:0] covering bits 21:16 of this register, and is the transmit burst length. However, while this may explain the transmit slowdown because it's on the transmit side, it doesn't explain the receive problem. -Value at address 0x0249113c: 0x000d07c0 +Value at address 0x0249113c: 0x000507c0 DMA_CHAN_SLOT_CTRL_STATUS(0) - bit 19 RSN[3:0] bit 3, readonly. With the TXPBL thing fixed, for transmit I now get: [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 1003 MBytes 841 Mbits/sec 0 sender [ 5] 0.00-10.01 sec 1002 MBytes 839 Mbits/sec receiver which is way better, but receive still fails, with a storm of PAUSE, with RBU set. Transmit fix (eventually): https://lore.kernel.org/r/E1vgY1k-00000003vOC-0Z1H@rmk-PC.armlinux.org.uk -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
On Fri, Jan 16, 2026 at 12:50:35AM +0000, Russell King (Oracle) wrote:
> However, while this may explain the transmit slowdown because it's
> on the transmit side, it doesn't explain the receive problem.
I'm bisecting to find the cause of the receive issue, but it's going to
take a long time (in the mean time, I can't do any mainline work.)
So far, the range of good/bad has been narrowed down to 6.14 is good,
1b98f357dadd ("Merge tag 'net-next-6.16' of
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next") is bad.
14 more iterations to go. Might be complete by Sunday. (Slowness in
building the more fully featured net-next I use primarily for build
testing, the slowness of the platform to reboot, and the need to
manually test each build.)
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
On Fri, Jan 16, 2026 at 01:37:48PM +0000, Russell King (Oracle) wrote:
> On Fri, Jan 16, 2026 at 12:50:35AM +0000, Russell King (Oracle) wrote:
> > However, while this may explain the transmit slowdown because it's
> > on the transmit side, it doesn't explain the receive problem.
>
> I'm bisecting to find the cause of the receive issue, but it's going to
> take a long time (in the mean time, I can't do any mainline work.)
>
> So far, the range of good/bad has been narrowed down to 6.14 is good,
> 1b98f357dadd ("Merge tag 'net-next-6.16' of
> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next") is bad.
>
> 14 more iterations to go. Might be complete by Sunday. (Slowness in
> building the more fully featured net-next I use primarily for build
> testing, the slowness of the platform to reboot, and the need to
> manually test each build.)
Well, that's been a waste of time today. While the next iteration was
building, because it's been suspicious that each and every bisect
point has failed so far, I decided to re-check 6.14, and that fails.
So, it looks like this problem has existed for some considerable
time. I don't have the compute power locally to bisect over a massive
range of kernels, so I'm afraid stmmac receive is going to have to
stay broken unless someone else can bisect (and find a "good" point
in the git history.)
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
Hi,
On 16/01/2026 19:08, Russell King (Oracle) wrote:
> On Fri, Jan 16, 2026 at 01:37:48PM +0000, Russell King (Oracle) wrote:
>> On Fri, Jan 16, 2026 at 12:50:35AM +0000, Russell King (Oracle) wrote:
>>> However, while this may explain the transmit slowdown because it's
>>> on the transmit side, it doesn't explain the receive problem.
>>
>> I'm bisecting to find the cause of the receive issue, but it's going to
>> take a long time (in the mean time, I can't do any mainline work.)
>>
>> So far, the range of good/bad has been narrowed down to 6.14 is good,
>> 1b98f357dadd ("Merge tag 'net-next-6.16' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next") is bad.
>>
>> 14 more iterations to go. Might be complete by Sunday. (Slowness in
>> building the more fully featured net-next I use primarily for build
>> testing, the slowness of the platform to reboot, and the need to
>> manually test each build.)
>
> Well, that's been a waste of time today. While the next iteration was
> building, because it's been suspicious that each and every bisect
> point has failed so far, I decided to re-check 6.14, and that fails.
> So, it looks like this problem has existed for some considerable
> time. I don't have the compute power locally to bisect over a massive
> range of kernels, so I'm afraid stmmac receive is going to have to
> stay broken unless someone else can bisect (and find a "good" point
> in the git history.)
>
To me RX looks OK, at least on the various devices I have that use
stmmac. It's fine on Cyclone V socfpga, and imx8mp. Maybe that's Jetson
specific ?
I've got pretty-much line rate with a basic 'iperf3 -c XX" and same with
'iperf3 -c XX -R". What commands are you running to check the issue ?
Are you still seeing the pause frames flood ?
Maxime
On Fri, Jan 16, 2026 at 07:27:16PM +0100, Maxime Chevallier wrote:
> Hi,
>
> On 16/01/2026 19:08, Russell King (Oracle) wrote:
> > On Fri, Jan 16, 2026 at 01:37:48PM +0000, Russell King (Oracle) wrote:
> >> On Fri, Jan 16, 2026 at 12:50:35AM +0000, Russell King (Oracle) wrote:
> >>> However, while this may explain the transmit slowdown because it's
> >>> on the transmit side, it doesn't explain the receive problem.
> >>
> >> I'm bisecting to find the cause of the receive issue, but it's going to
> >> take a long time (in the mean time, I can't do any mainline work.)
> >>
> >> So far, the range of good/bad has been narrowed down to 6.14 is good,
> >> 1b98f357dadd ("Merge tag 'net-next-6.16' of
> >> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next") is bad.
> >>
> >> 14 more iterations to go. Might be complete by Sunday. (Slowness in
> >> building the more fully featured net-next I use primarily for build
> >> testing, the slowness of the platform to reboot, and the need to
> >> manually test each build.)
> >
> > Well, that's been a waste of time today. While the next iteration was
> > building, because it's been suspicious that each and every bisect
> > point has failed so far, I decided to re-check 6.14, and that fails.
> > So, it looks like this problem has existed for some considerable
> > time. I don't have the compute power locally to bisect over a massive
> > range of kernels, so I'm afraid stmmac receive is going to have to
> > stay broken unless someone else can bisect (and find a "good" point
> > in the git history.)
> >
>
> To me RX looks OK, at least on the various devices I have that use
> stmmac. It's fine on Cyclone V socfpga, and imx8mp. Maybe that's Jetson
> specific ?
Maybe - it could be something to do with MMUs slowing down the packet
rate, or it could be uncovering a bug in stmmac's handling of dwmac4
when it runs out of descriptors in the ring.
The problem I'm seeing is that RBU ends up being set in the channel 0
control register (there's only a single channel) which means that the
hardware moved on to the next receive descriptor, and found that it
didn't own it.
It _should_ be counted by this statistic:
rx_buf_unav_irq: 0
but clearly, this doesn't work, because here is the channel 0 status
register:
Value at address 0x02491160: 0x00000484
which has:
#define DMA_CHAN_STATUS_RBU BIT(7)
set. The documentation I have (sadly not for Xavier but for stm32mp151)
states that when this occurs, a "Receive Poll Demand" command needs to
be issued, but fails to explain how to do that. Older cores (such as
dwmac1000) had a "received poll demand" register to write to for this.
> I've got pretty-much line rate with a basic 'iperf3 -c XX" and same with
> 'iperf3 -c XX -R". What commands are you running to check the issue ?
Merely iperf3 -R -c XX, it's enough to make it fall over normally
within the first second.
> Are you still seeing the pause frames flood ?
Yes, because the receive DMA has stopped, which makes the FIFO between
the MAC and MTL fill above the threshold for sending pause frames.
In order to stop the disruption to my network (because it basically
causes *everything* to clog up) I've had to turn off pause autoneg,
but that doesn't affect whether or not this happens.
It _may_ be worth testing whether adding a ndelay(500) into the
receive processing path, thereby making it intentionally slow,
allows you to reproduce the problem. If it does, then that confirms
that we're missing something in the dwmac4 handling for RBU.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
On Fri, Jan 16, 2026 at 07:22:39PM +0000, Russell King (Oracle) wrote: > Yes, because the receive DMA has stopped, which makes the FIFO between > the MAC and MTL fill above the threshold for sending pause frames. > > In order to stop the disruption to my network (because it basically > causes *everything* to clog up) I've had to turn off pause autoneg, > but that doesn't affect whether or not this happens. > > It _may_ be worth testing whether adding a ndelay(500) into the > receive processing path, thereby making it intentionally slow, > allows you to reproduce the problem. If it does, then that confirms > that we're missing something in the dwmac4 handling for RBU. I notice that the iMX8MP TRM says similar about the RBU bit (see 11.7.6.1.482.3 bit 7). However, it does say that in ring mode, merely advancing the tail pointer should be sufficient. I can write the tail pointer register using devmem2, but the hardware never wakes up. E.g.: Channel 0 Current Application Receive Descriptor: Value at address 0x0249114c: 0xfffff910 Channel 0 Rx Descriptor Tail Pointer: Value at address 0x02491128: 0xfffff910 Value at address 0x02491128: 0xfffff910 Written 0xfffff940; readback 0xfffff940 Value at address 0x02491128: 0xfffff940 Written 0xfffff980; readback 0xfffff980 Value at address 0x0249114c: 0xfffff910 So, the hardware hasn't advanced. Here's the ring state: RDES0 RDES1 RDES2 RDES3 401 [0x0000007ffffff910]: 0xffd63040 0x7f 0x0 0x81000000 402 [0x0000007ffffff920]: 0xffd64040 0x7f 0x0 0x81000000 403 [0x0000007ffffff930]: 0xffd3f040 0x7f 0x0 0x81000000 404 [0x0000007ffffff940]: 0xffeed040 0x7f 0x0 0x81000000 405 [0x0000007ffffff950]: 0xfff2f040 0x7f 0x0 0x81000000 406 [0x0000007ffffff960]: 0xffbee040 0x7f 0x0 0x81000000 407 [0x0000007ffffff970]: 0xffbef040 0x7f 0x0 0x81000000 408 [0x0000007ffffff980]: 0xffbf0040 0x7f 0x0 0x81000000 bit 31 of RDES3 is RDES3_OWN, which when set, means the dwmac core has ownership of the buffer. Bit 24 means buffer 1 addresa valid (stored in RDES0). So, if the iMX8MP information is correct, then advancing 0x02491128 to point at the following descriptors should "wake" the receive side, but it does not. Other registers: Queue 0 Receive Debug: Value at address 0x02490d38: 0x002a0020 bit 0 = 0 (MTL Rx Queue Write Controller Active Status not detected) bit 2:1 = 0 (Read controller Idle state) bits 5:4 = 2 (Rx Queue fill-level above flow-control activate threshold) bits 29:16 = 0x2a 42 packets in receive queue Because the internal queue is above the flow-control activate threshold, that causes the stmmac hardware to constantly spew pause frames, and, as the stmmac receive side is essentially stuck and won't make progress even when there are free buffers, the only way to release this state is via a software reset of the entire core. Why don't pause frames save us? Well, pause frames will only be sent when the receive queue fills to the activate threshold, which can only happen _after_ packets stop being transferred to the descriptor rings. In other words, it can only happen when a RBU event has been detected, which suspends the receiver - and it seems when that happens, it is irrecoverable without soft-reset on Xavier. Right now, I'm not sure what to think about this - I don't know whether it's the hardware that's at fault, or whether there's an issue in the driver. What I know for certain is what I've stated above, and the fact that iperf3 -R has *extremely* detrimental effects on my *entire* network. The reason is... you connect two Netgear switches together, they use flow control, and you have no way to turn that off... So, once stmmac starts sending pause frames, the switches queue for that port fills, and when further frames come in for that port, the switch sends pause frames to the next switch behind which stops all traffic flow between the two switches, severing the network. All the time that stmmac keeps that up, so does the switch it is connected to. If another machine happens to send a packet that needs to be queued on the port that stmmac is connected to (e.g. broadcast or multicast) then... that port starts sending pause frames back to that machine, severing its network connection permanently while stmmac is spewing pause frames. Thus, the entire network goes down, on account of _one_ machine repeatedly sending pause frames, preventing packet delivery. While the idea of a lossless network _seems_ like a good idea, in reality it gives an attacker who can get on a platform and take control of the ethernet NIC the ability to completely screw an entire network if flow control is enabled everywhere. I'm thinking at this point... just say no to flow control, disable it everywhere one can. Ethernet was designed to lose packets when it needs to, to ensure fairness. Flow control destroys that fairness and results in networks being severed. "attacker" is maybe too strong - consider what happens if the kernel crashes on a stmmac platform, so it can't receive packets anymore, and the ring fills up, causing it to start spewing pause frames. It's goodbye network! I'm just rambling, but I think that point is justified. Thoughts - should the kernel default to having flow control enabled or disabled in light of this? Should this feature require explicit administrative configuration given the severity of network disruption? -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
On Fri, 16 Jan 2026 20:57:05 +0000 Russell King (Oracle) wrote: > Thoughts - should the kernel default to having flow control enabled > or disabled in light of this? Should this feature require explicit > administrative configuration given the severity of network disruption? FWIW in DC historically we have seen a few NICs which have tiny buffers so back-pressuring up to the top of rack switch is helpful. Switches have more reasonable buffers. That's just NIC Tx pause, switch Rx pause (from downlink ports, not fabric!). Letting switches generate pause is a recipe for.. not having a network. We'd need to figure out why Netgear does what it does in your case, IMHO.
On Sat, Jan 17, 2026 at 09:06:34AM -0800, Jakub Kicinski wrote: > Letting switches generate pause is > a recipe for.. not having a network. We'd need to figure out why Netgear > does what it does in your case, IMHO. ... because they're dumb consumer switches. Also, the correct term is "a notwork" :D -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
© 2016 - 2026 Red Hat, Inc.