[PATCH] net: stmmac: fix fatal bus error on resume by reinitializing RX buffers

Ding Hui posted 1 patch 4 weeks ago
There is a newer version of this series
.../net/ethernet/stmicro/stmmac/stmmac_main.c | 24 +++++++++++++++++++
1 file changed, 24 insertions(+)
[PATCH] net: stmmac: fix fatal bus error on resume by reinitializing RX buffers
Posted by Ding Hui 4 weeks ago
From: Ding Hui <dinghui@lixiang.com>

On suspend, stmmac_suspend() calls stmmac_disable_all_queues() which
stops the RX NAPI, but the RX DMA engine may still be running for a
short window before stmmac_stop_all_dma() takes effect. During that
window the hardware can write incoming frames into the buffers pointed
to by the RX descriptors and write back the descriptors (clearing the
OWN bit, updating length/status). Because NAPI is already disabled,
the driver never refills these descriptors, so the RX ring is left in
a "consumed but not refilled" state with HW-written content in the
descriptor buffer-address fields.

On resume, stmmac_clear_descriptors() only re-arms the OWN bit (rdes3)
and does not repopulate the RX buffer address fields. As a result the
descriptors still contain whatever the hardware wrote back during the
suspend race. When the DMA is restarted, it dereferences these stale
addresses and triggers a fatal bus error.

Fix this by treating the RX ring the same way as on close/open around
a PM transition:

 - In stmmac_suspend(), after stmmac_stop_all_dma(), walk every RX
   queue and free its buffers via dma_free_rx_xskbufs() when an XSK
   pool is attached or dma_free_rx_skbufs() otherwise, then reset
   rx_q->buf_alloc_num and clear rx_q->xsk_pool so the queue state
   matches a freshly closed queue.

 - In stmmac_resume(), call init_dma_rx_desc_rings() before
   stmmac_reset_queues_param() so RX buffers are re-allocated and
   the descriptor buffer-address fields are properly repopulated
   before the DMA is restarted.

After this change, post-resume RX descriptors always reference freshly
allocated, driver-owned buffers, and the bus error no longer occurs.

Signed-off-by: Ding Hui <dinghui@lixiang.com>
---
 .../net/ethernet/stmicro/stmmac/stmmac_main.c | 24 +++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 3591755ea30b..8ed43187cf20 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -8176,6 +8176,9 @@ int stmmac_suspend(struct device *dev)
 {
 	struct net_device *ndev = dev_get_drvdata(dev);
 	struct stmmac_priv *priv = netdev_priv(ndev);
+	u32 rx_count = priv->plat->rx_queues_to_use;
+	struct stmmac_rx_queue *rx_q;
+	u32 queue;
 	u8 chan;
 
 	if (!ndev || !netif_running(ndev))
@@ -8198,6 +8201,19 @@ int stmmac_suspend(struct device *dev)
 	/* Stop TX/RX DMA */
 	stmmac_stop_all_dma(priv);
 
+	/* Free RX queue resources */
+	for (queue = 0; queue < rx_count; queue++) {
+		rx_q = &priv->dma_conf.rx_queue[queue];
+
+		/* Release the DMA RX socket buffers */
+		if (rx_q->xsk_pool)
+			dma_free_rx_xskbufs(priv, &priv->dma_conf, queue);
+		else
+			dma_free_rx_skbufs(priv, &priv->dma_conf, queue);
+		rx_q->buf_alloc_num = 0;
+		rx_q->xsk_pool = NULL;
+	}
+
 	stmmac_legacy_serdes_power_down(priv);
 
 	/* Enable Power down mode by programming the PMT regs */
@@ -8316,6 +8332,14 @@ int stmmac_resume(struct device *dev)
 
 	mutex_lock(&priv->lock);
 
+	ret = init_dma_rx_desc_rings(ndev, &priv->dma_conf, GFP_KERNEL);
+	if (ret < 0) {
+		netdev_err(priv->dev, "%s: rx dma desc rings init failed\n", __func__);
+		mutex_unlock(&priv->lock);
+		rtnl_unlock();
+		return ret;
+	}
+
 	stmmac_reset_queues_param(priv);
 
 	stmmac_free_tx_skbufs(priv);
-- 
2.34.1
Re: [PATCH] net: stmmac: fix fatal bus error on resume by reinitializing RX buffers
Posted by Andrew Lunn 3 weeks, 6 days ago
> Fix this by treating the RX ring the same way as on close/open around
> a PM transition:
> 
>  - In stmmac_suspend(), after stmmac_stop_all_dma(), walk every RX
>    queue and free its buffers via dma_free_rx_xskbufs() when an XSK
>    pool is attached or dma_free_rx_skbufs() otherwise, then reset
>    rx_q->buf_alloc_num and clear rx_q->xsk_pool so the queue state
>    matches a freshly closed queue.
> 
>  - In stmmac_resume(), call init_dma_rx_desc_rings() before
>    stmmac_reset_queues_param() so RX buffers are re-allocated and
>    the descriptor buffer-address fields are properly repopulated
>    before the DMA is restarted.

The problem with this is, if the system is under memory pressure, it
might not be able to allocate the new RX buffers. So on resume, your
network interface dies.

For configuration changes which require buffers to be change, like
ethtool --set-ring, sometimes changing the MTU, you first allocate the
new buffers, and only if successful do you free the old buffers, so
that you can gracefully fail.

That free and then release idea does not work for resume.

So, can you live with the buffers you have, and just reset the
descriptors?

	Andrew