[REGRESSION] stmmac: Random DMA reset failure on RK3399 since v6.18

Jensen Huang posted 1 patch 1 month, 2 weeks ago
[REGRESSION] stmmac: Random DMA reset failure on RK3399 since v6.18
Posted by Jensen Huang 1 month, 2 weeks ago
Hi,

I'm reporting a regression on RK3399 (stmmac) observed in v6.18.24.
When a network cable is connected during boot, the DMA reset
occasionally fails with the error message: "Failed to reset the dma".

This appears to be a timing issue related to the EEE RX clock-stop
logic. Based on my investigation with the RTL8211E PHY, I monitored
the PHY register PS1R (MMD device 3, address 0x01) and observed a
value of 0x0f40. This indicates that the PHY is in LPI mode and the RX
clock may have already stopped.

While commit dd557266cf5f ("net: stmmac: block PHY RXC clock-stop")
ensures the clock is running before the DMA reset, my tests suggest
that the phylink_rx_clk_stop_block() call might not provide a
sufficiently stable RX clock in time for the immediate DMA reset that
follows.

Since stmmac already sets mac_requires_rxc = true, I modified
phylink_bringup_phy() to honor this flag. This avoids toggling the
PHY's clk_stop_enable during the initialization sequence, ensuring the
RX clock remains active and stable throughout.
With the change below, I achieved 200/200 successful reboots with the
cable connected (previously ~50% failure rate).

--- a/drivers/net/phy/phylink.c
+++ b/drivers/net/phy/phylink.c
@@ -2171,7 +2171,7 @@ static int phylink_bringup_phy(struct phylink
*pl, struct phy_device *phy,
     /* Allow the MAC to stop its clock if the PHY has the capability */
     pl->mac_tx_clk_stop = phy_eee_tx_clock_stop_capable(phy) > 0;

-    if (pl->mac_supports_eee_ops) {
+    if (pl->mac_supports_eee_ops && !pl->config->mac_requires_rxc) {
         /* Explicitly configure whether the PHY is allowed to stop it's
          * receive clock.
          */

Any feedback/testing on this would be appreciated.

Best regards,
Jensen Huang