[PATCH net v1 4/4] net: phy: smsc: Disable IRQ support to prevent link state corruption

Oleksij Rempel posted 4 patches 3 months, 1 week ago
There is a newer version of this series
[PATCH net v1 4/4] net: phy: smsc: Disable IRQ support to prevent link state corruption
Posted by Oleksij Rempel 3 months, 1 week ago
Disable interrupt handling for the LAN87xx PHY to prevent the network
interface from entering a corrupted state after rapid configuration
changes.

When the link configuration is changed quickly, the PHY can get stuck in
a non-functional state. In this state, 'ethtool' reports that a link is
present, but 'ip link' shows NO-CARRIER, and the interface is unable to
transfer data.

Fixes: 1ce8b37241ed ("usbnet: smsc95xx: Forward PHY interrupts to PHY driver to avoid polling")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Cc: Lukas Wunner <lukas@wunner.de>
---
 drivers/net/phy/smsc.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/net/phy/smsc.c b/drivers/net/phy/smsc.c
index b6489da5cfcd..dac6bf156d15 100644
--- a/drivers/net/phy/smsc.c
+++ b/drivers/net/phy/smsc.c
@@ -746,10 +746,6 @@ static struct phy_driver smsc_phy_driver[] = {
 	.soft_reset	= smsc_phy_reset,
 	.config_aneg	= lan87xx_config_aneg,
 
-	/* IRQ related */
-	.config_intr	= smsc_phy_config_intr,
-	.handle_interrupt = smsc_phy_handle_interrupt,
-
 	/* Statistics */
 	.get_sset_count = smsc_get_sset_count,
 	.get_strings	= smsc_get_strings,
-- 
2.39.5
Re: [PATCH net v1 4/4] net: phy: smsc: Disable IRQ support to prevent link state corruption
Posted by Lukas Wunner 3 months, 1 week ago
On Tue, Jul 01, 2025 at 02:21:46PM +0200, Oleksij Rempel wrote:
> Disable interrupt handling for the LAN87xx PHY to prevent the network
> interface from entering a corrupted state after rapid configuration
> changes.
> 
> When the link configuration is changed quickly, the PHY can get stuck in
> a non-functional state. In this state, 'ethtool' reports that a link is
> present, but 'ip link' shows NO-CARRIER, and the interface is unable to
> transfer data.
[...]
> --- a/drivers/net/phy/smsc.c
> +++ b/drivers/net/phy/smsc.c
> @@ -746,10 +746,6 @@ static struct phy_driver smsc_phy_driver[] = {
>  	.soft_reset	= smsc_phy_reset,
>  	.config_aneg	= lan87xx_config_aneg,
>  
> -	/* IRQ related */
> -	.config_intr	= smsc_phy_config_intr,
> -	.handle_interrupt = smsc_phy_handle_interrupt,
> -

Well, that's not good.  I guess this means that the interrupt is
polled again, so we basically go back to the suboptimal behavior
prior to 1ce8b37241ed?

Without support for interrupt handling, we can't take advantage
of the GPIOs on the chip for interrupt generation.  Nor can we
properly support runtime PM if no cable is attached.

What's the actual root cause?  Is it the issue described in this
paragraph of 1ce8b37241ed's commit message?

    Normally the PHY interrupt should be masked until the PHY driver has
    cleared it.  However masking requires a (sleeping) USB transaction and
    interrupts are received in (non-sleepable) softirq context.  I decided
    not to mask the interrupt at all (by using the dummy_irq_chip's noop
    ->irq_mask() callback):  The USB interrupt endpoint is polled in 1 msec
    intervals and normally that's sufficient to wake the PHY driver's IRQ
    thread and have it clear the interrupt.  If it does take longer, worst
    thing that can happen is the IRQ thread is woken again.  No big deal.

There must be better options than going back to polling.
E.g. inserting delays to avoid the PHY getting wedged.

TBH I did test this thoroughly back in the day and never
witnessed the issue.

Thanks,

Lukas
Re: [PATCH net v1 4/4] net: phy: smsc: Disable IRQ support to prevent link state corruption
Posted by Paolo Abeni 3 months ago
On 7/1/25 2:58 PM, Lukas Wunner wrote:
> On Tue, Jul 01, 2025 at 02:21:46PM +0200, Oleksij Rempel wrote:
>> Disable interrupt handling for the LAN87xx PHY to prevent the network
>> interface from entering a corrupted state after rapid configuration
>> changes.
>>
>> When the link configuration is changed quickly, the PHY can get stuck in
>> a non-functional state. In this state, 'ethtool' reports that a link is
>> present, but 'ip link' shows NO-CARRIER, and the interface is unable to
>> transfer data.
> [...]
>> --- a/drivers/net/phy/smsc.c
>> +++ b/drivers/net/phy/smsc.c
>> @@ -746,10 +746,6 @@ static struct phy_driver smsc_phy_driver[] = {
>>  	.soft_reset	= smsc_phy_reset,
>>  	.config_aneg	= lan87xx_config_aneg,
>>  
>> -	/* IRQ related */
>> -	.config_intr	= smsc_phy_config_intr,
>> -	.handle_interrupt = smsc_phy_handle_interrupt,
>> -
> 
> Well, that's not good.  I guess this means that the interrupt is
> polled again, so we basically go back to the suboptimal behavior
> prior to 1ce8b37241ed?
> 
> Without support for interrupt handling, we can't take advantage
> of the GPIOs on the chip for interrupt generation.  Nor can we
> properly support runtime PM if no cable is attached.
> 
> What's the actual root cause?  Is it the issue described in this
> paragraph of 1ce8b37241ed's commit message?
> 
>     Normally the PHY interrupt should be masked until the PHY driver has
>     cleared it.  However masking requires a (sleeping) USB transaction and
>     interrupts are received in (non-sleepable) softirq context.  I decided
>     not to mask the interrupt at all (by using the dummy_irq_chip's noop
>     ->irq_mask() callback):  The USB interrupt endpoint is polled in 1 msec
>     intervals and normally that's sufficient to wake the PHY driver's IRQ
>     thread and have it clear the interrupt.  If it does take longer, worst
>     thing that can happen is the IRQ thread is woken again.  No big deal.
> 
> There must be better options than going back to polling.
> E.g. inserting delays to avoid the PHY getting wedged.

I agree the solution proposed by this patch looks too rough. I think
more effort should be invested to at least understand why the phy got stuck.

@Oleksij: possibly you could re-submit patch 1-3 only while you keep
investigating the issue addressed here.

/P
Re: [PATCH net v1 4/4] net: phy: smsc: Disable IRQ support to prevent link state corruption
Posted by Oleksij Rempel 3 months, 1 week ago
Hi Lukas,

On Tue, Jul 01, 2025 at 02:58:19PM +0200, Lukas Wunner wrote:
> On Tue, Jul 01, 2025 at 02:21:46PM +0200, Oleksij Rempel wrote:
> > Disable interrupt handling for the LAN87xx PHY to prevent the network
> > interface from entering a corrupted state after rapid configuration
> > changes.
> > 
> > When the link configuration is changed quickly, the PHY can get stuck in
> > a non-functional state. In this state, 'ethtool' reports that a link is
> > present, but 'ip link' shows NO-CARRIER, and the interface is unable to
> > transfer data.
> [...]
> > --- a/drivers/net/phy/smsc.c
> > +++ b/drivers/net/phy/smsc.c
> > @@ -746,10 +746,6 @@ static struct phy_driver smsc_phy_driver[] = {
> >  	.soft_reset	= smsc_phy_reset,
> >  	.config_aneg	= lan87xx_config_aneg,
> >  
> > -	/* IRQ related */
> > -	.config_intr	= smsc_phy_config_intr,
> > -	.handle_interrupt = smsc_phy_handle_interrupt,
> > -
> 
> Well, that's not good.  I guess this means that the interrupt is
> polled again, so we basically go back to the suboptimal behavior
> prior to 1ce8b37241ed?

Not fully. It will disable interrupt support only for the embedded PHY,
other types of interrupts should work as expected.

> Without support for interrupt handling, we can't take advantage
> of the GPIOs on the chip for interrupt generation.  Nor can we
> properly support runtime PM if no cable is attached.

Hm... the PHY smsc driver is not using EDPD mode by default if PHY
interrupts are enabled. Or do you mean other kind of PM?

> What's the actual root cause?  Is it the issue described in this
> paragraph of 1ce8b37241ed's commit message?
> 
>     Normally the PHY interrupt should be masked until the PHY driver has
>     cleared it.  However masking requires a (sleeping) USB transaction and
>     interrupts are received in (non-sleepable) softirq context.  I decided
>     not to mask the interrupt at all (by using the dummy_irq_chip's noop
>     ->irq_mask() callback):  The USB interrupt endpoint is polled in 1 msec
>     intervals and normally that's sufficient to wake the PHY driver's IRQ
>     thread and have it clear the interrupt.  If it does take longer, worst
>     thing that can happen is the IRQ thread is woken again.  No big deal.

I'm not sure. It seems to be not the problem.

> There must be better options than going back to polling.
> E.g. inserting delays to avoid the PHY getting wedged.
> 
> TBH I did test this thoroughly back in the day and never
> witnessed the issue.

I did some testing back in time too. It worked and still works normally
in the autoneg mode.

What is not working as expected is the fixed mode, especially 10 mbit
fixed mode.

Here are my current testing results:

# configure 10 mbit forced mode:
ethtool -s eth0 autoneg off speed 10 duplex half
# attach cable (can be done wothout reataching cable)
[10174.585150] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[10174.586760] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[10174.594636] lan87xx_read_status: link: no, speed: 10, duplex: half, autoneg: off
[10174.602777] lan87xx_read_status: link: no, speed: 10, duplex: half, autoneg: off
[10174.841458] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[10174.843017] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[10174.850619] lan87xx_read_status: link: no, speed: 10, duplex: half, autoneg: off
[10174.857026] lan87xx_read_status: link: no, speed: 10, duplex: half, autoneg: off
[10175.425513] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[10175.427046] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[10175.434871] lan87xx_read_status: link: no, speed: 10, duplex: half, autoneg: off
[10175.441332] lan87xx_read_status: link: no, speed: 10, duplex: half, autoneg: off

At this point no more interrupts will come and link up state will not be
detected. Replugging cable will have same result.

The worst part - unplugging the cable may trigger an endless interrupt storm
(which is some times reproducible in the 10Mbit forced mode):
[ 1584.132799] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[ 1584.134220] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[ 1584.389134] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[ 1584.390591] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[ 1584.644757] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[ 1584.646177] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[ 1584.900781] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[ 1584.902305] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[ 1585.158416] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098

With latest kernel we can use adaptive polling, wich I added now for
testing. Here are the results:

[ 2200.702427] lan87xx_read_status: link: no, speed: 10, duplex: half, autoneg: off
[ 2200.948552] smsc95xx 1-1.1:1.0 enu1u1: intdata: 0x00008000
[ 2200.949640] smsc95xx 1-1.1:1.0 enu1u1: intdata: 0x00008000
[ 2200.950182] smsc95xx 1-1.1:1.0 enu1u1: intdata: 0x00008000
[ 2200.951374] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[ 2200.953186] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[ 2200.959234] lan87xx_read_status: link: no, speed: 10, duplex: half, autoneg: off
[ 2201.204270] smsc95xx 1-1.1:1.0 enu1u1: intdata: 0x00008000
[ 2201.205284] smsc95xx 1-1.1:1.0 enu1u1: intdata: 0x00008000
[ 2201.207139] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[ 2201.208825] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[ 2201.216406] lan87xx_read_status: link: no, speed: 10, duplex: half, autoneg: off
[ 2201.460548] smsc95xx 1-1.1:1.0 enu1u1: intdata: 0x00008000
[ 2201.461618] smsc95xx 1-1.1:1.0 enu1u1: intdata: 0x00008000
[ 2201.462181] smsc95xx 1-1.1:1.0 enu1u1: intdata: 0x00008000
[ 2201.463273] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[ 2201.464764] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[ 2201.471066] lan87xx_read_status: link: no, speed: 10, duplex: half, autoneg: off
[ 2201.716547] smsc95xx 1-1.1:1.0 enu1u1: intdata: 0x00008000
[ 2201.717607] smsc95xx 1-1.1:1.0 enu1u1: intdata: 0x00008000
[ 2201.718235] smsc95xx 1-1.1:1.0 enu1u1: intdata: 0x00008000
[ 2201.719267] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[ 2201.721035] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[ 2201.727488] lan87xx_read_status: link: no, speed: 10, duplex: half, autoneg: off
[ 2201.972542] smsc95xx 1-1.1:1.0 enu1u1: intdata: 0x00008000
[ 2201.973614] smsc95xx 1-1.1:1.0 enu1u1: intdata: 0x00008000
[ 2201.974176] smsc95xx 1-1.1:1.0 enu1u1: intdata: 0x00008000
[ 2201.975321] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[ 2201.977078] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[ 2201.983500] lan87xx_read_status: link: no, speed: 10, duplex: half, autoneg: off
[ 2202.228538] smsc95xx 1-1.1:1.0 enu1u1: intdata: 0x00008000
[ 2202.229615] smsc95xx 1-1.1:1.0 enu1u1: intdata: 0x00008000
[ 2202.230174] smsc95xx 1-1.1:1.0 enu1u1: intdata: 0x00008000
[ 2202.231292] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[ 2202.233038] smsc_phy_handle_interrupt: MII_LAN83C185_ISF = 0x0098
[ 2202.238972] lan87xx_read_status: link: no, speed: 10, duplex: half, autoneg: off
[ 2202.239018] smsc_phy_get_next_update: next update in 250 jiffies
[ 2203.258566] lan87xx_read_status: link: yes, speed: 10, duplex: half, autoneg: off
[ 2203.258756] smsc95xx 1-1.1:1.0 enu1u1: Link is Up - 10Mbps/Half - flow control off

With adaptive polling we can use both. Since IRQ down interrupt works as
expected, we can use low frequency polling (one per 30 seconds) in link up
state. On link down state, after last interrupt poll one time per second
for 30 seconds, then switch to low frequency polling (one per 30
seconds).

I need to figure out haw to handle an interrupt storm.

-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |
Re: [PATCH net v1 4/4] net: phy: smsc: Disable IRQ support to prevent link state corruption
Posted by Lukas Wunner 3 months ago
On Wed, Jul 02, 2025 at 11:46:05AM +0200, Oleksij Rempel wrote:
> On Tue, Jul 01, 2025 at 02:58:19PM +0200, Lukas Wunner wrote:
> > Without support for interrupt handling, we can't take advantage
> > of the GPIOs on the chip for interrupt generation.  Nor can we
> > properly support runtime PM if no cable is attached.
> 
> Hm... the PHY smsc driver is not using EDPD mode by default if PHY
> interrupts are enabled. Or do you mean other kind of PM?

See commit 2642cc6c3bbe ("net: phy: smsc: Disable Energy Detect
Power-Down in interrupt mode"):

The LAN9514 used on older RasPi boards only supports SUSPEND0,
SUSPEND1, SUSPEND2.  Other incarnations of the LAN95xx family
such as LAN9500A additionally support SUSPEND3.  The smsc95xx.c
driver enables suspend only on SUSPEND3-capable chips.

I was planning to add suspend support for LAN9514 but never got
around to finish it.  Enabling SUSPEND1 if no cable is attached
saves quite a bit of power, so RasPis without a network connection
consume less.

Back in the day I did some tests and noticed that while EDPD
wasn't working at all outside of SUSPEND modes, it did work
at least sometimes to wake up from SUSPEND1.  The driver fiddles
with EDPD settings upon entering suspend.  I concluded that more
testing is necessary to enable EDPD before SUSPEND1 and that's
when I had to put this project aside. :(

Thanks,

Lukas