[PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage

Kory Maincent posted 1 patch 11 months ago
drivers/net/phy/phy_device.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
[PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage
Posted by Kory Maincent 11 months ago
The phy_detach function can be called with or without the rtnl lock held.
When the rtnl lock is not held, using rtnl_dereference() triggers a
warning due to the lack of lock context.

Add an rcu_read_lock() to ensure the lock is acquired and to maintain
synchronization.

The path reported to not having RTNL lock acquired is the suspend path of
the ravb MAC driver. Without this fix we got this warning:

[   39.032969] =============================
[   39.032983] WARNING: suspicious RCU usage
[   39.033019] -----------------------------
[   39.033033] drivers/net/phy/phy_device.c:2004 suspicious
rcu_dereference_protected() usage!
...
[   39.033597] stack backtrace:
[   39.033613] CPU: 0 UID: 0 PID: 174 Comm: python3 Not tainted
6.13.0-rc7-next-20250116-arm64-renesas-00002-g35245dfdc62c #7
[   39.033623] Hardware name: Renesas SMARC EVK version 2 based on
r9a08g045s33 (DT)
[   39.033628] Call trace:
[   39.033633]  show_stack+0x14/0x1c (C)
[   39.033652]  dump_stack_lvl+0xb4/0xc4
[   39.033664]  dump_stack+0x14/0x1c
[   39.033671]  lockdep_rcu_suspicious+0x16c/0x22c
[   39.033682]  phy_detach+0x160/0x190
[   39.033694]  phy_disconnect+0x40/0x54
[   39.033703]  ravb_close+0x6c/0x1cc
[   39.033714]  ravb_suspend+0x48/0x120
[   39.033721]  dpm_run_callback+0x4c/0x14c
[   39.033731]  device_suspend+0x11c/0x4dc
[   39.033740]  dpm_suspend+0xdc/0x214
[   39.033748]  dpm_suspend_start+0x48/0x60
[   39.033758]  suspend_devices_and_enter+0x124/0x574
[   39.033769]  pm_suspend+0x1ac/0x274
[   39.033778]  state_store+0x88/0x124
[   39.033788]  kobj_attr_store+0x14/0x24
[   39.033798]  sysfs_kf_write+0x48/0x6c
[   39.033808]  kernfs_fop_write_iter+0x118/0x1a8
[   39.033817]  vfs_write+0x27c/0x378
[   39.033825]  ksys_write+0x64/0xf4
[   39.033833]  __arm64_sys_write+0x18/0x20
[   39.033841]  invoke_syscall+0x44/0x104
[   39.033852]  el0_svc_common.constprop.0+0xb4/0xd4
[   39.033862]  do_el0_svc+0x18/0x20
[   39.033870]  el0_svc+0x3c/0xf0
[   39.033880]  el0t_64_sync_handler+0xc0/0xc4
[   39.033888]  el0t_64_sync+0x154/0x158
[   39.041274] ravb 11c30000.ethernet eth0: Link is Down

Tested-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Reported-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Closes: https://lore.kernel.org/netdev/4c6419d8-c06b-495c-b987-d66c2e1ff848@tuxon.dev/
Fixes: 35f7cad1743e ("net: Add the possibility to support a selected hwtstamp in netdevice")
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
---

Changes in v3:
- Update the commit message with the stack trace.

Changes in v2:
- Add a missing ;
---
 drivers/net/phy/phy_device.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 5b34d39d1d52..3eeee7cba923 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -2001,12 +2001,14 @@ void phy_detach(struct phy_device *phydev)
 	if (dev) {
 		struct hwtstamp_provider *hwprov;
 
-		hwprov = rtnl_dereference(dev->hwprov);
+		rcu_read_lock();
+		hwprov = rcu_dereference(dev->hwprov);
 		/* Disable timestamp if it is the one selected */
 		if (hwprov && hwprov->phydev == phydev) {
 			rcu_assign_pointer(dev->hwprov, NULL);
 			kfree_rcu(hwprov, rcu_head);
 		}
+		rcu_read_unlock();
 
 		phydev->attached_dev->phydev = NULL;
 		phydev->attached_dev = NULL;
-- 
2.34.1
Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage
Posted by Jakub Kicinski 11 months ago
On Mon, 20 Jan 2025 15:19:25 +0100 Kory Maincent wrote:
> The path reported to not having RTNL lock acquired is the suspend path of
> the ravb MAC driver. Without this fix we got this warning:

I maintain that ravb is buggy, plenty of drivers take rtnl_lock 
from the .suspend callback. We need _some_ write protection here,
the patch as is only silences a legitimate warning.
-- 
pw-bot: cr
Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage
Posted by Kory Maincent 11 months ago
On Mon, 20 Jan 2025 11:12:28 -0800
Jakub Kicinski <kuba@kernel.org> wrote:

> On Mon, 20 Jan 2025 15:19:25 +0100 Kory Maincent wrote:
> > The path reported to not having RTNL lock acquired is the suspend path of
> > the ravb MAC driver. Without this fix we got this warning:  
> 
> I maintain that ravb is buggy, plenty of drivers take rtnl_lock 
> from the .suspend callback. We need _some_ write protection here,
> the patch as is only silences a legitimate warning.

Indeed if the suspend path is buggy we should fix it. Still there is lots of
ethernet drivers calling phy_disconnect without rtnl (IIUC) if probe return an
error or in the remove path. What should we do about it?

About ravb suspend, I don't have the board, Claudiu could you try this instead
of the current fix:

diff --git a/drivers/net/ethernet/renesas/ravb_main.c
b/drivers/net/ethernet/renesas/ravb_main.c index bc395294a32d..c9a0d2d6f371
100644 --- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -3215,15 +3215,22 @@ static int ravb_suspend(struct device *dev)
        if (!netif_running(ndev))
                goto reset_assert;
 
+       rtnl_lock();
        netif_device_detach(ndev);
 
-       if (priv->wol_enabled)
-               return ravb_wol_setup(ndev);
+       if (priv->wol_enabled) {
+               ret = ravb_wol_setup(ndev);
+               rtnl_unlock();
+               return ret;
+       }
 
        ret = ravb_close(ndev);
-       if (ret)
+       if (ret) {
+               rtnl_unlock();
                return ret;
+       }
 
+       rtnl_unlock();
        ret = pm_runtime_force_suspend(&priv->pdev->dev);
        if (ret)
                return ret;

Regards,
-- 
Köry Maincent, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com
Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage
Posted by Paul Barker 11 months ago
On 21/01/2025 09:38, Kory Maincent wrote:
> On Mon, 20 Jan 2025 11:12:28 -0800
> Jakub Kicinski <kuba@kernel.org> wrote:
> 
>> On Mon, 20 Jan 2025 15:19:25 +0100 Kory Maincent wrote:
>>> The path reported to not having RTNL lock acquired is the suspend path of
>>> the ravb MAC driver. Without this fix we got this warning:  
>>
>> I maintain that ravb is buggy, plenty of drivers take rtnl_lock 
>> from the .suspend callback. We need _some_ write protection here,
>> the patch as is only silences a legitimate warning.
> 
> Indeed if the suspend path is buggy we should fix it. Still there is lots of
> ethernet drivers calling phy_disconnect without rtnl (IIUC) if probe return an
> error or in the remove path. What should we do about it?
> 
> About ravb suspend, I don't have the board, Claudiu could you try this instead
> of the current fix:
> 
> diff --git a/drivers/net/ethernet/renesas/ravb_main.c
> b/drivers/net/ethernet/renesas/ravb_main.c index bc395294a32d..c9a0d2d6f371
> 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c
> +++ b/drivers/net/ethernet/renesas/ravb_main.c
> @@ -3215,15 +3215,22 @@ static int ravb_suspend(struct device *dev)
>         if (!netif_running(ndev))
>                 goto reset_assert;
>  
> +       rtnl_lock();
>         netif_device_detach(ndev);
>  
> -       if (priv->wol_enabled)
> -               return ravb_wol_setup(ndev);
> +       if (priv->wol_enabled) {
> +               ret = ravb_wol_setup(ndev);
> +               rtnl_unlock();
> +               return ret;
> +       }
>  
>         ret = ravb_close(ndev);
> -       if (ret)
> +       if (ret) {
> +               rtnl_unlock();
>                 return ret;
> +       }
>  
> +       rtnl_unlock();
>         ret = pm_runtime_force_suspend(&priv->pdev->dev);
>         if (ret)
>                 return ret;
> 
> Regards,

(Cc'ing Niklas and Sergey as this relates to the ravb driver)

Why do we need to hold the rtnl mutex across the calls to
netif_device_detach() and ravb_wol_setup()?

My reading of Documentation/networking/netdevices.rst is that the rtnl
mutex is held when the net subsystem calls the driver's ndo_stop method,
which in our case is ravb_close(). So, we should take the rtnl mutex
when we call ravb_close() directly, in both ravb_suspend() and
ravb_wol_restore(). That would ensure that we do not call
phy_disconnect() without holding the rtnl mutex and should fix this
issue.

Commit 35f7cad1743e ("net: Add the possibility to support a selected
hwtstamp in netdevice") may have unearthed the issue, but the fixes tag
should point to the commits adding those unlocked calls to ravb_close().

I am not super familiar with the rtnl lock so let me know if I've missed
something.

Thanks,

-- 
Paul Barker
Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage
Posted by Kory Maincent 11 months ago
On Tue, 21 Jan 2025 11:34:48 +0000
Paul Barker <paul.barker.ct@bp.renesas.com> wrote:

> On 21/01/2025 09:38, Kory Maincent wrote:
> > On Mon, 20 Jan 2025 11:12:28 -0800
> > Jakub Kicinski <kuba@kernel.org> wrote:
> >   
> >> On Mon, 20 Jan 2025 15:19:25 +0100 Kory Maincent wrote:  
>  [...]  
> >>
> >> I maintain that ravb is buggy, plenty of drivers take rtnl_lock 
> >> from the .suspend callback. We need _some_ write protection here,
> >> the patch as is only silences a legitimate warning.  
> > 
> > Indeed if the suspend path is buggy we should fix it. Still there is lots of
> > ethernet drivers calling phy_disconnect without rtnl (IIUC) if probe return
> > an error or in the remove path. What should we do about it?
> > 
> > About ravb suspend, I don't have the board, Claudiu could you try this
> > instead of the current fix:
> > 
> > diff --git a/drivers/net/ethernet/renesas/ravb_main.c
> > b/drivers/net/ethernet/renesas/ravb_main.c index bc395294a32d..c9a0d2d6f371
> > 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c
> > +++ b/drivers/net/ethernet/renesas/ravb_main.c
> > @@ -3215,15 +3215,22 @@ static int ravb_suspend(struct device *dev)
> >         if (!netif_running(ndev))
> >                 goto reset_assert;
> >  
> > +       rtnl_lock();
> >         netif_device_detach(ndev);
> >  
> > -       if (priv->wol_enabled)
> > -               return ravb_wol_setup(ndev);
> > +       if (priv->wol_enabled) {
> > +               ret = ravb_wol_setup(ndev);
> > +               rtnl_unlock();
> > +               return ret;
> > +       }
> >  
> >         ret = ravb_close(ndev);
> > -       if (ret)
> > +       if (ret) {
> > +               rtnl_unlock();
> >                 return ret;
> > +       }
> >  
> > +       rtnl_unlock();
> >         ret = pm_runtime_force_suspend(&priv->pdev->dev);
> >         if (ret)
> >                 return ret;
> > 
> > Regards,  
> 
> (Cc'ing Niklas and Sergey as this relates to the ravb driver)

Yes, thanks.

> Why do we need to hold the rtnl mutex across the calls to
> netif_device_detach() and ravb_wol_setup()?
> 
> My reading of Documentation/networking/netdevices.rst is that the rtnl
> mutex is held when the net subsystem calls the driver's ndo_stop method,
> which in our case is ravb_close(). So, we should take the rtnl mutex
> when we call ravb_close() directly, in both ravb_suspend() and
> ravb_wol_restore(). That would ensure that we do not call
> phy_disconnect() without holding the rtnl mutex and should fix this
> issue.

Not sure about it. For example ravb_ptp_stop() called in ravb_wol_setup() won't
be protected by the rtnl lock.
I don't know about netif_device_detach(). It doesn't seems to be the case as
there is lots of driver using it without holding rtnl lock.

Indeed we should add the rtnl lock also in the resume path. 

> Commit 35f7cad1743e ("net: Add the possibility to support a selected
> hwtstamp in netdevice") may have unearthed the issue, but the fixes tag
> should point to the commits adding those unlocked calls to ravb_close().

The current patch was on phy_device.c that's why the fixes tag does not point to
a ravb commit, it will change.

Regards,
-- 
Köry Maincent, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com
Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage
Posted by Paul Barker 11 months ago
On 21/01/2025 13:01, Kory Maincent wrote:
> On Tue, 21 Jan 2025 11:34:48 +0000
> Paul Barker <paul.barker.ct@bp.renesas.com> wrote:
> 
>> On 21/01/2025 09:38, Kory Maincent wrote:
>>> On Mon, 20 Jan 2025 11:12:28 -0800
>>> Jakub Kicinski <kuba@kernel.org> wrote:
>>>   
>>>> On Mon, 20 Jan 2025 15:19:25 +0100 Kory Maincent wrote:  
>>  [...]  
>>>>
>>>> I maintain that ravb is buggy, plenty of drivers take rtnl_lock 
>>>> from the .suspend callback. We need _some_ write protection here,
>>>> the patch as is only silences a legitimate warning.  
>>>
>>> Indeed if the suspend path is buggy we should fix it. Still there is lots of
>>> ethernet drivers calling phy_disconnect without rtnl (IIUC) if probe return
>>> an error or in the remove path. What should we do about it?
>>>
>>> About ravb suspend, I don't have the board, Claudiu could you try this
>>> instead of the current fix:
>>>
>>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c
>>> b/drivers/net/ethernet/renesas/ravb_main.c index bc395294a32d..c9a0d2d6f371
>>> 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c
>>> +++ b/drivers/net/ethernet/renesas/ravb_main.c
>>> @@ -3215,15 +3215,22 @@ static int ravb_suspend(struct device *dev)
>>>         if (!netif_running(ndev))
>>>                 goto reset_assert;
>>>  
>>> +       rtnl_lock();
>>>         netif_device_detach(ndev);
>>>  
>>> -       if (priv->wol_enabled)
>>> -               return ravb_wol_setup(ndev);
>>> +       if (priv->wol_enabled) {
>>> +               ret = ravb_wol_setup(ndev);
>>> +               rtnl_unlock();
>>> +               return ret;
>>> +       }
>>>  
>>>         ret = ravb_close(ndev);
>>> -       if (ret)
>>> +       if (ret) {
>>> +               rtnl_unlock();
>>>                 return ret;
>>> +       }
>>>  
>>> +       rtnl_unlock();
>>>         ret = pm_runtime_force_suspend(&priv->pdev->dev);
>>>         if (ret)
>>>                 return ret;
>>>
>>> Regards,  
>>
>> (Cc'ing Niklas and Sergey as this relates to the ravb driver)
> 
> Yes, thanks.
> 
>> Why do we need to hold the rtnl mutex across the calls to
>> netif_device_detach() and ravb_wol_setup()?
>>
>> My reading of Documentation/networking/netdevices.rst is that the rtnl
>> mutex is held when the net subsystem calls the driver's ndo_stop method,
>> which in our case is ravb_close(). So, we should take the rtnl mutex
>> when we call ravb_close() directly, in both ravb_suspend() and
>> ravb_wol_restore(). That would ensure that we do not call
>> phy_disconnect() without holding the rtnl mutex and should fix this
>> issue.
> 
> Not sure about it. For example ravb_ptp_stop() called in ravb_wol_setup() won't
> be protected by the rtnl lock.

ravb_ptp_stop() modifies a couple of device registers and calls
ptp_clock_unregister(). I don't see anything to suggest that this
requires the rtnl lock to be held, unless I am missing something.

> I don't know about netif_device_detach(). It doesn't seems to be the case as
> there is lots of driver using it without holding rtnl lock.

netif_device_detach() clears the present flag from the link state and
stops all TX queues, so I don't think the rtnl lock needs to be held to
call this.

> 
> Indeed we should add the rtnl lock also in the resume path. 
> 
>> Commit 35f7cad1743e ("net: Add the possibility to support a selected
>> hwtstamp in netdevice") may have unearthed the issue, but the fixes tag
>> should point to the commits adding those unlocked calls to ravb_close().
> 
> The current patch was on phy_device.c that's why the fixes tag does not point to
> a ravb commit, it will change.

Ack. Thanks for digging in to this!

-- 
Paul Barker
Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage
Posted by Kory Maincent 11 months ago
On Tue, 21 Jan 2025 15:44:34 +0000
Paul Barker <paul.barker.ct@bp.renesas.com> wrote:

> On 21/01/2025 13:01, Kory Maincent wrote:
> > On Tue, 21 Jan 2025 11:34:48 +0000
> > Paul Barker <paul.barker.ct@bp.renesas.com> wrote:
> >   
> >> On 21/01/2025 09:38, Kory Maincent wrote:  
>  [...]  
>  [...]  
> >>  [...]    
>  [...]  
>  [...]  
> >>
> >> (Cc'ing Niklas and Sergey as this relates to the ravb driver)  
> > 
> > Yes, thanks.
> >   
> >> Why do we need to hold the rtnl mutex across the calls to
> >> netif_device_detach() and ravb_wol_setup()?
> >>
> >> My reading of Documentation/networking/netdevices.rst is that the rtnl
> >> mutex is held when the net subsystem calls the driver's ndo_stop method,
> >> which in our case is ravb_close(). So, we should take the rtnl mutex
> >> when we call ravb_close() directly, in both ravb_suspend() and
> >> ravb_wol_restore(). That would ensure that we do not call
> >> phy_disconnect() without holding the rtnl mutex and should fix this
> >> issue.  
> > 
> > Not sure about it. For example ravb_ptp_stop() called in ravb_wol_setup()
> > won't be protected by the rtnl lock.  
> 
> ravb_ptp_stop() modifies a couple of device registers and calls
> ptp_clock_unregister(). I don't see anything to suggest that this
> requires the rtnl lock to be held, unless I am missing something.

What happens if two ptp_clock_unregister() with the same ptp_clock pointer are 
called simultaneously? From ravb_suspend and ravb_set_ringparam for example. It
may cause some errors.
For example the ptp->kworker pointer could be used after a kfree.
https://elixir.bootlin.com/linux/v6.12.6/source/drivers/ptp/ptp_clock.c#L416

Regards,
-- 
Köry Maincent, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com
Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage
Posted by Russell King (Oracle) 11 months ago
On Tue, Jan 21, 2025 at 05:11:56PM +0100, Kory Maincent wrote:
> On Tue, 21 Jan 2025 15:44:34 +0000
> Paul Barker <paul.barker.ct@bp.renesas.com> wrote:
> 
> > On 21/01/2025 13:01, Kory Maincent wrote:
> > > On Tue, 21 Jan 2025 11:34:48 +0000
> > > Paul Barker <paul.barker.ct@bp.renesas.com> wrote:
> > >   
> > >> On 21/01/2025 09:38, Kory Maincent wrote:  
> >  [...]  
> >  [...]  
> > >>  [...]    
> >  [...]  
> >  [...]  
> > >>
> > >> (Cc'ing Niklas and Sergey as this relates to the ravb driver)  
> > > 
> > > Yes, thanks.
> > >   
> > >> Why do we need to hold the rtnl mutex across the calls to
> > >> netif_device_detach() and ravb_wol_setup()?
> > >>
> > >> My reading of Documentation/networking/netdevices.rst is that the rtnl
> > >> mutex is held when the net subsystem calls the driver's ndo_stop method,
> > >> which in our case is ravb_close(). So, we should take the rtnl mutex
> > >> when we call ravb_close() directly, in both ravb_suspend() and
> > >> ravb_wol_restore(). That would ensure that we do not call
> > >> phy_disconnect() without holding the rtnl mutex and should fix this
> > >> issue.  
> > > 
> > > Not sure about it. For example ravb_ptp_stop() called in ravb_wol_setup()
> > > won't be protected by the rtnl lock.  
> > 
> > ravb_ptp_stop() modifies a couple of device registers and calls
> > ptp_clock_unregister(). I don't see anything to suggest that this
> > requires the rtnl lock to be held, unless I am missing something.
> 
> What happens if two ptp_clock_unregister() with the same ptp_clock pointer are 
> called simultaneously? From ravb_suspend and ravb_set_ringparam for example. It
> may cause some errors.
> For example the ptp->kworker pointer could be used after a kfree.
> https://elixir.bootlin.com/linux/v6.12.6/source/drivers/ptp/ptp_clock.c#L416

Taking a look at where ravb_ptp_stop() is called from:

1. ravb_set_ringparam(). ethtool operation. RTNL will be held for this.
2. ravb_open() error-cleanup. RTNL will be held for this.
3. ravb_tx_timeout_work(). rtnl_trylock() is called and we will only
   call through to the above function if we grabbed the RTNL.
4. ravb_close(), again RTNL will be held here.
5. ravb_wol_setup(). Another ethtool operation. (1) applies.

Hence, it is not possible for two threads to execute ravb_ptp_stop()
symultaneously. However, if ptp_clock_register() in ravb_ptp_init()
fails, then priv->ptp.clock will be set to an error-pointer, and
subsequently passed to ptp_clock_unregister() which would cause a
kernel oops. No one seems to have thought about that... and that
definitely needs fixing.

However, one wonders why it's necessary to unregister a _user_
_interface_ when responding to a change in WoL, ring parameters, or
merely handling a transmit timeout. It doesn't seem particularly
nice to userspace for a device that its using to suddenly go away
for these reasons. I wonder whether anyone has tested anything
that uses the PTP clock interfaces while changing e.g. the WoL
settings.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage
Posted by Claudiu Beznea 11 months ago
Hi, Kory,

On 21.01.2025 18:11, Kory Maincent wrote:
> On Tue, 21 Jan 2025 15:44:34 +0000
> Paul Barker <paul.barker.ct@bp.renesas.com> wrote:
> 
>> On 21/01/2025 13:01, Kory Maincent wrote:
>>> On Tue, 21 Jan 2025 11:34:48 +0000
>>> Paul Barker <paul.barker.ct@bp.renesas.com> wrote:
>>>   
>>>> On 21/01/2025 09:38, Kory Maincent wrote:  
>>  [...]  
>>  [...]  
>>>>  [...]    
>>  [...]  
>>  [...]  
>>>>
>>>> (Cc'ing Niklas and Sergey as this relates to the ravb driver)  
>>>
>>> Yes, thanks.
>>>   
>>>> Why do we need to hold the rtnl mutex across the calls to
>>>> netif_device_detach() and ravb_wol_setup()?
>>>>
>>>> My reading of Documentation/networking/netdevices.rst is that the rtnl
>>>> mutex is held when the net subsystem calls the driver's ndo_stop method,
>>>> which in our case is ravb_close(). So, we should take the rtnl mutex
>>>> when we call ravb_close() directly, in both ravb_suspend() and
>>>> ravb_wol_restore(). That would ensure that we do not call
>>>> phy_disconnect() without holding the rtnl mutex and should fix this
>>>> issue.  
>>>
>>> Not sure about it. For example ravb_ptp_stop() called in ravb_wol_setup()
>>> won't be protected by the rtnl lock.  
>>
>> ravb_ptp_stop() modifies a couple of device registers and calls
>> ptp_clock_unregister(). I don't see anything to suggest that this
>> requires the rtnl lock to be held, unless I am missing something.
> 
> What happens if two ptp_clock_unregister() with the same ptp_clock pointer are 
> called simultaneously? From ravb_suspend and ravb_set_ringparam for example. It
> may cause some errors.

Can this happen? I see ethtool_ops::set_ringparam() references only in
ethtool or ioctl files:

net/ethtool/ioctl.c:2066
net/ethtool/ioctl.c:2081
net/ethtool/rings.c:212
net/ethtool/rings.c:304

At the time the suspend/resume APIs are called the user space threads are
frozen.

Thank you,
Claudiu

> For example the ptp->kworker pointer could be used after a kfree.
> https://elixir.bootlin.com/linux/v6.12.6/source/drivers/ptp/ptp_clock.c#L416
> 
> Regards,
Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage
Posted by Kory Maincent 11 months ago
On Thu, 23 Jan 2025 13:25:57 +0200
Claudiu Beznea <claudiu.beznea@tuxon.dev> wrote:

> >> ravb_ptp_stop() modifies a couple of device registers and calls
> >> ptp_clock_unregister(). I don't see anything to suggest that this
> >> requires the rtnl lock to be held, unless I am missing something.  
> > 
> > What happens if two ptp_clock_unregister() with the same ptp_clock pointer
> > are called simultaneously? From ravb_suspend and ravb_set_ringparam for
> > example. It may cause some errors.  
> 
> Can this happen? I see ethtool_ops::set_ringparam() references only in
> ethtool or ioctl files:
> 
> net/ethtool/ioctl.c:2066
> net/ethtool/ioctl.c:2081
> net/ethtool/rings.c:212
> net/ethtool/rings.c:304
> 
> At the time the suspend/resume APIs are called the user space threads are
> frozen.

Maybe, I don't know the suspend path, and what the state of user space threads
at that time. This was an example but Wake on Lan setup could also have some
issue. IMHO I think it is more precautions to have it under rtnl lock.

Regards,
-- 
Köry Maincent, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com
Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage
Posted by Paul Barker 11 months ago
On 21/01/2025 16:11, Kory Maincent wrote:
> On Tue, 21 Jan 2025 15:44:34 +0000
> Paul Barker <paul.barker.ct@bp.renesas.com> wrote:
> 
>> On 21/01/2025 13:01, Kory Maincent wrote:
>> 
>>> On Tue, 21 Jan 2025 11:34:48 +0000
>>> Paul Barker <paul.barker.ct@bp.renesas.com> wrote:
>>> 
>>>> Why do we need to hold the rtnl mutex across the calls to
>>>> netif_device_detach() and ravb_wol_setup()?
>>>>
>>>> My reading of Documentation/networking/netdevices.rst is that the rtnl
>>>> mutex is held when the net subsystem calls the driver's ndo_stop method,
>>>> which in our case is ravb_close(). So, we should take the rtnl mutex
>>>> when we call ravb_close() directly, in both ravb_suspend() and
>>>> ravb_wol_restore(). That would ensure that we do not call
>>>> phy_disconnect() without holding the rtnl mutex and should fix this
>>>> issue.  
>>>
>>> Not sure about it. For example ravb_ptp_stop() called in ravb_wol_setup()
>>> won't be protected by the rtnl lock.  
>>
>> ravb_ptp_stop() modifies a couple of device registers and calls
>> ptp_clock_unregister(). I don't see anything to suggest that this
>> requires the rtnl lock to be held, unless I am missing something.
> 
> What happens if two ptp_clock_unregister() with the same ptp_clock pointer are 
> called simultaneously? From ravb_suspend and ravb_set_ringparam for example. It
> may cause some errors.
> For example the ptp->kworker pointer could be used after a kfree.
> https://elixir.bootlin.com/linux/v6.12.6/source/drivers/ptp/ptp_clock.c#L416

I've dug into this some more today and looked at the suspend/resume
paths of other Ethernet drivers for comparison.

netif_device_detach() and netif_device_attach() seem to be safe to call
without holding the rtnl lock, e.g. the stmmac driver does this.

In the suspend path, we should hold the rtnl lock across the calls to
ravb_wol_setup() and ravb_close().

In the resume path, we should hold the rtnl lock across the calls to
ravb_wol_restore() and ravb_open(). I don't think there is any harm in
holding the rtnl lock while we call pm_runtime_force_resume(), so we can
take the lock before checking priv->wol_enabled and release after
calling ravb_open().

How does that sound?

Thanks,

-- 
Paul Barker
Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage
Posted by Kory Maincent 11 months ago
On Wed, 22 Jan 2025 14:03:37 +0000
Paul Barker <paul.barker.ct@bp.renesas.com> wrote:

> On 21/01/2025 16:11, Kory Maincent wrote:
> > On Tue, 21 Jan 2025 15:44:34 +0000
> > Paul Barker <paul.barker.ct@bp.renesas.com> wrote:
> >   
> >> On 21/01/2025 13:01, Kory Maincent wrote:
> >>   
>  [...]  
>  [...]  
>  [...]  
> >>
> >> ravb_ptp_stop() modifies a couple of device registers and calls
> >> ptp_clock_unregister(). I don't see anything to suggest that this
> >> requires the rtnl lock to be held, unless I am missing something.  
> > 
> > What happens if two ptp_clock_unregister() with the same ptp_clock pointer
> > are called simultaneously? From ravb_suspend and ravb_set_ringparam for
> > example. It may cause some errors.
> > For example the ptp->kworker pointer could be used after a kfree.
> > https://elixir.bootlin.com/linux/v6.12.6/source/drivers/ptp/ptp_clock.c#L416
> >  
> 
> I've dug into this some more today and looked at the suspend/resume
> paths of other Ethernet drivers for comparison.
> 
> netif_device_detach() and netif_device_attach() seem to be safe to call
> without holding the rtnl lock, e.g. the stmmac driver does this.
> 
> In the suspend path, we should hold the rtnl lock across the calls to
> ravb_wol_setup() and ravb_close().
> 
> In the resume path, we should hold the rtnl lock across the calls to
> ravb_wol_restore() and ravb_open(). I don't think there is any harm in
> holding the rtnl lock while we call pm_runtime_force_resume(), so we can
> take the lock before checking priv->wol_enabled and release after
> calling ravb_open().
> 
> How does that sound?

Yes that sound nice, and similar to what I thought.
I will send a patch, I also found the same issue on sh_eth driver.

Regards
-- 
Köry Maincent, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com
Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage
Posted by Claudiu Beznea 11 months ago
Hi, Kory,

On 21.01.2025 11:38, Kory Maincent wrote:
> On Mon, 20 Jan 2025 11:12:28 -0800
> Jakub Kicinski <kuba@kernel.org> wrote:
> 
>> On Mon, 20 Jan 2025 15:19:25 +0100 Kory Maincent wrote:
>>> The path reported to not having RTNL lock acquired is the suspend path of
>>> the ravb MAC driver. Without this fix we got this warning:  
>>
>> I maintain that ravb is buggy, plenty of drivers take rtnl_lock 
>> from the .suspend callback. We need _some_ write protection here,
>> the patch as is only silences a legitimate warning.
> 
> Indeed if the suspend path is buggy we should fix it. Still there is lots of
> ethernet drivers calling phy_disconnect without rtnl (IIUC) if probe return an
> error or in the remove path. What should we do about it?
> 
> About ravb suspend, I don't have the board, Claudiu could you try this instead
> of the current fix:
> 
> diff --git a/drivers/net/ethernet/renesas/ravb_main.c
> b/drivers/net/ethernet/renesas/ravb_main.c index bc395294a32d..c9a0d2d6f371
> 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c
> +++ b/drivers/net/ethernet/renesas/ravb_main.c
> @@ -3215,15 +3215,22 @@ static int ravb_suspend(struct device *dev)
>         if (!netif_running(ndev))
>                 goto reset_assert;
>  
> +       rtnl_lock();
>         netif_device_detach(ndev);
>  
> -       if (priv->wol_enabled)
> -               return ravb_wol_setup(ndev);
> +       if (priv->wol_enabled) {
> +               ret = ravb_wol_setup(ndev);
> +               rtnl_unlock();
> +               return ret;
> +       }
>  
>         ret = ravb_close(ndev);
> -       if (ret)
> +       if (ret) {
> +               rtnl_unlock();
>                 return ret;
> +       }
>  
> +       rtnl_unlock();
>         ret = pm_runtime_force_suspend(&priv->pdev->dev);
>         if (ret)
>                 return ret;

The warning is gone with this, as well.

Thank you,
Claudiu


> 
> Regards,
Re: [PATCH net-next v3] net: phy: Fix suspicious rcu_dereference usage
Posted by Russell King (Oracle) 11 months ago
On Tue, Jan 21, 2025 at 10:38:45AM +0100, Kory Maincent wrote:
> On Mon, 20 Jan 2025 11:12:28 -0800
> Jakub Kicinski <kuba@kernel.org> wrote:
> 
> > On Mon, 20 Jan 2025 15:19:25 +0100 Kory Maincent wrote:
> > > The path reported to not having RTNL lock acquired is the suspend path of
> > > the ravb MAC driver. Without this fix we got this warning:  
> > 
> > I maintain that ravb is buggy, plenty of drivers take rtnl_lock 
> > from the .suspend callback. We need _some_ write protection here,
> > the patch as is only silences a legitimate warning.
> 
> Indeed if the suspend path is buggy we should fix it. Still there is lots of
> ethernet drivers calling phy_disconnect without rtnl (IIUC) if probe return an
> error or in the remove path. What should we do about it?

They could trigger the same warning, although I think they would be
relatively safe because register_netdev() hasn't been called, and thus
nothing that the netdev provides should be used. (If it can be used, as
the driver has not completed initialisation, then it's probably racy
anyway.)

I don't think throwing ASSERT_RTNL() into phy_detach() will do anything
to solve this. If the RCU warning doesn't trigger (because phy_detach()
only gets called on error which practically never happens) then
ASSERT_RTNL() isn't going to trigger either. Warnings in functions will
only work when they're called in a context that will trigger the
warning!

So, I think it's something that can only be addressed by reviewing
drivers and patching.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!