[PATCH RFC net-next 0/4] net: pse-pd: decouple controller lookup from MDIO probe

Corey Leavitt via B4 Relay posted 4 patches 1 month, 3 weeks ago
drivers/net/mdio/fwnode_mdio.c |  34 ----------
drivers/net/phy/phy_device.c   | 144 ++++++++++++++++++++++++++++++++++++++---
drivers/net/phy/sfp.c          |   2 +-
drivers/net/pse-pd/pse_core.c  |  60 ++++++++++++++++-
include/linux/phy.h            |   2 +
include/linux/pse-pd/pse.h     |  41 ++++++++++++
6 files changed, 236 insertions(+), 47 deletions(-)
[PATCH RFC net-next 0/4] net: pse-pd: decouple controller lookup from MDIO probe
Posted by Corey Leavitt via B4 Relay 1 month, 3 weeks ago
On systems where a PSE controller driver loads as a module and a
device-tree PHY node carries a `pses = <&pse_pi>` reference,
fwnode_mdiobus_register_phy() tries to resolve the PSE handle before
the controller driver has probed. of_pse_control_get() returns
-EPROBE_DEFER, the enclosing MDIO/DSA probe fails, and driver-core
re-queues the work. The retry loop spins until the PSE driver module
loads and its controller registers.

Commit fa2f0454174c ("net: pse-pd: Introduce attached_phydev to pse
control") made each retry expensive. It reordered
fwnode_mdiobus_register_phy() so the PHY is registered before the
PSE lookup. Every deferral now performs a full
phy_device_register() / phy_device_remove() cycle. On a board with a
sufficiently tight watchdog the retry loop can starve the watchdog
kthread. On the reporting hardware (MT7621 + gpio-wdt, 1-second
margin) the retry loop converts a slow probe phase into a reset
before userspace loads.

The affected population today looks small. OpenWrt, where PSE
actually ships, is still on 6.12 (pre-regression), and most
environments with CONFIG_PSE_*=m do not have boards whose DT
references a PSE controller from a PHY. Still, the mechanism is
general. Any modular PSE driver combined with the documented
`pses = <&...>` binding reproduces the retry loop. Whether it
reaches brick-grade or merely slow/flaky boot depends on local
watchdog timing. More exposure is expected as distribution and
embedded kernels move to 6.13 and later.

The narrow fix would be to partially revert the ordering in
fa2f0454174c so each defer is cheap again. That keeps the same
architecture (fwnode_mdio holding PSE knowledge, -EPROBE_DEFER
flowing across the subsystem boundary), and any future reorder
reintroduces the same class of bug. This series takes the larger
fix: decouple PSE controller lookup from MDIO registration entirely.
pse_core now publishes a BLOCKING_NOTIFIER chain with REGISTERED
and UNREGISTERED events. phy_device subscribes, owns phydev->psec
lifetime, and attaches PSE handles in response to controller
lifecycle rather than during probe. fwnode_mdio loses its PSE
awareness, and -EPROBE_DEFER no longer flows out of fwnode_mdio.

Patch breakdown:

  1. Scope the pse_control regulator handle to kref lifetime
     (Fixes: d83e13761d5b). A latent bug that patch 4 makes
     reachable.
  2. Add the notifier chain (enum, head, register/unregister
     helpers). Pure infrastructure. No subscribers yet, no
     observable change.
  3. Fire REGISTERED and UNREGISTERED events from the controller
     register/unregister paths. Still no subscribers, still no
     observable change.
  4. Subscribe from the PHY layer, take ownership of phydev->psec
     via the notifier, and remove fwnode_find_pse_control() from
     fwnode_mdio.

Patch 1 is bundled here per stable-kernel-rules section 4
reachability guidance. On mainline today, with no notifier
subscriber, no caller drives the dangling regulator-handle sequence.
Patches 2 and 3 are deliberately split to separate "add
infrastructure" from "wire it up". Happy to fold them if maintainers
prefer the combined form.

Validated on a Cudy C200P (MT7621 + IP804AR) running an OpenWrt
build of 6.18.21 with the series applied. A lockdep build
(CONFIG_PROVE_LOCKING + CONFIG_DEBUG_ATOMIC_SLEEP) shows no splats
from the series' code paths during boot, PHY attach, PHY detach, or
a full controller unbind/rebind cycle. ethtool --set-pse drives all
four PoE-capable LAN ports, and a Ruckus H510 class-4 PD plugged
into lan3 negotiates and receives 48 V.

The C200P has no SFP cage, so the SFP path change in sfp.c
(phy_device_register -> phy_device_register_locked) isn't exercised
on the bench. Verified by call-graph audit: every path reaching
sfp_sm_probe_phy() holds rtnl at entry, via sfp_timeout,
sfp_check_state, sfp_probe, sfp_remove, or
sfp_bus_{add,del}_upstream.

Not addressed by this series: ethtool --show-pse returns "No data
available" on DSA netdevs in 6.18, because dev->phydev is NULL for
DSA-frontend netdevs and ethnl_req_get_phydev() therefore returns
NULL. That's a DSA / ethtool integration quirk that predates this
work.

Sending as RFC because this is my first net-next series. I'd
appreciate maintainer guidance on whether patch 1 should go to net
rather than net-next, and whether the patch 2/3 split is preferred
to the combined form.

Signed-off-by: Corey Leavitt <corey@leavitt.info>
---
Corey Leavitt (4):
      net: pse-pd: scope pse_control regulator handle to kref lifetime
      net: pse-pd: add notifier chain for controller lifecycle events
      net: pse-pd: fire lifecycle events on controller register/unregister
      net: phy: own phydev->psec via PSE notifier and remove fwnode_mdio hook

 drivers/net/mdio/fwnode_mdio.c |  34 ----------
 drivers/net/phy/phy_device.c   | 144 ++++++++++++++++++++++++++++++++++++++---
 drivers/net/phy/sfp.c          |   2 +-
 drivers/net/pse-pd/pse_core.c  |  60 ++++++++++++++++-
 include/linux/phy.h            |   2 +
 include/linux/pse-pd/pse.h     |  41 ++++++++++++
 6 files changed, 236 insertions(+), 47 deletions(-)
---
base-commit: 1f5ffc672165ff851063a5fd044b727ab2517ae3
change-id: 20260422-pse-notifier-decouple-efa80d77f4be

Best regards,
--  
Corey Leavitt <corey@leavitt.info>
Re: [PATCH RFC net-next 0/4] net: pse-pd: decouple controller lookup from MDIO probe
Posted by Kory Maincent 1 month, 3 weeks ago
Hello Corey,

On Thu, 23 Apr 2026 01:42:13 -0600
Corey Leavitt via B4 Relay <devnull+corey.leavitt.info@kernel.org> wrote:

> On systems where a PSE controller driver loads as a module and a
> device-tree PHY node carries a `pses = <&pse_pi>` reference,
> fwnode_mdiobus_register_phy() tries to resolve the PSE handle before
> the controller driver has probed. of_pse_control_get() returns
> -EPROBE_DEFER, the enclosing MDIO/DSA probe fails, and driver-core
> re-queues the work. The retry loop spins until the PSE driver module
> loads and its controller registers.

I will take a look at your series but FYI there was already a RFC series
tackling this issue:
https://lore.kernel.org/lkml/20260330132952.2950531-4-github@szelinsky.de/

It rose a debate and there was currently no final solution.
 
> Commit fa2f0454174c ("net: pse-pd: Introduce attached_phydev to pse
> control") made each retry expensive. It reordered
> fwnode_mdiobus_register_phy() so the PHY is registered before the
> PSE lookup. Every deferral now performs a full
> phy_device_register() / phy_device_remove() cycle. On a board with a
> sufficiently tight watchdog the retry loop can starve the watchdog
> kthread. On the reporting hardware (MT7621 + gpio-wdt, 1-second
> margin) the retry loop converts a slow probe phase into a reset
> before userspace loads.
> 
> The affected population today looks small. OpenWrt, where PSE
> actually ships, is still on 6.12 (pre-regression), and most
> environments with CONFIG_PSE_*=m do not have boards whose DT
> references a PSE controller from a PHY. Still, the mechanism is
> general. Any modular PSE driver combined with the documented
> `pses = <&...>` binding reproduces the retry loop. Whether it
> reaches brick-grade or merely slow/flaky boot depends on local
> watchdog timing. More exposure is expected as distribution and
> embedded kernels move to 6.13 and later.
> 
> The narrow fix would be to partially revert the ordering in
> fa2f0454174c so each defer is cheap again. That keeps the same
> architecture (fwnode_mdio holding PSE knowledge, -EPROBE_DEFER
> flowing across the subsystem boundary), and any future reorder
> reintroduces the same class of bug. This series takes the larger
> fix: decouple PSE controller lookup from MDIO registration entirely.
> pse_core now publishes a BLOCKING_NOTIFIER chain with REGISTERED
> and UNREGISTERED events. phy_device subscribes, owns phydev->psec
> lifetime, and attaches PSE handles in response to controller
> lifecycle rather than during probe. fwnode_mdio loses its PSE
> awareness, and -EPROBE_DEFER no longer flows out of fwnode_mdio.
> 
> Patch breakdown:
> 
>   1. Scope the pse_control regulator handle to kref lifetime
>      (Fixes: d83e13761d5b). A latent bug that patch 4 makes
>      reachable.
>   2. Add the notifier chain (enum, head, register/unregister
>      helpers). Pure infrastructure. No subscribers yet, no
>      observable change.
>   3. Fire REGISTERED and UNREGISTERED events from the controller
>      register/unregister paths. Still no subscribers, still no
>      observable change.
>   4. Subscribe from the PHY layer, take ownership of phydev->psec
>      via the notifier, and remove fwnode_find_pse_control() from
>      fwnode_mdio.
> 
> Patch 1 is bundled here per stable-kernel-rules section 4
> reachability guidance. On mainline today, with no notifier
> subscriber, no caller drives the dangling regulator-handle sequence.
> Patches 2 and 3 are deliberately split to separate "add
> infrastructure" from "wire it up". Happy to fold them if maintainers
> prefer the combined form.
> 
> Validated on a Cudy C200P (MT7621 + IP804AR) running an OpenWrt
> build of 6.18.21 with the series applied. A lockdep build
> (CONFIG_PROVE_LOCKING + CONFIG_DEBUG_ATOMIC_SLEEP) shows no splats
> from the series' code paths during boot, PHY attach, PHY detach, or
> a full controller unbind/rebind cycle. ethtool --set-pse drives all
> four PoE-capable LAN ports, and a Ruckus H510 class-4 PD plugged
> into lan3 negotiates and receives 48 V.
> 
> The C200P has no SFP cage, so the SFP path change in sfp.c
> (phy_device_register -> phy_device_register_locked) isn't exercised
> on the bench. Verified by call-graph audit: every path reaching
> sfp_sm_probe_phy() holds rtnl at entry, via sfp_timeout,
> sfp_check_state, sfp_probe, sfp_remove, or
> sfp_bus_{add,del}_upstream.
> 
> Not addressed by this series: ethtool --show-pse returns "No data
> available" on DSA netdevs in 6.18, because dev->phydev is NULL for
> DSA-frontend netdevs and ethnl_req_get_phydev() therefore returns
> NULL. That's a DSA / ethtool integration quirk that predates this
> work.
> 
> Sending as RFC because this is my first net-next series. I'd
> appreciate maintainer guidance on whether patch 1 should go to net
> rather than net-next, and whether the patch 2/3 split is preferred
> to the combined form.
> 
> Signed-off-by: Corey Leavitt <corey@leavitt.info>
> ---
> Corey Leavitt (4):
>       net: pse-pd: scope pse_control regulator handle to kref lifetime
>       net: pse-pd: add notifier chain for controller lifecycle events
>       net: pse-pd: fire lifecycle events on controller register/unregister
>       net: phy: own phydev->psec via PSE notifier and remove fwnode_mdio hook
> 
>  drivers/net/mdio/fwnode_mdio.c |  34 ----------
>  drivers/net/phy/phy_device.c   | 144
> ++++++++++++++++++++++++++++++++++++++--- drivers/net/phy/sfp.c          |
> 2 +- drivers/net/pse-pd/pse_core.c  |  60 ++++++++++++++++-
>  include/linux/phy.h            |   2 +
>  include/linux/pse-pd/pse.h     |  41 ++++++++++++
>  6 files changed, 236 insertions(+), 47 deletions(-)
> ---
> base-commit: 1f5ffc672165ff851063a5fd044b727ab2517ae3
> change-id: 20260422-pse-notifier-decouple-efa80d77f4be
> 
> Best regards,
> --  
> Corey Leavitt <corey@leavitt.info>
> 
> 



-- 
Köry Maincent, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com
Re: [PATCH RFC net-next 0/4] net: pse-pd: decouple controller lookup from MDIO probe
Posted by Corey Leavitt 1 month, 3 weeks ago
Hi Kory,

Thanks for the pointer -- I had not seen Carlo's thread; I should have
searched lore before sending and will do so before v2. Adding Carlo on
cc.

Having read it end-to-end, my read of the state as of 2026-04-13 was
that the conversation had narrowed to two open directions: propagate
EPROBE_DEFER further up into phylink/MAC probe (Andrew/Russell), or
resolve psec at PSE controller register time (your msg on 9 Apr,
"save the phandle ... then at PSE register time look for each PHY and
try to resolve every unresolved phandle"). Nothing concrete had been
posted for either.

This RFC implements the second direction. pse_core publishes a
BLOCKING_NOTIFIER chain with REGISTERED / UNREGISTERED events,
phy_device subscribes, and psec ownership moves from fwnode_mdio probe
into the notifier handler. Concretely with respect to points raised in
the earlier thread:

  - fwnode_mdio loses PSE awareness entirely, so the MDIO bus scan no
    longer sees -EPROBE_DEFER from PSE lookup. Consistent with
    Andrew's point that bus and device lifecycles are separate.

  - psec is acquired at PSE register time, before
    regulator_late_cleanup (30s) can run. Carlo's admin_state_synced
    guard (his patch 1) therefore isn't needed in this model. psec
    resolution happens eagerly on the REGISTERED event rather than
    lazily on first ethtool access, so his patch 2 is also not needed.
    And because fwnode_mdio no longer looks up PSE at all, the
    non-fatal EPROBE_DEFER handling there (patch 3) drops out. This
    series is a different architectural shape, not an increment on
    his v2.

  - Oleksij's concern about lazy resolution dropping UAPI
    notifications is addressed: the notifier fires at register time,
    so boot-time observer semantics are preserved.

  - One caveat I already owe a fix for in v2: the attach helper in
    phy_device currently treats every error from of_pse_control_get()
    as retry-on-notifier, including non-transient ones. Carlo's v2
    patch 3 was careful to differentiate -EPROBE_DEFER from bad-DT
    errors at the fwnode_mdio lookup site (which matches his msg 1
    concern about catching broken bindings at boot rather than
    silently later). I need to preserve that discrimination at the
    notifier-handler site -- phydev_warn() on anything other than
    -EPROBE_DEFER. Trivial, but worth flagging.

  - The DSA genphy force-bind sequence Carlo hit
    (phy_attach_direct -> device_bind_driver -> deferred retry
    skipped) does not apply, because psec attachment is not tied to
    phy_probe.

  - Patch 1 of this series scopes the regulator handle held by
    pse_control to its own kref lifetime, fixing a latent
    dangling-handle sequence that the notifier unregister path makes
    reachable. This is a separate regulator-lifetime bug from the one
    Carlo's patch 1 addresses.

Validated end-to-end on a Cudy C200P (MT7621 DSA + i2c IP804AR as
module), with lockdep active, across the i2c driver unbind/rebind
cycle that triggers UNREGISTERED -> REGISTERED on the live system.
The cover letter has the full evidence.

I would welcome your view on whether this is the shape you had in
mind on 9 Apr, or whether the MDI-based binding you raised with
Maxime is the better endpoint and we should be reshaping around that.
Happy to keep this RFC as the scaffolding either way.

Carlo -- your debugging work on the DSA phy_attach_direct interaction
is what made it clear this kind of approach was needed; thanks for
laying that groundwork. Would value your thoughts on the tradeoffs.

Best regards,
Corey
Re: [PATCH RFC net-next 0/4] net: pse-pd: decouple controller lookup from MDIO probe
Posted by Kory Maincent 1 month, 3 weeks ago
On Thu, 23 Apr 2026 09:48:53 +0000
Corey Leavitt <corey@leavitt.info> wrote:

> Hi Kory,
> 
> Thanks for the pointer -- I had not seen Carlo's thread; I should have
> searched lore before sending and will do so before v2. Adding Carlo on
> cc.
> 
> Having read it end-to-end, my read of the state as of 2026-04-13 was
> that the conversation had narrowed to two open directions: propagate
> EPROBE_DEFER further up into phylink/MAC probe (Andrew/Russell), or
> resolve psec at PSE controller register time (your msg on 9 Apr,
> "save the phandle ... then at PSE register time look for each PHY and
> try to resolve every unresolved phandle"). Nothing concrete had been
> posted for either.
> 
> This RFC implements the second direction. pse_core publishes a
> BLOCKING_NOTIFIER chain with REGISTERED / UNREGISTERED events,
> phy_device subscribes, and psec ownership moves from fwnode_mdio probe
> into the notifier handler. Concretely with respect to points raised in
> the earlier thread:
> 
>   - fwnode_mdio loses PSE awareness entirely, so the MDIO bus scan no
>     longer sees -EPROBE_DEFER from PSE lookup. Consistent with
>     Andrew's point that bus and device lifecycles are separate.
> 
>   - psec is acquired at PSE register time, before
>     regulator_late_cleanup (30s) can run. Carlo's admin_state_synced
>     guard (his patch 1) therefore isn't needed in this model. psec
>     resolution happens eagerly on the REGISTERED event rather than
>     lazily on first ethtool access, so his patch 2 is also not needed.
>     And because fwnode_mdio no longer looks up PSE at all, the
>     non-fatal EPROBE_DEFER handling there (patch 3) drops out. This
>     series is a different architectural shape, not an increment on
>     his v2.
> 
>   - Oleksij's concern about lazy resolution dropping UAPI
>     notifications is addressed: the notifier fires at register time,
>     so boot-time observer semantics are preserved.
> 
>   - One caveat I already owe a fix for in v2: the attach helper in
>     phy_device currently treats every error from of_pse_control_get()
>     as retry-on-notifier, including non-transient ones. Carlo's v2
>     patch 3 was careful to differentiate -EPROBE_DEFER from bad-DT
>     errors at the fwnode_mdio lookup site (which matches his msg 1
>     concern about catching broken bindings at boot rather than
>     silently later). I need to preserve that discrimination at the
>     notifier-handler site -- phydev_warn() on anything other than
>     -EPROBE_DEFER. Trivial, but worth flagging.
> 
>   - The DSA genphy force-bind sequence Carlo hit
>     (phy_attach_direct -> device_bind_driver -> deferred retry
>     skipped) does not apply, because psec attachment is not tied to
>     phy_probe.
> 
>   - Patch 1 of this series scopes the regulator handle held by
>     pse_control to its own kref lifetime, fixing a latent
>     dangling-handle sequence that the notifier unregister path makes
>     reachable. This is a separate regulator-lifetime bug from the one
>     Carlo's patch 1 addresses.

This seems to provide a solution to all our problems, and it is well designed.
Carlo, it would be nice if you could test it on your side.

> Validated end-to-end on a Cudy C200P (MT7621 DSA + i2c IP804AR as
> module), with lockdep active, across the i2c driver unbind/rebind
> cycle that triggers UNREGISTERED -> REGISTERED on the live system.
> The cover letter has the full evidence.
> 
> I would welcome your view on whether this is the shape you had in
> mind on 9 Apr, or whether the MDI-based binding you raised with
> Maxime is the better endpoint and we should be reshaping around that.
> Happy to keep this RFC as the scaffolding either way.

After a second thought, MDI-based binding is currently not in the pipe and we
should solve this without it for now.

Regards,
-- 
Köry Maincent, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com
Re: [PATCH RFC net-next 0/4] net: pse-pd: decouple controller lookup from MDIO probe
Posted by Carlo Szelinsky 1 month, 2 weeks ago
Hi,

I tested the series on a Hasivo S1100WP-8GT-SE with hs104 PSE
chips. Boot is clean, no more probe loop. ethtool --set-pse on/off
works, and a PD plugged into a port gets power. So the fix works
for me.

One thing I noticed: rmmod alone does not work while ports are
attached. Each PHY that holds a psec keeps a ref on the PSE
driver, so the module count stays above 0 and rmmod fails.

Unbinding the i2c devices first works:
  echo 0-000d > /sys/bus/i2c/drivers/hasivo-hs104/unbind
  echo 0-0015 > /sys/bus/i2c/drivers/hasivo-hs104/unbind
  rmmod hasivo-hs104

After unbind your PSE_UNREGISTERED notifier fires, the PHYs drop
the psec, and the module count goes to 0. Then rmmod works.

Not a big deal, but maybe worth a short note in the cover letter
so people know they need to unbind first. Or if there is a clean
way to skip the per-consumer module_get and just rely on the
unbind path, that would be nice too. What do other folks think?

One thing my test does not cover yet is the SFP path change in
sfp.c, since the S1100WP-8GT-SE is copper only. I do have an
S600WP-5GT-2SX-SE that uses the same hs104 chips plus SFP+ cages,
but it is not on my desk right now. I can run that part on it
later.

Anyway, thanks for the work on this. Happy to test a v2.

Carlo

Tested-by: Carlo Szelinsky <github@szelinsky.de>
Re: [PATCH RFC net-next 0/4] net: pse-pd: decouple controller lookup from MDIO probe
Posted by Corey Leavitt 1 month, 1 week ago
Hi Carlo,

Thanks for testing, and for the clear writeup of the rmmod path.
Will add your Tested-by to v2.

On the rmmod / module_get question: I think you are right that the
per-consumer try_module_get is structurally redundant once the
notifier exists, since PSE_UNREGISTERED already gives us a
synchronous invalidation point at unregister time. The reason I
did not strip it in v1 was caution: I want to walk every
pse_control_get / pse_control_put consumer and confirm no path
holds a transient reference outside of rtnl-lock (or some
equivalent serialization) such that an in-flight use could race
the unregister notifier.

That audit is on my plate now. I will share findings on this
thread before v2, and let the result drive whether the removal
lands in v2 or as a separate discussion.

Glad to hear the probe loop is gone on your bench and that ethtool
plus PD power are behaving. I expect to have an SFP-capable board
on my own bench later this week, so I will cover the sfp.c path
change before sending v2; a retest from you on the SFP chassis
whenever it is convenient is still very welcome as independent
confirmation.

Kory, Jakub, Russell, others: input welcome on whether dropping
try_module_get in favor of pure notifier-based invalidation is the
direction you would prefer here, or whether the belt-and-suspenders
refcount is worth keeping.

Thanks,
Corey
Re: [PATCH RFC net-next 0/4] net: pse-pd: decouple controller lookup from MDIO probe
Posted by Carlo Szelinsky 1 day, 11 hours ago
Hi Corey,

just checking in on this one. Did you get a chance to continue with the
series, or is there anything I can help with to move it forward? I'm
happy to test a v2, and I can still run the SFP path on the
S600WP-5GT-2SX-SE once it's back on my desk.

Kory, Jakub, Russell :-) it would be great to hear your view on the
approach so Corey can plan the next version. The series fixed the probe
loop in my testing and I'd really like to see it land.

Thanks,
Carlo
Re: [PATCH RFC net-next 0/4] net: pse-pd: decouple controller lookup from MDIO probe
Posted by Kory Maincent 12 hours ago
Hello Carlo,

On Mon, 15 Jun 2026 20:08:12 +0200
Carlo Szelinsky <github@szelinsky.de> wrote:

> Hi Corey,
> 
> just checking in on this one. Did you get a chance to continue with the
> series, or is there anything I can help with to move it forward? I'm
> happy to test a v2, and I can still run the SFP path on the
> S600WP-5GT-2SX-SE once it's back on my desk.
> 
> Kory, Jakub, Russell :-) it would be great to hear your view on the
> approach so Corey can plan the next version. The series fixed the probe
> loop in my testing and I'd really like to see it land.

I haven't heard from Corey since this patch series, but I am in favor of this
notifier design.
Corey, do you have time to continue this work? If not, would it be okay for
Carlo to continue it for you?

Regards,
-- 
Köry Maincent, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com
Re: [PATCH RFC net-next 0/4] net: pse-pd: decouple controller lookup from MDIO probe
Posted by Carlo Szelinsky 1 month, 3 weeks ago
Hey,
Thanks for all the help from multiple guys :-) I am currently busy with work, but I will take some time soon!
cheers Carlo