drivers/net/mdio/fwnode_mdio.c | 34 ---------- drivers/net/phy/phy_device.c | 144 ++++++++++++++++++++++++++++++++++++++--- drivers/net/phy/sfp.c | 2 +- drivers/net/pse-pd/pse_core.c | 60 ++++++++++++++++- include/linux/phy.h | 2 + include/linux/pse-pd/pse.h | 41 ++++++++++++ 6 files changed, 236 insertions(+), 47 deletions(-)
On systems where a PSE controller driver loads as a module and a
device-tree PHY node carries a `pses = <&pse_pi>` reference,
fwnode_mdiobus_register_phy() tries to resolve the PSE handle before
the controller driver has probed. of_pse_control_get() returns
-EPROBE_DEFER, the enclosing MDIO/DSA probe fails, and driver-core
re-queues the work. The retry loop spins until the PSE driver module
loads and its controller registers.
Commit fa2f0454174c ("net: pse-pd: Introduce attached_phydev to pse
control") made each retry expensive. It reordered
fwnode_mdiobus_register_phy() so the PHY is registered before the
PSE lookup. Every deferral now performs a full
phy_device_register() / phy_device_remove() cycle. On a board with a
sufficiently tight watchdog the retry loop can starve the watchdog
kthread. On the reporting hardware (MT7621 + gpio-wdt, 1-second
margin) the retry loop converts a slow probe phase into a reset
before userspace loads.
The affected population today looks small. OpenWrt, where PSE
actually ships, is still on 6.12 (pre-regression), and most
environments with CONFIG_PSE_*=m do not have boards whose DT
references a PSE controller from a PHY. Still, the mechanism is
general. Any modular PSE driver combined with the documented
`pses = <&...>` binding reproduces the retry loop. Whether it
reaches brick-grade or merely slow/flaky boot depends on local
watchdog timing. More exposure is expected as distribution and
embedded kernels move to 6.13 and later.
The narrow fix would be to partially revert the ordering in
fa2f0454174c so each defer is cheap again. That keeps the same
architecture (fwnode_mdio holding PSE knowledge, -EPROBE_DEFER
flowing across the subsystem boundary), and any future reorder
reintroduces the same class of bug. This series takes the larger
fix: decouple PSE controller lookup from MDIO registration entirely.
pse_core now publishes a BLOCKING_NOTIFIER chain with REGISTERED
and UNREGISTERED events. phy_device subscribes, owns phydev->psec
lifetime, and attaches PSE handles in response to controller
lifecycle rather than during probe. fwnode_mdio loses its PSE
awareness, and -EPROBE_DEFER no longer flows out of fwnode_mdio.
Patch breakdown:
1. Scope the pse_control regulator handle to kref lifetime
(Fixes: d83e13761d5b). A latent bug that patch 4 makes
reachable.
2. Add the notifier chain (enum, head, register/unregister
helpers). Pure infrastructure. No subscribers yet, no
observable change.
3. Fire REGISTERED and UNREGISTERED events from the controller
register/unregister paths. Still no subscribers, still no
observable change.
4. Subscribe from the PHY layer, take ownership of phydev->psec
via the notifier, and remove fwnode_find_pse_control() from
fwnode_mdio.
Patch 1 is bundled here per stable-kernel-rules section 4
reachability guidance. On mainline today, with no notifier
subscriber, no caller drives the dangling regulator-handle sequence.
Patches 2 and 3 are deliberately split to separate "add
infrastructure" from "wire it up". Happy to fold them if maintainers
prefer the combined form.
Validated on a Cudy C200P (MT7621 + IP804AR) running an OpenWrt
build of 6.18.21 with the series applied. A lockdep build
(CONFIG_PROVE_LOCKING + CONFIG_DEBUG_ATOMIC_SLEEP) shows no splats
from the series' code paths during boot, PHY attach, PHY detach, or
a full controller unbind/rebind cycle. ethtool --set-pse drives all
four PoE-capable LAN ports, and a Ruckus H510 class-4 PD plugged
into lan3 negotiates and receives 48 V.
The C200P has no SFP cage, so the SFP path change in sfp.c
(phy_device_register -> phy_device_register_locked) isn't exercised
on the bench. Verified by call-graph audit: every path reaching
sfp_sm_probe_phy() holds rtnl at entry, via sfp_timeout,
sfp_check_state, sfp_probe, sfp_remove, or
sfp_bus_{add,del}_upstream.
Not addressed by this series: ethtool --show-pse returns "No data
available" on DSA netdevs in 6.18, because dev->phydev is NULL for
DSA-frontend netdevs and ethnl_req_get_phydev() therefore returns
NULL. That's a DSA / ethtool integration quirk that predates this
work.
Sending as RFC because this is my first net-next series. I'd
appreciate maintainer guidance on whether patch 1 should go to net
rather than net-next, and whether the patch 2/3 split is preferred
to the combined form.
Signed-off-by: Corey Leavitt <corey@leavitt.info>
---
Corey Leavitt (4):
net: pse-pd: scope pse_control regulator handle to kref lifetime
net: pse-pd: add notifier chain for controller lifecycle events
net: pse-pd: fire lifecycle events on controller register/unregister
net: phy: own phydev->psec via PSE notifier and remove fwnode_mdio hook
drivers/net/mdio/fwnode_mdio.c | 34 ----------
drivers/net/phy/phy_device.c | 144 ++++++++++++++++++++++++++++++++++++++---
drivers/net/phy/sfp.c | 2 +-
drivers/net/pse-pd/pse_core.c | 60 ++++++++++++++++-
include/linux/phy.h | 2 +
include/linux/pse-pd/pse.h | 41 ++++++++++++
6 files changed, 236 insertions(+), 47 deletions(-)
---
base-commit: 1f5ffc672165ff851063a5fd044b727ab2517ae3
change-id: 20260422-pse-notifier-decouple-efa80d77f4be
Best regards,
--
Corey Leavitt <corey@leavitt.info>
Hello Corey,
On Thu, 23 Apr 2026 01:42:13 -0600
Corey Leavitt via B4 Relay <devnull+corey.leavitt.info@kernel.org> wrote:
> On systems where a PSE controller driver loads as a module and a
> device-tree PHY node carries a `pses = <&pse_pi>` reference,
> fwnode_mdiobus_register_phy() tries to resolve the PSE handle before
> the controller driver has probed. of_pse_control_get() returns
> -EPROBE_DEFER, the enclosing MDIO/DSA probe fails, and driver-core
> re-queues the work. The retry loop spins until the PSE driver module
> loads and its controller registers.
I will take a look at your series but FYI there was already a RFC series
tackling this issue:
https://lore.kernel.org/lkml/20260330132952.2950531-4-github@szelinsky.de/
It rose a debate and there was currently no final solution.
> Commit fa2f0454174c ("net: pse-pd: Introduce attached_phydev to pse
> control") made each retry expensive. It reordered
> fwnode_mdiobus_register_phy() so the PHY is registered before the
> PSE lookup. Every deferral now performs a full
> phy_device_register() / phy_device_remove() cycle. On a board with a
> sufficiently tight watchdog the retry loop can starve the watchdog
> kthread. On the reporting hardware (MT7621 + gpio-wdt, 1-second
> margin) the retry loop converts a slow probe phase into a reset
> before userspace loads.
>
> The affected population today looks small. OpenWrt, where PSE
> actually ships, is still on 6.12 (pre-regression), and most
> environments with CONFIG_PSE_*=m do not have boards whose DT
> references a PSE controller from a PHY. Still, the mechanism is
> general. Any modular PSE driver combined with the documented
> `pses = <&...>` binding reproduces the retry loop. Whether it
> reaches brick-grade or merely slow/flaky boot depends on local
> watchdog timing. More exposure is expected as distribution and
> embedded kernels move to 6.13 and later.
>
> The narrow fix would be to partially revert the ordering in
> fa2f0454174c so each defer is cheap again. That keeps the same
> architecture (fwnode_mdio holding PSE knowledge, -EPROBE_DEFER
> flowing across the subsystem boundary), and any future reorder
> reintroduces the same class of bug. This series takes the larger
> fix: decouple PSE controller lookup from MDIO registration entirely.
> pse_core now publishes a BLOCKING_NOTIFIER chain with REGISTERED
> and UNREGISTERED events. phy_device subscribes, owns phydev->psec
> lifetime, and attaches PSE handles in response to controller
> lifecycle rather than during probe. fwnode_mdio loses its PSE
> awareness, and -EPROBE_DEFER no longer flows out of fwnode_mdio.
>
> Patch breakdown:
>
> 1. Scope the pse_control regulator handle to kref lifetime
> (Fixes: d83e13761d5b). A latent bug that patch 4 makes
> reachable.
> 2. Add the notifier chain (enum, head, register/unregister
> helpers). Pure infrastructure. No subscribers yet, no
> observable change.
> 3. Fire REGISTERED and UNREGISTERED events from the controller
> register/unregister paths. Still no subscribers, still no
> observable change.
> 4. Subscribe from the PHY layer, take ownership of phydev->psec
> via the notifier, and remove fwnode_find_pse_control() from
> fwnode_mdio.
>
> Patch 1 is bundled here per stable-kernel-rules section 4
> reachability guidance. On mainline today, with no notifier
> subscriber, no caller drives the dangling regulator-handle sequence.
> Patches 2 and 3 are deliberately split to separate "add
> infrastructure" from "wire it up". Happy to fold them if maintainers
> prefer the combined form.
>
> Validated on a Cudy C200P (MT7621 + IP804AR) running an OpenWrt
> build of 6.18.21 with the series applied. A lockdep build
> (CONFIG_PROVE_LOCKING + CONFIG_DEBUG_ATOMIC_SLEEP) shows no splats
> from the series' code paths during boot, PHY attach, PHY detach, or
> a full controller unbind/rebind cycle. ethtool --set-pse drives all
> four PoE-capable LAN ports, and a Ruckus H510 class-4 PD plugged
> into lan3 negotiates and receives 48 V.
>
> The C200P has no SFP cage, so the SFP path change in sfp.c
> (phy_device_register -> phy_device_register_locked) isn't exercised
> on the bench. Verified by call-graph audit: every path reaching
> sfp_sm_probe_phy() holds rtnl at entry, via sfp_timeout,
> sfp_check_state, sfp_probe, sfp_remove, or
> sfp_bus_{add,del}_upstream.
>
> Not addressed by this series: ethtool --show-pse returns "No data
> available" on DSA netdevs in 6.18, because dev->phydev is NULL for
> DSA-frontend netdevs and ethnl_req_get_phydev() therefore returns
> NULL. That's a DSA / ethtool integration quirk that predates this
> work.
>
> Sending as RFC because this is my first net-next series. I'd
> appreciate maintainer guidance on whether patch 1 should go to net
> rather than net-next, and whether the patch 2/3 split is preferred
> to the combined form.
>
> Signed-off-by: Corey Leavitt <corey@leavitt.info>
> ---
> Corey Leavitt (4):
> net: pse-pd: scope pse_control regulator handle to kref lifetime
> net: pse-pd: add notifier chain for controller lifecycle events
> net: pse-pd: fire lifecycle events on controller register/unregister
> net: phy: own phydev->psec via PSE notifier and remove fwnode_mdio hook
>
> drivers/net/mdio/fwnode_mdio.c | 34 ----------
> drivers/net/phy/phy_device.c | 144
> ++++++++++++++++++++++++++++++++++++++--- drivers/net/phy/sfp.c |
> 2 +- drivers/net/pse-pd/pse_core.c | 60 ++++++++++++++++-
> include/linux/phy.h | 2 +
> include/linux/pse-pd/pse.h | 41 ++++++++++++
> 6 files changed, 236 insertions(+), 47 deletions(-)
> ---
> base-commit: 1f5ffc672165ff851063a5fd044b727ab2517ae3
> change-id: 20260422-pse-notifier-decouple-efa80d77f4be
>
> Best regards,
> --
> Corey Leavitt <corey@leavitt.info>
>
>
--
Köry Maincent, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com
Hi Kory,
Thanks for the pointer -- I had not seen Carlo's thread; I should have
searched lore before sending and will do so before v2. Adding Carlo on
cc.
Having read it end-to-end, my read of the state as of 2026-04-13 was
that the conversation had narrowed to two open directions: propagate
EPROBE_DEFER further up into phylink/MAC probe (Andrew/Russell), or
resolve psec at PSE controller register time (your msg on 9 Apr,
"save the phandle ... then at PSE register time look for each PHY and
try to resolve every unresolved phandle"). Nothing concrete had been
posted for either.
This RFC implements the second direction. pse_core publishes a
BLOCKING_NOTIFIER chain with REGISTERED / UNREGISTERED events,
phy_device subscribes, and psec ownership moves from fwnode_mdio probe
into the notifier handler. Concretely with respect to points raised in
the earlier thread:
- fwnode_mdio loses PSE awareness entirely, so the MDIO bus scan no
longer sees -EPROBE_DEFER from PSE lookup. Consistent with
Andrew's point that bus and device lifecycles are separate.
- psec is acquired at PSE register time, before
regulator_late_cleanup (30s) can run. Carlo's admin_state_synced
guard (his patch 1) therefore isn't needed in this model. psec
resolution happens eagerly on the REGISTERED event rather than
lazily on first ethtool access, so his patch 2 is also not needed.
And because fwnode_mdio no longer looks up PSE at all, the
non-fatal EPROBE_DEFER handling there (patch 3) drops out. This
series is a different architectural shape, not an increment on
his v2.
- Oleksij's concern about lazy resolution dropping UAPI
notifications is addressed: the notifier fires at register time,
so boot-time observer semantics are preserved.
- One caveat I already owe a fix for in v2: the attach helper in
phy_device currently treats every error from of_pse_control_get()
as retry-on-notifier, including non-transient ones. Carlo's v2
patch 3 was careful to differentiate -EPROBE_DEFER from bad-DT
errors at the fwnode_mdio lookup site (which matches his msg 1
concern about catching broken bindings at boot rather than
silently later). I need to preserve that discrimination at the
notifier-handler site -- phydev_warn() on anything other than
-EPROBE_DEFER. Trivial, but worth flagging.
- The DSA genphy force-bind sequence Carlo hit
(phy_attach_direct -> device_bind_driver -> deferred retry
skipped) does not apply, because psec attachment is not tied to
phy_probe.
- Patch 1 of this series scopes the regulator handle held by
pse_control to its own kref lifetime, fixing a latent
dangling-handle sequence that the notifier unregister path makes
reachable. This is a separate regulator-lifetime bug from the one
Carlo's patch 1 addresses.
Validated end-to-end on a Cudy C200P (MT7621 DSA + i2c IP804AR as
module), with lockdep active, across the i2c driver unbind/rebind
cycle that triggers UNREGISTERED -> REGISTERED on the live system.
The cover letter has the full evidence.
I would welcome your view on whether this is the shape you had in
mind on 9 Apr, or whether the MDI-based binding you raised with
Maxime is the better endpoint and we should be reshaping around that.
Happy to keep this RFC as the scaffolding either way.
Carlo -- your debugging work on the DSA phy_attach_direct interaction
is what made it clear this kind of approach was needed; thanks for
laying that groundwork. Would value your thoughts on the tradeoffs.
Best regards,
Corey
On Thu, 23 Apr 2026 09:48:53 +0000 Corey Leavitt <corey@leavitt.info> wrote: > Hi Kory, > > Thanks for the pointer -- I had not seen Carlo's thread; I should have > searched lore before sending and will do so before v2. Adding Carlo on > cc. > > Having read it end-to-end, my read of the state as of 2026-04-13 was > that the conversation had narrowed to two open directions: propagate > EPROBE_DEFER further up into phylink/MAC probe (Andrew/Russell), or > resolve psec at PSE controller register time (your msg on 9 Apr, > "save the phandle ... then at PSE register time look for each PHY and > try to resolve every unresolved phandle"). Nothing concrete had been > posted for either. > > This RFC implements the second direction. pse_core publishes a > BLOCKING_NOTIFIER chain with REGISTERED / UNREGISTERED events, > phy_device subscribes, and psec ownership moves from fwnode_mdio probe > into the notifier handler. Concretely with respect to points raised in > the earlier thread: > > - fwnode_mdio loses PSE awareness entirely, so the MDIO bus scan no > longer sees -EPROBE_DEFER from PSE lookup. Consistent with > Andrew's point that bus and device lifecycles are separate. > > - psec is acquired at PSE register time, before > regulator_late_cleanup (30s) can run. Carlo's admin_state_synced > guard (his patch 1) therefore isn't needed in this model. psec > resolution happens eagerly on the REGISTERED event rather than > lazily on first ethtool access, so his patch 2 is also not needed. > And because fwnode_mdio no longer looks up PSE at all, the > non-fatal EPROBE_DEFER handling there (patch 3) drops out. This > series is a different architectural shape, not an increment on > his v2. > > - Oleksij's concern about lazy resolution dropping UAPI > notifications is addressed: the notifier fires at register time, > so boot-time observer semantics are preserved. > > - One caveat I already owe a fix for in v2: the attach helper in > phy_device currently treats every error from of_pse_control_get() > as retry-on-notifier, including non-transient ones. Carlo's v2 > patch 3 was careful to differentiate -EPROBE_DEFER from bad-DT > errors at the fwnode_mdio lookup site (which matches his msg 1 > concern about catching broken bindings at boot rather than > silently later). I need to preserve that discrimination at the > notifier-handler site -- phydev_warn() on anything other than > -EPROBE_DEFER. Trivial, but worth flagging. > > - The DSA genphy force-bind sequence Carlo hit > (phy_attach_direct -> device_bind_driver -> deferred retry > skipped) does not apply, because psec attachment is not tied to > phy_probe. > > - Patch 1 of this series scopes the regulator handle held by > pse_control to its own kref lifetime, fixing a latent > dangling-handle sequence that the notifier unregister path makes > reachable. This is a separate regulator-lifetime bug from the one > Carlo's patch 1 addresses. This seems to provide a solution to all our problems, and it is well designed. Carlo, it would be nice if you could test it on your side. > Validated end-to-end on a Cudy C200P (MT7621 DSA + i2c IP804AR as > module), with lockdep active, across the i2c driver unbind/rebind > cycle that triggers UNREGISTERED -> REGISTERED on the live system. > The cover letter has the full evidence. > > I would welcome your view on whether this is the shape you had in > mind on 9 Apr, or whether the MDI-based binding you raised with > Maxime is the better endpoint and we should be reshaping around that. > Happy to keep this RFC as the scaffolding either way. After a second thought, MDI-based binding is currently not in the pipe and we should solve this without it for now. Regards, -- Köry Maincent, Bootlin Embedded Linux and kernel engineering https://bootlin.com
Hi, I tested the series on a Hasivo S1100WP-8GT-SE with hs104 PSE chips. Boot is clean, no more probe loop. ethtool --set-pse on/off works, and a PD plugged into a port gets power. So the fix works for me. One thing I noticed: rmmod alone does not work while ports are attached. Each PHY that holds a psec keeps a ref on the PSE driver, so the module count stays above 0 and rmmod fails. Unbinding the i2c devices first works: echo 0-000d > /sys/bus/i2c/drivers/hasivo-hs104/unbind echo 0-0015 > /sys/bus/i2c/drivers/hasivo-hs104/unbind rmmod hasivo-hs104 After unbind your PSE_UNREGISTERED notifier fires, the PHYs drop the psec, and the module count goes to 0. Then rmmod works. Not a big deal, but maybe worth a short note in the cover letter so people know they need to unbind first. Or if there is a clean way to skip the per-consumer module_get and just rely on the unbind path, that would be nice too. What do other folks think? One thing my test does not cover yet is the SFP path change in sfp.c, since the S1100WP-8GT-SE is copper only. I do have an S600WP-5GT-2SX-SE that uses the same hs104 chips plus SFP+ cages, but it is not on my desk right now. I can run that part on it later. Anyway, thanks for the work on this. Happy to test a v2. Carlo Tested-by: Carlo Szelinsky <github@szelinsky.de>
Hi Carlo, Thanks for testing, and for the clear writeup of the rmmod path. Will add your Tested-by to v2. On the rmmod / module_get question: I think you are right that the per-consumer try_module_get is structurally redundant once the notifier exists, since PSE_UNREGISTERED already gives us a synchronous invalidation point at unregister time. The reason I did not strip it in v1 was caution: I want to walk every pse_control_get / pse_control_put consumer and confirm no path holds a transient reference outside of rtnl-lock (or some equivalent serialization) such that an in-flight use could race the unregister notifier. That audit is on my plate now. I will share findings on this thread before v2, and let the result drive whether the removal lands in v2 or as a separate discussion. Glad to hear the probe loop is gone on your bench and that ethtool plus PD power are behaving. I expect to have an SFP-capable board on my own bench later this week, so I will cover the sfp.c path change before sending v2; a retest from you on the SFP chassis whenever it is convenient is still very welcome as independent confirmation. Kory, Jakub, Russell, others: input welcome on whether dropping try_module_get in favor of pure notifier-based invalidation is the direction you would prefer here, or whether the belt-and-suspenders refcount is worth keeping. Thanks, Corey
Hi Corey, just checking in on this one. Did you get a chance to continue with the series, or is there anything I can help with to move it forward? I'm happy to test a v2, and I can still run the SFP path on the S600WP-5GT-2SX-SE once it's back on my desk. Kory, Jakub, Russell :-) it would be great to hear your view on the approach so Corey can plan the next version. The series fixed the probe loop in my testing and I'd really like to see it land. Thanks, Carlo
Hello Carlo, On Mon, 15 Jun 2026 20:08:12 +0200 Carlo Szelinsky <github@szelinsky.de> wrote: > Hi Corey, > > just checking in on this one. Did you get a chance to continue with the > series, or is there anything I can help with to move it forward? I'm > happy to test a v2, and I can still run the SFP path on the > S600WP-5GT-2SX-SE once it's back on my desk. > > Kory, Jakub, Russell :-) it would be great to hear your view on the > approach so Corey can plan the next version. The series fixed the probe > loop in my testing and I'd really like to see it land. I haven't heard from Corey since this patch series, but I am in favor of this notifier design. Corey, do you have time to continue this work? If not, would it be okay for Carlo to continue it for you? Regards, -- Köry Maincent, Bootlin Embedded Linux and kernel engineering https://bootlin.com
© 2016 - 2026 Red Hat, Inc.