[net-next PATCH 3/6] net: phylink: Correctly handle PCS probe defer from PCS provider

Christian Marangi posted 6 patches 9 months ago
[net-next PATCH 3/6] net: phylink: Correctly handle PCS probe defer from PCS provider
Posted by Christian Marangi 9 months ago
On using OF PCS implementation, it can happen that the PCS driver still
needs to be probed at the time the MAC driver calls .mac_select_pcs.

In such case, PCS provider will return probe defer as the PCS can't be
found in the global provider list. It's expected the MAC driver to
retry probing at a later time when the PCS driver is actually probed.

To handle this, check the return value of phylink_validate (that will
call mac_select_pcs to validate the interface) in phylink_create and
make phylink_create return -EPROBE_DEFER. The MAC driver will check this
return value and retry probing.

PCS can be removed unexpectedly at later time (driver gets removed) and
not available anymore. In such case, PCS provider will also return probe
defer as the PCS can't be found in the global provider list. (as the PCS
driver on removal will delete itself as PCS provider)

To communicate correct error message, in phylink_major_config() detect
this condition and treat -EPROBE_DEFER error from .mac_select_pcs as
-ENOENT as probe defer is only allowed at probe stage.

Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
---
 drivers/net/phy/phylink.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
index 7f71547e89fe..c6d9e4efed13 100644
--- a/drivers/net/phy/phylink.c
+++ b/drivers/net/phy/phylink.c
@@ -1395,6 +1395,15 @@ static void phylink_major_config(struct phylink *pl, bool restart,
 	if (pl->mac_ops->mac_select_pcs) {
 		pcs = pl->mac_ops->mac_select_pcs(pl->config, state->interface);
 		if (IS_ERR(pcs)) {
+			/* PCS can be removed unexpectedly and not available
+			 * anymore.
+			 * PCS provider will return probe defer as the PCS
+			 * can't be found in the global provider list.
+			 * In such case, return -ENOENT as a more symbolic name
+			 * for the error message.
+			 */
+			if (PTR_ERR(pcs) == -EPROBE_DEFER)
+				pcs = ERR_PTR(-ENOENT);
 			phylink_err(pl,
 				    "mac_select_pcs unexpectedly failed: %pe\n",
 				    pcs);
@@ -2004,7 +2013,18 @@ struct phylink *phylink_create(struct phylink_config *config,
 
 	linkmode_fill(pl->supported);
 	linkmode_copy(pl->link_config.advertising, pl->supported);
-	phylink_validate(pl, pl->supported, &pl->link_config);
+	ret = phylink_validate(pl, pl->supported, &pl->link_config);
+	/* The PCS might not available at the time phylink_create
+	 * is called. Check this and communicate to the MAC driver
+	 * that probe should be retried later.
+	 *
+	 * Notice that this can only happen in probe stage and PCS
+	 * is expected to be avaialble in phylink_major_config.
+	 */
+	if (ret == -EPROBE_DEFER) {
+		kfree(pl);
+		return ERR_PTR(ret);
+	}
 
 	ret = phylink_parse_mode(pl, fwnode);
 	if (ret < 0) {
-- 
2.48.1
Re: [net-next PATCH 3/6] net: phylink: Correctly handle PCS probe defer from PCS provider
Posted by Russell King (Oracle) 9 months ago
On Wed, Mar 19, 2025 at 12:58:39AM +0100, Christian Marangi wrote:
> diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
> index 7f71547e89fe..c6d9e4efed13 100644
> --- a/drivers/net/phy/phylink.c
> +++ b/drivers/net/phy/phylink.c
> @@ -1395,6 +1395,15 @@ static void phylink_major_config(struct phylink *pl, bool restart,
>  	if (pl->mac_ops->mac_select_pcs) {
>  		pcs = pl->mac_ops->mac_select_pcs(pl->config, state->interface);
>  		if (IS_ERR(pcs)) {
> +			/* PCS can be removed unexpectedly and not available
> +			 * anymore.
> +			 * PCS provider will return probe defer as the PCS
> +			 * can't be found in the global provider list.
> +			 * In such case, return -ENOENT as a more symbolic name
> +			 * for the error message.
> +			 */
> +			if (PTR_ERR(pcs) == -EPROBE_DEFER)
> +				pcs = ERR_PTR(-ENOENT);

I don't particularly like the idea of returning -EPROBE_DEFER from
mac_select_pcs()... there is no way *ever* that such an error code
could be handled.

>  	linkmode_fill(pl->supported);
>  	linkmode_copy(pl->link_config.advertising, pl->supported);
> -	phylink_validate(pl, pl->supported, &pl->link_config);
> +	ret = phylink_validate(pl, pl->supported, &pl->link_config);
> +	/* The PCS might not available at the time phylink_create
> +	 * is called. Check this and communicate to the MAC driver
> +	 * that probe should be retried later.
> +	 *
> +	 * Notice that this can only happen in probe stage and PCS
> +	 * is expected to be avaialble in phylink_major_config.
> +	 */
> +	if (ret == -EPROBE_DEFER) {
> +		kfree(pl);
> +		return ERR_PTR(ret);
> +	}

This does not solve the problem - what if the interface mode is
currently not one that requires a PCS that may not yet be probed?

I don't like the idea that mac_select_pcs() might be doing a complex
lookup - that could make scanning the interface modes (as
phylink_validate_mask() does) quite slow and unreliable, and phylink
currently assumes that a PCS that is validated as present will remain
present.

If it goes away by the time phylink_major_config() is called, then we
leave the phylink state no longer reflecting how the hardware is
programmed, but we still continue to call mac_link_up() - which should
probably be fixed.

Given that netdev is severely backlogged, I'm not inclined to add to
the netdev maintainers workloads by trying to fix this until after
the merge window - it looks like they're at least one week behind.
Consequently, I'm expecting that most patches that have been
submitted during this week will be dropped from patchwork, which
means submitting patches this week is likely not useful.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
Re: [net-next PATCH 3/6] net: phylink: Correctly handle PCS probe defer from PCS provider
Posted by Christian Marangi 9 months ago
On Wed, Mar 19, 2025 at 03:58:14PM +0000, Russell King (Oracle) wrote:
> On Wed, Mar 19, 2025 at 12:58:39AM +0100, Christian Marangi wrote:
> > diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
> > index 7f71547e89fe..c6d9e4efed13 100644
> > --- a/drivers/net/phy/phylink.c
> > +++ b/drivers/net/phy/phylink.c
> > @@ -1395,6 +1395,15 @@ static void phylink_major_config(struct phylink *pl, bool restart,
> >  	if (pl->mac_ops->mac_select_pcs) {
> >  		pcs = pl->mac_ops->mac_select_pcs(pl->config, state->interface);
> >  		if (IS_ERR(pcs)) {
> > +			/* PCS can be removed unexpectedly and not available
> > +			 * anymore.
> > +			 * PCS provider will return probe defer as the PCS
> > +			 * can't be found in the global provider list.
> > +			 * In such case, return -ENOENT as a more symbolic name
> > +			 * for the error message.
> > +			 */
> > +			if (PTR_ERR(pcs) == -EPROBE_DEFER)
> > +				pcs = ERR_PTR(-ENOENT);
> 
> I don't particularly like the idea of returning -EPROBE_DEFER from
> mac_select_pcs()... there is no way *ever* that such an error code
> could be handled.
>

Maybe this wasn't clear enough, the idea here is that at major_config
under normal situation this case should never happen unless the driver
was removed. In such case the PCS provider returns a EPROBE_DEFER that
in this case is assumed driver not present anymore. Hence phylink fails
to apply the configuration similar to the other fail case in the same
function.

The principle here is not "we need to wait for PCS" but react on the
fact that it was removed in the meantime. (something that should not
happen as the PCS driver is expected to dev_close the interface)

> >  	linkmode_fill(pl->supported);
> >  	linkmode_copy(pl->link_config.advertising, pl->supported);
> > -	phylink_validate(pl, pl->supported, &pl->link_config);
> > +	ret = phylink_validate(pl, pl->supported, &pl->link_config);
> > +	/* The PCS might not available at the time phylink_create
> > +	 * is called. Check this and communicate to the MAC driver
> > +	 * that probe should be retried later.
> > +	 *
> > +	 * Notice that this can only happen in probe stage and PCS
> > +	 * is expected to be avaialble in phylink_major_config.
> > +	 */
> > +	if (ret == -EPROBE_DEFER) {
> > +		kfree(pl);
> > +		return ERR_PTR(ret);
> > +	}
> 
> This does not solve the problem - what if the interface mode is
> currently not one that requires a PCS that may not yet be probed?

Mhhh but what are the actual real world scenario for this? If a MAC
needs a dedicated PCS to handle multiple mode then it will probably
follow this new implementation and register as a provider.

An option to handle your corner case might be an OP that wait for each
supported interface by the MAC and make sure there is a possible PCS for
it. And Ideally place it in the codeflow of validate_pcs ?

> 
> I don't like the idea that mac_select_pcs() might be doing a complex
> lookup - that could make scanning the interface modes (as
> phylink_validate_mask() does) quite slow and unreliable, and phylink
> currently assumes that a PCS that is validated as present will remain
> present.

The assumption "will remain present" is already very fragile with the
current PCS so I feel this should be changed or improved. Honestly every
PCS currently implemented can be removed and phylink will stay in an
undefined state.

Also the complex lookup in 99% of the time is really checking one/2 max
PCS for a single interface and we are really checking a list and a
bitmap, nothing fancy that might introduce delay waiting for something.

> 
> If it goes away by the time phylink_major_config() is called, then we
> leave the phylink state no longer reflecting how the hardware is
> programmed, but we still continue to call mac_link_up() - which should
> probably be fixed.

Again, the idea to prevent these kind of chicken-egg problem is to
enforce correct removal on the PCS driver side.

> 
> Given that netdev is severely backlogged, I'm not inclined to add to
> the netdev maintainers workloads by trying to fix this until after
> the merge window - it looks like they're at least one week behind.
> Consequently, I'm expecting that most patches that have been
> submitted during this week will be dropped from patchwork, which
> means submitting patches this week is likely not useful.
> 

Ok I will send next revision as RFC to not increase the "load" but IMHO
it's worth to discuss this... I really feel we need to fix the PCS
situation ASAP or more driver will come. (there are already 3 in queue
as stressed in the cover letter)

-- 
	Ansuel
Re: [net-next PATCH 3/6] net: phylink: Correctly handle PCS probe defer from PCS provider
Posted by Russell King (Oracle) 9 months ago
On Wed, Mar 19, 2025 at 05:18:50PM +0100, Christian Marangi wrote:
> > >  	linkmode_fill(pl->supported);
> > >  	linkmode_copy(pl->link_config.advertising, pl->supported);
> > > -	phylink_validate(pl, pl->supported, &pl->link_config);
> > > +	ret = phylink_validate(pl, pl->supported, &pl->link_config);
> > > +	/* The PCS might not available at the time phylink_create
> > > +	 * is called. Check this and communicate to the MAC driver
> > > +	 * that probe should be retried later.
> > > +	 *
> > > +	 * Notice that this can only happen in probe stage and PCS
> > > +	 * is expected to be avaialble in phylink_major_config.
> > > +	 */
> > > +	if (ret == -EPROBE_DEFER) {
> > > +		kfree(pl);
> > > +		return ERR_PTR(ret);
> > > +	}
> > 
> > This does not solve the problem - what if the interface mode is
> > currently not one that requires a PCS that may not yet be probed?
> 
> Mhhh but what are the actual real world scenario for this? If a MAC
> needs a dedicated PCS to handle multiple mode then it will probably
> follow this new implementation and register as a provider.
> 
> An option to handle your corner case might be an OP that wait for each
> supported interface by the MAC and make sure there is a possible PCS for
> it. And Ideally place it in the codeflow of validate_pcs ?

I think you've fallen in to the trap of the stupid drivers that
implement mac_select_pcs() as:

static struct phylink_pcs *foo_mac_select_pcs(struct phylink_config *config,
					      phy_interface_t interface)
{
	struct foo_private *priv = phylink_to_foo(config);

	return priv->pcs;
}

but what drivers can (and should) be doing is looking at the interface
argument, and working out which interface to return.

Phylink is not designed to be single interface mode, single PCS driver
despite what many MAC drivers do. Checking the phylink_validate()
return code doesn't mean that all PCS exist for the MAC.

> > I don't like the idea that mac_select_pcs() might be doing a complex
> > lookup - that could make scanning the interface modes (as
> > phylink_validate_mask() does) quite slow and unreliable, and phylink
> > currently assumes that a PCS that is validated as present will remain
> > present.
> 
> The assumption "will remain present" is already very fragile with the
> current PCS so I feel this should be changed or improved. Honestly every
> PCS currently implemented can be removed and phylink will stay in an
> undefined state.

The fragility is because of the way networking works - there's nothing
phylink can do about this.

I take issue with "every PCS currently implemented" because it's
actually not a correct statement.

XPCS as used by stmmac does not fall into this.
The PCS used by mvneta and mvpp2 do not fall into this.
The PCS used by the Marvell DSA driver do not fall into this.

It's only relatively recently with pcs-lynx and others that people have
wanted them to be separate driver-model devices that this problem has
occurred, and I've been pushing back on it saying we need to find a
proper solution to it. I really haven't liked that we've merged drivers
that cause this fragility without addressing that fragility.

I've got to the point where I'm now saying no to new drivers that fail
to address this, so we're at a crunch time when it needs to be
addressed.

We need to think about how to get around this fragility. The need to
pre-validate the link modes comes from the netdev ethtool user
interface itself - the need to tell userspace what link modes can be
supported _before_ they get used. This API hasn't been designed with
the idea that parts of a netdev might vanish at any particular time.

> > If it goes away by the time phylink_major_config() is called, then we
> > leave the phylink state no longer reflecting how the hardware is
> > programmed, but we still continue to call mac_link_up() - which should
> > probably be fixed.
> 
> Again, the idea to prevent these kind of chicken-egg problem is to
> enforce correct removal on the PCS driver side.
> 
> > Given that netdev is severely backlogged, I'm not inclined to add to
> > the netdev maintainers workloads by trying to fix this until after
> > the merge window - it looks like they're at least one week behind.
> > Consequently, I'm expecting that most patches that have been
> > submitted during this week will be dropped from patchwork, which
> > means submitting patches this week is likely not useful.
> 
> Ok I will send next revision as RFC to not increase the "load" but IMHO
> it's worth to discuss this... I really feel we need to fix the PCS
> situation ASAP or more driver will come. (there are already 3 in queue
> as stressed in the cover letter)

Yes, we do need to fix it, but we need to recognise _all_ the issues
it creates by doing this, and how we handle it properly.

Right now, it's up to the MAC driver to get all the PCS it needs
during its probe function, and *not* in the mac_select_pcs() method
which has no way to propagate an error to anywhere sensible that
could handle an EPROBE_DEFER response.

My thoughts are that if a PCS goes away after a MAC driver has "got"
it, then:

1. we need to recognise that those PHY interfaces and/or link modes
   are no longer available.
2. if the PCS was in-use, then the link needs to be taken down at
   minimum and the .pcs_disable() method needs to be called to
   release any resources that .pcs_enable() enabled (e.g. irq masks,
   power enables, etc.)
3. the MAC driver needs to be notified that the PCS pointer it
   stashed is no longer valid, so it doesn't return it for
   mac_select_pcs().

There's probably a bunch more that needs to happen, and maybe need
to consider how to deal with "pcs came back".. but I haven't thought
that through yet.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
Re: [net-next PATCH 3/6] net: phylink: Correctly handle PCS probe defer from PCS provider
Posted by Christian Marangi 9 months ago
On Wed, Mar 19, 2025 at 05:02:50PM +0000, Russell King (Oracle) wrote:
> On Wed, Mar 19, 2025 at 05:18:50PM +0100, Christian Marangi wrote:
> > > >  	linkmode_fill(pl->supported);
> > > >  	linkmode_copy(pl->link_config.advertising, pl->supported);
> > > > -	phylink_validate(pl, pl->supported, &pl->link_config);
> > > > +	ret = phylink_validate(pl, pl->supported, &pl->link_config);
> > > > +	/* The PCS might not available at the time phylink_create
> > > > +	 * is called. Check this and communicate to the MAC driver
> > > > +	 * that probe should be retried later.
> > > > +	 *
> > > > +	 * Notice that this can only happen in probe stage and PCS
> > > > +	 * is expected to be avaialble in phylink_major_config.
> > > > +	 */
> > > > +	if (ret == -EPROBE_DEFER) {
> > > > +		kfree(pl);
> > > > +		return ERR_PTR(ret);
> > > > +	}
> > > 
> > > This does not solve the problem - what if the interface mode is
> > > currently not one that requires a PCS that may not yet be probed?
> > 
> > Mhhh but what are the actual real world scenario for this? If a MAC
> > needs a dedicated PCS to handle multiple mode then it will probably
> > follow this new implementation and register as a provider.
> > 
> > An option to handle your corner case might be an OP that wait for each
> > supported interface by the MAC and make sure there is a possible PCS for
> > it. And Ideally place it in the codeflow of validate_pcs ?
> 
> I think you've fallen in to the trap of the stupid drivers that
> implement mac_select_pcs() as:
> 
> static struct phylink_pcs *foo_mac_select_pcs(struct phylink_config *config,
> 					      phy_interface_t interface)
> {
> 	struct foo_private *priv = phylink_to_foo(config);
> 
> 	return priv->pcs;
> }
> 
> but what drivers can (and should) be doing is looking at the interface
> argument, and working out which interface to return.
> 
> Phylink is not designed to be single interface mode, single PCS driver
> despite what many MAC drivers do. Checking the phylink_validate()
> return code doesn't mean that all PCS exist for the MAC.
>
> > > I don't like the idea that mac_select_pcs() might be doing a complex
> > > lookup - that could make scanning the interface modes (as
> > > phylink_validate_mask() does) quite slow and unreliable, and phylink
> > > currently assumes that a PCS that is validated as present will remain
> > > present.
> > 
> > The assumption "will remain present" is already very fragile with the
> > current PCS so I feel this should be changed or improved. Honestly every
> > PCS currently implemented can be removed and phylink will stay in an
> > undefined state.
> 
> The fragility is because of the way networking works - there's nothing
> phylink can do about this.
> 
> I take issue with "every PCS currently implemented" because it's
> actually not a correct statement.
> 
> XPCS as used by stmmac does not fall into this.
> The PCS used by mvneta and mvpp2 do not fall into this.
> The PCS used by the Marvell DSA driver do not fall into this.
> 
> It's only relatively recently with pcs-lynx and others that people have
> wanted them to be separate driver-model devices that this problem has
> occurred, and I've been pushing back on it saying we need to find a
> proper solution to it. I really haven't liked that we've merged drivers
> that cause this fragility without addressing that fragility.
> 
> I've got to the point where I'm now saying no to new drivers that fail
> to address this, so we're at a crunch time when it needs to be
> addressed.
> 
> We need to think about how to get around this fragility. The need to
> pre-validate the link modes comes from the netdev ethtool user
> interface itself - the need to tell userspace what link modes can be
> supported _before_ they get used. This API hasn't been designed with
> the idea that parts of a netdev might vanish at any particular time.
>
> > > If it goes away by the time phylink_major_config() is called, then we
> > > leave the phylink state no longer reflecting how the hardware is
> > > programmed, but we still continue to call mac_link_up() - which should
> > > probably be fixed.
> > 
> > Again, the idea to prevent these kind of chicken-egg problem is to
> > enforce correct removal on the PCS driver side.
> > 
> > > Given that netdev is severely backlogged, I'm not inclined to add to
> > > the netdev maintainers workloads by trying to fix this until after
> > > the merge window - it looks like they're at least one week behind.
> > > Consequently, I'm expecting that most patches that have been
> > > submitted during this week will be dropped from patchwork, which
> > > means submitting patches this week is likely not useful.
> > 
> > Ok I will send next revision as RFC to not increase the "load" but IMHO
> > it's worth to discuss this... I really feel we need to fix the PCS
> > situation ASAP or more driver will come. (there are already 3 in queue
> > as stressed in the cover letter)
> 
> Yes, we do need to fix it, but we need to recognise _all_ the issues
> it creates by doing this, and how we handle it properly.
> 
> Right now, it's up to the MAC driver to get all the PCS it needs
> during its probe function, and *not* in the mac_select_pcs() method
> which has no way to propagate an error to anywhere sensible that
> could handle an EPROBE_DEFER response.
> 
> My thoughts are that if a PCS goes away after a MAC driver has "got"
> it, then:
> 
> 1. we need to recognise that those PHY interfaces and/or link modes
>    are no longer available.
> 2. if the PCS was in-use, then the link needs to be taken down at
>    minimum and the .pcs_disable() method needs to be called to
>    release any resources that .pcs_enable() enabled (e.g. irq masks,
>    power enables, etc.)
> 3. the MAC driver needs to be notified that the PCS pointer it
>    stashed is no longer valid, so it doesn't return it for
>    mac_select_pcs().

But why we need all these indirect handling and checks if we can
make use of .remove and shutdown the interface. A removal of a PCS
should cause the entire link to go down, isn't a dev_close enough to
propagate this? If and when the interface will came up checks are done
again and it will fail to go UP if PCS can't be found.

I know it's a drastic approach to call dev_close but link is down anyway
so lets reinit everything from scratch. It should handle point 2 and 3
right?

For point 1, additional entry like available_interface? And gets updated
once a PCS gets removed??? Or if we don't like the parsing hell we map
every interface to a PCS pointer? (not worth the wasted space IMHO)

> 
> There's probably a bunch more that needs to happen, and maybe need
> to consider how to deal with "pcs came back".. but I haven't thought
> that through yet.
>

Current approach supports PCS came back as we check the global provider
list and the PCS is reachable again there.
(we tasted various scenario with unbind/bind while the interface was
up/down)

> -- 
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

-- 
	Ansuel
Re: [net-next PATCH 3/6] net: phylink: Correctly handle PCS probe defer from PCS provider
Posted by Russell King (Oracle) 9 months ago
On Wed, Mar 19, 2025 at 06:35:21PM +0100, Christian Marangi wrote:
> On Wed, Mar 19, 2025 at 05:02:50PM +0000, Russell King (Oracle) wrote:
> > My thoughts are that if a PCS goes away after a MAC driver has "got"
> > it, then:
> > 
> > 1. we need to recognise that those PHY interfaces and/or link modes
> >    are no longer available.
> > 2. if the PCS was in-use, then the link needs to be taken down at
> >    minimum and the .pcs_disable() method needs to be called to
> >    release any resources that .pcs_enable() enabled (e.g. irq masks,
> >    power enables, etc.)
> > 3. the MAC driver needs to be notified that the PCS pointer it
> >    stashed is no longer valid, so it doesn't return it for
> >    mac_select_pcs().
> 
> But why we need all these indirect handling and checks if we can
> make use of .remove and shutdown the interface. A removal of a PCS
> should cause the entire link to go down, isn't a dev_close enough to
> propagate this? If and when the interface will came up checks are done
> again and it will fail to go UP if PCS can't be found.
> 
> I know it's a drastic approach to call dev_close but link is down anyway
> so lets reinit everything from scratch. It should handle point 2 and 3
> right?

Let's look at what dev_close() does. This is how it's documented:

 * dev_close() - shutdown an interface
 * @dev: device to shutdown
 *
 * This function moves an active device into down state. A
 * %NETDEV_GOING_DOWN is sent to the netdev notifier chain. The device
 * is then deactivated and finally a %NETDEV_DOWN is sent to the notifier
 * chain.

So, this is equivalent to userspace doing:

# ip li set dev ethX down

and nothing prevents userspace doing:

# ip li set dev ethX up

after that call to dev_close() has returned.

If this happens, then the netdev driver's .ndo_open will be called,
which will then call phylink_start(), and that will attempt to bring
the link back up. That will call .mac_select_pcs(), which _if_ the
PCS is still "published" means it is _still_ accessible.

So your call that results in dev_close() with the PCS still being
published is ineffectual.

It's *no* different from this crap in the stmmac driver:

        stmmac_stop_all_dma(priv);
        stmmac_mac_set(priv, priv->ioaddr, false);
        unregister_netdev(ndev);

because *until* that unregister_netdev() call has completed, _userspace_
still has control over the netdev, and can do whatever it damn well
pleases.

Look, this is very very very simple.

If something is published to another part of the code, it is
discoverable, and it can be used or manipulated by new users.

If we wish to take something away, then first, it must be
unpublished to prevent new users discovering the resource. Then
existing users need to be dealt with in a safe way. Only at that
point can we be certain that there are no users, and thus the
underlying device begin to be torn down.

It's entirely logical!

> For point 1, additional entry like available_interface? And gets updated
> once a PCS gets removed??? Or if we don't like the parsing hell we map
> every interface to a PCS pointer? (not worth the wasted space IMHO)

At the moment, MAC drivers that I've updated will do things like:

                phy_interface_or(priv->phylink_config.supported_interfaces,
                                 priv->phylink_config.supported_interfaces,
                                 pcs->supported_interfaces);

phylink_config.supported_interfaces is the set of interface modes that
the MAC _and_ PCS subsystem supports. It's not just the MAC, it's both
together.

So, if a PCS is going away, then clearing the interface modes that the
PCS was providing would make sense - but there's a problem here. What
if the PCS is a bought-in bit of IP where the driver supports many modes
but the MAC doesn't use it for all those modes. So... which interface
modes get cleared is up to the MAC driver to decide.

> > There's probably a bunch more that needs to happen, and maybe need
> > to consider how to deal with "pcs came back".. but I haven't thought
> > that through yet.
> 
> Current approach supports PCS came back as we check the global provider
> list and the PCS is reachable again there.
> (we tasted various scenario with unbind/bind while the interface was
> up/down)

... because you look up the PCS in the mac_select_pcs() callback which
leads to a different race to what we have today, this time inside the
phylink code which thankfully phylink prints an error which is *NEVER*
supposed to happen.

Just trading one problem for another problem.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
Re: [net-next PATCH 3/6] net: phylink: Correctly handle PCS probe defer from PCS provider
Posted by Christian Marangi 8 months, 3 weeks ago
On Wed, Mar 19, 2025 at 07:31:04PM +0000, Russell King (Oracle) wrote:
> On Wed, Mar 19, 2025 at 06:35:21PM +0100, Christian Marangi wrote:
> > On Wed, Mar 19, 2025 at 05:02:50PM +0000, Russell King (Oracle) wrote:
> > > My thoughts are that if a PCS goes away after a MAC driver has "got"
> > > it, then:
> > > 
> > > 1. we need to recognise that those PHY interfaces and/or link modes
> > >    are no longer available.
> > > 2. if the PCS was in-use, then the link needs to be taken down at
> > >    minimum and the .pcs_disable() method needs to be called to
> > >    release any resources that .pcs_enable() enabled (e.g. irq masks,
> > >    power enables, etc.)
> > > 3. the MAC driver needs to be notified that the PCS pointer it
> > >    stashed is no longer valid, so it doesn't return it for
> > >    mac_select_pcs().
> > 
> > But why we need all these indirect handling and checks if we can
> > make use of .remove and shutdown the interface. A removal of a PCS
> > should cause the entire link to go down, isn't a dev_close enough to
> > propagate this? If and when the interface will came up checks are done
> > again and it will fail to go UP if PCS can't be found.
> > 
> > I know it's a drastic approach to call dev_close but link is down anyway
> > so lets reinit everything from scratch. It should handle point 2 and 3
> > right?
> 
> Let's look at what dev_close() does. This is how it's documented:
> 
>  * dev_close() - shutdown an interface
>  * @dev: device to shutdown
>  *
>  * This function moves an active device into down state. A
>  * %NETDEV_GOING_DOWN is sent to the netdev notifier chain. The device
>  * is then deactivated and finally a %NETDEV_DOWN is sent to the notifier
>  * chain.
> 
> So, this is equivalent to userspace doing:
> 
> # ip li set dev ethX down
> 
> and nothing prevents userspace doing:
> 
> # ip li set dev ethX up
> 
> after that call to dev_close() has returned.
> 
> If this happens, then the netdev driver's .ndo_open will be called,
> which will then call phylink_start(), and that will attempt to bring
> the link back up. That will call .mac_select_pcs(), which _if_ the
> PCS is still "published" means it is _still_ accessible.
> 
> So your call that results in dev_close() with the PCS still being
> published is ineffectual.
> 
> It's *no* different from this crap in the stmmac driver:
> 
>         stmmac_stop_all_dma(priv);
>         stmmac_mac_set(priv, priv->ioaddr, false);
>         unregister_netdev(ndev);
> 
> because *until* that unregister_netdev() call has completed, _userspace_
> still has control over the netdev, and can do whatever it damn well
> pleases.
> 
> Look, this is very very very simple.
> 
> If something is published to another part of the code, it is
> discoverable, and it can be used or manipulated by new users.
> 
> If we wish to take something away, then first, it must be
> unpublished to prevent new users discovering the resource. Then
> existing users need to be dealt with in a safe way. Only at that
> point can we be certain that there are no users, and thus the
> underlying device begin to be torn down.
> 
> It's entirely logical!
>

OK so (I think this was also suggested in the more specific PCS patch)
- 1. unpublish the PCS from the provider
- 2. put down the link...

I feel point 2 is the big effort here to solve. Mainly your problem is
the fact that phylink_major_config should not handle PROBE_DEFER and
should always have all the expected PCS available. (returned from
mac_select_pcs)

So the validation MUST ALWAYS be done before reaching that code path.

That means that when a PCS is removed, the entire phylink should be
refreshed and reevaluated. And at the same time lock userspace from
doing anything fancy (as there might be a possibility for
phylink_major_config)

Daniel at some point in the brainstorm process suggested that we might
need something like phylink_impair() to lock it while it's getting
""refreshed"". Do you think that might be a good path for this?

One of the first implementation of this called phylink_stop (not
dev_stop) so maybe I should reconsider keeping everything phylink
related. But that wouldn't put the interface down from userspace if I'm
not wrong.

It's point 3 (of the old list) "the MAC driver needs to be notified that
the PCS pointer it stashed is no longer valid, so it doesn't return it for
mac_select_pcs()." my problem. I still feel MAC should not track PCS but
only react on the presence (or absence) of them.

And this point is really connected to point 1 so I guess point 1 is the
first to handle, before this. (I also feel it will magically solved once
point 1 is handled)

> > For point 1, additional entry like available_interface? And gets updated
> > once a PCS gets removed??? Or if we don't like the parsing hell we map
> > every interface to a PCS pointer? (not worth the wasted space IMHO)
> 
> At the moment, MAC drivers that I've updated will do things like:
> 
>                 phy_interface_or(priv->phylink_config.supported_interfaces,
>                                  priv->phylink_config.supported_interfaces,
>                                  pcs->supported_interfaces);
> 
> phylink_config.supported_interfaces is the set of interface modes that
> the MAC _and_ PCS subsystem supports. It's not just the MAC, it's both
> together.
> 
> So, if a PCS is going away, then clearing the interface modes that the
> PCS was providing would make sense - but there's a problem here. What
> if the PCS is a bought-in bit of IP where the driver supports many modes
> but the MAC doesn't use it for all those modes. So... which interface
> modes get cleared is up to the MAC driver to decide.
> 

Should we add an OP to handle removal of PCS from a MAC? Like
.mac_release_pcs ? I might be wrong but isn't that giving too much
freedom to the driver?

I need to recheck how the interface validation work and what values are
used but with this removal thing on the table, supported_interfaces OR
with the PCS supported_interface might be problematic and maybe the
original values should be stored somewhere.

> > > There's probably a bunch more that needs to happen, and maybe need
> > > to consider how to deal with "pcs came back".. but I haven't thought
> > > that through yet.
> > 
> > Current approach supports PCS came back as we check the global provider
> > list and the PCS is reachable again there.
> > (we tasted various scenario with unbind/bind while the interface was
> > up/down)
> 
> ... because you look up the PCS in the mac_select_pcs() callback which
> leads to a different race to what we have today, this time inside the
> phylink code which thankfully phylink prints an error which is *NEVER*
> supposed to happen.
>

I want to make sure tho you are ok with the usage of .mac_select_pcs
for re-evaluation task.

Maybe a better approach is to introduce .mac_get_pcs and enforce the
usage only on validation phase? (aka in phylink_validate_mac_and_pcs)

AFAIK in that phase .mac_select_pcs can return errors if the requested
interface is not possible for one reason or another.

What do you think? In short 2 additional OP that with the select one
result in:

- get
- select
- release

-- 
	Ansuel
Re: [net-next PATCH 3/6] net: phylink: Correctly handle PCS probe defer from PCS provider
Posted by Russell King (Oracle) 8 months, 3 weeks ago
On Thu, Mar 27, 2025 at 06:37:19PM +0100, Christian Marangi wrote:
> On Wed, Mar 19, 2025 at 07:31:04PM +0000, Russell King (Oracle) wrote:
> > If we wish to take something away, then first, it must be
> > unpublished to prevent new users discovering the resource. Then
> > existing users need to be dealt with in a safe way. Only at that
> > point can we be certain that there are no users, and thus the
> > underlying device begin to be torn down.
> > 
> > It's entirely logical!
> 
> OK so (I think this was also suggested in the more specific PCS patch)
> - 1. unpublish the PCS from the provider
> - 2. put down the link...
> 
> I feel point 2 is the big effort here to solve. Mainly your problem is
> the fact that phylink_major_config should not handle PROBE_DEFER and
> should always have all the expected PCS available. (returned from
> mac_select_pcs)
> 
> So the validation MUST ALWAYS be done before reaching that code path.

Yes, but there's a sting in the tail - e.g. if we take away link modes
that are advertised on some media by a PCS going away.

Here's a theoretical situation:

- separate PCS for SGMII and 10GBASE-R with their own drivers, muxed
  onto the same pins.
- PHY which switches its host interface between these two modes.

when both PCS are available, the PHY could be advertising 10base-T,
100base-Tx, 1000base-T and 10000base-T.

If one of these two PCS drivers becomes no longer available, then we
need to revalidate and update the PHY's advertisement.

> That means that when a PCS is removed, the entire phylink should be
> refreshed and reevaluated. And at the same time lock userspace from
> doing anything fancy (as there might be a possibility for
> phylink_major_config)

Taking the rtnl lock prevents userspace interfering with the interface
configuration.

> Daniel at some point in the brainstorm process suggested that we might
> need something like phylink_impair() to lock it while it's getting
> ""refreshed"". Do you think that might be a good path for this?

Taking the rtnl should be sufficient.

> One of the first implementation of this called phylink_stop (not
> dev_stop) so maybe I should reconsider keeping everything phylink
> related. But that wouldn't put the interface down from userspace if I'm
> not wrong.
> 
> It's point 3 (of the old list) "the MAC driver needs to be notified that
> the PCS pointer it stashed is no longer valid, so it doesn't return it for
> mac_select_pcs()." my problem. I still feel MAC should not track PCS but
> only react on the presence (or absence) of them.
> 
> And this point is really connected to point 1 so I guess point 1 is the
> first to handle, before this. (I also feel it will magically solved once
> point 1 is handled)

I'm wondering whether the mac_select_pcs() interface needs to be
revised - it's going to be difficult to do because there's many drivers
that blindly return a PCS irrespective of the interface mode.

I'm thinking that MAC drivers should register a PCS against a set of
interface modes that it wants to use it for (the interface modes that
a PCS supports is not necessarily the interface modes a MAC driver
wants to use it for.) That means we could handle mac_select_pcs()
entirely within phylink, which avoids storing PCS pointers in the MAC
driver.

I don't think this fully addresses the issues though.

However..

> > > For point 1, additional entry like available_interface? And gets updated
> > > once a PCS gets removed??? Or if we don't like the parsing hell we map
> > > every interface to a PCS pointer? (not worth the wasted space IMHO)
> > 
> > At the moment, MAC drivers that I've updated will do things like:
> > 
> >                 phy_interface_or(priv->phylink_config.supported_interfaces,
> >                                  priv->phylink_config.supported_interfaces,
> >                                  pcs->supported_interfaces);
> > 
> > phylink_config.supported_interfaces is the set of interface modes that
> > the MAC _and_ PCS subsystem supports. It's not just the MAC, it's both
> > together.
> > 
> > So, if a PCS is going away, then clearing the interface modes that the
> > PCS was providing would make sense - but there's a problem here. What
> > if the PCS is a bought-in bit of IP where the driver supports many modes
> > but the MAC doesn't use it for all those modes. So... which interface
> > modes get cleared is up to the MAC driver to decide.

.. we would then know which modes to clear from
phylink_config.supported_interfaces.

> Should we add an OP to handle removal of PCS from a MAC? Like
> .mac_release_pcs ? I might be wrong but isn't that giving too much
> freedom to the driver?
> 
> I need to recheck how the interface validation work and what values are
> used but with this removal thing on the table, supported_interfaces OR
> with the PCS supported_interface might be problematic and maybe the
> original values should be stored somewhere.

That thought also crossed my mind too.

> > > > There's probably a bunch more that needs to happen, and maybe need
> > > > to consider how to deal with "pcs came back".. but I haven't thought
> > > > that through yet.
> > > 
> > > Current approach supports PCS came back as we check the global provider
> > > list and the PCS is reachable again there.
> > > (we tasted various scenario with unbind/bind while the interface was
> > > up/down)
> > 
> > ... because you look up the PCS in the mac_select_pcs() callback which
> > leads to a different race to what we have today, this time inside the
> > phylink code which thankfully phylink prints an error which is *NEVER*
> > supposed to happen.
> >
> 
> I want to make sure tho you are ok with the usage of .mac_select_pcs
> for re-evaluation task.
> 
> Maybe a better approach is to introduce .mac_get_pcs and enforce the
> usage only on validation phase? (aka in phylink_validate_mac_and_pcs)
> 
> AFAIK in that phase .mac_select_pcs can return errors if the requested
> interface is not possible for one reason or another.

The problem is that the validation phase could happen in the distant
past in relation to when we actually use the results.

Consider the above case with SGMII + 10GBASE-R. We're running at
10G speeds, so we're using the 10GBASE-R PCS, and have been running for
a year. The cabling deteriorates, and the PHY renegotiates switching to
1G speed now wanting to use the SGMII PCS but someone's removed the
driver! The validation would've happened before the 10G link came up
but the use is now a significant time in the future.

In that scenario, if the SGMII PCS driver is available, then switching
to 1G speed is possible. If the driver isn't available, then the result
is that the link has gone down. That's the same result if we stop
advertising the slower speeds - the PHY basically wouldn't be able to
establish link at a slower speed, so the link has gone down.

So, maybe than re-evaluate phylink, maybe just keep track of which
PHY interface modes we can no longer support, and if we attempt to
switch to one of those, force the link down. However, with:

f1ae32a709e0 ("net: phylink: force link down on major_config failure")

we now have the mechanics to do this - if mac_select_pcs returns an
error for the interface mode at major_config time, we force the link
down until such time that we do have a successful major configuration.
Maybe we should just rely on that rather than trying to do a full
re-evaluation and reprogram things like PHY advertisements.

Maybe in this case, we need a way to inform phylib - if I remember
correctly, there is a way to get the PHY to signal "remote fault" to
the link partner, and this kind of situation seems an ideal use, but
I'd need to check 802.3.

Yes, the MAC driver still needs to be aware of a PCS going away so it
can properly respond to mac_select_pcs().

Part of the issue here is that phylink wasn't designed from the start
to cope with PCS as separate drivers - it always assumed that the MAC
was in control of the PCS and would ensure that the PCS would remain
available. E.g. through a PCS specific create() method causing a direct
module relationship without the involvement of the driver model, or the
PCS driver being built-in to the MAC driver.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
Re: [net-next PATCH 3/6] net: phylink: Correctly handle PCS probe defer from PCS provider
Posted by Russell King (Oracle) 8 months, 3 weeks ago
On Thu, Mar 27, 2025 at 06:37:19PM +0100, Christian Marangi wrote:
> OK so (I think this was also suggested in the more specific PCS patch)
> - 1. unpublish the PCS from the provider
> - 2. put down the link...
> 
> I feel point 2 is the big effort here to solve. Mainly your problem is
> the fact that phylink_major_config should not handle PROBE_DEFER and
> should always have all the expected PCS available. (returned from
> mac_select_pcs)

I'm going to do a quick reply here, because I'm on a high latency LTE
connection right now (seems Three's backhaul network for downstream is
being overloaded.)

I mad ea comment previously, and implemented it:

https://lore.kernel.org/r/E1twkqO-0006FI-Gm@rmk-PC.armlinux.org.uk

Remainder too painful to try and reply to at the moment.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!