Moving forward with stacked PHYs

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Moving forward with stacked PHYs
@ 2024-11-19 10:51 Maxime Chevallier
  2024-11-21  8:47 ` Oleksij Rempel
  0 siblings, 1 reply; 5+ messages in thread
From: Maxime Chevallier @ 2024-11-19 10:51 UTC (permalink / raw)
  To: netdev, pabeni, kuba, edumazet, davem, Andrew Lunn, Russell King,
	Romain Gantois, Florian Fainelli, Vladimir Oltean,
	Heiner Kallweit, Oleksij Rempel, Kory Maincent
  Cc: Thomas Petazzoni

Hi Netdev,

With Romain, we're currently trying to work on the stacked PHY problem,
where we'd like to get a proper support for Copper SFPs that are driven
by a media converter :

     RGMII       SGMII  +sfp----+
MAC ------- PHY ------- | PHY   |
                        +-------+

This is one of the cases where we have multiple PHYs on the link, on my
side I've been working on PHY muxes with parallel PHYs on the link :

       +-- PHY
MAC ---|
       +-- PHY

Both of these use-cases share one common issue, which is that some parts
of the kernel will try to directly access netdev->phydev, assuming there's
one and only PHY device that sits behind a MAC.

In the past months, I've worked on introducing the phy_link_topology, that
keeps track of all the PHYs that are sitting behind a netdev, and I've
used that infrastructure for some netlink commands that would access
the netdev->phydev field and replace that with :
  - A way to list the PHYs behind a netdev
  - Allowing netlink commands to target a specific PHY, and not "just"
    the netdev->phydev PHY.

On that front, there's more work coming to improve on that :
  - It would be good if the stats reported through ethtool would account
    for all the PHYs on the link, so that we would get a full overview
    of all stats from all components of the link
  - A netlink notification to indicate that a new PHY was found on the
    link is also in the plans.

There's a lot more work to do however, as these user-facing commands aren't
the only place where netdev->phydev is used.

The first place are the MAC drivers. For this, there seems to be 2 options :

 - move to phylink. The phylink API is pretty well designed as it makes
   the phydev totally irrelevant for the MAC driver. The MAC driver doesn't
   even need to know if it's connected to a PHY or something else (and SFP
   cage for example). That's nice, but there are still plenty of drivers
   that don't use phylink.

 - Introduce a new set of helpers that would abstact away the phydev from
   the netdev, but in a more straightforward way. MAC drivers that directly
   access netdev->phydev to configure the PHY's internal state or get some
   PHY info would instead be converted to using a set of helpers that will
   take the netdev as a parameter :

 static void ftgmac100_adjust_link(struct net_device *netdev)
 {
+	int phy_link, phy_speed, phy_duplex;
 	struct ftgmac100 *priv = netdev_priv(netdev);
 	struct phy_device *phydev = netdev->phydev;
 	bool tx_pause, rx_pause;
 	int new_speed;

+	netdev_phy_link_settings(netdev, &phy_link, &phy_speed, &phy_duplex);
+
 	/* We store "no link" as speed 0 */
-	if (!phydev->link)
+	if (!phy_link)
 		new_speed = 0;
 	else
-		new_speed = phydev->speed;
+		new_speed = phy_speed;
[...]

   The above is just an example of such helpers, Romain is currently
   investigating this and going over all the MAC drivers out there to
   assess to what extent they are directly accessing the netdev->phydev,
   and wrapping that with phydev-independent helpers.

   The idea here is to avoid accessing phydev directly from the MAC
   driver, and have a helper query the parameters for us. This helper
   could make use of netdev->phydev, but also use phy_link_topology
   to get an aggregated version of these parameters, if there are chained
   PHYs for example.

There are other parts of the kernel that accesses the netdev->phydev. There
are a few places in DSA where this is done, but the same helpers as the ones
introduced for MAC drivers could be used. The other remaining part would
be the Timestamping code, but Köry Maincent is currently working on an
series to help deal with the various timestamping sources, that will also
help on removing this strong netdev->phydev dependency.

Finally, there's the whole work of actually configuring the PHY themselves
correctly when they are chained to one another, and getting the logic right
for the link-up/down detection, getting the ksettings values aggregated
correctly, etc.

We have some local patches as well for that, to handle the stacked PHY
state-machine synchronisation, link-reporting and negociation but
it's still WIP to cover all the corner-cases and hard-to-test features, for
example how to deal with WoL or EEE in these situations.

Please don't hesitate to give any comments, remarks or point some shortcomings
if you see any. We've tried to keep this mail short but we'll be happy to
discuss any details. We're really looking for some feedback on that overall
approach before sending any RFC, but of course if you prefer to discuss over
some actual code we can move forward and send RFC code when it's a bit more
polished.

Thanks a lot,

Maxime Chevallier & Romain Gantois

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Moving forward with stacked PHYs
  2024-11-19 10:51 Moving forward with stacked PHYs Maxime Chevallier
@ 2024-11-21  8:47 ` Oleksij Rempel
  2024-11-21 14:00   ` Andrew Lunn
  0 siblings, 1 reply; 5+ messages in thread
From: Oleksij Rempel @ 2024-11-21  8:47 UTC (permalink / raw)
  To: Maxime Chevallier
  Cc: netdev, pabeni, kuba, edumazet, davem, Andrew Lunn, Russell King,
	Romain Gantois, Florian Fainelli, Vladimir Oltean,
	Heiner Kallweit, Kory Maincent, Thomas Petazzoni

Hi Maxime,

On Tue, Nov 19, 2024 at 11:51:36AM +0100, Maxime Chevallier wrote:
> Hi Netdev,
> 
> With Romain, we're currently trying to work on the stacked PHY problem,
> where we'd like to get a proper support for Copper SFPs that are driven
> by a media converter :
> 
>      RGMII       SGMII  +sfp----+
> MAC ------- PHY ------- | PHY   |
>                         +-------+
> 
> This is one of the cases where we have multiple PHYs on the link, on my
> side I've been working on PHY muxes with parallel PHYs on the link :
> 
> 
>        +-- PHY
> MAC ---|
>        +-- PHY
>        
> Both of these use-cases share one common issue, which is that some parts
> of the kernel will try to directly access netdev->phydev, assuming there's
> one and only PHY device that sits behind a MAC.
> 
> In the past months, I've worked on introducing the phy_link_topology, that
> keeps track of all the PHYs that are sitting behind a netdev, and I've
> used that infrastructure for some netlink commands that would access
> the netdev->phydev field and replace that with :
>   - A way to list the PHYs behind a netdev
>   - Allowing netlink commands to target a specific PHY, and not "just"
>     the netdev->phydev PHY.
>     
> On that front, there's more work coming to improve on that :
>   - It would be good if the stats reported through ethtool would account
>     for all the PHYs on the link, so that we would get a full overview
>     of all stats from all components of the link

Ack, i'm working right now on the standardized PHY stats. With some work on
kernel side it would be possible to get with something like this:
ethtool -S eth1 --all-groups

>   - A netlink notification to indicate that a new PHY was found on the
>     link is also in the plans.

This will be cool. I was thinking about netlink notification for some
health issues, to avoid polling on the user side.

> There's a lot more work to do however, as these user-facing commands aren't
> the only place where netdev->phydev is used.
> 
> The first place are the MAC drivers. For this, there seems to be 2 options :
> 
>  - move to phylink. The phylink API is pretty well designed as it makes
>    the phydev totally irrelevant for the MAC driver. The MAC driver doesn't
>    even need to know if it's connected to a PHY or something else (and SFP
>    cage for example). That's nice, but there are still plenty of drivers
>    that don't use phylink.

This would be nice, but this is too much work. Last week I was porting lan78xx
to PHYlink. Plain porting is some hours of work, the 80% of time is
testing and fixing different issues. I do no think there are enough
resources to port all of drivers.

>  - Introduce a new set of helpers that would abstact away the phydev from
>    the netdev, but in a more straightforward way. MAC drivers that directly
>    access netdev->phydev to configure the PHY's internal state or get some
>    PHY info would instead be converted to using a set of helpers that will
>    take the netdev as a parameter :
>    
>  static void ftgmac100_adjust_link(struct net_device *netdev)
>  {

> +	int phy_link, phy_speed, phy_duplex;
>  	struct ftgmac100 *priv = netdev_priv(netdev);
>  	struct phy_device *phydev = netdev->phydev;
>  	bool tx_pause, rx_pause;
>  	int new_speed;
>  
> +	netdev_phy_link_settings(netdev, &phy_link, &phy_speed, &phy_duplex);
> +
>  	/* We store "no link" as speed 0 */
> -	if (!phydev->link)
> +	if (!phy_link)
>  		new_speed = 0;
>  	else
> -		new_speed = phydev->speed;
> +		new_speed = phy_speed;
> [...]
> 
>    The above is just an example of such helpers, Romain is currently
>    investigating this and going over all the MAC drivers out there to
>    assess to what extent they are directly accessing the netdev->phydev,
>    and wrapping that with phydev-independent helpers.
> 
>    The idea here is to avoid accessing phydev directly from the MAC
>    driver, and have a helper query the parameters for us. This helper
>    could make use of netdev->phydev, but also use phy_link_topology
>    to get an aggregated version of these parameters, if there are chained
>    PHYs for example.

Sounds good, introduce the helper and rename net->phydev to something
different. This should explode everything what is using net->phydev
directly.

> There are other parts of the kernel that accesses the netdev->phydev. There
> are a few places in DSA where this is done, but the same helpers as the ones
> introduced for MAC drivers could be used. The other remaining part would
> be the Timestamping code, but Köry Maincent is currently working on an
> series to help deal with the various timestamping sources, that will also
> help on removing this strong netdev->phydev dependency.
>    
> Finally, there's the whole work of actually configuring the PHY themselves
> correctly when they are chained to one another, and getting the logic right
> for the link-up/down detection, getting the ksettings values aggregated
> correctly, etc.
> 
> We have some local patches as well for that, to handle the stacked PHY
> state-machine synchronisation, link-reporting and negociation but
> it's still WIP to cover all the corner-cases and hard-to-test features, for
> example how to deal with WoL or EEE in these situations.

I assume, even with stacked PHYs, some kind of power management would
be needed. The WoL is probably less interesting, since all attached PHYs are
under linux control, so we can suspend or resume them.

The EEE on other hand, may help to reduce run time power consumption.
Still, it the user space should know about it, since it may be critical
for time critical tasks.

I guess, we would need some kind of PHY related metadata. For example,
MAC triggered EEE, would add some latency, but should work fine with
PTP. PHY backed EEE (like SmartEEE, etc) would add latency and affect
PTP if time stamps are made by the MAC and not by the PHY. All of this
information is HW specific. But, I assume, it is not withing the scope
of your current work.

-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Moving forward with stacked PHYs
  2024-11-21  8:47 ` Oleksij Rempel
@ 2024-11-21 14:00   ` Andrew Lunn
  2024-11-25  8:46     ` Maxime Chevallier
  0 siblings, 1 reply; 5+ messages in thread
From: Andrew Lunn @ 2024-11-21 14:00 UTC (permalink / raw)
  To: Oleksij Rempel
  Cc: Maxime Chevallier, netdev, pabeni, kuba, edumazet, davem,
	Russell King, Romain Gantois, Florian Fainelli, Vladimir Oltean,
	Heiner Kallweit, Kory Maincent, Thomas Petazzoni

> > On that front, there's more work coming to improve on that :
> >   - It would be good if the stats reported through ethtool would account
> >     for all the PHYs on the link, so that we would get a full overview
> >     of all stats from all components of the link
> 
> Ack, i'm working right now on the standardized PHY stats. With some work on
> kernel side it would be possible to get with something like this:
> ethtool -S eth1 --all-groups

Naming becomes interesting, when you have multiple PHYs. You need to
indicate which PHY the stats are from. So it will need nested netlink
properties. I don't think we have anything like that at the moment for
ethtool statistics, but it does exist in other places, e.g. cable
testing results.

> >   - A netlink notification to indicate that a new PHY was found on the
> >     link is also in the plans.
> 
> This will be cool. I was thinking about netlink notification for some
> health issues, to avoid polling on the user side.

We need to think about when we send the notification. During
enumeration of the MDIO bus, during probe, or when the PHY is
connected to its upstream? What are the userspaces user cases for this
notification?

> > There's a lot more work to do however, as these user-facing commands aren't
> > the only place where netdev->phydev is used.
> > 
> > The first place are the MAC drivers. For this, there seems to be 2 options :
> > 
> >  - move to phylink. The phylink API is pretty well designed as it makes
> >    the phydev totally irrelevant for the MAC driver. The MAC driver doesn't
> >    even need to know if it's connected to a PHY or something else (and SFP
> >    cage for example). That's nice, but there are still plenty of drivers
> >    that don't use phylink.
> 
> This would be nice, but this is too much work. Last week I was porting lan78xx
> to PHYlink. Plain porting is some hours of work, the 80% of time is
> testing and fixing different issues. I do no think there are enough
> resources to port all of drivers.

It is clear that we the Maintainers cannot convert all MAC drivers to
phylink. But i would be happy to say if you want to support multiple
PHYs, you need to use phylink. To some extent, it is already that
way. I don't think there are any systems with SFPs using phylib.  So i
would not change the phylib API. Keep netdev->phydev meaning the first
PHY, and maybe encode the assumption that it is the only PHY with a
netdev_warn() if that assumption gets violated.

> > There are other parts of the kernel that accesses the netdev->phydev. There
> > are a few places in DSA where this is done.

Many of the DSA drivers are now phylink, not phylib. Any using SFPs
will be phylink. I would leave those using phylib alone, same as other
MAC drivers.

> > Finally, there's the whole work of actually configuring the PHY themselves
> > correctly when they are chained to one another, and getting the logic right
> > for the link-up/down detection, getting the ksettings values aggregated
> > correctly, etc.
> > 
> > We have some local patches as well for that, to handle the stacked PHY
> > state-machine synchronisation, link-reporting and negociation but
> > it's still WIP to cover all the corner-cases and hard-to-test features, for
> > example how to deal with WoL or EEE in these situations.
> 
> I assume, even with stacked PHYs, some kind of power management would
> be needed. The WoL is probably less interesting, since all attached PHYs are
> under linux control, so we can suspend or resume them.
> 
> The EEE on other hand, may help to reduce run time power consumption.
> Still, it the user space should know about it, since it may be critical
> for time critical tasks.

We also need to consider where in the chain EEE, WoL, Pause etc are
relevant. Both EEE and Pause it about negotiation, and that happen in
the PHY closet to the media. WoL in theory could happen anywhere, but
ideally you want it as close to the media as possible, so you can
power off intermediary PHYs. But can PHYs in SFP actually wake the
system? I don't think i have ever seen that, there is no interrupt pin
on the SFP cage.  However WoL is currently a mess and MAC driver
writers get is wrong. I would say WoL is another area we need a new
API moving as much code into the core as possible, same as we have
done for EEE, phylink handling of Pause etc. But that is probably out
of scope for this work, but should be kept in mind.

	Andrew

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Moving forward with stacked PHYs
  2024-11-21 14:00   ` Andrew Lunn
@ 2024-11-25  8:46     ` Maxime Chevallier
  2024-11-25 16:34       ` Andrew Lunn
  0 siblings, 1 reply; 5+ messages in thread
From: Maxime Chevallier @ 2024-11-25  8:46 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Oleksij Rempel, netdev, pabeni, kuba, edumazet, davem,
	Russell King, Romain Gantois, Florian Fainelli, Vladimir Oltean,
	Heiner Kallweit, Kory Maincent, Thomas Petazzoni

Hi Andrew,

On Thu, 21 Nov 2024 15:00:00 +0100
Andrew Lunn <andrew@lunn.ch> wrote:

> > > On that front, there's more work coming to improve on that :
> > >   - It would be good if the stats reported through ethtool would account
> > >     for all the PHYs on the link, so that we would get a full overview
> > >     of all stats from all components of the link  
> > 
> > Ack, i'm working right now on the standardized PHY stats. With some work on
> > kernel side it would be possible to get with something like this:
> > ethtool -S eth1 --all-groups  
> 
> Naming becomes interesting, when you have multiple PHYs. You need to
> indicate which PHY the stats are from. So it will need nested netlink
> properties. I don't think we have anything like that at the moment for
> ethtool statistics, but it does exist in other places, e.g. cable
> testing results.
> 
> > >   - A netlink notification to indicate that a new PHY was found on the
> > >     link is also in the plans.  
> > 
> > This will be cool. I was thinking about netlink notification for some
> > health issues, to avoid polling on the user side.  
> 
> We need to think about when we send the notification. During
> enumeration of the MDIO bus, during probe, or when the PHY is
> connected to its upstream?

The way I see this is based on the phy_link_topology. What is done
currently is that the SFP PHY is added to the topology when :

 - The .connect_phy() SFP upstream op is called, but ONLY if the
upstream PHY is attached to its netdev (otherwise, the upstream PHY
isn't in the topology)

 - Alternatively if the SFP phy is already attached to its upstream,
both the upstream PHY and the SFP PHY will be added to the tpopology
when the upstream PHY gets attached to its netdev.

The notification would be sent at that time. We can't really send it
before the PHYs are part of the topology because at that point we don't
know to which netdev it belongs.

> What are the userspaces user cases for this
> notification?

The way I see that, based on the appereance of PHYs, userspace may want
to re-ajust configuration, especially if :
 - The PHY is attached in .ndo_open() and
 - The PHY provides some kind of offloading capability (Timestamping,
maybe more such as macsec)

In that case, it's possible that userspace is interested in knowing
that a new PHY is here to re-adjust the offloads to the PHY.

But maybe a more correct approach would be a per-feature notif, such as
"there's a new timestamper on the link".

Some user also reported not so long ago the need to know when the SFP
PHY was attached to reconfigure the advertising.

> > > There's a lot more work to do however, as these user-facing commands aren't
> > > the only place where netdev->phydev is used.
> > > 
> > > The first place are the MAC drivers. For this, there seems to be 2 options :
> > > 
> > >  - move to phylink. The phylink API is pretty well designed as it makes
> > >    the phydev totally irrelevant for the MAC driver. The MAC driver doesn't
> > >    even need to know if it's connected to a PHY or something else (and SFP
> > >    cage for example). That's nice, but there are still plenty of drivers
> > >    that don't use phylink.  
> > 
> > This would be nice, but this is too much work. Last week I was porting lan78xx
> > to PHYlink. Plain porting is some hours of work, the 80% of time is
> > testing and fixing different issues. I do no think there are enough
> > resources to port all of drivers.  
> 
> It is clear that we the Maintainers cannot convert all MAC drivers to
> phylink. But i would be happy to say if you want to support multiple
> PHYs, you need to use phylink. To some extent, it is already that
> way. I don't think there are any systems with SFPs using phylib.  So i
> would not change the phylib API. Keep netdev->phydev meaning the first
> PHY, and maybe encode the assumption that it is the only PHY with a
> netdev_warn() if that assumption gets violated.

This is fine by me, this makes the path towards better isolation
between netdev and phydev much clearer, thanks Andrew :)

> > > There are other parts of the kernel that accesses the netdev->phydev. There
> > > are a few places in DSA where this is done.  
> 
> Many of the DSA drivers are now phylink, not phylib. Any using SFPs
> will be phylink. I would leave those using phylib alone, same as other
> MAC drivers.

No problem

> > > Finally, there's the whole work of actually configuring the PHY themselves
> > > correctly when they are chained to one another, and getting the logic right
> > > for the link-up/down detection, getting the ksettings values aggregated
> > > correctly, etc.
> > > 
> > > We have some local patches as well for that, to handle the stacked PHY
> > > state-machine synchronisation, link-reporting and negociation but
> > > it's still WIP to cover all the corner-cases and hard-to-test features, for
> > > example how to deal with WoL or EEE in these situations.  
> > 
> > I assume, even with stacked PHYs, some kind of power management would
> > be needed. The WoL is probably less interesting, since all attached PHYs are
> > under linux control, so we can suspend or resume them.
> > 
> > The EEE on other hand, may help to reduce run time power consumption.
> > Still, it the user space should know about it, since it may be critical
> > for time critical tasks.  
> 
> We also need to consider where in the chain EEE, WoL, Pause etc are
> relevant. Both EEE and Pause it about negotiation, and that happen in
> the PHY closet to the media. WoL in theory could happen anywhere, but
> ideally you want it as close to the media as possible, so you can
> power off intermediary PHYs. But can PHYs in SFP actually wake the
> system? I don't think i have ever seen that, there is no interrupt pin
> on the SFP cage.  However WoL is currently a mess and MAC driver
> writers get is wrong. I would say WoL is another area we need a new
> API moving as much code into the core as possible, same as we have
> done for EEE, phylink handling of Pause etc. But that is probably out
> of scope for this work, but should be kept in mind.

Indeed, to me at least it's a bit out of scope, but yes that's
something worth keeping in mind.

Thanks a lot for the answers Andrew and Oleksij.

Best regards,

Maxime

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Moving forward with stacked PHYs
  2024-11-25  8:46     ` Maxime Chevallier
@ 2024-11-25 16:34       ` Andrew Lunn
  0 siblings, 0 replies; 5+ messages in thread
From: Andrew Lunn @ 2024-11-25 16:34 UTC (permalink / raw)
  To: Maxime Chevallier
  Cc: Oleksij Rempel, netdev, pabeni, kuba, edumazet, davem,
	Russell King, Romain Gantois, Florian Fainelli, Vladimir Oltean,
	Heiner Kallweit, Kory Maincent, Thomas Petazzoni

> The way I see this is based on the phy_link_topology. What is done
> currently is that the SFP PHY is added to the topology when :
> 
>  - The .connect_phy() SFP upstream op is called, but ONLY if the
> upstream PHY is attached to its netdev (otherwise, the upstream PHY
> isn't in the topology)
> 
>  - Alternatively if the SFP phy is already attached to its upstream,
> both the upstream PHY and the SFP PHY will be added to the tpopology
> when the upstream PHY gets attached to its netdev.
> 
> The notification would be sent at that time. We can't really send it
> before the PHYs are part of the topology because at that point we don't
> know to which netdev it belongs.

I agree we cannot notify until we know where it goes in the chain. But
i was wondering if this is too early? Is it guaranteed that
phy_init_hw() is complete? We don't want userspace trying to
reconfigure the PHY of it then to be overwritten.

It would be good if the commit message adding the notification
explains what has already happened when the notification is sent, what
state the PHY is in.

> The way I see that, based on the appereance of PHYs, userspace may want
> to re-ajust configuration, especially if :
>  - The PHY is attached in .ndo_open() and
>  - The PHY provides some kind of offloading capability (Timestamping,
> maybe more such as macsec)
> 
> In that case, it's possible that userspace is interested in knowing
> that a new PHY is here to re-adjust the offloads to the PHY.

So do we need to be at the point that we know its EEE capabilities? It
has registers its MACSec capabilities and timestamper? Again, this
should be part of the commit message. What can a user of the
notification expect to have happened, and what has not yet happened,
or is ongoing and might race with.

	Andrew

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-11-25 16:34 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-19 10:51 Moving forward with stacked PHYs Maxime Chevallier
2024-11-21  8:47 ` Oleksij Rempel
2024-11-21 14:00   ` Andrew Lunn
2024-11-25  8:46     ` Maxime Chevallier
2024-11-25 16:34       ` Andrew Lunn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).