[net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
@ 2025-04-16 15:28 Alexander Duyck
  2025-04-16 15:28 ` [net-next PATCH 1/2] net: phylink: Drop unused defines for SUPPORTED/ADVERTISED_INTERFACES Alexander Duyck
                   ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Alexander Duyck @ 2025-04-16 15:28 UTC (permalink / raw)
  To: netdev; +Cc: linux, andrew, hkallweit1, davem, kuba, pabeni

Address two issues found in the phylink code.

The first issue is the fact that there were unused defines that were
referencing deprecated macros themselves. Since they aren't used we might
as well drop them.

The second issue which is more the main reason for this submission is the
fact that the BMC was losing link when we would call phylink_resume. This
is fixed by adding a new boolean value link_balanced which will allow us
to avoid doing an immediate force of the link up/down and instead defer it
until after we have checked the actual link state.

---

Alexander Duyck (2):
      net: phylink: Drop unused defines for SUPPORTED/ADVERTISED_INTERFACES
      net: phylink: Fix issues with link balancing w/ BMC present

 drivers/net/phy/phylink.c | 25 ++++++-------------------
 1 file changed, 6 insertions(+), 19 deletions(-)

--

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [net-next PATCH 1/2] net: phylink: Drop unused defines for SUPPORTED/ADVERTISED_INTERFACES
  2025-04-16 15:28 [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap Alexander Duyck
@ 2025-04-16 15:28 ` Alexander Duyck
  2025-04-22  8:32   ` Russell King (Oracle)
  2025-04-16 15:29 ` [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present Alexander Duyck
  2025-04-19 18:11 ` [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap Andrew Lunn
  2 siblings, 1 reply; 44+ messages in thread
From: Alexander Duyck @ 2025-04-16 15:28 UTC (permalink / raw)
  To: netdev; +Cc: linux, andrew, hkallweit1, davem, kuba, pabeni

From: Alexander Duyck <alexanderduyck@fb.com>

The defines for SUPPORTED_INTERFACES and ADVERTISED_INTERFACES both appear
to be unused. I couldn't find anything that actually references them in the
original diff that added them and it seems like they have persisted despite
using deprecated defines that aren't supposed to be used as per the
ethtool.h header that defines the bits they are composed of.

Since they are unused, and not supposed to be used anymore I am just
dropping the lines of code since they seem to just be occupying space.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
---
 drivers/net/phy/phylink.c |    7 -------
 1 file changed, 7 deletions(-)

diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
index b68369e2342b..942ce114dabd 100644
--- a/drivers/net/phy/phylink.c
+++ b/drivers/net/phy/phylink.c
@@ -24,13 +24,6 @@
 #include "sfp.h"
 #include "swphy.h"
 
-#define SUPPORTED_INTERFACES \
-	(SUPPORTED_TP | SUPPORTED_MII | SUPPORTED_FIBRE | \
-	 SUPPORTED_BNC | SUPPORTED_AUI | SUPPORTED_Backplane)
-#define ADVERTISED_INTERFACES \
-	(ADVERTISED_TP | ADVERTISED_MII | ADVERTISED_FIBRE | \
-	 ADVERTISED_BNC | ADVERTISED_AUI | ADVERTISED_Backplane)
-
 enum {
 	PHYLINK_DISABLE_STOPPED,
 	PHYLINK_DISABLE_LINK,



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 1/2] net: phylink: Drop unused defines for SUPPORTED/ADVERTISED_INTERFACES
  2025-04-16 15:28 ` [net-next PATCH 1/2] net: phylink: Drop unused defines for SUPPORTED/ADVERTISED_INTERFACES Alexander Duyck
@ 2025-04-22  8:32   ` Russell King (Oracle)
  0 siblings, 0 replies; 44+ messages in thread
From: Russell King (Oracle) @ 2025-04-22  8:32 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, andrew, hkallweit1, davem, kuba, pabeni

On Wed, Apr 16, 2025 at 08:28:53AM -0700, Alexander Duyck wrote:
> From: Alexander Duyck <alexanderduyck@fb.com>
> 
> The defines for SUPPORTED_INTERFACES and ADVERTISED_INTERFACES both appear
> to be unused. I couldn't find anything that actually references them in the
> original diff that added them and it seems like they have persisted despite
> using deprecated defines that aren't supposed to be used as per the
> ethtool.h header that defines the bits they are composed of.
> 
> Since they are unused, and not supposed to be used anymore I am just
> dropping the lines of code since they seem to just be occupying space.
> 
> Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>

Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

Please send this separately, it can be applied independently of patch 2.

Thanks!

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present
  2025-04-16 15:28 [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap Alexander Duyck
  2025-04-16 15:28 ` [net-next PATCH 1/2] net: phylink: Drop unused defines for SUPPORTED/ADVERTISED_INTERFACES Alexander Duyck
@ 2025-04-16 15:29 ` Alexander Duyck
  2025-04-16 16:01   ` Russell King (Oracle)
  2025-04-19 18:11 ` [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap Andrew Lunn
  2 siblings, 1 reply; 44+ messages in thread
From: Alexander Duyck @ 2025-04-16 15:29 UTC (permalink / raw)
  To: netdev; +Cc: linux, andrew, hkallweit1, davem, kuba, pabeni

From: Alexander Duyck <alexanderduyck@fb.com>

This change is meant to address the fact that there are link imbalances
introduced when using phylink on a system with a BMC. Specifically there
are two issues.

The first issue is that if we lose link after the first call to
phylink_start but before it gets to the phylink_resolve we will end up with
the phylink interface assuming the link was always down and not calling
phylink_link_down resulting in a stuck interface.

The second issue is that when a BMC is present we are currently forcing the
link down. This results in us bouncing the link for a fraction of a second
and that will result in dropped packets for the BMC.

The third issue is just an extra "Link Down" message that is seen when
calling phylink_resume. This is addressed by identifying that the link
isn't balanced and just not displaying the down message in such a case.

To resolve these issues this change introduces a new boolean variable
link_balanced. This value will be set to 0 initially when we create the
phylink interface, and again when we bring down the link and unbalance it
in phylink_suspend. When it is set to 0 it will force us to trigger the
phylink_link_up/down call which will have us write to the hardware. As a
result we can avoid the two issues and it should not rearm without another
call to phylink_suspend.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
---
 drivers/net/phy/phylink.c |   18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
index 942ce114dabd..2b9ab343942e 100644
--- a/drivers/net/phy/phylink.c
+++ b/drivers/net/phy/phylink.c
@@ -45,6 +45,7 @@ struct phylink {
 	struct phylink_pcs *pcs;
 	struct device *dev;
 	unsigned int old_link_state:1;
+	unsigned int link_balanced:1;
 
 	unsigned long phylink_disable_state; /* bitmask of disables */
 	struct phy_device *phydev;
@@ -1553,7 +1554,8 @@ static void phylink_link_down(struct phylink *pl)
 
 	pl->mac_ops->mac_link_down(pl->config, pl->act_link_an_mode,
 				   pl->cur_interface);
-	phylink_info(pl, "Link is Down\n");
+	if (pl->link_balanced)
+		phylink_info(pl, "Link is Down\n");
 }
 
 static bool phylink_link_is_up(struct phylink *pl)
@@ -1658,12 +1660,13 @@ static void phylink_resolve(struct work_struct *w)
 	if (pl->major_config_failed)
 		link_state.link = false;
 
-	if (link_state.link != cur_link_state) {
+	if (link_state.link != cur_link_state || !pl->link_balanced) {
 		pl->old_link_state = link_state.link;
 		if (!link_state.link)
 			phylink_link_down(pl);
 		else
 			phylink_link_up(pl, link_state);
+		pl->link_balanced = true;
 	}
 	if (!link_state.link && retrigger) {
 		pl->link_failed = false;
@@ -2546,6 +2549,7 @@ void phylink_suspend(struct phylink *pl, bool mac_wol)
 			netif_carrier_off(pl->netdev);
 		else
 			pl->old_link_state = false;
+		pl->link_balanced = false;
 
 		/* We do not call mac_link_down() here as we want the
 		 * link to remain up to receive the WoL packets.
@@ -2596,16 +2600,6 @@ void phylink_resume(struct phylink *pl)
 	if (test_bit(PHYLINK_DISABLE_MAC_WOL, &pl->phylink_disable_state)) {
 		/* Wake-on-Lan enabled, MAC handling */
 
-		/* Call mac_link_down() so we keep the overall state balanced.
-		 * Do this under the state_mutex lock for consistency. This
-		 * will cause a "Link Down" message to be printed during
-		 * resume, which is harmless - the true link state will be
-		 * printed when we run a resolve.
-		 */
-		mutex_lock(&pl->state_mutex);
-		phylink_link_down(pl);
-		mutex_unlock(&pl->state_mutex);
-
 		/* Re-apply the link parameters so that all the settings get
 		 * restored to the MAC.
 		 */



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present
  2025-04-16 15:29 ` [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present Alexander Duyck
@ 2025-04-16 16:01   ` Russell King (Oracle)
  2025-04-16 16:16     ` [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down Russell King (Oracle)
                       ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Russell King (Oracle) @ 2025-04-16 16:01 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, andrew, hkallweit1, davem, kuba, pabeni

On Wed, Apr 16, 2025 at 08:29:00AM -0700, Alexander Duyck wrote:
> From: Alexander Duyck <alexanderduyck@fb.com>
> 
> This change is meant to address the fact that there are link imbalances
> introduced when using phylink on a system with a BMC. Specifically there
> are two issues.
> 
> The first issue is that if we lose link after the first call to
> phylink_start but before it gets to the phylink_resolve we will end up with
> the phylink interface assuming the link was always down and not calling
> phylink_link_down resulting in a stuck interface.

That is intentional.

phylink strictly orders .mac_link_down and .mac_link_up, and starts from
an initial position that the link _will_ be considered to be down. So,
it is intentional that .mac_link_down will _never_ be called after
phylink_start().

> The second issue is that when a BMC is present we are currently forcing the
> link down. This results in us bouncing the link for a fraction of a second
> and that will result in dropped packets for the BMC.

... but you don't explain how that happens.

> The third issue is just an extra "Link Down" message that is seen when
> calling phylink_resume. This is addressed by identifying that the link
> isn't balanced and just not displaying the down message in such a case.

Hmm, this one is an error, but is not as simple as "don't print the
message" as it results in a violation of the rule I mentioned above.
We need phylink_suspend() to record the state of the link at that
point, and avoid calling phylink_link_down() if the link was down
prior to suspend.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down
  2025-04-16 16:01   ` Russell King (Oracle)
@ 2025-04-16 16:16     ` Russell King (Oracle)
  2025-04-16 17:14       ` Russell King (Oracle)
                         ` (2 more replies)
  2025-04-16 17:12     ` [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present Russell King (Oracle)
  2025-04-16 19:03     ` Alexander Duyck
  2 siblings, 3 replies; 44+ messages in thread
From: Russell King (Oracle) @ 2025-04-16 16:16 UTC (permalink / raw)
  To: Andrew Lunn, Heiner Kallweit, Alexander Duyck
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Joakim Zhang,
	netdev, Paolo Abeni

When WoL is enabled, we update the software state in phylink to
indicate that the link is down, and disable the resolver from
bringing the link back up.

On resume, we attempt to bring the overall state into consistency
by calling the .mac_link_down() method, but this is wrong if the
link was already down, as phylink strictly orders the .mac_link_up()
and .mac_link_down() methods - and this would break that ordering.

Fixes: f97493657c63 ("net: phylink: add suspend/resume support")
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---

To fix the suspend/resume with link down, this is what I think we
should do. Untested at the moment.

 drivers/net/phy/phylink.c | 38 ++++++++++++++++++++++----------------
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
index 69ca765485db..d2c59ee16ebc 100644
--- a/drivers/net/phy/phylink.c
+++ b/drivers/net/phy/phylink.c
@@ -81,6 +81,7 @@ struct phylink {
 	unsigned int pcs_state;
 
 	bool link_failed;
+	bool suspend_link_up;
 	bool major_config_failed;
 	bool mac_supports_eee_ops;
 	bool mac_supports_eee;
@@ -2545,14 +2546,16 @@ void phylink_suspend(struct phylink *pl, bool mac_wol)
 		/* Stop the resolver bringing the link up */
 		__set_bit(PHYLINK_DISABLE_MAC_WOL, &pl->phylink_disable_state);
 
-		/* Disable the carrier, to prevent transmit timeouts,
-		 * but one would hope all packets have been sent. This
-		 * also means phylink_resolve() will do nothing.
-		 */
-		if (pl->netdev)
-			netif_carrier_off(pl->netdev);
-		else
+		pl->suspend_link_up = phylink_link_is_up(pl);
+		if (pl->suspend_link_up) {
+			/* Disable the carrier, to prevent transmit timeouts,
+			 * but one would hope all packets have been sent. This
+			 * also means phylink_resolve() will do nothing.
+			 */
+			if (pl->netdev)
+				netif_carrier_off(pl->netdev);
 			pl->old_link_state = false;
+		}
 
 		/* We do not call mac_link_down() here as we want the
 		 * link to remain up to receive the WoL packets.
@@ -2603,15 +2606,18 @@ void phylink_resume(struct phylink *pl)
 	if (test_bit(PHYLINK_DISABLE_MAC_WOL, &pl->phylink_disable_state)) {
 		/* Wake-on-Lan enabled, MAC handling */
 
-		/* Call mac_link_down() so we keep the overall state balanced.
-		 * Do this under the state_mutex lock for consistency. This
-		 * will cause a "Link Down" message to be printed during
-		 * resume, which is harmless - the true link state will be
-		 * printed when we run a resolve.
-		 */
-		mutex_lock(&pl->state_mutex);
-		phylink_link_down(pl);
-		mutex_unlock(&pl->state_mutex);
+		if (pl->suspend_link_up) {
+			/* Call mac_link_down() so we keep the overall state
+			 * balanced. Do this under the state_mutex lock for
+			 * consistency. This will cause a "Link Down" message
+			 * to be printed during resume, which is harmless -
+			 * the true link state will be printed when we run a
+			 * resolve.
+			 */
+			mutex_lock(&pl->state_mutex);
+			phylink_link_down(pl);
+			mutex_unlock(&pl->state_mutex);
+		}
 
 		/* Re-apply the link parameters so that all the settings get
 		 * restored to the MAC.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down
  2025-04-16 16:16     ` [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down Russell King (Oracle)
@ 2025-04-16 17:14       ` Russell King (Oracle)
  2025-04-17 14:30       ` Alexander H Duyck
  2025-04-23  0:00       ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 44+ messages in thread
From: Russell King (Oracle) @ 2025-04-16 17:14 UTC (permalink / raw)
  To: Andrew Lunn, Heiner Kallweit, Alexander Duyck
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Joakim Zhang,
	netdev, Paolo Abeni

On Wed, Apr 16, 2025 at 05:16:01PM +0100, Russell King (Oracle) wrote:
> When WoL is enabled, we update the software state in phylink to
> indicate that the link is down, and disable the resolver from
> bringing the link back up.
> 
> On resume, we attempt to bring the overall state into consistency
> by calling the .mac_link_down() method, but this is wrong if the
> link was already down, as phylink strictly orders the .mac_link_up()
> and .mac_link_down() methods - and this would break that ordering.
> 
> Fixes: f97493657c63 ("net: phylink: add suspend/resume support")
> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
> ---
> 
> To fix the suspend/resume with link down, this is what I think we
> should do. Untested at the moment.

Tested on NVidia Jetson Xavier NX, this eliminates the incorrect call
to phylink_mac_link_down() when resuming and the link was previously
down.

Tested-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down
  2025-04-16 16:16     ` [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down Russell King (Oracle)
  2025-04-16 17:14       ` Russell King (Oracle)
@ 2025-04-17 14:30       ` Alexander H Duyck
  2025-04-17 14:35         ` Russell King (Oracle)
  2025-04-23  0:00       ` patchwork-bot+netdevbpf
  2 siblings, 1 reply; 44+ messages in thread
From: Alexander H Duyck @ 2025-04-17 14:30 UTC (permalink / raw)
  To: Russell King (Oracle), Andrew Lunn, Heiner Kallweit
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Joakim Zhang,
	netdev, Paolo Abeni

On Wed, 2025-04-16 at 17:16 +0100, Russell King (Oracle) wrote:
> When WoL is enabled, we update the software state in phylink to
> indicate that the link is down, and disable the resolver from
> bringing the link back up.
> 
> On resume, we attempt to bring the overall state into consistency
> by calling the .mac_link_down() method, but this is wrong if the
> link was already down, as phylink strictly orders the .mac_link_up()
> and .mac_link_down() methods - and this would break that ordering.
> 
> Fixes: f97493657c63 ("net: phylink: add suspend/resume support")
> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
> ---
> 
> To fix the suspend/resume with link down, this is what I think we
> should do. Untested at the moment.
> 
>  drivers/net/phy/phylink.c | 38 ++++++++++++++++++++++----------------
>  1 file changed, 22 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
> index 69ca765485db..d2c59ee16ebc 100644
> --- a/drivers/net/phy/phylink.c
> +++ b/drivers/net/phy/phylink.c
> @@ -81,6 +81,7 @@ struct phylink {
>  	unsigned int pcs_state;
>  
>  	bool link_failed;
> +	bool suspend_link_up;
>  	bool major_config_failed;
>  	bool mac_supports_eee_ops;
>  	bool mac_supports_eee;

I'm pretty sure this extra bit of state isn't needed.

> @@ -2545,14 +2546,16 @@ void phylink_suspend(struct phylink *pl, bool mac_wol)
>  		/* Stop the resolver bringing the link up */
>  		__set_bit(PHYLINK_DISABLE_MAC_WOL, &pl->phylink_disable_state);
>  
> -		/* Disable the carrier, to prevent transmit timeouts,
> -		 * but one would hope all packets have been sent. This
> -		 * also means phylink_resolve() will do nothing.
> -		 */
> -		if (pl->netdev)
> -			netif_carrier_off(pl->netdev);
> -		else

This is the only spot where we weren't setting netif_carrier_on/off and
old_link_state together. I suspect you could just carry old_link_state
without needing to add a new argument. Basically you would just need to
drop the "else" portion of this statement.

In the grand scheme of things with the exception of this one spot
old_link_state is essentially the actual MAC/PCS link state whereas
netif_carrier_off is the administrative state.

> +		pl->suspend_link_up = phylink_link_is_up(pl);
> +		if (pl->suspend_link_up) {
> +			/* Disable the carrier, to prevent transmit timeouts,
> +			 * but one would hope all packets have been sent. This
> +			 * also means phylink_resolve() will do nothing.
> +			 */
> +			if (pl->netdev)
> +				netif_carrier_off(pl->netdev);
>  			pl->old_link_state = false;
> +		}
>  
>  		/* We do not call mac_link_down() here as we want the
>  		 * link to remain up to receive the WoL packets.
> @@ -2603,15 +2606,18 @@ void phylink_resume(struct phylink *pl)
>  	if (test_bit(PHYLINK_DISABLE_MAC_WOL, &pl->phylink_disable_state)) {
>  		/* Wake-on-Lan enabled, MAC handling */
>  
> -		/* Call mac_link_down() so we keep the overall state balanced.
> -		 * Do this under the state_mutex lock for consistency. This
> -		 * will cause a "Link Down" message to be printed during
> -		 * resume, which is harmless - the true link state will be
> -		 * printed when we run a resolve.
> -		 */
> -		mutex_lock(&pl->state_mutex);
> -		phylink_link_down(pl);
> -		mutex_unlock(&pl->state_mutex);
> +		if (pl->suspend_link_up) {
> +			/* Call mac_link_down() so we keep the overall state
> +			 * balanced. Do this under the state_mutex lock for
> +			 * consistency. This will cause a "Link Down" message
> +			 * to be printed during resume, which is harmless -
> +			 * the true link state will be printed when we run a
> +			 * resolve.
> +			 */
> +			mutex_lock(&pl->state_mutex);
> +			phylink_link_down(pl);
> +			mutex_unlock(&pl->state_mutex);
> +		}

You should be able to do all of this with just old_link_state. The only
thing that would have to change is that you would need to set
old_link_state to false after the if statement.

I'm assuming part of the reason for forcing the link down here also has
to do with the fact that you are using phylink_mac_initial_config which
calls phylink_major_config after this?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down
  2025-04-17 14:30       ` Alexander H Duyck
@ 2025-04-17 14:35         ` Russell King (Oracle)
  2025-04-17 15:23           ` Russell King (Oracle)
  0 siblings, 1 reply; 44+ messages in thread
From: Russell King (Oracle) @ 2025-04-17 14:35 UTC (permalink / raw)
  To: Alexander H Duyck
  Cc: Andrew Lunn, Heiner Kallweit, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Joakim Zhang, netdev, Paolo Abeni

On Thu, Apr 17, 2025 at 07:30:05AM -0700, Alexander H Duyck wrote:
> On Wed, 2025-04-16 at 17:16 +0100, Russell King (Oracle) wrote:
> > When WoL is enabled, we update the software state in phylink to
> > indicate that the link is down, and disable the resolver from
> > bringing the link back up.
> > 
> > On resume, we attempt to bring the overall state into consistency
> > by calling the .mac_link_down() method, but this is wrong if the
> > link was already down, as phylink strictly orders the .mac_link_up()
> > and .mac_link_down() methods - and this would break that ordering.
> > 
> > Fixes: f97493657c63 ("net: phylink: add suspend/resume support")
> > Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
> > ---
> > 
> > To fix the suspend/resume with link down, this is what I think we
> > should do. Untested at the moment.
> > 
> >  drivers/net/phy/phylink.c | 38 ++++++++++++++++++++++----------------
> >  1 file changed, 22 insertions(+), 16 deletions(-)
> > 
> > diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
> > index 69ca765485db..d2c59ee16ebc 100644
> > --- a/drivers/net/phy/phylink.c
> > +++ b/drivers/net/phy/phylink.c
> > @@ -81,6 +81,7 @@ struct phylink {
> >  	unsigned int pcs_state;
> >  
> >  	bool link_failed;
> > +	bool suspend_link_up;
> >  	bool major_config_failed;
> >  	bool mac_supports_eee_ops;
> >  	bool mac_supports_eee;
> 
> I'm pretty sure this extra bit of state isn't needed.
> 
> > @@ -2545,14 +2546,16 @@ void phylink_suspend(struct phylink *pl, bool mac_wol)
> >  		/* Stop the resolver bringing the link up */
> >  		__set_bit(PHYLINK_DISABLE_MAC_WOL, &pl->phylink_disable_state);
> >  
> > -		/* Disable the carrier, to prevent transmit timeouts,
> > -		 * but one would hope all packets have been sent. This
> > -		 * also means phylink_resolve() will do nothing.
> > -		 */
> > -		if (pl->netdev)
> > -			netif_carrier_off(pl->netdev);
> > -		else
> 
> This is the only spot where we weren't setting netif_carrier_on/off and
> old_link_state together. I suspect you could just carry old_link_state
> without needing to add a new argument. Basically you would just need to
> drop the "else" portion of this statement.
> 
> In the grand scheme of things with the exception of this one spot
> old_link_state is essentially the actual MAC/PCS link state whereas
> netif_carrier_off is the administrative state.

Sorry to say, but you have that wrong. Neither are the administrative
state.

> > +		pl->suspend_link_up = phylink_link_is_up(pl);
> > +		if (pl->suspend_link_up) {
> > +			/* Disable the carrier, to prevent transmit timeouts,
> > +			 * but one would hope all packets have been sent. This
> > +			 * also means phylink_resolve() will do nothing.
> > +			 */
> > +			if (pl->netdev)
> > +				netif_carrier_off(pl->netdev);
> >  			pl->old_link_state = false;
> > +		}
> >  
> >  		/* We do not call mac_link_down() here as we want the
> >  		 * link to remain up to receive the WoL packets.
> > @@ -2603,15 +2606,18 @@ void phylink_resume(struct phylink *pl)
> >  	if (test_bit(PHYLINK_DISABLE_MAC_WOL, &pl->phylink_disable_state)) {
> >  		/* Wake-on-Lan enabled, MAC handling */
> >  
> > -		/* Call mac_link_down() so we keep the overall state balanced.
> > -		 * Do this under the state_mutex lock for consistency. This
> > -		 * will cause a "Link Down" message to be printed during
> > -		 * resume, which is harmless - the true link state will be
> > -		 * printed when we run a resolve.
> > -		 */
> > -		mutex_lock(&pl->state_mutex);
> > -		phylink_link_down(pl);
> > -		mutex_unlock(&pl->state_mutex);
> > +		if (pl->suspend_link_up) {
> > +			/* Call mac_link_down() so we keep the overall state
> > +			 * balanced. Do this under the state_mutex lock for
> > +			 * consistency. This will cause a "Link Down" message
> > +			 * to be printed during resume, which is harmless -
> > +			 * the true link state will be printed when we run a
> > +			 * resolve.
> > +			 */
> > +			mutex_lock(&pl->state_mutex);
> > +			phylink_link_down(pl);
> > +			mutex_unlock(&pl->state_mutex);
> > +		}
> 
> You should be able to do all of this with just old_link_state. The only
> thing that would have to change is that you would need to set
> old_link_state to false after the if statement.

Nope.

> I'm assuming part of the reason for forcing the link down here also has
> to do with the fact that you are using phylink_mac_initial_config which
> calls phylink_major_config after this?

Another of phylink's guarantees is that it won't do the mac_config()
etc with the link up. So, in order to ensure that everything is
correctly programmed after resume, it needs mac_config() etc called
which means the link needs to come down first.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down
  2025-04-17 14:35         ` Russell King (Oracle)
@ 2025-04-17 15:23           ` Russell King (Oracle)
  2025-04-17 17:06             ` Alexander Duyck
  0 siblings, 1 reply; 44+ messages in thread
From: Russell King (Oracle) @ 2025-04-17 15:23 UTC (permalink / raw)
  To: Alexander H Duyck
  Cc: Andrew Lunn, Heiner Kallweit, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Joakim Zhang, netdev, Paolo Abeni

On Thu, Apr 17, 2025 at 03:35:56PM +0100, Russell King (Oracle) wrote:
> On Thu, Apr 17, 2025 at 07:30:05AM -0700, Alexander H Duyck wrote:
> > This is the only spot where we weren't setting netif_carrier_on/off and
> > old_link_state together. I suspect you could just carry old_link_state
> > without needing to add a new argument. Basically you would just need to
> > drop the "else" portion of this statement.
> > 
> > In the grand scheme of things with the exception of this one spot
> > old_link_state is essentially the actual MAC/PCS link state whereas
> > netif_carrier_off is the administrative state.
> 
> Sorry to say, but you have that wrong. Neither are the administrative
> state.

To add to this (sorry, I rushed that reply), old_link_state is used when
we aren't bound to a network device, otherwise the netif carrier state
is used. Changes in the netif carrier state provoke netlink messages to
userspace to inform userspace of changes to the link state.

Moreover, it controls whether the network stack queues packets for
transmission to the driver, and also whether the transmit watchdog is
allowed to time out. It _probably_ also lets the packet schedulers
know that the link state has changed.

So, the netif carrier state is not "administrative".

It isn't strictly necessary for old_link_state to remain in sync with
the netif carrier state, but it is desirable to avoid errors - but
obviously the netif carrier state only exists when we're bound to a
network device.

> > > -		/* Call mac_link_down() so we keep the overall state balanced.
> > > -		 * Do this under the state_mutex lock for consistency. This
> > > -		 * will cause a "Link Down" message to be printed during
> > > -		 * resume, which is harmless - the true link state will be
> > > -		 * printed when we run a resolve.
> > > -		 */
> > > -		mutex_lock(&pl->state_mutex);
> > > -		phylink_link_down(pl);
> > > -		mutex_unlock(&pl->state_mutex);
> > > +		if (pl->suspend_link_up) {
> > > +			/* Call mac_link_down() so we keep the overall state
> > > +			 * balanced. Do this under the state_mutex lock for
> > > +			 * consistency. This will cause a "Link Down" message
> > > +			 * to be printed during resume, which is harmless -
> > > +			 * the true link state will be printed when we run a
> > > +			 * resolve.
> > > +			 */
> > > +			mutex_lock(&pl->state_mutex);
> > > +			phylink_link_down(pl);
> > > +			mutex_unlock(&pl->state_mutex);
> > > +		}
> > 
> > You should be able to do all of this with just old_link_state. The only
> > thing that would have to change is that you would need to set
> > old_link_state to false after the if statement.
> 
> Nope.

And still nope - what we need to know here is what was the link state
_before_ we called netif_carrier_off() or set old_link_state to false.

I somehow suspect that you don't understand what all this code is trying
to do, so let me explain it.

In the suspend function, when WoL is enabled, and we're using MAC-based
WoL, we need the link to the MAC to remain up, so it can receive packets
to check whether they are the appropriate wake packet. However, we don't
want packets to be queued for transmission either.

So, we have the case where we want to avoid notifying the MAC that the
link has gone down, but we also want to call netif_carrier_off() to stop
the network stack queueing packets.

To achieve this, several things work in unison:

- we take the state mutex to prevent the resolver from running while we
  fiddle with various state.
- we disable the resolver (which, if that's the only thing that happens,
  will provoke the resolver to take the link down.)
- we lie to the resolver about the link state, by calling
  netif_carrier_off() and/or setting old_link_state to false. This
  means the resolver believes the link is _already_ down, and thus
  because of the strict ordering I've previously mentioned, will *not*
  call mac_link_down().
- we release the lock.

There is now no way that the resolver will call either mac_link_up() or
mac_link_down() - which is what we want here.

When we resume, we need to unwind this, but taking care that the network
layers may need to be reprogrammed. That is done by "completing" the
link-down that was done in the suspend function by calling
mac_link_down() (which should only have been done if the link wasn't
already down - hence depends on state _prior_ to the modifications that
phylink_suspend() made.)

It can then re-setup the MAC/PCS by calling the config functions, and
then release the resolver to then make a decision about the state of
the link.

With the fix I posted, this guarantees that that the contract I talked
about previously is maintained.

I'm sorry that this doesn't "fit" your special case, but this is what
all users so far expect from phylink.

Now, how about explaining what's going on in fbnic_mac_link_up_asic()
and fbnic_mac_link_down_asic(), and what in there triggers the BMC
to press the panic stations do-not-press red buttin?

If you wish to have confirmation of what netif_carrier does, then I'm
sure your co-maintainer and core Linux networking maintainer, Jakub,
would be happy to help.

Thanks.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down
  2025-04-17 15:23           ` Russell King (Oracle)
@ 2025-04-17 17:06             ` Alexander Duyck
  2025-04-17 17:27               ` Russell King (Oracle)
  0 siblings, 1 reply; 44+ messages in thread
From: Alexander Duyck @ 2025-04-17 17:06 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Andrew Lunn, Heiner Kallweit, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Joakim Zhang, netdev, Paolo Abeni

On Thu, Apr 17, 2025 at 8:23 AM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> On Thu, Apr 17, 2025 at 03:35:56PM +0100, Russell King (Oracle) wrote:
> > On Thu, Apr 17, 2025 at 07:30:05AM -0700, Alexander H Duyck wrote:
> > > This is the only spot where we weren't setting netif_carrier_on/off and
> > > old_link_state together. I suspect you could just carry old_link_state
> > > without needing to add a new argument. Basically you would just need to
> > > drop the "else" portion of this statement.
> > >
> > > In the grand scheme of things with the exception of this one spot
> > > old_link_state is essentially the actual MAC/PCS link state whereas
> > > netif_carrier_off is the administrative state.
> >
> > Sorry to say, but you have that wrong. Neither are the administrative
> > state.
>
> To add to this (sorry, I rushed that reply), old_link_state is used when
> we aren't bound to a network device, otherwise the netif carrier state
> is used. Changes in the netif carrier state provoke netlink messages to
> userspace to inform userspace of changes to the link state.
>
> Moreover, it controls whether the network stack queues packets for
> transmission to the driver, and also whether the transmit watchdog is
> allowed to time out. It _probably_ also lets the packet schedulers
> know that the link state has changed.
>
> So, the netif carrier state is not "administrative".

I have always sort of assumed that netif_carrier_on/off was the
logical AND of the administrative state of the NIC and the state of
the actual MAC/PCS/PHY. That is why I refer to it as an administrative
state. You end up with ifconfig up/down toggling netif carrier even if
the underlying link state hasn't changed. This has been the case for
many of the high end NICs for a while now, especially for the
multi-host ones, as the firmware won't change the actual physical
state of the link. It leaves it in place while the NIC port on the
host is no longer active.

> It isn't strictly necessary for old_link_state to remain in sync with
> the netif carrier state, but it is desirable to avoid errors - but
> obviously the netif carrier state only exists when we're bound to a
> network device.

From what I can tell this is the only spot where the two diverge,
prior to this patch. The general thought I was having was to update
the netif_carrier state in the suspend, and then old_link_state in the
resume. I realize now the concern is that setting the
phylink_disable_state is essentially the same as setting the link
down. So really we have 3 states we are tracking,
netif_carrier_ok/old_link_state, phylink_disable_state, and if we want
the link state to change while disabled. So we do need one additional
piece of state in the event that there isn't a netdev in order to
handle the case where "pl->phylink_disable_state &&
!!pl->phylink_disable_state == pl->old_link_state".

> > > > -         /* Call mac_link_down() so we keep the overall state balanced.
> > > > -          * Do this under the state_mutex lock for consistency. This
> > > > -          * will cause a "Link Down" message to be printed during
> > > > -          * resume, which is harmless - the true link state will be
> > > > -          * printed when we run a resolve.
> > > > -          */
> > > > -         mutex_lock(&pl->state_mutex);
> > > > -         phylink_link_down(pl);
> > > > -         mutex_unlock(&pl->state_mutex);
> > > > +         if (pl->suspend_link_up) {
> > > > +                 /* Call mac_link_down() so we keep the overall state
> > > > +                  * balanced. Do this under the state_mutex lock for
> > > > +                  * consistency. This will cause a "Link Down" message
> > > > +                  * to be printed during resume, which is harmless -
> > > > +                  * the true link state will be printed when we run a
> > > > +                  * resolve.
> > > > +                  */
> > > > +                 mutex_lock(&pl->state_mutex);
> > > > +                 phylink_link_down(pl);
> > > > +                 mutex_unlock(&pl->state_mutex);
> > > > +         }
> > >
> > > You should be able to do all of this with just old_link_state. The only
> > > thing that would have to change is that you would need to set
> > > old_link_state to false after the if statement.
> >
> > Nope.
>
> And still nope - what we need to know here is what was the link state
> _before_ we called netif_carrier_off() or set old_link_state to false.
>
> I somehow suspect that you don't understand what all this code is trying
> to do, so let me explain it.
>
> In the suspend function, when WoL is enabled, and we're using MAC-based
> WoL, we need the link to the MAC to remain up, so it can receive packets
> to check whether they are the appropriate wake packet. However, we don't
> want packets to be queued for transmission either.

That is the same as what we are looking for. We aren't queueing any
packets for transmission ourselves. We just want to leave the MAC
enabled, however we also want to leave it enabled when we resume.

> So, we have the case where we want to avoid notifying the MAC that the
> link has gone down, but we also want to call netif_carrier_off() to stop
> the network stack queueing packets.
>
> To achieve this, several things work in unison:
>
> - we take the state mutex to prevent the resolver from running while we
>   fiddle with various state.
> - we disable the resolver (which, if that's the only thing that happens,
>   will provoke the resolver to take the link down.)
> - we lie to the resolver about the link state, by calling
>   netif_carrier_off() and/or setting old_link_state to false. This
>   means the resolver believes the link is _already_ down, and thus
>   because of the strict ordering I've previously mentioned, will *not*
>   call mac_link_down().
> - we release the lock.
>
> There is now no way that the resolver will call either mac_link_up() or
> mac_link_down() - which is what we want here.

The part that I think I missed here was that if we set
phylink_disable_state we then set link_state.link to false in
phylink_resolve. I wonder if we couldn't just have a flag that sets
cur_link_state to false in the "if(pl->phylink_disable_state)" case
and simplify this to indicate we won't force the link down"

> When we resume, we need to unwind this, but taking care that the network
> layers may need to be reprogrammed. That is done by "completing" the
> link-down that was done in the suspend function by calling
> mac_link_down() (which should only have been done if the link wasn't
> already down - hence depends on state _prior_ to the modifications that
> phylink_suspend() made.)
>
> It can then re-setup the MAC/PCS by calling the config functions, and
> then release the resolver to then make a decision about the state of
> the link.
>
> With the fix I posted, this guarantees that that the contract I talked
> about previously is maintained.
>
> I'm sorry that this doesn't "fit" your special case, but this is what
> all users so far expect from phylink.

I wasn't trying to "fit" it so much as understand it. So this worked
before because you were clearing either netif_carrier_off or
old_link_state and the resolve function followed the same logic. The
piece I missed is that setting the bit in phylink_disable_state is
forcing the link to resolve to false which in turn would call
mac_link_down if the old state is considered to be up and a netdev
isn't present.

> Now, how about explaining what's going on in fbnic_mac_link_up_asic()
> and fbnic_mac_link_down_asic(), and what in there triggers the BMC
> to press the panic stations do-not-press red buttin?

So fbnic_mac_link_up_asic doesn't trigger any issues. The issues are:

1. In fbnic_mac_link_down_asic we assume that if we are being told
that the link is down by phylink that it really means the link is down
and we need to shut off the MAC and flush any packets that are in the
Tx FIFO. The issue isn't so much the call itself, it is the fact that
we are being called in phylink_resume to rebalance the link that will
be going from UP->UP. The rebalancing is introducing an extra down
step. This could be resolved by adding an extra check to the line in
phylink_resume that is calling the mac_link_down so that it doesn't
get triggered and stall the link. In my test code I am now calling it
"pl->rolling_start".

2. In phylink_start/phylink_resume since we are coming at this from a
"rolling start" due to the BMC we have the MAC up and the
netif_carrier in the "off" state. As a result if we end up losing the
link between mac_prepare and pcs_get_state the MAC gets into a bad
state where netif_carrier is off, the MAC is stuck in the "up" state,
and we have stale packets clogging up the Tx FIFO.

As far as the BMC it isn't so much wanting to hit the big red button
as our platform team. Basically losing of packets is very problematic
for them, think about ssh sessions that suddenly lock up during boot,
and they can demonstrate that none of the other vendors that have the
MAC/PCS/PHY buried in their firmware have this issue. I was really
hoping to avoid going that route as the whole point of this was to
keep the code open source and visible.

> If you wish to have confirmation of what netif_carrier does, then I'm
> sure your co-maintainer and core Linux networking maintainer, Jakub,
> would be happy to help.

I'm well aware of what netif_carrier does, no need to be patronizing.
I have been doing this for over 16 years. The difference is for me the
1G stuff was 16 years ago, most of the stuff I deal with now is 25G or
higher and multi-host which is a different world with different
requirements. This is why we are having our communication issues in
understanding where each other is coming from.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down
  2025-04-17 17:06             ` Alexander Duyck
@ 2025-04-17 17:27               ` Russell King (Oracle)
  2025-04-17 19:49                 ` Alexander Duyck
  0 siblings, 1 reply; 44+ messages in thread
From: Russell King (Oracle) @ 2025-04-17 17:27 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Lunn, Heiner Kallweit, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Joakim Zhang, netdev, Paolo Abeni

On Thu, Apr 17, 2025 at 10:06:45AM -0700, Alexander Duyck wrote:
> On Thu, Apr 17, 2025 at 8:23 AM Russell King (Oracle)
> <linux@armlinux.org.uk> wrote:
> >
> > On Thu, Apr 17, 2025 at 03:35:56PM +0100, Russell King (Oracle) wrote:
> > > On Thu, Apr 17, 2025 at 07:30:05AM -0700, Alexander H Duyck wrote:
> > > > This is the only spot where we weren't setting netif_carrier_on/off and
> > > > old_link_state together. I suspect you could just carry old_link_state
> > > > without needing to add a new argument. Basically you would just need to
> > > > drop the "else" portion of this statement.
> > > >
> > > > In the grand scheme of things with the exception of this one spot
> > > > old_link_state is essentially the actual MAC/PCS link state whereas
> > > > netif_carrier_off is the administrative state.
> > >
> > > Sorry to say, but you have that wrong. Neither are the administrative
> > > state.
> >
> > To add to this (sorry, I rushed that reply), old_link_state is used when
> > we aren't bound to a network device, otherwise the netif carrier state
> > is used. Changes in the netif carrier state provoke netlink messages to
> > userspace to inform userspace of changes to the link state.
> >
> > Moreover, it controls whether the network stack queues packets for
> > transmission to the driver, and also whether the transmit watchdog is
> > allowed to time out. It _probably_ also lets the packet schedulers
> > know that the link state has changed.
> >
> > So, the netif carrier state is not "administrative".
> 
> I have always sort of assumed that netif_carrier_on/off was the
> logical AND of the administrative state of the NIC and the state of
> the actual MAC/PCS/PHY. That is why I refer to it as an administrative
> state. You end up with ifconfig up/down toggling netif carrier even if
> the underlying link state hasn't changed. This has been the case for
> many of the high end NICs for a while now, especially for the
> multi-host ones, as the firmware won't change the actual physical
> state of the link. It leaves it in place while the NIC port on the
> host is no longer active.
> 
> > It isn't strictly necessary for old_link_state to remain in sync with
> > the netif carrier state, but it is desirable to avoid errors - but
> > obviously the netif carrier state only exists when we're bound to a
> > network device.
> 
> From what I can tell this is the only spot where the two diverge,
> prior to this patch. The general thought I was having was to update
> the netif_carrier state in the suspend, and then old_link_state in the
> resume. I realize now the concern is that setting the
> phylink_disable_state is essentially the same as setting the link
> down. So really we have 3 states we are tracking,
> netif_carrier_ok/old_link_state, phylink_disable_state, and if we want
> the link state to change while disabled. So we do need one additional
> piece of state in the event that there isn't a netdev in order to
> handle the case where "pl->phylink_disable_state &&
> !!pl->phylink_disable_state == pl->old_link_state".
> 
> > > > > -         /* Call mac_link_down() so we keep the overall state balanced.
> > > > > -          * Do this under the state_mutex lock for consistency. This
> > > > > -          * will cause a "Link Down" message to be printed during
> > > > > -          * resume, which is harmless - the true link state will be
> > > > > -          * printed when we run a resolve.
> > > > > -          */
> > > > > -         mutex_lock(&pl->state_mutex);
> > > > > -         phylink_link_down(pl);
> > > > > -         mutex_unlock(&pl->state_mutex);
> > > > > +         if (pl->suspend_link_up) {
> > > > > +                 /* Call mac_link_down() so we keep the overall state
> > > > > +                  * balanced. Do this under the state_mutex lock for
> > > > > +                  * consistency. This will cause a "Link Down" message
> > > > > +                  * to be printed during resume, which is harmless -
> > > > > +                  * the true link state will be printed when we run a
> > > > > +                  * resolve.
> > > > > +                  */
> > > > > +                 mutex_lock(&pl->state_mutex);
> > > > > +                 phylink_link_down(pl);
> > > > > +                 mutex_unlock(&pl->state_mutex);
> > > > > +         }
> > > >
> > > > You should be able to do all of this with just old_link_state. The only
> > > > thing that would have to change is that you would need to set
> > > > old_link_state to false after the if statement.
> > >
> > > Nope.
> >
> > And still nope - what we need to know here is what was the link state
> > _before_ we called netif_carrier_off() or set old_link_state to false.
> >
> > I somehow suspect that you don't understand what all this code is trying
> > to do, so let me explain it.
> >
> > In the suspend function, when WoL is enabled, and we're using MAC-based
> > WoL, we need the link to the MAC to remain up, so it can receive packets
> > to check whether they are the appropriate wake packet. However, we don't
> > want packets to be queued for transmission either.
> 
> That is the same as what we are looking for. We aren't queueing any
> packets for transmission ourselves. We just want to leave the MAC
> enabled, however we also want to leave it enabled when we resume.
> 
> > So, we have the case where we want to avoid notifying the MAC that the
> > link has gone down, but we also want to call netif_carrier_off() to stop
> > the network stack queueing packets.
> >
> > To achieve this, several things work in unison:
> >
> > - we take the state mutex to prevent the resolver from running while we
> >   fiddle with various state.
> > - we disable the resolver (which, if that's the only thing that happens,
> >   will provoke the resolver to take the link down.)
> > - we lie to the resolver about the link state, by calling
> >   netif_carrier_off() and/or setting old_link_state to false. This
> >   means the resolver believes the link is _already_ down, and thus
> >   because of the strict ordering I've previously mentioned, will *not*
> >   call mac_link_down().
> > - we release the lock.
> >
> > There is now no way that the resolver will call either mac_link_up() or
> > mac_link_down() - which is what we want here.
> 
> The part that I think I missed here was that if we set
> phylink_disable_state we then set link_state.link to false in
> phylink_resolve. I wonder if we couldn't just have a flag that sets
> cur_link_state to false in the "if(pl->phylink_disable_state)" case
> and simplify this to indicate we won't force the link down"

Then how does phylink_stop() end up calling .mac_link_down() ?

> So fbnic_mac_link_up_asic doesn't trigger any issues. The issues are:
> 
> 1. In fbnic_mac_link_down_asic we assume that if we are being told
> that the link is down by phylink that it really means the link is down
> and we need to shut off the MAC and flush any packets that are in the
> Tx FIFO. The issue isn't so much the call itself, it is the fact that
> we are being called in phylink_resume to rebalance the link that will
> be going from UP->UP. The rebalancing is introducing an extra down
> step. This could be resolved by adding an extra check to the line in
> phylink_resume that is calling the mac_link_down so that it doesn't
> get triggered and stall the link. In my test code I am now calling it
> "pl->rolling_start".

You're not addessing the issue I've already stated here.

If the link was up, and we *don't* call mac_link_down(), we then *can*
*not* call phylink_mac_initial_config(). It's as simple as that. The
MAC must see link down before being configured.

So, if the link was up, and we don't call mac_link_down() then we must
also *not* call phylink_mac_initial_config(). I've no idea what will
break with that change.

> 2. In phylink_start/phylink_resume since we are coming at this from a
> "rolling start" due to the BMC we have the MAC up and the
> netif_carrier in the "off" state. As a result if we end up losing the
> link between mac_prepare and pcs_get_state the MAC gets into a bad
> state where netif_carrier is off, the MAC is stuck in the "up" state,
> and we have stale packets clogging up the Tx FIFO.
> 
> As far as the BMC it isn't so much wanting to hit the big red button
> as our platform team. Basically losing of packets is very problematic
> for them, think about ssh sessions that suddenly lock up during boot,
> and they can demonstrate that none of the other vendors that have the
> MAC/PCS/PHY buried in their firmware have this issue. I was really
> hoping to avoid going that route as the whole point of this was to
> keep the code open source and visible.

The problem I have is not the idea, but the implementation. You want
to change the phylink behaviour in a way that affects *all* existing
users. I don't want that because of the guarantees of that contract
I've previously stated that I've given to existing users.

As things currently stand, you have a currently unique new case, and
you're trying to force your solution on all users potentially breaking
them. I have no real issue with supporting the new case, but *not* at
the expense of regressing existing phylink users.

Yet, when I point out this, you seem to be dead against *not* affecting
other users. This is where the problem is, and where we fundamentally
disagree.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down
  2025-04-17 17:27               ` Russell King (Oracle)
@ 2025-04-17 19:49                 ` Alexander Duyck
  2025-04-22  9:51                   ` Russell King (Oracle)
  0 siblings, 1 reply; 44+ messages in thread
From: Alexander Duyck @ 2025-04-17 19:49 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Andrew Lunn, Heiner Kallweit, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Joakim Zhang, netdev, Paolo Abeni

On Thu, Apr 17, 2025 at 10:27 AM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> On Thu, Apr 17, 2025 at 10:06:45AM -0700, Alexander Duyck wrote:
> > On Thu, Apr 17, 2025 at 8:23 AM Russell King (Oracle)
> > <linux@armlinux.org.uk> wrote:
> > >
> > > On Thu, Apr 17, 2025 at 03:35:56PM +0100, Russell King (Oracle) wrote:

[...]

> > > So, we have the case where we want to avoid notifying the MAC that the
> > > link has gone down, but we also want to call netif_carrier_off() to stop
> > > the network stack queueing packets.
> > >
> > > To achieve this, several things work in unison:
> > >
> > > - we take the state mutex to prevent the resolver from running while we
> > >   fiddle with various state.
> > > - we disable the resolver (which, if that's the only thing that happens,
> > >   will provoke the resolver to take the link down.)
> > > - we lie to the resolver about the link state, by calling
> > >   netif_carrier_off() and/or setting old_link_state to false. This
> > >   means the resolver believes the link is _already_ down, and thus
> > >   because of the strict ordering I've previously mentioned, will *not*
> > >   call mac_link_down().
> > > - we release the lock.
> > >
> > > There is now no way that the resolver will call either mac_link_up() or
> > > mac_link_down() - which is what we want here.
> >
> > The part that I think I missed here was that if we set
> > phylink_disable_state we then set link_state.link to false in
> > phylink_resolve. I wonder if we couldn't just have a flag that sets
> > cur_link_state to false in the "if(pl->phylink_disable_state)" case
> > and simplify this to indicate we won't force the link down"
>
> Then how does phylink_stop() end up calling .mac_link_down() ?

What I was saying is that we could add an additional state flag that
we set before you write to the phylink_disable_state. You would
essentially set the state to true if you want to preserve the current
state, and if it is true you would set cur_link_statte to false in
phylink_resolve ignoring the actual current link state.

So in phylink_stop you would set it to false, and in phylink_suspend
you would set it to true. With that change phylink_stop could force
the link down, whereas phylink_suspend would keep it up,
phylink_suspend could deal with netif_carrier_off, and phylink_resume
could deal with old_link_state.

Anyway it wasn't as if I was saying we need to make that change. I
have a bad habit of thinking out loud and it tends to carry over into
emails..

> > So fbnic_mac_link_up_asic doesn't trigger any issues. The issues are:
> >
> > 1. In fbnic_mac_link_down_asic we assume that if we are being told
> > that the link is down by phylink that it really means the link is down
> > and we need to shut off the MAC and flush any packets that are in the
> > Tx FIFO. The issue isn't so much the call itself, it is the fact that
> > we are being called in phylink_resume to rebalance the link that will
> > be going from UP->UP. The rebalancing is introducing an extra down
> > step. This could be resolved by adding an extra check to the line in
> > phylink_resume that is calling the mac_link_down so that it doesn't
> > get triggered and stall the link. In my test code I am now calling it
> > "pl->rolling_start".
>
> You're not addessing the issue I've already stated here.
>
> If the link was up, and we *don't* call mac_link_down(), we then *can*
> *not* call phylink_mac_initial_config(). It's as simple as that. The
> MAC must see link down before being configured.
>
> So, if the link was up, and we don't call mac_link_down() then we must
> also *not* call phylink_mac_initial_config(). I've no idea what will
> break with that change.

Sorry, mentioning it didn't occur to me as I have been dealing with
the "rolling start" since the beginning. In mac_prepare I deal with
this. Specifically the code in mac_prepare will check to see if the
link state is currently up with the desired configuration already or
not. If it is, it sets a flag that will keep us from doing any changes
that would be destructive to the link. If the link is down it
basically clears the way for a full reinitialization.

> > 2. In phylink_start/phylink_resume since we are coming at this from a
> > "rolling start" due to the BMC we have the MAC up and the
> > netif_carrier in the "off" state. As a result if we end up losing the
> > link between mac_prepare and pcs_get_state the MAC gets into a bad
> > state where netif_carrier is off, the MAC is stuck in the "up" state,
> > and we have stale packets clogging up the Tx FIFO.
> >
> > As far as the BMC it isn't so much wanting to hit the big red button
> > as our platform team. Basically losing of packets is very problematic
> > for them, think about ssh sessions that suddenly lock up during boot,
> > and they can demonstrate that none of the other vendors that have the
> > MAC/PCS/PHY buried in their firmware have this issue. I was really
> > hoping to avoid going that route as the whole point of this was to
> > keep the code open source and visible.
>
> The problem I have is not the idea, but the implementation. You want
> to change the phylink behaviour in a way that affects *all* existing
> users. I don't want that because of the guarantees of that contract
> I've previously stated that I've given to existing users.

That is why I mentioned "renegotiating the terms" in my other email. I
would be okay with adding a phylink_config variable for the
"rolling_start".

Essentially with "rolling_start" we would have:
1. phylink_resume would not call phylink_link_down to rebalance the state
2. phylink_resolve would have to call mac_link_up or mac_link_down
following the first call after phylink_start/resume.
3. mac_prepare would have to take on the responsibility of doing
mac_link_down prior to any destructive configuration change.
4. Calls to mac_link_up/mac_link_down would be idempotent essentially
meaning that calling up on a link that is already up has no effect
(other than maybe updating autoneg settings), and calling down on a
down link likewise would have no effect.

> As things currently stand, you have a currently unique new case, and
> you're trying to force your solution on all users potentially breaking
> them. I have no real issue with supporting the new case, but *not* at
> the expense of regressing existing phylink users.
>
> Yet, when I point out this, you seem to be dead against *not* affecting
> other users. This is where the problem is, and where we fundamentally
> disagree.

I am not sure where that impression came from. I realized after you
mentioned it that the change was too broad and I had mentioned
limiting it with a config option on the other thread.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down
  2025-04-17 19:49                 ` Alexander Duyck
@ 2025-04-22  9:51                   ` Russell King (Oracle)
  2025-04-22 15:30                     ` Alexander Duyck
  0 siblings, 1 reply; 44+ messages in thread
From: Russell King (Oracle) @ 2025-04-22  9:51 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Andrew Lunn, Heiner Kallweit, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Joakim Zhang, netdev, Paolo Abeni

On Thu, Apr 17, 2025 at 12:49:07PM -0700, Alexander Duyck wrote:
> What I was saying is that we could add an additional state flag that
> we set before you write to the phylink_disable_state. You would
> essentially set the state to true if you want to preserve the current
> state, and if it is true you would set cur_link_statte to false in
> phylink_resolve ignoring the actual current link state.
> 
> So in phylink_stop you would set it to false, and in phylink_suspend
> you would set it to true. With that change phylink_stop could force
> the link down, whereas phylink_suspend would keep it up,
> phylink_suspend could deal with netif_carrier_off, and phylink_resume
> could deal with old_link_state.

I really don't like the idea that the netif carrier state differs from
old_link_state. These need to be the same to ensure that drivers which
use phylink in PHYLINK_NETDEV mode (which uses netif carrier for link
state tracking) vs PHYLINK_DEV mode (which uses old_link_state) see the
same behaviour, becomes much harder to guarantee if we start treating
these differently in the code depending on something other than which
PHYLINK*DEV is in use. It's really not something I want to entertain.

> > So, if the link was up, and we don't call mac_link_down() then we must
> > also *not* call phylink_mac_initial_config(). I've no idea what will
> > break with that change.
> 
> Sorry, mentioning it didn't occur to me as I have been dealing with
> the "rolling start" since the beginning. In mac_prepare I deal with
> this. Specifically the code in mac_prepare will check to see if the
> link state is currently up with the desired configuration already or
> not. If it is, it sets a flag that will keep us from doing any changes
> that would be destructive to the link. If the link is down it
> basically clears the way for a full reinitialization.

I would much rather avoid any of the "setup" calls (that means
mac_prepare(), mac_config(), mac_finish(), pcs_config() etc) and
mac_link_up() if we're going to add support for "rolling start" to
phylink.

That basically means that the MAC needs to be fully configured to
process packets before phylink_start() or phylink_resume() is called.

This, however, makes me wonder why you'd even want to use phylink in
this situation, as phylink will be doing virtually nothing for fbnic.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down
  2025-04-22  9:51                   ` Russell King (Oracle)
@ 2025-04-22 15:30                     ` Alexander Duyck
  2025-04-22 16:00                       ` Andrew Lunn
  0 siblings, 1 reply; 44+ messages in thread
From: Alexander Duyck @ 2025-04-22 15:30 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Andrew Lunn, Heiner Kallweit, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Joakim Zhang, netdev, Paolo Abeni

On Tue, Apr 22, 2025 at 2:51 AM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> On Thu, Apr 17, 2025 at 12:49:07PM -0700, Alexander Duyck wrote:

...

> > Sorry, mentioning it didn't occur to me as I have been dealing with
> > the "rolling start" since the beginning. In mac_prepare I deal with
> > this. Specifically the code in mac_prepare will check to see if the
> > link state is currently up with the desired configuration already or
> > not. If it is, it sets a flag that will keep us from doing any changes
> > that would be destructive to the link. If the link is down it
> > basically clears the way for a full reinitialization.
>
> I would much rather avoid any of the "setup" calls (that means
> mac_prepare(), mac_config(), mac_finish(), pcs_config() etc) and
> mac_link_up() if we're going to add support for "rolling start" to
> phylink.

It would be fine by me, at least in the case where the link is already
up and in the correct mode. For the other cases we would still want to
do this in case there is some incomplete setup or something like that.
We would essentially just pull the block out of mac_prepare and place
it wherever is needed to get things up and running.

> That basically means that the MAC needs to be fully configured to
> process packets before phylink_start() or phylink_resume() is called.
>
> This, however, makes me wonder why you'd even want to use phylink in
> this situation, as phylink will be doing virtually nothing for fbnic.

It wasn't that we had wanted to initially, it was more that we were
told that we must use it during the early driver review. I couldn't
really argue against it as our setup fits the model since we have a
MAC/PCS/FEC/PMA/PMD/AN and such. A good example is that I am pretty
certain we will end up using the XPCS driver eventually, however to do
so we must update it as it supports more than XLGMII which is why
somebody added those speeds without adding the correct interfaces.
There are 25G, 50G, and 100G modes that are likely supported if I am
not mistaken. Can't blame them though as that is essentially the state
we are in right now for fbnic. However I was holding off on enabling
the PCS until we can get the MAC pieces sorted out.

For us to fit we are going to have to expand things quite a bit as we
need to add support for higher speeds, QSFP, QSFP-DD, FEC, BMC, and
multi-host type behavior at a minimum. I had more-or-less assumed
there was a desire to push the interface support up to 100G or more
and that was one motivation for pushing us into phylink. By pushing us
in it was a way to get there with us as the lead test/development
vehicle since we are one of the first high speed NICs to actually
expose most of the hardware and not hide it behind firmware.

That said,  I have come to see some of the advantages for us as well.
Things like the QSFP support seems like it should be a light lift as I
just have to add support for basic SFF-8636, which isn't too
dissimilar to SFF-8472, and the rest seems to mostly fall in place
with the device picking up the interface mode from the QSFP module as
there isn't much needed for a DA cable.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down
  2025-04-22 15:30                     ` Alexander Duyck
@ 2025-04-22 16:00                       ` Andrew Lunn
  2025-04-22 20:09                         ` Alexander Duyck
  0 siblings, 1 reply; 44+ messages in thread
From: Andrew Lunn @ 2025-04-22 16:00 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Russell King (Oracle), Heiner Kallweit, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Joakim Zhang, netdev, Paolo Abeni

> For us to fit we are going to have to expand things quite a bit as we
> need to add support for higher speeds, QSFP, QSFP-DD, FEC, BMC, and
> multi-host type behavior at a minimum. I had more-or-less assumed
> there was a desire to push the interface support up to 100G or more
> and that was one motivation for pushing us into phylink. By pushing us
> in it was a way to get there with us as the lead test/development
> vehicle since we are one of the first high speed NICs to actually
> expose most of the hardware and not hide it behind firmware.
> 
> That said,  I have come to see some of the advantages for us as well.
> Things like the QSFP support seems like it should be a light lift as I
> just have to add support for basic SFF-8636, which isn't too
> dissimilar to SFF-8472, and the rest seems to mostly fall in place
> with the device picking up the interface mode from the QSFP module as
> there isn't much needed for a DA cable.

You should also get hwmon for the SFP for free. ethtool
--dump-module-eeprom will need a little work in sfp.c, but less work
than a whole MAC driver implementation. With that in place firmware
upgrade of the SFP should be easy. And we have a good quirk
infrastructure in place for dealing with SFPs, which all seem broken
in some way. No need to reinvent that.

	Andrew

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down
  2025-04-22 16:00                       ` Andrew Lunn
@ 2025-04-22 20:09                         ` Alexander Duyck
  0 siblings, 0 replies; 44+ messages in thread
From: Alexander Duyck @ 2025-04-22 20:09 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Russell King (Oracle), Heiner Kallweit, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Joakim Zhang, netdev, Paolo Abeni

On Tue, Apr 22, 2025 at 9:00 AM Andrew Lunn <andrew@lunn.ch> wrote:
>
> > For us to fit we are going to have to expand things quite a bit as we
> > need to add support for higher speeds, QSFP, QSFP-DD, FEC, BMC, and
> > multi-host type behavior at a minimum. I had more-or-less assumed
> > there was a desire to push the interface support up to 100G or more
> > and that was one motivation for pushing us into phylink. By pushing us
> > in it was a way to get there with us as the lead test/development
> > vehicle since we are one of the first high speed NICs to actually
> > expose most of the hardware and not hide it behind firmware.
> >
> > That said,  I have come to see some of the advantages for us as well.
> > Things like the QSFP support seems like it should be a light lift as I
> > just have to add support for basic SFF-8636, which isn't too
> > dissimilar to SFF-8472, and the rest seems to mostly fall in place
> > with the device picking up the interface mode from the QSFP module as
> > there isn't much needed for a DA cable.
>
> You should also get hwmon for the SFP for free. ethtool
> --dump-module-eeprom will need a little work in sfp.c, but less work
> than a whole MAC driver implementation. With that in place firmware
> upgrade of the SFP should be easy. And we have a good quirk
> infrastructure in place for dealing with SFPs, which all seem broken
> in some way. No need to reinvent that.

Actually the hwmon will need some work as I recall. The issue is that
it is reliant on the a2 page support and we don't have that in QSFP
since it is based on SFF-8636.

As far as the QSFP modules we are using we also don't have to deal
with FW upgrades and such since we are just dealing with direct attach
modules, and the quirks for now don't do much for us either for much
the same reason. However there is always the potential for us to use
something other than basic direct-attach cables in the future.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down
  2025-04-16 16:16     ` [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down Russell King (Oracle)
  2025-04-16 17:14       ` Russell King (Oracle)
  2025-04-17 14:30       ` Alexander H Duyck
@ 2025-04-23  0:00       ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 44+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-04-23  0:00 UTC (permalink / raw)
  To: Russell King
  Cc: andrew, hkallweit1, alexander.duyck, davem, edumazet, kuba,
	qiangqing.zhang, netdev, pabeni

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 16 Apr 2025 17:16:01 +0100 you wrote:
> When WoL is enabled, we update the software state in phylink to
> indicate that the link is down, and disable the resolver from
> bringing the link back up.
> 
> On resume, we attempt to bring the overall state into consistency
> by calling the .mac_link_down() method, but this is wrong if the
> link was already down, as phylink strictly orders the .mac_link_up()
> and .mac_link_down() methods - and this would break that ordering.
> 
> [...]

Here is the summary with links:
  - [net] net: phylink: fix suspend/resume with WoL enabled and link down
    https://git.kernel.org/netdev/net/c/4c8925cb9db1

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present
  2025-04-16 16:01   ` Russell King (Oracle)
  2025-04-16 16:16     ` [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down Russell King (Oracle)
@ 2025-04-16 17:12     ` Russell King (Oracle)
  2025-04-16 19:03     ` Alexander Duyck
  2 siblings, 0 replies; 44+ messages in thread
From: Russell King (Oracle) @ 2025-04-16 17:12 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, andrew, hkallweit1, davem, kuba, pabeni

On Wed, Apr 16, 2025 at 05:01:09PM +0100, Russell King (Oracle) wrote:
> On Wed, Apr 16, 2025 at 08:29:00AM -0700, Alexander Duyck wrote:
> > From: Alexander Duyck <alexanderduyck@fb.com>
> > 
> > This change is meant to address the fact that there are link imbalances
> > introduced when using phylink on a system with a BMC. Specifically there
> > are two issues.
> > 
> > The first issue is that if we lose link after the first call to
> > phylink_start but before it gets to the phylink_resolve we will end up with
> > the phylink interface assuming the link was always down and not calling
> > phylink_link_down resulting in a stuck interface.
> 
> That is intentional.
> 
> phylink strictly orders .mac_link_down and .mac_link_up, and starts from
> an initial position that the link _will_ be considered to be down. So,
> it is intentional that .mac_link_down will _never_ be called after
> phylink_start().
> 
> > The second issue is that when a BMC is present we are currently forcing the
> > link down. This results in us bouncing the link for a fraction of a second
> > and that will result in dropped packets for the BMC.
> 
> ... but you don't explain how that happens.
> 
> > The third issue is just an extra "Link Down" message that is seen when
> > calling phylink_resume. This is addressed by identifying that the link
> > isn't balanced and just not displaying the down message in such a case.
> 
> Hmm, this one is an error, but is not as simple as "don't print the
> message" as it results in a violation of the rule I mentioned above.
> We need phylink_suspend() to record the state of the link at that
> point, and avoid calling phylink_link_down() if the link was down
> prior to suspend.

Okay, confirmed on nvidia Jetson Xavier NX:

[   11.838132] dwc-eth-dwmac 2490000.ethernet eth0: Adding VLAN ID 0 is not supported
[   15.299757] dwc-eth-dwmac 2490000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx

LAN cable was unplugged:
[   50.436587] dwc-eth-dwmac 2490000.ethernet eth0: Link is Down

Then the system was suspended using rtcwake for 3 seconds:

[   54.736849] dwc-eth-dwmac 2490000.ethernet eth0: No Safety Features support found
[   54.736898] dwc-eth-dwmac 2490000.ethernet eth0: IEEE 1588-2008 Advanced Timestamp supported
[   54.741078] dwc-eth-dwmac 2490000.ethernet eth0: Link is Down

This shouldn't happen. With the patch I posted, this second "Link is Down"
message is not printed, and .mac_link_down() will not be called.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present
  2025-04-16 16:01   ` Russell King (Oracle)
  2025-04-16 16:16     ` [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down Russell King (Oracle)
  2025-04-16 17:12     ` [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present Russell King (Oracle)
@ 2025-04-16 19:03     ` Alexander Duyck
  2025-04-16 19:19       ` Russell King (Oracle)
  2 siblings, 1 reply; 44+ messages in thread
From: Alexander Duyck @ 2025-04-16 19:03 UTC (permalink / raw)
  To: Russell King (Oracle); +Cc: netdev, andrew, hkallweit1, davem, kuba, pabeni

On Wed, Apr 16, 2025 at 9:01 AM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> On Wed, Apr 16, 2025 at 08:29:00AM -0700, Alexander Duyck wrote:
> > From: Alexander Duyck <alexanderduyck@fb.com>
> >
> > This change is meant to address the fact that there are link imbalances
> > introduced when using phylink on a system with a BMC. Specifically there
> > are two issues.
> >
> > The first issue is that if we lose link after the first call to
> > phylink_start but before it gets to the phylink_resolve we will end up with
> > the phylink interface assuming the link was always down and not calling
> > phylink_link_down resulting in a stuck interface.
>
> That is intentional.
>
> phylink strictly orders .mac_link_down and .mac_link_up, and starts from
> an initial position that the link _will_ be considered to be down. So,
> it is intentional that .mac_link_down will _never_ be called after
> phylink_start().

Well the issue is that with a BMC present the link may be up before we
even start using phylink. So if the link is lost while we are going
through phylink_start we will end up in an in-between state where the
link is physically down, but the MAC is still configured as though the
link is up. This will be problematic as the MAC should essentially be
discarding frames for transmit if the link is down to avoid blocking
internal Tx FIFOs.

Everything for our driver has to be a light touch as bringing down the
BMC link for any reason other than the physical link being lost is
essentially a criteria for failure as the BMC is the most essential
link on the system. So, for example the code I have in the driver for
handling a major config is currently checking in mac_prepare to verify
if we already have link based on the requested interface mode and FEC
settings and if we do the mac_config and pcs_config steps read an
internal state flag and do nothing, then we just go through to
mac_finish where we write the necessary bits to make sure the PCS/PMA
and PMA/PMD connections are enabled which essentially becomes a no-op
if the link is already enabled.

> > The second issue is that when a BMC is present we are currently forcing the
> > link down. This results in us bouncing the link for a fraction of a second
> > and that will result in dropped packets for the BMC.
>
> ... but you don't explain how that happens.

It was right there in the patch. It was the lines I removed:
@@ -2596,16 +2600,6 @@ void phylink_resume(struct phylink *pl)
        if (test_bit(PHYLINK_DISABLE_MAC_WOL, &pl->phylink_disable_state)) {
                /* Wake-on-Lan enabled, MAC handling */

-               /* Call mac_link_down() so we keep the overall state balanced.
-                * Do this under the state_mutex lock for consistency. This
-                * will cause a "Link Down" message to be printed during
-                * resume, which is harmless - the true link state will be
-                * printed when we run a resolve.
-                */
-               mutex_lock(&pl->state_mutex);
-               phylink_link_down(pl);
-               mutex_unlock(&pl->state_mutex);
-
                /* Re-apply the link parameters so that all the settings get
                 * restored to the MAC.
                 */

From a BMC perspective this forcing of the link down even if it is for
a fraction of a second is unacceptable as it can break up ssh sessions
to the BMC, especially if somebody is doing a bunch of configuration
changes on the NIC as it results in dropped frames. When compared to
the firmware based NIC approaches such as Broadcom and Nvidia/Mellanox
it is a huge negative as the BMC link is static with those NICs and
doesn't bounce no matter if the interface is being configured up/down,
the driver proved/removed ect.

The issue, even with your recent patch, is that it will still force
the link down if the link was previously up. That is the piece I need
to avoid to prevent the BMC from losing link. Ideally what I need is
to have a check of the current link state and then sync back up rather
than force the phylink state on the MAC and then clean things up after
the fact.

> > The third issue is just an extra "Link Down" message that is seen when
> > calling phylink_resume. This is addressed by identifying that the link
> > isn't balanced and just not displaying the down message in such a case.
>
> Hmm, this one is an error, but is not as simple as "don't print the
> message" as it results in a violation of the rule I mentioned above.
> We need phylink_suspend() to record the state of the link at that
> point, and avoid calling phylink_link_down() if the link was down
> prior to suspend.

Looking at your fix it doesn't resolve my core issue. All it does is
address the "third issue".

In my case the link is up and needs to stay up when I resume. The fact
that we are forcing the link down causes our BMC to drop connection
which makes it difficult to remotely manage the interface as changes
to the interface such as a simple ifconfig down; ifconfig up are
enough to cause the BMC to drop packets and potentially the ssh
session if we start losing enough of them.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present
  2025-04-16 19:03     ` Alexander Duyck
@ 2025-04-16 19:19       ` Russell King (Oracle)
  2025-04-16 20:05         ` Russell King (Oracle)
  0 siblings, 1 reply; 44+ messages in thread
From: Russell King (Oracle) @ 2025-04-16 19:19 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, andrew, hkallweit1, davem, kuba, pabeni

On Wed, Apr 16, 2025 at 12:03:05PM -0700, Alexander Duyck wrote:
> On Wed, Apr 16, 2025 at 9:01 AM Russell King (Oracle)
> <linux@armlinux.org.uk> wrote:
> >
> > On Wed, Apr 16, 2025 at 08:29:00AM -0700, Alexander Duyck wrote:
> > > From: Alexander Duyck <alexanderduyck@fb.com>
> > >
> > > This change is meant to address the fact that there are link imbalances
> > > introduced when using phylink on a system with a BMC. Specifically there
> > > are two issues.
> > >
> > > The first issue is that if we lose link after the first call to
> > > phylink_start but before it gets to the phylink_resolve we will end up with
> > > the phylink interface assuming the link was always down and not calling
> > > phylink_link_down resulting in a stuck interface.
> >
> > That is intentional.
> >
> > phylink strictly orders .mac_link_down and .mac_link_up, and starts from
> > an initial position that the link _will_ be considered to be down. So,
> > it is intentional that .mac_link_down will _never_ be called after
> > phylink_start().
> 
> Well the issue is that with a BMC present the link may be up before we
> even start using phylink. So if the link is lost while we are going
> through phylink_start we will end up in an in-between state where the
> link is physically down, but the MAC is still configured as though the
> link is up. This will be problematic as the MAC should essentially be
> discarding frames for transmit if the link is down to avoid blocking
> internal Tx FIFOs.

So, when a Linux network driver probes, it starts out in administrative
state *DOWN*. When the administrator configures the network driver,
.ndo_open is called, which is expected to configure the network adapter.

Part of that process is to call phylink_start() as one of the last
steps, which detects whether the link is up or not. If the link is up,
then phylink will call netif_carrier_on() and .mac_link_up(). This
tells the core networking layers that the network interface is now
ready to start sending packets, and it will begin queueing packets for
the network driver to process - not before.

Prior to .ndo_open being called, the networking layers do not expect
traffic from the network device no matter what physical state the
media link is in. If .ndo_open fails, the same applies - no traffic is
expected to be passed to the core network layers from the network
layers because as far as the network stack is concerned, the interface
is still administratively down.

Thus, the fact that your BMC thinks that the link is up is irrelevant.

So, start off in a state that packet activity is suspended even if the
link is up at probe time. Only start packet activity (reception and
transmission) once .mac_link_up() has been called. Stop that activity
when .mac_link_down() is subsequently called.

There have been lots of NICs out there where the physical link doesn't
follow the adminstrative state of the network interface. This is not a
problem. It may be desirable that it does, but a desire is not the same
as "it must".

> > > The second issue is that when a BMC is present we are currently forcing the
> > > link down. This results in us bouncing the link for a fraction of a second
> > > and that will result in dropped packets for the BMC.
> >
> > ... but you don't explain how that happens.
> 
> It was right there in the patch. It was the lines I removed:

... thus further breaking phylink guarantees.

Sorry, but no.

> The issue, even with your recent patch, is that it will still force
> the link down if the link was previously up. That is the piece I need
> to avoid to prevent the BMC from losing link. Ideally what I need is
> to have a check of the current link state and then sync back up rather
> than force the phylink state on the MAC and then clean things up after
> the fact.

So don't force the link, just stop packet activity. As stated above,
nothing requires that the physical link is forced down just because
.mac_link_down() has been called.

This makes me wonder what happens to your BMC with your ideas if
userspace takes down the network interface. It sounds like all hell
breaks loose because you've taken the link down, and that's seen as
a critical failure... So taking down a network interface becomes a
critical failure - yet it's a *normal* userspace operation.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present
  2025-04-16 19:19       ` Russell King (Oracle)
@ 2025-04-16 20:05         ` Russell King (Oracle)
  2025-04-16 22:58           ` Alexander Duyck
  0 siblings, 1 reply; 44+ messages in thread
From: Russell King (Oracle) @ 2025-04-16 20:05 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, andrew, hkallweit1, davem, kuba, pabeni

On Wed, Apr 16, 2025 at 08:19:38PM +0100, Russell King (Oracle) wrote:
> So, when a Linux network driver probes, it starts out in administrative
> state *DOWN*. When the administrator configures the network driver,
> .ndo_open is called, which is expected to configure the network adapter.
> 
> Part of that process is to call phylink_start() as one of the last
> steps, which detects whether the link is up or not. If the link is up,
> then phylink will call netif_carrier_on() and .mac_link_up(). This
> tells the core networking layers that the network interface is now
> ready to start sending packets, and it will begin queueing packets for
> the network driver to process - not before.
> 
> Prior to .ndo_open being called, the networking layers do not expect
> traffic from the network device no matter what physical state the
> media link is in. If .ndo_open fails, the same applies - no traffic is
> expected to be passed to the core network layers from the network
> layers because as far as the network stack is concerned, the interface
> is still administratively down.
> 
> Thus, the fact that your BMC thinks that the link is up is irrelevant.
> 
> So, start off in a state that packet activity is suspended even if the
> link is up at probe time. Only start packet activity (reception and
> transmission) once .mac_link_up() has been called. Stop that activity
> when .mac_link_down() is subsequently called.
> 
> There have been lots of NICs out there where the physical link doesn't
> follow the adminstrative state of the network interface. This is not a
> problem. It may be desirable that it does, but a desire is not the same
> as "it must".

Let me be crystal clear on this.

Phylink has a contract with all existing users. That contract is:

Initial state: link down.

Driver calls phylink_start() in its .ndo_open method.

Phylink does configuration of the PHY and link according to the
chosen link parameters by calling into the MAC, PCS, and phylib as
appropriate.

If the link is then discovered to be up (it might have been already
up before phylink_start() was called), phylink will call the various
components such as PCS and MAC to inform them that the link is now up.
This will mean calling the .mac_link_up() method. Otherwise (if the
link is discovered to be down when the interface is brought up) no
call to either .mac_link_up() nor .mac_link_down() will be made.

If the link _subsequently_ goes down, then phylink deals with that
and calls .mac_link_down() - only if .mac_link_up() was previously
called (that's one of the bugs you discovered, that on resume it
gets called anyway. I've submitted a fix for that contract breach,
which only affects a very small number of drivers - stmmac, ucc_geth
and your fbnic out of 22 total ethernet users plus however many DSA
users we have.)

Only if .mac_link_down() has been called, if the link subsequently
comes back up, then the same process happens as before resulting in
.mac_link_up() being called.

If the interface is taken down, then .mac_link_down() will be called
if and only if .mac_link_up() had been called.

The ordering of .mac_link_up() / .mac_link_down() is a strict
contract term with phylink users.

The reason for this contract: phylink users may have ordering
requirements.

For example, on mac_link_down(), they may wait for packet activity to
stop, and then place the MAC in reset. If called without a previous
.mac_link_up call, the wait stage may time out due to the MAC being
in reset. (Ocelot may suffer with this.)

Another example is fs_enet which also relies on this strict ordering
as described above.

There could be others - there are some that call into firmware on
calls to .mac_link_up() / .mac_link_down() and misordering those
depends on what the firmware is doing, which we have no visibility
of.

As I stated, this is the contract that phylink gave to users, and
the contract still stands, and can't be broken to behave differently
(e.g. calling .mac_link_down() after phylink_start() without an
intervening call to .mac_link_up()) otherwise existing users will
break. Bugs that go against that contract will be fixed, but the
contract will not be intentionally broken.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present
  2025-04-16 20:05         ` Russell King (Oracle)
@ 2025-04-16 22:58           ` Alexander Duyck
  0 siblings, 0 replies; 44+ messages in thread
From: Alexander Duyck @ 2025-04-16 22:58 UTC (permalink / raw)
  To: Russell King (Oracle); +Cc: netdev, andrew, hkallweit1, davem, kuba, pabeni

On Wed, Apr 16, 2025 at 1:05 PM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> On Wed, Apr 16, 2025 at 08:19:38PM +0100, Russell King (Oracle) wrote:
> > So, when a Linux network driver probes, it starts out in administrative
> > state *DOWN*. When the administrator configures the network driver,
> > .ndo_open is called, which is expected to configure the network adapter.
> >
> > Part of that process is to call phylink_start() as one of the last
> > steps, which detects whether the link is up or not. If the link is up,
> > then phylink will call netif_carrier_on() and .mac_link_up(). This
> > tells the core networking layers that the network interface is now
> > ready to start sending packets, and it will begin queueing packets for
> > the network driver to process - not before.
> >
> > Prior to .ndo_open being called, the networking layers do not expect
> > traffic from the network device no matter what physical state the
> > media link is in. If .ndo_open fails, the same applies - no traffic is
> > expected to be passed to the core network layers from the network
> > layers because as far as the network stack is concerned, the interface
> > is still administratively down.
> >
> > Thus, the fact that your BMC thinks that the link is up is irrelevant.
> >
> > So, start off in a state that packet activity is suspended even if the
> > link is up at probe time. Only start packet activity (reception and
> > transmission) once .mac_link_up() has been called. Stop that activity
> > when .mac_link_down() is subsequently called.
> >
> > There have been lots of NICs out there where the physical link doesn't
> > follow the adminstrative state of the network interface. This is not a
> > problem. It may be desirable that it does, but a desire is not the same
> > as "it must".
>
> Let me be crystal clear on this.
>
> Phylink has a contract with all existing users. That contract is:
>
> Initial state: link down.
>
> Driver calls phylink_start() in its .ndo_open method.
>
> Phylink does configuration of the PHY and link according to the
> chosen link parameters by calling into the MAC, PCS, and phylib as
> appropriate.
>
> If the link is then discovered to be up (it might have been already
> up before phylink_start() was called), phylink will call the various
> components such as PCS and MAC to inform them that the link is now up.
> This will mean calling the .mac_link_up() method. Otherwise (if the
> link is discovered to be down when the interface is brought up) no
> call to either .mac_link_up() nor .mac_link_down() will be made.
>
> If the link _subsequently_ goes down, then phylink deals with that
> and calls .mac_link_down() - only if .mac_link_up() was previously
> called (that's one of the bugs you discovered, that on resume it
> gets called anyway. I've submitted a fix for that contract breach,
> which only affects a very small number of drivers - stmmac, ucc_geth
> and your fbnic out of 22 total ethernet users plus however many DSA
> users we have.)
>
> Only if .mac_link_down() has been called, if the link subsequently
> comes back up, then the same process happens as before resulting in
> .mac_link_up() being called.
>
> If the interface is taken down, then .mac_link_down() will be called
> if and only if .mac_link_up() had been called.
>
> The ordering of .mac_link_up() / .mac_link_down() is a strict
> contract term with phylink users.
>
> The reason for this contract: phylink users may have ordering
> requirements.
>
> For example, on mac_link_down(), they may wait for packet activity to
> stop, and then place the MAC in reset. If called without a previous
> .mac_link_up call, the wait stage may time out due to the MAC being
> in reset. (Ocelot may suffer with this.)
>
> Another example is fs_enet which also relies on this strict ordering
> as described above.
>
> There could be others - there are some that call into firmware on
> calls to .mac_link_up() / .mac_link_down() and misordering those
> depends on what the firmware is doing, which we have no visibility
> of.
>
> As I stated, this is the contract that phylink gave to users, and
> the contract still stands, and can't be broken to behave differently
> (e.g. calling .mac_link_down() after phylink_start() without an
> intervening call to .mac_link_up()) otherwise existing users will
> break. Bugs that go against that contract will be fixed, but the
> contract will not be intentionally broken.

The issue is as that stands the contract is inherently broken if a BMC
is present.

1. There is still the link loss during the phylink_start issue which
will essentially leave the MAC up if the link fails. This will cause
the Tx FIFO on the NIC to essentially be left to hang without flushing
out the stale packets that may have queued up when the link dropped.
This is one of the reasons why the mac_link_down call still needs to
propagate all the way back down to the hardware.

2. We already have precedent for the link being up when WOL is in use.
One concern I would have with your patch is if it will impact that or
not. I suspected part of the mac_link_down is related to cleaning up
something related to the link setup for WOL configuring the link for
some speed.

3. While it may be a part of the contract, isn't there some way we can
"renegotiate the terms"? The fact is in the case of our driver we are
essentially doing a hand-off from FW to the OS for the link when we
call ndo_open. When we are done in ndo_stop we hand ownership back
over to the firmware. Those hand-offs need to be free of link bounces.
This pattern looks very much like the WOL setup with the only
exception being trying to avoid these link bounces. I'm just wondering
if there isn't some way we can add a phylink_config value indicating
that there is a BMC present on the link so we can handle those cases.
The general idea would be to update my patch so instead of
pl->link_balanced being just set to false in suspend and at creation
time we could essentially just set it to "pl->link_balanced =
!pl->config.mac_bmc_handoff".

The fact is the code with the diff I provided did everything I needed.
I could load/unload and ifconfig up/down the driver all day and it
didn't drop a packet. As it stood the definition of the mac_config
code more or less called out that we weren't supposed to change things
unless we needed to and I had followed that. I suspect the overall
change should be small, smaller now following your fix of the message
issue, and it shouldn't impact anything other than our driver if I add
the config flag checks.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
  2025-04-16 15:28 [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap Alexander Duyck
  2025-04-16 15:28 ` [net-next PATCH 1/2] net: phylink: Drop unused defines for SUPPORTED/ADVERTISED_INTERFACES Alexander Duyck
  2025-04-16 15:29 ` [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present Alexander Duyck
@ 2025-04-19 18:11 ` Andrew Lunn
  2025-04-20 18:18   ` Alexander Duyck
  2 siblings, 1 reply; 44+ messages in thread
From: Andrew Lunn @ 2025-04-19 18:11 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, linux, hkallweit1, davem, kuba, pabeni

On Wed, Apr 16, 2025 at 08:28:46AM -0700, Alexander Duyck wrote:
> Address two issues found in the phylink code.
> 
> The first issue is the fact that there were unused defines that were
> referencing deprecated macros themselves. Since they aren't used we might
> as well drop them.
> 
> The second issue which is more the main reason for this submission is the
> fact that the BMC was losing link when we would call phylink_resume. This
> is fixed by adding a new boolean value link_balanced which will allow us
> to avoid doing an immediate force of the link up/down and instead defer it
> until after we have checked the actual link state.

I'm wondering if we have jumped straight into the weeds without having
a good overall big picture of what we are trying to achieve. But maybe
it is just me, and this is just for my edification...

As i've said a few times we don't have a good story around networking
and BMCs. Traditionally, all the details have been hidden away in the
NIC firmware, and linux is pretty much unaware it is going on, at
least from the Host side. fbnic is changing that, and we need
Linux/phylink to understand this.

Since this is all pretty new to me, i went and quickly read:

https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.1.0.pdf

Hopefully i now have a better big picture.

Figure 2 answers a few questions for me. One was, do we actually have
a three port switch in here? And i would say no. We have something
similar, but not a switch. There is no back to back MAC on the host
PCI interface. We do have back to back MAC on the NC-SI port, but it
appears Linux has no knowledge of the NIC NC-SI MAC, and the BMC is
controlling BMC NC-SI MAC.

Not having a switch means when we are talking about the MAC, PCS, PHY
etc, we are talking about the media side MAC, PCS, PHY. Given that
phylink is just as often used with switches with a conduit interface
and switch ports, that is an important point.

Figure 2 also hints at there being three different life cycles all
interacting with each other. Our normal model in phylink is that the
Network Controller is passive, it is told what to do by
Linux/phylink. However, in this setup, that is not true. The Network
Controller is active, it has firmware running on it. The Network
Controller and the Management Controller life cycle probably starts at
about the same time, when the PSU starts generating standby power. The
host life cycle starts later, when the BMC decides to power up the
host.

The NC-SI protocol defines messages between the Management Controller
and the Network Controller. One of these messages is how to configure
the media side. See section 8.4.21. It lists different networks speeds
which can be negotiated, duplex, and pause, and if to use
auto-neg. There is not enough details to fully specific link modes
above 1000BaseT, all you can request for example is 40G, but you
cannot say CR4, KR4, SR4, or LR4. There also does not appear to be a
way to ask the network controller what it actually supports. So i
guess you normally just ask for everything up to 100G, and you are
happy when Get Link Status response command says the link it 10BaseT
Half.

The Network Controller needs to be smart enough to get the link up and
running. So it basically has a phylink implementation, with a PCS
driver, 0 or more PHY drivers, SFP cage driver, SFP driver etc.

Some text from the document, which is pretty relevant to the
discussion.

  The Set Link command may be used by the Management Controller to
  configure the external network interface associated with the channel
  by using the provided settings. Upon receiving this command, while
  the host NC driver is not operational, the channel shall attempt to
  set the link to the configuration specified by the parameters. Upon
  successful completion of this command, link settings specified in
  the command should be used by the network controller as long as the
  host NC driver does not overwrite the link settings.

  In the absence of an operational host NC driver, the NC should
  attempt to make the requested link state change even if it requires
  the NC to drop the current link. The channel shall send a response
  packet to the Management Controller within the required response
  time. However, the requested link state changes may take an
  unspecified amount of time to complete.

  The actual link settings are controlled by the host NC driver when
  it is operational. When the host NC driver is operational, link
  settings specified by the MC using the Set Link command may be
  overwritten by the host NC driver. The link settings are not
  restored by the NC if the host NC driver becomes non
  operational.

There is a very clear indication that the host is in control, or the
host is not in control. So one obvious question to me is, should
phylink have ops into the MAC driver to say it is taking over control,
and relinquishing control? The Linux model is that when the interface
is admin down, you can use ethtool to preconfigure things, but they
don't take affect until the link is admin up. So with admin down, we
have a host NC driver, but it is not operational, hence the Network
Controller is in control of the link at the Management Controllers
bequest. It is only with admin up that phylink takes control of the
Network Controller, and it releases it with admin down. Having these
ops would also help with suspend/resume. Suspend does not change the
admin up/down status, but the host clearly needs to hand over control
of the media to the Network Controller, and take it back again on
resume.

Also, if we have these ops, we know that admin down/suspend does not
mean media down. The presence of these ops triggers different state
transitions in the phylink state machine so that it simply hands off
control of the media, but otherwise leaves it alone.

With this in place, i think we can avoid all the unbalanced state?

What is potentially more interesting is when phylink takes control. Do
we have enough information about the system to say its current
configuration is the wanted configuration? Or are we forced to do a
ground up reconfiguration, which will include a media down/up? I had a
quick scan of the document and i did not find anything which says the
host is not allowed/is allowed to do disruptive things, but the text
quoted above says 'The actual link settings are controlled by the host
NC driver when it is operational'. Controlling the link settings is a
disruptive operation, so the management controller needs to be
tolerant to such changes.

So, can we ignore the weeds for the moment, and think about the big
picture?

	Andrew

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
  2025-04-19 18:11 ` [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap Andrew Lunn
@ 2025-04-20 18:18   ` Alexander Duyck
  2025-04-20 21:58     ` Andrew Lunn
  0 siblings, 1 reply; 44+ messages in thread
From: Alexander Duyck @ 2025-04-20 18:18 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: netdev, linux, hkallweit1, davem, kuba, pabeni

On Sat, Apr 19, 2025 at 11:11 AM Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Wed, Apr 16, 2025 at 08:28:46AM -0700, Alexander Duyck wrote:
> > Address two issues found in the phylink code.
> >
> > The first issue is the fact that there were unused defines that were
> > referencing deprecated macros themselves. Since they aren't used we might
> > as well drop them.
> >
> > The second issue which is more the main reason for this submission is the
> > fact that the BMC was losing link when we would call phylink_resume. This
> > is fixed by adding a new boolean value link_balanced which will allow us
> > to avoid doing an immediate force of the link up/down and instead defer it
> > until after we have checked the actual link state.
>
> I'm wondering if we have jumped straight into the weeds without having
> a good overall big picture of what we are trying to achieve. But maybe
> it is just me, and this is just for my edification...
>
> As i've said a few times we don't have a good story around networking
> and BMCs. Traditionally, all the details have been hidden away in the
> NIC firmware, and linux is pretty much unaware it is going on, at
> least from the Host side. fbnic is changing that, and we need
> Linux/phylink to understand this.
>
> Since this is all pretty new to me, i went and quickly read:
>
> https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.1.0.pdf
>
> Hopefully i now have a better big picture.
>
> Figure 2 answers a few questions for me. One was, do we actually have
> a three port switch in here? And i would say no. We have something
> similar, but not a switch. There is no back to back MAC on the host
> PCI interface. We do have back to back MAC on the NC-SI port, but it
> appears Linux has no knowledge of the NIC NC-SI MAC, and the BMC is
> controlling BMC NC-SI MAC.
>
> Not having a switch means when we are talking about the MAC, PCS, PHY
> etc, we are talking about the media side MAC, PCS, PHY. Given that
> phylink is just as often used with switches with a conduit interface
> and switch ports, that is an important point.
>
> Figure 2 also hints at there being three different life cycles all
> interacting with each other. Our normal model in phylink is that the
> Network Controller is passive, it is told what to do by
> Linux/phylink. However, in this setup, that is not true. The Network
> Controller is active, it has firmware running on it. The Network
> Controller and the Management Controller life cycle probably starts at
> about the same time, when the PSU starts generating standby power. The
> host life cycle starts later, when the BMC decides to power up the
> host.
>
> The NC-SI protocol defines messages between the Management Controller
> and the Network Controller. One of these messages is how to configure
> the media side. See section 8.4.21. It lists different networks speeds
> which can be negotiated, duplex, and pause, and if to use
> auto-neg. There is not enough details to fully specific link modes
> above 1000BaseT, all you can request for example is 40G, but you
> cannot say CR4, KR4, SR4, or LR4. There also does not appear to be a
> way to ask the network controller what it actually supports. So i
> guess you normally just ask for everything up to 100G, and you are
> happy when Get Link Status response command says the link it 10BaseT
> Half.

In our case this is one of the reasons why the firmware knows what the
interface and media type are before we do. It basically has it coded
into the EEPROM so that it will identify the type as CR and normally
it informs the BMC and the host of the speed/FEC of the port as well.
So the BMC could configure this, but it is normally getting this info
from the NIC as it is setup to bring the link up before the BMC even
gets a chance to ask to setup the config.

> The Network Controller needs to be smart enough to get the link up and
> running. So it basically has a phylink implementation, with a PCS
> driver, 0 or more PHY drivers, SFP cage driver, SFP driver etc.
>
> Some text from the document, which is pretty relevant to the
> discussion.
>
>   The Set Link command may be used by the Management Controller to
>   configure the external network interface associated with the channel
>   by using the provided settings. Upon receiving this command, while
>   the host NC driver is not operational, the channel shall attempt to
>   set the link to the configuration specified by the parameters. Upon
>   successful completion of this command, link settings specified in
>   the command should be used by the network controller as long as the
>   host NC driver does not overwrite the link settings.
>
>   In the absence of an operational host NC driver, the NC should
>   attempt to make the requested link state change even if it requires
>   the NC to drop the current link. The channel shall send a response
>   packet to the Management Controller within the required response
>   time. However, the requested link state changes may take an
>   unspecified amount of time to complete.
>
>   The actual link settings are controlled by the host NC driver when
>   it is operational. When the host NC driver is operational, link
>   settings specified by the MC using the Set Link command may be
>   overwritten by the host NC driver. The link settings are not
>   restored by the NC if the host NC driver becomes non
>   operational.
>
> There is a very clear indication that the host is in control, or the
> host is not in control. So one obvious question to me is, should
> phylink have ops into the MAC driver to say it is taking over control,
> and relinquishing control? The Linux model is that when the interface
> is admin down, you can use ethtool to preconfigure things, but they
> don't take affect until the link is admin up. So with admin down, we
> have a host NC driver, but it is not operational, hence the Network
> Controller is in control of the link at the Management Controllers
> bequest. It is only with admin up that phylink takes control of the
> Network Controller, and it releases it with admin down. Having these
> ops would also help with suspend/resume. Suspend does not change the
> admin up/down status, but the host clearly needs to hand over control
> of the media to the Network Controller, and take it back again on
> resume.

Yes, this more-or-less describes the current setup in fbnic. The only
piece that is probably missing would be the heartbeat we maintain so
that the NIC doesn't revoke access due to the OS/driver potentially
being hung. The other thing involved in this that you didn't mention
is that the MC is also managing the Rx filter configuration. So when
we take ownership it is both the Rx Filters and MAC/PCS/PHY that we
are taking ownership of.

The current pattern in fbnic is for us to do most of this on the tail
end of __fbnic_open and unwind it near the start of fbnic_stop.
Essentially the pattern is xmit_ownership, init heartbeat, init PTP,
start phylink, configure Rx filters. In the case of close it is the
reverse with us tearing down the filters, disabling phylink, disabling
PTP, and then releasing ownership.

> Also, if we have these ops, we know that admin down/suspend does not
> mean media down. The presence of these ops triggers different state
> transitions in the phylink state machine so that it simply hands off
> control of the media, but otherwise leaves it alone.
>
> With this in place, i think we can avoid all the unbalanced state?

As I understand it right now the main issue is that Phylink assumes
that it has to take the link down in order to perform a major
configuration in phylink_start/phylink_resume. However in the case of
fbnic what I do a check in mac_prepare to see if I even really need to
take the link down or not to support the requested mode. If not then I
skip the destructive settings update and tweak a few minor things that
can be updated without taking the link down. So I think the change I
am making is that mac_prepare before assumed it was just disabling
things from the SerDes down, and I think I pulled that up to the MAC
layer which may be a departure from previous expectations.

> What is potentially more interesting is when phylink takes control. Do
> we have enough information about the system to say its current
> configuration is the wanted configuration? Or are we forced to do a
> ground up reconfiguration, which will include a media down/up? I had a
> quick scan of the document and i did not find anything which says the
> host is not allowed/is allowed to do disruptive things, but the text
> quoted above says 'The actual link settings are controlled by the host
> NC driver when it is operational'. Controlling the link settings is a
> disruptive operation, so the management controller needs to be
> tolerant to such changes.

In the case of fbnic we have a test we can do to verify we are linked
at the correct fec/modulation/lane width via a check of signal outputs
from the PCS. Basically we have one master bit that says link is
present, but then we can inspect the individual values on a per
FEC/lane/modulation basis. So we can do a link check for a specific
interface type and verify if the link is up and configured for the
settings we want.

Historically speaking, drivers in the past did disruptive things while
the BMC was up. That is one of the things that caught me off-guard
with Meta's requirements. I had been used to dealing with 1G BMC on
most enterprise hardware and in the case of those we end up bouncing
the link state when the link is configured. I had gotten quite used to
seeing that on igb some time ago as I have had a 1G system with igb
and a BMC for years and got used to the SoL console stalling as the
igb reconfigured the link.

The requirement that the BMC not lose link comes more out of the
multi-host setups that have been in place in the data center
environment for the last decade or so where there was only one link
but multiple systems all sharing that link, including the BMC. So it
is not strictly a BMC requirement, but more of a multi-host
requirement.

> So, can we ignore the weeds for the moment, and think about the big
> picture?

So big picture wise we really have 2 issues:
1. The BMC handling doesn't currently exist, so we need to extend
handling/hand-off for link up before we start, and link up after we
stop.
2. Expectations for our 25G+ interfaces to behave like multi-host NICs
that are sharing a link via firmware. Specifically that
loading/unloading the driver or ifconfig up/down on the host interface
should not cause the link to bounce and/or drop packets for any other
connections, which in this case includes the BMC.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
  2025-04-20 18:18   ` Alexander Duyck
@ 2025-04-20 21:58     ` Andrew Lunn
  2025-04-21 15:51       ` Alexander Duyck
  0 siblings, 1 reply; 44+ messages in thread
From: Andrew Lunn @ 2025-04-20 21:58 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, linux, hkallweit1, davem, kuba, pabeni

> >   The actual link settings are controlled by the host NC driver when
> >   it is operational. When the host NC driver is operational, link
> >   settings specified by the MC using the Set Link command may be
> >   overwritten by the host NC driver. The link settings are not
> >   restored by the NC if the host NC driver becomes non
> >   operational.
> >
> > There is a very clear indication that the host is in control, or the
> > host is not in control. So one obvious question to me is, should
> > phylink have ops into the MAC driver to say it is taking over control,
> > and relinquishing control? The Linux model is that when the interface
> > is admin down, you can use ethtool to preconfigure things, but they
> > don't take affect until the link is admin up. So with admin down, we
> > have a host NC driver, but it is not operational, hence the Network
> > Controller is in control of the link at the Management Controllers
> > bequest. It is only with admin up that phylink takes control of the
> > Network Controller, and it releases it with admin down. Having these
> > ops would also help with suspend/resume. Suspend does not change the
> > admin up/down status, but the host clearly needs to hand over control
> > of the media to the Network Controller, and take it back again on
> > resume.
> 
> Yes, this more-or-less describes the current setup in fbnic. The only
> piece that is probably missing would be the heartbeat we maintain so
> that the NIC doesn't revoke access due to the OS/driver potentially
> being hung.

That probably goes against the last sentence i quoted above. I do
however understand why you would want it. Can the host driver know if
the Network Controller has taken back control? Or does the heartbeat
also act as a watchdog, the host does not need to care, it is about to
experience a BMC induced reboot?

> The other thing involved in this that you didn't mention
> is that the MC is also managing the Rx filter configuration. So when
> we take ownership it is both the Rx Filters and MAC/PCS/PHY that we
> are taking ownership of.

That does not seem consistent with the standard. The Set Link command
i quoted above makes it clear that when the host driver is active, it
is in control of the media. However the Set VLAN Filter command,
Enable VLAN command, Set MAC Address command, Enable Broadcast Filter
command, say nothing about differences when the Host driver is
operational or not. It just seems to assume the Management Controller
and the host share the resources, and try not to stomp over each
other. Does fbnic not follow the standard in this respect? However,
from a phylink perspective, i don't think this matters, phylink is not
involved with any of this.

> The current pattern in fbnic is for us to do most of this on the tail
> end of __fbnic_open and unwind it near the start of fbnic_stop.
> Essentially the pattern is xmit_ownership, init heartbeat, init PTP,
> start phylink, configure Rx filters. In the case of close it is the
> reverse with us tearing down the filters, disabling phylink, disabling
> PTP, and then releasing ownership.
> 
> > Also, if we have these ops, we know that admin down/suspend does not
> > mean media down. The presence of these ops triggers different state
> > transitions in the phylink state machine so that it simply hands off
> > control of the media, but otherwise leaves it alone.
> >
> > With this in place, i think we can avoid all the unbalanced state?
> 
> As I understand it right now the main issue is that Phylink assumes
> that it has to take the link down in order to perform a major
> configuration in phylink_start/phylink_resume.

Well, as i said, my reading of the standard is that the host can make
disruptive media changes, so you have to be able to live with
disruptive media changes. If you have to live with it, the path of
lease resistance is just to accept it.

> The requirement that the BMC not lose link comes more out of the
> multi-host setups that have been in place in the data center
> environment for the last decade or so where there was only one link
> but multiple systems all sharing that link, including the BMC. So it
> is not strictly a BMC requirement, but more of a multi-host
> requirement.

Is this actually standardised somewhere? I see there is a draft of an
update to NC-SI Specification, but i don't think the section about
controlling the link has changed. Also, the standard talks about how
you connect one Management Controller to multiple Network
Controllers. There is nothing about multiple Management Controllers
connected to one Network Controller. Or i'm i missing something, like
one Management Controller is controlling all the host connected to the
Network Controller?

> > So, can we ignore the weeds for the moment, and think about the big
> > picture?
> 
> So big picture wise we really have 2 issues:
> 1. The BMC handling doesn't currently exist, so we need to extend
> handling/hand-off for link up before we start, and link up after we
> stop.

Agreed, and that fits with DSP0222.

> 2. Expectations for our 25G+ interfaces to behave like multi-host NICs
> that are sharing a link via firmware. Specifically that
> loading/unloading the driver or ifconfig up/down on the host interface
> should not cause the link to bounce and/or drop packets for any other
> connections, which in this case includes the BMC.

For this, it would be nice to point to some standard which describes
this, so we have a generic, vendor agnostic, description of how this
is supposed to work.

	Andrew

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
  2025-04-20 21:58     ` Andrew Lunn
@ 2025-04-21 15:51       ` Alexander Duyck
  2025-04-21 16:50         ` Alexander Duyck
  0 siblings, 1 reply; 44+ messages in thread
From: Alexander Duyck @ 2025-04-21 15:51 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: netdev, linux, hkallweit1, davem, kuba, pabeni

On Sun, Apr 20, 2025 at 2:58 PM Andrew Lunn <andrew@lunn.ch> wrote:
>
> > >   The actual link settings are controlled by the host NC driver when
> > >   it is operational. When the host NC driver is operational, link
> > >   settings specified by the MC using the Set Link command may be
> > >   overwritten by the host NC driver. The link settings are not
> > >   restored by the NC if the host NC driver becomes non
> > >   operational.
> > >
> > > There is a very clear indication that the host is in control, or the
> > > host is not in control. So one obvious question to me is, should
> > > phylink have ops into the MAC driver to say it is taking over control,
> > > and relinquishing control? The Linux model is that when the interface
> > > is admin down, you can use ethtool to preconfigure things, but they
> > > don't take affect until the link is admin up. So with admin down, we
> > > have a host NC driver, but it is not operational, hence the Network
> > > Controller is in control of the link at the Management Controllers
> > > bequest. It is only with admin up that phylink takes control of the
> > > Network Controller, and it releases it with admin down. Having these
> > > ops would also help with suspend/resume. Suspend does not change the
> > > admin up/down status, but the host clearly needs to hand over control
> > > of the media to the Network Controller, and take it back again on
> > > resume.
> >
> > Yes, this more-or-less describes the current setup in fbnic. The only
> > piece that is probably missing would be the heartbeat we maintain so
> > that the NIC doesn't revoke access due to the OS/driver potentially
> > being hung.
>
> That probably goes against the last sentence i quoted above. I do
> however understand why you would want it. Can the host driver know if
> the Network Controller has taken back control? Or does the heartbeat
> also act as a watchdog, the host does not need to care, it is about to
> experience a BMC induced reboot?

The heartbeat acts as a two way watchdog of sorts. Basically if the
firmware doesn't receive one every 2 minutes (was 10 seconds, but
there are BIOS bugs that cause things to literally lock up for over 90
seconds in early UEFI boot that we just had to work around), it will
evict the host and reject all further heartbeat requests. Likewise if
the FW receives a heartbeat without the host first claiming ownership
it will send back a rejection. If the host sends a heartbeat and the
FW rejects it or we don't receive a response, then we know we have
somehow lost ownership and we in turn will back off and reinitialize
ourselves. Currently the two cases we have seen this get triggered is
either due to a OS/BIOS CPU lockup and/or hang, or due to the FW
crashing and rebooting.

> > The other thing involved in this that you didn't mention
> > is that the MC is also managing the Rx filter configuration. So when
> > we take ownership it is both the Rx Filters and MAC/PCS/PHY that we
> > are taking ownership of.
>
> That does not seem consistent with the standard. The Set Link command
> i quoted above makes it clear that when the host driver is active, it
> is in control of the media. However the Set VLAN Filter command,
> Enable VLAN command, Set MAC Address command, Enable Broadcast Filter
> command, say nothing about differences when the Host driver is
> operational or not. It just seems to assume the Management Controller
> and the host share the resources, and try not to stomp over each
> other. Does fbnic not follow the standard in this respect? However,
> from a phylink perspective, i don't think this matters, phylink is not
> involved with any of this.

Yes, there are going to be deviations from the standard and for me it
is kind of a blurred line as I haven't done the 1G stuff in a while so
I keep defaulting to multi-host NIC thinking. The host driver is
managing the Rx filter config and the MAC/PCS directly and indirectly
ordering the FW to configure the PHY, at least until we can implement
the code to allow direct access in the 2 host mode. The FW effectively
becomes a relay between the host and the external entities such as the
SFP, BMC, GPIO, etc. On 1G there were checks that the host driver had
to perform to make sure not to touch the link but the Rx filter config
was more inline in front of the host filtering rather than
incorporated as a part of it as it is in our setup.

What we did in fbnic was to reserve the first 4 MAC TCAM addresses for
the BMC. One complication we ran into though is that by default the
BMC will run in multicast promiscuous mode. So we had to add logic to
allow the host to reorder things so that the multicast promisc filter
is at the end of the multicast address list routed only to BMC, and
the BMC gets included as a destination for all the multicast filters
added by the host. In addition we have to notify the FW of our MAC
config so that it can direct BMC traffic to the host as needed.

> > The current pattern in fbnic is for us to do most of this on the tail
> > end of __fbnic_open and unwind it near the start of fbnic_stop.
> > Essentially the pattern is xmit_ownership, init heartbeat, init PTP,
> > start phylink, configure Rx filters. In the case of close it is the
> > reverse with us tearing down the filters, disabling phylink, disabling
> > PTP, and then releasing ownership.
> >
> > > Also, if we have these ops, we know that admin down/suspend does not
> > > mean media down. The presence of these ops triggers different state
> > > transitions in the phylink state machine so that it simply hands off
> > > control of the media, but otherwise leaves it alone.
> > >
> > > With this in place, i think we can avoid all the unbalanced state?
> >
> > As I understand it right now the main issue is that Phylink assumes
> > that it has to take the link down in order to perform a major
> > configuration in phylink_start/phylink_resume.
>
> Well, as i said, my reading of the standard is that the host can make
> disruptive media changes, so you have to be able to live with
> disruptive media changes. If you have to live with it, the path of
> lease resistance is just to accept it.

The problem is that precedent has been set by the other multi-host NIC
vendors that didn't go w/ phylink. The argument will be that we either
move the MAC/PCS/PHY to firmware and make it work like the other
vendors or we are essentially DOA.

> > The requirement that the BMC not lose link comes more out of the
> > multi-host setups that have been in place in the data center
> > environment for the last decade or so where there was only one link
> > but multiple systems all sharing that link, including the BMC. So it
> > is not strictly a BMC requirement, but more of a multi-host
> > requirement.
>
> Is this actually standardised somewhere? I see there is a draft of an
> update to NC-SI Specification, but i don't think the section about
> controlling the link has changed. Also, the standard talks about how
> you connect one Management Controller to multiple Network
> Controllers. There is nothing about multiple Management Controllers
> connected to one Network Controller. Or i'm i missing something, like
> one Management Controller is controlling all the host connected to the
> Network Controller?

As far as a standard I can't say if it is. By and large our
requirements have been coming out of the fact that Broadcom or
Nvidia/Mellanox did it this way so we are expected to do it the same
way. So essentially I am dealing with project/platform requirements
more than a standard.

The issue isn't that we have multiple management controllers, it is
that we have setups where one physical MAC/PCS/PHY is shared between 4
PCIe connections to 4 seperate hosts. This is one of the reasons why
so many NICs now have the proprietary FW blob you have to talk to in
order to configure anything. This is also what we were trying to get
away from as the shared resources have a bad habit of creating other
issues such as noisy neighbor problems.

> > > So, can we ignore the weeds for the moment, and think about the big
> > > picture?
> >
> > So big picture wise we really have 2 issues:
> > 1. The BMC handling doesn't currently exist, so we need to extend
> > handling/hand-off for link up before we start, and link up after we
> > stop.
>
> Agreed, and that fits with DSP0222.
>
> > 2. Expectations for our 25G+ interfaces to behave like multi-host NICs
> > that are sharing a link via firmware. Specifically that
> > loading/unloading the driver or ifconfig up/down on the host interface
> > should not cause the link to bounce and/or drop packets for any other
> > connections, which in this case includes the BMC.
>
> For this, it would be nice to point to some standard which describes
> this, so we have a generic, vendor agnostic, description of how this
> is supposed to work.
>
>         Andrew

The problem here is this is more-or-less a bit of a "wild west" in
terms of the spec setup. From what I can tell OCP 3.0 defines how to
set up the PCIe bifurcation but doesn't explain what the expected
behavior is for the shared ports. One thing we might look into would
be the handling for VEPA(Virtual Ethernet Port Aggregator) or VEB
(Virtual Ethernet Bridging) as that wouldn't be too far off from what
inspired most of the logic in the hardware. Essentially the only
difference is that instead of supporting VFs most of these NICs are
supporting multiple PFs.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
  2025-04-21 15:51       ` Alexander Duyck
@ 2025-04-21 16:50         ` Alexander Duyck
  2025-04-22  1:21           ` Jakub Kicinski
  0 siblings, 1 reply; 44+ messages in thread
From: Alexander Duyck @ 2025-04-21 16:50 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: netdev, linux, hkallweit1, davem, kuba, pabeni

On Mon, Apr 21, 2025 at 8:51 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Sun, Apr 20, 2025 at 2:58 PM Andrew Lunn <andrew@lunn.ch> wrote:
> >

...

> >
> > > 2. Expectations for our 25G+ interfaces to behave like multi-host NICs
> > > that are sharing a link via firmware. Specifically that
> > > loading/unloading the driver or ifconfig up/down on the host interface
> > > should not cause the link to bounce and/or drop packets for any other
> > > connections, which in this case includes the BMC.
> >
> > For this, it would be nice to point to some standard which describes
> > this, so we have a generic, vendor agnostic, description of how this
> > is supposed to work.
> >
> >         Andrew
>
> The problem here is this is more-or-less a bit of a "wild west" in
> terms of the spec setup. From what I can tell OCP 3.0 defines how to
> set up the PCIe bifurcation but doesn't explain what the expected
> behavior is for the shared ports. One thing we might look into would
> be the handling for VEPA(Virtual Ethernet Port Aggregator) or VEB
> (Virtual Ethernet Bridging) as that wouldn't be too far off from what
> inspired most of the logic in the hardware. Essentially the only
> difference is that instead of supporting VFs most of these NICs are
> supporting multiple PFs.

So looking at 802.1Q-2022 section 40 I wonder if we don't need to
essentially define ourselves as an edge relay as our setup is pretty
close to what is depicted in figure 40-1. In our case an S-channel
essentially represents 2 SerDes lanes on an QSFP cable, with the
switch playing the role of the EVB bridge.

Anyway I think that is probably the spec we need to dig into if we are
looking for how the link is being shared and such. I'll try to do some
more reading myself to get caught up on all this as the last time I
had been reading through this it was called VEB instead of EVB.. :-/

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
  2025-04-21 16:50         ` Alexander Duyck
@ 2025-04-22  1:21           ` Jakub Kicinski
  2025-04-22 13:49             ` Andrew Lunn
  0 siblings, 1 reply; 44+ messages in thread
From: Jakub Kicinski @ 2025-04-22  1:21 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Andrew Lunn, netdev, linux, hkallweit1, davem, pabeni

On Mon, 21 Apr 2025 09:50:25 -0700 Alexander Duyck wrote:
> On Mon, Apr 21, 2025 at 8:51 AM Alexander Duyck wrote:
> > On Sun, Apr 20, 2025 at 2:58 PM Andrew Lunn <andrew@lunn.ch> wrote:  
> > > > 2. Expectations for our 25G+ interfaces to behave like multi-host NICs
> > > > that are sharing a link via firmware. Specifically that
> > > > loading/unloading the driver or ifconfig up/down on the host interface
> > > > should not cause the link to bounce and/or drop packets for any other
> > > > connections, which in this case includes the BMC.  
> > >
> > > For this, it would be nice to point to some standard which describes
> > > this, so we have a generic, vendor agnostic, description of how this
> > > is supposed to work.
> >
> > The problem here is this is more-or-less a bit of a "wild west" in
> > terms of the spec setup. From what I can tell OCP 3.0 defines how to
> > set up the PCIe bifurcation but doesn't explain what the expected
> > behavior is for the shared ports. One thing we might look into would
> > be the handling for VEPA(Virtual Ethernet Port Aggregator) or VEB
> > (Virtual Ethernet Bridging) as that wouldn't be too far off from what
> > inspired most of the logic in the hardware. Essentially the only
> > difference is that instead of supporting VFs most of these NICs are
> > supporting multiple PFs.  
> 
> So looking at 802.1Q-2022 section 40 I wonder if we don't need to
> essentially define ourselves as an edge relay as our setup is pretty
> close to what is depicted in figure 40-1. In our case an S-channel
> essentially represents 2 SerDes lanes on an QSFP cable, with the
> switch playing the role of the EVB bridge.
> 
> Anyway I think that is probably the spec we need to dig into if we are
> looking for how the link is being shared and such. I'll try to do some
> more reading myself to get caught up on all this as the last time I
> had been reading through this it was called VEB instead of EVB.. :-/

Interesting. My gut feeling is that even if we make Linux and the NIC
behave nicely according to 802.1Q, we'll also need to make some changes
on the BMC side. And there we may encounter pushback as the status quo
works quite trivially for devices with PHY control in FW.

To paraphrase what was already said upthread - in my mind the question
is whether we want to support "picking up" an already running link with
phylink. Whether the use case is BMC or avoiding link flap when picking
up from uboot is of secondary importance.

Andrew, the standard could help us build a common higher layer, if
needed but I'm not sure how much it impacts the phylink side. The only
question for phylink is 1 bit of policy whether link can be "recovered"
and left up on close, no?

BTW Saeed posted a devlink param to "keep link up" recently:
https://lore.kernel.org/all/20250414195959.1375031-11-saeed@kernel.org/
Intel has ethtool priv flags to the same effect, in their 40G and 100G
drivers, but with reverse polarity:
https://docs.kernel.org/networking/device_drivers/ethernet/intel/i40e.html#setting-the-link-down-on-close-private-flag
These are all for this exact use case. In the past Ido added module
power policy, which is the only truly generic configurable, and one we
should probably build upon:
https://docs.kernel.org/networking/ethtool-netlink.html#c.ethtool_module_power_mode_policy
I'm not sure if this is expected to include PCS or it's just telling
the module to keep the laser on..

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
  2025-04-22  1:21           ` Jakub Kicinski
@ 2025-04-22 13:49             ` Andrew Lunn
  2025-04-22 15:28               ` Jakub Kicinski
  0 siblings, 1 reply; 44+ messages in thread
From: Andrew Lunn @ 2025-04-22 13:49 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Alexander Duyck, netdev, linux, hkallweit1, davem, pabeni

On Mon, Apr 21, 2025 at 06:21:43PM -0700, Jakub Kicinski wrote:
> On Mon, 21 Apr 2025 09:50:25 -0700 Alexander Duyck wrote:
> > On Mon, Apr 21, 2025 at 8:51 AM Alexander Duyck wrote:
> > > On Sun, Apr 20, 2025 at 2:58 PM Andrew Lunn <andrew@lunn.ch> wrote:  
> > > > > 2. Expectations for our 25G+ interfaces to behave like multi-host NICs
> > > > > that are sharing a link via firmware. Specifically that
> > > > > loading/unloading the driver or ifconfig up/down on the host interface
> > > > > should not cause the link to bounce and/or drop packets for any other
> > > > > connections, which in this case includes the BMC.  
> > > >
> > > > For this, it would be nice to point to some standard which describes
> > > > this, so we have a generic, vendor agnostic, description of how this
> > > > is supposed to work.
> > >
> > > The problem here is this is more-or-less a bit of a "wild west" in
> > > terms of the spec setup. From what I can tell OCP 3.0 defines how to
> > > set up the PCIe bifurcation but doesn't explain what the expected
> > > behavior is for the shared ports. One thing we might look into would
> > > be the handling for VEPA(Virtual Ethernet Port Aggregator) or VEB
> > > (Virtual Ethernet Bridging) as that wouldn't be too far off from what
> > > inspired most of the logic in the hardware. Essentially the only
> > > difference is that instead of supporting VFs most of these NICs are
> > > supporting multiple PFs.  
> > 
> > So looking at 802.1Q-2022 section 40 I wonder if we don't need to
> > essentially define ourselves as an edge relay as our setup is pretty
> > close to what is depicted in figure 40-1. In our case an S-channel
> > essentially represents 2 SerDes lanes on an QSFP cable, with the
> > switch playing the role of the EVB bridge.
> > 
> > Anyway I think that is probably the spec we need to dig into if we are
> > looking for how the link is being shared and such. I'll try to do some
> > more reading myself to get caught up on all this as the last time I
> > had been reading through this it was called VEB instead of EVB.. :-/
> 
> Interesting. My gut feeling is that even if we make Linux and the NIC
> behave nicely according to 802.1Q, we'll also need to make some changes
> on the BMC side. And there we may encounter pushback as the status quo
> works quite trivially for devices with PHY control in FW.

As i see it, we have two things stacked on top of each other. We have
what is standardised for NC-SI, DSP0222. That gives a basis, and then
there is vendor stuff on top for multi-host, which is more strict.

Linux should have generic support for DSP0222. I've seen vendors hack
around with WoL to make it work. It would be nice to replace that hack
with a method to tell phylink to enable support for DSP0222. A
standardised method, since as additional ops, or a flag. phylink can
then separate admin down from carrier down when needed.

Then we have vendor stuff on top. 

> BTW Saeed posted a devlink param to "keep link up" recently:
> https://lore.kernel.org/all/20250414195959.1375031-11-saeed@kernel.org/
> Intel has ethtool priv flags to the same effect, in their 40G and 100G
> drivers, but with reverse polarity:
> https://docs.kernel.org/networking/device_drivers/ethernet/intel/i40e.html#setting-the-link-down-on-close-private-flag
> These are all for this exact use case. In the past Ido added module
> power policy, which is the only truly generic configurable, and one we
> should probably build upon:
> https://docs.kernel.org/networking/ethtool-netlink.html#c.ethtool_module_power_mode_policy
> I'm not sure if this is expected to include PCS or it's just telling
> the module to keep the laser on..

Ideally, we want to define something vendor agnostic. And i would
prefer we talk about the high level concept, sharing the NIC with a
BMC and multiple hosts, rather than the low level, keep link up.

The whole concept of a multi-host NIC is new to me. So i at least need
to get up to speed with it. I've no idea if Russell has come across it
before, since it is not a SoC concept.

I don't really want to agree to anything until i do have that concept
understood. That is part of why i asked about a standard. It is a
dense document answering a lot of questions. Without a standard, i
need to ask a lot of questions.

I also think there is a lot more to it than just keeping the laser
on. For NC-SI, DSP0222 that probably does cover a big chunk of the
problem, but for multi-host, my gut is telling me there is more to it.

Let me do some research and thinking about multi-host.

	Andrew

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
  2025-04-22 13:49             ` Andrew Lunn
@ 2025-04-22 15:28               ` Jakub Kicinski
  2025-04-22 16:49                 ` Andrew Lunn
  0 siblings, 1 reply; 44+ messages in thread
From: Jakub Kicinski @ 2025-04-22 15:28 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Alexander Duyck, netdev, linux, hkallweit1, davem, pabeni

On Tue, 22 Apr 2025 15:49:02 +0200 Andrew Lunn wrote:
> > Interesting. My gut feeling is that even if we make Linux and the NIC
> > behave nicely according to 802.1Q, we'll also need to make some changes
> > on the BMC side. And there we may encounter pushback as the status quo
> > works quite trivially for devices with PHY control in FW.  
> 
> As i see it, we have two things stacked on top of each other. We have
> what is standardised for NC-SI, DSP0222. That gives a basis, and then
> there is vendor stuff on top for multi-host, which is more strict.

The multi-host piece I'd completely ignore. Real multi-host NICs
cannot do without firmware. One component needs to be in control
of global resources, and it can't be a host that may get randomly
rebooted.

> Linux should have generic support for DSP0222. I've seen vendors hack
> around with WoL to make it work. It would be nice to replace that hack
> with a method to tell phylink to enable support for DSP0222. A
> standardised method, since as additional ops, or a flag. phylink can
> then separate admin down from carrier down when needed.

I'm no DSP0222 expert but in my experience it is rather limited in
defining its interactions with the host. It is also irrelevant what
exact tap / agent is asking us to keep the link up. The point is
_some_ agent is running either on the NIC or leeching off the NIC,
and is requesting the link to stay up. It could even be a VF/VM on
the same host.

FWIW in the NFP we just had a "forced" bit which told the host not
to touch physical link status:
https://elixir.bootlin.com/linux/v6.15-rc1/source/drivers/net/ethernet/netronome/nfp/nfp_port.h#L62

A possible extension would be to have a bitmap of possible agent types,
so that the user has more info about what agent is keeping the link up.
But at the end of the day all the link mgmt cares about is whether such
bitmap is empty or not.

> Then we have vendor stuff on top. 
> 
> > BTW Saeed posted a devlink param to "keep link up" recently:
> > https://lore.kernel.org/all/20250414195959.1375031-11-saeed@kernel.org/
> > Intel has ethtool priv flags to the same effect, in their 40G and 100G
> > drivers, but with reverse polarity:
> > https://docs.kernel.org/networking/device_drivers/ethernet/intel/i40e.html#setting-the-link-down-on-close-private-flag
> > These are all for this exact use case. In the past Ido added module
> > power policy, which is the only truly generic configurable, and one we
> > should probably build upon:
> > https://docs.kernel.org/networking/ethtool-netlink.html#c.ethtool_module_power_mode_policy
> > I'm not sure if this is expected to include PCS or it's just telling
> > the module to keep the laser on..  
> 
> Ideally, we want to define something vendor agnostic. And i would
> prefer we talk about the high level concept, sharing the NIC with a
> BMC and multiple hosts, rather than the low level, keep link up.
> 
> The whole concept of a multi-host NIC is new to me. So i at least need
> to get up to speed with it. I've no idea if Russell has come across it
> before, since it is not a SoC concept.
> 
> I don't really want to agree to anything until i do have that concept
> understood. That is part of why i asked about a standard. It is a
> dense document answering a lot of questions. Without a standard, i
> need to ask a lot of questions.

Don't hesitate to ask the questions, your last reply contains no
question marks :)

> I also think there is a lot more to it than just keeping the laser
> on. For NC-SI, DSP0222 that probably does cover a big chunk of the
> problem, but for multi-host, my gut is telling me there is more to it.
> 
> Let me do some research and thinking about multi-host.

The only case that could make multi-host relevant here is if each host
had its own port but they shared some resources like clocks. Making it
a not-really multi-host. I know there are NICs on the market which have
similar limitations on single host. If you configure port breakout all
sub-ports need to run at the same rate. Programming of shared resources
is orthogonal in my mind to a subservient agent asking us not to link
down.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
  2025-04-22 15:28               ` Jakub Kicinski
@ 2025-04-22 16:49                 ` Andrew Lunn
  2025-04-22 17:30                   ` Russell King (Oracle)
  2025-04-22 21:29                   ` Alexander Duyck
  0 siblings, 2 replies; 44+ messages in thread
From: Andrew Lunn @ 2025-04-22 16:49 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Alexander Duyck, netdev, linux, hkallweit1, davem, pabeni

> > The whole concept of a multi-host NIC is new to me. So i at least need
> > to get up to speed with it. I've no idea if Russell has come across it
> > before, since it is not a SoC concept.
> > 
> > I don't really want to agree to anything until i do have that concept
> > understood. That is part of why i asked about a standard. It is a
> > dense document answering a lot of questions. Without a standard, i
> > need to ask a lot of questions.
> 
> Don't hesitate to ask the questions, your last reply contains no
> question marks :)

O.K. Lets start with the basics. I assume the NIC has a PCIe connector
something like a 4.0 x4? Each of the four hosts in the system
contribute one PCIe lane. So from the host side it looks like a 4.0 x1
NIC?

There are not 4 host MACs connected to a 5 port switch. Rather, each
host gets its own subset of queues, DMA engines etc, for one shared
MAC. Below the MAC you have all the usual PCS, SFP cage, gpios, I2C
bus, and blinky LEDs. Plus you have the BMC connected via an RMII like
interface.

You must have a minimum of firmware on the NIC to get the MAC into a
state the BMC can inject/receive frames, configure the PCS, gpios to
the SFP, enough I2C to figure out what the module is, what quirks are
needed etc.

NC-SI, with Linux controlling the hardware, implies you need to be
able to hand off control of the GPIOs, I2C, PCS to Linux. But with
multi-host, it makes no sense for all 4 hosts to be trying to control
the GPIOs, I2C, PCS, perform SFP firmware upgrade. So it seems more
likely to me, one host gets put in change of everything below the
queues to the MAC. The others just know there is link, nothing more.

This actually circles back to the discussion about fixed-link. The one
host in control of all the lower hardware has the complete
picture. The other 3 maybe just need a fixed link. They don't get to
see what is going on below the MAC, and as a result there is no
ethtool support to change anything, and so no conflicting
configuration? And since they cannot control any of that, they cannot
put the link down. So 3/4 of the problem is solved.

phylink is however not expecting that when phylink_start() is called,
it might or might not have to drive the hardware depending on if it
wins an election to control the hardware. And if it losses, it needs
to ditch all its configuration for a PCS, SPF, etc and swap to a
fixed-link. Do we want to teach phylink all this, or put all phylink
stuff into open(), rather than spread across probe() and open(). Being
in open(), you basically construct a different phylink configuration
depending on if you win the election or not.

Is one host in the position to control the complete media
configuration? Could you split the QSFP into four, each host gets its
own channel, and it gets to choose how to use that channel, different
FEC schemes, bit rates?

	Andrew

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
  2025-04-22 16:49                 ` Andrew Lunn
@ 2025-04-22 17:30                   ` Russell King (Oracle)
  2025-04-22 18:13                     ` Andrew Lunn
  2025-04-22 21:29                   ` Alexander Duyck
  1 sibling, 1 reply; 44+ messages in thread
From: Russell King (Oracle) @ 2025-04-22 17:30 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Jakub Kicinski, Alexander Duyck, netdev, hkallweit1, davem,
	pabeni

On Tue, Apr 22, 2025 at 06:49:54PM +0200, Andrew Lunn wrote:
> > > The whole concept of a multi-host NIC is new to me. So i at least need
> > > to get up to speed with it. I've no idea if Russell has come across it
> > > before, since it is not a SoC concept.
> > > 
> > > I don't really want to agree to anything until i do have that concept
> > > understood. That is part of why i asked about a standard. It is a
> > > dense document answering a lot of questions. Without a standard, i
> > > need to ask a lot of questions.
> > 
> > Don't hesitate to ask the questions, your last reply contains no
> > question marks :)
> 
> O.K. Lets start with the basics. I assume the NIC has a PCIe connector
> something like a 4.0 x4? Each of the four hosts in the system
> contribute one PCIe lane. So from the host side it looks like a 4.0 x1
> NIC?
> 
> There are not 4 host MACs connected to a 5 port switch. Rather, each
> host gets its own subset of queues, DMA engines etc, for one shared
> MAC. Below the MAC you have all the usual PCS, SFP cage, gpios, I2C
> bus, and blinky LEDs. Plus you have the BMC connected via an RMII like
> interface.
> 
> You must have a minimum of firmware on the NIC to get the MAC into a
> state the BMC can inject/receive frames, configure the PCS, gpios to
> the SFP, enough I2C to figure out what the module is, what quirks are
> needed etc.

This all makes sense, but at this point, I have to ask something that
seems to be fundamental to me:

  Should any of the hosts accessing the NIC through those PCIe x1
  interfaces have any knowledge or control of anything behind "their"
  view of the MAC?

I would say no, they should not, because if they do, they can interfere
with other hosts. Surely only the BMC should have permission to access
the layers of hardware behind the MAC?

What should a host know about the setup? Maybe the speed of their
network connection through the MAC. I state it that way rather than
"the speed of the media" because if there is some control over the
traffic from each "host" then the media speed is irrelevant.

> NC-SI, with Linux controlling the hardware, implies you need to be
> able to hand off control of the GPIOs, I2C, PCS to Linux. But with
> multi-host, it makes no sense for all 4 hosts to be trying to control
> the GPIOs, I2C, PCS, perform SFP firmware upgrade. So it seems more
> likely to me, one host gets put in change of everything below the
> queues to the MAC. The others just know there is link, nothing more.

Ouch. Yes - if we have four independent hosts trying to access the same
I2C hardware as another host on the same hardware, then that sounds
like a recipe for a trainwreck.

> This actually circles back to the discussion about fixed-link. The one
> host in control of all the lower hardware has the complete
> picture. The other 3 maybe just need a fixed link. They don't get to
> see what is going on below the MAC, and as a result there is no
> ethtool support to change anything, and so no conflicting
> configuration? And since they cannot control any of that, they cannot
> put the link down. So 3/4 of the problem is solved.

Should one host have control, or should the BMC have control? I don't
actually know what you're talking about w.r.t. DSP0222 or whatever it
was, nor NC-SI - I don't have these documents.

> phylink is however not expecting that when phylink_start() is called,
> it might or might not have to drive the hardware depending on if it
> wins an election to control the hardware. And if it losses, it needs
> to ditch all its configuration for a PCS, SPF, etc and swap to a
> fixed-link. Do we want to teach phylink all this, or put all phylink
> stuff into open(), rather than spread across probe() and open(). Being
> in open(), you basically construct a different phylink configuration
> depending on if you win the election or not.

That sounds very complicated and all very new stuff.

> Is one host in the position to control the complete media
> configuration? Could you split the QSFP into four, each host gets its
> own channel, and it gets to choose how to use that channel, different
> FEC schemes, bit rates?

Yes, each channel in a QSFPs have separate LOS status bits accessible
over I2C. It's been a while since I looked at this, but I seem to
remember there aren't hardware pins for LOS, TX_DISABLE etc - that's
all over I2C.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
  2025-04-22 17:30                   ` Russell King (Oracle)
@ 2025-04-22 18:13                     ` Andrew Lunn
  2025-04-22 18:50                       ` Russell King (Oracle)
  2025-04-22 23:51                       ` Jakub Kicinski
  0 siblings, 2 replies; 44+ messages in thread
From: Andrew Lunn @ 2025-04-22 18:13 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Jakub Kicinski, Alexander Duyck, netdev, hkallweit1, davem,
	pabeni

> Should one host have control, or should the BMC have control? I don't
> actually know what you're talking about w.r.t. DSP0222 or whatever it
> was, nor NC-SI - I don't have these documents.

I gave a reference to it a few email back in the conversation:

https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.2.0.pdf

Linux has an implementation of the protocol in net/nsci

      Andrew

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
  2025-04-22 18:13                     ` Andrew Lunn
@ 2025-04-22 18:50                       ` Russell King (Oracle)
  2025-04-22 23:51                       ` Jakub Kicinski
  1 sibling, 0 replies; 44+ messages in thread
From: Russell King (Oracle) @ 2025-04-22 18:50 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Jakub Kicinski, Alexander Duyck, netdev, hkallweit1, davem,
	pabeni

On Tue, Apr 22, 2025 at 08:13:43PM +0200, Andrew Lunn wrote:
> > Should one host have control, or should the BMC have control? I don't
> > actually know what you're talking about w.r.t. DSP0222 or whatever it
> > was, nor NC-SI - I don't have these documents.
> 
> I gave a reference to it a few email back in the conversation:
> 
> https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.2.0.pdf
> 
> Linux has an implementation of the protocol in net/nsci

Hmm. I don't think I have bandwidth to read that spec any time soon,
sorry.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
  2025-04-22 18:13                     ` Andrew Lunn
  2025-04-22 18:50                       ` Russell King (Oracle)
@ 2025-04-22 23:51                       ` Jakub Kicinski
  1 sibling, 0 replies; 44+ messages in thread
From: Jakub Kicinski @ 2025-04-22 23:51 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Russell King (Oracle), Alexander Duyck, netdev, hkallweit1, davem,
	pabeni

On Tue, 22 Apr 2025 20:13:43 +0200 Andrew Lunn wrote:
> > Should one host have control, or should the BMC have control? I don't
> > actually know what you're talking about w.r.t. DSP0222 or whatever it
> > was, nor NC-SI - I don't have these documents.  
> 
> I gave a reference to it a few email back in the conversation:
> 
> https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.2.0.pdf
> 
> Linux has an implementation of the protocol in net/nsci

But to be clear the implementation is for when Linux runs as the BMC.
It does not interact in any way with the host AFAIU.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
  2025-04-22 16:49                 ` Andrew Lunn
  2025-04-22 17:30                   ` Russell King (Oracle)
@ 2025-04-22 21:29                   ` Alexander Duyck
  2025-04-22 22:26                     ` Andrew Lunn
  1 sibling, 1 reply; 44+ messages in thread
From: Alexander Duyck @ 2025-04-22 21:29 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Jakub Kicinski, netdev, linux, hkallweit1, davem, pabeni

On Tue, Apr 22, 2025 at 9:50 AM Andrew Lunn <andrew@lunn.ch> wrote:
>
> > > The whole concept of a multi-host NIC is new to me. So i at least need
> > > to get up to speed with it. I've no idea if Russell has come across it
> > > before, since it is not a SoC concept.
> > >
> > > I don't really want to agree to anything until i do have that concept
> > > understood. That is part of why i asked about a standard. It is a
> > > dense document answering a lot of questions. Without a standard, i
> > > need to ask a lot of questions.
> >
> > Don't hesitate to ask the questions, your last reply contains no
> > question marks :)
>
> O.K. Lets start with the basics. I assume the NIC has a PCIe connector
> something like a 4.0 x4? Each of the four hosts in the system
> contribute one PCIe lane. So from the host side it looks like a 4.0 x1
> NIC?

More like 5.0 x16 split in to 4 5.0 x4 NICs.

> There are not 4 host MACs connected to a 5 port switch. Rather, each
> host gets its own subset of queues, DMA engines etc, for one shared
> MAC. Below the MAC you have all the usual PCS, SFP cage, gpios, I2C
> bus, and blinky LEDs. Plus you have the BMC connected via an RMII like
> interface.

Yeah, that is the setup so far. Basically we are using one QSFP cable
and slicing it up. So instead of having a 100CR4 connection we might
have 2x50CR2 operating on the same cable, or 4x25CR.

> You must have a minimum of firmware on the NIC to get the MAC into a
> state the BMC can inject/receive frames, configure the PCS, gpios to
> the SFP, enough I2C to figure out what the module is, what quirks are
> needed etc.

The firmware isn't that smart. It isn't reading the QSFP itself to get
that info. It could, but it doesn't. It is essentially hands off as
there isn't any change needed for a direct attach cable. Basically it
is configuring the MAC, PCS, FEC, PMA, PMD with a pre-recorded setting
in the EEPROM for the NIC.

> NC-SI, with Linux controlling the hardware, implies you need to be
> able to hand off control of the GPIOs, I2C, PCS to Linux. But with
> multi-host, it makes no sense for all 4 hosts to be trying to control
> the GPIOs, I2C, PCS, perform SFP firmware upgrade. So it seems more
> likely to me, one host gets put in change of everything below the
> queues to the MAC. The others just know there is link, nothing more.

Things are a bit simpler than that. With the direct-attach we don't
need to take any action on the SFP. Essentially the I2C and GPIOs are
all shared. As such we can read the QSFP state, but cannot modify it
directly. We aren't taking any actions to write to the I2C other than
bank/page which is handled all as a part of the read call.

> This actually circles back to the discussion about fixed-link. The one
> host in control of all the lower hardware has the complete
> picture. The other 3 maybe just need a fixed link. They don't get to
> see what is going on below the MAC, and as a result there is no
> ethtool support to change anything, and so no conflicting
> configuration? And since they cannot control any of that, they cannot
> put the link down. So 3/4 of the problem is solved.

Yeah, this is why I was headed down that path for a bit. However our
links are independent with the only shared bit being the PMD and the
SFP module. We can essentially configure everything else diffrerent
between the ports from there. So depending on what the cable supports
we can potentially run one lane or two, and in NRZ or PAM4 mode.

So for example one of our standard test items to run is to use a
QSFP-DD loopback plug and essentially cycle through all different port
configurations on all the different ports to make sure we don't have
configuration leaking over from one port to the other as the PMD is
shared between hosts 0 and 1, and hosts 2 and 3 if we have a 4 port
setup. We don't have to have all 4 MAC/PCS/PMA configured the same. We
can have a different config between ports, although in most cases it
will just end up being the same.

> phylink is however not expecting that when phylink_start() is called,
> it might or might not have to drive the hardware depending on if it
> wins an election to control the hardware. And if it losses, it needs
> to ditch all its configuration for a PCS, SPF, etc and swap to a
> fixed-link. Do we want to teach phylink all this, or put all phylink
> stuff into open(), rather than spread across probe() and open(). Being
> in open(), you basically construct a different phylink configuration
> depending on if you win the election or not.

We are getting a bit off into the weeds here. So there isn't any sort
of election. There is still a firmware that is sitting on the shared
bits. So the PMD, I2C to the QSFP, and GPIO from the QSFP are all
controlled via the firmware. To prevent any significant issues for now
we treat the QSFP as read-only from the hosts as we have to go through
the firmware to get access, and the PMD can only be configured via a
message to the FW asking for a specific bitrate/modulation and number
of lanes.

> Is one host in the position to control the complete media
> configuration? Could you split the QSFP into four, each host gets its
> own channel, and it gets to choose how to use that channel, different
> FEC schemes, bit rates?

So one thing to be aware of is that the QSFP can be electrically
separated so that it is one cable, but with either 2 (QSFP+/QSFP28) or
4 (QSFP-DD) separate sets of lanes. The cable defines the limits of
what we can do in terms of modulation and number of lanes, but we
don't have to configure anything directly on it. That is handled
through the PCS/PMA/PMD side of things.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
  2025-04-22 21:29                   ` Alexander Duyck
@ 2025-04-22 22:26                     ` Andrew Lunn
  2025-04-22 23:06                       ` Alexander Duyck
  0 siblings, 1 reply; 44+ messages in thread
From: Andrew Lunn @ 2025-04-22 22:26 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Jakub Kicinski, netdev, linux, hkallweit1, davem, pabeni

On Tue, Apr 22, 2025 at 02:29:48PM -0700, Alexander Duyck wrote:
> On Tue, Apr 22, 2025 at 9:50 AM Andrew Lunn <andrew@lunn.ch> wrote:
> >
> > > > The whole concept of a multi-host NIC is new to me. So i at least need
> > > > to get up to speed with it. I've no idea if Russell has come across it
> > > > before, since it is not a SoC concept.
> > > >
> > > > I don't really want to agree to anything until i do have that concept
> > > > understood. That is part of why i asked about a standard. It is a
> > > > dense document answering a lot of questions. Without a standard, i
> > > > need to ask a lot of questions.
> > >
> > > Don't hesitate to ask the questions, your last reply contains no
> > > question marks :)
> >
> > O.K. Lets start with the basics. I assume the NIC has a PCIe connector
> > something like a 4.0 x4? Each of the four hosts in the system
> > contribute one PCIe lane. So from the host side it looks like a 4.0 x1
> > NIC?
> 
> More like 5.0 x16 split in to 4 5.0 x4 NICs.

O.K. Same thing, different scale.

> > There are not 4 host MACs connected to a 5 port switch. Rather, each
> > host gets its own subset of queues, DMA engines etc, for one shared
> > MAC. Below the MAC you have all the usual PCS, SFP cage, gpios, I2C
> > bus, and blinky LEDs. Plus you have the BMC connected via an RMII like
> > interface.
> 
> Yeah, that is the setup so far. Basically we are using one QSFP cable
> and slicing it up. So instead of having a 100CR4 connection we might
> have 2x50CR2 operating on the same cable, or 4x25CR.

But for 2x50CR2 you have two MACs? And for 4x25CR 4 MACs?

Or is there always 4 MACs, each MAC has its own queues, and you need
to place frames into the correct queue, and with a 2x50CR2 you also
need to load balance across those two queues?

I guess the queuing does not matter much to phylink, but how do you
represent multiple PCS lanes to phylink? Up until now, one netdev has
had one PCS lane. It now has 1, 2, or 4 lanes. None of the
phylink_pcs_op have a lane indicator.
 
> > NC-SI, with Linux controlling the hardware, implies you need to be
> > able to hand off control of the GPIOs, I2C, PCS to Linux. But with
> > multi-host, it makes no sense for all 4 hosts to be trying to control
> > the GPIOs, I2C, PCS, perform SFP firmware upgrade. So it seems more
> > likely to me, one host gets put in change of everything below the
> > queues to the MAC. The others just know there is link, nothing more.
> 
> Things are a bit simpler than that. With the direct-attach we don't
> need to take any action on the SFP. Essentially the I2C and GPIOs are
> all shared. As such we can read the QSFP state, but cannot modify it
> directly. We aren't taking any actions to write to the I2C other than
> bank/page which is handled all as a part of the read call.

That might work for direct-attach, but what about the general case? We
need to ensure whatever we add supports the general case.

The current SFP code expects a Linux I2C bus. Given how SFPs are
broken, it does 16 bytes reads at the most. When it needs to read more
than 16 bytes, i expect it will set the page once, read it back to
ensure the SFP actually implements the page, and then do multiple I2C
reads to read all the data it wants from that page. I don't see how
this is going to work when the I2C bus is shared.

> > This actually circles back to the discussion about fixed-link. The one
> > host in control of all the lower hardware has the complete
> > picture. The other 3 maybe just need a fixed link. They don't get to
> > see what is going on below the MAC, and as a result there is no
> > ethtool support to change anything, and so no conflicting
> > configuration? And since they cannot control any of that, they cannot
> > put the link down. So 3/4 of the problem is solved.
> 
> Yeah, this is why I was headed down that path for a bit. However our
> links are independent with the only shared bit being the PMD and the
> SFP module.

Yours might be, but what is the general case?

	Andrew

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
  2025-04-22 22:26                     ` Andrew Lunn
@ 2025-04-22 23:06                       ` Alexander Duyck
  2025-04-23 18:38                         ` Alexander Duyck
  2025-04-24 20:34                         ` Andrew Lunn
  0 siblings, 2 replies; 44+ messages in thread
From: Alexander Duyck @ 2025-04-22 23:06 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Jakub Kicinski, netdev, linux, hkallweit1, davem, pabeni

On Tue, Apr 22, 2025 at 3:26 PM Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Tue, Apr 22, 2025 at 02:29:48PM -0700, Alexander Duyck wrote:
> > On Tue, Apr 22, 2025 at 9:50 AM Andrew Lunn <andrew@lunn.ch> wrote:
> > >
> > > > > The whole concept of a multi-host NIC is new to me. So i at least need
> > > > > to get up to speed with it. I've no idea if Russell has come across it
> > > > > before, since it is not a SoC concept.
> > > > >
> > > > > I don't really want to agree to anything until i do have that concept
> > > > > understood. That is part of why i asked about a standard. It is a
> > > > > dense document answering a lot of questions. Without a standard, i
> > > > > need to ask a lot of questions.
> > > >
> > > > Don't hesitate to ask the questions, your last reply contains no
> > > > question marks :)
> > >
> > > O.K. Lets start with the basics. I assume the NIC has a PCIe connector
> > > something like a 4.0 x4? Each of the four hosts in the system
> > > contribute one PCIe lane. So from the host side it looks like a 4.0 x1
> > > NIC?
> >
> > More like 5.0 x16 split in to 4 5.0 x4 NICs.
>
> O.K. Same thing, different scale.

Agreed.

> > > There are not 4 host MACs connected to a 5 port switch. Rather, each
> > > host gets its own subset of queues, DMA engines etc, for one shared
> > > MAC. Below the MAC you have all the usual PCS, SFP cage, gpios, I2C
> > > bus, and blinky LEDs. Plus you have the BMC connected via an RMII like
> > > interface.
> >
> > Yeah, that is the setup so far. Basically we are using one QSFP cable
> > and slicing it up. So instead of having a 100CR4 connection we might
> > have 2x50CR2 operating on the same cable, or 4x25CR.
>
> But for 2x50CR2 you have two MACs? And for 4x25CR 4 MACs?

Yes. Some confusion here may be that our hardware always has 4
MAC/PCS/PMA setups, one for each host. Depending on the NIC
configuration we may have either 4 hosts or 2 hosts present with 2
disabled. What they end up doing is routing 2 lanes from the QSFP to
one host and the other two to the other. So in the case of the QSFP28
or QSFP+ we can only support 2 hosts, and with the QSFP-DD we can
support 4.

> Or is there always 4 MACs, each MAC has its own queues, and you need
> to place frames into the correct queue, and with a 2x50CR2 you also
> need to load balance across those two queues?

Are you familiar with the concept of QSFP breakout cables? The general
idea is that one end of the cable is a QSFP connection and it will
break out into 4 SFP connections on the other end. That is actually
pretty close to the concept behind our NIC. We essentially have an
internalized breakout where the QSFP connection comes in, but we break
it into either 2 or 4 connections on our end. Our limit is 2 lanes per
host.

I did a quick search and came up with the following link to a Cisco
whitepaper that sort of explains the breakout cable concept. I will
try to see if I can find a spec somewhere that defines how to handle a
breakout cable:
https://www.cisco.com/c/en/us/products/collateral/interfaces-modules/transceiver-modules/whitepaper-c11-744077.html

> I guess the queuing does not matter much to phylink, but how do you
> represent multiple PCS lanes to phylink? Up until now, one netdev has
> had one PCS lane. It now has 1, 2, or 4 lanes. None of the
> phylink_pcs_op have a lane indicator.

So the PCS isn't really much of a problem. There is only one PCS per
host. Where things get a bit messier is that the PMA/PMD setup is per
lane. So our PCS has vendor registers for setting up the PMA side of
things and we have to set them for 2 devices instead of just one.
Likewise we have to pass a lanes mask to the PMD to tell it which
lanes are being configured for what modulation and which lanes are
disabled.

> > > NC-SI, with Linux controlling the hardware, implies you need to be
> > > able to hand off control of the GPIOs, I2C, PCS to Linux. But with
> > > multi-host, it makes no sense for all 4 hosts to be trying to control
> > > the GPIOs, I2C, PCS, perform SFP firmware upgrade. So it seems more
> > > likely to me, one host gets put in change of everything below the
> > > queues to the MAC. The others just know there is link, nothing more.
> >
> > Things are a bit simpler than that. With the direct-attach we don't
> > need to take any action on the SFP. Essentially the I2C and GPIOs are
> > all shared. As such we can read the QSFP state, but cannot modify it
> > directly. We aren't taking any actions to write to the I2C other than
> > bank/page which is handled all as a part of the read call.
>
> That might work for direct-attach, but what about the general case? We
> need to ensure whatever we add supports the general case.

I agree, but at the same time I am just letting you know the
limitations of our hardware setup. There isn't anything to really
control on the QSFP. It is mostly just there to provide the media and
that is about it. No PHY on it to load FW for.

> The current SFP code expects a Linux I2C bus. Given how SFPs are
> broken, it does 16 bytes reads at the most. When it needs to read more
> than 16 bytes, i expect it will set the page once, read it back to
> ensure the SFP actually implements the page, and then do multiple I2C
> reads to read all the data it wants from that page. I don't see how
> this is going to work when the I2C bus is shared.

The general idea is that we have to cache the page and bank in the
driver and pass those as arguments to the firmware when we perform a
read. Basically it will take a lock on the I2C, set the page and bank,
perform the read, and then release the lock. With that all 4 hosts can
read the I2C from the QSFP without causing any side effects.

> > > This actually circles back to the discussion about fixed-link. The one
> > > host in control of all the lower hardware has the complete
> > > picture. The other 3 maybe just need a fixed link. They don't get to
> > > see what is going on below the MAC, and as a result there is no
> > > ethtool support to change anything, and so no conflicting
> > > configuration? And since they cannot control any of that, they cannot
> > > put the link down. So 3/4 of the problem is solved.
> >
> > Yeah, this is why I was headed down that path for a bit. However our
> > links are independent with the only shared bit being the PMD and the
> > SFP module.
>
> Yours might be, but what is the general case?

I will do some digging into the breakout cable path. That seems like
the most likely setup that would be similar and more general.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
  2025-04-22 23:06                       ` Alexander Duyck
@ 2025-04-23 18:38                         ` Alexander Duyck
  2025-04-24 20:34                         ` Andrew Lunn
  1 sibling, 0 replies; 44+ messages in thread
From: Alexander Duyck @ 2025-04-23 18:38 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Jakub Kicinski, netdev, linux, hkallweit1, davem, pabeni

On Tue, Apr 22, 2025 at 4:06 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Tue, Apr 22, 2025 at 3:26 PM Andrew Lunn <andrew@lunn.ch> wrote:
> >
> > On Tue, Apr 22, 2025 at 02:29:48PM -0700, Alexander Duyck wrote:
> > > On Tue, Apr 22, 2025 at 9:50 AM Andrew Lunn <andrew@lunn.ch> wrote:
> > > >

...

> > Or is there always 4 MACs, each MAC has its own queues, and you need
> > to place frames into the correct queue, and with a 2x50CR2 you also
> > need to load balance across those two queues?
>
> Are you familiar with the concept of QSFP breakout cables? The general
> idea is that one end of the cable is a QSFP connection and it will
> break out into 4 SFP connections on the other end. That is actually
> pretty close to the concept behind our NIC. We essentially have an
> internalized breakout where the QSFP connection comes in, but we break
> it into either 2 or 4 connections on our end. Our limit is 2 lanes per
> host.
>
> I did a quick search and came up with the following link to a Cisco
> whitepaper that sort of explains the breakout cable concept. I will
> try to see if I can find a spec somewhere that defines how to handle a
> breakout cable:
> https://www.cisco.com/c/en/us/products/collateral/interfaces-modules/transceiver-modules/whitepaper-c11-744077.html
>

So I have done some more digging. I'm wondering if the annex 109B,C
and 135A,C,E,G are meant to essentially explain how to hook up a
breakout style connection without mentioning splitting things up or
sharing.

As far as the QSFP cables I was able to find a bit more. Specifically
in SFF-8636 they have a mention of a "Separable module" in Byte 113
bits 6-4 that was added to support breakout cables. Getting into CMIS
(https://www.oiforum.com/wp-content/uploads/OIF-CMIS-05.2.pdf) there
is quite a bit more mentions of partitioning the module to support
multiple "host interface" instances on the same cable, with section
6.2 really getting into the separation of datapaths for "applications"
and how to go about configuring multiple distinct links sharing the
same module.

The only issue is that CMIS doesn't really apply until we get to
QSFP-DD cables, so the QSFP28 and QSFP+ may end up being a bit of a
wild west when it comes to breaking out the connection. That said
though Byte 113 may be enough as it seems like partitioning it is just
a matter of selecting 1/2/4 lanes versus QSFP-DD where we can swap
between different modulation modes and such.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
  2025-04-22 23:06                       ` Alexander Duyck
  2025-04-23 18:38                         ` Alexander Duyck
@ 2025-04-24 20:34                         ` Andrew Lunn
  2025-04-24 23:40                           ` Alexander Duyck
  1 sibling, 1 reply; 44+ messages in thread
From: Andrew Lunn @ 2025-04-24 20:34 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Jakub Kicinski, netdev, linux, hkallweit1, davem, pabeni

Sorry for the delay, busy with $DAY_JOB

> > > > There are not 4 host MACs connected to a 5 port switch. Rather, each
> > > > host gets its own subset of queues, DMA engines etc, for one shared
> > > > MAC. Below the MAC you have all the usual PCS, SFP cage, gpios, I2C
> > > > bus, and blinky LEDs. Plus you have the BMC connected via an RMII like
> > > > interface.
> > >
> > > Yeah, that is the setup so far. Basically we are using one QSFP cable
> > > and slicing it up. So instead of having a 100CR4 connection we might
> > > have 2x50CR2 operating on the same cable, or 4x25CR.
> >
> > But for 2x50CR2 you have two MACs? And for 4x25CR 4 MACs?
> 
> Yes. Some confusion here may be that our hardware always has 4
> MAC/PCS/PMA setups, one for each host. Depending on the NIC
> configuration we may have either 4 hosts or 2 hosts present with 2
> disabled.

So with 2 hosts, each host has two netdevs? If you were to dedicate
the whole card to one host, you would have 4 netdevs? It is upto
whatever is above to perform load balancing over those?

If you always have 4 MAC/PCS, then the PCS is only ever used with a
single lane? The MAC does not support 100000baseKR4 for example, but
250000baseKR1?

> The general idea is that we have to cache the page and bank in the
> driver and pass those as arguments to the firmware when we perform a
> read. Basically it will take a lock on the I2C, set the page and bank,
> perform the read, and then release the lock. With that all 4 hosts can
> read the I2C from the QSFP without causing any side effects.

I assume your hardware team have not actually implemented I2C, they
have licensed it. Hence there is probably already a driver for it in
drivers/i2c/busses, maybe one of the i2c-designware-? However, you are
not going to use it, you are going to reinvent the wheel so you can
parse the transactions going over it, look for reads and writes to
address 127? Humm, i suppose you could have a virtual I2C driver doing
this stacked on top of the real I2C driver. Is this something other
network drivers are going to need? Should it be somewhere in
drivers/net/phy? The hard bit is how you do the mutex in an agnostic
way. But it looks like hardware spinlocks would work:
https://docs.kernel.org/locking/hwspinlock.html

And actually, it is more complex than caching the page.

  This specification defines functions in Pages 00h-02h. Pages 03-7Fh
  are reserved for future use. Writing the value of a non-supported
  Page shall not be accepted by the transceiver. The Page Select byte
  shall revert to 0 and read / write operations shall be to the
  unpaged A2h memory map.

So i expect the SFP driver to do a write followed by a read to know if
it needs to return EOPNOTSUPP to user space because the SFP does not
implement the page.

	Andrew

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
  2025-04-24 20:34                         ` Andrew Lunn
@ 2025-04-24 23:40                           ` Alexander Duyck
  2025-04-25 13:11                             ` Andrew Lunn
  0 siblings, 1 reply; 44+ messages in thread
From: Alexander Duyck @ 2025-04-24 23:40 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Jakub Kicinski, netdev, linux, hkallweit1, davem, pabeni

On Thu, Apr 24, 2025 at 1:34 PM Andrew Lunn <andrew@lunn.ch> wrote:
>
> Sorry for the delay, busy with $DAY_JOB

No problem. I have a number of my own issues I am dealing with here in
terms of code cleanup stuff anyway.

> > > > > There are not 4 host MACs connected to a 5 port switch. Rather, each
> > > > > host gets its own subset of queues, DMA engines etc, for one shared
> > > > > MAC. Below the MAC you have all the usual PCS, SFP cage, gpios, I2C
> > > > > bus, and blinky LEDs. Plus you have the BMC connected via an RMII like
> > > > > interface.
> > > >
> > > > Yeah, that is the setup so far. Basically we are using one QSFP cable
> > > > and slicing it up. So instead of having a 100CR4 connection we might
> > > > have 2x50CR2 operating on the same cable, or 4x25CR.
> > >
> > > But for 2x50CR2 you have two MACs? And for 4x25CR 4 MACs?
> >
> > Yes. Some confusion here may be that our hardware always has 4
> > MAC/PCS/PMA setups, one for each host. Depending on the NIC
> > configuration we may have either 4 hosts or 2 hosts present with 2
> > disabled.
>
> So with 2 hosts, each host has two netdevs? If you were to dedicate
> the whole card to one host, you would have 4 netdevs? It is upto
> whatever is above to perform load balancing over those?

Just to be clear when I say "host" in this case I am referring to a
system running Linux, not "host" in the CMIS regard as that is the
NIC/FW I think.

Anyway we have 2 scenarios. Our main use case is to route one x4 to
each host. So in that case we only have one PCIe connection and it
will only show up as one netdev. In our manufacturing test case we
have a riser and use PCIe bifurcation to split up a x16 to 4 x4 and
all 4 endpoints can show up on the one host.

> If you always have 4 MAC/PCS, then the PCS is only ever used with a
> single lane? The MAC does not support 100000baseKR4 for example, but
> 250000baseKR1?

In our 2 host setup we are normally running a QSFP28 or QSFP+ cable
that has 4 lanes. So we effectively are cutting the cable in half to
provide 2 lanes to each host. This allows us to support either
50baseCR2 or 100baseCR2 as the upper limit for the cable depending on
if it is running NRZ or PAM4 modulation. In these setups things are
fairly rigid as we can only select to use 1 or 2 lanes, no selection
for modulation due to the nature of the cable spec.

In our 4 host setup we are configured to connect a QSFP-DD cable. With
that the cable has 8 lanes, with each host getting 2 of that and
seeing the same limitations as the 2 host setup mentioned earlier.

In theory we could do something like you call out in your example, but
we haven't configured a board combination for that yet. Basically it
would require a specific board and EEPROM combination to route the
lanes so that we had one lane per host instead of 2 which is the
current configs.

> > The general idea is that we have to cache the page and bank in the
> > driver and pass those as arguments to the firmware when we perform a
> > read. Basically it will take a lock on the I2C, set the page and bank,
> > perform the read, and then release the lock. With that all 4 hosts can
> > read the I2C from the QSFP without causing any side effects.
>
> I assume your hardware team have not actually implemented I2C, they
> have licensed it. Hence there is probably already a driver for it in
> drivers/i2c/busses, maybe one of the i2c-designware-? However, you are
> not going to use it, you are going to reinvent the wheel so you can
> parse the transactions going over it, look for reads and writes to
> address 127? Humm, i suppose you could have a virtual I2C driver doing
> this stacked on top of the real I2C driver. Is this something other
> network drivers are going to need? Should it be somewhere in
> drivers/net/phy? The hard bit is how you do the mutex in an agnostic
> way. But it looks like hardware spinlocks would work:
> https://docs.kernel.org/locking/hwspinlock.html

Part of the issue would be who owns the I2C driver. Both the firmware
and the host would need access to it. Rather than having to do a
handoff for that it is easier to have the firmware maintain the driver
and just process the requests for us via mailbox IPC calls.

One other point of contention is that we don't have a central firmware
managing things. We have one instance of the firmware running per
host. So the 4 firmware instances will be competing with each other
over access to the QSFP, so they have their own mutex that they
maintain to determine who can have master access to the I2C bus for
the QSFP with each having their own I2C device to connect to the bus.

> And actually, it is more complex than caching the page.

Yeah, that was generally my thought on it, and that is if I even need
to do that. From what I have seen most of the QSFP28/+ direct attach
cables we are working with are very simplistic. Seems like they only
had page 0. It wasn't until I started getting into the QSFP-DD stuff
for the 4 host NIC that I started running into the need for multi page
support. So for example the hwmon sensors don't do much for us as
direct attach cables don't really bother implementing them. A call to
"ethtool -m" on on of our systems usually yields 0.00 degrees C and
0.000 volts.

Also the ethtool API already had get_module_eeprom_by_page which is a
very good fit for our model since it allowed for atomic access based
on the page and bank number.

>   This specification defines functions in Pages 00h-02h. Pages 03-7Fh
>   are reserved for future use. Writing the value of a non-supported
>   Page shall not be accepted by the transceiver. The Page Select byte
>   shall revert to 0 and read / write operations shall be to the
>   unpaged A2h memory map.
>
> So i expect the SFP driver to do a write followed by a read to know if
> it needs to return EOPNOTSUPP to user space because the SFP does not
> implement the page.

I guess it could do EOPNOTSUPP too, we had used EADDRNOTAVAIL to
indicate that case. This is one of the reasons why our firmware API
requires the bank and page be passed in the message to perform an QSFP
I2C read. It is able to verify it on its end and if it isn't supported
it returns an error.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
  2025-04-24 23:40                           ` Alexander Duyck
@ 2025-04-25 13:11                             ` Andrew Lunn
  2025-04-25 15:41                               ` Alexander Duyck
  0 siblings, 1 reply; 44+ messages in thread
From: Andrew Lunn @ 2025-04-25 13:11 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Jakub Kicinski, netdev, linux, hkallweit1, davem, pabeni

> Part of the issue would be who owns the I2C driver. Both the firmware
> and the host would need access to it. Rather than having to do a
> handoff for that it is easier to have the firmware maintain the driver
> and just process the requests for us via mailbox IPC calls.

How do you incorporate that into sfp.c? sfp_i2c_configure() and
sfp_i2c_get() expect a Linux I2C bus, since this is supposed to be a
plain simple I2C bus. I'm not sure we want to encourage other
abstractions of an I2C bus than the current Linux I2C bus abstraction.

	Andrew

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
  2025-04-25 13:11                             ` Andrew Lunn
@ 2025-04-25 15:41                               ` Alexander Duyck
  0 siblings, 0 replies; 44+ messages in thread
From: Alexander Duyck @ 2025-04-25 15:41 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Jakub Kicinski, netdev, linux, hkallweit1, davem, pabeni

On Fri, Apr 25, 2025 at 6:12 AM Andrew Lunn <andrew@lunn.ch> wrote:
>
> > Part of the issue would be who owns the I2C driver. Both the firmware
> > and the host would need access to it. Rather than having to do a
> > handoff for that it is easier to have the firmware maintain the driver
> > and just process the requests for us via mailbox IPC calls.
>
> How do you incorporate that into sfp.c? sfp_i2c_configure() and
> sfp_i2c_get() expect a Linux I2C bus, since this is supposed to be a
> plain simple I2C bus. I'm not sure we want to encourage other
> abstractions of an I2C bus than the current Linux I2C bus abstraction.

We would just present it as an i2c bus. Essentially all I have to do
is have the xfer call package the i2c request into a message and send
it over the IPC mailbox and then wait for a completion. It isn't as if
the i2c is anything all that complex. Since our operations are all
essentially atomic anyway we can do the read operations without any
issues, and for now we essentially just ignore writes to the SFP cage
since the DAC cables don't have anything to really write to anyway.

The bigger issues will be dealing with page/bank and CMIS when the
time comes. That is likely to be more work then how we deal with the
i2c. For i2c it would just be a matter of making sure our abstraction
we are presenting of the firmware interface can fit into whatever the
model is we come up with.

For now I can essentially just fake it by passing messages back and
forth to the firmware and caching bytes 126 and 127 so that we keep
track of what page/bank is being accessed by the host. One thing I
will have to double check though is how we report back a failure if
trying to set page/bank to an unsupported value. For now the FW was
just rejecting the read attempt as we had to pass data to trigger it.
We may need to add support for a 0 byte transaction to support setting
page and bank just to test if we can set them or not if we are going
to fake the extra verification step.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2025-04-25 15:42 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-16 15:28 [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap Alexander Duyck
2025-04-16 15:28 ` [net-next PATCH 1/2] net: phylink: Drop unused defines for SUPPORTED/ADVERTISED_INTERFACES Alexander Duyck
2025-04-22  8:32   ` Russell King (Oracle)
2025-04-16 15:29 ` [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present Alexander Duyck
2025-04-16 16:01   ` Russell King (Oracle)
2025-04-16 16:16     ` [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down Russell King (Oracle)
2025-04-16 17:14       ` Russell King (Oracle)
2025-04-17 14:30       ` Alexander H Duyck
2025-04-17 14:35         ` Russell King (Oracle)
2025-04-17 15:23           ` Russell King (Oracle)
2025-04-17 17:06             ` Alexander Duyck
2025-04-17 17:27               ` Russell King (Oracle)
2025-04-17 19:49                 ` Alexander Duyck
2025-04-22  9:51                   ` Russell King (Oracle)
2025-04-22 15:30                     ` Alexander Duyck
2025-04-22 16:00                       ` Andrew Lunn
2025-04-22 20:09                         ` Alexander Duyck
2025-04-23  0:00       ` patchwork-bot+netdevbpf
2025-04-16 17:12     ` [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present Russell King (Oracle)
2025-04-16 19:03     ` Alexander Duyck
2025-04-16 19:19       ` Russell King (Oracle)
2025-04-16 20:05         ` Russell King (Oracle)
2025-04-16 22:58           ` Alexander Duyck
2025-04-19 18:11 ` [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap Andrew Lunn
2025-04-20 18:18   ` Alexander Duyck
2025-04-20 21:58     ` Andrew Lunn
2025-04-21 15:51       ` Alexander Duyck
2025-04-21 16:50         ` Alexander Duyck
2025-04-22  1:21           ` Jakub Kicinski
2025-04-22 13:49             ` Andrew Lunn
2025-04-22 15:28               ` Jakub Kicinski
2025-04-22 16:49                 ` Andrew Lunn
2025-04-22 17:30                   ` Russell King (Oracle)
2025-04-22 18:13                     ` Andrew Lunn
2025-04-22 18:50                       ` Russell King (Oracle)
2025-04-22 23:51                       ` Jakub Kicinski
2025-04-22 21:29                   ` Alexander Duyck
2025-04-22 22:26                     ` Andrew Lunn
2025-04-22 23:06                       ` Alexander Duyck
2025-04-23 18:38                         ` Alexander Duyck
2025-04-24 20:34                         ` Andrew Lunn
2025-04-24 23:40                           ` Alexander Duyck
2025-04-25 13:11                             ` Andrew Lunn
2025-04-25 15:41                               ` Alexander Duyck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).