* [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap
@ 2025-04-16 15:28 Alexander Duyck
2025-04-16 15:28 ` [net-next PATCH 1/2] net: phylink: Drop unused defines for SUPPORTED/ADVERTISED_INTERFACES Alexander Duyck
` (2 more replies)
0 siblings, 3 replies; 44+ messages in thread
From: Alexander Duyck @ 2025-04-16 15:28 UTC (permalink / raw)
To: netdev; +Cc: linux, andrew, hkallweit1, davem, kuba, pabeni
Address two issues found in the phylink code.
The first issue is the fact that there were unused defines that were
referencing deprecated macros themselves. Since they aren't used we might
as well drop them.
The second issue which is more the main reason for this submission is the
fact that the BMC was losing link when we would call phylink_resume. This
is fixed by adding a new boolean value link_balanced which will allow us
to avoid doing an immediate force of the link up/down and instead defer it
until after we have checked the actual link state.
---
Alexander Duyck (2):
net: phylink: Drop unused defines for SUPPORTED/ADVERTISED_INTERFACES
net: phylink: Fix issues with link balancing w/ BMC present
drivers/net/phy/phylink.c | 25 ++++++-------------------
1 file changed, 6 insertions(+), 19 deletions(-)
--
^ permalink raw reply [flat|nested] 44+ messages in thread* [net-next PATCH 1/2] net: phylink: Drop unused defines for SUPPORTED/ADVERTISED_INTERFACES 2025-04-16 15:28 [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap Alexander Duyck @ 2025-04-16 15:28 ` Alexander Duyck 2025-04-22 8:32 ` Russell King (Oracle) 2025-04-16 15:29 ` [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present Alexander Duyck 2025-04-19 18:11 ` [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap Andrew Lunn 2 siblings, 1 reply; 44+ messages in thread From: Alexander Duyck @ 2025-04-16 15:28 UTC (permalink / raw) To: netdev; +Cc: linux, andrew, hkallweit1, davem, kuba, pabeni From: Alexander Duyck <alexanderduyck@fb.com> The defines for SUPPORTED_INTERFACES and ADVERTISED_INTERFACES both appear to be unused. I couldn't find anything that actually references them in the original diff that added them and it seems like they have persisted despite using deprecated defines that aren't supposed to be used as per the ethtool.h header that defines the bits they are composed of. Since they are unused, and not supposed to be used anymore I am just dropping the lines of code since they seem to just be occupying space. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> --- drivers/net/phy/phylink.c | 7 ------- 1 file changed, 7 deletions(-) diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c index b68369e2342b..942ce114dabd 100644 --- a/drivers/net/phy/phylink.c +++ b/drivers/net/phy/phylink.c @@ -24,13 +24,6 @@ #include "sfp.h" #include "swphy.h" -#define SUPPORTED_INTERFACES \ - (SUPPORTED_TP | SUPPORTED_MII | SUPPORTED_FIBRE | \ - SUPPORTED_BNC | SUPPORTED_AUI | SUPPORTED_Backplane) -#define ADVERTISED_INTERFACES \ - (ADVERTISED_TP | ADVERTISED_MII | ADVERTISED_FIBRE | \ - ADVERTISED_BNC | ADVERTISED_AUI | ADVERTISED_Backplane) - enum { PHYLINK_DISABLE_STOPPED, PHYLINK_DISABLE_LINK, ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 1/2] net: phylink: Drop unused defines for SUPPORTED/ADVERTISED_INTERFACES 2025-04-16 15:28 ` [net-next PATCH 1/2] net: phylink: Drop unused defines for SUPPORTED/ADVERTISED_INTERFACES Alexander Duyck @ 2025-04-22 8:32 ` Russell King (Oracle) 0 siblings, 0 replies; 44+ messages in thread From: Russell King (Oracle) @ 2025-04-22 8:32 UTC (permalink / raw) To: Alexander Duyck; +Cc: netdev, andrew, hkallweit1, davem, kuba, pabeni On Wed, Apr 16, 2025 at 08:28:53AM -0700, Alexander Duyck wrote: > From: Alexander Duyck <alexanderduyck@fb.com> > > The defines for SUPPORTED_INTERFACES and ADVERTISED_INTERFACES both appear > to be unused. I couldn't find anything that actually references them in the > original diff that added them and it seems like they have persisted despite > using deprecated defines that aren't supposed to be used as per the > ethtool.h header that defines the bits they are composed of. > > Since they are unused, and not supposed to be used anymore I am just > dropping the lines of code since they seem to just be occupying space. > > Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Please send this separately, it can be applied independently of patch 2. Thanks! -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last! ^ permalink raw reply [flat|nested] 44+ messages in thread
* [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present 2025-04-16 15:28 [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap Alexander Duyck 2025-04-16 15:28 ` [net-next PATCH 1/2] net: phylink: Drop unused defines for SUPPORTED/ADVERTISED_INTERFACES Alexander Duyck @ 2025-04-16 15:29 ` Alexander Duyck 2025-04-16 16:01 ` Russell King (Oracle) 2025-04-19 18:11 ` [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap Andrew Lunn 2 siblings, 1 reply; 44+ messages in thread From: Alexander Duyck @ 2025-04-16 15:29 UTC (permalink / raw) To: netdev; +Cc: linux, andrew, hkallweit1, davem, kuba, pabeni From: Alexander Duyck <alexanderduyck@fb.com> This change is meant to address the fact that there are link imbalances introduced when using phylink on a system with a BMC. Specifically there are two issues. The first issue is that if we lose link after the first call to phylink_start but before it gets to the phylink_resolve we will end up with the phylink interface assuming the link was always down and not calling phylink_link_down resulting in a stuck interface. The second issue is that when a BMC is present we are currently forcing the link down. This results in us bouncing the link for a fraction of a second and that will result in dropped packets for the BMC. The third issue is just an extra "Link Down" message that is seen when calling phylink_resume. This is addressed by identifying that the link isn't balanced and just not displaying the down message in such a case. To resolve these issues this change introduces a new boolean variable link_balanced. This value will be set to 0 initially when we create the phylink interface, and again when we bring down the link and unbalance it in phylink_suspend. When it is set to 0 it will force us to trigger the phylink_link_up/down call which will have us write to the hardware. As a result we can avoid the two issues and it should not rearm without another call to phylink_suspend. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> --- drivers/net/phy/phylink.c | 18 ++++++------------ 1 file changed, 6 insertions(+), 12 deletions(-) diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c index 942ce114dabd..2b9ab343942e 100644 --- a/drivers/net/phy/phylink.c +++ b/drivers/net/phy/phylink.c @@ -45,6 +45,7 @@ struct phylink { struct phylink_pcs *pcs; struct device *dev; unsigned int old_link_state:1; + unsigned int link_balanced:1; unsigned long phylink_disable_state; /* bitmask of disables */ struct phy_device *phydev; @@ -1553,7 +1554,8 @@ static void phylink_link_down(struct phylink *pl) pl->mac_ops->mac_link_down(pl->config, pl->act_link_an_mode, pl->cur_interface); - phylink_info(pl, "Link is Down\n"); + if (pl->link_balanced) + phylink_info(pl, "Link is Down\n"); } static bool phylink_link_is_up(struct phylink *pl) @@ -1658,12 +1660,13 @@ static void phylink_resolve(struct work_struct *w) if (pl->major_config_failed) link_state.link = false; - if (link_state.link != cur_link_state) { + if (link_state.link != cur_link_state || !pl->link_balanced) { pl->old_link_state = link_state.link; if (!link_state.link) phylink_link_down(pl); else phylink_link_up(pl, link_state); + pl->link_balanced = true; } if (!link_state.link && retrigger) { pl->link_failed = false; @@ -2546,6 +2549,7 @@ void phylink_suspend(struct phylink *pl, bool mac_wol) netif_carrier_off(pl->netdev); else pl->old_link_state = false; + pl->link_balanced = false; /* We do not call mac_link_down() here as we want the * link to remain up to receive the WoL packets. @@ -2596,16 +2600,6 @@ void phylink_resume(struct phylink *pl) if (test_bit(PHYLINK_DISABLE_MAC_WOL, &pl->phylink_disable_state)) { /* Wake-on-Lan enabled, MAC handling */ - /* Call mac_link_down() so we keep the overall state balanced. - * Do this under the state_mutex lock for consistency. This - * will cause a "Link Down" message to be printed during - * resume, which is harmless - the true link state will be - * printed when we run a resolve. - */ - mutex_lock(&pl->state_mutex); - phylink_link_down(pl); - mutex_unlock(&pl->state_mutex); - /* Re-apply the link parameters so that all the settings get * restored to the MAC. */ ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present 2025-04-16 15:29 ` [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present Alexander Duyck @ 2025-04-16 16:01 ` Russell King (Oracle) 2025-04-16 16:16 ` [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down Russell King (Oracle) ` (2 more replies) 0 siblings, 3 replies; 44+ messages in thread From: Russell King (Oracle) @ 2025-04-16 16:01 UTC (permalink / raw) To: Alexander Duyck; +Cc: netdev, andrew, hkallweit1, davem, kuba, pabeni On Wed, Apr 16, 2025 at 08:29:00AM -0700, Alexander Duyck wrote: > From: Alexander Duyck <alexanderduyck@fb.com> > > This change is meant to address the fact that there are link imbalances > introduced when using phylink on a system with a BMC. Specifically there > are two issues. > > The first issue is that if we lose link after the first call to > phylink_start but before it gets to the phylink_resolve we will end up with > the phylink interface assuming the link was always down and not calling > phylink_link_down resulting in a stuck interface. That is intentional. phylink strictly orders .mac_link_down and .mac_link_up, and starts from an initial position that the link _will_ be considered to be down. So, it is intentional that .mac_link_down will _never_ be called after phylink_start(). > The second issue is that when a BMC is present we are currently forcing the > link down. This results in us bouncing the link for a fraction of a second > and that will result in dropped packets for the BMC. ... but you don't explain how that happens. > The third issue is just an extra "Link Down" message that is seen when > calling phylink_resume. This is addressed by identifying that the link > isn't balanced and just not displaying the down message in such a case. Hmm, this one is an error, but is not as simple as "don't print the message" as it results in a violation of the rule I mentioned above. We need phylink_suspend() to record the state of the link at that point, and avoid calling phylink_link_down() if the link was down prior to suspend. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last! ^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down 2025-04-16 16:01 ` Russell King (Oracle) @ 2025-04-16 16:16 ` Russell King (Oracle) 2025-04-16 17:14 ` Russell King (Oracle) ` (2 more replies) 2025-04-16 17:12 ` [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present Russell King (Oracle) 2025-04-16 19:03 ` Alexander Duyck 2 siblings, 3 replies; 44+ messages in thread From: Russell King (Oracle) @ 2025-04-16 16:16 UTC (permalink / raw) To: Andrew Lunn, Heiner Kallweit, Alexander Duyck Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Joakim Zhang, netdev, Paolo Abeni When WoL is enabled, we update the software state in phylink to indicate that the link is down, and disable the resolver from bringing the link back up. On resume, we attempt to bring the overall state into consistency by calling the .mac_link_down() method, but this is wrong if the link was already down, as phylink strictly orders the .mac_link_up() and .mac_link_down() methods - and this would break that ordering. Fixes: f97493657c63 ("net: phylink: add suspend/resume support") Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> --- To fix the suspend/resume with link down, this is what I think we should do. Untested at the moment. drivers/net/phy/phylink.c | 38 ++++++++++++++++++++++---------------- 1 file changed, 22 insertions(+), 16 deletions(-) diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c index 69ca765485db..d2c59ee16ebc 100644 --- a/drivers/net/phy/phylink.c +++ b/drivers/net/phy/phylink.c @@ -81,6 +81,7 @@ struct phylink { unsigned int pcs_state; bool link_failed; + bool suspend_link_up; bool major_config_failed; bool mac_supports_eee_ops; bool mac_supports_eee; @@ -2545,14 +2546,16 @@ void phylink_suspend(struct phylink *pl, bool mac_wol) /* Stop the resolver bringing the link up */ __set_bit(PHYLINK_DISABLE_MAC_WOL, &pl->phylink_disable_state); - /* Disable the carrier, to prevent transmit timeouts, - * but one would hope all packets have been sent. This - * also means phylink_resolve() will do nothing. - */ - if (pl->netdev) - netif_carrier_off(pl->netdev); - else + pl->suspend_link_up = phylink_link_is_up(pl); + if (pl->suspend_link_up) { + /* Disable the carrier, to prevent transmit timeouts, + * but one would hope all packets have been sent. This + * also means phylink_resolve() will do nothing. + */ + if (pl->netdev) + netif_carrier_off(pl->netdev); pl->old_link_state = false; + } /* We do not call mac_link_down() here as we want the * link to remain up to receive the WoL packets. @@ -2603,15 +2606,18 @@ void phylink_resume(struct phylink *pl) if (test_bit(PHYLINK_DISABLE_MAC_WOL, &pl->phylink_disable_state)) { /* Wake-on-Lan enabled, MAC handling */ - /* Call mac_link_down() so we keep the overall state balanced. - * Do this under the state_mutex lock for consistency. This - * will cause a "Link Down" message to be printed during - * resume, which is harmless - the true link state will be - * printed when we run a resolve. - */ - mutex_lock(&pl->state_mutex); - phylink_link_down(pl); - mutex_unlock(&pl->state_mutex); + if (pl->suspend_link_up) { + /* Call mac_link_down() so we keep the overall state + * balanced. Do this under the state_mutex lock for + * consistency. This will cause a "Link Down" message + * to be printed during resume, which is harmless - + * the true link state will be printed when we run a + * resolve. + */ + mutex_lock(&pl->state_mutex); + phylink_link_down(pl); + mutex_unlock(&pl->state_mutex); + } /* Re-apply the link parameters so that all the settings get * restored to the MAC. -- 2.30.2 ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down 2025-04-16 16:16 ` [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down Russell King (Oracle) @ 2025-04-16 17:14 ` Russell King (Oracle) 2025-04-17 14:30 ` Alexander H Duyck 2025-04-23 0:00 ` patchwork-bot+netdevbpf 2 siblings, 0 replies; 44+ messages in thread From: Russell King (Oracle) @ 2025-04-16 17:14 UTC (permalink / raw) To: Andrew Lunn, Heiner Kallweit, Alexander Duyck Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Joakim Zhang, netdev, Paolo Abeni On Wed, Apr 16, 2025 at 05:16:01PM +0100, Russell King (Oracle) wrote: > When WoL is enabled, we update the software state in phylink to > indicate that the link is down, and disable the resolver from > bringing the link back up. > > On resume, we attempt to bring the overall state into consistency > by calling the .mac_link_down() method, but this is wrong if the > link was already down, as phylink strictly orders the .mac_link_up() > and .mac_link_down() methods - and this would break that ordering. > > Fixes: f97493657c63 ("net: phylink: add suspend/resume support") > Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> > --- > > To fix the suspend/resume with link down, this is what I think we > should do. Untested at the moment. Tested on NVidia Jetson Xavier NX, this eliminates the incorrect call to phylink_mac_link_down() when resuming and the link was previously down. Tested-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last! ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down 2025-04-16 16:16 ` [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down Russell King (Oracle) 2025-04-16 17:14 ` Russell King (Oracle) @ 2025-04-17 14:30 ` Alexander H Duyck 2025-04-17 14:35 ` Russell King (Oracle) 2025-04-23 0:00 ` patchwork-bot+netdevbpf 2 siblings, 1 reply; 44+ messages in thread From: Alexander H Duyck @ 2025-04-17 14:30 UTC (permalink / raw) To: Russell King (Oracle), Andrew Lunn, Heiner Kallweit Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Joakim Zhang, netdev, Paolo Abeni On Wed, 2025-04-16 at 17:16 +0100, Russell King (Oracle) wrote: > When WoL is enabled, we update the software state in phylink to > indicate that the link is down, and disable the resolver from > bringing the link back up. > > On resume, we attempt to bring the overall state into consistency > by calling the .mac_link_down() method, but this is wrong if the > link was already down, as phylink strictly orders the .mac_link_up() > and .mac_link_down() methods - and this would break that ordering. > > Fixes: f97493657c63 ("net: phylink: add suspend/resume support") > Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> > --- > > To fix the suspend/resume with link down, this is what I think we > should do. Untested at the moment. > > drivers/net/phy/phylink.c | 38 ++++++++++++++++++++++---------------- > 1 file changed, 22 insertions(+), 16 deletions(-) > > diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c > index 69ca765485db..d2c59ee16ebc 100644 > --- a/drivers/net/phy/phylink.c > +++ b/drivers/net/phy/phylink.c > @@ -81,6 +81,7 @@ struct phylink { > unsigned int pcs_state; > > bool link_failed; > + bool suspend_link_up; > bool major_config_failed; > bool mac_supports_eee_ops; > bool mac_supports_eee; I'm pretty sure this extra bit of state isn't needed. > @@ -2545,14 +2546,16 @@ void phylink_suspend(struct phylink *pl, bool mac_wol) > /* Stop the resolver bringing the link up */ > __set_bit(PHYLINK_DISABLE_MAC_WOL, &pl->phylink_disable_state); > > - /* Disable the carrier, to prevent transmit timeouts, > - * but one would hope all packets have been sent. This > - * also means phylink_resolve() will do nothing. > - */ > - if (pl->netdev) > - netif_carrier_off(pl->netdev); > - else This is the only spot where we weren't setting netif_carrier_on/off and old_link_state together. I suspect you could just carry old_link_state without needing to add a new argument. Basically you would just need to drop the "else" portion of this statement. In the grand scheme of things with the exception of this one spot old_link_state is essentially the actual MAC/PCS link state whereas netif_carrier_off is the administrative state. > + pl->suspend_link_up = phylink_link_is_up(pl); > + if (pl->suspend_link_up) { > + /* Disable the carrier, to prevent transmit timeouts, > + * but one would hope all packets have been sent. This > + * also means phylink_resolve() will do nothing. > + */ > + if (pl->netdev) > + netif_carrier_off(pl->netdev); > pl->old_link_state = false; > + } > > /* We do not call mac_link_down() here as we want the > * link to remain up to receive the WoL packets. > @@ -2603,15 +2606,18 @@ void phylink_resume(struct phylink *pl) > if (test_bit(PHYLINK_DISABLE_MAC_WOL, &pl->phylink_disable_state)) { > /* Wake-on-Lan enabled, MAC handling */ > > - /* Call mac_link_down() so we keep the overall state balanced. > - * Do this under the state_mutex lock for consistency. This > - * will cause a "Link Down" message to be printed during > - * resume, which is harmless - the true link state will be > - * printed when we run a resolve. > - */ > - mutex_lock(&pl->state_mutex); > - phylink_link_down(pl); > - mutex_unlock(&pl->state_mutex); > + if (pl->suspend_link_up) { > + /* Call mac_link_down() so we keep the overall state > + * balanced. Do this under the state_mutex lock for > + * consistency. This will cause a "Link Down" message > + * to be printed during resume, which is harmless - > + * the true link state will be printed when we run a > + * resolve. > + */ > + mutex_lock(&pl->state_mutex); > + phylink_link_down(pl); > + mutex_unlock(&pl->state_mutex); > + } You should be able to do all of this with just old_link_state. The only thing that would have to change is that you would need to set old_link_state to false after the if statement. I'm assuming part of the reason for forcing the link down here also has to do with the fact that you are using phylink_mac_initial_config which calls phylink_major_config after this? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down 2025-04-17 14:30 ` Alexander H Duyck @ 2025-04-17 14:35 ` Russell King (Oracle) 2025-04-17 15:23 ` Russell King (Oracle) 0 siblings, 1 reply; 44+ messages in thread From: Russell King (Oracle) @ 2025-04-17 14:35 UTC (permalink / raw) To: Alexander H Duyck Cc: Andrew Lunn, Heiner Kallweit, David S. Miller, Eric Dumazet, Jakub Kicinski, Joakim Zhang, netdev, Paolo Abeni On Thu, Apr 17, 2025 at 07:30:05AM -0700, Alexander H Duyck wrote: > On Wed, 2025-04-16 at 17:16 +0100, Russell King (Oracle) wrote: > > When WoL is enabled, we update the software state in phylink to > > indicate that the link is down, and disable the resolver from > > bringing the link back up. > > > > On resume, we attempt to bring the overall state into consistency > > by calling the .mac_link_down() method, but this is wrong if the > > link was already down, as phylink strictly orders the .mac_link_up() > > and .mac_link_down() methods - and this would break that ordering. > > > > Fixes: f97493657c63 ("net: phylink: add suspend/resume support") > > Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> > > --- > > > > To fix the suspend/resume with link down, this is what I think we > > should do. Untested at the moment. > > > > drivers/net/phy/phylink.c | 38 ++++++++++++++++++++++---------------- > > 1 file changed, 22 insertions(+), 16 deletions(-) > > > > diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c > > index 69ca765485db..d2c59ee16ebc 100644 > > --- a/drivers/net/phy/phylink.c > > +++ b/drivers/net/phy/phylink.c > > @@ -81,6 +81,7 @@ struct phylink { > > unsigned int pcs_state; > > > > bool link_failed; > > + bool suspend_link_up; > > bool major_config_failed; > > bool mac_supports_eee_ops; > > bool mac_supports_eee; > > I'm pretty sure this extra bit of state isn't needed. > > > @@ -2545,14 +2546,16 @@ void phylink_suspend(struct phylink *pl, bool mac_wol) > > /* Stop the resolver bringing the link up */ > > __set_bit(PHYLINK_DISABLE_MAC_WOL, &pl->phylink_disable_state); > > > > - /* Disable the carrier, to prevent transmit timeouts, > > - * but one would hope all packets have been sent. This > > - * also means phylink_resolve() will do nothing. > > - */ > > - if (pl->netdev) > > - netif_carrier_off(pl->netdev); > > - else > > This is the only spot where we weren't setting netif_carrier_on/off and > old_link_state together. I suspect you could just carry old_link_state > without needing to add a new argument. Basically you would just need to > drop the "else" portion of this statement. > > In the grand scheme of things with the exception of this one spot > old_link_state is essentially the actual MAC/PCS link state whereas > netif_carrier_off is the administrative state. Sorry to say, but you have that wrong. Neither are the administrative state. > > + pl->suspend_link_up = phylink_link_is_up(pl); > > + if (pl->suspend_link_up) { > > + /* Disable the carrier, to prevent transmit timeouts, > > + * but one would hope all packets have been sent. This > > + * also means phylink_resolve() will do nothing. > > + */ > > + if (pl->netdev) > > + netif_carrier_off(pl->netdev); > > pl->old_link_state = false; > > + } > > > > /* We do not call mac_link_down() here as we want the > > * link to remain up to receive the WoL packets. > > @@ -2603,15 +2606,18 @@ void phylink_resume(struct phylink *pl) > > if (test_bit(PHYLINK_DISABLE_MAC_WOL, &pl->phylink_disable_state)) { > > /* Wake-on-Lan enabled, MAC handling */ > > > > - /* Call mac_link_down() so we keep the overall state balanced. > > - * Do this under the state_mutex lock for consistency. This > > - * will cause a "Link Down" message to be printed during > > - * resume, which is harmless - the true link state will be > > - * printed when we run a resolve. > > - */ > > - mutex_lock(&pl->state_mutex); > > - phylink_link_down(pl); > > - mutex_unlock(&pl->state_mutex); > > + if (pl->suspend_link_up) { > > + /* Call mac_link_down() so we keep the overall state > > + * balanced. Do this under the state_mutex lock for > > + * consistency. This will cause a "Link Down" message > > + * to be printed during resume, which is harmless - > > + * the true link state will be printed when we run a > > + * resolve. > > + */ > > + mutex_lock(&pl->state_mutex); > > + phylink_link_down(pl); > > + mutex_unlock(&pl->state_mutex); > > + } > > You should be able to do all of this with just old_link_state. The only > thing that would have to change is that you would need to set > old_link_state to false after the if statement. Nope. > I'm assuming part of the reason for forcing the link down here also has > to do with the fact that you are using phylink_mac_initial_config which > calls phylink_major_config after this? Another of phylink's guarantees is that it won't do the mac_config() etc with the link up. So, in order to ensure that everything is correctly programmed after resume, it needs mac_config() etc called which means the link needs to come down first. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last! ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down 2025-04-17 14:35 ` Russell King (Oracle) @ 2025-04-17 15:23 ` Russell King (Oracle) 2025-04-17 17:06 ` Alexander Duyck 0 siblings, 1 reply; 44+ messages in thread From: Russell King (Oracle) @ 2025-04-17 15:23 UTC (permalink / raw) To: Alexander H Duyck Cc: Andrew Lunn, Heiner Kallweit, David S. Miller, Eric Dumazet, Jakub Kicinski, Joakim Zhang, netdev, Paolo Abeni On Thu, Apr 17, 2025 at 03:35:56PM +0100, Russell King (Oracle) wrote: > On Thu, Apr 17, 2025 at 07:30:05AM -0700, Alexander H Duyck wrote: > > This is the only spot where we weren't setting netif_carrier_on/off and > > old_link_state together. I suspect you could just carry old_link_state > > without needing to add a new argument. Basically you would just need to > > drop the "else" portion of this statement. > > > > In the grand scheme of things with the exception of this one spot > > old_link_state is essentially the actual MAC/PCS link state whereas > > netif_carrier_off is the administrative state. > > Sorry to say, but you have that wrong. Neither are the administrative > state. To add to this (sorry, I rushed that reply), old_link_state is used when we aren't bound to a network device, otherwise the netif carrier state is used. Changes in the netif carrier state provoke netlink messages to userspace to inform userspace of changes to the link state. Moreover, it controls whether the network stack queues packets for transmission to the driver, and also whether the transmit watchdog is allowed to time out. It _probably_ also lets the packet schedulers know that the link state has changed. So, the netif carrier state is not "administrative". It isn't strictly necessary for old_link_state to remain in sync with the netif carrier state, but it is desirable to avoid errors - but obviously the netif carrier state only exists when we're bound to a network device. > > > - /* Call mac_link_down() so we keep the overall state balanced. > > > - * Do this under the state_mutex lock for consistency. This > > > - * will cause a "Link Down" message to be printed during > > > - * resume, which is harmless - the true link state will be > > > - * printed when we run a resolve. > > > - */ > > > - mutex_lock(&pl->state_mutex); > > > - phylink_link_down(pl); > > > - mutex_unlock(&pl->state_mutex); > > > + if (pl->suspend_link_up) { > > > + /* Call mac_link_down() so we keep the overall state > > > + * balanced. Do this under the state_mutex lock for > > > + * consistency. This will cause a "Link Down" message > > > + * to be printed during resume, which is harmless - > > > + * the true link state will be printed when we run a > > > + * resolve. > > > + */ > > > + mutex_lock(&pl->state_mutex); > > > + phylink_link_down(pl); > > > + mutex_unlock(&pl->state_mutex); > > > + } > > > > You should be able to do all of this with just old_link_state. The only > > thing that would have to change is that you would need to set > > old_link_state to false after the if statement. > > Nope. And still nope - what we need to know here is what was the link state _before_ we called netif_carrier_off() or set old_link_state to false. I somehow suspect that you don't understand what all this code is trying to do, so let me explain it. In the suspend function, when WoL is enabled, and we're using MAC-based WoL, we need the link to the MAC to remain up, so it can receive packets to check whether they are the appropriate wake packet. However, we don't want packets to be queued for transmission either. So, we have the case where we want to avoid notifying the MAC that the link has gone down, but we also want to call netif_carrier_off() to stop the network stack queueing packets. To achieve this, several things work in unison: - we take the state mutex to prevent the resolver from running while we fiddle with various state. - we disable the resolver (which, if that's the only thing that happens, will provoke the resolver to take the link down.) - we lie to the resolver about the link state, by calling netif_carrier_off() and/or setting old_link_state to false. This means the resolver believes the link is _already_ down, and thus because of the strict ordering I've previously mentioned, will *not* call mac_link_down(). - we release the lock. There is now no way that the resolver will call either mac_link_up() or mac_link_down() - which is what we want here. When we resume, we need to unwind this, but taking care that the network layers may need to be reprogrammed. That is done by "completing" the link-down that was done in the suspend function by calling mac_link_down() (which should only have been done if the link wasn't already down - hence depends on state _prior_ to the modifications that phylink_suspend() made.) It can then re-setup the MAC/PCS by calling the config functions, and then release the resolver to then make a decision about the state of the link. With the fix I posted, this guarantees that that the contract I talked about previously is maintained. I'm sorry that this doesn't "fit" your special case, but this is what all users so far expect from phylink. Now, how about explaining what's going on in fbnic_mac_link_up_asic() and fbnic_mac_link_down_asic(), and what in there triggers the BMC to press the panic stations do-not-press red buttin? If you wish to have confirmation of what netif_carrier does, then I'm sure your co-maintainer and core Linux networking maintainer, Jakub, would be happy to help. Thanks. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last! ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down 2025-04-17 15:23 ` Russell King (Oracle) @ 2025-04-17 17:06 ` Alexander Duyck 2025-04-17 17:27 ` Russell King (Oracle) 0 siblings, 1 reply; 44+ messages in thread From: Alexander Duyck @ 2025-04-17 17:06 UTC (permalink / raw) To: Russell King (Oracle) Cc: Andrew Lunn, Heiner Kallweit, David S. Miller, Eric Dumazet, Jakub Kicinski, Joakim Zhang, netdev, Paolo Abeni On Thu, Apr 17, 2025 at 8:23 AM Russell King (Oracle) <linux@armlinux.org.uk> wrote: > > On Thu, Apr 17, 2025 at 03:35:56PM +0100, Russell King (Oracle) wrote: > > On Thu, Apr 17, 2025 at 07:30:05AM -0700, Alexander H Duyck wrote: > > > This is the only spot where we weren't setting netif_carrier_on/off and > > > old_link_state together. I suspect you could just carry old_link_state > > > without needing to add a new argument. Basically you would just need to > > > drop the "else" portion of this statement. > > > > > > In the grand scheme of things with the exception of this one spot > > > old_link_state is essentially the actual MAC/PCS link state whereas > > > netif_carrier_off is the administrative state. > > > > Sorry to say, but you have that wrong. Neither are the administrative > > state. > > To add to this (sorry, I rushed that reply), old_link_state is used when > we aren't bound to a network device, otherwise the netif carrier state > is used. Changes in the netif carrier state provoke netlink messages to > userspace to inform userspace of changes to the link state. > > Moreover, it controls whether the network stack queues packets for > transmission to the driver, and also whether the transmit watchdog is > allowed to time out. It _probably_ also lets the packet schedulers > know that the link state has changed. > > So, the netif carrier state is not "administrative". I have always sort of assumed that netif_carrier_on/off was the logical AND of the administrative state of the NIC and the state of the actual MAC/PCS/PHY. That is why I refer to it as an administrative state. You end up with ifconfig up/down toggling netif carrier even if the underlying link state hasn't changed. This has been the case for many of the high end NICs for a while now, especially for the multi-host ones, as the firmware won't change the actual physical state of the link. It leaves it in place while the NIC port on the host is no longer active. > It isn't strictly necessary for old_link_state to remain in sync with > the netif carrier state, but it is desirable to avoid errors - but > obviously the netif carrier state only exists when we're bound to a > network device. From what I can tell this is the only spot where the two diverge, prior to this patch. The general thought I was having was to update the netif_carrier state in the suspend, and then old_link_state in the resume. I realize now the concern is that setting the phylink_disable_state is essentially the same as setting the link down. So really we have 3 states we are tracking, netif_carrier_ok/old_link_state, phylink_disable_state, and if we want the link state to change while disabled. So we do need one additional piece of state in the event that there isn't a netdev in order to handle the case where "pl->phylink_disable_state && !!pl->phylink_disable_state == pl->old_link_state". > > > > - /* Call mac_link_down() so we keep the overall state balanced. > > > > - * Do this under the state_mutex lock for consistency. This > > > > - * will cause a "Link Down" message to be printed during > > > > - * resume, which is harmless - the true link state will be > > > > - * printed when we run a resolve. > > > > - */ > > > > - mutex_lock(&pl->state_mutex); > > > > - phylink_link_down(pl); > > > > - mutex_unlock(&pl->state_mutex); > > > > + if (pl->suspend_link_up) { > > > > + /* Call mac_link_down() so we keep the overall state > > > > + * balanced. Do this under the state_mutex lock for > > > > + * consistency. This will cause a "Link Down" message > > > > + * to be printed during resume, which is harmless - > > > > + * the true link state will be printed when we run a > > > > + * resolve. > > > > + */ > > > > + mutex_lock(&pl->state_mutex); > > > > + phylink_link_down(pl); > > > > + mutex_unlock(&pl->state_mutex); > > > > + } > > > > > > You should be able to do all of this with just old_link_state. The only > > > thing that would have to change is that you would need to set > > > old_link_state to false after the if statement. > > > > Nope. > > And still nope - what we need to know here is what was the link state > _before_ we called netif_carrier_off() or set old_link_state to false. > > I somehow suspect that you don't understand what all this code is trying > to do, so let me explain it. > > In the suspend function, when WoL is enabled, and we're using MAC-based > WoL, we need the link to the MAC to remain up, so it can receive packets > to check whether they are the appropriate wake packet. However, we don't > want packets to be queued for transmission either. That is the same as what we are looking for. We aren't queueing any packets for transmission ourselves. We just want to leave the MAC enabled, however we also want to leave it enabled when we resume. > So, we have the case where we want to avoid notifying the MAC that the > link has gone down, but we also want to call netif_carrier_off() to stop > the network stack queueing packets. > > To achieve this, several things work in unison: > > - we take the state mutex to prevent the resolver from running while we > fiddle with various state. > - we disable the resolver (which, if that's the only thing that happens, > will provoke the resolver to take the link down.) > - we lie to the resolver about the link state, by calling > netif_carrier_off() and/or setting old_link_state to false. This > means the resolver believes the link is _already_ down, and thus > because of the strict ordering I've previously mentioned, will *not* > call mac_link_down(). > - we release the lock. > > There is now no way that the resolver will call either mac_link_up() or > mac_link_down() - which is what we want here. The part that I think I missed here was that if we set phylink_disable_state we then set link_state.link to false in phylink_resolve. I wonder if we couldn't just have a flag that sets cur_link_state to false in the "if(pl->phylink_disable_state)" case and simplify this to indicate we won't force the link down" > When we resume, we need to unwind this, but taking care that the network > layers may need to be reprogrammed. That is done by "completing" the > link-down that was done in the suspend function by calling > mac_link_down() (which should only have been done if the link wasn't > already down - hence depends on state _prior_ to the modifications that > phylink_suspend() made.) > > It can then re-setup the MAC/PCS by calling the config functions, and > then release the resolver to then make a decision about the state of > the link. > > With the fix I posted, this guarantees that that the contract I talked > about previously is maintained. > > I'm sorry that this doesn't "fit" your special case, but this is what > all users so far expect from phylink. I wasn't trying to "fit" it so much as understand it. So this worked before because you were clearing either netif_carrier_off or old_link_state and the resolve function followed the same logic. The piece I missed is that setting the bit in phylink_disable_state is forcing the link to resolve to false which in turn would call mac_link_down if the old state is considered to be up and a netdev isn't present. > Now, how about explaining what's going on in fbnic_mac_link_up_asic() > and fbnic_mac_link_down_asic(), and what in there triggers the BMC > to press the panic stations do-not-press red buttin? So fbnic_mac_link_up_asic doesn't trigger any issues. The issues are: 1. In fbnic_mac_link_down_asic we assume that if we are being told that the link is down by phylink that it really means the link is down and we need to shut off the MAC and flush any packets that are in the Tx FIFO. The issue isn't so much the call itself, it is the fact that we are being called in phylink_resume to rebalance the link that will be going from UP->UP. The rebalancing is introducing an extra down step. This could be resolved by adding an extra check to the line in phylink_resume that is calling the mac_link_down so that it doesn't get triggered and stall the link. In my test code I am now calling it "pl->rolling_start". 2. In phylink_start/phylink_resume since we are coming at this from a "rolling start" due to the BMC we have the MAC up and the netif_carrier in the "off" state. As a result if we end up losing the link between mac_prepare and pcs_get_state the MAC gets into a bad state where netif_carrier is off, the MAC is stuck in the "up" state, and we have stale packets clogging up the Tx FIFO. As far as the BMC it isn't so much wanting to hit the big red button as our platform team. Basically losing of packets is very problematic for them, think about ssh sessions that suddenly lock up during boot, and they can demonstrate that none of the other vendors that have the MAC/PCS/PHY buried in their firmware have this issue. I was really hoping to avoid going that route as the whole point of this was to keep the code open source and visible. > If you wish to have confirmation of what netif_carrier does, then I'm > sure your co-maintainer and core Linux networking maintainer, Jakub, > would be happy to help. I'm well aware of what netif_carrier does, no need to be patronizing. I have been doing this for over 16 years. The difference is for me the 1G stuff was 16 years ago, most of the stuff I deal with now is 25G or higher and multi-host which is a different world with different requirements. This is why we are having our communication issues in understanding where each other is coming from. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down 2025-04-17 17:06 ` Alexander Duyck @ 2025-04-17 17:27 ` Russell King (Oracle) 2025-04-17 19:49 ` Alexander Duyck 0 siblings, 1 reply; 44+ messages in thread From: Russell King (Oracle) @ 2025-04-17 17:27 UTC (permalink / raw) To: Alexander Duyck Cc: Andrew Lunn, Heiner Kallweit, David S. Miller, Eric Dumazet, Jakub Kicinski, Joakim Zhang, netdev, Paolo Abeni On Thu, Apr 17, 2025 at 10:06:45AM -0700, Alexander Duyck wrote: > On Thu, Apr 17, 2025 at 8:23 AM Russell King (Oracle) > <linux@armlinux.org.uk> wrote: > > > > On Thu, Apr 17, 2025 at 03:35:56PM +0100, Russell King (Oracle) wrote: > > > On Thu, Apr 17, 2025 at 07:30:05AM -0700, Alexander H Duyck wrote: > > > > This is the only spot where we weren't setting netif_carrier_on/off and > > > > old_link_state together. I suspect you could just carry old_link_state > > > > without needing to add a new argument. Basically you would just need to > > > > drop the "else" portion of this statement. > > > > > > > > In the grand scheme of things with the exception of this one spot > > > > old_link_state is essentially the actual MAC/PCS link state whereas > > > > netif_carrier_off is the administrative state. > > > > > > Sorry to say, but you have that wrong. Neither are the administrative > > > state. > > > > To add to this (sorry, I rushed that reply), old_link_state is used when > > we aren't bound to a network device, otherwise the netif carrier state > > is used. Changes in the netif carrier state provoke netlink messages to > > userspace to inform userspace of changes to the link state. > > > > Moreover, it controls whether the network stack queues packets for > > transmission to the driver, and also whether the transmit watchdog is > > allowed to time out. It _probably_ also lets the packet schedulers > > know that the link state has changed. > > > > So, the netif carrier state is not "administrative". > > I have always sort of assumed that netif_carrier_on/off was the > logical AND of the administrative state of the NIC and the state of > the actual MAC/PCS/PHY. That is why I refer to it as an administrative > state. You end up with ifconfig up/down toggling netif carrier even if > the underlying link state hasn't changed. This has been the case for > many of the high end NICs for a while now, especially for the > multi-host ones, as the firmware won't change the actual physical > state of the link. It leaves it in place while the NIC port on the > host is no longer active. > > > It isn't strictly necessary for old_link_state to remain in sync with > > the netif carrier state, but it is desirable to avoid errors - but > > obviously the netif carrier state only exists when we're bound to a > > network device. > > From what I can tell this is the only spot where the two diverge, > prior to this patch. The general thought I was having was to update > the netif_carrier state in the suspend, and then old_link_state in the > resume. I realize now the concern is that setting the > phylink_disable_state is essentially the same as setting the link > down. So really we have 3 states we are tracking, > netif_carrier_ok/old_link_state, phylink_disable_state, and if we want > the link state to change while disabled. So we do need one additional > piece of state in the event that there isn't a netdev in order to > handle the case where "pl->phylink_disable_state && > !!pl->phylink_disable_state == pl->old_link_state". > > > > > > - /* Call mac_link_down() so we keep the overall state balanced. > > > > > - * Do this under the state_mutex lock for consistency. This > > > > > - * will cause a "Link Down" message to be printed during > > > > > - * resume, which is harmless - the true link state will be > > > > > - * printed when we run a resolve. > > > > > - */ > > > > > - mutex_lock(&pl->state_mutex); > > > > > - phylink_link_down(pl); > > > > > - mutex_unlock(&pl->state_mutex); > > > > > + if (pl->suspend_link_up) { > > > > > + /* Call mac_link_down() so we keep the overall state > > > > > + * balanced. Do this under the state_mutex lock for > > > > > + * consistency. This will cause a "Link Down" message > > > > > + * to be printed during resume, which is harmless - > > > > > + * the true link state will be printed when we run a > > > > > + * resolve. > > > > > + */ > > > > > + mutex_lock(&pl->state_mutex); > > > > > + phylink_link_down(pl); > > > > > + mutex_unlock(&pl->state_mutex); > > > > > + } > > > > > > > > You should be able to do all of this with just old_link_state. The only > > > > thing that would have to change is that you would need to set > > > > old_link_state to false after the if statement. > > > > > > Nope. > > > > And still nope - what we need to know here is what was the link state > > _before_ we called netif_carrier_off() or set old_link_state to false. > > > > I somehow suspect that you don't understand what all this code is trying > > to do, so let me explain it. > > > > In the suspend function, when WoL is enabled, and we're using MAC-based > > WoL, we need the link to the MAC to remain up, so it can receive packets > > to check whether they are the appropriate wake packet. However, we don't > > want packets to be queued for transmission either. > > That is the same as what we are looking for. We aren't queueing any > packets for transmission ourselves. We just want to leave the MAC > enabled, however we also want to leave it enabled when we resume. > > > So, we have the case where we want to avoid notifying the MAC that the > > link has gone down, but we also want to call netif_carrier_off() to stop > > the network stack queueing packets. > > > > To achieve this, several things work in unison: > > > > - we take the state mutex to prevent the resolver from running while we > > fiddle with various state. > > - we disable the resolver (which, if that's the only thing that happens, > > will provoke the resolver to take the link down.) > > - we lie to the resolver about the link state, by calling > > netif_carrier_off() and/or setting old_link_state to false. This > > means the resolver believes the link is _already_ down, and thus > > because of the strict ordering I've previously mentioned, will *not* > > call mac_link_down(). > > - we release the lock. > > > > There is now no way that the resolver will call either mac_link_up() or > > mac_link_down() - which is what we want here. > > The part that I think I missed here was that if we set > phylink_disable_state we then set link_state.link to false in > phylink_resolve. I wonder if we couldn't just have a flag that sets > cur_link_state to false in the "if(pl->phylink_disable_state)" case > and simplify this to indicate we won't force the link down" Then how does phylink_stop() end up calling .mac_link_down() ? > So fbnic_mac_link_up_asic doesn't trigger any issues. The issues are: > > 1. In fbnic_mac_link_down_asic we assume that if we are being told > that the link is down by phylink that it really means the link is down > and we need to shut off the MAC and flush any packets that are in the > Tx FIFO. The issue isn't so much the call itself, it is the fact that > we are being called in phylink_resume to rebalance the link that will > be going from UP->UP. The rebalancing is introducing an extra down > step. This could be resolved by adding an extra check to the line in > phylink_resume that is calling the mac_link_down so that it doesn't > get triggered and stall the link. In my test code I am now calling it > "pl->rolling_start". You're not addessing the issue I've already stated here. If the link was up, and we *don't* call mac_link_down(), we then *can* *not* call phylink_mac_initial_config(). It's as simple as that. The MAC must see link down before being configured. So, if the link was up, and we don't call mac_link_down() then we must also *not* call phylink_mac_initial_config(). I've no idea what will break with that change. > 2. In phylink_start/phylink_resume since we are coming at this from a > "rolling start" due to the BMC we have the MAC up and the > netif_carrier in the "off" state. As a result if we end up losing the > link between mac_prepare and pcs_get_state the MAC gets into a bad > state where netif_carrier is off, the MAC is stuck in the "up" state, > and we have stale packets clogging up the Tx FIFO. > > As far as the BMC it isn't so much wanting to hit the big red button > as our platform team. Basically losing of packets is very problematic > for them, think about ssh sessions that suddenly lock up during boot, > and they can demonstrate that none of the other vendors that have the > MAC/PCS/PHY buried in their firmware have this issue. I was really > hoping to avoid going that route as the whole point of this was to > keep the code open source and visible. The problem I have is not the idea, but the implementation. You want to change the phylink behaviour in a way that affects *all* existing users. I don't want that because of the guarantees of that contract I've previously stated that I've given to existing users. As things currently stand, you have a currently unique new case, and you're trying to force your solution on all users potentially breaking them. I have no real issue with supporting the new case, but *not* at the expense of regressing existing phylink users. Yet, when I point out this, you seem to be dead against *not* affecting other users. This is where the problem is, and where we fundamentally disagree. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last! ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down 2025-04-17 17:27 ` Russell King (Oracle) @ 2025-04-17 19:49 ` Alexander Duyck 2025-04-22 9:51 ` Russell King (Oracle) 0 siblings, 1 reply; 44+ messages in thread From: Alexander Duyck @ 2025-04-17 19:49 UTC (permalink / raw) To: Russell King (Oracle) Cc: Andrew Lunn, Heiner Kallweit, David S. Miller, Eric Dumazet, Jakub Kicinski, Joakim Zhang, netdev, Paolo Abeni On Thu, Apr 17, 2025 at 10:27 AM Russell King (Oracle) <linux@armlinux.org.uk> wrote: > > On Thu, Apr 17, 2025 at 10:06:45AM -0700, Alexander Duyck wrote: > > On Thu, Apr 17, 2025 at 8:23 AM Russell King (Oracle) > > <linux@armlinux.org.uk> wrote: > > > > > > On Thu, Apr 17, 2025 at 03:35:56PM +0100, Russell King (Oracle) wrote: [...] > > > So, we have the case where we want to avoid notifying the MAC that the > > > link has gone down, but we also want to call netif_carrier_off() to stop > > > the network stack queueing packets. > > > > > > To achieve this, several things work in unison: > > > > > > - we take the state mutex to prevent the resolver from running while we > > > fiddle with various state. > > > - we disable the resolver (which, if that's the only thing that happens, > > > will provoke the resolver to take the link down.) > > > - we lie to the resolver about the link state, by calling > > > netif_carrier_off() and/or setting old_link_state to false. This > > > means the resolver believes the link is _already_ down, and thus > > > because of the strict ordering I've previously mentioned, will *not* > > > call mac_link_down(). > > > - we release the lock. > > > > > > There is now no way that the resolver will call either mac_link_up() or > > > mac_link_down() - which is what we want here. > > > > The part that I think I missed here was that if we set > > phylink_disable_state we then set link_state.link to false in > > phylink_resolve. I wonder if we couldn't just have a flag that sets > > cur_link_state to false in the "if(pl->phylink_disable_state)" case > > and simplify this to indicate we won't force the link down" > > Then how does phylink_stop() end up calling .mac_link_down() ? What I was saying is that we could add an additional state flag that we set before you write to the phylink_disable_state. You would essentially set the state to true if you want to preserve the current state, and if it is true you would set cur_link_statte to false in phylink_resolve ignoring the actual current link state. So in phylink_stop you would set it to false, and in phylink_suspend you would set it to true. With that change phylink_stop could force the link down, whereas phylink_suspend would keep it up, phylink_suspend could deal with netif_carrier_off, and phylink_resume could deal with old_link_state. Anyway it wasn't as if I was saying we need to make that change. I have a bad habit of thinking out loud and it tends to carry over into emails.. > > So fbnic_mac_link_up_asic doesn't trigger any issues. The issues are: > > > > 1. In fbnic_mac_link_down_asic we assume that if we are being told > > that the link is down by phylink that it really means the link is down > > and we need to shut off the MAC and flush any packets that are in the > > Tx FIFO. The issue isn't so much the call itself, it is the fact that > > we are being called in phylink_resume to rebalance the link that will > > be going from UP->UP. The rebalancing is introducing an extra down > > step. This could be resolved by adding an extra check to the line in > > phylink_resume that is calling the mac_link_down so that it doesn't > > get triggered and stall the link. In my test code I am now calling it > > "pl->rolling_start". > > You're not addessing the issue I've already stated here. > > If the link was up, and we *don't* call mac_link_down(), we then *can* > *not* call phylink_mac_initial_config(). It's as simple as that. The > MAC must see link down before being configured. > > So, if the link was up, and we don't call mac_link_down() then we must > also *not* call phylink_mac_initial_config(). I've no idea what will > break with that change. Sorry, mentioning it didn't occur to me as I have been dealing with the "rolling start" since the beginning. In mac_prepare I deal with this. Specifically the code in mac_prepare will check to see if the link state is currently up with the desired configuration already or not. If it is, it sets a flag that will keep us from doing any changes that would be destructive to the link. If the link is down it basically clears the way for a full reinitialization. > > 2. In phylink_start/phylink_resume since we are coming at this from a > > "rolling start" due to the BMC we have the MAC up and the > > netif_carrier in the "off" state. As a result if we end up losing the > > link between mac_prepare and pcs_get_state the MAC gets into a bad > > state where netif_carrier is off, the MAC is stuck in the "up" state, > > and we have stale packets clogging up the Tx FIFO. > > > > As far as the BMC it isn't so much wanting to hit the big red button > > as our platform team. Basically losing of packets is very problematic > > for them, think about ssh sessions that suddenly lock up during boot, > > and they can demonstrate that none of the other vendors that have the > > MAC/PCS/PHY buried in their firmware have this issue. I was really > > hoping to avoid going that route as the whole point of this was to > > keep the code open source and visible. > > The problem I have is not the idea, but the implementation. You want > to change the phylink behaviour in a way that affects *all* existing > users. I don't want that because of the guarantees of that contract > I've previously stated that I've given to existing users. That is why I mentioned "renegotiating the terms" in my other email. I would be okay with adding a phylink_config variable for the "rolling_start". Essentially with "rolling_start" we would have: 1. phylink_resume would not call phylink_link_down to rebalance the state 2. phylink_resolve would have to call mac_link_up or mac_link_down following the first call after phylink_start/resume. 3. mac_prepare would have to take on the responsibility of doing mac_link_down prior to any destructive configuration change. 4. Calls to mac_link_up/mac_link_down would be idempotent essentially meaning that calling up on a link that is already up has no effect (other than maybe updating autoneg settings), and calling down on a down link likewise would have no effect. > As things currently stand, you have a currently unique new case, and > you're trying to force your solution on all users potentially breaking > them. I have no real issue with supporting the new case, but *not* at > the expense of regressing existing phylink users. > > Yet, when I point out this, you seem to be dead against *not* affecting > other users. This is where the problem is, and where we fundamentally > disagree. I am not sure where that impression came from. I realized after you mentioned it that the change was too broad and I had mentioned limiting it with a config option on the other thread. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down 2025-04-17 19:49 ` Alexander Duyck @ 2025-04-22 9:51 ` Russell King (Oracle) 2025-04-22 15:30 ` Alexander Duyck 0 siblings, 1 reply; 44+ messages in thread From: Russell King (Oracle) @ 2025-04-22 9:51 UTC (permalink / raw) To: Alexander Duyck Cc: Andrew Lunn, Heiner Kallweit, David S. Miller, Eric Dumazet, Jakub Kicinski, Joakim Zhang, netdev, Paolo Abeni On Thu, Apr 17, 2025 at 12:49:07PM -0700, Alexander Duyck wrote: > What I was saying is that we could add an additional state flag that > we set before you write to the phylink_disable_state. You would > essentially set the state to true if you want to preserve the current > state, and if it is true you would set cur_link_statte to false in > phylink_resolve ignoring the actual current link state. > > So in phylink_stop you would set it to false, and in phylink_suspend > you would set it to true. With that change phylink_stop could force > the link down, whereas phylink_suspend would keep it up, > phylink_suspend could deal with netif_carrier_off, and phylink_resume > could deal with old_link_state. I really don't like the idea that the netif carrier state differs from old_link_state. These need to be the same to ensure that drivers which use phylink in PHYLINK_NETDEV mode (which uses netif carrier for link state tracking) vs PHYLINK_DEV mode (which uses old_link_state) see the same behaviour, becomes much harder to guarantee if we start treating these differently in the code depending on something other than which PHYLINK*DEV is in use. It's really not something I want to entertain. > > So, if the link was up, and we don't call mac_link_down() then we must > > also *not* call phylink_mac_initial_config(). I've no idea what will > > break with that change. > > Sorry, mentioning it didn't occur to me as I have been dealing with > the "rolling start" since the beginning. In mac_prepare I deal with > this. Specifically the code in mac_prepare will check to see if the > link state is currently up with the desired configuration already or > not. If it is, it sets a flag that will keep us from doing any changes > that would be destructive to the link. If the link is down it > basically clears the way for a full reinitialization. I would much rather avoid any of the "setup" calls (that means mac_prepare(), mac_config(), mac_finish(), pcs_config() etc) and mac_link_up() if we're going to add support for "rolling start" to phylink. That basically means that the MAC needs to be fully configured to process packets before phylink_start() or phylink_resume() is called. This, however, makes me wonder why you'd even want to use phylink in this situation, as phylink will be doing virtually nothing for fbnic. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last! ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down 2025-04-22 9:51 ` Russell King (Oracle) @ 2025-04-22 15:30 ` Alexander Duyck 2025-04-22 16:00 ` Andrew Lunn 0 siblings, 1 reply; 44+ messages in thread From: Alexander Duyck @ 2025-04-22 15:30 UTC (permalink / raw) To: Russell King (Oracle) Cc: Andrew Lunn, Heiner Kallweit, David S. Miller, Eric Dumazet, Jakub Kicinski, Joakim Zhang, netdev, Paolo Abeni On Tue, Apr 22, 2025 at 2:51 AM Russell King (Oracle) <linux@armlinux.org.uk> wrote: > > On Thu, Apr 17, 2025 at 12:49:07PM -0700, Alexander Duyck wrote: ... > > Sorry, mentioning it didn't occur to me as I have been dealing with > > the "rolling start" since the beginning. In mac_prepare I deal with > > this. Specifically the code in mac_prepare will check to see if the > > link state is currently up with the desired configuration already or > > not. If it is, it sets a flag that will keep us from doing any changes > > that would be destructive to the link. If the link is down it > > basically clears the way for a full reinitialization. > > I would much rather avoid any of the "setup" calls (that means > mac_prepare(), mac_config(), mac_finish(), pcs_config() etc) and > mac_link_up() if we're going to add support for "rolling start" to > phylink. It would be fine by me, at least in the case where the link is already up and in the correct mode. For the other cases we would still want to do this in case there is some incomplete setup or something like that. We would essentially just pull the block out of mac_prepare and place it wherever is needed to get things up and running. > That basically means that the MAC needs to be fully configured to > process packets before phylink_start() or phylink_resume() is called. > > This, however, makes me wonder why you'd even want to use phylink in > this situation, as phylink will be doing virtually nothing for fbnic. It wasn't that we had wanted to initially, it was more that we were told that we must use it during the early driver review. I couldn't really argue against it as our setup fits the model since we have a MAC/PCS/FEC/PMA/PMD/AN and such. A good example is that I am pretty certain we will end up using the XPCS driver eventually, however to do so we must update it as it supports more than XLGMII which is why somebody added those speeds without adding the correct interfaces. There are 25G, 50G, and 100G modes that are likely supported if I am not mistaken. Can't blame them though as that is essentially the state we are in right now for fbnic. However I was holding off on enabling the PCS until we can get the MAC pieces sorted out. For us to fit we are going to have to expand things quite a bit as we need to add support for higher speeds, QSFP, QSFP-DD, FEC, BMC, and multi-host type behavior at a minimum. I had more-or-less assumed there was a desire to push the interface support up to 100G or more and that was one motivation for pushing us into phylink. By pushing us in it was a way to get there with us as the lead test/development vehicle since we are one of the first high speed NICs to actually expose most of the hardware and not hide it behind firmware. That said, I have come to see some of the advantages for us as well. Things like the QSFP support seems like it should be a light lift as I just have to add support for basic SFF-8636, which isn't too dissimilar to SFF-8472, and the rest seems to mostly fall in place with the device picking up the interface mode from the QSFP module as there isn't much needed for a DA cable. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down 2025-04-22 15:30 ` Alexander Duyck @ 2025-04-22 16:00 ` Andrew Lunn 2025-04-22 20:09 ` Alexander Duyck 0 siblings, 1 reply; 44+ messages in thread From: Andrew Lunn @ 2025-04-22 16:00 UTC (permalink / raw) To: Alexander Duyck Cc: Russell King (Oracle), Heiner Kallweit, David S. Miller, Eric Dumazet, Jakub Kicinski, Joakim Zhang, netdev, Paolo Abeni > For us to fit we are going to have to expand things quite a bit as we > need to add support for higher speeds, QSFP, QSFP-DD, FEC, BMC, and > multi-host type behavior at a minimum. I had more-or-less assumed > there was a desire to push the interface support up to 100G or more > and that was one motivation for pushing us into phylink. By pushing us > in it was a way to get there with us as the lead test/development > vehicle since we are one of the first high speed NICs to actually > expose most of the hardware and not hide it behind firmware. > > That said, I have come to see some of the advantages for us as well. > Things like the QSFP support seems like it should be a light lift as I > just have to add support for basic SFF-8636, which isn't too > dissimilar to SFF-8472, and the rest seems to mostly fall in place > with the device picking up the interface mode from the QSFP module as > there isn't much needed for a DA cable. You should also get hwmon for the SFP for free. ethtool --dump-module-eeprom will need a little work in sfp.c, but less work than a whole MAC driver implementation. With that in place firmware upgrade of the SFP should be easy. And we have a good quirk infrastructure in place for dealing with SFPs, which all seem broken in some way. No need to reinvent that. Andrew ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down 2025-04-22 16:00 ` Andrew Lunn @ 2025-04-22 20:09 ` Alexander Duyck 0 siblings, 0 replies; 44+ messages in thread From: Alexander Duyck @ 2025-04-22 20:09 UTC (permalink / raw) To: Andrew Lunn Cc: Russell King (Oracle), Heiner Kallweit, David S. Miller, Eric Dumazet, Jakub Kicinski, Joakim Zhang, netdev, Paolo Abeni On Tue, Apr 22, 2025 at 9:00 AM Andrew Lunn <andrew@lunn.ch> wrote: > > > For us to fit we are going to have to expand things quite a bit as we > > need to add support for higher speeds, QSFP, QSFP-DD, FEC, BMC, and > > multi-host type behavior at a minimum. I had more-or-less assumed > > there was a desire to push the interface support up to 100G or more > > and that was one motivation for pushing us into phylink. By pushing us > > in it was a way to get there with us as the lead test/development > > vehicle since we are one of the first high speed NICs to actually > > expose most of the hardware and not hide it behind firmware. > > > > That said, I have come to see some of the advantages for us as well. > > Things like the QSFP support seems like it should be a light lift as I > > just have to add support for basic SFF-8636, which isn't too > > dissimilar to SFF-8472, and the rest seems to mostly fall in place > > with the device picking up the interface mode from the QSFP module as > > there isn't much needed for a DA cable. > > You should also get hwmon for the SFP for free. ethtool > --dump-module-eeprom will need a little work in sfp.c, but less work > than a whole MAC driver implementation. With that in place firmware > upgrade of the SFP should be easy. And we have a good quirk > infrastructure in place for dealing with SFPs, which all seem broken > in some way. No need to reinvent that. Actually the hwmon will need some work as I recall. The issue is that it is reliant on the a2 page support and we don't have that in QSFP since it is based on SFF-8636. As far as the QSFP modules we are using we also don't have to deal with FW upgrades and such since we are just dealing with direct attach modules, and the quirks for now don't do much for us either for much the same reason. However there is always the potential for us to use something other than basic direct-attach cables in the future. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down 2025-04-16 16:16 ` [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down Russell King (Oracle) 2025-04-16 17:14 ` Russell King (Oracle) 2025-04-17 14:30 ` Alexander H Duyck @ 2025-04-23 0:00 ` patchwork-bot+netdevbpf 2 siblings, 0 replies; 44+ messages in thread From: patchwork-bot+netdevbpf @ 2025-04-23 0:00 UTC (permalink / raw) To: Russell King Cc: andrew, hkallweit1, alexander.duyck, davem, edumazet, kuba, qiangqing.zhang, netdev, pabeni Hello: This patch was applied to netdev/net.git (main) by Jakub Kicinski <kuba@kernel.org>: On Wed, 16 Apr 2025 17:16:01 +0100 you wrote: > When WoL is enabled, we update the software state in phylink to > indicate that the link is down, and disable the resolver from > bringing the link back up. > > On resume, we attempt to bring the overall state into consistency > by calling the .mac_link_down() method, but this is wrong if the > link was already down, as phylink strictly orders the .mac_link_up() > and .mac_link_down() methods - and this would break that ordering. > > [...] Here is the summary with links: - [net] net: phylink: fix suspend/resume with WoL enabled and link down https://git.kernel.org/netdev/net/c/4c8925cb9db1 You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present 2025-04-16 16:01 ` Russell King (Oracle) 2025-04-16 16:16 ` [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down Russell King (Oracle) @ 2025-04-16 17:12 ` Russell King (Oracle) 2025-04-16 19:03 ` Alexander Duyck 2 siblings, 0 replies; 44+ messages in thread From: Russell King (Oracle) @ 2025-04-16 17:12 UTC (permalink / raw) To: Alexander Duyck; +Cc: netdev, andrew, hkallweit1, davem, kuba, pabeni On Wed, Apr 16, 2025 at 05:01:09PM +0100, Russell King (Oracle) wrote: > On Wed, Apr 16, 2025 at 08:29:00AM -0700, Alexander Duyck wrote: > > From: Alexander Duyck <alexanderduyck@fb.com> > > > > This change is meant to address the fact that there are link imbalances > > introduced when using phylink on a system with a BMC. Specifically there > > are two issues. > > > > The first issue is that if we lose link after the first call to > > phylink_start but before it gets to the phylink_resolve we will end up with > > the phylink interface assuming the link was always down and not calling > > phylink_link_down resulting in a stuck interface. > > That is intentional. > > phylink strictly orders .mac_link_down and .mac_link_up, and starts from > an initial position that the link _will_ be considered to be down. So, > it is intentional that .mac_link_down will _never_ be called after > phylink_start(). > > > The second issue is that when a BMC is present we are currently forcing the > > link down. This results in us bouncing the link for a fraction of a second > > and that will result in dropped packets for the BMC. > > ... but you don't explain how that happens. > > > The third issue is just an extra "Link Down" message that is seen when > > calling phylink_resume. This is addressed by identifying that the link > > isn't balanced and just not displaying the down message in such a case. > > Hmm, this one is an error, but is not as simple as "don't print the > message" as it results in a violation of the rule I mentioned above. > We need phylink_suspend() to record the state of the link at that > point, and avoid calling phylink_link_down() if the link was down > prior to suspend. Okay, confirmed on nvidia Jetson Xavier NX: [ 11.838132] dwc-eth-dwmac 2490000.ethernet eth0: Adding VLAN ID 0 is not supported [ 15.299757] dwc-eth-dwmac 2490000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx LAN cable was unplugged: [ 50.436587] dwc-eth-dwmac 2490000.ethernet eth0: Link is Down Then the system was suspended using rtcwake for 3 seconds: [ 54.736849] dwc-eth-dwmac 2490000.ethernet eth0: No Safety Features support found [ 54.736898] dwc-eth-dwmac 2490000.ethernet eth0: IEEE 1588-2008 Advanced Timestamp supported [ 54.741078] dwc-eth-dwmac 2490000.ethernet eth0: Link is Down This shouldn't happen. With the patch I posted, this second "Link is Down" message is not printed, and .mac_link_down() will not be called. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last! ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present 2025-04-16 16:01 ` Russell King (Oracle) 2025-04-16 16:16 ` [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down Russell King (Oracle) 2025-04-16 17:12 ` [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present Russell King (Oracle) @ 2025-04-16 19:03 ` Alexander Duyck 2025-04-16 19:19 ` Russell King (Oracle) 2 siblings, 1 reply; 44+ messages in thread From: Alexander Duyck @ 2025-04-16 19:03 UTC (permalink / raw) To: Russell King (Oracle); +Cc: netdev, andrew, hkallweit1, davem, kuba, pabeni On Wed, Apr 16, 2025 at 9:01 AM Russell King (Oracle) <linux@armlinux.org.uk> wrote: > > On Wed, Apr 16, 2025 at 08:29:00AM -0700, Alexander Duyck wrote: > > From: Alexander Duyck <alexanderduyck@fb.com> > > > > This change is meant to address the fact that there are link imbalances > > introduced when using phylink on a system with a BMC. Specifically there > > are two issues. > > > > The first issue is that if we lose link after the first call to > > phylink_start but before it gets to the phylink_resolve we will end up with > > the phylink interface assuming the link was always down and not calling > > phylink_link_down resulting in a stuck interface. > > That is intentional. > > phylink strictly orders .mac_link_down and .mac_link_up, and starts from > an initial position that the link _will_ be considered to be down. So, > it is intentional that .mac_link_down will _never_ be called after > phylink_start(). Well the issue is that with a BMC present the link may be up before we even start using phylink. So if the link is lost while we are going through phylink_start we will end up in an in-between state where the link is physically down, but the MAC is still configured as though the link is up. This will be problematic as the MAC should essentially be discarding frames for transmit if the link is down to avoid blocking internal Tx FIFOs. Everything for our driver has to be a light touch as bringing down the BMC link for any reason other than the physical link being lost is essentially a criteria for failure as the BMC is the most essential link on the system. So, for example the code I have in the driver for handling a major config is currently checking in mac_prepare to verify if we already have link based on the requested interface mode and FEC settings and if we do the mac_config and pcs_config steps read an internal state flag and do nothing, then we just go through to mac_finish where we write the necessary bits to make sure the PCS/PMA and PMA/PMD connections are enabled which essentially becomes a no-op if the link is already enabled. > > The second issue is that when a BMC is present we are currently forcing the > > link down. This results in us bouncing the link for a fraction of a second > > and that will result in dropped packets for the BMC. > > ... but you don't explain how that happens. It was right there in the patch. It was the lines I removed: @@ -2596,16 +2600,6 @@ void phylink_resume(struct phylink *pl) if (test_bit(PHYLINK_DISABLE_MAC_WOL, &pl->phylink_disable_state)) { /* Wake-on-Lan enabled, MAC handling */ - /* Call mac_link_down() so we keep the overall state balanced. - * Do this under the state_mutex lock for consistency. This - * will cause a "Link Down" message to be printed during - * resume, which is harmless - the true link state will be - * printed when we run a resolve. - */ - mutex_lock(&pl->state_mutex); - phylink_link_down(pl); - mutex_unlock(&pl->state_mutex); - /* Re-apply the link parameters so that all the settings get * restored to the MAC. */ From a BMC perspective this forcing of the link down even if it is for a fraction of a second is unacceptable as it can break up ssh sessions to the BMC, especially if somebody is doing a bunch of configuration changes on the NIC as it results in dropped frames. When compared to the firmware based NIC approaches such as Broadcom and Nvidia/Mellanox it is a huge negative as the BMC link is static with those NICs and doesn't bounce no matter if the interface is being configured up/down, the driver proved/removed ect. The issue, even with your recent patch, is that it will still force the link down if the link was previously up. That is the piece I need to avoid to prevent the BMC from losing link. Ideally what I need is to have a check of the current link state and then sync back up rather than force the phylink state on the MAC and then clean things up after the fact. > > The third issue is just an extra "Link Down" message that is seen when > > calling phylink_resume. This is addressed by identifying that the link > > isn't balanced and just not displaying the down message in such a case. > > Hmm, this one is an error, but is not as simple as "don't print the > message" as it results in a violation of the rule I mentioned above. > We need phylink_suspend() to record the state of the link at that > point, and avoid calling phylink_link_down() if the link was down > prior to suspend. Looking at your fix it doesn't resolve my core issue. All it does is address the "third issue". In my case the link is up and needs to stay up when I resume. The fact that we are forcing the link down causes our BMC to drop connection which makes it difficult to remotely manage the interface as changes to the interface such as a simple ifconfig down; ifconfig up are enough to cause the BMC to drop packets and potentially the ssh session if we start losing enough of them. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present 2025-04-16 19:03 ` Alexander Duyck @ 2025-04-16 19:19 ` Russell King (Oracle) 2025-04-16 20:05 ` Russell King (Oracle) 0 siblings, 1 reply; 44+ messages in thread From: Russell King (Oracle) @ 2025-04-16 19:19 UTC (permalink / raw) To: Alexander Duyck; +Cc: netdev, andrew, hkallweit1, davem, kuba, pabeni On Wed, Apr 16, 2025 at 12:03:05PM -0700, Alexander Duyck wrote: > On Wed, Apr 16, 2025 at 9:01 AM Russell King (Oracle) > <linux@armlinux.org.uk> wrote: > > > > On Wed, Apr 16, 2025 at 08:29:00AM -0700, Alexander Duyck wrote: > > > From: Alexander Duyck <alexanderduyck@fb.com> > > > > > > This change is meant to address the fact that there are link imbalances > > > introduced when using phylink on a system with a BMC. Specifically there > > > are two issues. > > > > > > The first issue is that if we lose link after the first call to > > > phylink_start but before it gets to the phylink_resolve we will end up with > > > the phylink interface assuming the link was always down and not calling > > > phylink_link_down resulting in a stuck interface. > > > > That is intentional. > > > > phylink strictly orders .mac_link_down and .mac_link_up, and starts from > > an initial position that the link _will_ be considered to be down. So, > > it is intentional that .mac_link_down will _never_ be called after > > phylink_start(). > > Well the issue is that with a BMC present the link may be up before we > even start using phylink. So if the link is lost while we are going > through phylink_start we will end up in an in-between state where the > link is physically down, but the MAC is still configured as though the > link is up. This will be problematic as the MAC should essentially be > discarding frames for transmit if the link is down to avoid blocking > internal Tx FIFOs. So, when a Linux network driver probes, it starts out in administrative state *DOWN*. When the administrator configures the network driver, .ndo_open is called, which is expected to configure the network adapter. Part of that process is to call phylink_start() as one of the last steps, which detects whether the link is up or not. If the link is up, then phylink will call netif_carrier_on() and .mac_link_up(). This tells the core networking layers that the network interface is now ready to start sending packets, and it will begin queueing packets for the network driver to process - not before. Prior to .ndo_open being called, the networking layers do not expect traffic from the network device no matter what physical state the media link is in. If .ndo_open fails, the same applies - no traffic is expected to be passed to the core network layers from the network layers because as far as the network stack is concerned, the interface is still administratively down. Thus, the fact that your BMC thinks that the link is up is irrelevant. So, start off in a state that packet activity is suspended even if the link is up at probe time. Only start packet activity (reception and transmission) once .mac_link_up() has been called. Stop that activity when .mac_link_down() is subsequently called. There have been lots of NICs out there where the physical link doesn't follow the adminstrative state of the network interface. This is not a problem. It may be desirable that it does, but a desire is not the same as "it must". > > > The second issue is that when a BMC is present we are currently forcing the > > > link down. This results in us bouncing the link for a fraction of a second > > > and that will result in dropped packets for the BMC. > > > > ... but you don't explain how that happens. > > It was right there in the patch. It was the lines I removed: ... thus further breaking phylink guarantees. Sorry, but no. > The issue, even with your recent patch, is that it will still force > the link down if the link was previously up. That is the piece I need > to avoid to prevent the BMC from losing link. Ideally what I need is > to have a check of the current link state and then sync back up rather > than force the phylink state on the MAC and then clean things up after > the fact. So don't force the link, just stop packet activity. As stated above, nothing requires that the physical link is forced down just because .mac_link_down() has been called. This makes me wonder what happens to your BMC with your ideas if userspace takes down the network interface. It sounds like all hell breaks loose because you've taken the link down, and that's seen as a critical failure... So taking down a network interface becomes a critical failure - yet it's a *normal* userspace operation. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last! ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present 2025-04-16 19:19 ` Russell King (Oracle) @ 2025-04-16 20:05 ` Russell King (Oracle) 2025-04-16 22:58 ` Alexander Duyck 0 siblings, 1 reply; 44+ messages in thread From: Russell King (Oracle) @ 2025-04-16 20:05 UTC (permalink / raw) To: Alexander Duyck; +Cc: netdev, andrew, hkallweit1, davem, kuba, pabeni On Wed, Apr 16, 2025 at 08:19:38PM +0100, Russell King (Oracle) wrote: > So, when a Linux network driver probes, it starts out in administrative > state *DOWN*. When the administrator configures the network driver, > .ndo_open is called, which is expected to configure the network adapter. > > Part of that process is to call phylink_start() as one of the last > steps, which detects whether the link is up or not. If the link is up, > then phylink will call netif_carrier_on() and .mac_link_up(). This > tells the core networking layers that the network interface is now > ready to start sending packets, and it will begin queueing packets for > the network driver to process - not before. > > Prior to .ndo_open being called, the networking layers do not expect > traffic from the network device no matter what physical state the > media link is in. If .ndo_open fails, the same applies - no traffic is > expected to be passed to the core network layers from the network > layers because as far as the network stack is concerned, the interface > is still administratively down. > > Thus, the fact that your BMC thinks that the link is up is irrelevant. > > So, start off in a state that packet activity is suspended even if the > link is up at probe time. Only start packet activity (reception and > transmission) once .mac_link_up() has been called. Stop that activity > when .mac_link_down() is subsequently called. > > There have been lots of NICs out there where the physical link doesn't > follow the adminstrative state of the network interface. This is not a > problem. It may be desirable that it does, but a desire is not the same > as "it must". Let me be crystal clear on this. Phylink has a contract with all existing users. That contract is: Initial state: link down. Driver calls phylink_start() in its .ndo_open method. Phylink does configuration of the PHY and link according to the chosen link parameters by calling into the MAC, PCS, and phylib as appropriate. If the link is then discovered to be up (it might have been already up before phylink_start() was called), phylink will call the various components such as PCS and MAC to inform them that the link is now up. This will mean calling the .mac_link_up() method. Otherwise (if the link is discovered to be down when the interface is brought up) no call to either .mac_link_up() nor .mac_link_down() will be made. If the link _subsequently_ goes down, then phylink deals with that and calls .mac_link_down() - only if .mac_link_up() was previously called (that's one of the bugs you discovered, that on resume it gets called anyway. I've submitted a fix for that contract breach, which only affects a very small number of drivers - stmmac, ucc_geth and your fbnic out of 22 total ethernet users plus however many DSA users we have.) Only if .mac_link_down() has been called, if the link subsequently comes back up, then the same process happens as before resulting in .mac_link_up() being called. If the interface is taken down, then .mac_link_down() will be called if and only if .mac_link_up() had been called. The ordering of .mac_link_up() / .mac_link_down() is a strict contract term with phylink users. The reason for this contract: phylink users may have ordering requirements. For example, on mac_link_down(), they may wait for packet activity to stop, and then place the MAC in reset. If called without a previous .mac_link_up call, the wait stage may time out due to the MAC being in reset. (Ocelot may suffer with this.) Another example is fs_enet which also relies on this strict ordering as described above. There could be others - there are some that call into firmware on calls to .mac_link_up() / .mac_link_down() and misordering those depends on what the firmware is doing, which we have no visibility of. As I stated, this is the contract that phylink gave to users, and the contract still stands, and can't be broken to behave differently (e.g. calling .mac_link_down() after phylink_start() without an intervening call to .mac_link_up()) otherwise existing users will break. Bugs that go against that contract will be fixed, but the contract will not be intentionally broken. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last! ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present 2025-04-16 20:05 ` Russell King (Oracle) @ 2025-04-16 22:58 ` Alexander Duyck 0 siblings, 0 replies; 44+ messages in thread From: Alexander Duyck @ 2025-04-16 22:58 UTC (permalink / raw) To: Russell King (Oracle); +Cc: netdev, andrew, hkallweit1, davem, kuba, pabeni On Wed, Apr 16, 2025 at 1:05 PM Russell King (Oracle) <linux@armlinux.org.uk> wrote: > > On Wed, Apr 16, 2025 at 08:19:38PM +0100, Russell King (Oracle) wrote: > > So, when a Linux network driver probes, it starts out in administrative > > state *DOWN*. When the administrator configures the network driver, > > .ndo_open is called, which is expected to configure the network adapter. > > > > Part of that process is to call phylink_start() as one of the last > > steps, which detects whether the link is up or not. If the link is up, > > then phylink will call netif_carrier_on() and .mac_link_up(). This > > tells the core networking layers that the network interface is now > > ready to start sending packets, and it will begin queueing packets for > > the network driver to process - not before. > > > > Prior to .ndo_open being called, the networking layers do not expect > > traffic from the network device no matter what physical state the > > media link is in. If .ndo_open fails, the same applies - no traffic is > > expected to be passed to the core network layers from the network > > layers because as far as the network stack is concerned, the interface > > is still administratively down. > > > > Thus, the fact that your BMC thinks that the link is up is irrelevant. > > > > So, start off in a state that packet activity is suspended even if the > > link is up at probe time. Only start packet activity (reception and > > transmission) once .mac_link_up() has been called. Stop that activity > > when .mac_link_down() is subsequently called. > > > > There have been lots of NICs out there where the physical link doesn't > > follow the adminstrative state of the network interface. This is not a > > problem. It may be desirable that it does, but a desire is not the same > > as "it must". > > Let me be crystal clear on this. > > Phylink has a contract with all existing users. That contract is: > > Initial state: link down. > > Driver calls phylink_start() in its .ndo_open method. > > Phylink does configuration of the PHY and link according to the > chosen link parameters by calling into the MAC, PCS, and phylib as > appropriate. > > If the link is then discovered to be up (it might have been already > up before phylink_start() was called), phylink will call the various > components such as PCS and MAC to inform them that the link is now up. > This will mean calling the .mac_link_up() method. Otherwise (if the > link is discovered to be down when the interface is brought up) no > call to either .mac_link_up() nor .mac_link_down() will be made. > > If the link _subsequently_ goes down, then phylink deals with that > and calls .mac_link_down() - only if .mac_link_up() was previously > called (that's one of the bugs you discovered, that on resume it > gets called anyway. I've submitted a fix for that contract breach, > which only affects a very small number of drivers - stmmac, ucc_geth > and your fbnic out of 22 total ethernet users plus however many DSA > users we have.) > > Only if .mac_link_down() has been called, if the link subsequently > comes back up, then the same process happens as before resulting in > .mac_link_up() being called. > > If the interface is taken down, then .mac_link_down() will be called > if and only if .mac_link_up() had been called. > > The ordering of .mac_link_up() / .mac_link_down() is a strict > contract term with phylink users. > > The reason for this contract: phylink users may have ordering > requirements. > > For example, on mac_link_down(), they may wait for packet activity to > stop, and then place the MAC in reset. If called without a previous > .mac_link_up call, the wait stage may time out due to the MAC being > in reset. (Ocelot may suffer with this.) > > Another example is fs_enet which also relies on this strict ordering > as described above. > > There could be others - there are some that call into firmware on > calls to .mac_link_up() / .mac_link_down() and misordering those > depends on what the firmware is doing, which we have no visibility > of. > > As I stated, this is the contract that phylink gave to users, and > the contract still stands, and can't be broken to behave differently > (e.g. calling .mac_link_down() after phylink_start() without an > intervening call to .mac_link_up()) otherwise existing users will > break. Bugs that go against that contract will be fixed, but the > contract will not be intentionally broken. The issue is as that stands the contract is inherently broken if a BMC is present. 1. There is still the link loss during the phylink_start issue which will essentially leave the MAC up if the link fails. This will cause the Tx FIFO on the NIC to essentially be left to hang without flushing out the stale packets that may have queued up when the link dropped. This is one of the reasons why the mac_link_down call still needs to propagate all the way back down to the hardware. 2. We already have precedent for the link being up when WOL is in use. One concern I would have with your patch is if it will impact that or not. I suspected part of the mac_link_down is related to cleaning up something related to the link setup for WOL configuring the link for some speed. 3. While it may be a part of the contract, isn't there some way we can "renegotiate the terms"? The fact is in the case of our driver we are essentially doing a hand-off from FW to the OS for the link when we call ndo_open. When we are done in ndo_stop we hand ownership back over to the firmware. Those hand-offs need to be free of link bounces. This pattern looks very much like the WOL setup with the only exception being trying to avoid these link bounces. I'm just wondering if there isn't some way we can add a phylink_config value indicating that there is a BMC present on the link so we can handle those cases. The general idea would be to update my patch so instead of pl->link_balanced being just set to false in suspend and at creation time we could essentially just set it to "pl->link_balanced = !pl->config.mac_bmc_handoff". The fact is the code with the diff I provided did everything I needed. I could load/unload and ifconfig up/down the driver all day and it didn't drop a packet. As it stood the definition of the mac_config code more or less called out that we weren't supposed to change things unless we needed to and I had followed that. I suspect the overall change should be small, smaller now following your fix of the message issue, and it shouldn't impact anything other than our driver if I add the config flag checks. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap 2025-04-16 15:28 [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap Alexander Duyck 2025-04-16 15:28 ` [net-next PATCH 1/2] net: phylink: Drop unused defines for SUPPORTED/ADVERTISED_INTERFACES Alexander Duyck 2025-04-16 15:29 ` [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present Alexander Duyck @ 2025-04-19 18:11 ` Andrew Lunn 2025-04-20 18:18 ` Alexander Duyck 2 siblings, 1 reply; 44+ messages in thread From: Andrew Lunn @ 2025-04-19 18:11 UTC (permalink / raw) To: Alexander Duyck; +Cc: netdev, linux, hkallweit1, davem, kuba, pabeni On Wed, Apr 16, 2025 at 08:28:46AM -0700, Alexander Duyck wrote: > Address two issues found in the phylink code. > > The first issue is the fact that there were unused defines that were > referencing deprecated macros themselves. Since they aren't used we might > as well drop them. > > The second issue which is more the main reason for this submission is the > fact that the BMC was losing link when we would call phylink_resume. This > is fixed by adding a new boolean value link_balanced which will allow us > to avoid doing an immediate force of the link up/down and instead defer it > until after we have checked the actual link state. I'm wondering if we have jumped straight into the weeds without having a good overall big picture of what we are trying to achieve. But maybe it is just me, and this is just for my edification... As i've said a few times we don't have a good story around networking and BMCs. Traditionally, all the details have been hidden away in the NIC firmware, and linux is pretty much unaware it is going on, at least from the Host side. fbnic is changing that, and we need Linux/phylink to understand this. Since this is all pretty new to me, i went and quickly read: https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.1.0.pdf Hopefully i now have a better big picture. Figure 2 answers a few questions for me. One was, do we actually have a three port switch in here? And i would say no. We have something similar, but not a switch. There is no back to back MAC on the host PCI interface. We do have back to back MAC on the NC-SI port, but it appears Linux has no knowledge of the NIC NC-SI MAC, and the BMC is controlling BMC NC-SI MAC. Not having a switch means when we are talking about the MAC, PCS, PHY etc, we are talking about the media side MAC, PCS, PHY. Given that phylink is just as often used with switches with a conduit interface and switch ports, that is an important point. Figure 2 also hints at there being three different life cycles all interacting with each other. Our normal model in phylink is that the Network Controller is passive, it is told what to do by Linux/phylink. However, in this setup, that is not true. The Network Controller is active, it has firmware running on it. The Network Controller and the Management Controller life cycle probably starts at about the same time, when the PSU starts generating standby power. The host life cycle starts later, when the BMC decides to power up the host. The NC-SI protocol defines messages between the Management Controller and the Network Controller. One of these messages is how to configure the media side. See section 8.4.21. It lists different networks speeds which can be negotiated, duplex, and pause, and if to use auto-neg. There is not enough details to fully specific link modes above 1000BaseT, all you can request for example is 40G, but you cannot say CR4, KR4, SR4, or LR4. There also does not appear to be a way to ask the network controller what it actually supports. So i guess you normally just ask for everything up to 100G, and you are happy when Get Link Status response command says the link it 10BaseT Half. The Network Controller needs to be smart enough to get the link up and running. So it basically has a phylink implementation, with a PCS driver, 0 or more PHY drivers, SFP cage driver, SFP driver etc. Some text from the document, which is pretty relevant to the discussion. The Set Link command may be used by the Management Controller to configure the external network interface associated with the channel by using the provided settings. Upon receiving this command, while the host NC driver is not operational, the channel shall attempt to set the link to the configuration specified by the parameters. Upon successful completion of this command, link settings specified in the command should be used by the network controller as long as the host NC driver does not overwrite the link settings. In the absence of an operational host NC driver, the NC should attempt to make the requested link state change even if it requires the NC to drop the current link. The channel shall send a response packet to the Management Controller within the required response time. However, the requested link state changes may take an unspecified amount of time to complete. The actual link settings are controlled by the host NC driver when it is operational. When the host NC driver is operational, link settings specified by the MC using the Set Link command may be overwritten by the host NC driver. The link settings are not restored by the NC if the host NC driver becomes non operational. There is a very clear indication that the host is in control, or the host is not in control. So one obvious question to me is, should phylink have ops into the MAC driver to say it is taking over control, and relinquishing control? The Linux model is that when the interface is admin down, you can use ethtool to preconfigure things, but they don't take affect until the link is admin up. So with admin down, we have a host NC driver, but it is not operational, hence the Network Controller is in control of the link at the Management Controllers bequest. It is only with admin up that phylink takes control of the Network Controller, and it releases it with admin down. Having these ops would also help with suspend/resume. Suspend does not change the admin up/down status, but the host clearly needs to hand over control of the media to the Network Controller, and take it back again on resume. Also, if we have these ops, we know that admin down/suspend does not mean media down. The presence of these ops triggers different state transitions in the phylink state machine so that it simply hands off control of the media, but otherwise leaves it alone. With this in place, i think we can avoid all the unbalanced state? What is potentially more interesting is when phylink takes control. Do we have enough information about the system to say its current configuration is the wanted configuration? Or are we forced to do a ground up reconfiguration, which will include a media down/up? I had a quick scan of the document and i did not find anything which says the host is not allowed/is allowed to do disruptive things, but the text quoted above says 'The actual link settings are controlled by the host NC driver when it is operational'. Controlling the link settings is a disruptive operation, so the management controller needs to be tolerant to such changes. So, can we ignore the weeds for the moment, and think about the big picture? Andrew ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap 2025-04-19 18:11 ` [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap Andrew Lunn @ 2025-04-20 18:18 ` Alexander Duyck 2025-04-20 21:58 ` Andrew Lunn 0 siblings, 1 reply; 44+ messages in thread From: Alexander Duyck @ 2025-04-20 18:18 UTC (permalink / raw) To: Andrew Lunn; +Cc: netdev, linux, hkallweit1, davem, kuba, pabeni On Sat, Apr 19, 2025 at 11:11 AM Andrew Lunn <andrew@lunn.ch> wrote: > > On Wed, Apr 16, 2025 at 08:28:46AM -0700, Alexander Duyck wrote: > > Address two issues found in the phylink code. > > > > The first issue is the fact that there were unused defines that were > > referencing deprecated macros themselves. Since they aren't used we might > > as well drop them. > > > > The second issue which is more the main reason for this submission is the > > fact that the BMC was losing link when we would call phylink_resume. This > > is fixed by adding a new boolean value link_balanced which will allow us > > to avoid doing an immediate force of the link up/down and instead defer it > > until after we have checked the actual link state. > > I'm wondering if we have jumped straight into the weeds without having > a good overall big picture of what we are trying to achieve. But maybe > it is just me, and this is just for my edification... > > As i've said a few times we don't have a good story around networking > and BMCs. Traditionally, all the details have been hidden away in the > NIC firmware, and linux is pretty much unaware it is going on, at > least from the Host side. fbnic is changing that, and we need > Linux/phylink to understand this. > > Since this is all pretty new to me, i went and quickly read: > > https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.1.0.pdf > > Hopefully i now have a better big picture. > > Figure 2 answers a few questions for me. One was, do we actually have > a three port switch in here? And i would say no. We have something > similar, but not a switch. There is no back to back MAC on the host > PCI interface. We do have back to back MAC on the NC-SI port, but it > appears Linux has no knowledge of the NIC NC-SI MAC, and the BMC is > controlling BMC NC-SI MAC. > > Not having a switch means when we are talking about the MAC, PCS, PHY > etc, we are talking about the media side MAC, PCS, PHY. Given that > phylink is just as often used with switches with a conduit interface > and switch ports, that is an important point. > > Figure 2 also hints at there being three different life cycles all > interacting with each other. Our normal model in phylink is that the > Network Controller is passive, it is told what to do by > Linux/phylink. However, in this setup, that is not true. The Network > Controller is active, it has firmware running on it. The Network > Controller and the Management Controller life cycle probably starts at > about the same time, when the PSU starts generating standby power. The > host life cycle starts later, when the BMC decides to power up the > host. > > The NC-SI protocol defines messages between the Management Controller > and the Network Controller. One of these messages is how to configure > the media side. See section 8.4.21. It lists different networks speeds > which can be negotiated, duplex, and pause, and if to use > auto-neg. There is not enough details to fully specific link modes > above 1000BaseT, all you can request for example is 40G, but you > cannot say CR4, KR4, SR4, or LR4. There also does not appear to be a > way to ask the network controller what it actually supports. So i > guess you normally just ask for everything up to 100G, and you are > happy when Get Link Status response command says the link it 10BaseT > Half. In our case this is one of the reasons why the firmware knows what the interface and media type are before we do. It basically has it coded into the EEPROM so that it will identify the type as CR and normally it informs the BMC and the host of the speed/FEC of the port as well. So the BMC could configure this, but it is normally getting this info from the NIC as it is setup to bring the link up before the BMC even gets a chance to ask to setup the config. > The Network Controller needs to be smart enough to get the link up and > running. So it basically has a phylink implementation, with a PCS > driver, 0 or more PHY drivers, SFP cage driver, SFP driver etc. > > Some text from the document, which is pretty relevant to the > discussion. > > The Set Link command may be used by the Management Controller to > configure the external network interface associated with the channel > by using the provided settings. Upon receiving this command, while > the host NC driver is not operational, the channel shall attempt to > set the link to the configuration specified by the parameters. Upon > successful completion of this command, link settings specified in > the command should be used by the network controller as long as the > host NC driver does not overwrite the link settings. > > In the absence of an operational host NC driver, the NC should > attempt to make the requested link state change even if it requires > the NC to drop the current link. The channel shall send a response > packet to the Management Controller within the required response > time. However, the requested link state changes may take an > unspecified amount of time to complete. > > The actual link settings are controlled by the host NC driver when > it is operational. When the host NC driver is operational, link > settings specified by the MC using the Set Link command may be > overwritten by the host NC driver. The link settings are not > restored by the NC if the host NC driver becomes non > operational. > > There is a very clear indication that the host is in control, or the > host is not in control. So one obvious question to me is, should > phylink have ops into the MAC driver to say it is taking over control, > and relinquishing control? The Linux model is that when the interface > is admin down, you can use ethtool to preconfigure things, but they > don't take affect until the link is admin up. So with admin down, we > have a host NC driver, but it is not operational, hence the Network > Controller is in control of the link at the Management Controllers > bequest. It is only with admin up that phylink takes control of the > Network Controller, and it releases it with admin down. Having these > ops would also help with suspend/resume. Suspend does not change the > admin up/down status, but the host clearly needs to hand over control > of the media to the Network Controller, and take it back again on > resume. Yes, this more-or-less describes the current setup in fbnic. The only piece that is probably missing would be the heartbeat we maintain so that the NIC doesn't revoke access due to the OS/driver potentially being hung. The other thing involved in this that you didn't mention is that the MC is also managing the Rx filter configuration. So when we take ownership it is both the Rx Filters and MAC/PCS/PHY that we are taking ownership of. The current pattern in fbnic is for us to do most of this on the tail end of __fbnic_open and unwind it near the start of fbnic_stop. Essentially the pattern is xmit_ownership, init heartbeat, init PTP, start phylink, configure Rx filters. In the case of close it is the reverse with us tearing down the filters, disabling phylink, disabling PTP, and then releasing ownership. > Also, if we have these ops, we know that admin down/suspend does not > mean media down. The presence of these ops triggers different state > transitions in the phylink state machine so that it simply hands off > control of the media, but otherwise leaves it alone. > > With this in place, i think we can avoid all the unbalanced state? As I understand it right now the main issue is that Phylink assumes that it has to take the link down in order to perform a major configuration in phylink_start/phylink_resume. However in the case of fbnic what I do a check in mac_prepare to see if I even really need to take the link down or not to support the requested mode. If not then I skip the destructive settings update and tweak a few minor things that can be updated without taking the link down. So I think the change I am making is that mac_prepare before assumed it was just disabling things from the SerDes down, and I think I pulled that up to the MAC layer which may be a departure from previous expectations. > What is potentially more interesting is when phylink takes control. Do > we have enough information about the system to say its current > configuration is the wanted configuration? Or are we forced to do a > ground up reconfiguration, which will include a media down/up? I had a > quick scan of the document and i did not find anything which says the > host is not allowed/is allowed to do disruptive things, but the text > quoted above says 'The actual link settings are controlled by the host > NC driver when it is operational'. Controlling the link settings is a > disruptive operation, so the management controller needs to be > tolerant to such changes. In the case of fbnic we have a test we can do to verify we are linked at the correct fec/modulation/lane width via a check of signal outputs from the PCS. Basically we have one master bit that says link is present, but then we can inspect the individual values on a per FEC/lane/modulation basis. So we can do a link check for a specific interface type and verify if the link is up and configured for the settings we want. Historically speaking, drivers in the past did disruptive things while the BMC was up. That is one of the things that caught me off-guard with Meta's requirements. I had been used to dealing with 1G BMC on most enterprise hardware and in the case of those we end up bouncing the link state when the link is configured. I had gotten quite used to seeing that on igb some time ago as I have had a 1G system with igb and a BMC for years and got used to the SoL console stalling as the igb reconfigured the link. The requirement that the BMC not lose link comes more out of the multi-host setups that have been in place in the data center environment for the last decade or so where there was only one link but multiple systems all sharing that link, including the BMC. So it is not strictly a BMC requirement, but more of a multi-host requirement. > So, can we ignore the weeds for the moment, and think about the big > picture? So big picture wise we really have 2 issues: 1. The BMC handling doesn't currently exist, so we need to extend handling/hand-off for link up before we start, and link up after we stop. 2. Expectations for our 25G+ interfaces to behave like multi-host NICs that are sharing a link via firmware. Specifically that loading/unloading the driver or ifconfig up/down on the host interface should not cause the link to bounce and/or drop packets for any other connections, which in this case includes the BMC. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap 2025-04-20 18:18 ` Alexander Duyck @ 2025-04-20 21:58 ` Andrew Lunn 2025-04-21 15:51 ` Alexander Duyck 0 siblings, 1 reply; 44+ messages in thread From: Andrew Lunn @ 2025-04-20 21:58 UTC (permalink / raw) To: Alexander Duyck; +Cc: netdev, linux, hkallweit1, davem, kuba, pabeni > > The actual link settings are controlled by the host NC driver when > > it is operational. When the host NC driver is operational, link > > settings specified by the MC using the Set Link command may be > > overwritten by the host NC driver. The link settings are not > > restored by the NC if the host NC driver becomes non > > operational. > > > > There is a very clear indication that the host is in control, or the > > host is not in control. So one obvious question to me is, should > > phylink have ops into the MAC driver to say it is taking over control, > > and relinquishing control? The Linux model is that when the interface > > is admin down, you can use ethtool to preconfigure things, but they > > don't take affect until the link is admin up. So with admin down, we > > have a host NC driver, but it is not operational, hence the Network > > Controller is in control of the link at the Management Controllers > > bequest. It is only with admin up that phylink takes control of the > > Network Controller, and it releases it with admin down. Having these > > ops would also help with suspend/resume. Suspend does not change the > > admin up/down status, but the host clearly needs to hand over control > > of the media to the Network Controller, and take it back again on > > resume. > > Yes, this more-or-less describes the current setup in fbnic. The only > piece that is probably missing would be the heartbeat we maintain so > that the NIC doesn't revoke access due to the OS/driver potentially > being hung. That probably goes against the last sentence i quoted above. I do however understand why you would want it. Can the host driver know if the Network Controller has taken back control? Or does the heartbeat also act as a watchdog, the host does not need to care, it is about to experience a BMC induced reboot? > The other thing involved in this that you didn't mention > is that the MC is also managing the Rx filter configuration. So when > we take ownership it is both the Rx Filters and MAC/PCS/PHY that we > are taking ownership of. That does not seem consistent with the standard. The Set Link command i quoted above makes it clear that when the host driver is active, it is in control of the media. However the Set VLAN Filter command, Enable VLAN command, Set MAC Address command, Enable Broadcast Filter command, say nothing about differences when the Host driver is operational or not. It just seems to assume the Management Controller and the host share the resources, and try not to stomp over each other. Does fbnic not follow the standard in this respect? However, from a phylink perspective, i don't think this matters, phylink is not involved with any of this. > The current pattern in fbnic is for us to do most of this on the tail > end of __fbnic_open and unwind it near the start of fbnic_stop. > Essentially the pattern is xmit_ownership, init heartbeat, init PTP, > start phylink, configure Rx filters. In the case of close it is the > reverse with us tearing down the filters, disabling phylink, disabling > PTP, and then releasing ownership. > > > Also, if we have these ops, we know that admin down/suspend does not > > mean media down. The presence of these ops triggers different state > > transitions in the phylink state machine so that it simply hands off > > control of the media, but otherwise leaves it alone. > > > > With this in place, i think we can avoid all the unbalanced state? > > As I understand it right now the main issue is that Phylink assumes > that it has to take the link down in order to perform a major > configuration in phylink_start/phylink_resume. Well, as i said, my reading of the standard is that the host can make disruptive media changes, so you have to be able to live with disruptive media changes. If you have to live with it, the path of lease resistance is just to accept it. > The requirement that the BMC not lose link comes more out of the > multi-host setups that have been in place in the data center > environment for the last decade or so where there was only one link > but multiple systems all sharing that link, including the BMC. So it > is not strictly a BMC requirement, but more of a multi-host > requirement. Is this actually standardised somewhere? I see there is a draft of an update to NC-SI Specification, but i don't think the section about controlling the link has changed. Also, the standard talks about how you connect one Management Controller to multiple Network Controllers. There is nothing about multiple Management Controllers connected to one Network Controller. Or i'm i missing something, like one Management Controller is controlling all the host connected to the Network Controller? > > So, can we ignore the weeds for the moment, and think about the big > > picture? > > So big picture wise we really have 2 issues: > 1. The BMC handling doesn't currently exist, so we need to extend > handling/hand-off for link up before we start, and link up after we > stop. Agreed, and that fits with DSP0222. > 2. Expectations for our 25G+ interfaces to behave like multi-host NICs > that are sharing a link via firmware. Specifically that > loading/unloading the driver or ifconfig up/down on the host interface > should not cause the link to bounce and/or drop packets for any other > connections, which in this case includes the BMC. For this, it would be nice to point to some standard which describes this, so we have a generic, vendor agnostic, description of how this is supposed to work. Andrew ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap 2025-04-20 21:58 ` Andrew Lunn @ 2025-04-21 15:51 ` Alexander Duyck 2025-04-21 16:50 ` Alexander Duyck 0 siblings, 1 reply; 44+ messages in thread From: Alexander Duyck @ 2025-04-21 15:51 UTC (permalink / raw) To: Andrew Lunn; +Cc: netdev, linux, hkallweit1, davem, kuba, pabeni On Sun, Apr 20, 2025 at 2:58 PM Andrew Lunn <andrew@lunn.ch> wrote: > > > > The actual link settings are controlled by the host NC driver when > > > it is operational. When the host NC driver is operational, link > > > settings specified by the MC using the Set Link command may be > > > overwritten by the host NC driver. The link settings are not > > > restored by the NC if the host NC driver becomes non > > > operational. > > > > > > There is a very clear indication that the host is in control, or the > > > host is not in control. So one obvious question to me is, should > > > phylink have ops into the MAC driver to say it is taking over control, > > > and relinquishing control? The Linux model is that when the interface > > > is admin down, you can use ethtool to preconfigure things, but they > > > don't take affect until the link is admin up. So with admin down, we > > > have a host NC driver, but it is not operational, hence the Network > > > Controller is in control of the link at the Management Controllers > > > bequest. It is only with admin up that phylink takes control of the > > > Network Controller, and it releases it with admin down. Having these > > > ops would also help with suspend/resume. Suspend does not change the > > > admin up/down status, but the host clearly needs to hand over control > > > of the media to the Network Controller, and take it back again on > > > resume. > > > > Yes, this more-or-less describes the current setup in fbnic. The only > > piece that is probably missing would be the heartbeat we maintain so > > that the NIC doesn't revoke access due to the OS/driver potentially > > being hung. > > That probably goes against the last sentence i quoted above. I do > however understand why you would want it. Can the host driver know if > the Network Controller has taken back control? Or does the heartbeat > also act as a watchdog, the host does not need to care, it is about to > experience a BMC induced reboot? The heartbeat acts as a two way watchdog of sorts. Basically if the firmware doesn't receive one every 2 minutes (was 10 seconds, but there are BIOS bugs that cause things to literally lock up for over 90 seconds in early UEFI boot that we just had to work around), it will evict the host and reject all further heartbeat requests. Likewise if the FW receives a heartbeat without the host first claiming ownership it will send back a rejection. If the host sends a heartbeat and the FW rejects it or we don't receive a response, then we know we have somehow lost ownership and we in turn will back off and reinitialize ourselves. Currently the two cases we have seen this get triggered is either due to a OS/BIOS CPU lockup and/or hang, or due to the FW crashing and rebooting. > > The other thing involved in this that you didn't mention > > is that the MC is also managing the Rx filter configuration. So when > > we take ownership it is both the Rx Filters and MAC/PCS/PHY that we > > are taking ownership of. > > That does not seem consistent with the standard. The Set Link command > i quoted above makes it clear that when the host driver is active, it > is in control of the media. However the Set VLAN Filter command, > Enable VLAN command, Set MAC Address command, Enable Broadcast Filter > command, say nothing about differences when the Host driver is > operational or not. It just seems to assume the Management Controller > and the host share the resources, and try not to stomp over each > other. Does fbnic not follow the standard in this respect? However, > from a phylink perspective, i don't think this matters, phylink is not > involved with any of this. Yes, there are going to be deviations from the standard and for me it is kind of a blurred line as I haven't done the 1G stuff in a while so I keep defaulting to multi-host NIC thinking. The host driver is managing the Rx filter config and the MAC/PCS directly and indirectly ordering the FW to configure the PHY, at least until we can implement the code to allow direct access in the 2 host mode. The FW effectively becomes a relay between the host and the external entities such as the SFP, BMC, GPIO, etc. On 1G there were checks that the host driver had to perform to make sure not to touch the link but the Rx filter config was more inline in front of the host filtering rather than incorporated as a part of it as it is in our setup. What we did in fbnic was to reserve the first 4 MAC TCAM addresses for the BMC. One complication we ran into though is that by default the BMC will run in multicast promiscuous mode. So we had to add logic to allow the host to reorder things so that the multicast promisc filter is at the end of the multicast address list routed only to BMC, and the BMC gets included as a destination for all the multicast filters added by the host. In addition we have to notify the FW of our MAC config so that it can direct BMC traffic to the host as needed. > > The current pattern in fbnic is for us to do most of this on the tail > > end of __fbnic_open and unwind it near the start of fbnic_stop. > > Essentially the pattern is xmit_ownership, init heartbeat, init PTP, > > start phylink, configure Rx filters. In the case of close it is the > > reverse with us tearing down the filters, disabling phylink, disabling > > PTP, and then releasing ownership. > > > > > Also, if we have these ops, we know that admin down/suspend does not > > > mean media down. The presence of these ops triggers different state > > > transitions in the phylink state machine so that it simply hands off > > > control of the media, but otherwise leaves it alone. > > > > > > With this in place, i think we can avoid all the unbalanced state? > > > > As I understand it right now the main issue is that Phylink assumes > > that it has to take the link down in order to perform a major > > configuration in phylink_start/phylink_resume. > > Well, as i said, my reading of the standard is that the host can make > disruptive media changes, so you have to be able to live with > disruptive media changes. If you have to live with it, the path of > lease resistance is just to accept it. The problem is that precedent has been set by the other multi-host NIC vendors that didn't go w/ phylink. The argument will be that we either move the MAC/PCS/PHY to firmware and make it work like the other vendors or we are essentially DOA. > > The requirement that the BMC not lose link comes more out of the > > multi-host setups that have been in place in the data center > > environment for the last decade or so where there was only one link > > but multiple systems all sharing that link, including the BMC. So it > > is not strictly a BMC requirement, but more of a multi-host > > requirement. > > Is this actually standardised somewhere? I see there is a draft of an > update to NC-SI Specification, but i don't think the section about > controlling the link has changed. Also, the standard talks about how > you connect one Management Controller to multiple Network > Controllers. There is nothing about multiple Management Controllers > connected to one Network Controller. Or i'm i missing something, like > one Management Controller is controlling all the host connected to the > Network Controller? As far as a standard I can't say if it is. By and large our requirements have been coming out of the fact that Broadcom or Nvidia/Mellanox did it this way so we are expected to do it the same way. So essentially I am dealing with project/platform requirements more than a standard. The issue isn't that we have multiple management controllers, it is that we have setups where one physical MAC/PCS/PHY is shared between 4 PCIe connections to 4 seperate hosts. This is one of the reasons why so many NICs now have the proprietary FW blob you have to talk to in order to configure anything. This is also what we were trying to get away from as the shared resources have a bad habit of creating other issues such as noisy neighbor problems. > > > So, can we ignore the weeds for the moment, and think about the big > > > picture? > > > > So big picture wise we really have 2 issues: > > 1. The BMC handling doesn't currently exist, so we need to extend > > handling/hand-off for link up before we start, and link up after we > > stop. > > Agreed, and that fits with DSP0222. > > > 2. Expectations for our 25G+ interfaces to behave like multi-host NICs > > that are sharing a link via firmware. Specifically that > > loading/unloading the driver or ifconfig up/down on the host interface > > should not cause the link to bounce and/or drop packets for any other > > connections, which in this case includes the BMC. > > For this, it would be nice to point to some standard which describes > this, so we have a generic, vendor agnostic, description of how this > is supposed to work. > > Andrew The problem here is this is more-or-less a bit of a "wild west" in terms of the spec setup. From what I can tell OCP 3.0 defines how to set up the PCIe bifurcation but doesn't explain what the expected behavior is for the shared ports. One thing we might look into would be the handling for VEPA(Virtual Ethernet Port Aggregator) or VEB (Virtual Ethernet Bridging) as that wouldn't be too far off from what inspired most of the logic in the hardware. Essentially the only difference is that instead of supporting VFs most of these NICs are supporting multiple PFs. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap 2025-04-21 15:51 ` Alexander Duyck @ 2025-04-21 16:50 ` Alexander Duyck 2025-04-22 1:21 ` Jakub Kicinski 0 siblings, 1 reply; 44+ messages in thread From: Alexander Duyck @ 2025-04-21 16:50 UTC (permalink / raw) To: Andrew Lunn; +Cc: netdev, linux, hkallweit1, davem, kuba, pabeni On Mon, Apr 21, 2025 at 8:51 AM Alexander Duyck <alexander.duyck@gmail.com> wrote: > > On Sun, Apr 20, 2025 at 2:58 PM Andrew Lunn <andrew@lunn.ch> wrote: > > ... > > > > > 2. Expectations for our 25G+ interfaces to behave like multi-host NICs > > > that are sharing a link via firmware. Specifically that > > > loading/unloading the driver or ifconfig up/down on the host interface > > > should not cause the link to bounce and/or drop packets for any other > > > connections, which in this case includes the BMC. > > > > For this, it would be nice to point to some standard which describes > > this, so we have a generic, vendor agnostic, description of how this > > is supposed to work. > > > > Andrew > > The problem here is this is more-or-less a bit of a "wild west" in > terms of the spec setup. From what I can tell OCP 3.0 defines how to > set up the PCIe bifurcation but doesn't explain what the expected > behavior is for the shared ports. One thing we might look into would > be the handling for VEPA(Virtual Ethernet Port Aggregator) or VEB > (Virtual Ethernet Bridging) as that wouldn't be too far off from what > inspired most of the logic in the hardware. Essentially the only > difference is that instead of supporting VFs most of these NICs are > supporting multiple PFs. So looking at 802.1Q-2022 section 40 I wonder if we don't need to essentially define ourselves as an edge relay as our setup is pretty close to what is depicted in figure 40-1. In our case an S-channel essentially represents 2 SerDes lanes on an QSFP cable, with the switch playing the role of the EVB bridge. Anyway I think that is probably the spec we need to dig into if we are looking for how the link is being shared and such. I'll try to do some more reading myself to get caught up on all this as the last time I had been reading through this it was called VEB instead of EVB.. :-/ ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap 2025-04-21 16:50 ` Alexander Duyck @ 2025-04-22 1:21 ` Jakub Kicinski 2025-04-22 13:49 ` Andrew Lunn 0 siblings, 1 reply; 44+ messages in thread From: Jakub Kicinski @ 2025-04-22 1:21 UTC (permalink / raw) To: Alexander Duyck; +Cc: Andrew Lunn, netdev, linux, hkallweit1, davem, pabeni On Mon, 21 Apr 2025 09:50:25 -0700 Alexander Duyck wrote: > On Mon, Apr 21, 2025 at 8:51 AM Alexander Duyck wrote: > > On Sun, Apr 20, 2025 at 2:58 PM Andrew Lunn <andrew@lunn.ch> wrote: > > > > 2. Expectations for our 25G+ interfaces to behave like multi-host NICs > > > > that are sharing a link via firmware. Specifically that > > > > loading/unloading the driver or ifconfig up/down on the host interface > > > > should not cause the link to bounce and/or drop packets for any other > > > > connections, which in this case includes the BMC. > > > > > > For this, it would be nice to point to some standard which describes > > > this, so we have a generic, vendor agnostic, description of how this > > > is supposed to work. > > > > The problem here is this is more-or-less a bit of a "wild west" in > > terms of the spec setup. From what I can tell OCP 3.0 defines how to > > set up the PCIe bifurcation but doesn't explain what the expected > > behavior is for the shared ports. One thing we might look into would > > be the handling for VEPA(Virtual Ethernet Port Aggregator) or VEB > > (Virtual Ethernet Bridging) as that wouldn't be too far off from what > > inspired most of the logic in the hardware. Essentially the only > > difference is that instead of supporting VFs most of these NICs are > > supporting multiple PFs. > > So looking at 802.1Q-2022 section 40 I wonder if we don't need to > essentially define ourselves as an edge relay as our setup is pretty > close to what is depicted in figure 40-1. In our case an S-channel > essentially represents 2 SerDes lanes on an QSFP cable, with the > switch playing the role of the EVB bridge. > > Anyway I think that is probably the spec we need to dig into if we are > looking for how the link is being shared and such. I'll try to do some > more reading myself to get caught up on all this as the last time I > had been reading through this it was called VEB instead of EVB.. :-/ Interesting. My gut feeling is that even if we make Linux and the NIC behave nicely according to 802.1Q, we'll also need to make some changes on the BMC side. And there we may encounter pushback as the status quo works quite trivially for devices with PHY control in FW. To paraphrase what was already said upthread - in my mind the question is whether we want to support "picking up" an already running link with phylink. Whether the use case is BMC or avoiding link flap when picking up from uboot is of secondary importance. Andrew, the standard could help us build a common higher layer, if needed but I'm not sure how much it impacts the phylink side. The only question for phylink is 1 bit of policy whether link can be "recovered" and left up on close, no? BTW Saeed posted a devlink param to "keep link up" recently: https://lore.kernel.org/all/20250414195959.1375031-11-saeed@kernel.org/ Intel has ethtool priv flags to the same effect, in their 40G and 100G drivers, but with reverse polarity: https://docs.kernel.org/networking/device_drivers/ethernet/intel/i40e.html#setting-the-link-down-on-close-private-flag These are all for this exact use case. In the past Ido added module power policy, which is the only truly generic configurable, and one we should probably build upon: https://docs.kernel.org/networking/ethtool-netlink.html#c.ethtool_module_power_mode_policy I'm not sure if this is expected to include PCS or it's just telling the module to keep the laser on.. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap 2025-04-22 1:21 ` Jakub Kicinski @ 2025-04-22 13:49 ` Andrew Lunn 2025-04-22 15:28 ` Jakub Kicinski 0 siblings, 1 reply; 44+ messages in thread From: Andrew Lunn @ 2025-04-22 13:49 UTC (permalink / raw) To: Jakub Kicinski; +Cc: Alexander Duyck, netdev, linux, hkallweit1, davem, pabeni On Mon, Apr 21, 2025 at 06:21:43PM -0700, Jakub Kicinski wrote: > On Mon, 21 Apr 2025 09:50:25 -0700 Alexander Duyck wrote: > > On Mon, Apr 21, 2025 at 8:51 AM Alexander Duyck wrote: > > > On Sun, Apr 20, 2025 at 2:58 PM Andrew Lunn <andrew@lunn.ch> wrote: > > > > > 2. Expectations for our 25G+ interfaces to behave like multi-host NICs > > > > > that are sharing a link via firmware. Specifically that > > > > > loading/unloading the driver or ifconfig up/down on the host interface > > > > > should not cause the link to bounce and/or drop packets for any other > > > > > connections, which in this case includes the BMC. > > > > > > > > For this, it would be nice to point to some standard which describes > > > > this, so we have a generic, vendor agnostic, description of how this > > > > is supposed to work. > > > > > > The problem here is this is more-or-less a bit of a "wild west" in > > > terms of the spec setup. From what I can tell OCP 3.0 defines how to > > > set up the PCIe bifurcation but doesn't explain what the expected > > > behavior is for the shared ports. One thing we might look into would > > > be the handling for VEPA(Virtual Ethernet Port Aggregator) or VEB > > > (Virtual Ethernet Bridging) as that wouldn't be too far off from what > > > inspired most of the logic in the hardware. Essentially the only > > > difference is that instead of supporting VFs most of these NICs are > > > supporting multiple PFs. > > > > So looking at 802.1Q-2022 section 40 I wonder if we don't need to > > essentially define ourselves as an edge relay as our setup is pretty > > close to what is depicted in figure 40-1. In our case an S-channel > > essentially represents 2 SerDes lanes on an QSFP cable, with the > > switch playing the role of the EVB bridge. > > > > Anyway I think that is probably the spec we need to dig into if we are > > looking for how the link is being shared and such. I'll try to do some > > more reading myself to get caught up on all this as the last time I > > had been reading through this it was called VEB instead of EVB.. :-/ > > Interesting. My gut feeling is that even if we make Linux and the NIC > behave nicely according to 802.1Q, we'll also need to make some changes > on the BMC side. And there we may encounter pushback as the status quo > works quite trivially for devices with PHY control in FW. As i see it, we have two things stacked on top of each other. We have what is standardised for NC-SI, DSP0222. That gives a basis, and then there is vendor stuff on top for multi-host, which is more strict. Linux should have generic support for DSP0222. I've seen vendors hack around with WoL to make it work. It would be nice to replace that hack with a method to tell phylink to enable support for DSP0222. A standardised method, since as additional ops, or a flag. phylink can then separate admin down from carrier down when needed. Then we have vendor stuff on top. > BTW Saeed posted a devlink param to "keep link up" recently: > https://lore.kernel.org/all/20250414195959.1375031-11-saeed@kernel.org/ > Intel has ethtool priv flags to the same effect, in their 40G and 100G > drivers, but with reverse polarity: > https://docs.kernel.org/networking/device_drivers/ethernet/intel/i40e.html#setting-the-link-down-on-close-private-flag > These are all for this exact use case. In the past Ido added module > power policy, which is the only truly generic configurable, and one we > should probably build upon: > https://docs.kernel.org/networking/ethtool-netlink.html#c.ethtool_module_power_mode_policy > I'm not sure if this is expected to include PCS or it's just telling > the module to keep the laser on.. Ideally, we want to define something vendor agnostic. And i would prefer we talk about the high level concept, sharing the NIC with a BMC and multiple hosts, rather than the low level, keep link up. The whole concept of a multi-host NIC is new to me. So i at least need to get up to speed with it. I've no idea if Russell has come across it before, since it is not a SoC concept. I don't really want to agree to anything until i do have that concept understood. That is part of why i asked about a standard. It is a dense document answering a lot of questions. Without a standard, i need to ask a lot of questions. I also think there is a lot more to it than just keeping the laser on. For NC-SI, DSP0222 that probably does cover a big chunk of the problem, but for multi-host, my gut is telling me there is more to it. Let me do some research and thinking about multi-host. Andrew ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap 2025-04-22 13:49 ` Andrew Lunn @ 2025-04-22 15:28 ` Jakub Kicinski 2025-04-22 16:49 ` Andrew Lunn 0 siblings, 1 reply; 44+ messages in thread From: Jakub Kicinski @ 2025-04-22 15:28 UTC (permalink / raw) To: Andrew Lunn; +Cc: Alexander Duyck, netdev, linux, hkallweit1, davem, pabeni On Tue, 22 Apr 2025 15:49:02 +0200 Andrew Lunn wrote: > > Interesting. My gut feeling is that even if we make Linux and the NIC > > behave nicely according to 802.1Q, we'll also need to make some changes > > on the BMC side. And there we may encounter pushback as the status quo > > works quite trivially for devices with PHY control in FW. > > As i see it, we have two things stacked on top of each other. We have > what is standardised for NC-SI, DSP0222. That gives a basis, and then > there is vendor stuff on top for multi-host, which is more strict. The multi-host piece I'd completely ignore. Real multi-host NICs cannot do without firmware. One component needs to be in control of global resources, and it can't be a host that may get randomly rebooted. > Linux should have generic support for DSP0222. I've seen vendors hack > around with WoL to make it work. It would be nice to replace that hack > with a method to tell phylink to enable support for DSP0222. A > standardised method, since as additional ops, or a flag. phylink can > then separate admin down from carrier down when needed. I'm no DSP0222 expert but in my experience it is rather limited in defining its interactions with the host. It is also irrelevant what exact tap / agent is asking us to keep the link up. The point is _some_ agent is running either on the NIC or leeching off the NIC, and is requesting the link to stay up. It could even be a VF/VM on the same host. FWIW in the NFP we just had a "forced" bit which told the host not to touch physical link status: https://elixir.bootlin.com/linux/v6.15-rc1/source/drivers/net/ethernet/netronome/nfp/nfp_port.h#L62 A possible extension would be to have a bitmap of possible agent types, so that the user has more info about what agent is keeping the link up. But at the end of the day all the link mgmt cares about is whether such bitmap is empty or not. > Then we have vendor stuff on top. > > > BTW Saeed posted a devlink param to "keep link up" recently: > > https://lore.kernel.org/all/20250414195959.1375031-11-saeed@kernel.org/ > > Intel has ethtool priv flags to the same effect, in their 40G and 100G > > drivers, but with reverse polarity: > > https://docs.kernel.org/networking/device_drivers/ethernet/intel/i40e.html#setting-the-link-down-on-close-private-flag > > These are all for this exact use case. In the past Ido added module > > power policy, which is the only truly generic configurable, and one we > > should probably build upon: > > https://docs.kernel.org/networking/ethtool-netlink.html#c.ethtool_module_power_mode_policy > > I'm not sure if this is expected to include PCS or it's just telling > > the module to keep the laser on.. > > Ideally, we want to define something vendor agnostic. And i would > prefer we talk about the high level concept, sharing the NIC with a > BMC and multiple hosts, rather than the low level, keep link up. > > The whole concept of a multi-host NIC is new to me. So i at least need > to get up to speed with it. I've no idea if Russell has come across it > before, since it is not a SoC concept. > > I don't really want to agree to anything until i do have that concept > understood. That is part of why i asked about a standard. It is a > dense document answering a lot of questions. Without a standard, i > need to ask a lot of questions. Don't hesitate to ask the questions, your last reply contains no question marks :) > I also think there is a lot more to it than just keeping the laser > on. For NC-SI, DSP0222 that probably does cover a big chunk of the > problem, but for multi-host, my gut is telling me there is more to it. > > Let me do some research and thinking about multi-host. The only case that could make multi-host relevant here is if each host had its own port but they shared some resources like clocks. Making it a not-really multi-host. I know there are NICs on the market which have similar limitations on single host. If you configure port breakout all sub-ports need to run at the same rate. Programming of shared resources is orthogonal in my mind to a subservient agent asking us not to link down. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap 2025-04-22 15:28 ` Jakub Kicinski @ 2025-04-22 16:49 ` Andrew Lunn 2025-04-22 17:30 ` Russell King (Oracle) 2025-04-22 21:29 ` Alexander Duyck 0 siblings, 2 replies; 44+ messages in thread From: Andrew Lunn @ 2025-04-22 16:49 UTC (permalink / raw) To: Jakub Kicinski; +Cc: Alexander Duyck, netdev, linux, hkallweit1, davem, pabeni > > The whole concept of a multi-host NIC is new to me. So i at least need > > to get up to speed with it. I've no idea if Russell has come across it > > before, since it is not a SoC concept. > > > > I don't really want to agree to anything until i do have that concept > > understood. That is part of why i asked about a standard. It is a > > dense document answering a lot of questions. Without a standard, i > > need to ask a lot of questions. > > Don't hesitate to ask the questions, your last reply contains no > question marks :) O.K. Lets start with the basics. I assume the NIC has a PCIe connector something like a 4.0 x4? Each of the four hosts in the system contribute one PCIe lane. So from the host side it looks like a 4.0 x1 NIC? There are not 4 host MACs connected to a 5 port switch. Rather, each host gets its own subset of queues, DMA engines etc, for one shared MAC. Below the MAC you have all the usual PCS, SFP cage, gpios, I2C bus, and blinky LEDs. Plus you have the BMC connected via an RMII like interface. You must have a minimum of firmware on the NIC to get the MAC into a state the BMC can inject/receive frames, configure the PCS, gpios to the SFP, enough I2C to figure out what the module is, what quirks are needed etc. NC-SI, with Linux controlling the hardware, implies you need to be able to hand off control of the GPIOs, I2C, PCS to Linux. But with multi-host, it makes no sense for all 4 hosts to be trying to control the GPIOs, I2C, PCS, perform SFP firmware upgrade. So it seems more likely to me, one host gets put in change of everything below the queues to the MAC. The others just know there is link, nothing more. This actually circles back to the discussion about fixed-link. The one host in control of all the lower hardware has the complete picture. The other 3 maybe just need a fixed link. They don't get to see what is going on below the MAC, and as a result there is no ethtool support to change anything, and so no conflicting configuration? And since they cannot control any of that, they cannot put the link down. So 3/4 of the problem is solved. phylink is however not expecting that when phylink_start() is called, it might or might not have to drive the hardware depending on if it wins an election to control the hardware. And if it losses, it needs to ditch all its configuration for a PCS, SPF, etc and swap to a fixed-link. Do we want to teach phylink all this, or put all phylink stuff into open(), rather than spread across probe() and open(). Being in open(), you basically construct a different phylink configuration depending on if you win the election or not. Is one host in the position to control the complete media configuration? Could you split the QSFP into four, each host gets its own channel, and it gets to choose how to use that channel, different FEC schemes, bit rates? Andrew ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap 2025-04-22 16:49 ` Andrew Lunn @ 2025-04-22 17:30 ` Russell King (Oracle) 2025-04-22 18:13 ` Andrew Lunn 2025-04-22 21:29 ` Alexander Duyck 1 sibling, 1 reply; 44+ messages in thread From: Russell King (Oracle) @ 2025-04-22 17:30 UTC (permalink / raw) To: Andrew Lunn Cc: Jakub Kicinski, Alexander Duyck, netdev, hkallweit1, davem, pabeni On Tue, Apr 22, 2025 at 06:49:54PM +0200, Andrew Lunn wrote: > > > The whole concept of a multi-host NIC is new to me. So i at least need > > > to get up to speed with it. I've no idea if Russell has come across it > > > before, since it is not a SoC concept. > > > > > > I don't really want to agree to anything until i do have that concept > > > understood. That is part of why i asked about a standard. It is a > > > dense document answering a lot of questions. Without a standard, i > > > need to ask a lot of questions. > > > > Don't hesitate to ask the questions, your last reply contains no > > question marks :) > > O.K. Lets start with the basics. I assume the NIC has a PCIe connector > something like a 4.0 x4? Each of the four hosts in the system > contribute one PCIe lane. So from the host side it looks like a 4.0 x1 > NIC? > > There are not 4 host MACs connected to a 5 port switch. Rather, each > host gets its own subset of queues, DMA engines etc, for one shared > MAC. Below the MAC you have all the usual PCS, SFP cage, gpios, I2C > bus, and blinky LEDs. Plus you have the BMC connected via an RMII like > interface. > > You must have a minimum of firmware on the NIC to get the MAC into a > state the BMC can inject/receive frames, configure the PCS, gpios to > the SFP, enough I2C to figure out what the module is, what quirks are > needed etc. This all makes sense, but at this point, I have to ask something that seems to be fundamental to me: Should any of the hosts accessing the NIC through those PCIe x1 interfaces have any knowledge or control of anything behind "their" view of the MAC? I would say no, they should not, because if they do, they can interfere with other hosts. Surely only the BMC should have permission to access the layers of hardware behind the MAC? What should a host know about the setup? Maybe the speed of their network connection through the MAC. I state it that way rather than "the speed of the media" because if there is some control over the traffic from each "host" then the media speed is irrelevant. > NC-SI, with Linux controlling the hardware, implies you need to be > able to hand off control of the GPIOs, I2C, PCS to Linux. But with > multi-host, it makes no sense for all 4 hosts to be trying to control > the GPIOs, I2C, PCS, perform SFP firmware upgrade. So it seems more > likely to me, one host gets put in change of everything below the > queues to the MAC. The others just know there is link, nothing more. Ouch. Yes - if we have four independent hosts trying to access the same I2C hardware as another host on the same hardware, then that sounds like a recipe for a trainwreck. > This actually circles back to the discussion about fixed-link. The one > host in control of all the lower hardware has the complete > picture. The other 3 maybe just need a fixed link. They don't get to > see what is going on below the MAC, and as a result there is no > ethtool support to change anything, and so no conflicting > configuration? And since they cannot control any of that, they cannot > put the link down. So 3/4 of the problem is solved. Should one host have control, or should the BMC have control? I don't actually know what you're talking about w.r.t. DSP0222 or whatever it was, nor NC-SI - I don't have these documents. > phylink is however not expecting that when phylink_start() is called, > it might or might not have to drive the hardware depending on if it > wins an election to control the hardware. And if it losses, it needs > to ditch all its configuration for a PCS, SPF, etc and swap to a > fixed-link. Do we want to teach phylink all this, or put all phylink > stuff into open(), rather than spread across probe() and open(). Being > in open(), you basically construct a different phylink configuration > depending on if you win the election or not. That sounds very complicated and all very new stuff. > Is one host in the position to control the complete media > configuration? Could you split the QSFP into four, each host gets its > own channel, and it gets to choose how to use that channel, different > FEC schemes, bit rates? Yes, each channel in a QSFPs have separate LOS status bits accessible over I2C. It's been a while since I looked at this, but I seem to remember there aren't hardware pins for LOS, TX_DISABLE etc - that's all over I2C. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last! ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap 2025-04-22 17:30 ` Russell King (Oracle) @ 2025-04-22 18:13 ` Andrew Lunn 2025-04-22 18:50 ` Russell King (Oracle) 2025-04-22 23:51 ` Jakub Kicinski 0 siblings, 2 replies; 44+ messages in thread From: Andrew Lunn @ 2025-04-22 18:13 UTC (permalink / raw) To: Russell King (Oracle) Cc: Jakub Kicinski, Alexander Duyck, netdev, hkallweit1, davem, pabeni > Should one host have control, or should the BMC have control? I don't > actually know what you're talking about w.r.t. DSP0222 or whatever it > was, nor NC-SI - I don't have these documents. I gave a reference to it a few email back in the conversation: https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.2.0.pdf Linux has an implementation of the protocol in net/nsci Andrew ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap 2025-04-22 18:13 ` Andrew Lunn @ 2025-04-22 18:50 ` Russell King (Oracle) 2025-04-22 23:51 ` Jakub Kicinski 1 sibling, 0 replies; 44+ messages in thread From: Russell King (Oracle) @ 2025-04-22 18:50 UTC (permalink / raw) To: Andrew Lunn Cc: Jakub Kicinski, Alexander Duyck, netdev, hkallweit1, davem, pabeni On Tue, Apr 22, 2025 at 08:13:43PM +0200, Andrew Lunn wrote: > > Should one host have control, or should the BMC have control? I don't > > actually know what you're talking about w.r.t. DSP0222 or whatever it > > was, nor NC-SI - I don't have these documents. > > I gave a reference to it a few email back in the conversation: > > https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.2.0.pdf > > Linux has an implementation of the protocol in net/nsci Hmm. I don't think I have bandwidth to read that spec any time soon, sorry. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last! ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap 2025-04-22 18:13 ` Andrew Lunn 2025-04-22 18:50 ` Russell King (Oracle) @ 2025-04-22 23:51 ` Jakub Kicinski 1 sibling, 0 replies; 44+ messages in thread From: Jakub Kicinski @ 2025-04-22 23:51 UTC (permalink / raw) To: Andrew Lunn Cc: Russell King (Oracle), Alexander Duyck, netdev, hkallweit1, davem, pabeni On Tue, 22 Apr 2025 20:13:43 +0200 Andrew Lunn wrote: > > Should one host have control, or should the BMC have control? I don't > > actually know what you're talking about w.r.t. DSP0222 or whatever it > > was, nor NC-SI - I don't have these documents. > > I gave a reference to it a few email back in the conversation: > > https://www.dmtf.org/sites/default/files/standards/documents/DSP0222_1.2.0.pdf > > Linux has an implementation of the protocol in net/nsci But to be clear the implementation is for when Linux runs as the BMC. It does not interact in any way with the host AFAIU. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap 2025-04-22 16:49 ` Andrew Lunn 2025-04-22 17:30 ` Russell King (Oracle) @ 2025-04-22 21:29 ` Alexander Duyck 2025-04-22 22:26 ` Andrew Lunn 1 sibling, 1 reply; 44+ messages in thread From: Alexander Duyck @ 2025-04-22 21:29 UTC (permalink / raw) To: Andrew Lunn; +Cc: Jakub Kicinski, netdev, linux, hkallweit1, davem, pabeni On Tue, Apr 22, 2025 at 9:50 AM Andrew Lunn <andrew@lunn.ch> wrote: > > > > The whole concept of a multi-host NIC is new to me. So i at least need > > > to get up to speed with it. I've no idea if Russell has come across it > > > before, since it is not a SoC concept. > > > > > > I don't really want to agree to anything until i do have that concept > > > understood. That is part of why i asked about a standard. It is a > > > dense document answering a lot of questions. Without a standard, i > > > need to ask a lot of questions. > > > > Don't hesitate to ask the questions, your last reply contains no > > question marks :) > > O.K. Lets start with the basics. I assume the NIC has a PCIe connector > something like a 4.0 x4? Each of the four hosts in the system > contribute one PCIe lane. So from the host side it looks like a 4.0 x1 > NIC? More like 5.0 x16 split in to 4 5.0 x4 NICs. > There are not 4 host MACs connected to a 5 port switch. Rather, each > host gets its own subset of queues, DMA engines etc, for one shared > MAC. Below the MAC you have all the usual PCS, SFP cage, gpios, I2C > bus, and blinky LEDs. Plus you have the BMC connected via an RMII like > interface. Yeah, that is the setup so far. Basically we are using one QSFP cable and slicing it up. So instead of having a 100CR4 connection we might have 2x50CR2 operating on the same cable, or 4x25CR. > You must have a minimum of firmware on the NIC to get the MAC into a > state the BMC can inject/receive frames, configure the PCS, gpios to > the SFP, enough I2C to figure out what the module is, what quirks are > needed etc. The firmware isn't that smart. It isn't reading the QSFP itself to get that info. It could, but it doesn't. It is essentially hands off as there isn't any change needed for a direct attach cable. Basically it is configuring the MAC, PCS, FEC, PMA, PMD with a pre-recorded setting in the EEPROM for the NIC. > NC-SI, with Linux controlling the hardware, implies you need to be > able to hand off control of the GPIOs, I2C, PCS to Linux. But with > multi-host, it makes no sense for all 4 hosts to be trying to control > the GPIOs, I2C, PCS, perform SFP firmware upgrade. So it seems more > likely to me, one host gets put in change of everything below the > queues to the MAC. The others just know there is link, nothing more. Things are a bit simpler than that. With the direct-attach we don't need to take any action on the SFP. Essentially the I2C and GPIOs are all shared. As such we can read the QSFP state, but cannot modify it directly. We aren't taking any actions to write to the I2C other than bank/page which is handled all as a part of the read call. > This actually circles back to the discussion about fixed-link. The one > host in control of all the lower hardware has the complete > picture. The other 3 maybe just need a fixed link. They don't get to > see what is going on below the MAC, and as a result there is no > ethtool support to change anything, and so no conflicting > configuration? And since they cannot control any of that, they cannot > put the link down. So 3/4 of the problem is solved. Yeah, this is why I was headed down that path for a bit. However our links are independent with the only shared bit being the PMD and the SFP module. We can essentially configure everything else diffrerent between the ports from there. So depending on what the cable supports we can potentially run one lane or two, and in NRZ or PAM4 mode. So for example one of our standard test items to run is to use a QSFP-DD loopback plug and essentially cycle through all different port configurations on all the different ports to make sure we don't have configuration leaking over from one port to the other as the PMD is shared between hosts 0 and 1, and hosts 2 and 3 if we have a 4 port setup. We don't have to have all 4 MAC/PCS/PMA configured the same. We can have a different config between ports, although in most cases it will just end up being the same. > phylink is however not expecting that when phylink_start() is called, > it might or might not have to drive the hardware depending on if it > wins an election to control the hardware. And if it losses, it needs > to ditch all its configuration for a PCS, SPF, etc and swap to a > fixed-link. Do we want to teach phylink all this, or put all phylink > stuff into open(), rather than spread across probe() and open(). Being > in open(), you basically construct a different phylink configuration > depending on if you win the election or not. We are getting a bit off into the weeds here. So there isn't any sort of election. There is still a firmware that is sitting on the shared bits. So the PMD, I2C to the QSFP, and GPIO from the QSFP are all controlled via the firmware. To prevent any significant issues for now we treat the QSFP as read-only from the hosts as we have to go through the firmware to get access, and the PMD can only be configured via a message to the FW asking for a specific bitrate/modulation and number of lanes. > Is one host in the position to control the complete media > configuration? Could you split the QSFP into four, each host gets its > own channel, and it gets to choose how to use that channel, different > FEC schemes, bit rates? So one thing to be aware of is that the QSFP can be electrically separated so that it is one cable, but with either 2 (QSFP+/QSFP28) or 4 (QSFP-DD) separate sets of lanes. The cable defines the limits of what we can do in terms of modulation and number of lanes, but we don't have to configure anything directly on it. That is handled through the PCS/PMA/PMD side of things. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap 2025-04-22 21:29 ` Alexander Duyck @ 2025-04-22 22:26 ` Andrew Lunn 2025-04-22 23:06 ` Alexander Duyck 0 siblings, 1 reply; 44+ messages in thread From: Andrew Lunn @ 2025-04-22 22:26 UTC (permalink / raw) To: Alexander Duyck; +Cc: Jakub Kicinski, netdev, linux, hkallweit1, davem, pabeni On Tue, Apr 22, 2025 at 02:29:48PM -0700, Alexander Duyck wrote: > On Tue, Apr 22, 2025 at 9:50 AM Andrew Lunn <andrew@lunn.ch> wrote: > > > > > > The whole concept of a multi-host NIC is new to me. So i at least need > > > > to get up to speed with it. I've no idea if Russell has come across it > > > > before, since it is not a SoC concept. > > > > > > > > I don't really want to agree to anything until i do have that concept > > > > understood. That is part of why i asked about a standard. It is a > > > > dense document answering a lot of questions. Without a standard, i > > > > need to ask a lot of questions. > > > > > > Don't hesitate to ask the questions, your last reply contains no > > > question marks :) > > > > O.K. Lets start with the basics. I assume the NIC has a PCIe connector > > something like a 4.0 x4? Each of the four hosts in the system > > contribute one PCIe lane. So from the host side it looks like a 4.0 x1 > > NIC? > > More like 5.0 x16 split in to 4 5.0 x4 NICs. O.K. Same thing, different scale. > > There are not 4 host MACs connected to a 5 port switch. Rather, each > > host gets its own subset of queues, DMA engines etc, for one shared > > MAC. Below the MAC you have all the usual PCS, SFP cage, gpios, I2C > > bus, and blinky LEDs. Plus you have the BMC connected via an RMII like > > interface. > > Yeah, that is the setup so far. Basically we are using one QSFP cable > and slicing it up. So instead of having a 100CR4 connection we might > have 2x50CR2 operating on the same cable, or 4x25CR. But for 2x50CR2 you have two MACs? And for 4x25CR 4 MACs? Or is there always 4 MACs, each MAC has its own queues, and you need to place frames into the correct queue, and with a 2x50CR2 you also need to load balance across those two queues? I guess the queuing does not matter much to phylink, but how do you represent multiple PCS lanes to phylink? Up until now, one netdev has had one PCS lane. It now has 1, 2, or 4 lanes. None of the phylink_pcs_op have a lane indicator. > > NC-SI, with Linux controlling the hardware, implies you need to be > > able to hand off control of the GPIOs, I2C, PCS to Linux. But with > > multi-host, it makes no sense for all 4 hosts to be trying to control > > the GPIOs, I2C, PCS, perform SFP firmware upgrade. So it seems more > > likely to me, one host gets put in change of everything below the > > queues to the MAC. The others just know there is link, nothing more. > > Things are a bit simpler than that. With the direct-attach we don't > need to take any action on the SFP. Essentially the I2C and GPIOs are > all shared. As such we can read the QSFP state, but cannot modify it > directly. We aren't taking any actions to write to the I2C other than > bank/page which is handled all as a part of the read call. That might work for direct-attach, but what about the general case? We need to ensure whatever we add supports the general case. The current SFP code expects a Linux I2C bus. Given how SFPs are broken, it does 16 bytes reads at the most. When it needs to read more than 16 bytes, i expect it will set the page once, read it back to ensure the SFP actually implements the page, and then do multiple I2C reads to read all the data it wants from that page. I don't see how this is going to work when the I2C bus is shared. > > This actually circles back to the discussion about fixed-link. The one > > host in control of all the lower hardware has the complete > > picture. The other 3 maybe just need a fixed link. They don't get to > > see what is going on below the MAC, and as a result there is no > > ethtool support to change anything, and so no conflicting > > configuration? And since they cannot control any of that, they cannot > > put the link down. So 3/4 of the problem is solved. > > Yeah, this is why I was headed down that path for a bit. However our > links are independent with the only shared bit being the PMD and the > SFP module. Yours might be, but what is the general case? Andrew ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap 2025-04-22 22:26 ` Andrew Lunn @ 2025-04-22 23:06 ` Alexander Duyck 2025-04-23 18:38 ` Alexander Duyck 2025-04-24 20:34 ` Andrew Lunn 0 siblings, 2 replies; 44+ messages in thread From: Alexander Duyck @ 2025-04-22 23:06 UTC (permalink / raw) To: Andrew Lunn; +Cc: Jakub Kicinski, netdev, linux, hkallweit1, davem, pabeni On Tue, Apr 22, 2025 at 3:26 PM Andrew Lunn <andrew@lunn.ch> wrote: > > On Tue, Apr 22, 2025 at 02:29:48PM -0700, Alexander Duyck wrote: > > On Tue, Apr 22, 2025 at 9:50 AM Andrew Lunn <andrew@lunn.ch> wrote: > > > > > > > > The whole concept of a multi-host NIC is new to me. So i at least need > > > > > to get up to speed with it. I've no idea if Russell has come across it > > > > > before, since it is not a SoC concept. > > > > > > > > > > I don't really want to agree to anything until i do have that concept > > > > > understood. That is part of why i asked about a standard. It is a > > > > > dense document answering a lot of questions. Without a standard, i > > > > > need to ask a lot of questions. > > > > > > > > Don't hesitate to ask the questions, your last reply contains no > > > > question marks :) > > > > > > O.K. Lets start with the basics. I assume the NIC has a PCIe connector > > > something like a 4.0 x4? Each of the four hosts in the system > > > contribute one PCIe lane. So from the host side it looks like a 4.0 x1 > > > NIC? > > > > More like 5.0 x16 split in to 4 5.0 x4 NICs. > > O.K. Same thing, different scale. Agreed. > > > There are not 4 host MACs connected to a 5 port switch. Rather, each > > > host gets its own subset of queues, DMA engines etc, for one shared > > > MAC. Below the MAC you have all the usual PCS, SFP cage, gpios, I2C > > > bus, and blinky LEDs. Plus you have the BMC connected via an RMII like > > > interface. > > > > Yeah, that is the setup so far. Basically we are using one QSFP cable > > and slicing it up. So instead of having a 100CR4 connection we might > > have 2x50CR2 operating on the same cable, or 4x25CR. > > But for 2x50CR2 you have two MACs? And for 4x25CR 4 MACs? Yes. Some confusion here may be that our hardware always has 4 MAC/PCS/PMA setups, one for each host. Depending on the NIC configuration we may have either 4 hosts or 2 hosts present with 2 disabled. What they end up doing is routing 2 lanes from the QSFP to one host and the other two to the other. So in the case of the QSFP28 or QSFP+ we can only support 2 hosts, and with the QSFP-DD we can support 4. > Or is there always 4 MACs, each MAC has its own queues, and you need > to place frames into the correct queue, and with a 2x50CR2 you also > need to load balance across those two queues? Are you familiar with the concept of QSFP breakout cables? The general idea is that one end of the cable is a QSFP connection and it will break out into 4 SFP connections on the other end. That is actually pretty close to the concept behind our NIC. We essentially have an internalized breakout where the QSFP connection comes in, but we break it into either 2 or 4 connections on our end. Our limit is 2 lanes per host. I did a quick search and came up with the following link to a Cisco whitepaper that sort of explains the breakout cable concept. I will try to see if I can find a spec somewhere that defines how to handle a breakout cable: https://www.cisco.com/c/en/us/products/collateral/interfaces-modules/transceiver-modules/whitepaper-c11-744077.html > I guess the queuing does not matter much to phylink, but how do you > represent multiple PCS lanes to phylink? Up until now, one netdev has > had one PCS lane. It now has 1, 2, or 4 lanes. None of the > phylink_pcs_op have a lane indicator. So the PCS isn't really much of a problem. There is only one PCS per host. Where things get a bit messier is that the PMA/PMD setup is per lane. So our PCS has vendor registers for setting up the PMA side of things and we have to set them for 2 devices instead of just one. Likewise we have to pass a lanes mask to the PMD to tell it which lanes are being configured for what modulation and which lanes are disabled. > > > NC-SI, with Linux controlling the hardware, implies you need to be > > > able to hand off control of the GPIOs, I2C, PCS to Linux. But with > > > multi-host, it makes no sense for all 4 hosts to be trying to control > > > the GPIOs, I2C, PCS, perform SFP firmware upgrade. So it seems more > > > likely to me, one host gets put in change of everything below the > > > queues to the MAC. The others just know there is link, nothing more. > > > > Things are a bit simpler than that. With the direct-attach we don't > > need to take any action on the SFP. Essentially the I2C and GPIOs are > > all shared. As such we can read the QSFP state, but cannot modify it > > directly. We aren't taking any actions to write to the I2C other than > > bank/page which is handled all as a part of the read call. > > That might work for direct-attach, but what about the general case? We > need to ensure whatever we add supports the general case. I agree, but at the same time I am just letting you know the limitations of our hardware setup. There isn't anything to really control on the QSFP. It is mostly just there to provide the media and that is about it. No PHY on it to load FW for. > The current SFP code expects a Linux I2C bus. Given how SFPs are > broken, it does 16 bytes reads at the most. When it needs to read more > than 16 bytes, i expect it will set the page once, read it back to > ensure the SFP actually implements the page, and then do multiple I2C > reads to read all the data it wants from that page. I don't see how > this is going to work when the I2C bus is shared. The general idea is that we have to cache the page and bank in the driver and pass those as arguments to the firmware when we perform a read. Basically it will take a lock on the I2C, set the page and bank, perform the read, and then release the lock. With that all 4 hosts can read the I2C from the QSFP without causing any side effects. > > > This actually circles back to the discussion about fixed-link. The one > > > host in control of all the lower hardware has the complete > > > picture. The other 3 maybe just need a fixed link. They don't get to > > > see what is going on below the MAC, and as a result there is no > > > ethtool support to change anything, and so no conflicting > > > configuration? And since they cannot control any of that, they cannot > > > put the link down. So 3/4 of the problem is solved. > > > > Yeah, this is why I was headed down that path for a bit. However our > > links are independent with the only shared bit being the PMD and the > > SFP module. > > Yours might be, but what is the general case? I will do some digging into the breakout cable path. That seems like the most likely setup that would be similar and more general. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap 2025-04-22 23:06 ` Alexander Duyck @ 2025-04-23 18:38 ` Alexander Duyck 2025-04-24 20:34 ` Andrew Lunn 1 sibling, 0 replies; 44+ messages in thread From: Alexander Duyck @ 2025-04-23 18:38 UTC (permalink / raw) To: Andrew Lunn; +Cc: Jakub Kicinski, netdev, linux, hkallweit1, davem, pabeni On Tue, Apr 22, 2025 at 4:06 PM Alexander Duyck <alexander.duyck@gmail.com> wrote: > > On Tue, Apr 22, 2025 at 3:26 PM Andrew Lunn <andrew@lunn.ch> wrote: > > > > On Tue, Apr 22, 2025 at 02:29:48PM -0700, Alexander Duyck wrote: > > > On Tue, Apr 22, 2025 at 9:50 AM Andrew Lunn <andrew@lunn.ch> wrote: > > > > ... > > Or is there always 4 MACs, each MAC has its own queues, and you need > > to place frames into the correct queue, and with a 2x50CR2 you also > > need to load balance across those two queues? > > Are you familiar with the concept of QSFP breakout cables? The general > idea is that one end of the cable is a QSFP connection and it will > break out into 4 SFP connections on the other end. That is actually > pretty close to the concept behind our NIC. We essentially have an > internalized breakout where the QSFP connection comes in, but we break > it into either 2 or 4 connections on our end. Our limit is 2 lanes per > host. > > I did a quick search and came up with the following link to a Cisco > whitepaper that sort of explains the breakout cable concept. I will > try to see if I can find a spec somewhere that defines how to handle a > breakout cable: > https://www.cisco.com/c/en/us/products/collateral/interfaces-modules/transceiver-modules/whitepaper-c11-744077.html > So I have done some more digging. I'm wondering if the annex 109B,C and 135A,C,E,G are meant to essentially explain how to hook up a breakout style connection without mentioning splitting things up or sharing. As far as the QSFP cables I was able to find a bit more. Specifically in SFF-8636 they have a mention of a "Separable module" in Byte 113 bits 6-4 that was added to support breakout cables. Getting into CMIS (https://www.oiforum.com/wp-content/uploads/OIF-CMIS-05.2.pdf) there is quite a bit more mentions of partitioning the module to support multiple "host interface" instances on the same cable, with section 6.2 really getting into the separation of datapaths for "applications" and how to go about configuring multiple distinct links sharing the same module. The only issue is that CMIS doesn't really apply until we get to QSFP-DD cables, so the QSFP28 and QSFP+ may end up being a bit of a wild west when it comes to breaking out the connection. That said though Byte 113 may be enough as it seems like partitioning it is just a matter of selecting 1/2/4 lanes versus QSFP-DD where we can swap between different modulation modes and such. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap 2025-04-22 23:06 ` Alexander Duyck 2025-04-23 18:38 ` Alexander Duyck @ 2025-04-24 20:34 ` Andrew Lunn 2025-04-24 23:40 ` Alexander Duyck 1 sibling, 1 reply; 44+ messages in thread From: Andrew Lunn @ 2025-04-24 20:34 UTC (permalink / raw) To: Alexander Duyck; +Cc: Jakub Kicinski, netdev, linux, hkallweit1, davem, pabeni Sorry for the delay, busy with $DAY_JOB > > > > There are not 4 host MACs connected to a 5 port switch. Rather, each > > > > host gets its own subset of queues, DMA engines etc, for one shared > > > > MAC. Below the MAC you have all the usual PCS, SFP cage, gpios, I2C > > > > bus, and blinky LEDs. Plus you have the BMC connected via an RMII like > > > > interface. > > > > > > Yeah, that is the setup so far. Basically we are using one QSFP cable > > > and slicing it up. So instead of having a 100CR4 connection we might > > > have 2x50CR2 operating on the same cable, or 4x25CR. > > > > But for 2x50CR2 you have two MACs? And for 4x25CR 4 MACs? > > Yes. Some confusion here may be that our hardware always has 4 > MAC/PCS/PMA setups, one for each host. Depending on the NIC > configuration we may have either 4 hosts or 2 hosts present with 2 > disabled. So with 2 hosts, each host has two netdevs? If you were to dedicate the whole card to one host, you would have 4 netdevs? It is upto whatever is above to perform load balancing over those? If you always have 4 MAC/PCS, then the PCS is only ever used with a single lane? The MAC does not support 100000baseKR4 for example, but 250000baseKR1? > The general idea is that we have to cache the page and bank in the > driver and pass those as arguments to the firmware when we perform a > read. Basically it will take a lock on the I2C, set the page and bank, > perform the read, and then release the lock. With that all 4 hosts can > read the I2C from the QSFP without causing any side effects. I assume your hardware team have not actually implemented I2C, they have licensed it. Hence there is probably already a driver for it in drivers/i2c/busses, maybe one of the i2c-designware-? However, you are not going to use it, you are going to reinvent the wheel so you can parse the transactions going over it, look for reads and writes to address 127? Humm, i suppose you could have a virtual I2C driver doing this stacked on top of the real I2C driver. Is this something other network drivers are going to need? Should it be somewhere in drivers/net/phy? The hard bit is how you do the mutex in an agnostic way. But it looks like hardware spinlocks would work: https://docs.kernel.org/locking/hwspinlock.html And actually, it is more complex than caching the page. This specification defines functions in Pages 00h-02h. Pages 03-7Fh are reserved for future use. Writing the value of a non-supported Page shall not be accepted by the transceiver. The Page Select byte shall revert to 0 and read / write operations shall be to the unpaged A2h memory map. So i expect the SFP driver to do a write followed by a read to know if it needs to return EOPNOTSUPP to user space because the SFP does not implement the page. Andrew ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap 2025-04-24 20:34 ` Andrew Lunn @ 2025-04-24 23:40 ` Alexander Duyck 2025-04-25 13:11 ` Andrew Lunn 0 siblings, 1 reply; 44+ messages in thread From: Alexander Duyck @ 2025-04-24 23:40 UTC (permalink / raw) To: Andrew Lunn; +Cc: Jakub Kicinski, netdev, linux, hkallweit1, davem, pabeni On Thu, Apr 24, 2025 at 1:34 PM Andrew Lunn <andrew@lunn.ch> wrote: > > Sorry for the delay, busy with $DAY_JOB No problem. I have a number of my own issues I am dealing with here in terms of code cleanup stuff anyway. > > > > > There are not 4 host MACs connected to a 5 port switch. Rather, each > > > > > host gets its own subset of queues, DMA engines etc, for one shared > > > > > MAC. Below the MAC you have all the usual PCS, SFP cage, gpios, I2C > > > > > bus, and blinky LEDs. Plus you have the BMC connected via an RMII like > > > > > interface. > > > > > > > > Yeah, that is the setup so far. Basically we are using one QSFP cable > > > > and slicing it up. So instead of having a 100CR4 connection we might > > > > have 2x50CR2 operating on the same cable, or 4x25CR. > > > > > > But for 2x50CR2 you have two MACs? And for 4x25CR 4 MACs? > > > > Yes. Some confusion here may be that our hardware always has 4 > > MAC/PCS/PMA setups, one for each host. Depending on the NIC > > configuration we may have either 4 hosts or 2 hosts present with 2 > > disabled. > > So with 2 hosts, each host has two netdevs? If you were to dedicate > the whole card to one host, you would have 4 netdevs? It is upto > whatever is above to perform load balancing over those? Just to be clear when I say "host" in this case I am referring to a system running Linux, not "host" in the CMIS regard as that is the NIC/FW I think. Anyway we have 2 scenarios. Our main use case is to route one x4 to each host. So in that case we only have one PCIe connection and it will only show up as one netdev. In our manufacturing test case we have a riser and use PCIe bifurcation to split up a x16 to 4 x4 and all 4 endpoints can show up on the one host. > If you always have 4 MAC/PCS, then the PCS is only ever used with a > single lane? The MAC does not support 100000baseKR4 for example, but > 250000baseKR1? In our 2 host setup we are normally running a QSFP28 or QSFP+ cable that has 4 lanes. So we effectively are cutting the cable in half to provide 2 lanes to each host. This allows us to support either 50baseCR2 or 100baseCR2 as the upper limit for the cable depending on if it is running NRZ or PAM4 modulation. In these setups things are fairly rigid as we can only select to use 1 or 2 lanes, no selection for modulation due to the nature of the cable spec. In our 4 host setup we are configured to connect a QSFP-DD cable. With that the cable has 8 lanes, with each host getting 2 of that and seeing the same limitations as the 2 host setup mentioned earlier. In theory we could do something like you call out in your example, but we haven't configured a board combination for that yet. Basically it would require a specific board and EEPROM combination to route the lanes so that we had one lane per host instead of 2 which is the current configs. > > The general idea is that we have to cache the page and bank in the > > driver and pass those as arguments to the firmware when we perform a > > read. Basically it will take a lock on the I2C, set the page and bank, > > perform the read, and then release the lock. With that all 4 hosts can > > read the I2C from the QSFP without causing any side effects. > > I assume your hardware team have not actually implemented I2C, they > have licensed it. Hence there is probably already a driver for it in > drivers/i2c/busses, maybe one of the i2c-designware-? However, you are > not going to use it, you are going to reinvent the wheel so you can > parse the transactions going over it, look for reads and writes to > address 127? Humm, i suppose you could have a virtual I2C driver doing > this stacked on top of the real I2C driver. Is this something other > network drivers are going to need? Should it be somewhere in > drivers/net/phy? The hard bit is how you do the mutex in an agnostic > way. But it looks like hardware spinlocks would work: > https://docs.kernel.org/locking/hwspinlock.html Part of the issue would be who owns the I2C driver. Both the firmware and the host would need access to it. Rather than having to do a handoff for that it is easier to have the firmware maintain the driver and just process the requests for us via mailbox IPC calls. One other point of contention is that we don't have a central firmware managing things. We have one instance of the firmware running per host. So the 4 firmware instances will be competing with each other over access to the QSFP, so they have their own mutex that they maintain to determine who can have master access to the I2C bus for the QSFP with each having their own I2C device to connect to the bus. > And actually, it is more complex than caching the page. Yeah, that was generally my thought on it, and that is if I even need to do that. From what I have seen most of the QSFP28/+ direct attach cables we are working with are very simplistic. Seems like they only had page 0. It wasn't until I started getting into the QSFP-DD stuff for the 4 host NIC that I started running into the need for multi page support. So for example the hwmon sensors don't do much for us as direct attach cables don't really bother implementing them. A call to "ethtool -m" on on of our systems usually yields 0.00 degrees C and 0.000 volts. Also the ethtool API already had get_module_eeprom_by_page which is a very good fit for our model since it allowed for atomic access based on the page and bank number. > This specification defines functions in Pages 00h-02h. Pages 03-7Fh > are reserved for future use. Writing the value of a non-supported > Page shall not be accepted by the transceiver. The Page Select byte > shall revert to 0 and read / write operations shall be to the > unpaged A2h memory map. > > So i expect the SFP driver to do a write followed by a read to know if > it needs to return EOPNOTSUPP to user space because the SFP does not > implement the page. I guess it could do EOPNOTSUPP too, we had used EADDRNOTAVAIL to indicate that case. This is one of the reasons why our firmware API requires the bank and page be passed in the message to perform an QSFP I2C read. It is able to verify it on its end and if it isn't supported it returns an error. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap 2025-04-24 23:40 ` Alexander Duyck @ 2025-04-25 13:11 ` Andrew Lunn 2025-04-25 15:41 ` Alexander Duyck 0 siblings, 1 reply; 44+ messages in thread From: Andrew Lunn @ 2025-04-25 13:11 UTC (permalink / raw) To: Alexander Duyck; +Cc: Jakub Kicinski, netdev, linux, hkallweit1, davem, pabeni > Part of the issue would be who owns the I2C driver. Both the firmware > and the host would need access to it. Rather than having to do a > handoff for that it is easier to have the firmware maintain the driver > and just process the requests for us via mailbox IPC calls. How do you incorporate that into sfp.c? sfp_i2c_configure() and sfp_i2c_get() expect a Linux I2C bus, since this is supposed to be a plain simple I2C bus. I'm not sure we want to encourage other abstractions of an I2C bus than the current Linux I2C bus abstraction. Andrew ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap 2025-04-25 13:11 ` Andrew Lunn @ 2025-04-25 15:41 ` Alexander Duyck 0 siblings, 0 replies; 44+ messages in thread From: Alexander Duyck @ 2025-04-25 15:41 UTC (permalink / raw) To: Andrew Lunn; +Cc: Jakub Kicinski, netdev, linux, hkallweit1, davem, pabeni On Fri, Apr 25, 2025 at 6:12 AM Andrew Lunn <andrew@lunn.ch> wrote: > > > Part of the issue would be who owns the I2C driver. Both the firmware > > and the host would need access to it. Rather than having to do a > > handoff for that it is easier to have the firmware maintain the driver > > and just process the requests for us via mailbox IPC calls. > > How do you incorporate that into sfp.c? sfp_i2c_configure() and > sfp_i2c_get() expect a Linux I2C bus, since this is supposed to be a > plain simple I2C bus. I'm not sure we want to encourage other > abstractions of an I2C bus than the current Linux I2C bus abstraction. We would just present it as an i2c bus. Essentially all I have to do is have the xfer call package the i2c request into a message and send it over the IPC mailbox and then wait for a completion. It isn't as if the i2c is anything all that complex. Since our operations are all essentially atomic anyway we can do the read operations without any issues, and for now we essentially just ignore writes to the SFP cage since the DAC cables don't have anything to really write to anyway. The bigger issues will be dealing with page/bank and CMIS when the time comes. That is likely to be more work then how we deal with the i2c. For i2c it would just be a matter of making sure our abstraction we are presenting of the firmware interface can fit into whatever the model is we come up with. For now I can essentially just fake it by passing messages back and forth to the firmware and caching bytes 126 and 127 so that we keep track of what page/bank is being accessed by the host. One thing I will have to double check though is how we report back a failure if trying to set page/bank to an unsupported value. For now the FW was just rejecting the read attempt as we had to pass data to trigger it. We may need to add support for a 0 byte transaction to support setting page and bank just to test if we can set them or not if we are going to fake the extra verification step. ^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2025-04-25 15:42 UTC | newest] Thread overview: 44+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-04-16 15:28 [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap Alexander Duyck 2025-04-16 15:28 ` [net-next PATCH 1/2] net: phylink: Drop unused defines for SUPPORTED/ADVERTISED_INTERFACES Alexander Duyck 2025-04-22 8:32 ` Russell King (Oracle) 2025-04-16 15:29 ` [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present Alexander Duyck 2025-04-16 16:01 ` Russell King (Oracle) 2025-04-16 16:16 ` [PATCH net] net: phylink: fix suspend/resume with WoL enabled and link down Russell King (Oracle) 2025-04-16 17:14 ` Russell King (Oracle) 2025-04-17 14:30 ` Alexander H Duyck 2025-04-17 14:35 ` Russell King (Oracle) 2025-04-17 15:23 ` Russell King (Oracle) 2025-04-17 17:06 ` Alexander Duyck 2025-04-17 17:27 ` Russell King (Oracle) 2025-04-17 19:49 ` Alexander Duyck 2025-04-22 9:51 ` Russell King (Oracle) 2025-04-22 15:30 ` Alexander Duyck 2025-04-22 16:00 ` Andrew Lunn 2025-04-22 20:09 ` Alexander Duyck 2025-04-23 0:00 ` patchwork-bot+netdevbpf 2025-04-16 17:12 ` [net-next PATCH 2/2] net: phylink: Fix issues with link balancing w/ BMC present Russell King (Oracle) 2025-04-16 19:03 ` Alexander Duyck 2025-04-16 19:19 ` Russell King (Oracle) 2025-04-16 20:05 ` Russell King (Oracle) 2025-04-16 22:58 ` Alexander Duyck 2025-04-19 18:11 ` [net-next PATCH 0/2] net: phylink: Fix issue w/ BMC link flap Andrew Lunn 2025-04-20 18:18 ` Alexander Duyck 2025-04-20 21:58 ` Andrew Lunn 2025-04-21 15:51 ` Alexander Duyck 2025-04-21 16:50 ` Alexander Duyck 2025-04-22 1:21 ` Jakub Kicinski 2025-04-22 13:49 ` Andrew Lunn 2025-04-22 15:28 ` Jakub Kicinski 2025-04-22 16:49 ` Andrew Lunn 2025-04-22 17:30 ` Russell King (Oracle) 2025-04-22 18:13 ` Andrew Lunn 2025-04-22 18:50 ` Russell King (Oracle) 2025-04-22 23:51 ` Jakub Kicinski 2025-04-22 21:29 ` Alexander Duyck 2025-04-22 22:26 ` Andrew Lunn 2025-04-22 23:06 ` Alexander Duyck 2025-04-23 18:38 ` Alexander Duyck 2025-04-24 20:34 ` Andrew Lunn 2025-04-24 23:40 ` Alexander Duyck 2025-04-25 13:11 ` Andrew Lunn 2025-04-25 15:41 ` Alexander Duyck
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).