* racing ndo_open()/phylink*connect() with phy_probe()
@ 2024-04-30 21:02 Andrew Halaney
2024-04-30 21:42 ` Russell King (Oracle)
2024-05-01 15:07 ` Andrew Lunn
0 siblings, 2 replies; 9+ messages in thread
From: Andrew Halaney @ 2024-04-30 21:02 UTC (permalink / raw)
To: linux-kernel, linux-arm-kernel, linux-stm32, netdev
Cc: alexandre.torgue, joabreu, davem, edumazet, kuba, pabeni,
mcoquelin.stm32, andrew, hkallweit1, linux
Hi,
I've been taking a look at the following error message:
qcom-ethqos 23000000.ethernet end1: __stmmac_open: Cannot attach to PHY (error: -19)
end1 is using a phy on the mdio bus of end0, not on its own bus. Something
like this devicetree snippet highlights the relationship:
// end0
ðernet0 {
phy-mode = "sgmii";
phy-handle = <&sgmii_phy0>;
mdio {
compatible = "snps,dwmac-mdio";
sgmii_phy0: phy@8 {
compatible = "ethernet-phy-id0141.0dd4";
reg = <0x8>;
device_type = "ethernet-phy";
};
sgmii_phy1: phy@a {
compatible = "ethernet-phy-id0141.0dd4";
reg = <0xa>;
device_type = "ethernet-phy";
};
};
};
// end1
ðernet1 {
phy-mode = "sgmii";
phy-handle = <&sgmii_phy1>;
};
Basically, NetworkManager is setting both interfaces to up, and end1's
phy doesn't seem to be ready when ndo_open() runs, returning
-ENODEV in phylink_fwnode_phy_connect() and bubbling that back up. This doesn't
happen very often, but by shoving things into the initramfs or anything to
speed up probe/ndo_open() it's easier to reproduce. Delaying probe()
of end0 and then setting end1 up is another easy way to reproduce.
My question after looking around for a while, is what is the expectation
of userspace in this situation?
NetworkManager retries 4 times right now (tunable via autoconnect-retries setting),
and if you're lucky that's good enough. You could tell it to retry infinitely,
that should get me out of my bind at least, but it's not an amazing solution.
I'm used to dealing with deferral issues in probe() callbacks, but
to me it seems that phylink and netdev sort of move the phy part of the problem
till later (so you don't get the kernel retrying for you, etc, like you do
with deferred probes) when you ndo_open(). I guess the logic there is that
the phys could be hot pluggable, so with phylink we delay getting at them
until ndo_open()? I also guess if the mac is going to create an mdio bus
with phys off of it that also gets dicey as the phy could -EPROBE_DEFER
when you're trying to probe the mac and confirm its phy is ready..
Is retrying the correct solution here from userspace, or am I missing something?
I thought the "does not support carrier detection" below might be my ticket out
of this, but I can still reproduce it even after patching[0] to make that work.
I guess NetworkManager still brings the device up, but doesn't "activate" the
connection until the carrier detect stuff is done.
[0] https://lore.kernel.org/netdev/20240429-stmmac-no-ethtool-begin-v1-1-04c629c1c142@redhat.com/
Thanks,
Andrew
Some logs to illustrate the issue with a little more context (without
the patch in [0], in the end that doesn't help):
[ 1.541839] fedora kernel: qcom-ethqos 23000000.ethernet end1: renamed from eth0
[ 1.545750] fedora NetworkManager[407]: <info> [1709251201.1647] device (eth0): driver '(null)' does not support carrier detection.
[ 1.545826] fedora NetworkManager[407]: <info> [1709251201.1657] device (eth0): driver 'unknown' does not support carrier detection.
[ 1.545854] fedora NetworkManager[407]: <info> [1709251201.1660] manager: (eth0): new Ethernet device (/org/freedesktop/NetworkManager/Devices/2)
[ 1.545880] fedora NetworkManager[407]: <info> [1709251201.1678] device (eth0): interface index 2 renamed iface from 'eth0' to 'end1'
[ 1.551721] fedora NetworkManager[407]: <info> [1709251201.1794] device (end1): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
[ 1.554944] fedora kernel: qcom-ethqos 23000000.ethernet end1: Register MEM_TYPE_PAGE_POOL RxQ-0
[ 1.555962] fedora kernel: qcom-ethqos 23000000.ethernet end1: Register MEM_TYPE_PAGE_POOL RxQ-1
[ 1.557711] fedora kernel: qcom-ethqos 23000000.ethernet end1: Register MEM_TYPE_PAGE_POOL RxQ-2
[ 1.558721] fedora kernel: qcom-ethqos 23000000.ethernet end1: Register MEM_TYPE_PAGE_POOL RxQ-3
[ 1.560297] fedora kernel: qcom-ethqos 23000000.ethernet end1: __stmmac_open: Cannot attach to PHY (error: -19)
[ 1.573664] fedora NetworkManager[407]: <info> [1709251201.2013] device (end1): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'managed')
[ 1.574344] fedora NetworkManager[407]: <info> [1709251201.2020] policy: auto-activating connection 'Wired Connection' (bfe920e8-6031-4129-bf5c-78198427076a)
[ 1.574733] fedora NetworkManager[407]: <info> [1709251201.2024] device (end1): Activation: starting connection 'Wired Connection' (bfe920e8-6031-4129-bf5c-78198427076a)
[ 1.578589] fedora kernel: qcom-ethqos 23000000.ethernet end1: Register MEM_TYPE_PAGE_POOL RxQ-0
[ 1.579351] fedora kernel: qcom-ethqos 23000000.ethernet end1: Register MEM_TYPE_PAGE_POOL RxQ-1
[ 1.580102] fedora kernel: qcom-ethqos 23000000.ethernet end1: Register MEM_TYPE_PAGE_POOL RxQ-2
[ 1.578337] fedora NetworkManager[407]: <info> [1709251201.2027] device (end1): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
[ 1.578404] fedora NetworkManager[407]: <info> [1709251201.2030] manager: NetworkManager state is now CONNECTING
[ 1.578431] fedora NetworkManager[407]: <info> [1709251201.2033] device (end1): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
[ 1.580965] fedora kernel: qcom-ethqos 23000000.ethernet end1: Register MEM_TYPE_PAGE_POOL RxQ-3
[ 1.582390] fedora kernel: qcom-ethqos 23000000.ethernet end1: __stmmac_open: Cannot attach to PHY (error: -19)
[ 1.593993] fedora NetworkManager[407]: <info> [1709251201.2217] device (end1): state change: config -> failed (reason 'config-failed', sys-iface-state: 'managed')
[ 1.648395] fedora NetworkManager[407]: <info> [1709251201.2220] manager: NetworkManager state is now DISCONNECTED
[ 1.648634] fedora NetworkManager[407]: <warn> [1709251201.2222] device (end1): Activation: failed for connection 'Wired Connection'
[ 1.648926] fedora NetworkManager[407]: <info> [1709251201.2224] device (end1): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')
[ 1.649179] fedora NetworkManager[407]: <info> [1709251201.2233] manager: startup complete
[ 1.834016] fedora NetworkManager[407]: <info> [1709251201.4617] device (eth0): driver '(null)' does not support carrier detection.
[ 1.836553] fedora NetworkManager[407]: <info> [1709251201.4624] device (eth0): driver 'unknown' does not support carrier detection.
[ 1.836628] fedora NetworkManager[407]: <info> [1709251201.4627] manager: (eth0): new Ethernet device (/org/freedesktop/NetworkManager/Devices/3)
[ 1.896859] fedora kernel: qcom-ethqos 23040000.ethernet end0: renamed from eth0
[ 1.902243] fedora NetworkManager[407]: <info> [1709251201.5299] device (eth0): interface index 3 renamed iface from 'eth0' to 'end0'
[ 1.911400] fedora kernel: qcom-ethqos 23040000.ethernet end0: Register MEM_TYPE_PAGE_POOL RxQ-0
[ 1.909632] fedora NetworkManager[407]: <info> [1709251201.5357] device (end0): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: racing ndo_open()/phylink*connect() with phy_probe() 2024-04-30 21:02 racing ndo_open()/phylink*connect() with phy_probe() Andrew Halaney @ 2024-04-30 21:42 ` Russell King (Oracle) 2024-05-02 17:43 ` Andrew Halaney 2024-05-01 15:07 ` Andrew Lunn 1 sibling, 1 reply; 9+ messages in thread From: Russell King (Oracle) @ 2024-04-30 21:42 UTC (permalink / raw) To: Andrew Halaney Cc: linux-kernel, linux-arm-kernel, linux-stm32, netdev, alexandre.torgue, joabreu, davem, edumazet, kuba, pabeni, mcoquelin.stm32, andrew, hkallweit1 On Tue, Apr 30, 2024 at 04:02:19PM -0500, Andrew Halaney wrote: > Basically, NetworkManager is setting both interfaces to up, and end1's > phy doesn't seem to be ready when ndo_open() runs, returning > -ENODEV in phylink_fwnode_phy_connect() and bubbling that back up. This doesn't Let's get something clear - you're attributing phylink to this, but this is not the case. phylink doesn't deal directly with PHYs, it makes use of phylib for that, and merely passes back to its caller whatever status it gets from phylib. It's also not fair to attribute this to phylib as we will see later... There are a few reasons for phylink_fwnode_phy_connect() would return -ENODEV: 1) fwnode_get_phy_node() (a phylib function) returning an error, basically meaning the phy node isn't found. This would be a persistent error, so unlikely to be your issue. 2) fwnode_phy_find_device() (another phylib function) not finding the PHY device corresponding to the fwnode returned by the above on the MDIO bus. This is possible if the PHY has not been detected on the MDIO bus, but I suspect this is not the cause of your issue. 3) phy_attach_direct() (another phylib function) returning an error. This function calls phy_init_hw() which will attempt to talk to the hardware, and if that returns an error, it will be propagated up. (3) is the most likely scenario given your quoted DT description. I suspect that the stmmac/qcom-ethqos driver is what's at fault here. Your DT description shows that the PHYs are on one MDIO bus owned by one of the network interfaces. I suspect if that network interface is down, then the MDIO bus is not accessible. Therefore, when you attempt to open the _other_ network interface, accesses to its PHY fail with -ENODEV and that gets propagated all the way back up. What's more is if you did manage to get that network interface up (because the one with the MDIO bus on was up) then if you take that interface down, you'll end up with a phy_error() splat from phylib because the PHY it was using has become inaccessible. Basically, the network driver is buggy for this PHY setup. If a MDIO bus contains devices that are not owned by the network device owning that MDIO bus, then the MDIO bus _must_ be prepared to handle MDIO bus accesses _at_ _any_ _time_. This clearly is not the case here. It could also be the case that if the driver is using runtime PM, that when the network interface is runtime-PM suspended, it causes MDIO bus accesses to fail... that would be very chaotic though. In any case, I'm going to say... I don't think this is a phylink nor phylib issue, but a buggy network driver thinking that it has the right to shutdown MDIO bus access depending on its network interface state. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last! ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: racing ndo_open()/phylink*connect() with phy_probe() 2024-04-30 21:42 ` Russell King (Oracle) @ 2024-05-02 17:43 ` Andrew Halaney 2024-05-02 21:26 ` Serge Semin 0 siblings, 1 reply; 9+ messages in thread From: Andrew Halaney @ 2024-05-02 17:43 UTC (permalink / raw) To: Russell King (Oracle) Cc: linux-kernel, linux-arm-kernel, linux-stm32, netdev, alexandre.torgue, joabreu, davem, edumazet, kuba, pabeni, mcoquelin.stm32, andrew, hkallweit1 On Tue, Apr 30, 2024 at 10:42:58PM +0100, Russell King (Oracle) wrote: > On Tue, Apr 30, 2024 at 04:02:19PM -0500, Andrew Halaney wrote: > > Basically, NetworkManager is setting both interfaces to up, and end1's > > phy doesn't seem to be ready when ndo_open() runs, returning > > -ENODEV in phylink_fwnode_phy_connect() and bubbling that back up. This doesn't > > Let's get something clear - you're attributing phylink to this, but this > is not the case. phylink doesn't deal directly with PHYs, it makes use > of phylib for that, and merely passes back to its caller whatever status > it gets from phylib. It's also not fair to attribute this to phylib as > we will see later... Sorry for the delay, I wanted to try and test with some extra logs in the legit setup (not my "simulate via EPROBE_DEFER delays" approach) which is tedious with the initramfs (plus I wasted time failing to ftrace some stuff :P) to reconvince me of old notes. Thanks for the explanation above on the nuances between phylink and phylib, I really appreciate it. > > There are a few reasons for phylink_fwnode_phy_connect() would return > -ENODEV: > > 1) fwnode_get_phy_node() (a phylib function) returning an error, > basically meaning the phy node isn't found. This would be a persistent > error, so unlikely to be your issue. > > 2) fwnode_phy_find_device() (another phylib function) not finding the > PHY device corresponding to the fwnode returned by the above on the > MDIO bus. This is possible if the PHY has not been detected on the > MDIO bus, but I suspect this is not the cause of your issue. So I think we're in this case. I added some extra logs to see which of the cases we were hitting, as well as some extra logs in phy creation code etc to come to that conclusion: # end1 probe start (and finish) [ 1.424099] qcom-ethqos 23000000.ethernet: Adding to iommu group 2 ... [ 1.431267] qcom-ethqos 23000000.ethernet: Using 40/40 bits DMA host/device width # end0 probe start [ 1.440517] qcom-ethqos 23040000.ethernet: Adding to iommu group 3 ... [ 1.443502] qcom-ethqos 23040000.ethernet: Using 40/40 bits DMA host/device width # end0 starts making the mdio bus, and phy devices [ 1.443537] qcom-ethqos 23040000.ethernet: Before of_mdiobus_reg # create phy at addr 0x8, end0's phy [ 1.450118] Starting phy_create_device for addr: 8 # NetworkManager up'ed end1! and again. But the device we're needing # (0xa) isn't created yet [ 1.459743] qcom-ethqos 23000000.ethernet end1: Register MEM_TYPE_PAGE_POOL RxQ-0 ... [ 1.465168] Failed at fwnode_phy_find_device [ 1.465174] qcom-ethqos 23000000.ethernet end1: __stmmac_open: Cannot attach to PHY (error: -19) [ 1.473687] qcom-ethqos 23000000.ethernet end1: Register MEM_TYPE_PAGE_POOL RxQ-0 ... [ 1.477637] Failed at fwnode_phy_find_device [ 1.477643] qcom-ethqos 23000000.ethernet end1: __stmmac_open: Cannot attach to PHY (error: -19) # device created for 0x8, probe it [ 1.531617] Ending phy_create_device for addr: 8 [ 1.627462] Marvell 88E1510 stmmac-0:08: Starting probe [ 1.627644] hwmon hwmon0: temp1_input not attached to any thermal zone [ 1.627650] Marvell 88E1510 stmmac-0:08: Ending probe # device created for 0xa, probe it [ 1.628992] Starting phy_create_device for addr: a [ 1.632615] Ending phy_create_device for addr: a [ 1.731552] Marvell 88E1510 stmmac-0:0a: Starting probe [ 1.731732] hwmon hwmon1: temp1_input not attached to any thermal zone [ 1.731738] Marvell 88E1510 stmmac-0:0a: Ending probe # end0 is done probing now [ 1.732804] qcom-ethqos 23040000.ethernet: After of_mdiobus_reg [ 1.820725] qcom-ethqos 23040000.ethernet end0: renamed from eth0 # NetworkManager up's end0 [ 1.851805] qcom-ethqos 23040000.ethernet end0: Register MEM_TYPE_PAGE_POOL RxQ-0 ... [ 1.914980] qcom-ethqos 23040000.ethernet end0: PHY [stmmac-0:08] driver [Marvell 88E1510] (irq=233) ... [ 1.939432] qcom-ethqos 23040000.ethernet end0: configuring for phy/sgmii link mode ... [ 4.451765] qcom-ethqos 23040000.ethernet end0: Link is Up - 1Gbps/Full - flow control rx/tx So end1 is up'ed before end0 can finish making its mdio bus / phy devices, and therefore we fail to find it. I can easily simulate this situation as well by -EPROBE_DEFER'ing end0 for say 10 seconds. In playing around with this I also discovered that if end1's marvell phy -EPROBE_DEFERs for a bit, up'ing end1 results in matching against the Generic PHY driver, so then things don't work network wise. That's a similar topic, but probably should be discussed separately? Mentioning it now before I forget though. Here's some logs: # Probe end1 [ 8.245164] qcom-ethqos 23000000.ethernet: Adding to iommu group 8 ... [ 8.377010] qcom-ethqos 23000000.ethernet: Using 40/40 bits DMA host/device width # Probe end0 [ 8.396919] qcom-ethqos 23040000.ethernet: Adding to iommu group 9 ... [ 8.513481] qcom-ethqos 23040000.ethernet: Using 40/40 bits DMA host/device width [ 8.521475] qcom-ethqos 23040000.ethernet: Before of_mdiobus_reg [ 8.529283] Starting phy_create_device for addr: 8 [ 8.553872] Ending phy_create_device for addr: 8 [ 8.714637] Marvell 88E1510 stmmac-0:08: Ending probe [ 8.721627] Starting phy_create_device for addr: a [ 8.729064] Ending phy_create_device for addr: a [ 8.898759] qcom-ethqos 23040000.ethernet: After of_mdiobus_reg ... # NetworkManager ups end0 [ 9.028419] qcom-ethqos 23040000.ethernet end0: Register MEM_TYPE_PAGE_POOL RxQ-0 ... [ 9.092839] net end0: Before phylink_fwnode_phy_connect [ 9.164375] qcom-ethqos 23040000.ethernet end0: PHY [stmmac-0:08] driver [Marvell 88E1510] (irq=280) [ 9.174201] net end0: After phylink_fwnode_phy_connect ... # NetworkManager ups end1, get the Generic PHY instead of marvell... [ 9.257364] qcom-ethqos 23040000.ethernet end0: configuring for phy/sgmii link mode ... [ 9.317542] net end1: Before phylink_fwnode_phy_connect [ 9.404384] qcom-ethqos 23000000.ethernet end1: PHY [stmmac-0:0a] driver [Generic PHY] (irq=POLL) [ 9.414730] net end1: After phylink_fwnode_phy_connect ... [ 9.509450] qcom-ethqos 23000000.ethernet end1: configuring for phy/sgmii link mode # end0 comes up, end1 doesn't due to the wrong phy being selected here [ 11.672223] qcom-ethqos 23040000.ethernet end0: Link is Up - 1Gbps/Full - flow control rx/tx > > 3) phy_attach_direct() (another phylib function) returning an error. > This function calls phy_init_hw() which will attempt to talk to the > hardware, and if that returns an error, it will be propagated up. > > (3) is the most likely scenario given your quoted DT description. I > suspect that the stmmac/qcom-ethqos driver is what's at fault here. > > Your DT description shows that the PHYs are on one MDIO bus owned by > one of the network interfaces. I suspect if that network interface > is down, then the MDIO bus is not accessible. > > Therefore, when you attempt to open the _other_ network interface, > accesses to its PHY fail with -ENODEV and that gets propagated all > the way back up. > > What's more is if you did manage to get that network interface up > (because the one with the MDIO bus on was up) then if you take > that interface down, you'll end up with a phy_error() splat from > phylib because the PHY it was using has become inaccessible. > > Basically, the network driver is buggy for this PHY setup. If a > MDIO bus contains devices that are not owned by the network device > owning that MDIO bus, then the MDIO bus _must_ be prepared to handle > MDIO bus accesses _at_ _any_ _time_. This clearly is not the case > here. As far as I can tell, at least from the "link up/down" perspective, any combo works. If I boot (without NetworkManager doing things), I can play around with any combo of link up and down without any noticeable issue. > > It could also be the case that if the driver is using runtime PM, > that when the network interface is runtime-PM suspended, it causes > MDIO bus accesses to fail... that would be very chaotic though. > > In any case, I'm going to say... I don't think this is a phylink nor > phylib issue, but a buggy network driver thinking that it has the > right to shutdown MDIO bus access depending on its network interface > state. > > -- > RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ > FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last! > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: racing ndo_open()/phylink*connect() with phy_probe() 2024-05-02 17:43 ` Andrew Halaney @ 2024-05-02 21:26 ` Serge Semin 2024-05-03 1:25 ` Andrew Lunn 0 siblings, 1 reply; 9+ messages in thread From: Serge Semin @ 2024-05-02 21:26 UTC (permalink / raw) To: Andrew Halaney, Russell King (Oracle), Andrew Lunn Cc: linux-kernel, linux-arm-kernel, linux-stm32, netdev, alexandre.torgue, joabreu, davem, edumazet, kuba, pabeni, mcoquelin.stm32, hkallweit1 Hi all On Thu, May 02, 2024 at 12:43:27PM -0500, Andrew Halaney wrote: > On Tue, Apr 30, 2024 at 10:42:58PM +0100, Russell King (Oracle) wrote: > > On Tue, Apr 30, 2024 at 04:02:19PM -0500, Andrew Halaney wrote: > > > Basically, NetworkManager is setting both interfaces to up, and end1's > > > phy doesn't seem to be ready when ndo_open() runs, returning > > > -ENODEV in phylink_fwnode_phy_connect() and bubbling that back up. This doesn't > > > > Let's get something clear - you're attributing phylink to this, but this > > is not the case. phylink doesn't deal directly with PHYs, it makes use > > of phylib for that, and merely passes back to its caller whatever status > > it gets from phylib. It's also not fair to attribute this to phylib as > > we will see later... > > Sorry for the delay, I wanted to try and test with some extra logs in > the legit setup (not my "simulate via EPROBE_DEFER delays" approach) > which is tedious with the initramfs (plus I wasted time failing to > ftrace some stuff :P) to reconvince me of old notes. Thanks for the > explanation above on the nuances between phylink and phylib, I really > appreciate it. > > > > > There are a few reasons for phylink_fwnode_phy_connect() would return > > -ENODEV: > > > > 1) fwnode_get_phy_node() (a phylib function) returning an error, > > basically meaning the phy node isn't found. This would be a persistent > > error, so unlikely to be your issue. > > > > 2) fwnode_phy_find_device() (another phylib function) not finding the > > PHY device corresponding to the fwnode returned by the above on the > > MDIO bus. This is possible if the PHY has not been detected on the > > MDIO bus, but I suspect this is not the cause of your issue. > > So I think we're in this case. I added some extra logs to see which > of the cases we were hitting, as well as some extra logs in phy creation > code etc to come to that conclusion: > > # end1 probe start (and finish) > [ 1.424099] qcom-ethqos 23000000.ethernet: Adding to iommu group 2 > ... > [ 1.431267] qcom-ethqos 23000000.ethernet: Using 40/40 bits DMA host/device width > > # end0 probe start > [ 1.440517] qcom-ethqos 23040000.ethernet: Adding to iommu group 3 > ... > [ 1.443502] qcom-ethqos 23040000.ethernet: Using 40/40 bits DMA host/device width > > # end0 starts making the mdio bus, and phy devices > [ 1.443537] qcom-ethqos 23040000.ethernet: Before of_mdiobus_reg > > # create phy at addr 0x8, end0's phy > [ 1.450118] Starting phy_create_device for addr: 8 > > # NetworkManager up'ed end1! and again. But the device we're needing > # (0xa) isn't created yet > [ 1.459743] qcom-ethqos 23000000.ethernet end1: Register MEM_TYPE_PAGE_POOL RxQ-0 > ... > [ 1.465168] Failed at fwnode_phy_find_device > [ 1.465174] qcom-ethqos 23000000.ethernet end1: __stmmac_open: Cannot attach to PHY (error: -19) > [ 1.473687] qcom-ethqos 23000000.ethernet end1: Register MEM_TYPE_PAGE_POOL RxQ-0 > ... > [ 1.477637] Failed at fwnode_phy_find_device > [ 1.477643] qcom-ethqos 23000000.ethernet end1: __stmmac_open: Cannot attach to PHY (error: -19) > > # device created for 0x8, probe it > [ 1.531617] Ending phy_create_device for addr: 8 > [ 1.627462] Marvell 88E1510 stmmac-0:08: Starting probe > [ 1.627644] hwmon hwmon0: temp1_input not attached to any thermal zone > [ 1.627650] Marvell 88E1510 stmmac-0:08: Ending probe > > # device created for 0xa, probe it > [ 1.628992] Starting phy_create_device for addr: a > [ 1.632615] Ending phy_create_device for addr: a > [ 1.731552] Marvell 88E1510 stmmac-0:0a: Starting probe > [ 1.731732] hwmon hwmon1: temp1_input not attached to any thermal zone > [ 1.731738] Marvell 88E1510 stmmac-0:0a: Ending probe > > # end0 is done probing now > [ 1.732804] qcom-ethqos 23040000.ethernet: After of_mdiobus_reg > [ 1.820725] qcom-ethqos 23040000.ethernet end0: renamed from eth0 > > # NetworkManager up's end0 > [ 1.851805] qcom-ethqos 23040000.ethernet end0: Register MEM_TYPE_PAGE_POOL RxQ-0 > ... > [ 1.914980] qcom-ethqos 23040000.ethernet end0: PHY [stmmac-0:08] driver [Marvell 88E1510] (irq=233) > ... > [ 1.939432] qcom-ethqos 23040000.ethernet end0: configuring for phy/sgmii link mode > ... > [ 4.451765] qcom-ethqos 23040000.ethernet end0: Link is Up - 1Gbps/Full - flow control rx/tx > > So end1 is up'ed before end0 can finish making its mdio bus / phy > devices, and therefore we fail to find it. I can easily simulate this > situation as well by -EPROBE_DEFER'ing end0 for say 10 seconds. AFAICS the problem is in the race between the end0 and end1 device probes. Right? If so then can't the order be fixed by adding the links between the OF-devices? As it's already done for various phandle-based references like "clocks", "gpios", "phys", etc? * Before this topic was raised I had thought it was working for any phandle-based dependencies, but apparently it wasn't and the supplier/consumer linkage was supposed to be implemented for each particular case. The "phy-handle" property lacks that feature support (see drivers/of/property.c:of_supplier_bindings and of_fwnode_add_links() for details). -Serge(y) > [...] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: racing ndo_open()/phylink*connect() with phy_probe() 2024-05-02 21:26 ` Serge Semin @ 2024-05-03 1:25 ` Andrew Lunn 2024-05-07 14:48 ` Andrew Halaney 0 siblings, 1 reply; 9+ messages in thread From: Andrew Lunn @ 2024-05-03 1:25 UTC (permalink / raw) To: Serge Semin Cc: Andrew Halaney, Russell King (Oracle), linux-kernel, linux-arm-kernel, linux-stm32, netdev, alexandre.torgue, joabreu, davem, edumazet, kuba, pabeni, mcoquelin.stm32, hkallweit1 > AFAICS the problem is in the race between the end0 and end1 device > probes. Right? > If so then can't the order be fixed by adding the links between the > OF-devices? As it's already done for various phandle-based references > like "clocks", "gpios", "phys", etc? It gets tricky because an MDIO bus master device is often a sub device of an Ethernet MAC driver. Typically how it works is that a MAC driver probes. Part way through the probe it creates an MDIO bus driver, which enumerates the MDIO bus and creates the PHYs. Later in the MAC driver probe, or maybe when the MAC driver is opened, it follows the phy-handle to a PHY on its own MDIO bus. If you were to say it cannot probe the MAC driver until the MDIO bus driver is created and the PHYs created, you never get anywhere, because it is the act of probing the MAC driver which creates the PHYs which fulfils the phandle. You would need to differentiate between a phandle pointing deeper into the same branch of a DT tree, or pointing to a different branch of a DT tree. If it is deeper within the same branch, you need to probe in order to make progress. If it points to a different branch you need to defer until that sub-branch has successfully probed. And if you get two branches which are mutually dependent on each other, probe and hope EPROBE_DEFER solves it. Andrew ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: racing ndo_open()/phylink*connect() with phy_probe() 2024-05-03 1:25 ` Andrew Lunn @ 2024-05-07 14:48 ` Andrew Halaney 0 siblings, 0 replies; 9+ messages in thread From: Andrew Halaney @ 2024-05-07 14:48 UTC (permalink / raw) To: Andrew Lunn Cc: Serge Semin, Russell King (Oracle), linux-kernel, linux-arm-kernel, linux-stm32, netdev, alexandre.torgue, joabreu, davem, edumazet, kuba, pabeni, mcoquelin.stm32, hkallweit1 On Fri, May 03, 2024 at 03:25:19AM GMT, Andrew Lunn wrote: > > AFAICS the problem is in the race between the end0 and end1 device > > probes. Right? > > If so then can't the order be fixed by adding the links between the > > OF-devices? As it's already done for various phandle-based references > > like "clocks", "gpios", "phys", etc? Thanks for the pointer here Serge, I had no idea (still don't have much of an idea) on how this works. I think this makes sense to explore some more. Hopefully sometime this week I'll poke at this more. > > It gets tricky because an MDIO bus master device is often a sub device > of an Ethernet MAC driver. Typically how it works is that a MAC driver > probes. Part way through the probe it creates an MDIO bus driver, > which enumerates the MDIO bus and creates the PHYs. Later in the MAC > driver probe, or maybe when the MAC driver is opened, it follows the > phy-handle to a PHY on its own MDIO bus. > > If you were to say it cannot probe the MAC driver until the MDIO bus > driver is created and the PHYs created, you never get anywhere, > because it is the act of probing the MAC driver which creates the PHYs > which fulfils the phandle. > > You would need to differentiate between a phandle pointing deeper into > the same branch of a DT tree, or pointing to a different branch of a > DT tree. If it is deeper within the same branch, you need to probe in > order to make progress. If it points to a different branch you need to > defer until that sub-branch has successfully probed. And if you get > two branches which are mutually dependent on each other, probe and > hope EPROBE_DEFER solves it. I'll keep this relationship in mind. IIUC the fw_devlink stuff sort of handles cycles, but I need to look into how all that works further. At least in my example device, end0 is in this situation, whereas end1 is in the other situation, so I have a decent test setup for that. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: racing ndo_open()/phylink*connect() with phy_probe() 2024-04-30 21:02 racing ndo_open()/phylink*connect() with phy_probe() Andrew Halaney 2024-04-30 21:42 ` Russell King (Oracle) @ 2024-05-01 15:07 ` Andrew Lunn 2024-05-01 15:19 ` Russell King (Oracle) 1 sibling, 1 reply; 9+ messages in thread From: Andrew Lunn @ 2024-05-01 15:07 UTC (permalink / raw) To: Andrew Halaney Cc: linux-kernel, linux-arm-kernel, linux-stm32, netdev, alexandre.torgue, joabreu, davem, edumazet, kuba, pabeni, mcoquelin.stm32, hkallweit1, linux On Tue, Apr 30, 2024 at 04:02:19PM -0500, Andrew Halaney wrote: > Hi, > > I've been taking a look at the following error message: > > qcom-ethqos 23000000.ethernet end1: __stmmac_open: Cannot attach to PHY (error: -19) > > end1 is using a phy on the mdio bus of end0, not on its own bus. Something > like this devicetree snippet highlights the relationship: > > // end0 > ðernet0 { > phy-mode = "sgmii"; > phy-handle = <&sgmii_phy0>; > > mdio { > compatible = "snps,dwmac-mdio"; > sgmii_phy0: phy@8 { > compatible = "ethernet-phy-id0141.0dd4"; > reg = <0x8>; > device_type = "ethernet-phy"; > }; > > sgmii_phy1: phy@a { > compatible = "ethernet-phy-id0141.0dd4"; > reg = <0xa>; > device_type = "ethernet-phy"; > }; > }; > }; > > // end1 > ðernet1 { > phy-mode = "sgmii"; > phy-handle = <&sgmii_phy1>; > }; I agree with Russell here. You need to determine where the ENODEV comes from. It seems unlikely, but one possibility is: static int stmmac_xgmac2_mdio_read_c22(struct mii_bus *bus, int phyaddr, int phyreg) { struct net_device *ndev = bus->priv; struct stmmac_priv *priv; u32 addr; priv = netdev_priv(ndev); /* Until ver 2.20 XGMAC does not support C22 addr >= 4 */ if (priv->synopsys_id < DWXGMAC_CORE_2_20 && phyaddr > MII_XGMAC_MAX_C22ADDR) return -ENODEV; Maybe if the interface is down, priv->synopsys_id has not been set yet? Your devices are at address 8 and 10, so if priv->synopsys_id where 0, this would give you the ENODEV. Once you know the root cause, we might be able to make suggests how to fix it. Andrew ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: racing ndo_open()/phylink*connect() with phy_probe() 2024-05-01 15:07 ` Andrew Lunn @ 2024-05-01 15:19 ` Russell King (Oracle) 2024-05-01 15:23 ` Andrew Lunn 0 siblings, 1 reply; 9+ messages in thread From: Russell King (Oracle) @ 2024-05-01 15:19 UTC (permalink / raw) To: Andrew Lunn Cc: Andrew Halaney, linux-kernel, linux-arm-kernel, linux-stm32, netdev, alexandre.torgue, joabreu, davem, edumazet, kuba, pabeni, mcoquelin.stm32, hkallweit1 On Wed, May 01, 2024 at 05:07:44PM +0200, Andrew Lunn wrote: > On Tue, Apr 30, 2024 at 04:02:19PM -0500, Andrew Halaney wrote: > > Hi, > > > > I've been taking a look at the following error message: > > > > qcom-ethqos 23000000.ethernet end1: __stmmac_open: Cannot attach to PHY (error: -19) > > > > end1 is using a phy on the mdio bus of end0, not on its own bus. Something > > like this devicetree snippet highlights the relationship: > > > > // end0 > > ðernet0 { > > phy-mode = "sgmii"; > > phy-handle = <&sgmii_phy0>; > > > > mdio { > > compatible = "snps,dwmac-mdio"; > > sgmii_phy0: phy@8 { > > compatible = "ethernet-phy-id0141.0dd4"; > > reg = <0x8>; > > device_type = "ethernet-phy"; > > }; > > > > sgmii_phy1: phy@a { > > compatible = "ethernet-phy-id0141.0dd4"; > > reg = <0xa>; > > device_type = "ethernet-phy"; > > }; > > }; > > }; > > > > // end1 > > ðernet1 { > > phy-mode = "sgmii"; > > phy-handle = <&sgmii_phy1>; > > }; > > I agree with Russell here. You need to determine where the ENODEV > comes from. > > It seems unlikely, but one possibility is: > > static int stmmac_xgmac2_mdio_read_c22(struct mii_bus *bus, int phyaddr, > int phyreg) > { > struct net_device *ndev = bus->priv; > struct stmmac_priv *priv; > u32 addr; > > priv = netdev_priv(ndev); > > /* Until ver 2.20 XGMAC does not support C22 addr >= 4 */ > if (priv->synopsys_id < DWXGMAC_CORE_2_20 && > phyaddr > MII_XGMAC_MAX_C22ADDR) > return -ENODEV; > > Maybe if the interface is down, priv->synopsys_id has not been set > yet? Your devices are at address 8 and 10, so if priv->synopsys_id > where 0, this would give you the ENODEV. Yes, I did look at that, but didn't read the DT snippet to realise that the addresses would be trapped by that. So, looking at where priv->synopsys_id is set... is in stmmac_hwif_init(), which is called from stmmac_hw_init(), which in turn is called from stmmac_dvr_probe(). This is the only path that leads here. This all happens before the MDIO bus is registered, so the value of priv->synopsys_id should be set by the time the MDIO bus is registered. So I'm unconvinced... but I think Andrew Halaney needs to delve deeper into debugging what is going on when the failure occurs. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last! ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: racing ndo_open()/phylink*connect() with phy_probe() 2024-05-01 15:19 ` Russell King (Oracle) @ 2024-05-01 15:23 ` Andrew Lunn 0 siblings, 0 replies; 9+ messages in thread From: Andrew Lunn @ 2024-05-01 15:23 UTC (permalink / raw) To: Russell King (Oracle) Cc: Andrew Halaney, linux-kernel, linux-arm-kernel, linux-stm32, netdev, alexandre.torgue, joabreu, davem, edumazet, kuba, pabeni, mcoquelin.stm32, hkallweit1 > > It seems unlikely, but one possibility is: > > > > static int stmmac_xgmac2_mdio_read_c22(struct mii_bus *bus, int phyaddr, > > int phyreg) > > { > > struct net_device *ndev = bus->priv; > > struct stmmac_priv *priv; > > u32 addr; > > > > priv = netdev_priv(ndev); > > > > /* Until ver 2.20 XGMAC does not support C22 addr >= 4 */ > > if (priv->synopsys_id < DWXGMAC_CORE_2_20 && > > phyaddr > MII_XGMAC_MAX_C22ADDR) > > return -ENODEV; > > > > Maybe if the interface is down, priv->synopsys_id has not been set > > yet? Your devices are at address 8 and 10, so if priv->synopsys_id > > where 0, this would give you the ENODEV. > > Yes, I did look at that, but didn't read the DT snippet to realise > that the addresses would be trapped by that. So, looking at where > priv->synopsys_id is set... is in stmmac_hwif_init(), which is > called from stmmac_hw_init(), which in turn is called from > stmmac_dvr_probe(). This is the only path that leads here. > > This all happens before the MDIO bus is registered, so the value of > priv->synopsys_id should be set by the time the MDIO bus is registered. > > So I'm unconvinced... Me too. I just wanted to give an example of why it is important to track down the source of the ENODEV. Andrew ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-05-07 14:48 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-04-30 21:02 racing ndo_open()/phylink*connect() with phy_probe() Andrew Halaney 2024-04-30 21:42 ` Russell King (Oracle) 2024-05-02 17:43 ` Andrew Halaney 2024-05-02 21:26 ` Serge Semin 2024-05-03 1:25 ` Andrew Lunn 2024-05-07 14:48 ` Andrew Halaney 2024-05-01 15:07 ` Andrew Lunn 2024-05-01 15:19 ` Russell King (Oracle) 2024-05-01 15:23 ` Andrew Lunn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).