* [BUG] SFP I2C timeout forces link down with PHY_ERROR @ 2024-05-28 16:57 Sean Anderson 2024-05-28 17:28 ` Russell King (Oracle) 0 siblings, 1 reply; 7+ messages in thread From: Sean Anderson @ 2024-05-28 16:57 UTC (permalink / raw) To: Andrew Lunn, Russell King, Andi Shyti, netdev@vger.kernel.org, linux-i2c Cc: Michal Simek, Heiner Kallweit, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org Hi, I saw the following warning [1] twice when testing 1000Base-T SFP modules: [ 1481.682501] cdns-i2c ff030000.i2c: timeout waiting on completion [ 1481.692010] Marvell 88E1111 i2c:sfp-ge3:16: Master/Slave resolution failed [ 1481.699910] ------------[ cut here ]------------ [ 1481.705459] phy_check_link_status+0x0/0xe8: returned: -67 [ 1481.711448] WARNING: CPU: 2 PID: 67 at drivers/net/phy/phy.c:1233 phy_state_machine+0xac/0x2ec <snip> [ 1481.904544] macb ff0c0000.ethernet net1: Link is Down and a second time with some other errors too: [ 64.972751] cdns-i2c ff030000.i2c: xfer_size reg rollover. xfer aborted! [ 64.979478] cdns-i2c ff030000.i2c: xfer_size reg rollover. xfer aborted! [ 65.998108] cdns-i2c ff030000.i2c: timeout waiting on completion [ 66.010558] Marvell 88E1111 i2c:sfp-ge3:16: Master/Slave resolution failed [ 66.017856] ------------[ cut here ]------------ [ 66.022786] phy_check_link_status+0x0/0xcc: returned: -67 [ 66.028255] WARNING: CPU: 0 PID: 70 at drivers/net/phy/phy.c:1233 phy_state_machine+0xa4/0x2b8 <snip> [ 66.339533] macb ff0c0000.ethernet net1: Link is Down The chain of events is: - The I2C transaction times out for some reason (in the latter case due to a known hardware bug). - mdio-i2c converts the error response to a 0xffff return value - genphy_read_lpa sees that LPA_1000MSFAIL is set in MII_STAT1000 and returns -ENOLINK. This propagates up the calls stack. - phy_check_link_status returns -ENOLINK - phy_error_precise forces the link down with state = PHY_ERROR. The problem with this is that although the register read fails due to a temporary condition, the link goes down permanently (or at least until the admin cycles the interface state). I think some part of the stack should implement a retry mechanism, but I'm not sure which part. One idea could be to have mdio-i2c propagate negative errors instead of converting them to successful reads of 0xffff. But we would still need to handle that in the phy driver or in phy_state_machine. - Are I2C bus drivers supposed to be flaky like this? That is, are callers of i2c_transfer expected to handle the occasional spurious error? - Similarly, are MDIO bus drivers allowed to be flaky? - Is ETIMEDOUT even supposed to be recoverable? Maybe we should have cdns-i2c return EAGAIN instead so it gets retried by the bus arbitration logic in __i2c_transfer. - ENOLINK really seems like something which we could recover from by resetting the phy (or even just waiting a bit). Maybe we should have the phy state machine just switch to PHY_NOLINK? Of course, the best option would be to fix cdns-i2c to not be buggy, but the hardware itself is buggy in at least one of the above cases so that may not be practical. --Sean _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] SFP I2C timeout forces link down with PHY_ERROR 2024-05-28 16:57 [BUG] SFP I2C timeout forces link down with PHY_ERROR Sean Anderson @ 2024-05-28 17:28 ` Russell King (Oracle) 2024-05-28 17:50 ` Sean Anderson 0 siblings, 1 reply; 7+ messages in thread From: Russell King (Oracle) @ 2024-05-28 17:28 UTC (permalink / raw) To: Sean Anderson Cc: Andrew Lunn, Andi Shyti, netdev@vger.kernel.org, linux-i2c, Michal Simek, Heiner Kallweit, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org First, note that phylib's policy is if it loses comms with the PHY, then the link will be forced down. This is out of control of the SFP or phylink code. I've seen bugs with the I2C emulation on some modules resulting in problems with various I2C controllers. Sometimes the problem is due to a bad I2C level shifter. Some I2C level shifter manufacturers will swear blind that their shifter doesn't lock up, but strangely, one can prove with an osciloscope that it _does_ lock up - and in a way that the only way to recover was to possibly unplug the module or poewr cycle the platform. My advice would be to investigate the hardware in the first instance. On Tue, May 28, 2024 at 12:57:25PM -0400, Sean Anderson wrote: > Hi, > > I saw the following warning [1] twice when testing 1000Base-T SFP > modules: > > [ 1481.682501] cdns-i2c ff030000.i2c: timeout waiting on completion > [ 1481.692010] Marvell 88E1111 i2c:sfp-ge3:16: Master/Slave resolution failed > [ 1481.699910] ------------[ cut here ]------------ > [ 1481.705459] phy_check_link_status+0x0/0xe8: returned: -67 > [ 1481.711448] WARNING: CPU: 2 PID: 67 at drivers/net/phy/phy.c:1233 phy_state_machine+0xac/0x2ec > <snip> > [ 1481.904544] macb ff0c0000.ethernet net1: Link is Down > > and a second time with some other errors too: > > [ 64.972751] cdns-i2c ff030000.i2c: xfer_size reg rollover. xfer aborted! > [ 64.979478] cdns-i2c ff030000.i2c: xfer_size reg rollover. xfer aborted! I2C driver bug? From what I can see, this occurs when there is further data to be read, and id->recv_count hits zero. The I2C controller is entirely in control of how many bytes are transferred from the remote device, and it should raise a NAK on the last byte before signalling a STOP condition during a read. > I think some part of the stack should implement a retry mechanism, but > I'm not sure which part. One idea could be to have mdio-i2c propagate > negative errors instead of converting them to successful reads of > 0xffff. That would unfortunately break phylib's PHY probing. > - Are I2C bus drivers supposed to be flaky like this? That is, are callers of > i2c_transfer expected to handle the occasional spurious error? I2C transfers - to some extent - are supposed to have a number of retries, but that's for the I2C device not responding to its address. Otherwise, the bus is supposed to be reliable (there is no form of error detection however - there's no CRCs or similar.) The problem with merely retrying the transaction is a register read from a PHY may have side-effects (such as the BMSR's LSTATUS bit which is latched in link-fail state until the next read. Or a register pointer could be incremented. So it's not simple to solve at bus level. > - Similarly, are MDIO bus drivers allowed to be flaky? No. I think the only realistic method would be for phylib to attempt to reprogram the PHY, but that would need lots of changes to phylib. Many drivers now do not check whether the PHY accesses they are performing succeeded or not, and rely on the failure being permanent. > Of course, the best option would be to fix cdns-i2c to not be buggy, but > the hardware itself is buggy in at least one of the above cases so that > may not be practical. Well, I don't think there's much option. If I2C drivers are flakey maybe its better to use GPIOs instead of the broken "inteligent" hardware. Maybe Andrew has a different view however. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last! _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] SFP I2C timeout forces link down with PHY_ERROR 2024-05-28 17:28 ` Russell King (Oracle) @ 2024-05-28 17:50 ` Sean Anderson 2024-05-28 17:52 ` Sean Anderson 2024-05-28 18:22 ` Russell King (Oracle) 0 siblings, 2 replies; 7+ messages in thread From: Sean Anderson @ 2024-05-28 17:50 UTC (permalink / raw) To: Russell King (Oracle) Cc: Andrew Lunn, Andi Shyti, netdev@vger.kernel.org, linux-i2c, Michal Simek, Heiner Kallweit, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org On 5/28/24 13:28, Russell King (Oracle) wrote: > First, note that phylib's policy is if it loses comms with the PHY, > then the link will be forced down. This is out of control of the SFP > or phylink code. > > I've seen bugs with the I2C emulation on some modules resulting in > problems with various I2C controllers. > > Sometimes the problem is due to a bad I2C level shifter. Some I2C > level shifter manufacturers will swear blind that their shifter > doesn't lock up, but strangely, one can prove with an osciloscope > that it _does_ lock up - and in a way that the only way to recover > was to possibly unplug the module or poewr cycle the platform. Well, I haven't seen any case where the bus locks up. I've been able to recover just by doing ip link set net0 down ip link set net0 up which suggests that this is just a transient problem. > My advice would be to investigate the hardware in the first instance. I'll try to keep this in mind, but it's pretty infrequent and I probably won't be able to test anything until I can reproduce it better. > On Tue, May 28, 2024 at 12:57:25PM -0400, Sean Anderson wrote: >> Hi, >> >> I saw the following warning [1] twice when testing 1000Base-T SFP >> modules: >> >> [ 1481.682501] cdns-i2c ff030000.i2c: timeout waiting on completion >> [ 1481.692010] Marvell 88E1111 i2c:sfp-ge3:16: Master/Slave resolution failed >> [ 1481.699910] ------------[ cut here ]------------ >> [ 1481.705459] phy_check_link_status+0x0/0xe8: returned: -67 >> [ 1481.711448] WARNING: CPU: 2 PID: 67 at drivers/net/phy/phy.c:1233 phy_state_machine+0xac/0x2ec >> <snip> >> [ 1481.904544] macb ff0c0000.ethernet net1: Link is Down >> >> and a second time with some other errors too: >> >> [ 64.972751] cdns-i2c ff030000.i2c: xfer_size reg rollover. xfer aborted! >> [ 64.979478] cdns-i2c ff030000.i2c: xfer_size reg rollover. xfer aborted! > > I2C driver bug? From what I can see, this occurs when there is further > data to be read, and id->recv_count hits zero. The I2C controller is > entirely in control of how many bytes are transferred from the remote > device, and it should raise a NAK on the last byte before signalling a > STOP condition during a read. Commit bbf967b223b3 ("i2c: cadence: Handle transfer_size rollover") makes it seem like a hardware error. E.g. Linux thinks we're done but the hardware thinks there's still more data. I've added Alex to CC; maybe he can comment. >> I think some part of the stack should implement a retry mechanism, but >> I'm not sure which part. One idea could be to have mdio-i2c propagate >> negative errors instead of converting them to successful reads of >> 0xffff. > > That would unfortunately break phylib's PHY probing. > >> - Are I2C bus drivers supposed to be flaky like this? That is, are callers of >> i2c_transfer expected to handle the occasional spurious error? > > I2C transfers - to some extent - are supposed to have a number of > retries, but that's for the I2C device not responding to its address. > Otherwise, the bus is supposed to be reliable (there is no form of > error detection however - there's no CRCs or similar.) > > The problem with merely retrying the transaction is a register read > from a PHY may have side-effects (such as the BMSR's LSTATUS bit > which is latched in link-fail state until the next read. Or a > register pointer could be incremented. So it's not simple to solve > at bus level. OK... >> - Similarly, are MDIO bus drivers allowed to be flaky? > > No. > > I think the only realistic method would be for phylib to attempt to > reprogram the PHY, but that would need lots of changes to phylib. Would it? Maybe we just need something like if (err == -ENOLINK) { phy_init_hw(phydev); needs_aneg = true; phydev->state = PHY_UP; err = 0; } in the phy_state_machine switch statement under PHY_NOLINK and PHY_RUNNING. The phy_init_hw wouldn't even be necessary for this case (but would probably be a good idea in the general case where master/slave resolution fails). > Many drivers now do not check whether the PHY accesses they are > performing succeeded or not, and rely on the failure being permanent. Well, this driver does, which is how the error gets propagated all the way up to phy_state_machine. >> Of course, the best option would be to fix cdns-i2c to not be buggy, but >> the hardware itself is buggy in at least one of the above cases so that >> may not be practical. > > Well, I don't think there's much option. If I2C drivers are flakey maybe > its better to use GPIOs instead of the broken "inteligent" hardware. The CPU on this device is already underpowered, so I'd rather not resort to bitbanging. --Sean _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] SFP I2C timeout forces link down with PHY_ERROR 2024-05-28 17:50 ` Sean Anderson @ 2024-05-28 17:52 ` Sean Anderson 2024-05-28 18:14 ` Andrew Lunn 2024-05-28 18:22 ` Russell King (Oracle) 1 sibling, 1 reply; 7+ messages in thread From: Sean Anderson @ 2024-05-28 17:52 UTC (permalink / raw) To: Russell King (Oracle), Alex Williams Cc: Andrew Lunn, Andi Shyti, netdev@vger.kernel.org, linux-i2c, Michal Simek, Heiner Kallweit, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org (forgot to CC Alex) On 5/28/24 13:50, Sean Anderson wrote: > On 5/28/24 13:28, Russell King (Oracle) wrote: >> First, note that phylib's policy is if it loses comms with the PHY, >> then the link will be forced down. This is out of control of the SFP >> or phylink code. >> >> I've seen bugs with the I2C emulation on some modules resulting in >> problems with various I2C controllers. >> >> Sometimes the problem is due to a bad I2C level shifter. Some I2C >> level shifter manufacturers will swear blind that their shifter >> doesn't lock up, but strangely, one can prove with an osciloscope >> that it _does_ lock up - and in a way that the only way to recover >> was to possibly unplug the module or poewr cycle the platform. > > Well, I haven't seen any case where the bus locks up. I've been able to > recover just by doing > > ip link set net0 down > ip link set net0 up > > which suggests that this is just a transient problem. > >> My advice would be to investigate the hardware in the first instance. > > I'll try to keep this in mind, but it's pretty infrequent and I probably > won't be able to test anything until I can reproduce it better. > >> On Tue, May 28, 2024 at 12:57:25PM -0400, Sean Anderson wrote: >>> Hi, >>> >>> I saw the following warning [1] twice when testing 1000Base-T SFP >>> modules: >>> >>> [ 1481.682501] cdns-i2c ff030000.i2c: timeout waiting on completion >>> [ 1481.692010] Marvell 88E1111 i2c:sfp-ge3:16: Master/Slave resolution failed >>> [ 1481.699910] ------------[ cut here ]------------ >>> [ 1481.705459] phy_check_link_status+0x0/0xe8: returned: -67 >>> [ 1481.711448] WARNING: CPU: 2 PID: 67 at drivers/net/phy/phy.c:1233 phy_state_machine+0xac/0x2ec >>> <snip> >>> [ 1481.904544] macb ff0c0000.ethernet net1: Link is Down >>> >>> and a second time with some other errors too: >>> >>> [ 64.972751] cdns-i2c ff030000.i2c: xfer_size reg rollover. xfer aborted! >>> [ 64.979478] cdns-i2c ff030000.i2c: xfer_size reg rollover. xfer aborted! >> >> I2C driver bug? From what I can see, this occurs when there is further >> data to be read, and id->recv_count hits zero. The I2C controller is >> entirely in control of how many bytes are transferred from the remote >> device, and it should raise a NAK on the last byte before signalling a >> STOP condition during a read. > > Commit bbf967b223b3 ("i2c: cadence: Handle transfer_size rollover") > makes it seem like a hardware error. E.g. Linux thinks we're done but > the hardware thinks there's still more data. I've added Alex to CC; > maybe he can comment. > >>> I think some part of the stack should implement a retry mechanism, but >>> I'm not sure which part. One idea could be to have mdio-i2c propagate >>> negative errors instead of converting them to successful reads of >>> 0xffff. >> >> That would unfortunately break phylib's PHY probing. >> >>> - Are I2C bus drivers supposed to be flaky like this? That is, are callers of >>> i2c_transfer expected to handle the occasional spurious error? >> >> I2C transfers - to some extent - are supposed to have a number of >> retries, but that's for the I2C device not responding to its address. >> Otherwise, the bus is supposed to be reliable (there is no form of >> error detection however - there's no CRCs or similar.) >> >> The problem with merely retrying the transaction is a register read >> from a PHY may have side-effects (such as the BMSR's LSTATUS bit >> which is latched in link-fail state until the next read. Or a >> register pointer could be incremented. So it's not simple to solve >> at bus level. > > OK... > >>> - Similarly, are MDIO bus drivers allowed to be flaky? >> >> No. >> >> I think the only realistic method would be for phylib to attempt to >> reprogram the PHY, but that would need lots of changes to phylib. > > Would it? Maybe we just need something like > > if (err == -ENOLINK) { > phy_init_hw(phydev); > needs_aneg = true; > phydev->state = PHY_UP; > err = 0; > } > > in the phy_state_machine switch statement under PHY_NOLINK and > PHY_RUNNING. The phy_init_hw wouldn't even be necessary for this case > (but would probably be a good idea in the general case where > master/slave resolution fails). > >> Many drivers now do not check whether the PHY accesses they are >> performing succeeded or not, and rely on the failure being permanent. > > Well, this driver does, which is how the error gets propagated all the > way up to phy_state_machine. > >>> Of course, the best option would be to fix cdns-i2c to not be buggy, but >>> the hardware itself is buggy in at least one of the above cases so that >>> may not be practical. >> >> Well, I don't think there's much option. If I2C drivers are flakey maybe >> its better to use GPIOs instead of the broken "inteligent" hardware. > > The CPU on this device is already underpowered, so I'd rather not resort > to bitbanging. > > --Sean _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] SFP I2C timeout forces link down with PHY_ERROR 2024-05-28 17:52 ` Sean Anderson @ 2024-05-28 18:14 ` Andrew Lunn 2024-05-30 16:56 ` Sean Anderson 0 siblings, 1 reply; 7+ messages in thread From: Andrew Lunn @ 2024-05-28 18:14 UTC (permalink / raw) To: Sean Anderson Cc: Russell King (Oracle), Alex Williams, Andi Shyti, netdev@vger.kernel.org, linux-i2c, Michal Simek, Heiner Kallweit, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org On Tue, May 28, 2024 at 01:52:56PM -0400, Sean Anderson wrote: > (forgot to CC Alex) > > On 5/28/24 13:50, Sean Anderson wrote: > > On 5/28/24 13:28, Russell King (Oracle) wrote: > >> First, note that phylib's policy is if it loses comms with the PHY, > >> then the link will be forced down. This is out of control of the SFP > >> or phylink code. > >> > >> I've seen bugs with the I2C emulation on some modules resulting in > >> problems with various I2C controllers. > >> > >> Sometimes the problem is due to a bad I2C level shifter. Some I2C > >> level shifter manufacturers will swear blind that their shifter > >> doesn't lock up, but strangely, one can prove with an osciloscope > >> that it _does_ lock up - and in a way that the only way to recover > >> was to possibly unplug the module or poewr cycle the platform. > > > > Well, I haven't seen any case where the bus locks up. I've been able to > > recover just by doing > > > > ip link set net0 down > > ip link set net0 up > > > > which suggests that this is just a transient problem. If you look back over the history, i don't think you will find any reports to transient problems with real MDIO busses. Hence any error is considered fatal. Also, when you consider the design of MDIO, it is actually very hard for an error to be detected. It is basically a shift register, shifting out 64 bits for a write, or 48 bits for a read, followed by receiving 16 bits for a read. There is no protocol to indicate any sort of error. If there is no device at the address, the pullup means you receive 1s. End of story. With MDIO over I2C, it is I2C which has problems, not MDIO. Do you expect transient problems with I2C? I would also point out that MDIO is not idempotent. Reading an interrupt status register often clears it. Reading the link status clears the latched link status. If you need to retry the read of the interrupt status register, you cannot, the interrupt has been cleared, you have lost it, and probably your hardware no longer works because you don't know what interrupt to handle.... If you need to re-read the link status, you have lost the latched version, and you have missed a up or down event. > >> My advice would be to investigate the hardware in the first instance. I agree with Russell. Figure out why I2C is flaky. Since this is an SFP it maybe something as trivial as the contacts need cleaning. Or the resistors are wrong, or you have a cheap module which is out of spec. Andrew _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] SFP I2C timeout forces link down with PHY_ERROR 2024-05-28 18:14 ` Andrew Lunn @ 2024-05-30 16:56 ` Sean Anderson 0 siblings, 0 replies; 7+ messages in thread From: Sean Anderson @ 2024-05-30 16:56 UTC (permalink / raw) To: Andrew Lunn Cc: Russell King (Oracle), Alex Williams, Andi Shyti, netdev@vger.kernel.org, linux-i2c, Michal Simek, Heiner Kallweit, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org On 5/28/24 14:14, Andrew Lunn wrote: > On Tue, May 28, 2024 at 01:52:56PM -0400, Sean Anderson wrote: >> (forgot to CC Alex) >> >> On 5/28/24 13:50, Sean Anderson wrote: >> > On 5/28/24 13:28, Russell King (Oracle) wrote: >> >> First, note that phylib's policy is if it loses comms with the PHY, >> >> then the link will be forced down. This is out of control of the SFP >> >> or phylink code. >> >> >> >> I've seen bugs with the I2C emulation on some modules resulting in >> >> problems with various I2C controllers. >> >> >> >> Sometimes the problem is due to a bad I2C level shifter. Some I2C >> >> level shifter manufacturers will swear blind that their shifter >> >> doesn't lock up, but strangely, one can prove with an osciloscope >> >> that it _does_ lock up - and in a way that the only way to recover >> >> was to possibly unplug the module or poewr cycle the platform. >> > >> > Well, I haven't seen any case where the bus locks up. I've been able to >> > recover just by doing >> > >> > ip link set net0 down >> > ip link set net0 up >> > >> > which suggests that this is just a transient problem. > > If you look back over the history, i don't think you will find any > reports to transient problems with real MDIO busses. Hence any error > is considered fatal. Also, when you consider the design of MDIO, it is > actually very hard for an error to be detected. It is basically a > shift register, shifting out 64 bits for a write, or 48 bits for a > read, followed by receiving 16 bits for a read. There is no protocol > to indicate any sort of error. If there is no device at the address, > the pullup means you receive 1s. End of story. Yes, I would expect the only time there could be transient problems would be with external MII (such as if someone jiggled the phy). > With MDIO over I2C, it is I2C which has problems, not MDIO. Do you > expect transient problems with I2C? Well, I2C is known to have devices which can get stuck and hang the bus (generally requiring some bit-banging from Linux to get things unstuck, or a reset of the device). So while I2C (like MDIO) is supposed to be completely reliable, there is a history of it being not quite perfect. That said, I did not expect to see these kinds of errors at all. I'll have a closer look at the controller driver when I have the time. Maybe there is some errata for this... > I would also point out that MDIO is not idempotent. Reading an > interrupt status register often clears it. Reading the link status > clears the latched link status. If you need to retry the read of the > interrupt status register, you cannot, the interrupt has been cleared, > you have lost it, and probably your hardware no longer works because > you don't know what interrupt to handle.... If you need to re-read the > link status, you have lost the latched version, and you have missed a > up or down event. Yes. Same thing with I2C. >> >> My advice would be to investigate the hardware in the first instance. > > I agree with Russell. Figure out why I2C is flaky. Since this is an > SFP it maybe something as trivial as the contacts need cleaning. Or > the resistors are wrong, or you have a cheap module which is out of > spec. OK, I'll try to dig into this a little more... --Sean _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] SFP I2C timeout forces link down with PHY_ERROR 2024-05-28 17:50 ` Sean Anderson 2024-05-28 17:52 ` Sean Anderson @ 2024-05-28 18:22 ` Russell King (Oracle) 1 sibling, 0 replies; 7+ messages in thread From: Russell King (Oracle) @ 2024-05-28 18:22 UTC (permalink / raw) To: Sean Anderson Cc: Andrew Lunn, Andi Shyti, netdev@vger.kernel.org, linux-i2c, Michal Simek, Heiner Kallweit, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org On Tue, May 28, 2024 at 01:50:54PM -0400, Sean Anderson wrote: > On 5/28/24 13:28, Russell King (Oracle) wrote: > > First, note that phylib's policy is if it loses comms with the PHY, > > then the link will be forced down. This is out of control of the SFP > > or phylink code. > > > > I've seen bugs with the I2C emulation on some modules resulting in > > problems with various I2C controllers. > > > > Sometimes the problem is due to a bad I2C level shifter. Some I2C > > level shifter manufacturers will swear blind that their shifter > > doesn't lock up, but strangely, one can prove with an osciloscope > > that it _does_ lock up - and in a way that the only way to recover > > was to possibly unplug the module or poewr cycle the platform. > > Well, I haven't seen any case where the bus locks up. I've been able to > recover just by doing > > ip link set net0 down > ip link set net0 up > > which suggests that this is just a transient problem. > > > My advice would be to investigate the hardware in the first instance. > > I'll try to keep this in mind, but it's pretty infrequent and I probably > won't be able to test anything until I can reproduce it better. > > > On Tue, May 28, 2024 at 12:57:25PM -0400, Sean Anderson wrote: > >> Hi, > >> > >> I saw the following warning [1] twice when testing 1000Base-T SFP > >> modules: > >> > >> [ 1481.682501] cdns-i2c ff030000.i2c: timeout waiting on completion > >> [ 1481.692010] Marvell 88E1111 i2c:sfp-ge3:16: Master/Slave resolution failed > >> [ 1481.699910] ------------[ cut here ]------------ > >> [ 1481.705459] phy_check_link_status+0x0/0xe8: returned: -67 > >> [ 1481.711448] WARNING: CPU: 2 PID: 67 at drivers/net/phy/phy.c:1233 phy_state_machine+0xac/0x2ec > >> <snip> > >> [ 1481.904544] macb ff0c0000.ethernet net1: Link is Down > >> > >> and a second time with some other errors too: > >> > >> [ 64.972751] cdns-i2c ff030000.i2c: xfer_size reg rollover. xfer aborted! > >> [ 64.979478] cdns-i2c ff030000.i2c: xfer_size reg rollover. xfer aborted! > > > > I2C driver bug? From what I can see, this occurs when there is further > > data to be read, and id->recv_count hits zero. The I2C controller is > > entirely in control of how many bytes are transferred from the remote > > device, and it should raise a NAK on the last byte before signalling a > > STOP condition during a read. > > Commit bbf967b223b3 ("i2c: cadence: Handle transfer_size rollover") > makes it seem like a hardware error. E.g. Linux thinks we're done but > the hardware thinks there's still more data. I've added Alex to CC; > maybe he can comment. See https://www.ti.com/lit/an/slva704/slva704.pdf figure 9 and the text immediately above it. On a read, the controller is entirely in control of how many bytes are transferred from the connected device, and the controller has the responsibility to generate the ACK after each byte read from the device _if_ it wants another byte, or a NAK if it doesn't. So, if the controller has been programmed to transfer e.g. 2 bytes, but decides to ACK the 2nd byte and proceed to receive a 3rd byte, that's nothing to do with the bus or the device, it's entirely down to the controller being silly when it knows we only want 2 bytes. > > Many drivers now do not check whether the PHY accesses they are > > performing succeeded or not, and rely on the failure being permanent. > > Well, this driver does, which is how the error gets propagated all the > way up to phy_state_machine. While the Marvell driver is good (probably because phylib maintainers look after it!), this isn't true of all drivers, and I don't think we should add a kind of recovery to the core without sorting out the other drivers first. Maybe it needs to be something that PHY drivers opt into. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last! _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-05-30 16:56 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-05-28 16:57 [BUG] SFP I2C timeout forces link down with PHY_ERROR Sean Anderson 2024-05-28 17:28 ` Russell King (Oracle) 2024-05-28 17:50 ` Sean Anderson 2024-05-28 17:52 ` Sean Anderson 2024-05-28 18:14 ` Andrew Lunn 2024-05-30 16:56 ` Sean Anderson 2024-05-28 18:22 ` Russell King (Oracle)
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).