* GPY215 PHY interrupt issue @ 2022-11-25 14:44 Michael Walle 2022-11-25 15:17 ` Andrew Lunn 0 siblings, 1 reply; 11+ messages in thread From: Michael Walle @ 2022-11-25 14:44 UTC (permalink / raw) To: Andrew Lunn, Heiner Kallweit, Russell King Cc: Horatiu Vultur, netdev, Xu Liang Hi, I've been debugging an issue with spurious interrupts and it turns out the GYP215 (possibly also other MaxLinear PHYs) has a problem with the link down interrupt. Although the datasheet mentions (and which common sense) a read of the interrupt status register will deassert the interrupt line, the PHY doesn't deassert it right away. There is a variable delay between reading the status register and the deassertion of the interrupt line. This only happens on a link down event. The actual delay depends on the firmware revision and is also somehow random. With FW 7.71 (on a GPY215B) I've meassured around 40us with FW 7.118 (GPY215B) it's about 2ms. It also varies from link down to link down. The issue is also present in the new GPY215C revision. MaxLinear confirmed the issue and is currently working on a firmware fix. But it seems that the issue cannot really be resolved. At best, the delay can be minimized. If there will be a fix, this is then only for a GPY215C and a new firmware version. Does anyone have an idea of a viable software workaround? The only one I can think of is to disable the interrupts for the GPY215 altogether depending on the firmware version. For now, I haven't described the interrupt line in the device tree. But that cannot be the solution here ;) If anyone is interested in the scope screenshots: https://walle.cc/d10/gpy215_fw7_71_link_up_mdio.png https://walle.cc/d10/gpy215_fw7_118_link_down_mdio1.png https://walle.cc/d10/gpy215_fw7_71_link_down_mdio1.png https://walle.cc/d10/gpy215c_fw8_111_link_down1.png Red is MDIO, yellow is PHY_INT#, in all cases the second transaction is the read interrupt status of the PHY which asserts the interrupt line. Please note, that this is a shared interrupt line. The first link shows the correct behavior. -michael ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: GPY215 PHY interrupt issue 2022-11-25 14:44 GPY215 PHY interrupt issue Michael Walle @ 2022-11-25 15:17 ` Andrew Lunn 2022-11-27 19:01 ` Michael Walle 2022-11-28 7:41 ` Michael Walle 0 siblings, 2 replies; 11+ messages in thread From: Andrew Lunn @ 2022-11-25 15:17 UTC (permalink / raw) To: Michael Walle Cc: Heiner Kallweit, Russell King, Horatiu Vultur, netdev, Xu Liang On Fri, Nov 25, 2022 at 03:44:08PM +0100, Michael Walle wrote: > Hi, > > I've been debugging an issue with spurious interrupts and it turns out > the GYP215 (possibly also other MaxLinear PHYs) has a problem with the > link down interrupt. Although the datasheet mentions (and which common > sense) a read of the interrupt status register will deassert the > interrupt line, the PHY doesn't deassert it right away. There is a > variable delay between reading the status register and the deassertion > of the interrupt line. This only happens on a link down event. The > actual delay depends on the firmware revision and is also somehow > random. With FW 7.71 (on a GPY215B) I've meassured around 40us with > FW 7.118 (GPY215B) it's about 2ms. So you get 2ms of interrupt storm? Does the interrupt status bit clear immediately, or does that clear only once the interrupt line itself has cleared? I'm assuming it clears immediately, otherwise you would add a polling loop. > It also varies from link down to > link down. The issue is also present in the new GPY215C revision. > MaxLinear confirmed the issue and is currently working on a firmware > fix. But it seems that the issue cannot really be resolved. At best, > the delay can be minimized. If there will be a fix, this is then > only for a GPY215C and a new firmware version. > > Does anyone have an idea of a viable software workaround? Looking at the datasheet for the GPY215, the interrupt line is also GPIO 14. Could you flip it into a GPIO, force it inactive, and sleep to 2ms? Or even turn it into an input and see if you can read its state and poll it until it clears? Andrew ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: GPY215 PHY interrupt issue 2022-11-25 15:17 ` Andrew Lunn @ 2022-11-27 19:01 ` Michael Walle 2022-11-28 7:41 ` Michael Walle 1 sibling, 0 replies; 11+ messages in thread From: Michael Walle @ 2022-11-27 19:01 UTC (permalink / raw) To: Andrew Lunn Cc: Heiner Kallweit, Russell King, Horatiu Vultur, netdev, Xu Liang Am 2022-11-25 16:17, schrieb Andrew Lunn: > On Fri, Nov 25, 2022 at 03:44:08PM +0100, Michael Walle wrote: >> Hi, >> >> I've been debugging an issue with spurious interrupts and it turns out >> the GYP215 (possibly also other MaxLinear PHYs) has a problem with the >> link down interrupt. Although the datasheet mentions (and which common >> sense) a read of the interrupt status register will deassert the >> interrupt line, the PHY doesn't deassert it right away. There is a >> variable delay between reading the status register and the deassertion >> of the interrupt line. This only happens on a link down event. The >> actual delay depends on the firmware revision and is also somehow >> random. With FW 7.71 (on a GPY215B) I've meassured around 40us with >> FW 7.118 (GPY215B) it's about 2ms. > > So you get 2ms of interrupt storm? Yes. > Does the interrupt status bit clear > immediately, or does that clear only once the interrupt line itself > has cleared? I'm assuming it clears immediately, otherwise you would > add a polling loop. Yeah, the register clears after it's read (i.e. the second read returns zero). And yes 2ms >> It also varies from link down to >> link down. The issue is also present in the new GPY215C revision. >> MaxLinear confirmed the issue and is currently working on a firmware >> fix. But it seems that the issue cannot really be resolved. At best, >> the delay can be minimized. If there will be a fix, this is then >> only for a GPY215C and a new firmware version. >> >> Does anyone have an idea of a viable software workaround? > > Looking at the datasheet for the GPY215, the interrupt line is also > GPIO 14. Could you flip it into a GPIO, force it inactive, and sleep > to 2ms? Or even turn it into an input and see if you can read its > state and poll it until it clears? Nice idea. Let me try that. First obstacle is that it doesn't seem to be documented how to do that. The vendor PHY API has some ALTSEL and gpio functions, though. -michael ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: GPY215 PHY interrupt issue 2022-11-25 15:17 ` Andrew Lunn 2022-11-27 19:01 ` Michael Walle @ 2022-11-28 7:41 ` Michael Walle 2022-11-28 13:30 ` Andrew Lunn 1 sibling, 1 reply; 11+ messages in thread From: Michael Walle @ 2022-11-28 7:41 UTC (permalink / raw) To: Andrew Lunn Cc: Heiner Kallweit, Russell King, Horatiu Vultur, netdev, Xu Liang Am 2022-11-25 16:17, schrieb Andrew Lunn: > Or even turn it into an input and see if you can read its > state and poll it until it clears? Btw, I don't think that's possible for shared interrupts. In the worst case you'd poll while another device is asserting the interrupt line. -michael ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: GPY215 PHY interrupt issue 2022-11-28 7:41 ` Michael Walle @ 2022-11-28 13:30 ` Andrew Lunn 2022-12-01 10:24 ` Michael Walle 0 siblings, 1 reply; 11+ messages in thread From: Andrew Lunn @ 2022-11-28 13:30 UTC (permalink / raw) To: Michael Walle Cc: Heiner Kallweit, Russell King, Horatiu Vultur, netdev, Xu Liang On Mon, Nov 28, 2022 at 08:41:17AM +0100, Michael Walle wrote: > Am 2022-11-25 16:17, schrieb Andrew Lunn: > > Or even turn it into an input and see if you can read its > > state and poll it until it clears? > > Btw, I don't think that's possible for shared interrupts. In > the worst case you'd poll while another device is asserting the > interrupt line. Yes, i thought about that afterwards. You need a timeout of 2ms for your polling, and then assume its the other PHY. But it also seems pretty unlikely that both PHYs go down within 2ms of each other. Maybe if you are using a bond and the switch at the other end looses power, but for normal use cases, it seems unlikely. It is also a question of complexity vs gain. 802.3 says something like you have to wait 750ms before declaring link down, so adding a 2ms sleep is just a bit more noise. Andrew ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: GPY215 PHY interrupt issue 2022-11-28 13:30 ` Andrew Lunn @ 2022-12-01 10:24 ` Michael Walle 2022-12-01 15:54 ` Andrew Lunn 0 siblings, 1 reply; 11+ messages in thread From: Michael Walle @ 2022-12-01 10:24 UTC (permalink / raw) To: Andrew Lunn Cc: Heiner Kallweit, Russell King, Horatiu Vultur, netdev, Xu Liang Am 2022-11-28 14:30, schrieb Andrew Lunn: > On Mon, Nov 28, 2022 at 08:41:17AM +0100, Michael Walle wrote: >> Am 2022-11-25 16:17, schrieb Andrew Lunn: >> > Or even turn it into an input and see if you can read its >> > state and poll it until it clears? >> >> Btw, I don't think that's possible for shared interrupts. In >> the worst case you'd poll while another device is asserting the >> interrupt line. > > Yes, i thought about that afterwards. You need a timeout of 2ms for > your polling, and then assume its the other PHY. But it also seems > pretty unlikely that both PHYs go down within 2ms of each other. Maybe > if you are using a bond and the switch at the other end looses power, > but for normal use cases, it seems unlikely. It is also a question of > complexity vs gain. 802.3 says something like you have to wait 750ms > before declaring link down, so adding a 2ms sleep is just a bit more > noise. There are also other PHYs connected to this interrupt line, esp. the LAN8814 which might do PHY timestamping in the future. Therefore, I'd prefer a solution which "unblocks" interrupt line. That being said, I've developed a patchset which change the MDINT to GPIO mode (which isn't that easy because you need to go through a mailbox mechanism to write to the internal bus), just to notice that the access is apparently blocked as long as the interrupt line is low :( I suspect that somehow the complete internal bus is stalled due to some event, after which also the interrupt line is released. Maybe they can't release the interrupt line exactly because of this, who knows. So, switching the line to GPIO input doesn't help here, which also means the interrupt line will be stuck the whole time. Back to your other suggestion that we can somehow poll the line. Currently I'm trying to exploit the fact that even a read is blocked. IOW, in the interrupt handler, I just read an internal register and wait until that read was successful. Should be rather easy and also circument the question "is this interrupt I'm polling on from another PHY of the same line". -michael ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: GPY215 PHY interrupt issue 2022-12-01 10:24 ` Michael Walle @ 2022-12-01 15:54 ` Andrew Lunn 2022-12-01 21:31 ` Michael Walle 0 siblings, 1 reply; 11+ messages in thread From: Andrew Lunn @ 2022-12-01 15:54 UTC (permalink / raw) To: Michael Walle Cc: Heiner Kallweit, Russell King, Horatiu Vultur, netdev, Xu Liang > So, switching the line to GPIO input doesn't help here, which also > means the interrupt line will be stuck the whole time. Sounds like they totally messed up the design somehow. Since we are into horrible hack territory..... I assume you are using the Link state change interrupt? LSTC? Maybe instead use Link speed change and Duplex mode change? And disallow 10/Half. Some PHYs change to 10/Half when they loose link. They might be enough to tell you the link has changed. You can then read the BMSR to find out what actually happened. This is assuming that interrupts in general are not FUBAR. Andrew ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: GPY215 PHY interrupt issue 2022-12-01 15:54 ` Andrew Lunn @ 2022-12-01 21:31 ` Michael Walle 2022-12-01 22:06 ` Andrew Lunn 0 siblings, 1 reply; 11+ messages in thread From: Michael Walle @ 2022-12-01 21:31 UTC (permalink / raw) To: Andrew Lunn Cc: Heiner Kallweit, Russell King, Horatiu Vultur, netdev, Xu Liang Am 2022-12-01 16:54, schrieb Andrew Lunn: >> So, switching the line to GPIO input doesn't help here, which also >> means the interrupt line will be stuck the whole time. > > Sounds like they totally messed up the design somehow. > > Since we are into horrible hack territory..... > > I assume you are using the Link state change interrupt? LSTC? Yes, but recently I've found it that it also happens with the speed change interrupt (during link-up). By pure luck (or bad luck really?) I discovered that when I reduce the MDIO frequency I get a similar behavior for the interrupt line at link-up with the LSPC interrupt. I don't think it has something to do with the frequency but with changed timing. Anyway, I need to do the workaround for LTSC and for LSPC... At the moment I have the following, which works: @@ -560,6 +610,8 @@ static int gpy_config_intr(struct phy_device *phydev) static irqreturn_t gpy_handle_interrupt(struct phy_device *phydev) { + bool needs_mdint_wa = phydev->drv->phy_id == PHY_ID_GPY215B || + phydev->drv->phy_id == PHY_ID_GPY215C; int reg; reg = phy_read(phydev, PHY_ISTAT); @@ -571,6 +623,23 @@ static irqreturn_t gpy_handle_interrupt(struct phy_device *phydev) if (!(reg & PHY_IMASK_MASK)) return IRQ_NONE; + /* The PHY might leave the interrupt line asserted even after PHY_ISTAT + * is read. To avoid interrupt storms, delay the interrupt handling as + * long as the PHY drives the interrupt line. An internal bus read will + * stall as long as the interrupt line is asserted, thus just read a + * random register here. + * Because we cannot access the internal bus at all while the interrupt + * is driven by the PHY, there is no way to make the interrupt line + * unstuck (e.g. by changing the pinmux to GPIO input) during that time + * frame. Therefore, polling is the best we can do and won't do any more + * harm. + * It was observed that this bug happens on link state and link speed + * changes on a GPY215B and GYP215C independent of the firmware version + * (which doesn't mean that this list is exhaustive). + */ + if (needs_mdint_wa && (reg & (PHY_IMASK_LSTC | PHY_IMASK_LSPC))) + gpy_mbox_read(phydev, REG_GPIO0_OUT); + phy_trigger_machine(phydev); return IRQ_HANDLED; > Maybe instead use Link speed change and Duplex mode change? And > disallow 10/Half. Some PHYs change to 10/Half when they loose > link. They might be enough to tell you the link has changed. You can > then read the BMSR to find out what actually happened. > > This is assuming that interrupts in general are not FUBAR. It seems like at least these two are :/ So with the code above we could avoid the interrupt storm but we can't do anything about the blocked interrupt line. I'm unsure if that is acceptable or if we'd have to disable interrupts on this PHY altogether and fallback to polling. -michael ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: GPY215 PHY interrupt issue 2022-12-01 21:31 ` Michael Walle @ 2022-12-01 22:06 ` Andrew Lunn 2022-12-01 23:24 ` Michael Walle 0 siblings, 1 reply; 11+ messages in thread From: Andrew Lunn @ 2022-12-01 22:06 UTC (permalink / raw) To: Michael Walle Cc: Heiner Kallweit, Russell King, Horatiu Vultur, netdev, Xu Liang On Thu, Dec 01, 2022 at 10:31:19PM +0100, Michael Walle wrote: > Am 2022-12-01 16:54, schrieb Andrew Lunn: > > > So, switching the line to GPIO input doesn't help here, which also > > > means the interrupt line will be stuck the whole time. > > > > Sounds like they totally messed up the design somehow. > > > > Since we are into horrible hack territory..... > > > > I assume you are using the Link state change interrupt? LSTC? > > Yes, but recently I've found it that it also happens with > the speed change interrupt (during link-up). By pure luck (or > bad luck really?) I discovered that when I reduce the MDIO > frequency I get a similar behavior for the interrupt line > at link-up with the LSPC interrupt. I don't think it has > something to do with the frequency but with changed timing. So at a wild guess, there is some sort of race condition between the 2.5MHz ish MDIO clock and the rest of the system which is probably clocked at 25Mhz. We have to hope this is limited to just interrupts! Not any MDIO bus transaction. > + /* The PHY might leave the interrupt line asserted even after PHY_ISTAT > + * is read. To avoid interrupt storms, delay the interrupt handling as > + * long as the PHY drives the interrupt line. An internal bus read will > + * stall as long as the interrupt line is asserted, thus just read a > + * random register here. > + * Because we cannot access the internal bus at all while the interrupt > + * is driven by the PHY, there is no way to make the interrupt line > + * unstuck (e.g. by changing the pinmux to GPIO input) during that time > + * frame. Therefore, polling is the best we can do and won't do any more > + * harm. > + * It was observed that this bug happens on link state and link speed > + * changes on a GPY215B and GYP215C independent of the firmware version > + * (which doesn't mean that this list is exhaustive). > + */ > + if (needs_mdint_wa && (reg & (PHY_IMASK_LSTC | PHY_IMASK_LSPC))) > + gpy_mbox_read(phydev, REG_GPIO0_OUT); > + > phy_trigger_machine(phydev); > > return IRQ_HANDLED; So this delayed exiting the interrupt handler until the line actually return to normal. And during that time, the interrupt is disabled. And hence the storm is avoided. I'm assuming gpy_mbox_read() has a timeout, so if the PHY completely jams, it does exit? If that happens, maybe call phy_error() to indicate the PHY is dead. And don't return IRQ_HANDLED, but IRQ_NONE. I think the IRQ core should notice the storm for an interrupt which nobody cares about and disable the interrupt. Probably not much you can do after that, but at least the machine won't be totally dead. > It seems like at least these two are :/ So with the code above > we could avoid the interrupt storm but we can't do anything about > the blocked interrupt line. I'm unsure if that is acceptable or > if we'd have to disable interrupts on this PHY altogether and > fallback to polling. It probably depends on your system design. If this is the only PHY and the storm can be avoided, it is probably O.K. If you have other PHYs sharing the line, and those PHYs are doing time sensitive stuff like PTP, maybe polling would be better. Maybe add a kconfig symbol CONFIG_MAXLINEAR_GPHY_BROKEN_INTERRUPTS and make all the interrupt code conditional on this? Hopefully distros won't enable it, but the option is there to buy into the behaviour if you want? Andrew ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: GPY215 PHY interrupt issue 2022-12-01 22:06 ` Andrew Lunn @ 2022-12-01 23:24 ` Michael Walle 2022-12-02 0:05 ` Andrew Lunn 0 siblings, 1 reply; 11+ messages in thread From: Michael Walle @ 2022-12-01 23:24 UTC (permalink / raw) To: Andrew Lunn Cc: Heiner Kallweit, Russell King, Horatiu Vultur, netdev, Xu Liang Am 2022-12-01 23:06, schrieb Andrew Lunn: > On Thu, Dec 01, 2022 at 10:31:19PM +0100, Michael Walle wrote: >> Am 2022-12-01 16:54, schrieb Andrew Lunn: >> > > So, switching the line to GPIO input doesn't help here, which also >> > > means the interrupt line will be stuck the whole time. >> > >> > Sounds like they totally messed up the design somehow. >> > >> > Since we are into horrible hack territory..... >> > >> > I assume you are using the Link state change interrupt? LSTC? >> >> Yes, but recently I've found it that it also happens with >> the speed change interrupt (during link-up). By pure luck (or >> bad luck really?) I discovered that when I reduce the MDIO >> frequency I get a similar behavior for the interrupt line >> at link-up with the LSPC interrupt. I don't think it has >> something to do with the frequency but with changed timing. > > So at a wild guess, there is some sort of race condition between the > 2.5MHz ish MDIO clock and the rest of the system which is probably > clocked at 25Mhz. > > We have to hope this is limited to just interrupts! Not any MDIO bus > transaction. > >> + /* The PHY might leave the interrupt line asserted even after >> PHY_ISTAT >> + * is read. To avoid interrupt storms, delay the interrupt handling >> as >> + * long as the PHY drives the interrupt line. An internal bus read >> will >> + * stall as long as the interrupt line is asserted, thus just read a >> + * random register here. >> + * Because we cannot access the internal bus at all while the >> interrupt >> + * is driven by the PHY, there is no way to make the interrupt line >> + * unstuck (e.g. by changing the pinmux to GPIO input) during that >> time >> + * frame. Therefore, polling is the best we can do and won't do any >> more >> + * harm. >> + * It was observed that this bug happens on link state and link speed >> + * changes on a GPY215B and GYP215C independent of the firmware >> version >> + * (which doesn't mean that this list is exhaustive). >> + */ >> + if (needs_mdint_wa && (reg & (PHY_IMASK_LSTC | PHY_IMASK_LSPC))) >> + gpy_mbox_read(phydev, REG_GPIO0_OUT); >> + >> phy_trigger_machine(phydev); >> >> return IRQ_HANDLED; > > So this delayed exiting the interrupt handler until the line actually > return to normal. And during that time, the interrupt is disabled. And > hence the storm is avoided. Exactly. Almost like your idea with polling the interrupt line in GPIO mode. > I'm assuming gpy_mbox_read() has a timeout, so if the PHY completely > jams, it does exit? If that happens, maybe call phy_error() to > indicate the PHY is dead. And don't return IRQ_HANDLED, but > IRQ_NONE. I think the IRQ core should notice the storm for an > interrupt which nobody cares about and disable the interrupt. > Probably not much you can do after that, but at least the machine > won't be totally dead. Ah, yes. In the earlier code (that changed the pin to GPIO input), I had that error handling. Missed that here, I'll add it. And yes, there is an timeout of 10ms. >> It seems like at least these two are :/ So with the code above >> we could avoid the interrupt storm but we can't do anything about >> the blocked interrupt line. I'm unsure if that is acceptable or >> if we'd have to disable interrupts on this PHY altogether and >> fallback to polling. > > It probably depends on your system design. If this is the only PHY and > the storm can be avoided, it is probably O.K. If you have other PHYs > sharing the line, and those PHYs are doing time sensitive stuff like > PTP, maybe polling would be better. > > Maybe add a kconfig symbol CONFIG_MAXLINEAR_GPHY_BROKEN_INTERRUPTS and > make all the interrupt code conditional on this? Hopefully distros > won't enable it, but the option is there to buy into the behaviour if > you want? I don't even dare to ask, but wouldn't a device tree property maxlinear,use-broken-interrupts make more sense than a compile time option? I'm fine with both. -michael ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: GPY215 PHY interrupt issue 2022-12-01 23:24 ` Michael Walle @ 2022-12-02 0:05 ` Andrew Lunn 0 siblings, 0 replies; 11+ messages in thread From: Andrew Lunn @ 2022-12-02 0:05 UTC (permalink / raw) To: Michael Walle Cc: Heiner Kallweit, Russell King, Horatiu Vultur, netdev, Xu Liang > I don't even dare to ask, but wouldn't a device tree property > maxlinear,use-broken-interrupts make more sense than a compile time > option? I'm fine with both. Maybe. But it limits it to DT based systems. It gets messy making it work for PCI devices, USB devices, ACPI etc. But i guess we can have it disabled by default, and leave it to whoever wants it enabled to figure out how to enable i for whatever configuration system they use. So yes, maxlinear,use-broken-interrupts. Andrew ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2022-12-02 0:05 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-11-25 14:44 GPY215 PHY interrupt issue Michael Walle 2022-11-25 15:17 ` Andrew Lunn 2022-11-27 19:01 ` Michael Walle 2022-11-28 7:41 ` Michael Walle 2022-11-28 13:30 ` Andrew Lunn 2022-12-01 10:24 ` Michael Walle 2022-12-01 15:54 ` Andrew Lunn 2022-12-01 21:31 ` Michael Walle 2022-12-01 22:06 ` Andrew Lunn 2022-12-01 23:24 ` Michael Walle 2022-12-02 0:05 ` Andrew Lunn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).