* thunderx sgmii interface hang
@ 2017-12-12 23:05 Tim Harvey
2017-12-13 19:43 ` Andrew Lunn
0 siblings, 1 reply; 12+ messages in thread
From: Tim Harvey @ 2017-12-12 23:05 UTC (permalink / raw)
To: netdev, Sunil Goutham
Greetings,
We are experiencing an issue on a CN80XX with an SGMII interface
coupled to a TI DP83867IS phy. We have the same PHY connected to the
RGMII interface on the same board design and everything is working as
expected on that nic both before and after triggering the hang.
The nic appears to work fine (pings, TCP etc) up until a performance
test is attempted.
When an iperf bandwidth test is attempted the nic ends up in a state
where truncated-ip packets are being sent out (per a tcpdump from
another board):
2016-02-11 16:40:23.996660 IP truncated-ip - 1454 bytes missing! (tos
0x0, ttl 64, id 39570, offset 0, flags [DF], proto TCP (6), length
1500, bad cksum 172a (->7033)!)
192.168.1.5.0 > 192.168.168.0.0: tcp 1480 [bad hdr length 0 - too
short, < 20]
Prior to 'net: thunderx: Fix BGX transmit stall due to underflow'
unplugging the cable and re-plugging would resolve the issue to a
point where it could no longer be created. Prior to this patch a link
status change would disable and re-enable the BGX so perhaps that
helps shed some light on what's going on.
I'm using 4.14.4 with the following patches (although the issue
existed with 4.14.0 as well):
2615c91 net: thunderx: Fix TCP/UDP checksum offload for IPv4 pkts
763d8b3 net: thunderx: Fix TCP/UDP checksum offload for IPv6 pkts
93b2e67 net: thunderx: fix double free error
The issue persists regardless of the DP83867 PHY driver being in the
kernel. Any ideas or details that can help troubleshoot this?
Best Regards,
Tim
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: thunderx sgmii interface hang 2017-12-12 23:05 thunderx sgmii interface hang Tim Harvey @ 2017-12-13 19:43 ` Andrew Lunn 2017-12-18 21:53 ` Tim Harvey 0 siblings, 1 reply; 12+ messages in thread From: Andrew Lunn @ 2017-12-13 19:43 UTC (permalink / raw) To: Tim Harvey; +Cc: netdev, Sunil Goutham > The nic appears to work fine (pings, TCP etc) up until a performance > test is attempted. > When an iperf bandwidth test is attempted the nic ends up in a state > where truncated-ip packets are being sent out (per a tcpdump from > another board): Hi Tim Are pause frames supported? Have you tried turning them off? Can you reproduce the issue with UDP? Or is it TCP only? Andrew ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: thunderx sgmii interface hang 2017-12-13 19:43 ` Andrew Lunn @ 2017-12-18 21:53 ` Tim Harvey 2017-12-19 9:47 ` Sunil Kovvuri 2017-12-19 20:52 ` Andrew Lunn 0 siblings, 2 replies; 12+ messages in thread From: Tim Harvey @ 2017-12-18 21:53 UTC (permalink / raw) To: Andrew Lunn, Sunil Goutham; +Cc: netdev On Wed, Dec 13, 2017 at 11:43 AM, Andrew Lunn <andrew@lunn.ch> wrote: >> The nic appears to work fine (pings, TCP etc) up until a performance >> test is attempted. >> When an iperf bandwidth test is attempted the nic ends up in a state >> where truncated-ip packets are being sent out (per a tcpdump from >> another board): > > Hi Tim > > Are pause frames supported? Have you tried turning them off? > > Can you reproduce the issue with UDP? Or is it TCP only? > Andrew, Pause frames don't appear to be supported yet and the issue occurs when using UDP as well as TCP. I'm not clear what the best way to troubleshoot this is. Sunil, have any others reported this issue? I do not have a Cavium CN80xx/CN81xx reference board that has SGMII. Regards, Tim ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: thunderx sgmii interface hang 2017-12-18 21:53 ` Tim Harvey @ 2017-12-19 9:47 ` Sunil Kovvuri 2017-12-19 20:52 ` Andrew Lunn 1 sibling, 0 replies; 12+ messages in thread From: Sunil Kovvuri @ 2017-12-19 9:47 UTC (permalink / raw) To: Tim Harvey; +Cc: Andrew Lunn, Sunil Goutham, netdev On Tue, Dec 19, 2017 at 3:23 AM, Tim Harvey <tharvey@gateworks.com> wrote: > On Wed, Dec 13, 2017 at 11:43 AM, Andrew Lunn <andrew@lunn.ch> wrote: >>> The nic appears to work fine (pings, TCP etc) up until a performance >>> test is attempted. >>> When an iperf bandwidth test is attempted the nic ends up in a state >>> where truncated-ip packets are being sent out (per a tcpdump from >>> another board): >> >> Hi Tim >> >> Are pause frames supported? Have you tried turning them off? >> >> Can you reproduce the issue with UDP? Or is it TCP only? >> > > Andrew, > > Pause frames don't appear to be supported yet and the issue occurs > when using UDP as well as TCP. I'm not clear what the best way to > troubleshoot this is. > > Sunil, have any others reported this issue? I do not have a Cavium > CN80xx/CN81xx reference board that has SGMII. Cavium's CN80xx/CN81xx reference board has marvell,88e1240 phy and i have never faced personally nor i heard from anyone having such issues. Thanks, Sunil. > > Regards, > > Tim ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: thunderx sgmii interface hang 2017-12-18 21:53 ` Tim Harvey 2017-12-19 9:47 ` Sunil Kovvuri @ 2017-12-19 20:52 ` Andrew Lunn 2017-12-22 22:19 ` Tim Harvey 1 sibling, 1 reply; 12+ messages in thread From: Andrew Lunn @ 2017-12-19 20:52 UTC (permalink / raw) To: Tim Harvey; +Cc: Sunil Goutham, netdev On Mon, Dec 18, 2017 at 01:53:47PM -0800, Tim Harvey wrote: > On Wed, Dec 13, 2017 at 11:43 AM, Andrew Lunn <andrew@lunn.ch> wrote: > >> The nic appears to work fine (pings, TCP etc) up until a performance > >> test is attempted. > >> When an iperf bandwidth test is attempted the nic ends up in a state > >> where truncated-ip packets are being sent out (per a tcpdump from > >> another board): > > > > Hi Tim > > > > Are pause frames supported? Have you tried turning them off? > > > > Can you reproduce the issue with UDP? Or is it TCP only? > > > > Andrew, > > Pause frames don't appear to be supported yet and the issue occurs > when using UDP as well as TCP. I'm not clear what the best way to > troubleshoot this is. Hi Tim Is pause being negotiated? In theory, it should not be. The PHY should not offer it, if the MAC has not enabled it. But some PHY drivers are probably broken and offer pause when they should not. Also, can you trigger the issue using UDP at say 75% the maximum bandwidth. That should be low enough that the peer never even tries to use pause. All this pause stuff is just a stab in the dark. Something else to try is to turn off various forms off acceleration, ethtook -K, and see if that makes a difference. Andrew ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: thunderx sgmii interface hang 2017-12-19 20:52 ` Andrew Lunn @ 2017-12-22 22:19 ` Tim Harvey 2017-12-22 22:45 ` Andrew Lunn 2017-12-22 23:00 ` David Daney 0 siblings, 2 replies; 12+ messages in thread From: Tim Harvey @ 2017-12-22 22:19 UTC (permalink / raw) To: Andrew Lunn; +Cc: Sunil Goutham, netdev On Tue, Dec 19, 2017 at 12:52 PM, Andrew Lunn <andrew@lunn.ch> wrote: > On Mon, Dec 18, 2017 at 01:53:47PM -0800, Tim Harvey wrote: >> On Wed, Dec 13, 2017 at 11:43 AM, Andrew Lunn <andrew@lunn.ch> wrote: >> >> The nic appears to work fine (pings, TCP etc) up until a performance >> >> test is attempted. >> >> When an iperf bandwidth test is attempted the nic ends up in a state >> >> where truncated-ip packets are being sent out (per a tcpdump from >> >> another board): >> > >> > Hi Tim >> > >> > Are pause frames supported? Have you tried turning them off? >> > >> > Can you reproduce the issue with UDP? Or is it TCP only? >> > >> >> Andrew, >> >> Pause frames don't appear to be supported yet and the issue occurs >> when using UDP as well as TCP. I'm not clear what the best way to >> troubleshoot this is. > > Hi Tim > > Is pause being negotiated? In theory, it should not be. The PHY should > not offer it, if the MAC has not enabled it. But some PHY drivers are > probably broken and offer pause when they should not. > > Also, can you trigger the issue using UDP at say 75% the maximum > bandwidth. That should be low enough that the peer never even tries to > use pause. > > All this pause stuff is just a stab in the dark. Something else to try > is to turn off various forms off acceleration, ethtook -K, and see if > that makes a difference. > Andrew, Currently I'm not using the DP83867_PHY driver (after verifying the issue occurs with or without that driver). It does not occur if I limit UDP (ie 950mbps). I disabled all offloads and the issue still occurs. I have found that once the issue occurs I can recover to a working state by clearing/setting BGX_CMRX_CFG[BGX_EN] and once I encounter the issue and recover with that, I can never trigger the issue again. If toggle that register bit upon power-up before the issue occurs it will still occur. The CN80XX reference manual describes BGX_CMRX_CFG[BGX_EN] as: - when cleared all dedicated BGX context state for LMAC (state machine, FIFOs, counters etc) are reset and LMAC access to shared BGX resources (data path, serdes lanes) is disabled - when set LMAC operation is enabled (link bring-up, sync, and tx/rx of idles and fault sequences) I'm told that the particular Cavium reference board with an SGMII phy doesn't show this issue (I don't have that specific board to do my own testing or comparisons against our board) so I'm inclined to think it has something to do with an interaction with the DP83867 PHY. I would like to start poking at PHY registers to see if I can find anything unusual. The best way to do that from userspace is via SIOCGMIIREG/SIOCSMIIREG right? The thunderx nic doesn't currently support ioctl's so I guess I'll have to add that support unless there's a way to get at phy registers from userspace through a phy driver? Regards, Tim ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: thunderx sgmii interface hang 2017-12-22 22:19 ` Tim Harvey @ 2017-12-22 22:45 ` Andrew Lunn 2017-12-23 0:22 ` Tim Harvey 2017-12-22 23:00 ` David Daney 1 sibling, 1 reply; 12+ messages in thread From: Andrew Lunn @ 2017-12-22 22:45 UTC (permalink / raw) To: Tim Harvey; +Cc: Sunil Goutham, netdev > Currently I'm not using the DP83867_PHY driver (after verifying the > issue occurs with or without that driver). > > It does not occur if I limit UDP (ie 950mbps). I disabled all offloads > and the issue still occurs. > I'm told that the particular Cavium reference board with an SGMII phy > doesn't show this issue (I don't have that specific board to do my own > testing or comparisons against our board) so I'm inclined to think it > has something to do with an interaction with the DP83867 PHY. I would > like to start poking at PHY registers to see if I can find anything > unusual. The best way to do that from userspace is via > SIOCGMIIREG/SIOCSMIIREG right? The thunderx nic doesn't currently > support ioctl's so I guess I'll have to add that support unless > there's a way to get at phy registers from userspace through a phy > driver? phy_mii_ioctl() does what you need, and is simple to use. mii-tool will then give you access to the standard PHY registers. Andre ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: thunderx sgmii interface hang 2017-12-22 22:45 ` Andrew Lunn @ 2017-12-23 0:22 ` Tim Harvey 0 siblings, 0 replies; 12+ messages in thread From: Tim Harvey @ 2017-12-23 0:22 UTC (permalink / raw) To: Andrew Lunn; +Cc: Sunil Goutham, netdev On Fri, Dec 22, 2017 at 2:45 PM, Andrew Lunn <andrew@lunn.ch> wrote: >> Currently I'm not using the DP83867_PHY driver (after verifying the >> issue occurs with or without that driver). >> >> It does not occur if I limit UDP (ie 950mbps). I disabled all offloads >> and the issue still occurs. > >> I'm told that the particular Cavium reference board with an SGMII phy >> doesn't show this issue (I don't have that specific board to do my own >> testing or comparisons against our board) so I'm inclined to think it >> has something to do with an interaction with the DP83867 PHY. I would >> like to start poking at PHY registers to see if I can find anything >> unusual. The best way to do that from userspace is via >> SIOCGMIIREG/SIOCSMIIREG right? The thunderx nic doesn't currently >> support ioctl's so I guess I'll have to add that support unless >> there's a way to get at phy registers from userspace through a phy >> driver? > > phy_mii_ioctl() does what you need, and is simple to use. > > mii-tool will then give you access to the standard PHY registers. > > Andre I didn't think mii-tool or ethtool (its replacement right?) could read/write specific phy registers via cmdline? I wrote my own tool some time ago to do that [1]. Tim [1] http://trac.gateworks.com/attachment/wiki/conformance_testing/mii-reg.c ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: thunderx sgmii interface hang 2017-12-22 22:19 ` Tim Harvey 2017-12-22 22:45 ` Andrew Lunn @ 2017-12-22 23:00 ` David Daney 2017-12-23 0:22 ` Tim Harvey 1 sibling, 1 reply; 12+ messages in thread From: David Daney @ 2017-12-22 23:00 UTC (permalink / raw) To: Tim Harvey, Andrew Lunn; +Cc: Sunil Goutham, netdev On 12/22/2017 02:19 PM, Tim Harvey wrote: > On Tue, Dec 19, 2017 at 12:52 PM, Andrew Lunn <andrew@lunn.ch> wrote: >> On Mon, Dec 18, 2017 at 01:53:47PM -0800, Tim Harvey wrote: >>> On Wed, Dec 13, 2017 at 11:43 AM, Andrew Lunn <andrew@lunn.ch> wrote: >>>>> The nic appears to work fine (pings, TCP etc) up until a performance >>>>> test is attempted. >>>>> When an iperf bandwidth test is attempted the nic ends up in a state >>>>> where truncated-ip packets are being sent out (per a tcpdump from >>>>> another board): >>>> >>>> Hi Tim >>>> >>>> Are pause frames supported? Have you tried turning them off? >>>> >>>> Can you reproduce the issue with UDP? Or is it TCP only? >>>> >>> >>> Andrew, >>> >>> Pause frames don't appear to be supported yet and the issue occurs >>> when using UDP as well as TCP. I'm not clear what the best way to >>> troubleshoot this is. >> >> Hi Tim >> >> Is pause being negotiated? In theory, it should not be. The PHY should >> not offer it, if the MAC has not enabled it. But some PHY drivers are >> probably broken and offer pause when they should not. >> >> Also, can you trigger the issue using UDP at say 75% the maximum >> bandwidth. That should be low enough that the peer never even tries to >> use pause. >> >> All this pause stuff is just a stab in the dark. Something else to try >> is to turn off various forms off acceleration, ethtook -K, and see if >> that makes a difference. >> > > Andrew, > > Currently I'm not using the DP83867_PHY driver (after verifying the > issue occurs with or without that driver). > > It does not occur if I limit UDP (ie 950mbps). I disabled all offloads > and the issue still occurs. > > I have found that once the issue occurs I can recover to a working > state by clearing/setting BGX_CMRX_CFG[BGX_EN] and once I encounter > the issue and recover with that, I can never trigger the issue again. > If toggle that register bit upon power-up before the issue occurs it > will still occur. > > The CN80XX reference manual describes BGX_CMRX_CFG[BGX_EN] as: > - when cleared all dedicated BGX context state for LMAC (state > machine, FIFOs, counters etc) are reset and LMAC access to shared BGX > resources (data path, serdes lanes) is disabled > - when set LMAC operation is enabled (link bring-up, sync, and tx/rx > of idles and fault sequences) You could try looking at BGXX_GMP_PCS_INTX BGXX_GMP_GMI_RXX_INT BGXX_GMP_GMI_TXX_INT Those are all W1C registers that should contain all zeros. If they don't, just write back to them to clear before running a test. If there are bits asserting in these when the thing gets wedged up, it might point to a possible cause. You could also look at these RO registers: BGXX_GMP_PCS_TXX_STATES BGXX_GMP_PCS_RXX_STATES > > I'm told that the particular Cavium reference board with an SGMII phy > doesn't show this issue (I don't have that specific board to do my own > testing or comparisons against our board) so I'm inclined to think it > has something to do with an interaction with the DP83867 PHY. I would > like to start poking at PHY registers to see if I can find anything > unusual. The best way to do that from userspace is via > SIOCGMIIREG/SIOCSMIIREG right? The thunderx nic doesn't currently > support ioctl's so I guess I'll have to add that support unless > there's a way to get at phy registers from userspace through a phy > driver? > > Regards, > > Tim > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: thunderx sgmii interface hang 2017-12-22 23:00 ` David Daney @ 2017-12-23 0:22 ` Tim Harvey 2017-12-23 0:30 ` David Daney 0 siblings, 1 reply; 12+ messages in thread From: Tim Harvey @ 2017-12-23 0:22 UTC (permalink / raw) To: David Daney; +Cc: Andrew Lunn, Sunil Goutham, netdev On Fri, Dec 22, 2017 at 3:00 PM, David Daney <ddaney@caviumnetworks.com> wrote: > On 12/22/2017 02:19 PM, Tim Harvey wrote: >> >> On Tue, Dec 19, 2017 at 12:52 PM, Andrew Lunn <andrew@lunn.ch> wrote: >>> >>> On Mon, Dec 18, 2017 at 01:53:47PM -0800, Tim Harvey wrote: >>>> >>>> On Wed, Dec 13, 2017 at 11:43 AM, Andrew Lunn <andrew@lunn.ch> wrote: >>>>>> >>>>>> The nic appears to work fine (pings, TCP etc) up until a performance >>>>>> test is attempted. >>>>>> When an iperf bandwidth test is attempted the nic ends up in a state >>>>>> where truncated-ip packets are being sent out (per a tcpdump from >>>>>> another board): >>>>> >>>>> >>>>> Hi Tim >>>>> >>>>> Are pause frames supported? Have you tried turning them off? >>>>> >>>>> Can you reproduce the issue with UDP? Or is it TCP only? >>>>> >>>> >>>> Andrew, >>>> >>>> Pause frames don't appear to be supported yet and the issue occurs >>>> when using UDP as well as TCP. I'm not clear what the best way to >>>> troubleshoot this is. >>> >>> >>> Hi Tim >>> >>> Is pause being negotiated? In theory, it should not be. The PHY should >>> not offer it, if the MAC has not enabled it. But some PHY drivers are >>> probably broken and offer pause when they should not. >>> >>> Also, can you trigger the issue using UDP at say 75% the maximum >>> bandwidth. That should be low enough that the peer never even tries to >>> use pause. >>> >>> All this pause stuff is just a stab in the dark. Something else to try >>> is to turn off various forms off acceleration, ethtook -K, and see if >>> that makes a difference. >>> >> >> Andrew, >> >> Currently I'm not using the DP83867_PHY driver (after verifying the >> issue occurs with or without that driver). >> >> It does not occur if I limit UDP (ie 950mbps). I disabled all offloads >> and the issue still occurs. >> >> I have found that once the issue occurs I can recover to a working >> state by clearing/setting BGX_CMRX_CFG[BGX_EN] and once I encounter >> the issue and recover with that, I can never trigger the issue again. >> If toggle that register bit upon power-up before the issue occurs it >> will still occur. >> >> The CN80XX reference manual describes BGX_CMRX_CFG[BGX_EN] as: >> - when cleared all dedicated BGX context state for LMAC (state >> machine, FIFOs, counters etc) are reset and LMAC access to shared BGX >> resources (data path, serdes lanes) is disabled >> - when set LMAC operation is enabled (link bring-up, sync, and tx/rx >> of idles and fault sequences) > > > You could try looking at > BGXX_GMP_PCS_INTX > BGXX_GMP_GMI_RXX_INT > BGXX_GMP_GMI_TXX_INT > > Those are all W1C registers that should contain all zeros. If they don't, > just write back to them to clear before running a test. > > If there are bits asserting in these when the thing gets wedged up, it might > point to a possible cause. David, BGXX_GMP_GMI_TXX_INT[UNDFLW] is getting set when the issue is triggered. From CN80XX-HM-1.2P this is caused by: "In the unlikely event that P2X data cannot keep the GMP TX FIFO full, the SGMII/1000BASE-X/ QSGMII packet transfer will underflow. This should be detected by the receiving device as an FCS error. Internally, the packet is drained and lost" Perhaps this needs to be caught and handled in some way. There's some interrupt handlers in nicvf_main.c yet I'm not clear where to hook up this one. > > You could also look at these RO registers: > BGXX_GMP_PCS_TXX_STATES > BGXX_GMP_PCS_RXX_STATES > These show the same before/after triggering the issue and RX_BAD/TX_BAD are still 0. Tim ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: thunderx sgmii interface hang 2017-12-23 0:22 ` Tim Harvey @ 2017-12-23 0:30 ` David Daney 2018-01-02 19:18 ` Tim Harvey 0 siblings, 1 reply; 12+ messages in thread From: David Daney @ 2017-12-23 0:30 UTC (permalink / raw) To: Tim Harvey; +Cc: Andrew Lunn, Sunil Goutham, netdev On 12/22/2017 04:22 PM, Tim Harvey wrote: > On Fri, Dec 22, 2017 at 3:00 PM, David Daney <ddaney@caviumnetworks.com> wrote: >> On 12/22/2017 02:19 PM, Tim Harvey wrote: >>> >>> On Tue, Dec 19, 2017 at 12:52 PM, Andrew Lunn <andrew@lunn.ch> wrote: >>>> >>>> On Mon, Dec 18, 2017 at 01:53:47PM -0800, Tim Harvey wrote: >>>>> >>>>> On Wed, Dec 13, 2017 at 11:43 AM, Andrew Lunn <andrew@lunn.ch> wrote: >>>>>>> >>>>>>> The nic appears to work fine (pings, TCP etc) up until a performance >>>>>>> test is attempted. >>>>>>> When an iperf bandwidth test is attempted the nic ends up in a state >>>>>>> where truncated-ip packets are being sent out (per a tcpdump from >>>>>>> another board): >>>>>> >>>>>> >>>>>> Hi Tim >>>>>> >>>>>> Are pause frames supported? Have you tried turning them off? >>>>>> >>>>>> Can you reproduce the issue with UDP? Or is it TCP only? >>>>>> >>>>> >>>>> Andrew, >>>>> >>>>> Pause frames don't appear to be supported yet and the issue occurs >>>>> when using UDP as well as TCP. I'm not clear what the best way to >>>>> troubleshoot this is. >>>> >>>> >>>> Hi Tim >>>> >>>> Is pause being negotiated? In theory, it should not be. The PHY should >>>> not offer it, if the MAC has not enabled it. But some PHY drivers are >>>> probably broken and offer pause when they should not. >>>> >>>> Also, can you trigger the issue using UDP at say 75% the maximum >>>> bandwidth. That should be low enough that the peer never even tries to >>>> use pause. >>>> >>>> All this pause stuff is just a stab in the dark. Something else to try >>>> is to turn off various forms off acceleration, ethtook -K, and see if >>>> that makes a difference. >>>> >>> >>> Andrew, >>> >>> Currently I'm not using the DP83867_PHY driver (after verifying the >>> issue occurs with or without that driver). >>> >>> It does not occur if I limit UDP (ie 950mbps). I disabled all offloads >>> and the issue still occurs. >>> >>> I have found that once the issue occurs I can recover to a working >>> state by clearing/setting BGX_CMRX_CFG[BGX_EN] and once I encounter >>> the issue and recover with that, I can never trigger the issue again. >>> If toggle that register bit upon power-up before the issue occurs it >>> will still occur. >>> >>> The CN80XX reference manual describes BGX_CMRX_CFG[BGX_EN] as: >>> - when cleared all dedicated BGX context state for LMAC (state >>> machine, FIFOs, counters etc) are reset and LMAC access to shared BGX >>> resources (data path, serdes lanes) is disabled >>> - when set LMAC operation is enabled (link bring-up, sync, and tx/rx >>> of idles and fault sequences) >> >> >> You could try looking at >> BGXX_GMP_PCS_INTX >> BGXX_GMP_GMI_RXX_INT >> BGXX_GMP_GMI_TXX_INT >> >> Those are all W1C registers that should contain all zeros. If they don't, >> just write back to them to clear before running a test. >> >> If there are bits asserting in these when the thing gets wedged up, it might >> point to a possible cause. > > David, > > BGXX_GMP_GMI_TXX_INT[UNDFLW] is getting set when the issue is > triggered. From CN80XX-HM-1.2P this is caused by: > > "In the unlikely event that P2X data cannot keep the GMP TX FIFO full, > the SGMII/1000BASE-X/ QSGMII packet transfer will underflow. This > should be detected by the receiving device as an FCS error. > Internally, the packet is drained and lost" > Yikes! > Perhaps this needs to be caught and handled in some way. There's some > interrupt handlers in nicvf_main.c yet I'm not clear where to hook up > this one. This would be an interrupt generated by the BGX device, not the NIC device It will have an MSI-X index of (6 + LMAC * 7). See BGX_INT_VEC_E in the HRM. Note that I am telling you which interrupt it is, but not recommending that catching it and doing something is necessarily the best thing to do. > >> >> You could also look at these RO registers: >> BGXX_GMP_PCS_TXX_STATES >> BGXX_GMP_PCS_RXX_STATES >> > > These show the same before/after triggering the issue and > RX_BAD/TX_BAD are still 0. > > Tim > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: thunderx sgmii interface hang 2017-12-23 0:30 ` David Daney @ 2018-01-02 19:18 ` Tim Harvey 0 siblings, 0 replies; 12+ messages in thread From: Tim Harvey @ 2018-01-02 19:18 UTC (permalink / raw) To: David Daney; +Cc: Andrew Lunn, Sunil Goutham, netdev On Fri, Dec 22, 2017 at 4:30 PM, David Daney <ddaney@caviumnetworks.com> wrote: > On 12/22/2017 04:22 PM, Tim Harvey wrote: <snip> >> >> BGXX_GMP_GMI_TXX_INT[UNDFLW] is getting set when the issue is >> triggered. From CN80XX-HM-1.2P this is caused by: >> >> "In the unlikely event that P2X data cannot keep the GMP TX FIFO full, >> the SGMII/1000BASE-X/ QSGMII packet transfer will underflow. This >> should be detected by the receiving device as an FCS error. >> Internally, the packet is drained and lost" >> > > Yikes! > >> Perhaps this needs to be caught and handled in some way. There's some >> interrupt handlers in nicvf_main.c yet I'm not clear where to hook up >> this one. > > > This would be an interrupt generated by the BGX device, not the NIC device > It will have an MSI-X index of (6 + LMAC * 7). See BGX_INT_VEC_E in the > HRM. > > Note that I am telling you which interrupt it is, but not recommending that > catching it and doing something is necessarily the best thing to do. > David, The following patch registers the BGX_GMP_GMI_TXX interrupt, enables the UNDFLW interrupt, and toggles the MAC/PCS enable and does indeed catch/resolve the issue which never occurs again. I would agree this isn't a 'fix' but works around whatever the issue is. Any ideas what could cause an UNDFLW like this? I suspect it might have something to do with the fact we have a 100MHz external ref clock for the DLM which is configured for GSERx_LANE_MODE[LMODE] of 0x6 (R_125G_REFCLK15625_SGMII 1.25Gbd 156.25MHz ref clock SGMII) which seems just wrong (100MHz clock yet we tell CN80XX 165MHz) but according to the reference manual and BDK code this is an allowed configuration. Do you have access to a CN80XX reference board that has an SGMII PHY on it to test for this issue? I've been trying to get hold of one but have not been successful. diff --git a/drivers/net/ethernet/cavium/thunder/thunder_bgx.h b/drivers/net/ethernet/cavium/thunder/thunder_bgx.h index 2bba9d1..be9148f9 100644 --- a/drivers/net/ethernet/cavium/thunder/thunder_bgx.h +++ b/drivers/net/ethernet/cavium/thunder/thunder_bgx.h @@ -179,6 +179,15 @@ #define BGX_GMP_GMI_TXX_BURST 0x38228 #define BGX_GMP_GMI_TXX_MIN_PKT 0x38240 #define BGX_GMP_GMI_TXX_SGMII_CTL 0x38300 +#define BGX_GMP_GMI_TXX_INT 0x38500 +#define BGX_GMP_GMI_TXX_INT_W1S 0x38508 +#define BGX_GMP_GMI_TXX_INT_ENA_W1C 0x38510 +#define BGX_GMP_GMI_TXX_INT_ENA_W1S 0x38518 +#define GMI_TXX_INT_PTP_LOST BIT_ULL(4) +#define GMI_TXX_INT_LATE_COL BIT_ULL(3) +#define GMI_TXX_INT_XSDEF BIT_ULL(2) +#define GMI_TXX_INT_XSCOL BIT_ULL(1) +#define GMI_TXX_INT_UNDFLW BIT_ULL(0) #define BGX_MSIX_VEC_0_29_ADDR 0x400000 /* +(0..29) << 4 */ #define BGX_MSIX_VEC_0_29_CTL 0x400008 diff --git a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c index 805c02a..0690966 100644 --- a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c +++ b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c @@ -1344,6 +1344,54 @@ static int bgx_init_phy(struct bgx *bgx) return bgx_init_of_phy(bgx); } +static irqreturn_t bgx_intr_handler(int irq, void *data) +{ + struct bgx *bgx = (struct bgx *)data; + struct device *dev = &bgx->pdev->dev; + u64 status, val; + int lmac; + + for (lmac = 0; lmac < bgx->lmac_count; lmac++) { + status = bgx_reg_read(bgx, lmac, BGX_GMP_GMI_TXX_INT); + if (status & GMI_TXX_INT_UNDFLW) { + dev_err(dev, "BGX%d lmac%d UNDFLW\n", bgx->bgx_id, + lmac); + val = bgx_reg_read(bgx, lmac, BGX_CMRX_CFG); + val &= ~CMR_EN; + bgx_reg_write(bgx, lmac, BGX_CMRX_CFG, val); + val |= CMR_EN; + bgx_reg_write(bgx, lmac, BGX_CMRX_CFG, val); + } + /* clear interrupts */ + bgx_reg_write(bgx, lmac, BGX_GMP_GMI_TXX_INT, status); + } + + return IRQ_HANDLED; +} + +static int bgx_register_intr(struct pci_dev *pdev) +{ + struct bgx *bgx = pci_get_drvdata(pdev); + struct device *dev = &pdev->dev; + int num_vec, ret; + char irq_name[32]; + + /* Enable MSI-X */ + num_vec = pci_msix_vec_count(pdev); + ret = pci_alloc_irq_vectors(pdev, num_vec, num_vec, PCI_IRQ_MSIX); + if (ret < 0) { + dev_err(dev, "Req for #%d msix vectors failed\n", num_vec); + return 1; + } + sprintf(irq_name, "BGX%d", bgx->bgx_id); + ret = request_irq(pci_irq_vector(pdev, GMPX_GMI_TX_INT), + bgx_intr_handler, 0, irq_name, bgx); + if (ret) + return 1; + + return 0; +} + static int bgx_probe(struct pci_dev *pdev, const struct pci_device_id *ent) { int err; @@ -1414,6 +1462,8 @@ static int bgx_probe(struct pci_dev *pdev, const struct pci_device_id *ent) xcv_init_hw(bgx->phy_mode); bgx_init_hw(bgx); + bgx_register_intr(pdev); + /* Enable all LMACs */ for (lmac = 0; lmac < bgx->lmac_count; lmac++) { err = bgx_lmac_enable(bgx, lmac); @@ -1424,6 +1474,10 @@ static int bgx_probe(struct pci_dev *pdev, const struct pci_device_id *ent) bgx_lmac_disable(bgx, --lmac); goto err_enable; } + + /* enable TX FIFO Underflow interrupt */ + bgx_reg_modify(bgx, lmac, BGX_GMP_GMI_TXX_INT_ENA_W1S, + GMI_TXX_INT_UNDFLW); } return 0; Regards, Tim ^ permalink raw reply related [flat|nested] 12+ messages in thread
end of thread, other threads:[~2018-01-02 19:18 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-12-12 23:05 thunderx sgmii interface hang Tim Harvey 2017-12-13 19:43 ` Andrew Lunn 2017-12-18 21:53 ` Tim Harvey 2017-12-19 9:47 ` Sunil Kovvuri 2017-12-19 20:52 ` Andrew Lunn 2017-12-22 22:19 ` Tim Harvey 2017-12-22 22:45 ` Andrew Lunn 2017-12-23 0:22 ` Tim Harvey 2017-12-22 23:00 ` David Daney 2017-12-23 0:22 ` Tim Harvey 2017-12-23 0:30 ` David Daney 2018-01-02 19:18 ` Tim Harvey
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).