* linux-3.6.11-rt30 smoke test on ARM @ 2013-03-08 3:55 Frank Rowand 2013-03-08 4:03 ` Frank Rowand 0 siblings, 1 reply; 5+ messages in thread From: Frank Rowand @ 2013-03-08 3:55 UTC (permalink / raw) To: Thomas Gleixner, linux-kernel, linux-rt-users linux-3.6.11-rt30 builds and boots on some ARM boards (PandaBoard, Realview) for: - SMP, PREEMPT_RT_FULL - SMP, PREEMPT_NONE - UP, PREEMPT_RT_FULL - UP, PREEMPT_NONE The PandaBoard frequently fails to boot with an eth0 error. I have only seen this for SMP, PREEMPT_NONE. This same error also occurs in 3.6.7 without the RT_PREEMPT patches applied, so this does not appear to be an RT_PREEMPT issue. This error also occurred in the previous smoke test of linux-3.6.7-rt18 http://permalink.gmane.org/gmane.linux.rt.user/9202 The boot error starts with the USB time out on ep0out: [ 3.264373] smsc95xx 1-1.1:1.0: usb_probe_interface [ 3.269500] smsc95xx 1-1.1:1.0: usb_probe_interface - got id [ 3.275543] smsc95xx v1.0.4 [ 8.078674] smsc95xx 1-1.1:1.0: eth0: register 'smsc95xx' at usb-ehci-omap.0-1.1, smsc95xx USB 2.0 Ethernet, 82:b9:1d:fa:67:0d [ 8.091003] hub 1-1:1.0: state 7 ports 5 chg 0000 evt 0002 [ 13.509918] usb 1-1.1: swapper/0 timed out on ep0out len=0/4 [ 13.515869] smsc95xx 1-1.1:1.0: eth0: Failed to write register index 0x00000108 [ 13.523559] smsc95xx 1-1.1:1.0: eth0: Failed to write ADDRL: -110 [ 13.529998] IP-Config: Failed to open eth0 The boot error can be avoided with a workaround that simply retries the failed USB command. I will reply to this email with the patch. -Frank ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: linux-3.6.11-rt30 smoke test on ARM 2013-03-08 3:55 linux-3.6.11-rt30 smoke test on ARM Frank Rowand @ 2013-03-08 4:03 ` Frank Rowand 2013-03-11 17:34 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 5+ messages in thread From: Frank Rowand @ 2013-03-08 4:03 UTC (permalink / raw) Cc: Thomas Gleixner, linux-kernel, linux-rt-users On 03/07/13 19:55, Frank Rowand wrote: > linux-3.6.11-rt30 builds and boots on some ARM boards (PandaBoard, Realview) for: > > - SMP, PREEMPT_RT_FULL > - SMP, PREEMPT_NONE > - UP, PREEMPT_RT_FULL > - UP, PREEMPT_NONE > > > The PandaBoard frequently fails to boot with an eth0 error. I have > only seen this for SMP, PREEMPT_NONE. This same error also occurs > in 3.6.7 without the RT_PREEMPT patches applied, so this does not > appear to be an RT_PREEMPT issue. > > This error also occurred in the previous smoke test of linux-3.6.7-rt18 > > http://permalink.gmane.org/gmane.linux.rt.user/9202 > > The boot error starts with the USB time out on ep0out: > > [ 3.264373] smsc95xx 1-1.1:1.0: usb_probe_interface > [ 3.269500] smsc95xx 1-1.1:1.0: usb_probe_interface - got id > [ 3.275543] smsc95xx v1.0.4 > [ 8.078674] smsc95xx 1-1.1:1.0: eth0: register 'smsc95xx' at usb-ehci-omap.0-1.1, smsc95xx USB 2.0 Ethernet, 82:b9:1d:fa:67:0d > [ 8.091003] hub 1-1:1.0: state 7 ports 5 chg 0000 evt 0002 > [ 13.509918] usb 1-1.1: swapper/0 timed out on ep0out len=0/4 > [ 13.515869] smsc95xx 1-1.1:1.0: eth0: Failed to write register index 0x00000108 > [ 13.523559] smsc95xx 1-1.1:1.0: eth0: Failed to write ADDRL: -110 > [ 13.529998] IP-Config: Failed to open eth0 > > > The boot error can be avoided with a workaround that simply retries > the failed USB command. I will reply to this email with the patch. > > -Frank panda boot often fails due to a usb timeout, while sending a command on behalf of the smsc95xx ethernet driver. This patch is a temporary hack to force a retry when the timeout occurs. # bad boot: hub 1-1:1.0: state 7 ports 5 chg 0000 evt 0002 usb 1-1.1: swapper timed out on ep0out len=0/4 smsc95xx 1-1.1:1.0: eth0: Failed to write register index 0x00000020 smsc95xx 1-1.1:1.0: eth0: Failed to write PM_CTRL: -110 IP-Config: Failed to open eth0 # good boot: hub 1-1:1.0: state 7 ports 5 chg 0000 evt 0002 usb 1-1.1: link qh8-0001/dc0bae80 start 2 [1/0 us] IP-Config: Complete: device=eth0, addr=192.168.1.85, mask=255.255.255.0, gw=192.168.1.1 host=panda, domain=, nis-domain=(none) bootserver=192.168.1.1, rootserver=192.168.1.1, rootpath= smsc95xx 1-1.1:1.0: eth0: link up, 100Mbps, full-duplex, lpa 0x41E1 VFS: Mounted root (nfs filesystem) on device 0:11. --- drivers/net/usb/smsc95xx.c | 9 8 + 1 - 0 ! 1 file changed, 8 insertions(+), 1 deletion(-) Index: b/drivers/net/usb/smsc95xx.c =================================================================== --- a/drivers/net/usb/smsc95xx.c +++ b/drivers/net/usb/smsc95xx.c @@ -92,6 +92,7 @@ static int smsc95xx_write_reg(struct usb { u32 *buf = kmalloc(4, GFP_KERNEL); int ret; + int retry = 5; BUG_ON(!dev); @@ -101,13 +102,19 @@ static int smsc95xx_write_reg(struct usb *buf = data; cpu_to_le32s(buf); +try_again: ret = usb_control_msg(dev->udev, usb_sndctrlpipe(dev->udev, 0), USB_VENDOR_REQUEST_WRITE_REGISTER, USB_DIR_OUT | USB_TYPE_VENDOR | USB_RECIP_DEVICE, 00, index, buf, 4, USB_CTRL_SET_TIMEOUT); - if (unlikely(ret < 0)) + if (unlikely(ret < 0)) { netdev_warn(dev->net, "Failed to write register index 0x%08x\n", index); + if (retry-- && ret == -ETIMEDOUT) { + netdev_warn(dev->net, "failed due to timed out, retrying\n"); + goto try_again; + } + } kfree(buf); ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: linux-3.6.11-rt30 smoke test on ARM 2013-03-08 4:03 ` Frank Rowand @ 2013-03-11 17:34 ` Sebastian Andrzej Siewior 2013-03-13 0:44 ` Frank Rowand 0 siblings, 1 reply; 5+ messages in thread From: Sebastian Andrzej Siewior @ 2013-03-11 17:34 UTC (permalink / raw) To: Frank Rowand; +Cc: Thomas Gleixner, linux-kernel, linux-rt-users * Frank Rowand | 2013-03-07 20:03:18 [-0800]: >panda boot often fails due to a usb timeout, while sending a command on >behalf of the smsc95xx ethernet driver. > >This patch is a temporary hack to force a retry when the timeout occurs. It looks like you overrun the chip for some reason. Can you reproduce it on mainline? They added a few delayes on register read() it might do the trick. Sebastian ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: linux-3.6.11-rt30 smoke test on ARM 2013-03-11 17:34 ` Sebastian Andrzej Siewior @ 2013-03-13 0:44 ` Frank Rowand 2013-03-21 20:26 ` Frank Rowand 0 siblings, 1 reply; 5+ messages in thread From: Frank Rowand @ 2013-03-13 0:44 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Thomas Gleixner, linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org On 03/11/13 10:34, Sebastian Andrzej Siewior wrote: > * Frank Rowand | 2013-03-07 20:03:18 [-0800]: > >> panda boot often fails due to a usb timeout, while sending a command on >> behalf of the smsc95xx ethernet driver. >> >> This patch is a temporary hack to force a retry when the timeout occurs. > > It looks like you overrun the chip for some reason. Can you reproduce it > on mainline? They added a few delayes on register read() it might do the > trick. Yes, I can reproduce it on mainline. Here is the current state of my debugging: The problem usually occurs within three boot attempts. But it has also taken eight boot attempts to see the problem. I do not know what the maximum number of boots is required to see the problem, so I can not state with certainty that a given kernel version does not have the problem. If the boot fails then I can state with certainty that the given kernel version has the problem. Given that level of uncertainty, I know: v3.5 does not appear to have the problem v3.6-rc1 has the problem v3.6 has the problem v3.7 has the problem v3.8 does not appear to have the problem v3.9-rc1 fails to build I thought I had bisected the problem to a specific commit, but wanting to be sure of it, I did extra boots of what should have been the last good commit. On the 7th boot, that kernel version had the problem. I'll probably redo the bisect, but have not had time to do so yet. The problem manifests as a timeout from at least two different locations in drivers/net/usb/smsc95xx.c: 656 static int smsc95xx_set_mac_address(struct usbnet *dev) 657 { ... 663 ret = smsc95xx_write_reg(dev, ADDRL, addr_lo); 664 if (ret < 0) { 665 netdev_warn(dev->net, "Failed to write ADDRL: %d\n", ret); 666 return ret; 667 } 751 static int smsc95xx_reset(struct usbnet *dev) 752 { ... 783 write_buf = PM_CTL_PHY_RST_; 784 ret = smsc95xx_write_reg(dev, PM_CTRL, write_buf); 785 if (ret < 0) { 786 netdev_warn(dev->net, "Failed to write PM_CTRL: %d\n", ret); 787 return ret; 788 } Some of the other smsc95xx_write_reg() calls in smsc95xx_reset() are protected with checks for timeout, with up to 100 retries. I do not know if this one should have the same protection. -Frank ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: linux-3.6.11-rt30 smoke test on ARM 2013-03-13 0:44 ` Frank Rowand @ 2013-03-21 20:26 ` Frank Rowand 0 siblings, 0 replies; 5+ messages in thread From: Frank Rowand @ 2013-03-21 20:26 UTC (permalink / raw) Cc: Sebastian Andrzej Siewior, Thomas Gleixner, linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org On 03/12/13 17:44, Frank Rowand wrote: > On 03/11/13 10:34, Sebastian Andrzej Siewior wrote: >> * Frank Rowand | 2013-03-07 20:03:18 [-0800]: >> >>> panda boot often fails due to a usb timeout, while sending a command on >>> behalf of the smsc95xx ethernet driver. >>> >>> This patch is a temporary hack to force a retry when the timeout occurs. >> >> It looks like you overrun the chip for some reason. Can you reproduce it >> on mainline? They added a few delayes on register read() it might do the >> trick. > > Yes, I can reproduce it on mainline. > > Here is the current state of my debugging: > > The problem usually occurs within three boot attempts. But it has also > taken eight boot attempts to see the problem. I do not know what the > maximum number of boots is required to see the problem, so I can not > state with certainty that a given kernel version does not have the > problem. If the boot fails then I can state with certainty that the > given kernel version has the problem. > > Given that level of uncertainty, I know: > > v3.5 does not appear to have the problem > v3.6-rc1 has the problem > v3.6 has the problem > v3.7 has the problem > v3.8 does not appear to have the problem > v3.9-rc1 fails to build > > I thought I had bisected the problem to a specific commit, but wanting > to be sure of it, I did extra boots of what should have been the last > good commit. On the 7th boot, that kernel version had the problem. > > I'll probably redo the bisect, but have not had time to do so yet. I did the bisect again, with more boot tests per bisect point, and found the commit to blame. Hopefully the problem will be resolved in the thread where I report the bisect: https://lkml.org/lkml/2013/3/20/742 > > The problem manifests as a timeout from at least two different locations > in drivers/net/usb/smsc95xx.c: > > > 656 static int smsc95xx_set_mac_address(struct usbnet *dev) > 657 { > ... > 663 ret = smsc95xx_write_reg(dev, ADDRL, addr_lo); > 664 if (ret < 0) { > 665 netdev_warn(dev->net, "Failed to write ADDRL: %d\n", ret); > 666 return ret; > 667 } > > > 751 static int smsc95xx_reset(struct usbnet *dev) > 752 { > ... > 783 write_buf = PM_CTL_PHY_RST_; > 784 ret = smsc95xx_write_reg(dev, PM_CTRL, write_buf); > 785 if (ret < 0) { > 786 netdev_warn(dev->net, "Failed to write PM_CTRL: %d\n", ret); > 787 return ret; > 788 } > > > Some of the other smsc95xx_write_reg() calls in smsc95xx_reset() are protected with > checks for timeout, with up to 100 retries. I do not know if this one should have > the same protection. > > -Frank ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2013-03-21 20:26 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-03-08 3:55 linux-3.6.11-rt30 smoke test on ARM Frank Rowand 2013-03-08 4:03 ` Frank Rowand 2013-03-11 17:34 ` Sebastian Andrzej Siewior 2013-03-13 0:44 ` Frank Rowand 2013-03-21 20:26 ` Frank Rowand
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox