* transmit lockup using smsc95xx ethernet on usb3
@ 2013-10-15 9:59 David Laight
[not found] ` <AE90C24D6B3A694183C094C60CF0A2F6026B738B-CgBM+Bx2aUAnGFn1LkZF6NBPR1lH4CV8@public.gmane.org>
0 siblings, 1 reply; 10+ messages in thread
From: David Laight @ 2013-10-15 9:59 UTC (permalink / raw)
To: netdev, linux-usb
We are seeing complete lockups of the transmit side when using
the smsc95xx driver connected to a USB3 port on an i7 (Ivybridge) cpu.
These errors are very intermittent - less than once a day, and
it isn't actually clear that they are related to traffic load.
Most of the systems are running the 3.2 kernel from Ubuntu 12.04
but I've seen the same problem when running a 3.4 kernel.
Looking at the changelog for xhci-ring.c I can see that some
'nasty' bugs were fixed between 3.2 and 3.4 (and possibly since)
but the usbmon trace I've now got doesn't seem to match any
of the changelog entries.
We are also seeing similar problems if we connect to a USB2
header.
Since we can't reproduce the problem quickly it is difficult to
do any analysis. Any suggestions for increasing the error rate
would be welcome.
Below is an annotated extract from a usbmon trace while running
a netperf test that was sending 8192 byte TCP packets (nagle off).
I've deleted the Bi entries (packets are received throughout)
and numbered all the others (modulo 10000) so it is easier to
see when the requests complete, I've also calculated the elapsed
time and the number of Setup entries between the S and C traces.
The USB ring seems to have 60 outstanding transmits in it,
each time one completes another is sent. There are a few 10000
traces of that then:
start:9870 ffff88020ea16000 293811125 S Bo:3:003:2 -115 1514 =
e2350000 e2450000 22003200 00224d98
d8460002 1f0057d7 08004500 05d0ff11
done:9811:6969:60 ffff88020c7c8000 293811236 C Bo:3:003:2 0 1090 >
start:9871 ffff88020ea16a80 293811242 S Bo:3:003:2 -115 1090 =
3a340000 3a440000 22003200 00224d98
d8460002 1f0057d7 08004500 0428ff12
...
start:9929 ffff88020ea16780 293817964 S Bo:3:003:2 -115 1514 =
e2350000 e2450000 22003200 00224d98
d8460002 1f0057d7 08004500 05d0ff4c
Last successful completion.
done:9870:6968:60 ffff88020ea16000 293818093 C Bo:3:003:2 0 1514 >
start:9930 ffff88020ea16000 293818099 S Bo:3:003:2 -115 1514 =
e2350000 e2450000 22003200 00224d98
d8460002 1f0057d7 08004500 05d0ff4d
At this point something (untraced) seems to have gone horribly
wrong in the transmit ring. dmesg shows nothing.
Two Bo 'fail -71', 6 succeed, one fails -32 the rest fail -104.
done:9871:6913:60 ffff88020ea16a80 293818155 C Bo:3:003:2 -71 512 >
done:9872:6927:59 ffff88020ea16f00 293818235 C Bo:3:003:2 -71 0
done:9873:6875:58 ffff88020ea16480 293818313 C Bo:3:003:2 0 1514 >
done:9874:6786:57 ffff88020c7c83c0 293818353 C Bo:3:003:2 0 1514 >
done:9875:6794:56 ffff88020c7c80c0 293818470 C Bo:3:003:2 0 1514 >
done:9876:6789:55 ffff88020c7c8e40 293818589 C Bo:3:003:2 0 1514 >
done:9877:6775:54 ffff88020c7c8240 293818702 C Bo:3:003:2 0 1090 >
done:9878:6751:53 ffff88020c7c8180 293818803 C Bo:3:003:2 0 1514 >
done:9879:6735:52 ffff88020c7c89c0 293818885 C Bo:3:003:2 -32 0
done:9880:6671:51 ffff88020c7c8900 293818925 C Bo:3:003:2 -104 0
...
done:9927:1292:4 ffff88020cf0c480 293819015 C Bo:3:003:2 -104 0
done:9928:1170:3 ffff88020ea160c0 293819016 C Bo:3:003:2 -104 0
Something is known to be wrong...
start:9931 ffff88020ea160c0 293819037 S Co:3:003:0
s 02 01 0000 0002 0000 0
done:9929:1080:3 ffff88020ea16780 293819044 C Bo:3:003:2 -104 0
done:9930:945:2 ffff88020ea16000 293819044 C Bo:3:003:2 -104 0
done:9931:48:1 ffff88020ea160c0 293819085 C Co:3:003:0 0 0
These 10 transmits never finish:
start:9932 ffff88020ea160c0 293819098 S Bo:3:003:2 -115 1090 =
3a340000 3a440000 22003200 00224d98
d8460002 1f0057d7 08004500 0428ff4e
... 9933 to 9940 deleted
start:9941 ffff88020ea16b40 293819111 S Bo:3:003:2 -115 1514 =
e2350000 e2450000 22003200 00224d98
d8460002 1f0057d7 08004500 05d0ff57
All further transmits fail immediately E -12 and generate the
'xhci_hcd 0000:00:14.0: ERROR no room on ep ring' message.
(There are 1070 'E' traces and 1070 'no room' messages.)
Receives are still working.
start:9942 ffff88020ea16240 293819113 S Bo:3:003:2 -115 1514 =
e2350000 e2450000 22003200 00224d98
d8460002 1f0057d7 08004500 05d0ff58
done:9942:1550:1 ffff88020ea16240 293820663 E Bo:3:003:2 -12 0
start:9943 ffff88020ea16240 293820675 S Bo:3:003:2 -115 1514 =
e2350000 e2450000 22003200 00224d98
d8460002 1f0057d7 08004500 05d0ff59
done:9943:1507:1 ffff88020ea16240 293822182 E Bo:3:003:2 -12 0
Eventually something causes a device remove and insert - everything re-initialises.
This is over 12 hours later.
done:unknown ffff88020c8570c0 3637139297 C Ii:3:001:1 0:2048 1 = 02
start:1015 ffff88020c8570c0 3637139302 S Ii:3:001:1 -115:2048 4 <
start:1016 ffff88020cbeb300 3637139323 S Ci:3:001:0
s a3 00 0000 0001 0004 4 <
ffff88020ea16240 3637139331 C Bi:3:003:1 -71 0
done:1016:9:1 ffff88020cbeb300 3637139332 C Ci:3:001:0 0 4 = 00010100
start:1017 ffff88020cbeb300 3637139334 S Co:3:001:0
s 23 01 0010 0001 0000 0
done:1017:4:1 ffff88020cbeb300 3637139338 C Co:3:001:0 0 0
done:unknown ffff88020ca9ae40 3637139423 C Ii:3:003:3 -71:1 0
ffff88020c9db540 3637139428 C Bi:3:003:1 -108 0
ffff88020c9db780 3637139430 C Bi:3:003:1 -108 0
ffff88020d8bb540 3637139431 C Bi:3:003:1 -108 0
The last 10 transmits then terminate with error -108:
done:9932:xxxx ffff88020ea160c0 3637139462 C Bo:3:003:2 -108 0
... 9933 to 9940 deleted
done:9941:xxxx ffff88020ea16b40 3637139482 C Bo:3:003:2 -108 0
done:1015:21090:3 ffff88020c8570c0 3637160392 C Ii:3:001:1 0:2048 1 = 02
start:1018 ffff88020c8570c0 3637160396 S Ii:3:001:1 -115:2048 4 <
done:unknown ffff88020cbf26c0 3637176790 C Ii:3:005:1 -108:8 0
done:unknown ffff88020c68aa80 3637622497 C Ii:3:002:1 -108:2048 0
Followed by lots of Ci/Co and eventually it all starts working again.
I've not yet tried to look up the control transfers.
These aren't the only errors we are seeing, we also see (separately):
[21549.917529] hub 3-2:1.0: port 1 disabled by hub (EMI?), re-enabling...
[ 5822.629579] NETDEV WATCHDOG: eth0 (smsc95xx): transmit queue 0 timed out
[ 7263.834404] hid-generic 0003:413C:2005.0002: can't reset device, 0000:00:1a.0-1.4.3/input0, status -71 (connected to a USB2 header).
These all cause a USB bus reset and everything recovers within a couple of seconds.
David
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: transmit lockup using smsc95xx ethernet on usb3
[not found] ` <AE90C24D6B3A694183C094C60CF0A2F6026B738B-CgBM+Bx2aUAnGFn1LkZF6NBPR1lH4CV8@public.gmane.org>
@ 2013-10-17 0:07 ` Sarah Sharp
2013-10-17 14:30 ` David Laight
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Sarah Sharp @ 2013-10-17 0:07 UTC (permalink / raw)
To: David Laight
Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-usb-u79uwXL29TY76Z2rM5mHXA,
Xenia Ragiadakou
On Tue, Oct 15, 2013 at 10:59:18AM +0100, David Laight wrote:
> We are seeing complete lockups of the transmit side when using
> the smsc95xx driver connected to a USB3 port on an i7 (Ivybridge) cpu.
> These errors are very intermittent - less than once a day, and
> it isn't actually clear that they are related to traffic load.
>
> Most of the systems are running the 3.2 kernel from Ubuntu 12.04
> but I've seen the same problem when running a 3.4 kernel.
Have you tried the latest stable kernel or the latest -rc kernel?
> Looking at the changelog for xhci-ring.c I can see that some
> 'nasty' bugs were fixed between 3.2 and 3.4 (and possibly since)
> but the usbmon trace I've now got doesn't seem to match any
> of the changelog entries.
>
> We are also seeing similar problems if we connect to a USB2
> header.
Do you mean a USB 2.0 port under the xHCI host controller? What does
`lsusb -t` show as the parent host controller in that case?
> Since we can't reproduce the problem quickly it is difficult to
> do any analysis. Any suggestions for increasing the error rate
> would be welcome.
>
> Below is an annotated extract from a usbmon trace while running
> a netperf test that was sending 8192 byte TCP packets (nagle off).
> I've deleted the Bi entries (packets are received throughout)
> and numbered all the others (modulo 10000) so it is easier to
> see when the requests complete, I've also calculated the elapsed
> time and the number of Setup entries between the S and C traces.
>
> The USB ring seems to have 60 outstanding transmits in it,
> each time one completes another is sent. There are a few 10000
> traces of that then:
>
> start:9870 ffff88020ea16000 293811125 S Bo:3:003:2 -115 1514 =
> e2350000 e2450000 22003200 00224d98
> d8460002 1f0057d7 08004500 05d0ff11
> done:9811:6969:60 ffff88020c7c8000 293811236 C Bo:3:003:2 0 1090 >
> start:9871 ffff88020ea16a80 293811242 S Bo:3:003:2 -115 1090 =
> 3a340000 3a440000 22003200 00224d98
> d8460002 1f0057d7 08004500 0428ff12
> ...
> start:9929 ffff88020ea16780 293817964 S Bo:3:003:2 -115 1514 =
> e2350000 e2450000 22003200 00224d98
> d8460002 1f0057d7 08004500 05d0ff4c
> Last successful completion.
> done:9870:6968:60 ffff88020ea16000 293818093 C Bo:3:003:2 0 1514 >
> start:9930 ffff88020ea16000 293818099 S Bo:3:003:2 -115 1514 =
> e2350000 e2450000 22003200 00224d98
> d8460002 1f0057d7 08004500 05d0ff4d
>
> At this point something (untraced) seems to have gone horribly
> wrong in the transmit ring. dmesg shows nothing.
> Two Bo 'fail -71', 6 succeed, one fails -32 the rest fail -104.
> done:9871:6913:60 ffff88020ea16a80 293818155 C Bo:3:003:2 -71 512 >
> done:9872:6927:59 ffff88020ea16f00 293818235 C Bo:3:003:2 -71 0
> done:9873:6875:58 ffff88020ea16480 293818313 C Bo:3:003:2 0 1514 >
> done:9874:6786:57 ffff88020c7c83c0 293818353 C Bo:3:003:2 0 1514 >
> done:9875:6794:56 ffff88020c7c80c0 293818470 C Bo:3:003:2 0 1514 >
> done:9876:6789:55 ffff88020c7c8e40 293818589 C Bo:3:003:2 0 1514 >
> done:9877:6775:54 ffff88020c7c8240 293818702 C Bo:3:003:2 0 1090 >
> done:9878:6751:53 ffff88020c7c8180 293818803 C Bo:3:003:2 0 1514 >
> done:9879:6735:52 ffff88020c7c89c0 293818885 C Bo:3:003:2 -32 0
> done:9880:6671:51 ffff88020c7c8900 293818925 C Bo:3:003:2 -104 0
> ...
> done:9927:1292:4 ffff88020cf0c480 293819015 C Bo:3:003:2 -104 0
> done:9928:1170:3 ffff88020ea160c0 293819016 C Bo:3:003:2 -104 0
> Something is known to be wrong...
> start:9931 ffff88020ea160c0 293819037 S Co:3:003:0
> s 02 01 0000 0002 0000 0
> done:9929:1080:3 ffff88020ea16780 293819044 C Bo:3:003:2 -104 0
> done:9930:945:2 ffff88020ea16000 293819044 C Bo:3:003:2 -104 0
> done:9931:48:1 ffff88020ea160c0 293819085 C Co:3:003:0 0 0
>
> These 10 transmits never finish:
> start:9932 ffff88020ea160c0 293819098 S Bo:3:003:2 -115 1090 =
> 3a340000 3a440000 22003200 00224d98
> d8460002 1f0057d7 08004500 0428ff4e
> ... 9933 to 9940 deleted
> start:9941 ffff88020ea16b40 293819111 S Bo:3:003:2 -115 1514 =
> e2350000 e2450000 22003200 00224d98
> d8460002 1f0057d7 08004500 05d0ff57
>
> All further transmits fail immediately E -12 and generate the
> 'xhci_hcd 0000:00:14.0: ERROR no room on ep ring' message.
> (There are 1070 'E' traces and 1070 'no room' messages.)
> Receives are still working.
> start:9942 ffff88020ea16240 293819113 S Bo:3:003:2 -115 1514 =
> e2350000 e2450000 22003200 00224d98
> d8460002 1f0057d7 08004500 05d0ff58
> done:9942:1550:1 ffff88020ea16240 293820663 E Bo:3:003:2 -12 0
> start:9943 ffff88020ea16240 293820675 S Bo:3:003:2 -115 1514 =
> e2350000 e2450000 22003200 00224d98
> d8460002 1f0057d7 08004500 05d0ff59
> done:9943:1507:1 ffff88020ea16240 293822182 E Bo:3:003:2 -12 0
>
> Eventually something causes a device remove and insert - everything re-initialises.
> This is over 12 hours later.
> done:unknown ffff88020c8570c0 3637139297 C Ii:3:001:1 0:2048 1 = 02
> start:1015 ffff88020c8570c0 3637139302 S Ii:3:001:1 -115:2048 4 <
> start:1016 ffff88020cbeb300 3637139323 S Ci:3:001:0
> s a3 00 0000 0001 0004 4 <
> ffff88020ea16240 3637139331 C Bi:3:003:1 -71 0
> done:1016:9:1 ffff88020cbeb300 3637139332 C Ci:3:001:0 0 4 = 00010100
> start:1017 ffff88020cbeb300 3637139334 S Co:3:001:0
> s 23 01 0010 0001 0000 0
> done:1017:4:1 ffff88020cbeb300 3637139338 C Co:3:001:0 0 0
> done:unknown ffff88020ca9ae40 3637139423 C Ii:3:003:3 -71:1 0
> ffff88020c9db540 3637139428 C Bi:3:003:1 -108 0
> ffff88020c9db780 3637139430 C Bi:3:003:1 -108 0
> ffff88020d8bb540 3637139431 C Bi:3:003:1 -108 0
>
> The last 10 transmits then terminate with error -108:
> done:9932:xxxx ffff88020ea160c0 3637139462 C Bo:3:003:2 -108 0
> ... 9933 to 9940 deleted
> done:9941:xxxx ffff88020ea16b40 3637139482 C Bo:3:003:2 -108 0
> done:1015:21090:3 ffff88020c8570c0 3637160392 C Ii:3:001:1 0:2048 1 = 02
> start:1018 ffff88020c8570c0 3637160396 S Ii:3:001:1 -115:2048 4 <
> done:unknown ffff88020cbf26c0 3637176790 C Ii:3:005:1 -108:8 0
> done:unknown ffff88020c68aa80 3637622497 C Ii:3:002:1 -108:2048 0
>
> Followed by lots of Ci/Co and eventually it all starts working again.
>
> I've not yet tried to look up the control transfers.
>
> These aren't the only errors we are seeing, we also see (separately):
> [21549.917529] hub 3-2:1.0: port 1 disabled by hub (EMI?), re-enabling...
> [ 5822.629579] NETDEV WATCHDOG: eth0 (smsc95xx): transmit queue 0 timed out
> [ 7263.834404] hid-generic 0003:413C:2005.0002: can't reset device, 0000:00:1a.0-1.4.3/input0, status -71 (connected to a USB2 header).
> These all cause a USB bus reset and everything recovers within a couple of seconds.
I would suggest you try with the latest stable kernel, with
CONFIG_USB_DEBUG and CONFIG_USB_XHCI_HCD_DEBUGGING enabled. If you try
the latest 3.12-rc, you will only need the CONFIG_USB_DEBUG. Or, if
that output is too much (it will spew on short packets, which may be an
issue with your ethernet adapter), with the 3.12-rc kernel, you can turn
off CONFIG_USB_DEBUG and capture command trace events through the xHCI
trace infrastructure. Xenia can help you with that if necessary (I'm
going to be at a conference starting Friday).
Without that dmesg, I really can't tell what's happening at an xHCI
level.
Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: transmit lockup using smsc95xx ethernet on usb3
2013-10-17 0:07 ` Sarah Sharp
@ 2013-10-17 14:30 ` David Laight
2013-10-17 17:43 ` Sarah Sharp
2013-10-24 15:05 ` David Laight
2013-10-28 17:09 ` David Laight
2 siblings, 1 reply; 10+ messages in thread
From: David Laight @ 2013-10-17 14:30 UTC (permalink / raw)
To: Sarah Sharp; +Cc: netdev, linux-usb, Xenia Ragiadakou
> From: Sarah Sharp [mailto:sarah.a.sharp@linux.intel.com]
Thanks for taking an interest.
> On Tue, Oct 15, 2013 at 10:59:18AM +0100, David Laight wrote:
> > We are seeing complete lockups of the transmit side when using
> > the smsc95xx driver connected to a USB3 port on an i7 (Ivybridge) cpu.
> > These errors are very intermittent - less than once a day, and
> > it isn't actually clear that they are related to traffic load.
> >
> > Most of the systems are running the 3.2 kernel from Ubuntu 12.04
> > but I've seen the same problem when running a 3.4 kernel.
>
> Have you tried the latest stable kernel or the latest -rc kernel?
No. Although I've written a lot of device driver and comms protocol
stack code over the years I've not had to work out how to build
linux kernels - this may be the time I start!
Given the difficulty (or rather the infrequency) of reproducing the
problem I'd like to sort out the failing code path before changing
kernels so that I can then verify that a more recent kernel fixes it.
...
> > We are also seeing similar problems if we connect to a USB2
> > header.
>
> Do you mean a USB 2.0 port under the xHCI host controller? What does
> `lsusb -t` show as the parent host controller in that case?
To clarify the fail trace below is from an xhci controller, but
I'm pretty sure we've seen a tx lockup when using ohci.
The usbmon trace when the tx locks up starts with:
> > Two Bo 'fail -71', 6 succeed, one fails -32 the rest fail -104.
> > done:9871:6913:60 ffff88020ea16a80 293818155 C Bo:3:003:2 -71 EPROTO 512 >
> > done:9872:6927:59 ffff88020ea16f00 293818235 C Bo:3:003:2 -71 EPROTO 0
> > done:9873:6875:58 ffff88020ea16480 293818313 C Bo:3:003:2 0 1514 >
> > done:9874:6786:57 ffff88020c7c83c0 293818353 C Bo:3:003:2 0 1514 >
> > done:9875:6794:56 ffff88020c7c80c0 293818470 C Bo:3:003:2 0 1514 >
> > done:9876:6789:55 ffff88020c7c8e40 293818589 C Bo:3:003:2 0 1514 >
> > done:9877:6775:54 ffff88020c7c8240 293818702 C Bo:3:003:2 0 1090 >
> > done:9878:6751:53 ffff88020c7c8180 293818803 C Bo:3:003:2 0 1514 >
> > done:9879:6735:52 ffff88020c7c89c0 293818885 C Bo:3:003:2 -32 EPIPE 0
> > done:9880:6671:51 ffff88020c7c8900 293818925 C Bo:3:003:2 -104 ECONNRESET 0
> > ... the ring is cleared in a software loop
> > done:9927:1292:4 ffff88020cf0c480 293819015 C Bo:3:003:2 -104 0
> > done:9928:1170:3 ffff88020ea160c0 293819016 C Bo:3:003:2 -104 0
> > Something is known to be wrong...
> > start:9931 ffff88020ea160c0 293819037 S Co:3:003:0
> > s 02 01 0000 0002 0000 0
> > done:9929:1080:3 ffff88020ea16780 293819044 C Bo:3:003:2 -104 0
> > done:9930:945:2 ffff88020ea16000 293819044 C Bo:3:003:2 -104 0
> > done:9931:48:1 ffff88020ea160c0 293819085 C Co:3:003:0 0 0
I've also seen resets that start with an interrupt from device 1.
In this case the ring is cleared with ESHUTDOWN and dmesg traces what looks
like an unplug-plug action.
Last successful ethernet transmit
ffff88020c4870c0 701760986 C Bo:3:018:2 0 1090 >
ffff88020c4870c0 701760992 S Bo:3:018:2 -115 1090
= 3a340000 3a440000 22003200 00224d98
d8460002 1f0057d7 08004500 042879ca
Interrupt - I think from the root hub.
ffff88020c8570c0 701761038 C Ii:3:001:1 0:2048 1 = 02
ffff88020c8570c0 701761042 S Ii:3:001:1 -115:2048 4 <
ffff88020ea16840 701761046 C Ii:3:018:3 -71:1 0 EPROTO
ffff88020ea16840 701761047 S Ii:3:018:3 -115:1 16 <
ffff88020c53c480 701761051 C Bi:3:018:1 -71 0
ffff88020c487180 701761054 C Bo:3:018:2 -71 1024 >
ffff880210570240 701761063 S Ci:3:001:0 s a3 00 0000 0001 0004 4 <
ffff880210570240 701761071 C Ci:3:001:0 0 4 = 00010100
ffff880210570240 701761074 S Co:3:001:0 s 23 01 0010 0001 0000 0
ffff88020c53c540 701761076 C Bi:3:018:1 -71 0
ffff880210570240 701761078 C Co:3:001:0 0 0
ffff88020c487240 701761117 C Bo:3:018:2 -71 0
ffff88020c53cd80 701761156 C Bi:3:018:1 -108 0 ESHITDOWN
ffff88020c53c9c0 701761158 C Bi:3:018:1 -108 0
ffff88020c487840 701761196 C Bo:3:018:2 -108 0
ffff88020c487900 701761201 C Bo:3:018:2 -108 0
... lots of similar lines deleted
ffff88020c487540 701761299 C Bo:3:018:2 -108 0
ffff88020c4870c0 701761300 C Bo:3:018:2 -108 0
ffff88020ea16840 701761304 C Ii:3:018:3 -71:1 0
ffff88020c8570c0 701782179 C Ii:3:001:1 0:2048 1 = 02
ffff88020c8570c0 701782183 S Ii:3:001:1 -115:2048 4 <
ffff88020c95be40 703089906 C Ii:3:020:1 -108:8 0
device 17 will be the hub on the smsc95xx chip
ffff88020c95b240 703427572 C Ii:3:017:1 -108:2048 0
ffff88020c487540 703427788 S Ci:3:001:0 s a3 00 0000 0001 0004 4 <
ffff88020c487540 703427803 C Ci:3:001:0 0 4 = 01010100
...
> I would suggest you try with the latest stable kernel, with
> CONFIG_USB_DEBUG and CONFIG_USB_XHCI_HCD_DEBUGGING enabled. If you try
> the latest 3.12-rc, you will only need the CONFIG_USB_DEBUG. Or, if
> that output is too much (it will spew on short packets, which may be an
> issue with your ethernet adapter),
The only way I've got the above usbmon trace is by reading 1000000
lines (with dd) and saving the file if the last line output by dmesg
changes. I suspect the above might trace too much for that!
I think the kernel I have has the xhci_warn() compiled out - I'm not
seeing any of those reported by dmesg. I looked at the .h files and
couldn't see where xhci_warn() gets suppressed.
> with the 3.12-rc kernel, you can turn
> off CONFIG_USB_DEBUG and capture command trace events through the xHCI
> trace infrastructure. Xenia can help you with that if necessary (I'm
> going to be at a conference starting Friday).
>
> Without that dmesg, I really can't tell what's happening at an xHCI
> level.
Have you any idea which code loops are clearing the ring with ECONNRESET
and ESHUTDOWN and where they might be being called from?
I've been reading the code but couldn't see any obvious paths that would
lead to the ring being cleared with those codes.
David
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: transmit lockup using smsc95xx ethernet on usb3
2013-10-17 14:30 ` David Laight
@ 2013-10-17 17:43 ` Sarah Sharp
2013-10-17 18:07 ` Alan Stern
2013-10-18 15:22 ` David Laight
0 siblings, 2 replies; 10+ messages in thread
From: Sarah Sharp @ 2013-10-17 17:43 UTC (permalink / raw)
To: David Laight; +Cc: netdev, linux-usb, Xenia Ragiadakou
On Thu, Oct 17, 2013 at 03:30:56PM +0100, David Laight wrote:
> > From: Sarah Sharp [mailto:sarah.a.sharp@linux.intel.com]
>
> Thanks for taking an interest.
>
> > On Tue, Oct 15, 2013 at 10:59:18AM +0100, David Laight wrote:
> > > We are seeing complete lockups of the transmit side when using
> > > the smsc95xx driver connected to a USB3 port on an i7 (Ivybridge) cpu.
> > > These errors are very intermittent - less than once a day, and
> > > it isn't actually clear that they are related to traffic load.
> > >
> > > Most of the systems are running the 3.2 kernel from Ubuntu 12.04
> > > but I've seen the same problem when running a 3.4 kernel.
> >
> > Have you tried the latest stable kernel or the latest -rc kernel?
>
> No. Although I've written a lot of device driver and comms protocol
> stack code over the years I've not had to work out how to build
> linux kernels - this may be the time I start!
It's not too hard. Here's some directions:
http://kernelnewbies.org/KernelBuild
Try building the latest stable tree, and let me know if you still have
issues. If so, we'll figure out a plan to add some pretty small
debugging to the xHCI driver.
> Given the difficulty (or rather the infrequency) of reproducing the
> problem I'd like to sort out the failing code path before changing
> kernels so that I can then verify that a more recent kernel fixes it.
The problem is that -ESHUTDOWN usually means there's a driver or host
controller issue. Numerous bug fixes have gone in since 3.4 to avoid
such host controller issues. It would be a waste of time for me to
attempt debug your issue, only to discover it had been fixed in a more
recent kernel. So I would really rather you test on a stable kernel,
see if the issue still occurs, and then we can work from a known good
base to figure out where the problem is.
> ...
> > > We are also seeing similar problems if we connect to a USB2
> > > header.
> >
> > Do you mean a USB 2.0 port under the xHCI host controller? What does
> > `lsusb -t` show as the parent host controller in that case?
>
> To clarify the fail trace below is from an xhci controller, but
> I'm pretty sure we've seen a tx lockup when using ohci.
Then it might not be an xHCI host specific issue.
> The usbmon trace when the tx locks up starts with:
>
> > > Two Bo 'fail -71', 6 succeed, one fails -32 the rest fail -104.
> > > done:9871:6913:60 ffff88020ea16a80 293818155 C Bo:3:003:2 -71 EPROTO 512 >
> > > done:9872:6927:59 ffff88020ea16f00 293818235 C Bo:3:003:2 -71 EPROTO 0
> > > done:9873:6875:58 ffff88020ea16480 293818313 C Bo:3:003:2 0 1514 >
> > > done:9874:6786:57 ffff88020c7c83c0 293818353 C Bo:3:003:2 0 1514 >
> > > done:9875:6794:56 ffff88020c7c80c0 293818470 C Bo:3:003:2 0 1514 >
> > > done:9876:6789:55 ffff88020c7c8e40 293818589 C Bo:3:003:2 0 1514 >
> > > done:9877:6775:54 ffff88020c7c8240 293818702 C Bo:3:003:2 0 1090 >
> > > done:9878:6751:53 ffff88020c7c8180 293818803 C Bo:3:003:2 0 1514 >
> > > done:9879:6735:52 ffff88020c7c89c0 293818885 C Bo:3:003:2 -32 EPIPE 0
> > > done:9880:6671:51 ffff88020c7c8900 293818925 C Bo:3:003:2 -104 ECONNRESET 0
> > > ... the ring is cleared in a software loop
> > > done:9927:1292:4 ffff88020cf0c480 293819015 C Bo:3:003:2 -104 0
> > > done:9928:1170:3 ffff88020ea160c0 293819016 C Bo:3:003:2 -104 0
> > > Something is known to be wrong...
> > > start:9931 ffff88020ea160c0 293819037 S Co:3:003:0
> > > s 02 01 0000 0002 0000 0
> > > done:9929:1080:3 ffff88020ea16780 293819044 C Bo:3:003:2 -104 0
> > > done:9930:945:2 ffff88020ea16000 293819044 C Bo:3:003:2 -104 0
> > > done:9931:48:1 ffff88020ea160c0 293819085 C Co:3:003:0 0 0
>
> I've also seen resets that start with an interrupt from device 1.
> In this case the ring is cleared with ESHUTDOWN and dmesg traces what looks
> like an unplug-plug action.
I really care a lot more about dmesg than a usbmon trace.
> Last successful ethernet transmit
> ffff88020c4870c0 701760986 C Bo:3:018:2 0 1090 >
> ffff88020c4870c0 701760992 S Bo:3:018:2 -115 1090
> = 3a340000 3a440000 22003200 00224d98
> d8460002 1f0057d7 08004500 042879ca
> Interrupt - I think from the root hub.
> ffff88020c8570c0 701761038 C Ii:3:001:1 0:2048 1 = 02
> ffff88020c8570c0 701761042 S Ii:3:001:1 -115:2048 4 <
> ffff88020ea16840 701761046 C Ii:3:018:3 -71:1 0 EPROTO
> ffff88020ea16840 701761047 S Ii:3:018:3 -115:1 16 <
> ffff88020c53c480 701761051 C Bi:3:018:1 -71 0
> ffff88020c487180 701761054 C Bo:3:018:2 -71 1024 >
> ffff880210570240 701761063 S Ci:3:001:0 s a3 00 0000 0001 0004 4 <
> ffff880210570240 701761071 C Ci:3:001:0 0 4 = 00010100
> ffff880210570240 701761074 S Co:3:001:0 s 23 01 0010 0001 0000 0
> ffff88020c53c540 701761076 C Bi:3:018:1 -71 0
> ffff880210570240 701761078 C Co:3:001:0 0 0
> ffff88020c487240 701761117 C Bo:3:018:2 -71 0
> ffff88020c53cd80 701761156 C Bi:3:018:1 -108 0 ESHITDOWN
I assume you meant "ESHUTDOWN" instead of "ESHITDOWN". :)
> ffff88020c53c9c0 701761158 C Bi:3:018:1 -108 0
> ffff88020c487840 701761196 C Bo:3:018:2 -108 0
> ffff88020c487900 701761201 C Bo:3:018:2 -108 0
> ... lots of similar lines deleted
> ffff88020c487540 701761299 C Bo:3:018:2 -108 0
> ffff88020c4870c0 701761300 C Bo:3:018:2 -108 0
> ffff88020ea16840 701761304 C Ii:3:018:3 -71:1 0
> ffff88020c8570c0 701782179 C Ii:3:001:1 0:2048 1 = 02
> ffff88020c8570c0 701782183 S Ii:3:001:1 -115:2048 4 <
> ffff88020c95be40 703089906 C Ii:3:020:1 -108:8 0
> device 17 will be the hub on the smsc95xx chip
> ffff88020c95b240 703427572 C Ii:3:017:1 -108:2048 0
> ffff88020c487540 703427788 S Ci:3:001:0 s a3 00 0000 0001 0004 4 <
> ffff88020c487540 703427803 C Ci:3:001:0 0 4 = 01010100
>
> ...
> > I would suggest you try with the latest stable kernel, with
> > CONFIG_USB_DEBUG and CONFIG_USB_XHCI_HCD_DEBUGGING enabled. If you try
> > the latest 3.12-rc, you will only need the CONFIG_USB_DEBUG. Or, if
> > that output is too much (it will spew on short packets, which may be an
> > issue with your ethernet adapter),
>
> The only way I've got the above usbmon trace is by reading 1000000
> lines (with dd) and saving the file if the last line output by dmesg
> changes. I suspect the above might trace too much for that!
> I think the kernel I have has the xhci_warn() compiled out - I'm not
> seeing any of those reported by dmesg. I looked at the .h files and
> couldn't see where xhci_warn() gets suppressed.
If CONFIG_USB_DEBUG isn't set, then the calls to dev_dbg and dev_warn
are suppressed.
> > with the 3.12-rc kernel, you can turn
> > off CONFIG_USB_DEBUG and capture command trace events through the xHCI
> > trace infrastructure. Xenia can help you with that if necessary (I'm
> > going to be at a conference starting Friday).
> >
> > Without that dmesg, I really can't tell what's happening at an xHCI
> > level.
>
> Have you any idea which code loops are clearing the ring with ECONNRESET
> and ESHUTDOWN and where they might be being called from?
> I've been reading the code but couldn't see any obvious paths that would
> lead to the ring being cleared with those codes.
There's a lot of cases in the host controller drivers in
drivers/usb/host/ where URBs are returned with ESHUTDOWN.
If you look in the xHCI driver, ESHUTDOWN occurs when the host "dies",
usually because it stopped responding to commands. There were several
fixes that went into the kernel relating to command handling that could
help you. So, as I said, I would like you to test from a known good
base (the latest stable kernel) and we can debug from there.
Sarah Sharp
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: transmit lockup using smsc95xx ethernet on usb3
2013-10-17 17:43 ` Sarah Sharp
@ 2013-10-17 18:07 ` Alan Stern
2013-10-18 15:22 ` David Laight
1 sibling, 0 replies; 10+ messages in thread
From: Alan Stern @ 2013-10-17 18:07 UTC (permalink / raw)
To: Sarah Sharp
Cc: David Laight, netdev-u79uwXL29TY76Z2rM5mHXA,
linux-usb-u79uwXL29TY76Z2rM5mHXA, Xenia Ragiadakou
On Thu, 17 Oct 2013, Sarah Sharp wrote:
> > Given the difficulty (or rather the infrequency) of reproducing the
> > problem I'd like to sort out the failing code path before changing
> > kernels so that I can then verify that a more recent kernel fixes it.
>
> The problem is that -ESHUTDOWN usually means there's a driver or host
> controller issue. Numerous bug fixes have gone in since 3.4 to avoid
> such host controller issues. It would be a waste of time for me to
> attempt debug your issue, only to discover it had been fixed in a more
> recent kernel. So I would really rather you test on a stable kernel,
> see if the issue still occurs, and then we can work from a known good
> base to figure out where the problem is.
-ESHUTDOWN really indicates either that the system believes the device
has been disconnected from the USB bus or that the host controller
itself has stopped working.
> > To clarify the fail trace below is from an xhci controller, but
> > I'm pretty sure we've seen a tx lockup when using ohci.
>
> Then it might not be an xHCI host specific issue.
Undoubtedly not.
> > The usbmon trace when the tx locks up starts with:
> >
> > > > Two Bo 'fail -71', 6 succeed, one fails -32 the rest fail -104.
> > > > done:9871:6913:60 ffff88020ea16a80 293818155 C Bo:3:003:2 -71 EPROTO 512 >
> > > > done:9872:6927:59 ffff88020ea16f00 293818235 C Bo:3:003:2 -71 EPROTO 0
Those -71 errors indicate a low-level problem. It generally means that
the device has stopped sending packets. Perhaps its firmware has
crashed, or perhaps it has disconnected itself electrically from the
bus.
> > Last successful ethernet transmit
> > ffff88020c4870c0 701760986 C Bo:3:018:2 0 1090 >
> > ffff88020c4870c0 701760992 S Bo:3:018:2 -115 1090
> > = 3a340000 3a440000 22003200 00224d98
> > d8460002 1f0057d7 08004500 042879ca
> > Interrupt - I think from the root hub.
> > ffff88020c8570c0 701761038 C Ii:3:001:1 0:2048 1 = 02
> > ffff88020c8570c0 701761042 S Ii:3:001:1 -115:2048 4 <
> > ffff88020ea16840 701761046 C Ii:3:018:3 -71:1 0 EPROTO
> > ffff88020ea16840 701761047 S Ii:3:018:3 -115:1 16 <
> > ffff88020c53c480 701761051 C Bi:3:018:1 -71 0
> > ffff88020c487180 701761054 C Bo:3:018:2 -71 1024 >
> > ffff880210570240 701761063 S Ci:3:001:0 s a3 00 0000 0001 0004 4 <
> > ffff880210570240 701761071 C Ci:3:001:0 0 4 = 00010100
These last two lines show the host controller telling the system that
the device has disconnected. Once that happens, any future
communication with the device is hopeless.
Alan Stern
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: transmit lockup using smsc95xx ethernet on usb3
2013-10-17 17:43 ` Sarah Sharp
2013-10-17 18:07 ` Alan Stern
@ 2013-10-18 15:22 ` David Laight
2013-10-18 15:50 ` Greg KH
1 sibling, 1 reply; 10+ messages in thread
From: David Laight @ 2013-10-18 15:22 UTC (permalink / raw)
To: Sarah Sharp
Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-usb-u79uwXL29TY76Z2rM5mHXA,
Xenia Ragiadakou
> It's not too hard. Here's some directions:
> http://kernelnewbies.org/KernelBuild
I can build it easily enough but ...
Is there a .config file with a sane list of modules anywhere?
(and builtin drivers for disk boot)
The defconfig one doesn't seem to contain enough to be useful,
and the allmodconfig gives me 800MB+ to copy from the build
system to the test one (I can't seem to see an obvious way of
doing this either - apart from installing them to the 'wrong'
place on the build system and then copying everything over).
David
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: transmit lockup using smsc95xx ethernet on usb3
2013-10-18 15:22 ` David Laight
@ 2013-10-18 15:50 ` Greg KH
0 siblings, 0 replies; 10+ messages in thread
From: Greg KH @ 2013-10-18 15:50 UTC (permalink / raw)
To: David Laight; +Cc: Sarah Sharp, netdev, linux-usb, Xenia Ragiadakou
On Fri, Oct 18, 2013 at 04:22:02PM +0100, David Laight wrote:
> > It's not too hard. Here's some directions:
> > http://kernelnewbies.org/KernelBuild
>
> I can build it easily enough but ...
>
> Is there a .config file with a sane list of modules anywhere?
> (and builtin drivers for disk boot)
>
> The defconfig one doesn't seem to contain enough to be useful,
> and the allmodconfig gives me 800MB+ to copy from the build
> system to the test one (I can't seem to see an obvious way of
> doing this either - apart from installing them to the 'wrong'
> place on the build system and then copying everything over).
The easiest thing to do is take a working .config file (like created by
your distro, for the kernel it is booting), and creating a smaller
config for just the hardware plugged into the machine at that moment in
time.
You usually do this automatically by running:
make localmodconfig
which knows how to root around in /proc/config.gz or /boot/ to try to
find the current config file and go from there.
But remember to go back and enable anything specific for any devices you
don't have plugged in at the moment.
Hope this helps,
greg k-h
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: transmit lockup using smsc95xx ethernet on usb3
2013-10-17 0:07 ` Sarah Sharp
2013-10-17 14:30 ` David Laight
@ 2013-10-24 15:05 ` David Laight
2013-10-28 17:09 ` David Laight
2 siblings, 0 replies; 10+ messages in thread
From: David Laight @ 2013-10-24 15:05 UTC (permalink / raw)
To: Sarah Sharp; +Cc: netdev, linux-usb, Xenia Ragiadakou
> Have you tried the latest stable kernel or the latest -rc kernel?
I've built a kernel based on Linus's tree from last Friday, 3.12-rc6 ish.
Commented out the trace for short reads - happens all the time.
I've not seen an error on a Bo yet, the failure rate is depressingly low.
I have had an error that started with an interrupt completion (I think
from the root), that leads to a full unplug-replug sequence that
overfilled the kernel message buffer (rebuilt with a much bigger buffer).
I've just got a error -71 from a Bi. This generates:
(There are actually two separated by a Bo completion and setup.
[175908.563068] xhci_hcd 0000:00:14.0: Transfer error on endpoint
[175908.563070] xhci_hcd 0000:00:14.0: Cleaning up stalled endpoint ring
[175908.563072] xhci_hcd 0000:00:14.0: Finding segment containing stopped TRB.
[175908.563074] xhci_hcd 0000:00:14.0: Finding endpoint context
[175908.563076] xhci_hcd 0000:00:14.0: Finding segment containing last TRB in TD.
[175908.563078] xhci_hcd 0000:00:14.0: Cycle state = 0x1
[175908.563081] xhci_hcd 0000:00:14.0: New dequeue segment = ffff8800d6a817e0 (virtual)
[175908.563089] xhci_hcd 0000:00:14.0: New dequeue pointer = 0x2137a4b90 (DMA)
[175908.563096] xhci_hcd 0000:00:14.0: Queueing new dequeue state
[175908.563104] xhci_hcd 0000:00:14.0: Set TR Deq Ptr cmd, new deq seg = ffff8800d6a817e0 (0x2137a4800 dma), new deq ptr = ffff8802137a4b90 (0x2137a4b90 dma), new cycle = 1
[175908.563110] xhci_hcd 0000:00:14.0: // Ding dong!
[175908.563119] xhci_hcd 0000:00:14.0: Giveback URB ffff8802114fd9c0, len = 0, expected = 18944, status = -71
[175908.563122] xhci_hcd 0000:00:14.0: Ignoring reset ep completion code of 1
[175908.563125] xhci_hcd 0000:00:14.0: Successful Set TR Deq Ptr cmd, deq = @2137a4b91
What is interesting is that the usbmon trace then shows only 2 Bi URB being used
(rather than 4). After 125ms (assuming the usbmon timestamps are us)
another 2 URB are added.
Since the URB are usually recycled (or at least freed and immediately realloced
so getting the same address) I wonder if this is a memory leak?
In any case waiting that long before adding the URB back doesn't seem right.
I don't think I've seen an error that only affected the Bi side before, so
don't know whether the recover behaviour has changed.
FWIW we have identified something 'sub-optimal' with the pcb layout that might
be responsible for noise on the USB data/clock source. However the xhci driver
should recover from such errors.
If the hw guys fix the pcb I'm going to need to keep a failing system!
David
^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: transmit lockup using smsc95xx ethernet on usb3
2013-10-17 0:07 ` Sarah Sharp
2013-10-17 14:30 ` David Laight
2013-10-24 15:05 ` David Laight
@ 2013-10-28 17:09 ` David Laight
[not found] ` <AE90C24D6B3A694183C094C60CF0A2F6026B73BA-CgBM+Bx2aUAnGFn1LkZF6NBPR1lH4CV8@public.gmane.org>
2 siblings, 1 reply; 10+ messages in thread
From: David Laight @ 2013-10-28 17:09 UTC (permalink / raw)
To: Sarah Sharp
Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-usb-u79uwXL29TY76Z2rM5mHXA,
Xenia Ragiadakou
> > We are seeing complete lockups of the transmit side when using
> > the smsc95xx driver connected to a USB3 port on an i7 (Ivybridge) cpu.
> > These errors are very intermittent - less than once a day, and
> > it isn't actually clear that they are related to traffic load.
...
> I would suggest you try with the latest stable kernel, with
> CONFIG_USB_DEBUG enabled.
I've finally got the same failure with a 3.12-0-rc5 kernel.
The only change I made was to comment out the 'short packet' trace.
The kernel trace contains:
(I've shortened the lines by removing the fixed part of the header.
I've got the full trace, and the matching usbmon trace, but they
are a bit large to post to the mailing lists!
All from xhci_hcd 0000:00:14.0
[38704.116936] Transfer error on endpoint
[38704.116941] Cleaning up stalled endpoint ring
[38704.116944] Finding segment containing stopped TRB.
[38704.116946] Finding endpoint context
[38704.116948] Finding segment containing last TRB in TD.
[38704.116950] Cycle state = 0x0
[38704.116952] New dequeue segment = ffff880214565660 (virtual)
[38704.116954] New dequeue pointer = 0x210cf4ad0 (DMA)
[38704.116956] Queueing new dequeue state
[38704.116959] Set TR Deq Ptr cmd, new deq seg = ffff880214565660 (0x210cf4800 dma), new deq ptr = ffff880210cf4ad0 (0x210cf4ad0 dma), new cycle = 0
[38704.116962] // Ding dong!
[38704.116966] Giveback URB ffff8800d6841d80, len = 0, expected = 18944, status = -71
[38704.116970] Transfer error on endpoint
[38704.116973] Cleaning up stalled endpoint ring
[38704.116974] Finding segment containing stopped TRB.
[38704.116976] Finding endpoint context
[38704.116978] Finding segment containing last TRB in TD.
[38704.116980] Cycle state = 0x1
[38704.116982] New dequeue segment = ffff880214565680 (virtual)
[38704.116984] New dequeue pointer = 0x210cf47e0 (DMA)
[38704.116986] Queueing new dequeue state
[38704.116989] Set TR Deq Ptr cmd, new deq seg = ffff880214565680 (0x210cf4400 dma), new deq ptr = ffff880210cf47e0 (0x210cf47e0 dma), new cycle = 1
[38704.116991] // Ding dong!
[38704.116995] Giveback URB ffff8802132169c0, len = 1024, expected = 1526, status = -71
[38704.116998] Ignoring reset ep completion code of 1
[38704.117001] Successful Set TR Deq Ptr cmd, deq = @210cf4ad0
[38704.117004] Ignoring reset ep completion code of 1
[38704.117006] Successful Set TR Deq Ptr cmd, deq = @210cf47e1
[38704.240067] Stalled endpoint
[38704.240075] Giveback URB ffff8800d4438900, len = 0, expected = 1526, status = -32
[38704.240112] Cancel URB ffff8800d4438b40, dev 1.1, ep 0x2, starting at offset 0x210cf4030
[38704.240118] // Ding dong!
[38704.240124] Cancel URB ffff8800d4438cc0, dev 1.1, ep 0x2, starting at offset 0x210cf4040
[38704.240129] Removing canceled TD starting at 0x210cf4030 (dma).
[38704.240132] TRB to noop at offset 0x210cf4030
[38704.240134] Removing canceled TD starting at 0x210cf4040 (dma).
[38704.240136] TRB to noop at offset 0x210cf4040
[38704.240155] WARN halted endpoint, queueing URB anyway.
[38704.240163] WARN halted endpoint, queueing URB anyway.
[38704.240170] WARN halted endpoint, queueing URB anyway.
[38704.240177] Cancel URB ffff8800d4438840, dev 1.1, ep 0x2, starting at offset 0x210cf4050
[38704.240179] // Ding dong!
[38704.240184] Cancel URB ffff8800d44389c0, dev 1.1, ep 0x2, starting at offset 0x210cf4060
[38704.240189] Removing canceled TD starting at 0x210cf4050 (dma).
[38704.240191] TRB to noop at offset 0x210cf4050
[38704.240193] Removing canceled TD starting at 0x210cf4060 (dma).
[38704.240195] TRB to noop at offset 0x210cf4060
[38704.240205] WARN halted endpoint, queueing URB anyway.
[38704.240210] WARN halted endpoint, queueing URB anyway.
[38704.240216] Cancel URB ffff8800d4438c00, dev 1.1, ep 0x2, starting at offset 0x210cf4070
[38704.240219] // Ding dong!
[38704.240224] Cancel URB ffff8800d4438a80, dev 1.1, ep 0x2, starting at offset 0x210cf4080
[38704.240228] Removing canceled TD starting at 0x210cf4070 (dma).
[38704.240230] TRB to noop at offset 0x210cf4070
[38704.240232] Removing canceled TD starting at 0x210cf4080 (dma).
[38704.240234] TRB to noop at offset 0x210cf4080
[38704.240244] WARN halted endpoint, queueing URB anyway.
[38704.240249] WARN halted endpoint, queueing URB anyway.
[38704.240255] Cancel URB ffff8800d4438780, dev 1.1, ep 0x2, starting at offset 0x210cf4090
[38704.240257] // Ding dong!
[38704.240262] Cancel URB ffff8800d4438480, dev 1.1, ep 0x2, starting at offset 0x210cf40a0
[38704.240266] Cancel URB ffff8800d44386c0, dev 1.1, ep 0x2, starting at offset 0x210cf40b0
[38704.240271] Removing canceled TD starting at 0x210cf4090 (dma).
[38704.240273] TRB to noop at offset 0x210cf4090
[38704.240275] Removing canceled TD starting at 0x210cf40a0 (dma).
[38704.240277] TRB to noop at offset 0x210cf40a0
[38704.240279] Removing canceled TD starting at 0x210cf40b0 (dma).
[38704.240281] TRB to noop at offset 0x210cf40b0
[38704.240292] WARN halted endpoint, queueing URB anyway.
[38704.240296] WARN halted endpoint, queueing URB anyway.
[38704.240301] WARN halted endpoint, queueing URB anyway.
[38704.240307] Cancel URB ffff8800d4438900, dev 1.1, ep 0x2, starting at offset 0x210cf40c0
[38704.240309] // Ding dong!
[38704.240314] Cancel URB ffff8800d4438b40, dev 1.1, ep 0x2, starting at offset 0x210cf40d0
[38704.240319] Removing canceled TD starting at 0x210cf40c0 (dma).
[38704.240321] TRB to noop at offset 0x210cf40c0
[38704.240323] Removing canceled TD starting at 0x210cf40d0 (dma).
[38704.240325] TRB to noop at offset 0x210cf40d0
[38704.240335] WARN halted endpoint, queueing URB anyway.
[38704.240340] WARN halted endpoint, queueing URB anyway.
[38704.240346] Cancel URB ffff8800d4438f00, dev 1.1, ep 0x2, starting at offset 0x210cf40e0
[38704.240348] // Ding dong!
[38704.240353] Cancel URB ffff8800d4438840, dev 1.1, ep 0x2, starting at offset 0x210cf40f0
[38704.240357] Removing canceled TD starting at 0x210cf40e0 (dma).
[38704.240359] TRB to noop at offset 0x210cf40e0
[38704.240361] Removing canceled TD starting at 0x210cf40f0 (dma).
[38704.240363] TRB to noop at offset 0x210cf40f0
[38704.240373] WARN halted endpoint, queueing URB anyway.
[38704.240378] WARN halted endpoint, queueing URB anyway.
[38704.240384] Cancel URB ffff8800d4438cc0, dev 1.1, ep 0x2, starting at offset 0x210cf4100
[38704.240386] // Ding dong!
[38704.240391] Cancel URB ffff8800d4438c00, dev 1.1, ep 0x2, starting at offset 0x210cf4110
[38704.240395] Cancel URB ffff8800d44389c0, dev 1.1, ep 0x2, starting at offset 0x210cf4120
[38704.240400] Removing canceled TD starting at 0x210cf4100 (dma).
[38704.240402] TRB to noop at offset 0x210cf4100
[38704.240404] Removing canceled TD starting at 0x210cf4110 (dma).
[38704.240406] TRB to noop at offset 0x210cf4110
[38704.240408] Removing canceled TD starting at 0x210cf4120 (dma).
[38704.240410] TRB to noop at offset 0x210cf4120
[38704.240420] WARN halted endpoint, queueing URB anyway.
[38704.240425] WARN halted endpoint, queueing URB anyway.
[38704.240430] WARN halted endpoint, queueing URB anyway.
[38704.240436] Cancel URB ffff8800d4438480, dev 1.1, ep 0x2, starting at offset 0x210cf4130
[38704.240438] // Ding dong!
[38704.240443] Cancel URB ffff8800d4438780, dev 1.1, ep 0x2, starting at offset 0x210cf4140
[38704.240447] Removing canceled TD starting at 0x210cf4130 (dma).
[38704.240449] TRB to noop at offset 0x210cf4130
[38704.240451] Removing canceled TD starting at 0x210cf4140 (dma).
[38704.240453] TRB to noop at offset 0x210cf4140
[38704.240463] WARN halted endpoint, queueing URB anyway.
[38704.240468] WARN halted endpoint, queueing URB anyway.
[38704.240474] Cancel URB ffff8800d4438a80, dev 1.1, ep 0x2, starting at offset 0x210cf4150
[38704.240476] // Ding dong!
[38704.240481] Cancel URB ffff8800d4438900, dev 1.1, ep 0x2, starting at offset 0x210cf4160
[38704.240486] Removing canceled TD starting at 0x210cf4150 (dma).
[38704.240488] TRB to noop at offset 0x210cf4150
[38704.240490] Removing canceled TD starting at 0x210cf4160 (dma).
[38704.240492] TRB to noop at offset 0x210cf4160
[38704.240501] WARN halted endpoint, queueing URB anyway.
[38704.240506] WARN halted endpoint, queueing URB anyway.
[38704.240512] Cancel URB ffff8800d44386c0, dev 1.1, ep 0x2, starting at offset 0x210cf4170
[38704.240514] // Ding dong!
[38704.240519] Cancel URB ffff8800d4438f00, dev 1.1, ep 0x2, starting at offset 0x210cf4180
[38704.240524] Cancel URB ffff8800d4438b40, dev 1.1, ep 0x2, starting at offset 0x210cf4190
[38704.240528] Removing canceled TD starting at 0x210cf4170 (dma).
[38704.240530] TRB to noop at offset 0x210cf4170
[38704.240532] Removing canceled TD starting at 0x210cf4180 (dma).
[38704.240535] TRB to noop at offset 0x210cf4180
[38704.240537] Removing canceled TD starting at 0x210cf4190 (dma).
[38704.240539] TRB to noop at offset 0x210cf4190
[38704.240549] WARN halted endpoint, queueing URB anyway.
[38704.240554] WARN halted endpoint, queueing URB anyway.
[38704.240559] WARN halted endpoint, queueing URB anyway.
[38704.240565] Cancel URB ffff8800d4438c00, dev 1.1, ep 0x2, starting at offset 0x210cf41a0
[38704.240567] // Ding dong!
[38704.240572] Cancel URB ffff8800d4438cc0, dev 1.1, ep 0x2, starting at offset 0x210cf41b0
[38704.240577] Removing canceled TD starting at 0x210cf41a0 (dma).
[38704.240579] TRB to noop at offset 0x210cf41a0
[38704.240581] Removing canceled TD starting at 0x210cf41b0 (dma).
[38704.240583] TRB to noop at offset 0x210cf41b0
[38704.240592] WARN halted endpoint, queueing URB anyway.
[38704.240598] WARN halted endpoint, queueing URB anyway.
[38704.240603] Cancel URB ffff8800d4438840, dev 1.1, ep 0x2, starting at offset 0x210cf41c0
[38704.240605] // Ding dong!
[38704.240610] Cancel URB ffff8800d4438480, dev 1.1, ep 0x2, starting at offset 0x210cf41d0
[38704.240614] Cancel URB ffff8800d44389c0, dev 1.1, ep 0x2, starting at offset 0x210cf41e0
[38704.240618] Removing canceled TD starting at 0x210cf41c0 (dma).
[38704.240620] TRB to noop at offset 0x210cf41c0
[38704.240623] Removing canceled TD starting at 0x210cf41d0 (dma).
[38704.240625] TRB to noop at offset 0x210cf41d0
[38704.240627] Removing canceled TD starting at 0x210cf41e0 (dma).
[38704.240629] TRB to noop at offset 0x210cf41e0
[38704.240639] WARN halted endpoint, queueing URB anyway.
[38704.240644] WARN halted endpoint, queueing URB anyway.
[38704.240648] WARN halted endpoint, queueing URB anyway.
[38704.240654] Cancel URB ffff8800d4438a80, dev 1.1, ep 0x2, starting at offset 0x210cf41f0
[38704.240656] // Ding dong!
[38704.240661] Cancel URB ffff8800d4438780, dev 1.1, ep 0x2, starting at offset 0x210cf4200
[38704.240665] Removing canceled TD starting at 0x210cf41f0 (dma).
[38704.240667] TRB to noop at offset 0x210cf41f0
[38704.240669] Removing canceled TD starting at 0x210cf4200 (dma).
[38704.240671] TRB to noop at offset 0x210cf4200
[38704.240681] WARN halted endpoint, queueing URB anyway.
[38704.240687] Cancel URB ffff8800d4438f00, dev 1.1, ep 0x2, starting at offset 0x210cf4210
[38704.240689] // Ding dong!
[38704.240694] Cancel URB ffff8800d44386c0, dev 1.1, ep 0x2, starting at offset 0x210cf4220
[38704.240699] Cancel URB ffff8800d4438900, dev 1.1, ep 0x2, starting at offset 0x210cf4230
[38704.240703] Cancel URB ffff8800d4438c00, dev 1.1, ep 0x2, starting at offset 0x210cf4240
[38704.240707] Removing canceled TD starting at 0x210cf4210 (dma).
[38704.240709] TRB to noop at offset 0x210cf4210
[38704.240711] Removing canceled TD starting at 0x210cf4220 (dma).
[38704.240713] TRB to noop at offset 0x210cf4220
[38704.240716] Removing canceled TD starting at 0x210cf4230 (dma).
[38704.240718] TRB to noop at offset 0x210cf4230
[38704.240720] Removing canceled TD starting at 0x210cf4240 (dma).
[38704.240722] TRB to noop at offset 0x210cf4240
[38704.240733] Cancel URB ffff8800d4438b40, dev 1.1, ep 0x2, starting at offset 0x210cf4250
[38704.240735] // Ding dong!
[38704.240740] Cancel URB ffff8800d4438480, dev 1.1, ep 0x2, starting at offset 0x210cf4260
[38704.240745] Removing canceled TD starting at 0x210cf4250 (dma).
[38704.240747] TRB to noop at offset 0x210cf4250
[38704.240749] Removing canceled TD starting at 0x210cf4260 (dma).
[38704.240751] TRB to noop at offset 0x210cf4260
[38704.240764] Cancel URB ffff8800d4438840, dev 1.1, ep 0x2, starting at offset 0x210cf4270
[38704.240766] // Ding dong!
[38704.240771] Cancel URB ffff8800d4438cc0, dev 1.1, ep 0x2, starting at offset 0x210cf4280
[38704.240776] Cancel URB ffff8800d4438a80, dev 1.1, ep 0x2, starting at offset 0x210cf4290
[38704.240787] Removing canceled TD starting at 0x210cf4270 (dma).
[38704.240789] TRB to noop at offset 0x210cf4270
[38704.240791] Removing canceled TD starting at 0x210cf4280 (dma).
[38704.240793] TRB to noop at offset 0x210cf4280
[38704.240796] Removing canceled TD starting at 0x210cf4290 (dma).
[38704.240798] TRB to noop at offset 0x210cf4290
[38704.240812] Queueing reset endpoint command
[38704.240814] Cleaning up stalled endpoint ring
[38704.240816] Finding segment containing stopped TRB.
[38704.240818] Finding endpoint context
[38704.240820] Finding segment containing last TRB in TD.
[38704.240822] Cycle state = 0x0
[38704.240825] New dequeue segment = ffff8802145657a0 (virtual)
[38704.240827] New dequeue pointer = 0x210cf4030 (DMA)
[38704.240829] Queueing new dequeue state
[38704.240832] Set TR Deq Ptr cmd, new deq seg = ffff8802145657a0 (0x210cf4000 dma), new deq ptr = ffff880210cf4030 (0x210cf4030 dma), new cycle = 0
[38704.240834] // Ding dong!
[38704.240843] Ignoring reset ep completion code of 1
[38704.240846] Successful Set TR Deq Ptr cmd, deq = @210cf4030
[38704.332072] Stalled endpoint
[38704.332080] Giveback URB ffff8800d4438480, len = 0, expected = 1526, status = -32
[38704.332135] Queueing reset endpoint command
[38704.332138] Cleaning up stalled endpoint ring
[38704.332140] Finding segment containing stopped TRB.
[38704.332142] Finding endpoint context
[38704.332144] Finding segment containing last TRB in TD.
[38704.332146] Cycle state = 0x0
[38704.332149] New dequeue segment = ffff8802145657a0 (virtual)
[38704.332151] New dequeue pointer = 0x210cf42b0 (DMA)
[38704.332153] Queueing new dequeue state
[38704.332156] Set TR Deq Ptr cmd, new deq seg = ffff8802145657a0 (0x210cf4000 dma), new deq ptr = ffff880210cf42b0 (0x210cf42b0 dma), new cycle = 0
[38704.332158] // Ding dong!
[38704.332170] Ignoring reset ep completion code of 1
[38704.332175] Successful Set TR Deq Ptr cmd, deq = @210cf42b0
>From then on this gets traced every few seconds - probably whenever
dhclient tries to send a packet.
[38704.747916] Stalled endpoint
[38704.747924] Giveback URB ffff8800d4438480, len = 0, expected = 1526, status = -32
[38704.747972] Queueing reset endpoint command
[38704.747982] Cleaning up stalled endpoint ring
[38704.747984] Finding segment containing stopped TRB.
[38704.747986] Finding endpoint context
[38704.747988] Finding segment containing last TRB in TD.
[38704.747990] Cycle state = 0x0
[38704.747992] New dequeue segment = ffff8802145657a0 (virtual)
[38704.747994] New dequeue pointer = 0x210cf42c0 (DMA)
[38704.747996] Queueing new dequeue state
[38704.747999] Set TR Deq Ptr cmd, new deq seg = ffff8802145657a0 (0x210cf4000 dma), new deq ptr = ffff880210cf42c0 (0x210cf42c0 dma), new cycle = 0
[38704.748002] // Ding dong!
[38704.748010] Ignoring reset ep completion code of 1
[38704.748018] Successful Set TR Deq Ptr cmd, deq = @210cf42c0
The only way to recover is to unplug and replug.
David
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: transmit lockup using smsc95xx ethernet on usb3
[not found] ` <AE90C24D6B3A694183C094C60CF0A2F6026B73BA-CgBM+Bx2aUAnGFn1LkZF6NBPR1lH4CV8@public.gmane.org>
@ 2013-11-08 0:45 ` Sarah Sharp
0 siblings, 0 replies; 10+ messages in thread
From: Sarah Sharp @ 2013-11-08 0:45 UTC (permalink / raw)
To: David Laight
Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-usb-u79uwXL29TY76Z2rM5mHXA,
Xenia Ragiadakou
On Mon, Oct 28, 2013 at 05:09:25PM -0000, David Laight wrote:
> > > We are seeing complete lockups of the transmit side when using
> > > the smsc95xx driver connected to a USB3 port on an i7 (Ivybridge) cpu.
> > > These errors are very intermittent - less than once a day, and
> > > it isn't actually clear that they are related to traffic load.
> ...
> > I would suggest you try with the latest stable kernel, with
> > CONFIG_USB_DEBUG enabled.
>
> I've finally got the same failure with a 3.12-0-rc5 kernel.
> The only change I made was to comment out the 'short packet' trace.
>
> The kernel trace contains:
> (I've shortened the lines by removing the fixed part of the header.
> I've got the full trace, and the matching usbmon trace, but they
> are a bit large to post to the mailing lists!
The log was good enough, analysis below:
> All from xhci_hcd 0000:00:14.0
> [38704.116936] Transfer error on endpoint
> [38704.116941] Cleaning up stalled endpoint ring
> [38704.116944] Finding segment containing stopped TRB.
> [38704.116946] Finding endpoint context
> [38704.116948] Finding segment containing last TRB in TD.
> [38704.116950] Cycle state = 0x0
> [38704.116952] New dequeue segment = ffff880214565660 (virtual)
> [38704.116954] New dequeue pointer = 0x210cf4ad0 (DMA)
> [38704.116956] Queueing new dequeue state
> [38704.116959] Set TR Deq Ptr cmd, new deq seg = ffff880214565660 (0x210cf4800 dma), new deq ptr = ffff880210cf4ad0 (0x210cf4ad0 dma), new cycle = 0
> [38704.116962] // Ding dong!
> [38704.116966] Giveback URB ffff8800d6841d80, len = 0, expected = 18944, status = -71
> [38704.116970] Transfer error on endpoint
> [38704.116973] Cleaning up stalled endpoint ring
> [38704.116974] Finding segment containing stopped TRB.
> [38704.116976] Finding endpoint context
> [38704.116978] Finding segment containing last TRB in TD.
> [38704.116980] Cycle state = 0x1
> [38704.116982] New dequeue segment = ffff880214565680 (virtual)
> [38704.116984] New dequeue pointer = 0x210cf47e0 (DMA)
> [38704.116986] Queueing new dequeue state
> [38704.116989] Set TR Deq Ptr cmd, new deq seg = ffff880214565680 (0x210cf4400 dma), new deq ptr = ffff880210cf47e0 (0x210cf47e0 dma), new cycle = 1
> [38704.116991] // Ding dong!
> [38704.116995] Giveback URB ffff8802132169c0, len = 1024, expected = 1526, status = -71
> [38704.116998] Ignoring reset ep completion code of 1
> [38704.117001] Successful Set TR Deq Ptr cmd, deq = @210cf4ad0
> [38704.117004] Ignoring reset ep completion code of 1
> [38704.117006] Successful Set TR Deq Ptr cmd, deq = @210cf47e1
> [38704.240067] Stalled endpoint
> [38704.240075] Giveback URB ffff8800d4438900, len = 0, expected = 1526, status = -32
So it looks like the endpoint stalls, and then the USB device driver
never clears the halted condition. If the endpoint was anything other
than a control endpoint, that's a driver bug.
> [38704.240112] Cancel URB ffff8800d4438b40, dev 1.1, ep 0x2, starting at offset 0x210cf4030
> [38704.240118] // Ding dong!
> [38704.240124] Cancel URB ffff8800d4438cc0, dev 1.1, ep 0x2, starting at offset 0x210cf4040
> [38704.240129] Removing canceled TD starting at 0x210cf4030 (dma).
> [38704.240132] TRB to noop at offset 0x210cf4030
> [38704.240134] Removing canceled TD starting at 0x210cf4040 (dma).
> [38704.240136] TRB to noop at offset 0x210cf4040
If these transfers were queued for the same endpoint that halted, the
endpoint is still stopped, and it would make sense that they would need
to be canceled. No transfers will occur until the halt condition is
cleared through the Reset Endpoint command, and the doorbell ring is
rung for the endpoint after the Set TR Dequeue Pointer command.
> [38704.240155] WARN halted endpoint, queueing URB anyway.
> [38704.240163] WARN halted endpoint, queueing URB anyway.
> [38704.240170] WARN halted endpoint, queueing URB anyway.
These warnings are consistent with the USB device driver not clearing
the stall.
> [38704.240177] Cancel URB ffff8800d4438840, dev 1.1, ep 0x2, starting at offset 0x210cf4050
> [38704.240179] // Ding dong!
> [38704.240184] Cancel URB ffff8800d44389c0, dev 1.1, ep 0x2, starting at offset 0x210cf4060
> [38704.240189] Removing canceled TD starting at 0x210cf4050 (dma).
> [38704.240191] TRB to noop at offset 0x210cf4050
> [38704.240193] Removing canceled TD starting at 0x210cf4060 (dma).
> [38704.240195] TRB to noop at offset 0x210cf4060
> [38704.240205] WARN halted endpoint, queueing URB anyway.
> [38704.240210] WARN halted endpoint, queueing URB anyway.
> [38704.240216] Cancel URB ffff8800d4438c00, dev 1.1, ep 0x2, starting at offset 0x210cf4070
> [38704.240219] // Ding dong!
> [38704.240224] Cancel URB ffff8800d4438a80, dev 1.1, ep 0x2, starting at offset 0x210cf4080
> [38704.240228] Removing canceled TD starting at 0x210cf4070 (dma).
> [38704.240230] TRB to noop at offset 0x210cf4070
> [38704.240232] Removing canceled TD starting at 0x210cf4080 (dma).
> [38704.240234] TRB to noop at offset 0x210cf4080
> [38704.240244] WARN halted endpoint, queueing URB anyway.
> [38704.240249] WARN halted endpoint, queueing URB anyway.
...and the cycle continues endlessly, as the USB device driver tries to
submit URBs to a stalled endpoint.
> [38704.240255] Cancel URB ffff8800d4438780, dev 1.1, ep 0x2, starting at offset 0x210cf4090
> [38704.240257] // Ding dong!
> [38704.240262] Cancel URB ffff8800d4438480, dev 1.1, ep 0x2, starting at offset 0x210cf40a0
> [38704.240266] Cancel URB ffff8800d44386c0, dev 1.1, ep 0x2, starting at offset 0x210cf40b0
> [38704.240271] Removing canceled TD starting at 0x210cf4090 (dma).
> [38704.240273] TRB to noop at offset 0x210cf4090
> [38704.240275] Removing canceled TD starting at 0x210cf40a0 (dma).
> [38704.240277] TRB to noop at offset 0x210cf40a0
> [38704.240279] Removing canceled TD starting at 0x210cf40b0 (dma).
> [38704.240281] TRB to noop at offset 0x210cf40b0
> [38704.240292] WARN halted endpoint, queueing URB anyway.
> [38704.240296] WARN halted endpoint, queueing URB anyway.
> [38704.240301] WARN halted endpoint, queueing URB anyway.
> [38704.240307] Cancel URB ffff8800d4438900, dev 1.1, ep 0x2, starting at offset 0x210cf40c0
> [38704.240309] // Ding dong!
> [38704.240314] Cancel URB ffff8800d4438b40, dev 1.1, ep 0x2, starting at offset 0x210cf40d0
> [38704.240319] Removing canceled TD starting at 0x210cf40c0 (dma).
> [38704.240321] TRB to noop at offset 0x210cf40c0
> [38704.240323] Removing canceled TD starting at 0x210cf40d0 (dma).
> [38704.240325] TRB to noop at offset 0x210cf40d0
> [38704.240335] WARN halted endpoint, queueing URB anyway.
> [38704.240340] WARN halted endpoint, queueing URB anyway.
> [38704.240346] Cancel URB ffff8800d4438f00, dev 1.1, ep 0x2, starting at offset 0x210cf40e0
> [38704.240348] // Ding dong!
> [38704.240353] Cancel URB ffff8800d4438840, dev 1.1, ep 0x2, starting at offset 0x210cf40f0
> [38704.240357] Removing canceled TD starting at 0x210cf40e0 (dma).
> [38704.240359] TRB to noop at offset 0x210cf40e0
> [38704.240361] Removing canceled TD starting at 0x210cf40f0 (dma).
> [38704.240363] TRB to noop at offset 0x210cf40f0
> [38704.240373] WARN halted endpoint, queueing URB anyway.
> [38704.240378] WARN halted endpoint, queueing URB anyway.
> [38704.240384] Cancel URB ffff8800d4438cc0, dev 1.1, ep 0x2, starting at offset 0x210cf4100
> [38704.240386] // Ding dong!
> [38704.240391] Cancel URB ffff8800d4438c00, dev 1.1, ep 0x2, starting at offset 0x210cf4110
> [38704.240395] Cancel URB ffff8800d44389c0, dev 1.1, ep 0x2, starting at offset 0x210cf4120
> [38704.240400] Removing canceled TD starting at 0x210cf4100 (dma).
> [38704.240402] TRB to noop at offset 0x210cf4100
> [38704.240404] Removing canceled TD starting at 0x210cf4110 (dma).
> [38704.240406] TRB to noop at offset 0x210cf4110
> [38704.240408] Removing canceled TD starting at 0x210cf4120 (dma).
> [38704.240410] TRB to noop at offset 0x210cf4120
> [38704.240420] WARN halted endpoint, queueing URB anyway.
> [38704.240425] WARN halted endpoint, queueing URB anyway.
> [38704.240430] WARN halted endpoint, queueing URB anyway.
> [38704.240436] Cancel URB ffff8800d4438480, dev 1.1, ep 0x2, starting at offset 0x210cf4130
> [38704.240438] // Ding dong!
> [38704.240443] Cancel URB ffff8800d4438780, dev 1.1, ep 0x2, starting at offset 0x210cf4140
> [38704.240447] Removing canceled TD starting at 0x210cf4130 (dma).
> [38704.240449] TRB to noop at offset 0x210cf4130
> [38704.240451] Removing canceled TD starting at 0x210cf4140 (dma).
> [38704.240453] TRB to noop at offset 0x210cf4140
> [38704.240463] WARN halted endpoint, queueing URB anyway.
> [38704.240468] WARN halted endpoint, queueing URB anyway.
> [38704.240474] Cancel URB ffff8800d4438a80, dev 1.1, ep 0x2, starting at offset 0x210cf4150
> [38704.240476] // Ding dong!
> [38704.240481] Cancel URB ffff8800d4438900, dev 1.1, ep 0x2, starting at offset 0x210cf4160
> [38704.240486] Removing canceled TD starting at 0x210cf4150 (dma).
> [38704.240488] TRB to noop at offset 0x210cf4150
> [38704.240490] Removing canceled TD starting at 0x210cf4160 (dma).
> [38704.240492] TRB to noop at offset 0x210cf4160
> [38704.240501] WARN halted endpoint, queueing URB anyway.
> [38704.240506] WARN halted endpoint, queueing URB anyway.
> [38704.240512] Cancel URB ffff8800d44386c0, dev 1.1, ep 0x2, starting at offset 0x210cf4170
> [38704.240514] // Ding dong!
> [38704.240519] Cancel URB ffff8800d4438f00, dev 1.1, ep 0x2, starting at offset 0x210cf4180
> [38704.240524] Cancel URB ffff8800d4438b40, dev 1.1, ep 0x2, starting at offset 0x210cf4190
> [38704.240528] Removing canceled TD starting at 0x210cf4170 (dma).
> [38704.240530] TRB to noop at offset 0x210cf4170
> [38704.240532] Removing canceled TD starting at 0x210cf4180 (dma).
> [38704.240535] TRB to noop at offset 0x210cf4180
> [38704.240537] Removing canceled TD starting at 0x210cf4190 (dma).
> [38704.240539] TRB to noop at offset 0x210cf4190
> [38704.240549] WARN halted endpoint, queueing URB anyway.
> [38704.240554] WARN halted endpoint, queueing URB anyway.
> [38704.240559] WARN halted endpoint, queueing URB anyway.
> [38704.240565] Cancel URB ffff8800d4438c00, dev 1.1, ep 0x2, starting at offset 0x210cf41a0
> [38704.240567] // Ding dong!
> [38704.240572] Cancel URB ffff8800d4438cc0, dev 1.1, ep 0x2, starting at offset 0x210cf41b0
> [38704.240577] Removing canceled TD starting at 0x210cf41a0 (dma).
> [38704.240579] TRB to noop at offset 0x210cf41a0
> [38704.240581] Removing canceled TD starting at 0x210cf41b0 (dma).
> [38704.240583] TRB to noop at offset 0x210cf41b0
> [38704.240592] WARN halted endpoint, queueing URB anyway.
> [38704.240598] WARN halted endpoint, queueing URB anyway.
> [38704.240603] Cancel URB ffff8800d4438840, dev 1.1, ep 0x2, starting at offset 0x210cf41c0
> [38704.240605] // Ding dong!
> [38704.240610] Cancel URB ffff8800d4438480, dev 1.1, ep 0x2, starting at offset 0x210cf41d0
> [38704.240614] Cancel URB ffff8800d44389c0, dev 1.1, ep 0x2, starting at offset 0x210cf41e0
> [38704.240618] Removing canceled TD starting at 0x210cf41c0 (dma).
> [38704.240620] TRB to noop at offset 0x210cf41c0
> [38704.240623] Removing canceled TD starting at 0x210cf41d0 (dma).
> [38704.240625] TRB to noop at offset 0x210cf41d0
> [38704.240627] Removing canceled TD starting at 0x210cf41e0 (dma).
> [38704.240629] TRB to noop at offset 0x210cf41e0
> [38704.240639] WARN halted endpoint, queueing URB anyway.
> [38704.240644] WARN halted endpoint, queueing URB anyway.
> [38704.240648] WARN halted endpoint, queueing URB anyway.
> [38704.240654] Cancel URB ffff8800d4438a80, dev 1.1, ep 0x2, starting at offset 0x210cf41f0
> [38704.240656] // Ding dong!
> [38704.240661] Cancel URB ffff8800d4438780, dev 1.1, ep 0x2, starting at offset 0x210cf4200
> [38704.240665] Removing canceled TD starting at 0x210cf41f0 (dma).
> [38704.240667] TRB to noop at offset 0x210cf41f0
> [38704.240669] Removing canceled TD starting at 0x210cf4200 (dma).
> [38704.240671] TRB to noop at offset 0x210cf4200
> [38704.240681] WARN halted endpoint, queueing URB anyway.
> [38704.240687] Cancel URB ffff8800d4438f00, dev 1.1, ep 0x2, starting at offset 0x210cf4210
> [38704.240689] // Ding dong!
> [38704.240694] Cancel URB ffff8800d44386c0, dev 1.1, ep 0x2, starting at offset 0x210cf4220
> [38704.240699] Cancel URB ffff8800d4438900, dev 1.1, ep 0x2, starting at offset 0x210cf4230
> [38704.240703] Cancel URB ffff8800d4438c00, dev 1.1, ep 0x2, starting at offset 0x210cf4240
> [38704.240707] Removing canceled TD starting at 0x210cf4210 (dma).
> [38704.240709] TRB to noop at offset 0x210cf4210
> [38704.240711] Removing canceled TD starting at 0x210cf4220 (dma).
> [38704.240713] TRB to noop at offset 0x210cf4220
> [38704.240716] Removing canceled TD starting at 0x210cf4230 (dma).
> [38704.240718] TRB to noop at offset 0x210cf4230
> [38704.240720] Removing canceled TD starting at 0x210cf4240 (dma).
> [38704.240722] TRB to noop at offset 0x210cf4240
> [38704.240733] Cancel URB ffff8800d4438b40, dev 1.1, ep 0x2, starting at offset 0x210cf4250
> [38704.240735] // Ding dong!
> [38704.240740] Cancel URB ffff8800d4438480, dev 1.1, ep 0x2, starting at offset 0x210cf4260
> [38704.240745] Removing canceled TD starting at 0x210cf4250 (dma).
> [38704.240747] TRB to noop at offset 0x210cf4250
> [38704.240749] Removing canceled TD starting at 0x210cf4260 (dma).
> [38704.240751] TRB to noop at offset 0x210cf4260
> [38704.240764] Cancel URB ffff8800d4438840, dev 1.1, ep 0x2, starting at offset 0x210cf4270
> [38704.240766] // Ding dong!
> [38704.240771] Cancel URB ffff8800d4438cc0, dev 1.1, ep 0x2, starting at offset 0x210cf4280
> [38704.240776] Cancel URB ffff8800d4438a80, dev 1.1, ep 0x2, starting at offset 0x210cf4290
> [38704.240787] Removing canceled TD starting at 0x210cf4270 (dma).
> [38704.240789] TRB to noop at offset 0x210cf4270
> [38704.240791] Removing canceled TD starting at 0x210cf4280 (dma).
> [38704.240793] TRB to noop at offset 0x210cf4280
> [38704.240796] Removing canceled TD starting at 0x210cf4290 (dma).
> [38704.240798] TRB to noop at offset 0x210cf4290
Finally, something changes, and the driver clears the halted condition:
> [38704.240812] Queueing reset endpoint command
> [38704.240814] Cleaning up stalled endpoint ring
> [38704.240816] Finding segment containing stopped TRB.
> [38704.240818] Finding endpoint context
> [38704.240820] Finding segment containing last TRB in TD.
> [38704.240822] Cycle state = 0x0
> [38704.240825] New dequeue segment = ffff8802145657a0 (virtual)
> [38704.240827] New dequeue pointer = 0x210cf4030 (DMA)
> [38704.240829] Queueing new dequeue state
> [38704.240832] Set TR Deq Ptr cmd, new deq seg = ffff8802145657a0 (0x210cf4000 dma), new deq ptr = ffff880210cf4030 (0x210cf4030 dma), new cycle = 0
> [38704.240834] // Ding dong!
> [38704.240843] Ignoring reset ep completion code of 1
> [38704.240846] Successful Set TR Deq Ptr cmd, deq = @210cf4030
The next time it finds a stalled endpoint, it clears the halted
condition right away:
> [38704.332072] Stalled endpoint
> [38704.332080] Giveback URB ffff8800d4438480, len = 0, expected = 1526, status = -32
> [38704.332135] Queueing reset endpoint command
> [38704.332138] Cleaning up stalled endpoint ring
> [38704.332140] Finding segment containing stopped TRB.
> [38704.332142] Finding endpoint context
> [38704.332144] Finding segment containing last TRB in TD.
> [38704.332146] Cycle state = 0x0
> [38704.332149] New dequeue segment = ffff8802145657a0 (virtual)
> [38704.332151] New dequeue pointer = 0x210cf42b0 (DMA)
> [38704.332153] Queueing new dequeue state
> [38704.332156] Set TR Deq Ptr cmd, new deq seg = ffff8802145657a0 (0x210cf4000 dma), new deq ptr = ffff880210cf42b0 (0x210cf42b0 dma), new cycle = 0
> [38704.332158] // Ding dong!
> [38704.332170] Ignoring reset ep completion code of 1
> [38704.332175] Successful Set TR Deq Ptr cmd, deq = @210cf42b0
You should take a close look at the code that clears the halted
conditions when an URB is returned with -EPIPE. I suspect there's some
sort of race condition that means URBs are submitted to endpoints that
are stalled, and the stall never gets cleared.
> From then on this gets traced every few seconds - probably whenever
> dhclient tries to send a packet.
>
> [38704.747916] Stalled endpoint
> [38704.747924] Giveback URB ffff8800d4438480, len = 0, expected = 1526, status = -32
> [38704.747972] Queueing reset endpoint command
> [38704.747982] Cleaning up stalled endpoint ring
> [38704.747984] Finding segment containing stopped TRB.
> [38704.747986] Finding endpoint context
> [38704.747988] Finding segment containing last TRB in TD.
> [38704.747990] Cycle state = 0x0
> [38704.747992] New dequeue segment = ffff8802145657a0 (virtual)
> [38704.747994] New dequeue pointer = 0x210cf42c0 (DMA)
> [38704.747996] Queueing new dequeue state
> [38704.747999] Set TR Deq Ptr cmd, new deq seg = ffff8802145657a0 (0x210cf4000 dma), new deq ptr = ffff880210cf42c0 (0x210cf42c0 dma), new cycle = 0
> [38704.748002] // Ding dong!
> [38704.748010] Ignoring reset ep completion code of 1
> [38704.748018] Successful Set TR Deq Ptr cmd, deq = @210cf42c0
>
> The only way to recover is to unplug and replug.
Are you saying that after the driver doesn't clear the stall on the
endpoint, the device always responds with a stalled packet?
Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2013-11-08 0:45 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-15 9:59 transmit lockup using smsc95xx ethernet on usb3 David Laight
[not found] ` <AE90C24D6B3A694183C094C60CF0A2F6026B738B-CgBM+Bx2aUAnGFn1LkZF6NBPR1lH4CV8@public.gmane.org>
2013-10-17 0:07 ` Sarah Sharp
2013-10-17 14:30 ` David Laight
2013-10-17 17:43 ` Sarah Sharp
2013-10-17 18:07 ` Alan Stern
2013-10-18 15:22 ` David Laight
2013-10-18 15:50 ` Greg KH
2013-10-24 15:05 ` David Laight
2013-10-28 17:09 ` David Laight
[not found] ` <AE90C24D6B3A694183C094C60CF0A2F6026B73BA-CgBM+Bx2aUAnGFn1LkZF6NBPR1lH4CV8@public.gmane.org>
2013-11-08 0:45 ` Sarah Sharp
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).