Reboot/shutdown failure due to "USB: EHCI: work around silicon bug in Intel's EHCI"

linux-tegra.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Reboot/shutdown failure due to "USB: EHCI: work around silicon bug in Intel's EHCI"
@ 2013-03-18 20:06 Stephen Warren
       [not found] ` <514773BD.6010803-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Stephen Warren @ 2013-03-18 20:06 UTC (permalink / raw)
  To: Alan Stern, linux-tegra-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	USB list
  Cc: Greg Kroah-Hartman

Alan,

In v3.9-rc3, I find that "reboot" and "shutdown" hang on Tegra. I
bisected it to commit 6402c79 "USB: EHCI: work around silicon bug in
Intel's EHCI controllers", and confirmed that running v3.9-rc3 with just
that one patch reverted does solve the problem.

Do you have any idea what the problem might be, or is there anything I
can do to help debug this? Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

[parent not found: <514773BD.6010803-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>]

* Re: Reboot/shutdown failure due to "USB: EHCI: work around silicon bug in Intel's EHCI"
       [not found] ` <514773BD.6010803-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>
@ 2013-03-19 18:34   ` Alan Stern
       [not found]     ` <Pine.LNX.4.44L0.1303191420121.1302-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Alan Stern @ 2013-03-19 18:34 UTC (permalink / raw)
  To: Stephen Warren
  Cc: linux-tegra-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, USB list,
	Greg Kroah-Hartman

On Mon, 18 Mar 2013, Stephen Warren wrote:

> Alan,
> 
> In v3.9-rc3, I find that "reboot" and "shutdown" hang on Tegra. I
> bisected it to commit 6402c79 "USB: EHCI: work around silicon bug in
> Intel's EHCI controllers", and confirmed that running v3.9-rc3 with just
> that one patch reverted does solve the problem.
> 
> Do you have any idea what the problem might be, or is there anything I
> can do to help debug this? Thanks.

Is there any way to find out where the hangup occurs?  For example, can
you get anything from Alt-SysRq-W?

Failing that, can you put some printk statements in
drivers/usb/host/ehci-hcd.c, in the ehci_halt(),
ehci_silence_controller(), and ehci_shutdown() routines?  (Although I 
suspect this may not help because the hangup occurs somewhere else...)

Alan Stern

^ permalink raw reply	[flat|nested] 10+ messages in thread

[parent not found: <Pine.LNX.4.44L0.1303191420121.1302-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>]

* Re: Reboot/shutdown failure due to "USB: EHCI: work around silicon bug in Intel's EHCI"
       [not found]     ` <Pine.LNX.4.44L0.1303191420121.1302-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
@ 2013-03-19 18:49       ` Stephen Warren
       [not found]         ` <5148B338.6070001-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Stephen Warren @ 2013-03-19 18:49 UTC (permalink / raw)
  To: Alan Stern
  Cc: linux-tegra-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, USB list,
	Greg Kroah-Hartman

On 03/19/2013 12:34 PM, Alan Stern wrote:
> On Mon, 18 Mar 2013, Stephen Warren wrote:
> 
>> Alan,
>>
>> In v3.9-rc3, I find that "reboot" and "shutdown" hang on Tegra. I
>> bisected it to commit 6402c79 "USB: EHCI: work around silicon bug in
>> Intel's EHCI controllers", and confirmed that running v3.9-rc3 with just
>> that one patch reverted does solve the problem.
>>
>> Do you have any idea what the problem might be, or is there anything I
>> can do to help debug this? Thanks.
> 
> Is there any way to find out where the hangup occurs?  For example, can
> you get anything from Alt-SysRq-W?
> 
> Failing that, can you put some printk statements in
> drivers/usb/host/ehci-hcd.c, in the ehci_halt(),
> ehci_silence_controller(), and ehci_shutdown() routines?  (Although I 
> suspect this may not help because the hangup occurs somewhere else...)

Yes, sysrq seems to work over the serial console:-)

>  * Will now restart
> 
> *** break sent ***
> [  180.765213] SysRq : Show Blocked State
> [  180.768963]   task                PC stack   pid father
> [  180.774182] khubd           D c0517678     0    21      2 0x00000000
> [  180.780559] [<c0517678>] (__schedule+0x348/0x6dc) from [<c02ebc90>] (usb_kill_urb+0x88/0xd4)
> [  180.788984] [<c02ebc90>] (usb_kill_urb+0x88/0xd4) from [<c02ecb70>] (usb_start_wait_urb+0xa4/0x13c)
> [  180.798012] [<c02ecb70>] (usb_start_wait_urb+0xa4/0x13c) from [<c02ecd98>] (usb_control_msg+0x98/0xcc)
> [  180.807301] [<c02ecd98>] (usb_control_msg+0x98/0xcc) from [<c02e4394>] (hub_port_init+0x5e0/0x9d0)
> [  180.816242] [<c02e4394>] (hub_port_init+0x5e0/0x9d0) from [<c02e6640>] (hub_port_connect_change+0x1f4/0x970)
> [  180.826050] [<c02e6640>] (hub_port_connect_change+0x1f4/0x970) from [<c02e70fc>] (hub_events+0x340/0x8c4)
> [  180.835598] [<c02e70fc>] (hub_events+0x340/0x8c4) from [<c02e76a8>] (hub_thread+0x28/0x1b8)
> [  180.843941] [<c02e76a8>] (hub_thread+0x28/0x1b8) from [<c0041268>] (kthread+0xa4/0xb0)
> [  180.851851] [<c0041268>] (kthread+0xa4/0xb0) from [<c000ed98>] (ret_from_fork+0x14/0x3c)
> [  180.859929] udevd           D c0517678     0   167      1 0x00000001
> [  180.866285] [<c0517678>] (__schedule+0x348/0x6dc) from [<c0517df0>] (schedule_preempt_disabled+0x24/0x34)
> [  180.875834] [<c0517df0>] (schedule_preempt_disabled+0x24/0x34) from [<c05167f8>] (__mutex_lock_slowpath+0x15c/0x354)
> [  180.886336] [<c05167f8>] (__mutex_lock_slowpath+0x15c/0x354) from [<c05169fc>] (mutex_lock+0xc/0x24)
> [  180.895455] [<c05169fc>] (mutex_lock+0xc/0x24) from [<c02f1fd0>] (show_manufacturer+0x18/0x40)
> [  180.904059] [<c02f1fd0>] (show_manufacturer+0x18/0x40) from [<c027fe18>] (dev_attr_show+0x1c/0x48)
> [  180.913007] [<c027fe18>] (dev_attr_show+0x1c/0x48) from [<c01191c0>] (sysfs_read_file+0x90/0x16c)
> [  180.921870] [<c01191c0>] (sysfs_read_file+0x90/0x16c) from [<c00c8418>] (vfs_read+0x98/0x13c)
> [  180.930379] [<c00c8418>] (vfs_read+0x98/0x13c) from [<c00c84f8>] (SyS_read+0x3c/0x70)
> [  180.938196] [<c00c84f8>] (SyS_read+0x3c/0x70) from [<c000ed00>] (ret_fast_syscall+0x0/0x30)
> [  180.946527] reboot          D c0517678     0   877    876 0x00000000
> [  180.952883] [<c0517678>] (__schedule+0x348/0x6dc) from [<c0517df0>] (schedule_preempt_disabled+0x24/0x34)
> [  180.962432] [<c0517df0>] (schedule_preempt_disabled+0x24/0x34) from [<c05167f8>] (__mutex_lock_slowpath+0x15c/0x354)
> [  180.972934] [<c05167f8>] (__mutex_lock_slowpath+0x15c/0x354) from [<c05169fc>] (mutex_lock+0xc/0x24)
> [  180.982050] [<c05169fc>] (mutex_lock+0xc/0x24) from [<c02813c4>] (device_shutdown+0xc8/0x188)
> [  180.990566] [<c02813c4>] (device_shutdown+0xc8/0x188) from [<c00364b4>] (kernel_restart_prepare+0x34/0x3c)
> [  181.000202] [<c00364b4>] (kernel_restart_prepare+0x34/0x3c) from [<c00364c8>] (kernel_restart+0xc/0x4c)
> [  181.009578] [<c00364c8>] (kernel_restart+0xc/0x4c) from [<c00366bc>] (SyS_reboot+0x1ac/0x1d8)
> [  181.018086] [<c00366bc>] (SyS_reboot+0x1ac/0x1d8) from [<c000ed00>] (ret_fast_syscall+0x0/0x30)

I assume you only want the back-traces, not the reset of the sysrq spew?

BTW, I have also occasionally noticed some either hung task or RCU spew
related to usb_kill_urb at other times (i.e. when not rebooting), IIRC
(which I may not; I may be remembering what happens if I just leave the
reboot hung for a few minutes, as shown below):

> [  240.670335] INFO: task khubd:21 blocked for more than 120 seconds.
> [  240.676516] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  240.684336] khubd           D c0517678     0    21      2 0x00000000
> [  240.690706] [<c0517678>] (__schedule+0x348/0x6dc) from [<c02ebc90>] (usb_kill_urb+0x88/0xd4)
> [  240.699140] [<c02ebc90>] (usb_kill_urb+0x88/0xd4) from [<c02ecb70>] (usb_start_wait_urb+0xa4/0x13c)
> [  240.708181] [<c02ecb70>] (usb_start_wait_urb+0xa4/0x13c) from [<c02ecd98>] (usb_control_msg+0x98/0xcc)
> [  240.717481] [<c02ecd98>] (usb_control_msg+0x98/0xcc) from [<c02e4394>] (hub_port_init+0x5e0/0x9d0)
> [  240.726433] [<c02e4394>] (hub_port_init+0x5e0/0x9d0) from [<c02e6640>] (hub_port_connect_change+0x1f4/0x970)
> [  240.736251] [<c02e6640>] (hub_port_connect_change+0x1f4/0x970) from [<c02e70fc>] (hub_events+0x340/0x8c4)
> [  240.745809] [<c02e70fc>] (hub_events+0x340/0x8c4) from [<c02e76a8>] (hub_thread+0x28/0x1b8)
> [  240.754160] [<c02e76a8>] (hub_thread+0x28/0x1b8) from [<c0041268>] (kthread+0xa4/0xb0)
> [  240.762067] [<c0041268>] (kthread+0xa4/0xb0) from [<c000ed98>] (ret_from_fork+0x14/0x3c)
> [  240.770153] INFO: task udevd:167 blocked for more than 120 seconds.
> [  240.776410] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  240.784228] udevd           D c0517678     0   167      1 0x00000001
> [  240.790586] [<c0517678>] (__schedule+0x348/0x6dc) from [<c0517df0>] (schedule_preempt_disabled+0x24/0x34)
> [  240.800145] [<c0517df0>] (schedule_preempt_disabled+0x24/0x34) from [<c05167f8>] (__mutex_lock_slowpath+0x15c/0x354)
> [  240.810659] [<c05167f8>] (__mutex_lock_slowpath+0x15c/0x354) from [<c05169fc>] (mutex_lock+0xc/0x24)
> [  240.819787] [<c05169fc>] (mutex_lock+0xc/0x24) from [<c02f1fd0>] (show_manufacturer+0x18/0x40)
> [  240.828397] [<c02f1fd0>] (show_manufacturer+0x18/0x40) from [<c027fe18>] (dev_attr_show+0x1c/0x48)
> [  240.837353] [<c027fe18>] (dev_attr_show+0x1c/0x48) from [<c01191c0>] (sysfs_read_file+0x90/0x16c)
> [  240.846227] [<c01191c0>] (sysfs_read_file+0x90/0x16c) from [<c00c8418>] (vfs_read+0x98/0x13c)
> [  240.854747] [<c00c8418>] (vfs_read+0x98/0x13c) from [<c00c84f8>] (SyS_read+0x3c/0x70)
> [  240.862565] [<c00c84f8>] (SyS_read+0x3c/0x70) from [<c000ed00>] (ret_fast_syscall+0x0/0x30)

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

[parent not found: <5148B338.6070001-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>]

* Re: Reboot/shutdown failure due to "USB: EHCI: work around silicon bug in Intel's EHCI"
       [not found]         ` <5148B338.6070001-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>
@ 2013-03-19 20:07           ` Alan Stern
       [not found]             ` <Pine.LNX.4.44L0.1303191604530.1302-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Alan Stern @ 2013-03-19 20:07 UTC (permalink / raw)
  To: Stephen Warren
  Cc: linux-tegra-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, USB list,
	Greg Kroah-Hartman

On Tue, 19 Mar 2013, Stephen Warren wrote:

> Yes, sysrq seems to work over the serial console:-)
> 
> >  * Will now restart
> > 
> > *** break sent ***
> > [  180.765213] SysRq : Show Blocked State
> > [  180.768963]   task                PC stack   pid father
> > [  180.774182] khubd           D c0517678     0    21      2 0x00000000
> > [  180.780559] [<c0517678>] (__schedule+0x348/0x6dc) from [<c02ebc90>] (usb_kill_urb+0x88/0xd4)
> > [  180.788984] [<c02ebc90>] (usb_kill_urb+0x88/0xd4) from [<c02ecb70>] (usb_start_wait_urb+0xa4/0x13c)
> > [  180.798012] [<c02ecb70>] (usb_start_wait_urb+0xa4/0x13c) from [<c02ecd98>] (usb_control_msg+0x98/0xcc)
> > [  180.807301] [<c02ecd98>] (usb_control_msg+0x98/0xcc) from [<c02e4394>] (hub_port_init+0x5e0/0x9d0)
> > [  180.816242] [<c02e4394>] (hub_port_init+0x5e0/0x9d0) from [<c02e6640>] (hub_port_connect_change+0x1f4/0x970)
> > [  180.826050] [<c02e6640>] (hub_port_connect_change+0x1f4/0x970) from [<c02e70fc>] (hub_events+0x340/0x8c4)
> > [  180.835598] [<c02e70fc>] (hub_events+0x340/0x8c4) from [<c02e76a8>] (hub_thread+0x28/0x1b8)
> > [  180.843941] [<c02e76a8>] (hub_thread+0x28/0x1b8) from [<c0041268>] (kthread+0xa4/0xb0)

> I assume you only want the back-traces, not the reset of the sysrq spew?

Yes.  It looks like khubd is your problem.

> BTW, I have also occasionally noticed some either hung task or RCU spew
> related to usb_kill_urb at other times (i.e. when not rebooting), IIRC
> (which I may not; I may be remembering what happens if I just leave the
> reboot hung for a few minutes, as shown below):
> 
> > [  240.670335] INFO: task khubd:21 blocked for more than 120 seconds.
> > [  240.676516] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [  240.684336] khubd           D c0517678     0    21      2 0x00000000

Right.  It's another symptom of the same thing.  This problem occurs 
well before shutdown -- something is causing khubd to hang and that 
messes up everything else.

A dmesg log with CONFIG_USB_DEBUG enabled would be helpful.  We ought
to be able to tell where khubd is getting stuck.

Alan Stern

^ permalink raw reply	[flat|nested] 10+ messages in thread

[parent not found: <Pine.LNX.4.44L0.1303191604530.1302-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>]

* Re: Reboot/shutdown failure due to "USB: EHCI: work around silicon bug in Intel's EHCI"
       [not found]             ` <Pine.LNX.4.44L0.1303191604530.1302-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
@ 2013-03-19 22:48               ` Stephen Warren
       [not found]                 ` <5148EB2C.1080402-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Stephen Warren @ 2013-03-19 22:48 UTC (permalink / raw)
  To: Alan Stern
  Cc: linux-tegra-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, USB list,
	Greg Kroah-Hartman

On 03/19/2013 02:07 PM, Alan Stern wrote:
> On Tue, 19 Mar 2013, Stephen Warren wrote:
> 
>> Yes, sysrq seems to work over the serial console:-)
>>
>>>  * Will now restart
>>>
>>> *** break sent ***
>>> [  180.765213] SysRq : Show Blocked State
>>> [  180.768963]   task                PC stack   pid father
>>> [  180.774182] khubd           D c0517678     0    21      2 0x00000000
>>> [  180.780559] [<c0517678>] (__schedule+0x348/0x6dc) from [<c02ebc90>] (usb_kill_urb+0x88/0xd4)
>>> [  180.788984] [<c02ebc90>] (usb_kill_urb+0x88/0xd4) from [<c02ecb70>] (usb_start_wait_urb+0xa4/0x13c)
>>> [  180.798012] [<c02ecb70>] (usb_start_wait_urb+0xa4/0x13c) from [<c02ecd98>] (usb_control_msg+0x98/0xcc)
>>> [  180.807301] [<c02ecd98>] (usb_control_msg+0x98/0xcc) from [<c02e4394>] (hub_port_init+0x5e0/0x9d0)
>>> [  180.816242] [<c02e4394>] (hub_port_init+0x5e0/0x9d0) from [<c02e6640>] (hub_port_connect_change+0x1f4/0x970)
>>> [  180.826050] [<c02e6640>] (hub_port_connect_change+0x1f4/0x970) from [<c02e70fc>] (hub_events+0x340/0x8c4)
>>> [  180.835598] [<c02e70fc>] (hub_events+0x340/0x8c4) from [<c02e76a8>] (hub_thread+0x28/0x1b8)
>>> [  180.843941] [<c02e76a8>] (hub_thread+0x28/0x1b8) from [<c0041268>] (kthread+0xa4/0xb0)
> 
>> I assume you only want the back-traces, not the reset of the sysrq spew?
> 
> Yes.  It looks like khubd is your problem.
> 
>> BTW, I have also occasionally noticed some either hung task or RCU spew
>> related to usb_kill_urb at other times (i.e. when not rebooting), IIRC
>> (which I may not; I may be remembering what happens if I just leave the
>> reboot hung for a few minutes, as shown below):
>>
>>> [  240.670335] INFO: task khubd:21 blocked for more than 120 seconds.
>>> [  240.676516] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> [  240.684336] khubd           D c0517678     0    21      2 0x00000000
> 
> Right.  It's another symptom of the same thing.  This problem occurs 
> well before shutdown -- something is causing khubd to hang and that 
> messes up everything else.
> 
> A dmesg log with CONFIG_USB_DEBUG enabled would be helpful.  We ought
> to be able to tell where khubd is getting stuck.

Hmmm. Enabling CONFIG_USB_DEBUG appears to mask the problem. I assume
this is some kind of timing/race condition, unless there's some code
with required side-effects hiding under ifdef CONFIG_USB_DEBUG somehow?
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

[parent not found: <5148EB2C.1080402-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>]

* Re: Reboot/shutdown failure due to "USB: EHCI: work around silicon bug in Intel's EHCI"
       [not found]                 ` <5148EB2C.1080402-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>
@ 2013-03-19 23:41                   ` Stephen Warren
       [not found]                     ` <5148F7A1.2090901-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Stephen Warren @ 2013-03-19 23:41 UTC (permalink / raw)
  To: Alan Stern
  Cc: linux-tegra-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, USB list,
	Greg Kroah-Hartman

On 03/19/2013 04:48 PM, Stephen Warren wrote:
> On 03/19/2013 02:07 PM, Alan Stern wrote:
...
>> A dmesg log with CONFIG_USB_DEBUG enabled would be helpful.  We ought
>> to be able to tell where khubd is getting stuck.
> 
> Hmmm. Enabling CONFIG_USB_DEBUG appears to mask the problem. I assume
> this is some kind of timing/race condition, unless there's some code
> with required side-effects hiding under ifdef CONFIG_USB_DEBUG somehow?

Some further information: I added some printks, which are hopefully
obvious from the text below, and in the failing case, I see:

> [    1.291277] single_unlink_async: qh ee31bc40 qh_state set to UNLINK_WAIT
> [    1.297960] start_iaa_cycle: qh ee31bc40 qh_state changing UNLINK_WAIT -> UNLINK
...
> [    6.452331] ehci_urb_dequeue: qh ee31bc40 attempting dequeue (qh_stated 2)

This is with CONFIG_USB_DEBUG disabled. That seems to happen to the very
first (and only) URB(?) ever issued.

If I enable CONFIG_USB_DEBUG, then I see the more expected:

> [    1.540410] single_unlink_async: qh ee0c0300 qh_state set to UNLINK_WAIT
> [    1.547094] start_iaa_cycle: qh ee0c0300 qh_state changing UNLINK_WAIT -> UNLINK
> [    1.554487] start_iaa_cycle: qh ee0c0300 qh_state was UNLINK; processing

followed by a whole slew of subsequent URBs being submitted and processed.

Perhaps the issue is that start_iaa_cycle() (or whatever triggers it)
only happens when there's an URB in state UNLINK, but not when there's
only one in state UNLINK_WAIT, so that it only happens once rather than
the required twice? I'm not sure why a timing difference would affect
this though, unless there's some loop in the IAA processing that happens
to do both the UNLINK_WAIT->UNLINK change, and the processing of the URB
in the UNLINK state in one invocation sometimes, but not others?

^ permalink raw reply	[flat|nested] 10+ messages in thread

[parent not found: <5148F7A1.2090901-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>]

* Re: Reboot/shutdown failure due to "USB: EHCI: work around silicon bug in Intel's EHCI"
       [not found]                     ` <5148F7A1.2090901-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>
@ 2013-03-20 14:40                       ` Alan Stern
       [not found]                         ` <Pine.LNX.4.44L0.1303201026310.1701-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Alan Stern @ 2013-03-20 14:40 UTC (permalink / raw)
  To: Stephen Warren, Andreas Bombe, Sven Joachim
  Cc: linux-tegra-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, USB list,
	Greg Kroah-Hartman

On Tue, 19 Mar 2013, Stephen Warren wrote:

> On 03/19/2013 04:48 PM, Stephen Warren wrote:
> > On 03/19/2013 02:07 PM, Alan Stern wrote:
> ...
> >> A dmesg log with CONFIG_USB_DEBUG enabled would be helpful.  We ought
> >> to be able to tell where khubd is getting stuck.
> > 
> > Hmmm. Enabling CONFIG_USB_DEBUG appears to mask the problem. I assume
> > this is some kind of timing/race condition, unless there's some code
> > with required side-effects hiding under ifdef CONFIG_USB_DEBUG somehow?
> 
> Some further information: I added some printks, which are hopefully
> obvious from the text below, and in the failing case, I see:
> 
> > [    1.291277] single_unlink_async: qh ee31bc40 qh_state set to UNLINK_WAIT
> > [    1.297960] start_iaa_cycle: qh ee31bc40 qh_state changing UNLINK_WAIT -> UNLINK
> ...
> > [    6.452331] ehci_urb_dequeue: qh ee31bc40 attempting dequeue (qh_stated 2)
> 
> This is with CONFIG_USB_DEBUG disabled. That seems to happen to the very
> first (and only) URB(?) ever issued.

There should be a call to ehci_end_async_unlink shortly after 
start_iaa_cycle.

> If I enable CONFIG_USB_DEBUG, then I see the more expected:
> 
> > [    1.540410] single_unlink_async: qh ee0c0300 qh_state set to UNLINK_WAIT
> > [    1.547094] start_iaa_cycle: qh ee0c0300 qh_state changing UNLINK_WAIT -> UNLINK
> > [    1.554487] start_iaa_cycle: qh ee0c0300 qh_state was UNLINK; processing
> 
> followed by a whole slew of subsequent URBs being submitted and processed.
> 
> Perhaps the issue is that start_iaa_cycle() (or whatever triggers it)
> only happens when there's an URB in state UNLINK, but not when there's
> only one in state UNLINK_WAIT, so that it only happens once rather than
> the required twice? I'm not sure why a timing difference would affect
> this though, unless there's some loop in the IAA processing that happens
> to do both the UNLINK_WAIT->UNLINK change, and the processing of the URB
> in the UNLINK state in one invocation sometimes, but not others?

A possible explanation for a timing difference is that the IAA
mechanism simply isn't working right on your controller, so the IAA
watchdog timer expires.  It's worth pointing out that your second log
excerpt above has messages spaced at intervals of 7 ms, which is
strange considering that the second line should get printed almost
immediately after the first.  Excessively long delays like this do
point in the direction of a timer expiration.

Anyway, there _is_ a mistake in the "work around silicon bug" patch; I
forgot to update the timeout handler.  Does the patch below help?  You 
should build it without CONFIG_USB_DEBUG; I changed the ehci_vdbg to 
ehci_info so that it would show up in the log regardless and we can see 
if it gets triggered.

(Sven, this patch should replace the one I sent you earlier.)

Alan Stern



Index: 3.8/drivers/usb/host/ehci-timer.c
===================================================================
--- 3.8.orig/drivers/usb/host/ehci-timer.c
+++ 3.8/drivers/usb/host/ehci-timer.c
@@ -305,7 +305,8 @@ static void ehci_iaa_watchdog(struct ehc
 	 * (a) SMP races against real IAA firing and retriggering, and
 	 * (b) clean HC shutdown, when IAA watchdog was pending.
 	 */
-	if (ehci->async_iaa) {
+//	if (ehci->async_iaa) {
+	if (1) {
 		u32 cmd, status;
 
 		/* If we get here, IAA is *REALLY* late.  It's barely
@@ -329,7 +330,7 @@ static void ehci_iaa_watchdog(struct ehc
 			ehci_writel(ehci, STS_IAA, &ehci->regs->status);
 		}
 
-		ehci_vdbg(ehci, "IAA watchdog: status %x cmd %x\n",
+		ehci_info(ehci, "IAA watchdog: status %x cmd %x\n",
 				status, cmd);
 		end_unlink_async(ehci);
 	}

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

[parent not found: <Pine.LNX.4.44L0.1303201026310.1701-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>]

* Re: Reboot/shutdown failure due to "USB: EHCI: work around silicon bug in Intel's EHCI"
       [not found]                         ` <Pine.LNX.4.44L0.1303201026310.1701-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
@ 2013-03-20 16:54                           ` Sven Joachim
  2013-03-20 17:19                           ` Stephen Warren
  1 sibling, 0 replies; 10+ messages in thread
From: Sven Joachim @ 2013-03-20 16:54 UTC (permalink / raw)
  To: Alan Stern
  Cc: Stephen Warren, Andreas Bombe, linux-tegra@vger.kernel.org,
	USB list, Greg Kroah-Hartman

On 2013-03-20 15:40 +0100, Alan Stern wrote:

> Anyway, there _is_ a mistake in the "work around silicon bug" patch; I
> forgot to update the timeout handler.  Does the patch below help?  You 
> should build it without CONFIG_USB_DEBUG; I changed the ehci_vdbg to 
> ehci_info so that it would show up in the log regardless and we can see 
> if it gets triggered.
>
> (Sven, this patch should replace the one I sent you earlier.)

Works fine for me.

Cheers,
       Sven

> Index: 3.8/drivers/usb/host/ehci-timer.c
> ===================================================================
> --- 3.8.orig/drivers/usb/host/ehci-timer.c
> +++ 3.8/drivers/usb/host/ehci-timer.c
> @@ -305,7 +305,8 @@ static void ehci_iaa_watchdog(struct ehc
>  	 * (a) SMP races against real IAA firing and retriggering, and
>  	 * (b) clean HC shutdown, when IAA watchdog was pending.
>  	 */
> -	if (ehci->async_iaa) {
> +//	if (ehci->async_iaa) {
> +	if (1) {
>  		u32 cmd, status;
>  
>  		/* If we get here, IAA is *REALLY* late.  It's barely
> @@ -329,7 +330,7 @@ static void ehci_iaa_watchdog(struct ehc
>  			ehci_writel(ehci, STS_IAA, &ehci->regs->status);
>  		}
>  
> -		ehci_vdbg(ehci, "IAA watchdog: status %x cmd %x\n",
> +		ehci_info(ehci, "IAA watchdog: status %x cmd %x\n",
>  				status, cmd);
>  		end_unlink_async(ehci);
>  	}
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Reboot/shutdown failure due to "USB: EHCI: work around silicon bug in Intel's EHCI"
       [not found]                         ` <Pine.LNX.4.44L0.1303201026310.1701-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
  2013-03-20 16:54                           ` Sven Joachim
@ 2013-03-20 17:19                           ` Stephen Warren
       [not found]                             ` <5149EFA7.8050307-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>
  1 sibling, 1 reply; 10+ messages in thread
From: Stephen Warren @ 2013-03-20 17:19 UTC (permalink / raw)
  To: Alan Stern
  Cc: Andreas Bombe, Sven Joachim,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, USB list,
	Greg Kroah-Hartman

On 03/20/2013 08:40 AM, Alan Stern wrote:
> On Tue, 19 Mar 2013, Stephen Warren wrote:
> 
>> On 03/19/2013 04:48 PM, Stephen Warren wrote:
>>> On 03/19/2013 02:07 PM, Alan Stern wrote:
>> ...
>>>> A dmesg log with CONFIG_USB_DEBUG enabled would be helpful.  We ought
>>>> to be able to tell where khubd is getting stuck.
>>>
>>> Hmmm. Enabling CONFIG_USB_DEBUG appears to mask the problem. I assume
>>> this is some kind of timing/race condition, unless there's some code
>>> with required side-effects hiding under ifdef CONFIG_USB_DEBUG somehow?
>>
>> Some further information: I added some printks, which are hopefully
>> obvious from the text below, and in the failing case, I see:
>>
>>> [    1.291277] single_unlink_async: qh ee31bc40 qh_state set to UNLINK_WAIT
>>> [    1.297960] start_iaa_cycle: qh ee31bc40 qh_state changing UNLINK_WAIT -> UNLINK
>> ...
>>> [    6.452331] ehci_urb_dequeue: qh ee31bc40 attempting dequeue (qh_stated 2)
>>
>> This is with CONFIG_USB_DEBUG disabled. That seems to happen to the very
>> first (and only) URB(?) ever issued.
> 
> There should be a call to ehci_end_async_unlink shortly after 
> start_iaa_cycle.
> 
>> If I enable CONFIG_USB_DEBUG, then I see the more expected:
>>
>>> [    1.540410] single_unlink_async: qh ee0c0300 qh_state set to UNLINK_WAIT
>>> [    1.547094] start_iaa_cycle: qh ee0c0300 qh_state changing UNLINK_WAIT -> UNLINK
>>> [    1.554487] start_iaa_cycle: qh ee0c0300 qh_state was UNLINK; processing
>>
>> followed by a whole slew of subsequent URBs being submitted and processed.
>>
>> Perhaps the issue is that start_iaa_cycle() (or whatever triggers it)
>> only happens when there's an URB in state UNLINK, but not when there's
>> only one in state UNLINK_WAIT, so that it only happens once rather than
>> the required twice? I'm not sure why a timing difference would affect
>> this though, unless there's some loop in the IAA processing that happens
>> to do both the UNLINK_WAIT->UNLINK change, and the processing of the URB
>> in the UNLINK state in one invocation sometimes, but not others?
> 
> A possible explanation for a timing difference is that the IAA
> mechanism simply isn't working right on your controller, so the IAA
> watchdog timer expires.  It's worth pointing out that your second log
> excerpt above has messages spaced at intervals of 7 ms, which is
> strange considering that the second line should get printed almost
> immediately after the first.  Excessively long delays like this do
> point in the direction of a timer expiration.

Bugs probably wouldn't surprise me:-)

> Anyway, there _is_ a mistake in the "work around silicon bug" patch; I
> forgot to update the timeout handler.  Does the patch below help?  You 
> should build it without CONFIG_USB_DEBUG; I changed the ehci_vdbg to 
> ehci_info so that it would show up in the log regardless and we can see 
> if it gets triggered.

Yes, with that applied, I see:

> root@localhost:~# dmesg|grep IAA
> [    1.310994] tegra-ehci tegra-ehci.0: IAA watchdog: status a0a0 cmd 10025
> [    1.322006] tegra-ehci tegra-ehci.0: IAA watchdog: status 80a0 cmd 10025

and reboot and shutdown both work fine.

Thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

[parent not found: <5149EFA7.8050307-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>]

* Re: Reboot/shutdown failure due to "USB: EHCI: work around silicon bug in Intel's EHCI"
       [not found]                             ` <5149EFA7.8050307-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>
@ 2013-03-20 18:53                               ` Alan Stern
  0 siblings, 0 replies; 10+ messages in thread
From: Alan Stern @ 2013-03-20 18:53 UTC (permalink / raw)
  To: Stephen Warren
  Cc: Andreas Bombe, Sven Joachim,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, USB list,
	Greg Kroah-Hartman

On Wed, 20 Mar 2013, Stephen Warren wrote:

> > Anyway, there _is_ a mistake in the "work around silicon bug" patch; I
> > forgot to update the timeout handler.  Does the patch below help?  You 
> > should build it without CONFIG_USB_DEBUG; I changed the ehci_vdbg to 
> > ehci_info so that it would show up in the log regardless and we can see 
> > if it gets triggered.
> 
> Yes, with that applied, I see:
> 
> > root@localhost:~# dmesg|grep IAA
> > [    1.310994] tegra-ehci tegra-ehci.0: IAA watchdog: status a0a0 cmd 10025
> > [    1.322006] tegra-ehci tegra-ehci.0: IAA watchdog: status 80a0 cmd 10025

Sure enough, the 0x20 bit in the status register and lack of the 0x40 
bit in the command register indicate that the IAA cycle ended, but the 
watchdog timer expired before an interrupt was received.  Or maybe the 
IRQ was lost entirely.

> and reboot and shutdown both work fine.

Great!  Okay, I will submit the patch (without the ehci_info change) 
for -current and -stable.

Alan Stern

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2013-03-20 18:53 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-18 20:06 Reboot/shutdown failure due to "USB: EHCI: work around silicon bug in Intel's EHCI" Stephen Warren
     [not found] ` <514773BD.6010803-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>
2013-03-19 18:34   ` Alan Stern
     [not found]     ` <Pine.LNX.4.44L0.1303191420121.1302-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
2013-03-19 18:49       ` Stephen Warren
     [not found]         ` <5148B338.6070001-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>
2013-03-19 20:07           ` Alan Stern
     [not found]             ` <Pine.LNX.4.44L0.1303191604530.1302-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
2013-03-19 22:48               ` Stephen Warren
     [not found]                 ` <5148EB2C.1080402-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>
2013-03-19 23:41                   ` Stephen Warren
     [not found]                     ` <5148F7A1.2090901-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>
2013-03-20 14:40                       ` Alan Stern
     [not found]                         ` <Pine.LNX.4.44L0.1303201026310.1701-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
2013-03-20 16:54                           ` Sven Joachim
2013-03-20 17:19                           ` Stephen Warren
     [not found]                             ` <5149EFA7.8050307-3lzwWm7+Weoh9ZMKESR00Q@public.gmane.org>
2013-03-20 18:53                               ` Alan Stern

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).