USB: workqueues stuck in 'D' state?

linux-usb.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* USB: workqueues stuck in 'D' state?
@ 2024-12-06 23:07 Linus Torvalds
  2024-12-07  2:02 ` Alan Stern
  2024-12-08 15:51 ` Greg Kroah-Hartman
  0 siblings, 2 replies; 6+ messages in thread
From: Linus Torvalds @ 2024-12-06 23:07 UTC (permalink / raw)
  To: Mathias Nyman, Greg Kroah-Hartman; +Cc: USB subsystem

So I'm not sure if this is new or not, but I *think* I would have
noticed it earlier.

On my Ampere Altra (128-core arm64 system), I started seeing 'top'
claiming a load average of roughly 2.3 even when idle, and it seems to
be all due to this:

  $ ps ax | grep ' [DR] '
    869 ?        D      0:00 [kworker/24:1+usb_hub_wq]
   1900 ?        D      0:00 [kworker/24:7+pm]

where sometimes there are multiple of those 'pm' workers.

Doing a sysrq-w, I get this:

  task:kworker/24:3    state:D stack:0     pid:1308  tgid:1308  ppid:2
     flags:0x00000008
  Workqueue: pm pm_runtime_work
  Call trace:
   __switch_to+0xf4/0x168 (T)
   __schedule+0x248/0x648
   schedule+0x3c/0xe0
   usleep_range_state+0x118/0x150
   xhci_hub_control+0xe80/0x1090
   rh_call_control+0x274/0x7a0
   usb_hcd_submit_urb+0x13c/0x3a0
   usb_submit_urb+0x1c8/0x600
   usb_start_wait_urb+0x7c/0x180
   usb_control_msg+0xcc/0x150
   usb_port_suspend+0x414/0x510
   usb_generic_driver_suspend+0x68/0x90
   usb_suspend_both+0x1c8/0x290
   usb_runtime_suspend+0x3c/0xb0
   __rpm_callback+0x50/0x1f0
   rpm_callback+0x70/0x88
   rpm_suspend+0xe8/0x5a8
   __pm_runtime_suspend+0x4c/0x130
   usb_runtime_idle+0x48/0x68
   rpm_idle+0xa4/0x358
   pm_runtime_work+0xb0/0xe0

  task:kworker/24:7    state:D stack:0     pid:1900  tgid:1900  ppid:2
     flags:0x00000208
  Workqueue: pm pm_runtime_work
  Call trace:
   __switch_to+0xf4/0x168 (T)
   __schedule+0x248/0x648
   schedule+0x3c/0xe0
   usleep_range_state+0x118/0x150
   xhci_hub_control+0xe80/0x1090
   rh_call_control+0x274/0x7a0
   usb_hcd_submit_urb+0x13c/0x3a0
   usb_submit_urb+0x1c8/0x600
   usb_start_wait_urb+0x7c/0x180
   usb_control_msg+0xcc/0x150
   usb_port_suspend+0x414/0x510
   usb_generic_driver_suspend+0x68/0x90
   usb_suspend_both+0x1c8/0x290
   usb_runtime_suspend+0x3c/0xb0
   __rpm_callback+0x50/0x1f0
   rpm_callback+0x70/0x88
   rpm_suspend+0xe8/0x5a8
   __pm_runtime_suspend+0x4c/0x130

so it seems to be all in that xhci_hub_control() path. I'm not seeing
anything that has changed in the xhci driver in this merge window, so
maybe this goes back further, and I just haven't noticed this odd load
average issue before.

The call trace for the usb_hub_wq seems a lot less stable, but I've
seen backtraces like

  task:kworker/24:1    state:D stack:0     pid:869   tgid:869   ppid:2
     flags:0x00000008
  Workqueue: usb_hub_wq hub_event
  Call trace:
   __switch_to+0xf4/0x168 (T)
   __schedule+0x248/0x648
   schedule+0x3c/0xe0
   schedule_preempt_disabled+0x2c/0x50
   __mutex_lock.constprop.0+0x478/0x968
   __mutex_lock_slowpath+0x1c/0x38
   mutex_lock+0x6c/0x88
   hub_event+0x144/0x4a0
   process_one_work+0x170/0x408
   worker_thread+0x2cc/0x400
   kthread+0xf4/0x108
   ret_from_fork+0x10/0x20

But also just

  Workqueue: usb_hub_wq hub_event
  Call trace:
   __switch_to+0xf4/0x168 (T)
   usb_control_msg+0xcc/0x150

or

  Workqueue: usb_hub_wq hub_event
  Call trace:
   __switch_to+0xf4/0x168 (T)
   __schedule+0x248/0x648
   schedule+0x3c/0xe0
   schedule_timeout+0x94/0x120
   msleep+0x30/0x50

so at a guess it's just some interaction with that 'pm' workqueue.

I did a reboot just to verify that yes, it happened again after a
fresh boot. So it is at least *somewhat* consistently repeatable,
although I wouldn't be surprised if it's some kind of timing-dependent
race condition that just happens to trigger on this machine.

I could try to see if it's so consistent that I could bisect it, but
before I start doing that, maybe just the backtraces makes somebody go
"Ahh, that smells like XYZ"..

Hmm?

             Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: USB: workqueues stuck in 'D' state?
  2024-12-06 23:07 USB: workqueues stuck in 'D' state? Linus Torvalds
@ 2024-12-07  2:02 ` Alan Stern
  2024-12-08 15:51 ` Greg Kroah-Hartman
  1 sibling, 0 replies; 6+ messages in thread
From: Alan Stern @ 2024-12-07  2:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Mathias Nyman, Greg Kroah-Hartman, USB subsystem

On Fri, Dec 06, 2024 at 03:07:10PM -0800, Linus Torvalds wrote:
> So I'm not sure if this is new or not, but I *think* I would have
> noticed it earlier.
> 
> On my Ampere Altra (128-core arm64 system), I started seeing 'top'
> claiming a load average of roughly 2.3 even when idle, and it seems to
> be all due to this:
> 
>   $ ps ax | grep ' [DR] '
>     869 ?        D      0:00 [kworker/24:1+usb_hub_wq]
>    1900 ?        D      0:00 [kworker/24:7+pm]
> 
> where sometimes there are multiple of those 'pm' workers.
> 
> Doing a sysrq-w, I get this:
> 
>   task:kworker/24:3    state:D stack:0     pid:1308  tgid:1308  ppid:2
>      flags:0x00000008
>   Workqueue: pm pm_runtime_work
>   Call trace:
>    __switch_to+0xf4/0x168 (T)
>    __schedule+0x248/0x648
>    schedule+0x3c/0xe0
>    usleep_range_state+0x118/0x150
>    xhci_hub_control+0xe80/0x1090
>    rh_call_control+0x274/0x7a0
>    usb_hcd_submit_urb+0x13c/0x3a0
>    usb_submit_urb+0x1c8/0x600
>    usb_start_wait_urb+0x7c/0x180
>    usb_control_msg+0xcc/0x150
>    usb_port_suspend+0x414/0x510
>    usb_generic_driver_suspend+0x68/0x90
>    usb_suspend_both+0x1c8/0x290
>    usb_runtime_suspend+0x3c/0xb0
>    __rpm_callback+0x50/0x1f0
>    rpm_callback+0x70/0x88
>    rpm_suspend+0xe8/0x5a8
>    __pm_runtime_suspend+0x4c/0x130
>    usb_runtime_idle+0x48/0x68
>    rpm_idle+0xa4/0x358
>    pm_runtime_work+0xb0/0xe0
> 
>   task:kworker/24:7    state:D stack:0     pid:1900  tgid:1900  ppid:2
>      flags:0x00000208
>   Workqueue: pm pm_runtime_work
>   Call trace:
>    __switch_to+0xf4/0x168 (T)
>    __schedule+0x248/0x648
>    schedule+0x3c/0xe0
>    usleep_range_state+0x118/0x150
>    xhci_hub_control+0xe80/0x1090
>    rh_call_control+0x274/0x7a0
>    usb_hcd_submit_urb+0x13c/0x3a0
>    usb_submit_urb+0x1c8/0x600
>    usb_start_wait_urb+0x7c/0x180
>    usb_control_msg+0xcc/0x150
>    usb_port_suspend+0x414/0x510
>    usb_generic_driver_suspend+0x68/0x90
>    usb_suspend_both+0x1c8/0x290
>    usb_runtime_suspend+0x3c/0xb0
>    __rpm_callback+0x50/0x1f0
>    rpm_callback+0x70/0x88
>    rpm_suspend+0xe8/0x5a8
>    __pm_runtime_suspend+0x4c/0x130
> 
> so it seems to be all in that xhci_hub_control() path. I'm not seeing
> anything that has changed in the xhci driver in this merge window, so
> maybe this goes back further, and I just haven't noticed this odd load
> average issue before.
> 
> The call trace for the usb_hub_wq seems a lot less stable, but I've
> seen backtraces like
> 
>   task:kworker/24:1    state:D stack:0     pid:869   tgid:869   ppid:2
>      flags:0x00000008
>   Workqueue: usb_hub_wq hub_event
>   Call trace:
>    __switch_to+0xf4/0x168 (T)
>    __schedule+0x248/0x648
>    schedule+0x3c/0xe0
>    schedule_preempt_disabled+0x2c/0x50
>    __mutex_lock.constprop.0+0x478/0x968
>    __mutex_lock_slowpath+0x1c/0x38
>    mutex_lock+0x6c/0x88
>    hub_event+0x144/0x4a0
>    process_one_work+0x170/0x408
>    worker_thread+0x2cc/0x400
>    kthread+0xf4/0x108
>    ret_from_fork+0x10/0x20
> 
> But also just
> 
>   Workqueue: usb_hub_wq hub_event
>   Call trace:
>    __switch_to+0xf4/0x168 (T)
>    usb_control_msg+0xcc/0x150
> 
> or
> 
>   Workqueue: usb_hub_wq hub_event
>   Call trace:
>    __switch_to+0xf4/0x168 (T)
>    __schedule+0x248/0x648
>    schedule+0x3c/0xe0
>    schedule_timeout+0x94/0x120
>    msleep+0x30/0x50
> 
> so at a guess it's just some interaction with that 'pm' workqueue.
> 
> I did a reboot just to verify that yes, it happened again after a
> fresh boot. So it is at least *somewhat* consistently repeatable,
> although I wouldn't be surprised if it's some kind of timing-dependent
> race condition that just happens to trigger on this machine.
> 
> I could try to see if it's so consistent that I could bisect it, but
> before I start doing that, maybe just the backtraces makes somebody go
> "Ahh, that smells like XYZ"..
> 
> Hmm?

In the past (not recently) I have seen the system get into a loop, 
repeatedly suspending and resuming some device over and over.

What do you get if you enable dynamic debugging for usbcore?

Alan Stern

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: USB: workqueues stuck in 'D' state?
  2024-12-06 23:07 USB: workqueues stuck in 'D' state? Linus Torvalds
  2024-12-07  2:02 ` Alan Stern
@ 2024-12-08 15:51 ` Greg Kroah-Hartman
  2024-12-08 19:14   ` Linus Torvalds
  1 sibling, 1 reply; 6+ messages in thread
From: Greg Kroah-Hartman @ 2024-12-08 15:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Mathias Nyman, USB subsystem

On Fri, Dec 06, 2024 at 03:07:10PM -0800, Linus Torvalds wrote:
> So I'm not sure if this is new or not, but I *think* I would have
> noticed it earlier.
> 
> On my Ampere Altra (128-core arm64 system), I started seeing 'top'
> claiming a load average of roughly 2.3 even when idle, and it seems to
> be all due to this:
> 
>   $ ps ax | grep ' [DR] '
>     869 ?        D      0:00 [kworker/24:1+usb_hub_wq]
>    1900 ?        D      0:00 [kworker/24:7+pm]
> 
> where sometimes there are multiple of those 'pm' workers.
> 
> Doing a sysrq-w, I get this:
> 
>   task:kworker/24:3    state:D stack:0     pid:1308  tgid:1308  ppid:2
>      flags:0x00000008
>   Workqueue: pm pm_runtime_work
>   Call trace:
>    __switch_to+0xf4/0x168 (T)
>    __schedule+0x248/0x648
>    schedule+0x3c/0xe0
>    usleep_range_state+0x118/0x150
>    xhci_hub_control+0xe80/0x1090
>    rh_call_control+0x274/0x7a0
>    usb_hcd_submit_urb+0x13c/0x3a0
>    usb_submit_urb+0x1c8/0x600
>    usb_start_wait_urb+0x7c/0x180
>    usb_control_msg+0xcc/0x150
>    usb_port_suspend+0x414/0x510
>    usb_generic_driver_suspend+0x68/0x90
>    usb_suspend_both+0x1c8/0x290
>    usb_runtime_suspend+0x3c/0xb0
>    __rpm_callback+0x50/0x1f0
>    rpm_callback+0x70/0x88
>    rpm_suspend+0xe8/0x5a8
>    __pm_runtime_suspend+0x4c/0x130
>    usb_runtime_idle+0x48/0x68
>    rpm_idle+0xa4/0x358
>    pm_runtime_work+0xb0/0xe0
> 
>   task:kworker/24:7    state:D stack:0     pid:1900  tgid:1900  ppid:2
>      flags:0x00000208
>   Workqueue: pm pm_runtime_work
>   Call trace:
>    __switch_to+0xf4/0x168 (T)
>    __schedule+0x248/0x648
>    schedule+0x3c/0xe0
>    usleep_range_state+0x118/0x150
>    xhci_hub_control+0xe80/0x1090
>    rh_call_control+0x274/0x7a0
>    usb_hcd_submit_urb+0x13c/0x3a0
>    usb_submit_urb+0x1c8/0x600
>    usb_start_wait_urb+0x7c/0x180
>    usb_control_msg+0xcc/0x150
>    usb_port_suspend+0x414/0x510
>    usb_generic_driver_suspend+0x68/0x90
>    usb_suspend_both+0x1c8/0x290
>    usb_runtime_suspend+0x3c/0xb0
>    __rpm_callback+0x50/0x1f0
>    rpm_callback+0x70/0x88
>    rpm_suspend+0xe8/0x5a8
>    __pm_runtime_suspend+0x4c/0x130
> 
> so it seems to be all in that xhci_hub_control() path. I'm not seeing
> anything that has changed in the xhci driver in this merge window, so
> maybe this goes back further, and I just haven't noticed this odd load
> average issue before.
> 
> The call trace for the usb_hub_wq seems a lot less stable, but I've
> seen backtraces like
> 
>   task:kworker/24:1    state:D stack:0     pid:869   tgid:869   ppid:2
>      flags:0x00000008
>   Workqueue: usb_hub_wq hub_event
>   Call trace:
>    __switch_to+0xf4/0x168 (T)
>    __schedule+0x248/0x648
>    schedule+0x3c/0xe0
>    schedule_preempt_disabled+0x2c/0x50
>    __mutex_lock.constprop.0+0x478/0x968
>    __mutex_lock_slowpath+0x1c/0x38
>    mutex_lock+0x6c/0x88
>    hub_event+0x144/0x4a0
>    process_one_work+0x170/0x408
>    worker_thread+0x2cc/0x400
>    kthread+0xf4/0x108
>    ret_from_fork+0x10/0x20
> 
> But also just
> 
>   Workqueue: usb_hub_wq hub_event
>   Call trace:
>    __switch_to+0xf4/0x168 (T)
>    usb_control_msg+0xcc/0x150
> 
> or
> 
>   Workqueue: usb_hub_wq hub_event
>   Call trace:
>    __switch_to+0xf4/0x168 (T)
>    __schedule+0x248/0x648
>    schedule+0x3c/0xe0
>    schedule_timeout+0x94/0x120
>    msleep+0x30/0x50
> 
> so at a guess it's just some interaction with that 'pm' workqueue.
> 
> I did a reboot just to verify that yes, it happened again after a
> fresh boot. So it is at least *somewhat* consistently repeatable,
> although I wouldn't be surprised if it's some kind of timing-dependent
> race condition that just happens to trigger on this machine.
> 
> I could try to see if it's so consistent that I could bisect it, but
> before I start doing that, maybe just the backtraces makes somebody go
> "Ahh, that smells like XYZ"..

I can't duplicate this here running on your latest tree at all (it's a
"smaller" x86 box with only 64 cores.)  So I don't know what to suggest.
Are you using any USB devices on this thing, or is that just for
internal connections?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: USB: workqueues stuck in 'D' state?
  2024-12-08 15:51 ` Greg Kroah-Hartman
@ 2024-12-08 19:14   ` Linus Torvalds
  2024-12-08 20:17     ` Michał Pecio
  0 siblings, 1 reply; 6+ messages in thread
From: Linus Torvalds @ 2024-12-08 19:14 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: Mathias Nyman, USB subsystem

On Sun, 8 Dec 2024 at 07:51, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> I can't duplicate this here running on your latest tree at all (it's a
> "smaller" x86 box with only 64 cores.)  So I don't know what to suggest.

Yeah, I suspect it's an older thing and triggered by some timing change.

I guess I will try to see if I can bisect it, but I'm not all that optimistic.

> Are you using any USB devices on this thing, or is that just for
> internal connections?

All internal motherboard devices. Mostly the usual random hubs, but
also some "ASRR USB Ethernet" and a "Multifunction Composite Gadget".

I assume that's for the whole management side.

             Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: USB: workqueues stuck in 'D' state?
  2024-12-08 19:14   ` Linus Torvalds
@ 2024-12-08 20:17     ` Michał Pecio
  2024-12-09 23:27       ` Michał Pecio
  0 siblings, 1 reply; 6+ messages in thread
From: Michał Pecio @ 2024-12-08 20:17 UTC (permalink / raw)
  To: torvalds; +Cc: gregkh, linux-usb, mathias.nyman

Hi,

> > Are you using any USB devices on this thing, or is that just for
> > internal connections?
>
> All internal motherboard devices. Mostly the usual random hubs, but
> also some "ASRR USB Ethernet" and a "Multifunction Composite Gadget".

Dynamic debug has been suggested, it would tell which ports are being
suspended and how often. Either usbcore and/or this should do:

func xhci_set_link_state +p

Stacktrace looks like this function has just returned and the driver
is waiting for the HW to confirm new link state. (Which confirmation,
or lack thereof, seems to be subsequently ignored by the way.)

Regards,
Michal

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: USB: workqueues stuck in 'D' state?
  2024-12-08 20:17     ` Michał Pecio
@ 2024-12-09 23:27       ` Michał Pecio
  0 siblings, 0 replies; 6+ messages in thread
From: Michał Pecio @ 2024-12-09 23:27 UTC (permalink / raw)
  To: torvalds; +Cc: gregkh, linux-usb, mathias.nyman

OK, I have no idea if it's the same bug or merely similar, but:

Linux v6.13-rc1. My hardware currently seems to insists that there is
overcurrent on all ports. It's an ASMedia chip, so nothing shocking.
This drives Linux nuts and it repeatedly hammers on hub_suspend() and
hub_resume(), apparently in a fairly tight loop, with no end in sight.

I have also seen short bursts of the same a few times while unplugging
some device from two ASMedia controllers (and not from one other HC),
but they stopped after a second. Maybe the HW came back to sanity?

It looks like something is very eager to retry things.

This shows up upon reloading xhci_hcd:

[613292.598841] usb usb8-port1: over-current condition
[613292.726843] usb usb8-port2: over-current condition
[613292.761832] usb usb9-port1: over-current condition
[613292.969830] usb usb9-port2: over-current condition

One core at 100% and junk like this repeats forever with dynamic debug:

[  +0.000001] hub 8-0:1.0: hub_resume
[  +0.000007] xhci_hcd 0000:02:00.0: Get port status 8-2 read: 0x288, return 0x108
[  +0.000005] hub 8-0:1.0: state 7 ports 2 chg 0006 evt 0000
[  +0.000006] usb usb8-port1: status 0108, change 0000, 12 Mb/s
[  +0.000003] xhci_hcd 0000:02:00.0: Get port status 8-2 read: 0x288, return 0x108
[  +0.000004] usb usb8-port2: status 0108, change 0000, 12 Mb/s
[  +0.000002] hub 8-0:1.0: hub_suspend
[  +0.000003] usb usb8: bus auto-suspend, wakeup 1
[  +0.000003] xhci_hcd 0000:02:00.0: Bus suspend bailout, port over-current detected
[  +0.000001] usb usb8: bus suspend fail, err -16
[  +0.000008] xhci_hcd 0000:02:00.0: Get port status 8-2 read: 0x288, return 0x108
[  +0.000004] hub 8-0:1.0: state 7 ports 2 chg 0006 evt 0000
[  +0.000003] xhci_hcd 0000:02:00.0: Get port status 8-1 read: 0x288, return 0x108
[  +0.000004] usb usb8-port1: status 0108, change 0000, 12 Mb/s
[  +0.000007] usb usb8-port2: status 0108, change 0000, 12 Mb/s
[  +0.000002] hub 8-0:1.0: hub_suspend
[  +0.000006] xhci_hcd 0000:02:00.0: Bus suspend bailout, port over-current detected
[  +0.000004] xhci_hcd 0000:02:00.0: Get port status 8-1 read: 0x288, return 0x108
[  +0.000009] hub 8-0:1.0: state 7 ports 2 chg 0006 evt 0000
[  +0.000003] xhci_hcd 0000:02:00.0: Get port status 8-1 read: 0x288, return 0x108
[  +0.000004] usb usb8-port1: status 0108, change 0000, 12 Mb/s
[  +0.000006] usb usb8-port2: status 0108, change 0000, 12 Mb/s
[  +0.000003] hub 8-0:1.0: hub_suspend
[  +0.000003] usb usb8: bus auto-suspend, wakeup 1
[  +0.000002] xhci_hcd 0000:02:00.0: Bus suspend bailout, port over-current detected
[  +0.000002] usb usb8: bus suspend fail, err -16
[  +0.000001] hub 8-0:1.0: hub_resume
[  +0.000007] xhci_hcd 0000:02:00.0: Get port status 8-2 read: 0x288, return 0x108
[  +0.000004] hub 8-0:1.0: state 7 ports 2 chg 0006 evt 0000
[  +0.000007] usb usb8-port1: status 0108, change 0000, 12 Mb/s
[  +0.000003] xhci_hcd 0000:02:00.0: Get port status 8-2 read: 0x288, return 0x108
[  +0.000003] usb usb8-port2: status 0108, change 0000, 12 Mb/s
[  +0.000003] hub 8-0:1.0: hub_suspend

Stack traces are similar: pm workqueue -> usbcore -> xhci.

Regards,
Michal

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-12-09 23:27 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-06 23:07 USB: workqueues stuck in 'D' state? Linus Torvalds
2024-12-07  2:02 ` Alan Stern
2024-12-08 15:51 ` Greg Kroah-Hartman
2024-12-08 19:14   ` Linus Torvalds
2024-12-08 20:17     ` Michał Pecio
2024-12-09 23:27       ` Michał Pecio

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).