public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Michal Pecio <michal.pecio@gmail.com>
To: Desnes Nunes <desnesn@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-usb@vger.kernel.org,
	gregkh@linuxfoundation.org, mathias.nyman@intel.com,
	stable@vger.kernel.org
Subject: Re: [PATCH] usb: xhci: bound wait command completion to avoid kdump deadlock
Date: Thu, 30 Apr 2026 10:48:50 +0200	[thread overview]
Message-ID: <20260430104850.352bd946.michal.pecio@gmail.com> (raw)
In-Reply-To: <20260430014817.2006885-1-desnesn@redhat.com>

On Wed, 29 Apr 2026 22:48:17 -0300, Desnes Nunes wrote:
> The following deadlock in the usb subsystem can be triggered during kdump:
> 
> systemd-udevd[402]: usb3: Worker [419] processing SEQNUM=2194 is taking a long time
> dracut-initqueue[432]: Timed out while waiting for udev queue to empty.
> systemd-udevd[402]: usb3: Worker [419] processing SEQNUM=2194 killed
> systemd-udevd[402]: usb3: Worker [419] terminated by signal 9 (KILL).
> ...
> kdump[720]: saving vmcore complete
> ...
> systemd-shutdown[1]: Rebooting.
> INFO: task kworker/0:6:76 blocked for more than 122 seconds.

That's suspiciously long indeed.

>       Not tainted 6.12.0-223.2443_2475543665.el10.x86_64 #1

Pretty old kernel, and distribution to boot.
Have you tried 7.x, does the bug still exist?

> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/0:6     state:D stack:0     pid:76    tgid:76    ppid:2      task_flags:0x4208060 flags:0x00004000
> Workqueue: usb_hub_wq hub_event
> Call Trace:
>  <TASK>
>  __schedule+0x2a5/0x630
>  schedule+0x27/0x80
>  schedule_timeout+0xbf/0x100
>  __wait_for_common+0x95/0x1b0
>  ? __pfx_schedule_timeout+0x10/0x10
>  xhci_alloc_dev+0x9e/0x290
>  usb_alloc_dev+0x77/0x3a0
>  hub_port_connect+0x293/0x9a0
>  hub_port_connect_change+0x94/0x260
>  port_event+0x4d1/0x7f0
>  hub_event+0x16f/0x480
>  process_one_work+0x174/0x330
>  worker_thread+0x256/0x3a0
>  ? __pfx_worker_thread+0x10/0x10
>  kthread+0xfa/0x240
>  ? __pfx_kthread+0x10/0x10
>  ret_from_fork+0x31/0x50
>  ? __pfx_kthread+0x10/0x10
>  ret_from_fork_asm+0x1a/0x30
>  </TASK>
> INFO: task systemd-shutdow:1 blocked for more than 122 seconds.
>       Not tainted 6.12.0-223.2443_2475543665.el10.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:systemd-shutdow state:D stack:0     pid:1     tgid:1     ppid:0      task_flags:0x400100 flags:0x00000002
> Call Trace:
>  <TASK>
>  __schedule+0x2a5/0x630
>  schedule+0x27/0x80
>  schedule_preempt_disabled+0x15/0x30
>  __mutex_lock.constprop.0+0x497/0x860
>  device_shutdown+0xac/0x190
>  kernel_restart+0x3a/0x70
>  __do_sys_reboot+0x146/0x240
>  do_syscall_64+0x7d/0x160
>  ? devkmsg_write.cold+0x24/0x4a
>  ? update_load_avg+0x7f/0x730
>  ? __dequeue_entity+0x3ec/0x4a0
>  ? update_load_avg+0x7f/0x730
>  ? pick_next_task_fair+0x1e6/0x330
>  ? finish_task_switch.isra.0+0x97/0x2a0
>  ? rseq_get_rseq_cs+0x1d/0x220
>  ? rseq_ip_fixup+0x8d/0x1d0
>  ? arch_exit_to_user_mode_prepare.isra.0+0xa5/0xd0
>  ? syscall_exit_to_user_mode+0x32/0x190
>  ? do_syscall_64+0x89/0x160
>  ? handle_mm_fault+0x110/0x370
>  ? do_user_addr_fault+0x606/0x830
>  ? exc_page_fault+0x7f/0x150
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7f32517d9917
> RSP: 002b:00007ffc018d4fb8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a9
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f32517d9917
> RDX: 0000000001234567 RSI: 0000000028121969 RDI: 00000000fee1dead
> RBP: 00007ffc018d5130 R08: 0000000000000069 R09: 00000000ffffffff
> R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
> R13: 0000000000000000 R14: 00007ffc018d5258 R15: 0000000000000000
>  </TASK>
> 
> During crashkernel's boot, hub_event() takes usb_lock_device(hdev) of the
> root hub and keeps it for the whole hub processing loop, since it calls
> hub_port_connect() -> usb_alloc_dev() -> xhci_alloc_dev(). If during kdump
> another device (e.g., a mis-initialized dGPU) hogs interrupts or DMAs, the
> TRB_ENABLE_SLOT command will be blocked from completion in time, moving
> the HC to an unstable condition (e.g., HSE in USBSTS).

What specifically have you seen?

If you have actually observed HSE (how?), maybe xhci-hcd could detect
it automatically by the same means and clean up immediately.

> After vmcore gets captured, init calls device_shutdown() trying to
> shut down the hub device, by also trying to take the same lock still
> held by the hub kworker task.
> 
> Avoid the deadlock by adding a 2x timeout for command completion

nit: not a deadlock if X waits for Y and Y is just stuck by itself.

> before calling xhci_hc_died(). This gives enough time before marking
> the host un- stable, dying and calling xhci_cleanup_command_queue();
> which unblocks the hub worker into releasing the lock, allowing
> device_shutdown() to proceed.

Many functions which wait for command completion without timeouts.
Patch this one and you will get stuck in the next.

This shouldn't be happening in the first place. If a command doesn't
complete normally in time then xhci_handle_command_timeout() should
abort it, and if that times out too, then hc_died().

So not sure why this hasn't happened here.
Is it reproducible? Can you try again with debug logs?

echo 'module xhci_hcd +p' >/proc/dynamic_debug/control

Regards,
Michal

  reply	other threads:[~2026-04-30  8:48 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-30  1:48 [PATCH] usb: xhci: bound wait command completion to avoid kdump deadlock Desnes Nunes
2026-04-30  8:48 ` Michal Pecio [this message]
2026-04-30 17:27   ` Desnes Nunes
2026-04-30 21:54     ` Michal Pecio
2026-05-01 14:09       ` Desnes Nunes
2026-05-02  9:46         ` [PATCH RFT RFC] usb: xhci: Kill hosts with HCE or HSE on command timeout Michal Pecio
2026-05-02 11:38           ` Desnes Nunes
2026-05-02 21:55             ` Michal Pecio
2026-05-03  3:36               ` Desnes Nunes
2026-05-03  5:17                 ` Michal Pecio
2026-05-03 16:20                   ` Desnes Nunes
2026-05-03 19:31                     ` Michal Pecio
2026-05-04  7:31                       ` Michal Pecio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260430104850.352bd946.michal.pecio@gmail.com \
    --to=michal.pecio@gmail.com \
    --cc=desnesn@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-usb@vger.kernel.org \
    --cc=mathias.nyman@intel.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox