xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend

Linux USB
 help / color / mirror / Atom feed

* xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
@ 2026-03-29 21:52 martinalderson
  2026-03-30  0:07 ` Michal Pecio
  0 siblings, 1 reply; 15+ messages in thread
From: martinalderson @ 2026-03-29 21:52 UTC (permalink / raw)
  To: linux-usb

[BUG] xhci_hcd 0000:0f:00.0: controller declared dead on resume from suspend

Hardware:
  CPU: AMD Ryzen 9 7900 12-Core Processor
  Board: ASUS PRIME B650-PLUS
  Controller: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8]
  Subsystem: ASUSTeK Computer Inc. [1043:8877]
  PCI: 0000:0f:00.0 (IOMMU group 30)

Software:
  Kernel: 7.0.0-rc5 (commit be762d8b, built 2026-03-28)
  Distro: Fedora 43 (Workstation)
  Desktop: GNOME on Wayland

Description:
  On the first suspend/resume cycle after boot, the xHCI controller at
  0000:0f:00.0 (AMD Raphael/Granite Ridge USB 2.0) fails to resume and
  is declared dead. A Logitech Unifying Receiver (046d:c52b) on this
  controller is disconnected and the mouse (Logitech M720 Triathlon)
  stops functioning.

  A second xHCI controller on the same system (0000:0c:00.0, AMD 600
  Series Chipset USB 3.2 [1022:43f7]) also errors on resume (USBSTS
  0x401) but successfully recovers via reinit. The 0f:00.0 controller
  does not recover.

  Regression from rc4: suspend/resume worked correctly on 7.0-rc4 and
  earlier kernels on the same hardware.

Reproduce:
  1. Boot with USB device attached to a port on the 0000:0f:00.0 controller
  2. Suspend (systemd suspend)
  3. Resume

dmesg on resume:
  xhci_hcd 0000:0f:00.0: xHCI host not responding to stop endpoint command
  xhci_hcd 0000:0f:00.0: xHCI host controller not responding, assume dead
  xhci_hcd 0000:0f:00.0: HC died; cleaning up
  xhci_hcd 0000:0c:00.0: xHC error in resume, USBSTS 0x401, Reinit
  usb usb1: root hub lost power or was reset
  usb usb2: root hub lost power or was reset
  usb 1-7: WARN: invalid context state for evaluate context command.
  usb 1-10: WARN: invalid context state for evaluate context command.
  usb 7-1: USB disconnect, device number 2

Workaround:
  PCI remove + rescan recovers the controller:
    echo 1 > /sys/bus/pci/devices/0000:0f:00.0/remove
    echo 1 > /sys/bus/pci/rescan

  A simple PCI device reset (echo 1 > .../reset) was insufficient -- the
  controller came back but did not re-enumerate the attached device.

Notes:
  - The 0f:00.0 controller is USB 2.0 only (USB3 root hub has no ports)
  - hci version 0x120, hcc params 0x0110ffc5, quirks 0x0000000200000010

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
  2026-03-29 21:52 xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend martinalderson
@ 2026-03-30  0:07 ` Michal Pecio
  2026-04-04 12:04   ` Martin Alderson
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Pecio @ 2026-03-30  0:07 UTC (permalink / raw)
  To: martinalderson; +Cc: linux-usb

On Sun, 29 Mar 2026 17:52:39 -0400, martinalderson@gmail.com wrote:
> [BUG] xhci_hcd 0000:0f:00.0: controller declared dead on resume from
> suspend
> 
> Hardware:
>   CPU: AMD Ryzen 9 7900 12-Core Processor
>   Board: ASUS PRIME B650-PLUS
>   Controller: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8]
>   Subsystem: ASUSTeK Computer Inc. [1043:8877]
>   PCI: 0000:0f:00.0 (IOMMU group 30)
> 
> Software:
>   Kernel: 7.0.0-rc5 (commit be762d8b, built 2026-03-28)
>   Distro: Fedora 43 (Workstation)
>   Desktop: GNOME on Wayland
> 
> Description:
>   On the first suspend/resume cycle after boot, the xHCI controller at
>   0000:0f:00.0 (AMD Raphael/Granite Ridge USB 2.0) fails to resume and
>   is declared dead. A Logitech Unifying Receiver (046d:c52b) on this
>   controller is disconnected and the mouse (Logitech M720 Triathlon)
>   stops functioning.
> 
>   A second xHCI controller on the same system (0000:0c:00.0, AMD 600
>   Series Chipset USB 3.2 [1022:43f7]) also errors on resume (USBSTS
>   0x401) but successfully recovers via reinit. The 0f:00.0 controller
>   does not recover.
> 
>   Regression from rc4: suspend/resume worked correctly on 7.0-rc4 and
>   earlier kernels on the same hardware.

That's interesting because there were no USB subsystem changes
between 7.0-rc4 and 7.0-rc5.

Any chance you could git-bisect this?
Are both kernels built with the same .config?

> Reproduce:
>   1. Boot with USB device attached to a port on the 0000:0f:00.0
>      controller
>   2. Suspend (systemd suspend)
>   3. Resume

By the way, are you using this affected controller to resume
(with a keyboard or something like that)?
 
> dmesg on resume:
>   xhci_hcd 0000:0f:00.0: xHCI host not responding to stop endpoint command
>   xhci_hcd 0000:0f:00.0: xHCI host controller not responding, assume dead
>   xhci_hcd 0000:0f:00.0: HC died; cleaning up
>   xhci_hcd 0000:0c:00.0: xHC error in resume, USBSTS 0x401, Reinit
>   usb usb1: root hub lost power or was reset
>   usb usb2: root hub lost power or was reset
>   usb 1-7: WARN: invalid context state for evaluate context command.
>   usb 1-10: WARN: invalid context state for evaluate context command.
>   usb 7-1: USB disconnect, device number 2
> 
> Workaround:
>   PCI remove + rescan recovers the controller:
>     echo 1 > /sys/bus/pci/devices/0000:0f:00.0/remove
>     echo 1 > /sys/bus/pci/rescan
> 
>   A simple PCI device reset (echo 1 > .../reset) was insufficient -- the
>   controller came back but did not re-enumerate the attached device.

What about the unbind/bind procedure described here?
https://bugzilla.kernel.org/show_bug.cgi?id=221073

> Notes:
>   - The 0f:00.0 controller is USB 2.0 only (USB3 root hub has no ports)
>   - hci version 0x120, hcc params 0x0110ffc5, quirks 0x0000000200000010

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
  2026-03-30  0:07 ` Michal Pecio
@ 2026-04-04 12:04   ` Martin Alderson
  2026-04-04 13:24     ` Michal Pecio
  0 siblings, 1 reply; 15+ messages in thread
From: Martin Alderson @ 2026-04-04 12:04 UTC (permalink / raw)
  To: Michal Pecio; +Cc: linux-usb

Hi,

Just for clarity this never happened to me with the 6.19 kernel I was
on before (suspend/resumed many times on that kernel with no issues).
It's happened twice now (once with rc5, now with rc6) in a short space
of time. It may just be random luck though than a specific regression
- sorry if I confused things there.

Not sure I'm able to do a bisect because it's very intermittent so
would take an age to reproduce it sorry.

Previously I was on the Fedora 43 default kernel series, now I
switched to the COPR for 7.x (to try and fix something else).

Thanks for the bugzilla, I'll look at some of those workarounds.


On Mon, Mar 30, 2026 at 1:07 AM Michal Pecio <michal.pecio@gmail.com> wrote:
>
> On Sun, 29 Mar 2026 17:52:39 -0400, martinalderson@gmail.com wrote:
> > [BUG] xhci_hcd 0000:0f:00.0: controller declared dead on resume from
> > suspend
> >
> > Hardware:
> >   CPU: AMD Ryzen 9 7900 12-Core Processor
> >   Board: ASUS PRIME B650-PLUS
> >   Controller: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8]
> >   Subsystem: ASUSTeK Computer Inc. [1043:8877]
> >   PCI: 0000:0f:00.0 (IOMMU group 30)
> >
> > Software:
> >   Kernel: 7.0.0-rc5 (commit be762d8b, built 2026-03-28)
> >   Distro: Fedora 43 (Workstation)
> >   Desktop: GNOME on Wayland
> >
> > Description:
> >   On the first suspend/resume cycle after boot, the xHCI controller at
> >   0000:0f:00.0 (AMD Raphael/Granite Ridge USB 2.0) fails to resume and
> >   is declared dead. A Logitech Unifying Receiver (046d:c52b) on this
> >   controller is disconnected and the mouse (Logitech M720 Triathlon)
> >   stops functioning.
> >
> >   A second xHCI controller on the same system (0000:0c:00.0, AMD 600
> >   Series Chipset USB 3.2 [1022:43f7]) also errors on resume (USBSTS
> >   0x401) but successfully recovers via reinit. The 0f:00.0 controller
> >   does not recover.
> >
> >   Regression from rc4: suspend/resume worked correctly on 7.0-rc4 and
> >   earlier kernels on the same hardware.
>
> That's interesting because there were no USB subsystem changes
> between 7.0-rc4 and 7.0-rc5.
>
> Any chance you could git-bisect this?
> Are both kernels built with the same .config?
>
> > Reproduce:
> >   1. Boot with USB device attached to a port on the 0000:0f:00.0
> >      controller
> >   2. Suspend (systemd suspend)
> >   3. Resume
>
> By the way, are you using this affected controller to resume
> (with a keyboard or something like that)?
>
> > dmesg on resume:
> >   xhci_hcd 0000:0f:00.0: xHCI host not responding to stop endpoint command
> >   xhci_hcd 0000:0f:00.0: xHCI host controller not responding, assume dead
> >   xhci_hcd 0000:0f:00.0: HC died; cleaning up
> >   xhci_hcd 0000:0c:00.0: xHC error in resume, USBSTS 0x401, Reinit
> >   usb usb1: root hub lost power or was reset
> >   usb usb2: root hub lost power or was reset
> >   usb 1-7: WARN: invalid context state for evaluate context command.
> >   usb 1-10: WARN: invalid context state for evaluate context command.
> >   usb 7-1: USB disconnect, device number 2
> >
> > Workaround:
> >   PCI remove + rescan recovers the controller:
> >     echo 1 > /sys/bus/pci/devices/0000:0f:00.0/remove
> >     echo 1 > /sys/bus/pci/rescan
> >
> >   A simple PCI device reset (echo 1 > .../reset) was insufficient -- the
> >   controller came back but did not re-enumerate the attached device.
>
> What about the unbind/bind procedure described here?
> https://bugzilla.kernel.org/show_bug.cgi?id=221073
>
> > Notes:
> >   - The 0f:00.0 controller is USB 2.0 only (USB3 root hub has no ports)
> >   - hci version 0x120, hcc params 0x0110ffc5, quirks 0x0000000200000010

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
  2026-04-04 12:04   ` Martin Alderson
@ 2026-04-04 13:24     ` Michal Pecio
  2026-05-09 14:51       ` Martin Alderson
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Pecio @ 2026-04-04 13:24 UTC (permalink / raw)
  To: Martin Alderson; +Cc: linux-usb

On Sat, 4 Apr 2026 13:04:02 +0100, Martin Alderson wrote:
> Just for clarity this never happened to me with the 6.19 kernel I was
> on before (suspend/resumed many times on that kernel with no issues).
> It's happened twice now (once with rc5, now with rc6) in a short space
> of time.

So apparently about once per week. That's not very easy to debug.
One trick I have seen people use to accelerate such tests is running
"rtcwake -s 5 -m freeze" in a loop. This puts the system in s2idle and
resumes automatically after 5 seconds.

Do you have more complete dmesg from those failures with timestamps?
From suspend up to until everything has calmed down after resume, or
also including whatever you have done later to restore operation.

> Previously I was on the Fedora 43 default kernel series, now I
> switched to the COPR for 7.x (to try and fix something else).

Not sure what COPR is, but I gather it went like this:
1. Fedora 6.19 kernel was OK for a long time
2. Some other kernel, possibly other config, 7.0-rc4 still worked, but
   only used for a short time. What about 7.0-rc1 to -rc3? 
3. After updating to -rc5 it's definitely broken.

> Thanks for the bugzilla, I'll look at some of those workarounds.

Particularly, collecting dynamic debug and debugfs could tell if it's
the same problem with missing IRQ after resume or something else.

Regards,
Michal

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
  2026-04-04 13:24     ` Michal Pecio
@ 2026-05-09 14:51       ` Martin Alderson
  2026-05-09 16:06         ` Michal Pecio
  0 siblings, 1 reply; 15+ messages in thread
From: Martin Alderson @ 2026-05-09 14:51 UTC (permalink / raw)
  To: Michal Pecio; +Cc: linux-usb

Hi, still experiencing this on 7.0.2. I tried to pull the logs
together to get to the bottom of this (I've tried a few different
kernels)

Kernel                          Suspends   xHCI 0f:00.0 deaths   Rate
------------------------------  --------   -------------------   -----
6.17.1-300.fc43 (March)             ~12             0             0%
6.18.16-200.fc43                     10             0             0%
6.19.7/8-200.fc43                     5             0             0%
7.0-rc4   (build 260320)             13             0             0%
7.0-rc5   (build 260328)              7             2            ~28%
7.0-rc6   (build 260401)             10             4             40%
7.0-rc7   (build 260409)              7             2            ~28%
7.0.0-261.vanilla.fc43                7             2            ~28%
6.17.1-300.fc43 (April, retry)       10             2             20%
 <-- same bug, stable kernel
7.0.1-262.vanilla.fc43                7             2            ~28%
7.0.2-300.vanilla.fc44                6             4            ~66%

May 09 15:29:37 fedora kernel: Freezing user space processes completed
(elapsed 0.001 seconds)
May 09 15:29:37 fedora kernel: OOM killer disabled.
May 09 15:29:37 fedora kernel: Freezing remaining freezable tasks
May 09 15:29:37 fedora kernel: Freezing remaining freezable tasks
completed (elapsed 0.001 seconds)
May 09 15:29:37 fedora kernel: printk: Suspending console(s) (use
no_console_suspend to debug)
May 09 15:29:37 fedora kernel: sd 6:0:0:0: [sdb] Synchronizing SCSI cache
May 09 15:29:37 fedora kernel: serial 00:01: disabled
May 09 15:29:37 fedora kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
May 09 15:29:37 fedora kernel: ata1.00: Entering standby power mode
May 09 15:29:37 fedora kernel: xhci_hcd 0000:0f:00.0: xHCI host not
responding to stop endpoint command
May 09 15:29:37 fedora kernel: xhci_hcd 0000:0f:00.0: xHCI host
controller not responding, assume dead
May 09 15:29:37 fedora kernel: xhci_hcd 0000:0f:00.0: HC died; cleaning up
May 09 15:29:37 fedora kernel: PM: suspend devices took 5.758 seconds
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: MODE1 reset
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: GPU mode1 reset
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: GPU smu mode1 reset
May 09 15:29:37 fedora kernel: ACPI: PM: Preparing to enter system
sleep state S3
May 09 15:29:37 fedora kernel: ACPI: PM: Saving platform NVS memory
May 09 15:29:37 fedora kernel: Disabling non-boot CPUs ...
May 09 15:29:37 fedora kernel: smpboot: CPU 23 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 22 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 21 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 20 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 19 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 18 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 17 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 16 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 15 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 14 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 13 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 12 is now offline
May 09 15:29:37 fedora kernel: Spectre V2 : Update user space SMT
mitigation: STIBP off
May 09 15:29:37 fedora kernel: smpboot: CPU 11 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 10 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 9 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 8 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 7 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 6 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 5 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 4 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 3 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 2 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 1 is now offline
May 09 15:29:37 fedora kernel: ACPI: PM: Low-level resume complete
May 09 15:29:37 fedora kernel: ACPI: PM: Restoring platform NVS memory
May 09 15:29:37 fedora kernel: AMD-Vi: Virtual APIC enabled
May 09 15:29:37 fedora kernel: AMD-Vi: Virtual APIC enabled
May 09 15:29:37 fedora kernel: LVT offset 0 assigned for vector 0x400
May 09 15:29:37 fedora kernel: Enabling non-boot CPUs ...
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 1 APIC 0x2
May 09 15:29:37 fedora kernel: CPU1 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 2 APIC 0x4
May 09 15:29:37 fedora kernel: CPU2 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 3 APIC 0x6
May 09 15:29:37 fedora kernel: CPU3 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 4 APIC 0x8
May 09 15:29:37 fedora kernel: CPU4 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 5 APIC 0xa
May 09 15:29:37 fedora kernel: CPU5 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 6 APIC 0x10
May 09 15:29:37 fedora kernel: CPU6 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 7 APIC 0x12
May 09 15:29:37 fedora kernel: CPU7 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 8 APIC 0x14
May 09 15:29:37 fedora kernel: CPU8 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 9 APIC 0x16
May 09 15:29:37 fedora kernel: CPU9 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 10 APIC 0x18
May 09 15:29:37 fedora kernel: CPU10 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 11 APIC 0x1a
May 09 15:29:37 fedora kernel: CPU11 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 12 APIC 0x1
May 09 15:29:37 fedora kernel: Spectre V2 : Update user space SMT
mitigation: STIBP always-on
May 09 15:29:37 fedora kernel: CPU12 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 13 APIC 0x3
May 09 15:29:37 fedora kernel: Spurious APIC interrupt (vector 0xFF)
on CPU#13, should never happen.
May 09 15:29:37 fedora kernel: CPU13 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 14 APIC 0x5
May 09 15:29:37 fedora kernel: Spurious APIC interrupt (vector 0xFF)
on CPU#14, should never happen.
May 09 15:29:37 fedora kernel: CPU14 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 15 APIC 0x7
May 09 15:29:37 fedora kernel: Spurious APIC interrupt (vector 0xFF)
on CPU#15, should never happen.
May 09 15:29:37 fedora kernel: CPU15 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 16 APIC 0x9
May 09 15:29:37 fedora kernel: Spurious APIC interrupt (vector 0xFF)
on CPU#16, should never happen.
May 09 15:29:37 fedora kernel: CPU16 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 17 APIC 0xb
May 09 15:29:37 fedora kernel: Spurious APIC interrupt (vector 0xFF)
on CPU#17, should never happen.
May 09 15:29:37 fedora kernel: CPU17 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 18 APIC 0x11
May 09 15:29:37 fedora kernel: Spurious APIC interrupt (vector 0xFF)
on CPU#18, should never happen.
May 09 15:29:37 fedora kernel: CPU18 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 19 APIC 0x13
May 09 15:29:37 fedora kernel: Spurious APIC interrupt (vector 0xFF)
on CPU#19, should never happen.
May 09 15:29:37 fedora kernel: CPU19 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 20 APIC 0x15
May 09 15:29:37 fedora kernel: Spurious APIC interrupt (vector 0xFF)
on CPU#20, should never happen.
May 09 15:29:37 fedora kernel: CPU20 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 21 APIC 0x17
May 09 15:29:37 fedora kernel: Spurious APIC interrupt (vector 0xFF)
on CPU#21, should never happen.
May 09 15:29:37 fedora kernel: CPU21 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 22 APIC 0x19
May 09 15:29:37 fedora kernel: Spurious APIC interrupt (vector 0xFF)
on CPU#22, should never happen.
May 09 15:29:37 fedora kernel: CPU22 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 23 APIC 0x1b
May 09 15:29:37 fedora kernel: Spurious APIC interrupt (vector 0xFF)
on CPU#23, should never happen.
May 09 15:29:37 fedora kernel: CPU23 is up
May 09 15:29:37 fedora kernel: ACPI: PM: Waking up from system sleep state S3
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: [drm] PCIE GART of
512M enabled (table at 0x00000083DAB00000).
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: PSP is resuming...
May 09 15:29:37 fedora kernel: nvme nvme0: D3 entry latency set to 10 seconds
May 09 15:29:37 fedora kernel: xhci_hcd 0000:0c:00.0: xHC error in
resume, USBSTS 0x401, Reinit
May 09 15:29:37 fedora kernel: usb usb1: root hub lost power or was reset
May 09 15:29:37 fedora kernel: usb usb2: root hub lost power or was reset
May 09 15:29:37 fedora kernel: serial 00:01: activated
May 09 15:29:37 fedora kernel: nvme nvme0: 24/0/0 default/read/poll queues
May 09 15:29:37 fedora kernel: usb 5-2: reset full-speed USB device
number 2 using xhci_hcd
May 09 15:29:37 fedora kernel: usb 1-7: WARN: invalid context state
for evaluate context command.
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: RAP: optional rap
ta ucode is not available
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: SECUREDISPLAY:
optional securedisplay ta ucode is not available
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: SMU is resuming...
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: smu driver if
version = 0x0000002e, smu fw if version = 0x00000033, smu fw program =
0, smu fw version = 0x00684c00 (104.76.0)
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: SMU is resumed successfully!
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: program
CP_MES_CNTL : 0x4000000
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: program
CP_MES_CNTL : 0xc000000
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: [drm] DMUB
hardware initialized: version=0x0A003500
May 09 15:29:37 fedora kernel: ata4: SATA link down (SStatus 0 SControl 300)
May 09 15:29:37 fedora kernel: ata3: SATA link down (SStatus 0 SControl 300)
May 09 15:29:37 fedora kernel: ata2: SATA link down (SStatus 0 SControl 300)
May 09 15:29:37 fedora kernel: usb 1-7: reset full-speed USB device
number 3 using xhci_hcd
May 09 15:29:37 fedora kernel: ata1: SATA link up 6.0 Gbps (SStatus
133 SControl 300)
May 09 15:29:37 fedora kernel: sd 0:0:0:0: [sda] Starting disk
May 09 15:29:37 fedora kernel: ata1.00: configured for UDMA/133
May 09 15:29:37 fedora kernel: ahci 0000:0d:00.0: port does not
support device sleep
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: ring gfx_0.0.0
uses VM inv eng 0 on hub 0
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: ring comp_1.0.0
uses VM inv eng 1 on hub 0
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: ring comp_1.1.0
uses VM inv eng 4 on hub 0
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: ring comp_1.0.1
uses VM inv eng 7 on hub 0
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: ring comp_1.1.1
uses VM inv eng 8 on hub 0
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: ring sdma0 uses VM
inv eng 9 on hub 0
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: ring sdma1 uses VM
inv eng 10 on hub 0
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: ring vcn_unified_0
uses VM inv eng 0 on hub 8
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: ring jpeg_dec uses
VM inv eng 1 on hub 8
May 09 15:29:37 fedora kernel: usb 1-10: WARN: invalid context state
for evaluate context command.
May 09 15:29:37 fedora kernel: usb 1-10: reset full-speed USB device
number 6 using xhci_hcd
May 09 15:29:37 fedora kernel: usb 1-1: reset high-speed USB device
number 2 using xhci_hcd
May 09 15:29:37 fedora kernel: usb 1-1.4: reset high-speed USB device
number 4 using xhci_hcd
May 09 15:29:37 fedora kernel: PM: resume devices took 2.046 seconds
May 09 15:29:37 fedora kernel: OOM killer enabled.
May 09 15:29:37 fedora kernel: Restarting tasks: Starting
May 09 15:29:37 fedora kernel: usb 7-1: USB disconnect, device number 2
May 09 15:29:37 fedora kernel: Restarting tasks: Done
May 09 15:29:37 fedora kernel: efivarfs: resyncing variable state
May 09 15:29:37 fedora kernel: efivarfs: finished resyncing variable state
May 09 15:29:37 fedora kernel: random: crng reseeded on system resumption
May 09 15:29:37 fedora kernel: PM: suspend exit
May 09 15:29:37 fedora kernel: Bluetooth: hci0: RTL: examining
hci_ver=0a hci_rev=000b lmp_ver=0a lmp_subver=8761
May 09 15:29:37 fedora kernel: Bluetooth: hci0: RTL: rom_version
status=0 version=1
May 09 15:29:37 fedora kernel: Bluetooth: hci0: RTL: btrtl_initialize: key id 0
May 09 15:29:37 fedora kernel: Bluetooth: hci0: RTL: loading
rtl_bt/rtl8761bu_fw.bin
May 09 15:29:37 fedora kernel: Bluetooth: hci0: RTL: loading
rtl_bt/rtl8761bu_config.bin
May 09 15:29:37 fedora kernel: Bluetooth: hci0: RTL: cfg_sz 6, total sz 30210
May 09 15:29:37 fedora kernel: Realtek Internal NBASE-T PHY
r8169-0-a00:00: attached PHY driver (mii_bus:phy_addr=r8169-0-a00:00,
irq=MAC)
May 09 15:29:37 fedora kernel: r8169 0000:0a:00.0 eno1: Link is Down
May 09 15:29:38 fedora kernel: Bluetooth: hci0: RTL: fw version 0xdfc6d922
May 09 15:29:38 fedora kernel: Bluetooth: MGMT ver 1.23
May 09 15:29:41 fedora kernel: r8169 0000:0a:00.0 eno1: Link is Up -
2.5Gbps/Full - flow control rx/tx
May 09 15:30:00 fedora kernel: input: soundcore Space One (AVRCP) as
/devices/virtual/input/input32
May 09 15:31:40 fedora kernel: xhci_hcd 0000:0f:00.0: remove, state 1
May 09 15:31:40 fedora kernel: usb usb7: USB disconnect, device number 1
May 09 15:31:40 fedora kernel: xhci_hcd 0000:0f:00.0: USB bus 7 deregistered
May 09 15:31:41 fedora kernel: pci 0000:0f:00.0: [1022:15b8] type 00
class 0x0c0330 PCIe Endpoint
May 09 15:31:41 fedora kernel: pci 0000:0f:00.0: BAR 0 [mem
0xf6e00000-0xf6efffff 64bit]
May 09 15:31:41 fedora kernel: pci 0000:0f:00.0: PME# supported from
D0 D3hot D3cold
May 09 15:31:41 fedora kernel: pci 0000:0f:00.0: Adding to iommu group 30
May 09 15:31:41 fedora kernel: pci 0000:0f:00.0: BAR 0 [mem
0xf6e00000-0xf6efffff 64bit]: assigned
May 09 15:31:41 fedora kernel: xhci_hcd 0000:0f:00.0: xHCI Host Controller
May 09 15:31:41 fedora kernel: xhci_hcd 0000:0f:00.0: new USB bus
registered, assigned bus number 7
May 09 15:31:41 fedora kernel: xhci_hcd 0000:0f:00.0: USB3 root hub has no ports
May 09 15:31:41 fedora kernel: xhci_hcd 0000:0f:00.0: hcc params
0x0110ffc5 hci version 0x120 quirks 0x0000000200000010
May 09 15:31:41 fedora kernel: usb usb7: New USB device found,
idVendor=1d6b, idProduct=0002, bcdDevice= 7.00
May 09 15:31:41 fedora kernel: usb usb7: New USB device strings:
Mfr=3, Product=2, SerialNumber=1
May 09 15:31:41 fedora kernel: usb usb7: Product: xHCI Host Controller
May 09 15:31:41 fedora kernel: usb usb7: Manufacturer: Linux
7.0.2-300.vanilla.fc44.x86_64 xhci-hcd
May 09 15:31:41 fedora kernel: usb usb7: SerialNumber: 0000:0f:00.0
May 09 15:31:41 fedora kernel: hub 7-0:1.0: USB hub found
May 09 15:31:41 fedora kernel: hub 7-0:1.0: 1 port detected
May 09 15:31:41 fedora kernel: usb 7-1: new full-speed USB device
number 2 using xhci_hcd
May 09 15:31:42 fedora kernel: usb 7-1: New USB device found,
idVendor=046d, idProduct=c52b, bcdDevice=12.11
May 09 15:31:42 fedora kernel: usb 7-1: New USB device strings: Mfr=1,
Product=2, SerialNumber=0
May 09 15:31:42 fedora kernel: usb 7-1: Product: USB Receiver
May 09 15:31:42 fedora kernel: usb 7-1: Manufacturer: Logitech
May 09 15:31:42 fedora kernel: logitech-djreceiver
0003:046D:C52B.000C: hiddev96,hidraw1: USB HID v1.11 Device [Logitech
USB Receiver] on usb-0000:0f:00.0-1/input2
May 09 15:31:42 fedora kernel: input: Logitech M720 Triathlon as
/devices/pci0000:00/0000:00:08.3/0000:0f:00.0/usb7/7-1/7-1:1.2/0003:046D:C52B.000C/0003:046D:405E.000D/input/input33
May 09 15:31:42 fedora kernel: logitech-hidpp-device
0003:046D:405E.000D: input,hidraw3: USB HID v1.11 Keyboard [Logitech
M720 Triathlon] on usb-0000:0f:00.0-1/input2:1
May 09 15:31:42 fedora kernel: logitech-hidpp-device
0003:046D:405E.000D: HID++ 4.5 device connected.

You can see at 15:31 the results of me running echo 1 >
/sys/bus/pci/devices/0000:0f:00.0/remove followed by echo 1 >
/sys/bus/pci/rescan here. This works 100% of the time to restore. I
have now wired this up to a systemd unit. Let me know if I can provide
anything else that would help?

On Sat, Apr 4, 2026 at 2:24 PM Michal Pecio <michal.pecio@gmail.com> wrote:
>
> On Sat, 4 Apr 2026 13:04:02 +0100, Martin Alderson wrote:
> > Just for clarity this never happened to me with the 6.19 kernel I was
> > on before (suspend/resumed many times on that kernel with no issues).
> > It's happened twice now (once with rc5, now with rc6) in a short space
> > of time.
>
> So apparently about once per week. That's not very easy to debug.
> One trick I have seen people use to accelerate such tests is running
> "rtcwake -s 5 -m freeze" in a loop. This puts the system in s2idle and
> resumes automatically after 5 seconds.
>
> Do you have more complete dmesg from those failures with timestamps?
> From suspend up to until everything has calmed down after resume, or
> also including whatever you have done later to restore operation.
>
> > Previously I was on the Fedora 43 default kernel series, now I
> > switched to the COPR for 7.x (to try and fix something else).
>
> Not sure what COPR is, but I gather it went like this:
> 1. Fedora 6.19 kernel was OK for a long time
> 2. Some other kernel, possibly other config, 7.0-rc4 still worked, but
>    only used for a short time. What about 7.0-rc1 to -rc3?
> 3. After updating to -rc5 it's definitely broken.
>
> > Thanks for the bugzilla, I'll look at some of those workarounds.
>
> Particularly, collecting dynamic debug and debugfs could tell if it's
> the same problem with missing IRQ after resume or something else.
>
> Regards,
> Michal

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
  2026-05-09 14:51       ` Martin Alderson
@ 2026-05-09 16:06         ` Michal Pecio
  2026-05-10 16:29           ` Martin Alderson
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Pecio @ 2026-05-09 16:06 UTC (permalink / raw)
  To: Martin Alderson; +Cc: linux-usb

On Sat, 9 May 2026 15:51:03 +0100, Martin Alderson wrote:
> Hi, still experiencing this on 7.0.2. I tried to pull the logs
> together to get to the bottom of this (I've tried a few different
> kernels)
> 
> Kernel                          Suspends   xHCI 0f:00.0 deaths   Rate
> ------------------------------  --------   -------------------   -----
> 6.17.1-300.fc43 (March)             ~12             0             0%
> 6.18.16-200.fc43                     10             0             0%
> 6.19.7/8-200.fc43                     5             0             0%
> 7.0-rc4   (build 260320)             13             0             0%
> 7.0-rc5   (build 260328)              7             2            ~28%
> 7.0-rc6   (build 260401)             10             4             40%
> 7.0-rc7   (build 260409)              7             2            ~28%
> 7.0.0-261.vanilla.fc43                7             2            ~28%
> 6.17.1-300.fc43 (April, retry)       10             2             20%
>  <-- same bug, stable kernel

Looks like it's not a regression then, but not sure what else may have
caused it.

Any new USB device that wasn't connected before?
Perhaps a BIOS upgrade?

> 7.0.1-262.vanilla.fc43                7             2            ~28%
> 7.0.2-300.vanilla.fc44                6             4            ~66%
> 
> 
> May 09 15:29:37 fedora kernel: Freezing user space processes completed
> (elapsed 0.001 seconds)
> May 09 15:29:37 fedora kernel: OOM killer disabled.
> May 09 15:29:37 fedora kernel: Freezing remaining freezable tasks
> May 09 15:29:37 fedora kernel: Freezing remaining freezable tasks
> completed (elapsed 0.001 seconds)
> May 09 15:29:37 fedora kernel: printk: Suspending console(s) (use
> no_console_suspend to debug)
> May 09 15:29:37 fedora kernel: sd 6:0:0:0: [sdb] Synchronizing SCSI cache
> May 09 15:29:37 fedora kernel: serial 00:01: disabled
> May 09 15:29:37 fedora kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
> May 09 15:29:37 fedora kernel: ata1.00: Entering standby power mode
> May 09 15:29:37 fedora kernel: xhci_hcd 0000:0f:00.0: xHCI host not
> responding to stop endpoint command
> May 09 15:29:37 fedora kernel: xhci_hcd 0000:0f:00.0: xHCI host
> controller not responding, assume dead
> May 09 15:29:37 fedora kernel: xhci_hcd 0000:0f:00.0: HC died; cleaning up
> May 09 15:29:37 fedora kernel: PM: suspend devices took 5.758 seconds

That's not resume, it's during suspend. Are other logs also like that?

Regards,
Michal

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
  2026-05-09 16:06         ` Michal Pecio
@ 2026-05-10 16:29           ` Martin Alderson
  2026-05-12 10:03             ` Michal Pecio
  0 siblings, 1 reply; 15+ messages in thread
From: Martin Alderson @ 2026-05-10 16:29 UTC (permalink / raw)
  To: Michal Pecio; +Cc: linux-usb

Hi,

Two answers, plus a hypothesis:

1. The timing is during suspend in every single failure I have logs for.
I went back through 7 weeks of persistent journals and pulled the
context around every "HC died" event. All 9 failures show the same
sequence:

  xhci_hcd 0000:0f:00.0: xHCI host not responding to stop endpoint command
  xhci_hcd 0000:0f:00.0: xHCI host controller not responding, assume dead
  xhci_hcd 0000:0f:00.0: HC died; cleaning up
  PM: suspend devices took 5.5--6.1 seconds      <-- elevated
  amdgpu 0000:03:00.0: MODE1 reset
  ACPI: PM: Preparing to enter system sleep state S3

So it's reliably during suspend, before S3 entry, and the elevated
"suspend devices took" matches the 5s xHCI stop-endpoint timeout. A
clean suspend on the same boot takes ~0.46s.

2. No BIOS upgrade. ASUS PRIME B650-PLUS BIOS version 3263 dated
2025-06-09 across every boot from 2026-03-02 to 2026-05-08 (42 boots).

3. Re "any new USB device": yes, and it correlates exactly. A 4-port
USB hub appeared on bus 1 (controller 0c:00.0, AMD 600 Series USB 3.2)
on 2026-03-16, with a USB mass-storage device behind it on port 4. It's
the hub built into a new monitor I added around then. Per-boot
presence:

  2026-03-02 to 2026-03-16: NO hub, NO flash drive, ~12 6.17.1
                            suspends, 0 failures
  2026-03-16+:              hub + flash drive present
  2026-03-22 to 2026-03-28: 7.0-rc4 with hub present, 13 suspends,
                            0 failures
  2026-03-28+:              7.0-rc5+, failures begin
  2026-04-18 to 2026-04-25: 6.17.1 (retry, with hub still present),
                            10 suspends, 2 failures -- same kernel
                            that was clean in March

The hub is on a different xHCI from the one that dies (0c:00.0 vs
0f:00.0), but they're sibling controllers on the same AMD SoC, so
shared power/ACPI domains seem plausible.

Even with the hub identified as the trigger, I think there are still
two kernel-side issues worth flagging:

(a) Recovery: when the stop-endpoint timeout hits, 0f:00.0 is marked
"HC died" and never comes back without a manual PCI remove+rescan.
The other controller on the same machine recovers itself on resume:

  xhci_hcd 0000:0c:00.0: xHC error in resume, USBSTS 0x401, Reinit

There doesn't seem to be an equivalent recovery path for the
suspend-time stop-endpoint timeout on 0f:00.0.

Regards,
Martin


On Sat, May 9, 2026 at 5:06 PM Michal Pecio <michal.pecio@gmail.com> wrote:
>
> On Sat, 9 May 2026 15:51:03 +0100, Martin Alderson wrote:
> > Hi, still experiencing this on 7.0.2. I tried to pull the logs
> > together to get to the bottom of this (I've tried a few different
> > kernels)
> >
> > Kernel                          Suspends   xHCI 0f:00.0 deaths   Rate
> > ------------------------------  --------   -------------------   -----
> > 6.17.1-300.fc43 (March)             ~12             0             0%
> > 6.18.16-200.fc43                     10             0             0%
> > 6.19.7/8-200.fc43                     5             0             0%
> > 7.0-rc4   (build 260320)             13             0             0%
> > 7.0-rc5   (build 260328)              7             2            ~28%
> > 7.0-rc6   (build 260401)             10             4             40%
> > 7.0-rc7   (build 260409)              7             2            ~28%
> > 7.0.0-261.vanilla.fc43                7             2            ~28%
> > 6.17.1-300.fc43 (April, retry)       10             2             20%
> >  <-- same bug, stable kernel
>
> Looks like it's not a regression then, but not sure what else may have
> caused it.
>
> Any new USB device that wasn't connected before?
> Perhaps a BIOS upgrade?
>
> > 7.0.1-262.vanilla.fc43                7             2            ~28%
> > 7.0.2-300.vanilla.fc44                6             4            ~66%
> >
> >
> > May 09 15:29:37 fedora kernel: Freezing user space processes completed
> > (elapsed 0.001 seconds)
> > May 09 15:29:37 fedora kernel: OOM killer disabled.
> > May 09 15:29:37 fedora kernel: Freezing remaining freezable tasks
> > May 09 15:29:37 fedora kernel: Freezing remaining freezable tasks
> > completed (elapsed 0.001 seconds)
> > May 09 15:29:37 fedora kernel: printk: Suspending console(s) (use
> > no_console_suspend to debug)
> > May 09 15:29:37 fedora kernel: sd 6:0:0:0: [sdb] Synchronizing SCSI cache
> > May 09 15:29:37 fedora kernel: serial 00:01: disabled
> > May 09 15:29:37 fedora kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
> > May 09 15:29:37 fedora kernel: ata1.00: Entering standby power mode
> > May 09 15:29:37 fedora kernel: xhci_hcd 0000:0f:00.0: xHCI host not
> > responding to stop endpoint command
> > May 09 15:29:37 fedora kernel: xhci_hcd 0000:0f:00.0: xHCI host
> > controller not responding, assume dead
> > May 09 15:29:37 fedora kernel: xhci_hcd 0000:0f:00.0: HC died; cleaning up
> > May 09 15:29:37 fedora kernel: PM: suspend devices took 5.758 seconds
>
> That's not resume, it's during suspend. Are other logs also like that?
>
> Regards,
> Michal

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
  2026-05-10 16:29           ` Martin Alderson
@ 2026-05-12 10:03             ` Michal Pecio
  2026-05-12 14:01               ` Mathias Nyman
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Pecio @ 2026-05-12 10:03 UTC (permalink / raw)
  To: Martin Alderson; +Cc: linux-usb, Mathias Nyman

On Sun, 10 May 2026 17:29:26 +0100, Martin Alderson wrote:
> 1. The timing is during suspend in every single failure I have logs
> for. I went back through 7 weeks of persistent journals and pulled the
> context around every "HC died" event. All 9 failures show the same
> sequence:
> 
>   xhci_hcd 0000:0f:00.0: xHCI host not responding to stop endpoint command
>   xhci_hcd 0000:0f:00.0: xHCI host controller not responding, assume dead
>   xhci_hcd 0000:0f:00.0: HC died; cleaning up
>   PM: suspend devices took 5.5--6.1 seconds      <-- elevated
>   amdgpu 0000:03:00.0: MODE1 reset
>   ACPI: PM: Preparing to enter system sleep state S3
> 
> So it's reliably during suspend, before S3 entry, and the elevated
> "suspend devices took" matches the 5s xHCI stop-endpoint timeout. A
> clean suspend on the same boot takes ~0.46s.

The S3 state probably doesn't matter, chances are that it would also
happen with s2idle or hibernation.

Could you enable dynamic debug before every suspend (or permanently
on every boot) and collect a dmesg log of this happening again?
And maybe also a snapshot of debugfs directory after resume but before
unbinding xhci_hcd. These may contain clues what triggered it.

echo 'module xhci_hcd +p' >/proc/dynamic_debug/control
zip -r debugfs.zip /sys/kernel/debug/usb/xhci/0000:0f:00.0

> 2. No BIOS upgrade. ASUS PRIME B650-PLUS BIOS version 3263 dated
> 2025-06-09 across every boot from 2026-03-02 to 2026-05-08 (42 boots).
> 
> 3. Re "any new USB device": yes, and it correlates exactly. A 4-port
> USB hub appeared on bus 1 (controller 0c:00.0, AMD 600 Series USB 3.2)
> on 2026-03-16, with a USB mass-storage device behind it on port 4. It's
> the hub built into a new monitor I added around then. Per-boot
> presence:
> 
>   2026-03-02 to 2026-03-16: NO hub, NO flash drive, ~12 6.17.1
>                             suspends, 0 failures
>   2026-03-16+:              hub + flash drive present
>   2026-03-22 to 2026-03-28: 7.0-rc4 with hub present, 13 suspends,
>                             0 failures
>   2026-03-28+:              7.0-rc5+, failures begin
>   2026-04-18 to 2026-04-25: 6.17.1 (retry, with hub still present),
>                             10 suspends, 2 failures -- same kernel
>                             that was clean in March
> 
> The hub is on a different xHCI from the one that dies (0c:00.0 vs
> 0f:00.0), but they're sibling controllers on the same AMD SoC, so
> shared power/ACPI domains seem plausible.

That's over a week from first connection to the first failure.
And these are separate chips of different type, the 600 series chipset
is not part of the CPU while the broken one is the CPU I/O die AFAIK.

May be a coincidence. I would sooner suspect changes to devices on the
affected bus to be responsible.

Regards,
Michal

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
  2026-05-12 10:03             ` Michal Pecio
@ 2026-05-12 14:01               ` Mathias Nyman
  2026-05-28 11:52                 ` Martin Alderson
  0 siblings, 1 reply; 15+ messages in thread
From: Mathias Nyman @ 2026-05-12 14:01 UTC (permalink / raw)
  To: Michal Pecio, Martin Alderson; +Cc: linux-usb

On 5/12/26 13:03, Michal Pecio wrote:
> On Sun, 10 May 2026 17:29:26 +0100, Martin Alderson wrote:
>> 1. The timing is during suspend in every single failure I have logs
>> for. I went back through 7 weeks of persistent journals and pulled the
>> context around every "HC died" event. All 9 failures show the same
>> sequence:
>>
>>    xhci_hcd 0000:0f:00.0: xHCI host not responding to stop endpoint command
>>    xhci_hcd 0000:0f:00.0: xHCI host controller not responding, assume dead
>>    xhci_hcd 0000:0f:00.0: HC died; cleaning up
>>    PM: suspend devices took 5.5--6.1 seconds      <-- elevated
>>    amdgpu 0000:03:00.0: MODE1 reset
>>    ACPI: PM: Preparing to enter system sleep state S3
>>
>> So it's reliably during suspend, before S3 entry, and the elevated
>> "suspend devices took" matches the 5s xHCI stop-endpoint timeout. A
>> clean suspend on the same boot takes ~0.46s.
> 
> The S3 state probably doesn't matter, chances are that it would also
> happen with s2idle or hibernation.
> 
> Could you enable dynamic debug before every suspend (or permanently
> on every boot) and collect a dmesg log of this happening again?
> And maybe also a snapshot of debugfs directory after resume but before
> unbinding xhci_hcd. These may contain clues what triggered it.

It's possible there is a race between queuing a command and suspend.
It looks like nothing is preventing a new command from being queued while
suspend stops the host from running, thus causing commands to timeout.

Suspend isn't checking if there are pending commands, or if command timer
is running either.

I wrote some debugging code, can be found in my debug_hc_died_cmdring_race branch:
git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git debug_hc_died_cmdring_race
https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=debug_hc_died_cmdring_race

If it prints
   "Can't queue command, xHC not accessible (stopped?)"
or
   "Suspending and stopping xHC with pending command(s)!!!"
Then we have a queue_command - suspend race.

Code below for reference
Mathias

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index e47e644b296e..50ce4a4a7fe3 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -4353,6 +4353,7 @@ static int queue_command(struct xhci_hcd *xhci, struct xhci_command *cmd,
  			 u32 field3, u32 field4, bool command_must_succeed)
  {
  	int reserved_trbs = xhci->cmd_ring_reserved_trbs;
+	struct usb_hcd *hcd = xhci_to_hcd(xhci);
  	int ret;
  
  	if ((xhci->xhc_state & XHCI_STATE_DYING) ||
@@ -4362,6 +4363,14 @@ static int queue_command(struct xhci_hcd *xhci, struct xhci_command *cmd,
  		return -ESHUTDOWN;
  	}
  
+	if (!HCD_HW_ACCESSIBLE(hcd)) {
+		xhci_err(xhci, "Can't queue command, xHC not accessible (stopped?)\n");
+		xhci_err(xhci, "called by %pS from %pS\n",
+			 __builtin_return_address(0),
+			 __builtin_return_address(1));
+		return -ESHUTDOWN;
+	}
+
  	if (!command_must_succeed)
  		reserved_trbs++;
  
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index a54f5b57f205..04279fbbe1dd 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -949,6 +949,34 @@ static bool xhci_pending_portevent(struct xhci_hcd *xhci)
  	return false;
  }
  
+static void xhci_dump_ring(struct xhci_hcd *xhci, struct xhci_ring *ring)
+{
+	struct xhci_segment	*seg;
+	union xhci_trb		*trb;
+	dma_addr_t		dma;
+	char			str[XHCI_MSG_MAX];
+	int			i, j;
+
+	seg = ring->first_seg;
+	dma =  xhci_trb_virt_to_dma(ring->deq_seg, ring->dequeue);
+
+        xhci_err(xhci, "Dequeue: %pad\n", &dma);
+
+	for (i = 0; i < ring->num_segs; i++) {
+		for (j = 0; j < TRBS_PER_SEGMENT; j++) {
+			trb = &seg->trbs[j];
+			dma = seg->dma + j * sizeof(*trb);
+			xhci_err(xhci, "%pad: %s\n", &dma,
+				 xhci_decode_trb(str, XHCI_MSG_MAX,
+						 le32_to_cpu(trb->generic.field[0]),
+						 le32_to_cpu(trb->generic.field[1]),
+						 le32_to_cpu(trb->generic.field[2]),
+						 le32_to_cpu(trb->generic.field[3])));
+		}
+		seg = seg->next;
+	}
+}
+
  /*
   * Stop HC (not bus-specific)
   *
@@ -999,6 +1027,12 @@ int xhci_suspend(struct xhci_hcd *xhci, bool do_wakeup)
  	/* step 1: stop endpoint */
  	/* skipped assuming that port suspend has done */
  
+	/* Check if command ring is empty */
+	if (!list_empty(&xhci->cmd_list)) {
+		xhci_err(xhci, "Suspending and stopping xHC with pending command(s)!!!\n");
+		xhci_dump_ring(xhci, xhci->cmd_ring);
+	}
+
  	/* step 2: clear Run/Stop bit */
  	command = readl(&xhci->op_regs->command);
  	command &= ~CMD_RUN;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
  2026-05-12 14:01               ` Mathias Nyman
@ 2026-05-28 11:52                 ` Martin Alderson
  2026-05-28 22:10                   ` Michal Pecio
  0 siblings, 1 reply; 15+ messages in thread
From: Martin Alderson @ 2026-05-28 11:52 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: Michal Pecio, linux-usb

Caught a fresh failure on kernel 7.0.9-205.fc44.x86_64 with xhci_hcd.dyndbg=+p:

Timeline (single suspend cycle):

  11:09:45  xhci_suspend: stopping usb1/3/5 port polling
  11:09:50  xhci_hcd 0000:0f:00.0: xHCI host not responding to stop
endpoint command
  11:09:50  xhci_hcd 0000:0f:00.0: xHCI host controller not
responding, assume dead
  11:09:50  xhci_hcd 0000:0f:00.0: HC died; cleaning up
  11:09:52  xhci_hcd 0000:0e:00.3: port resume event for port 1
(keyboard wake)

5-second gap between suspend start and HC died - the stop endpoint
timeout you predicted.

Command ring state at death:

  0xffffe070: Stop Ring Command: slot 1 sp 1 ep 1 flags C   (completed)
  0xffffe080: Stop Ring Command: slot 1 sp 0 ep 1 flags C   <- dequeue (stuck)
  0xffffe090: empty                                          <- enqueue

  enqueue - dequeue = 1 TRB pending
  USBCMD = 0x0   USBSTS = 0x1 (HCHalted)
  port01 portsc = 0x663  Link=U3 CCS PP PED


I have the full snapshot: command-ring/event-ring TRB dumps, all xhci
debugfs registers, dmesg with full dyndbg trace, and the kernel
journal for the cycle. I'm not sure the best way to send it though if
you need it?

 Martin


On Tue, May 12, 2026 at 3:01 PM Mathias Nyman
<mathias.nyman@linux.intel.com> wrote:
>
> On 5/12/26 13:03, Michal Pecio wrote:
> > On Sun, 10 May 2026 17:29:26 +0100, Martin Alderson wrote:
> >> 1. The timing is during suspend in every single failure I have logs
> >> for. I went back through 7 weeks of persistent journals and pulled the
> >> context around every "HC died" event. All 9 failures show the same
> >> sequence:
> >>
> >>    xhci_hcd 0000:0f:00.0: xHCI host not responding to stop endpoint command
> >>    xhci_hcd 0000:0f:00.0: xHCI host controller not responding, assume dead
> >>    xhci_hcd 0000:0f:00.0: HC died; cleaning up
> >>    PM: suspend devices took 5.5--6.1 seconds      <-- elevated
> >>    amdgpu 0000:03:00.0: MODE1 reset
> >>    ACPI: PM: Preparing to enter system sleep state S3
> >>
> >> So it's reliably during suspend, before S3 entry, and the elevated
> >> "suspend devices took" matches the 5s xHCI stop-endpoint timeout. A
> >> clean suspend on the same boot takes ~0.46s.
> >
> > The S3 state probably doesn't matter, chances are that it would also
> > happen with s2idle or hibernation.
> >
> > Could you enable dynamic debug before every suspend (or permanently
> > on every boot) and collect a dmesg log of this happening again?
> > And maybe also a snapshot of debugfs directory after resume but before
> > unbinding xhci_hcd. These may contain clues what triggered it.
>
> It's possible there is a race between queuing a command and suspend.
> It looks like nothing is preventing a new command from being queued while
> suspend stops the host from running, thus causing commands to timeout.
>
> Suspend isn't checking if there are pending commands, or if command timer
> is running either.
>
> I wrote some debugging code, can be found in my debug_hc_died_cmdring_race branch:
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git debug_hc_died_cmdring_race
> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=debug_hc_died_cmdring_race
>
> If it prints
>    "Can't queue command, xHC not accessible (stopped?)"
> or
>    "Suspending and stopping xHC with pending command(s)!!!"
> Then we have a queue_command - suspend race.
>
> Code below for reference
> Mathias
>
> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
> index e47e644b296e..50ce4a4a7fe3 100644
> --- a/drivers/usb/host/xhci-ring.c
> +++ b/drivers/usb/host/xhci-ring.c
> @@ -4353,6 +4353,7 @@ static int queue_command(struct xhci_hcd *xhci, struct xhci_command *cmd,
>                          u32 field3, u32 field4, bool command_must_succeed)
>   {
>         int reserved_trbs = xhci->cmd_ring_reserved_trbs;
> +       struct usb_hcd *hcd = xhci_to_hcd(xhci);
>         int ret;
>
>         if ((xhci->xhc_state & XHCI_STATE_DYING) ||
> @@ -4362,6 +4363,14 @@ static int queue_command(struct xhci_hcd *xhci, struct xhci_command *cmd,
>                 return -ESHUTDOWN;
>         }
>
> +       if (!HCD_HW_ACCESSIBLE(hcd)) {
> +               xhci_err(xhci, "Can't queue command, xHC not accessible (stopped?)\n");
> +               xhci_err(xhci, "called by %pS from %pS\n",
> +                        __builtin_return_address(0),
> +                        __builtin_return_address(1));
> +               return -ESHUTDOWN;
> +       }
> +
>         if (!command_must_succeed)
>                 reserved_trbs++;
>
> diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
> index a54f5b57f205..04279fbbe1dd 100644
> --- a/drivers/usb/host/xhci.c
> +++ b/drivers/usb/host/xhci.c
> @@ -949,6 +949,34 @@ static bool xhci_pending_portevent(struct xhci_hcd *xhci)
>         return false;
>   }
>
> +static void xhci_dump_ring(struct xhci_hcd *xhci, struct xhci_ring *ring)
> +{
> +       struct xhci_segment     *seg;
> +       union xhci_trb          *trb;
> +       dma_addr_t              dma;
> +       char                    str[XHCI_MSG_MAX];
> +       int                     i, j;
> +
> +       seg = ring->first_seg;
> +       dma =  xhci_trb_virt_to_dma(ring->deq_seg, ring->dequeue);
> +
> +        xhci_err(xhci, "Dequeue: %pad\n", &dma);
> +
> +       for (i = 0; i < ring->num_segs; i++) {
> +               for (j = 0; j < TRBS_PER_SEGMENT; j++) {
> +                       trb = &seg->trbs[j];
> +                       dma = seg->dma + j * sizeof(*trb);
> +                       xhci_err(xhci, "%pad: %s\n", &dma,
> +                                xhci_decode_trb(str, XHCI_MSG_MAX,
> +                                                le32_to_cpu(trb->generic.field[0]),
> +                                                le32_to_cpu(trb->generic.field[1]),
> +                                                le32_to_cpu(trb->generic.field[2]),
> +                                                le32_to_cpu(trb->generic.field[3])));
> +               }
> +               seg = seg->next;
> +       }
> +}
> +
>   /*
>    * Stop HC (not bus-specific)
>    *
> @@ -999,6 +1027,12 @@ int xhci_suspend(struct xhci_hcd *xhci, bool do_wakeup)
>         /* step 1: stop endpoint */
>         /* skipped assuming that port suspend has done */
>
> +       /* Check if command ring is empty */
> +       if (!list_empty(&xhci->cmd_list)) {
> +               xhci_err(xhci, "Suspending and stopping xHC with pending command(s)!!!\n");
> +               xhci_dump_ring(xhci, xhci->cmd_ring);
> +       }
> +
>         /* step 2: clear Run/Stop bit */
>         command = readl(&xhci->op_regs->command);
>         command &= ~CMD_RUN;

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
  2026-05-28 11:52                 ` Martin Alderson
@ 2026-05-28 22:10                   ` Michal Pecio
  2026-05-28 23:06                     ` Martin Alderson
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Pecio @ 2026-05-28 22:10 UTC (permalink / raw)
  To: Martin Alderson; +Cc: Mathias Nyman, linux-usb

On Thu, 28 May 2026 12:52:16 +0100, Martin Alderson wrote:
> Caught a fresh failure on kernel 7.0.9-205.fc44.x86_64 with xhci_hcd.dyndbg=+p:
> 
> Timeline (single suspend cycle):
> 
>   11:09:45  xhci_suspend: stopping usb1/3/5 port polling
>   11:09:50  xhci_hcd 0000:0f:00.0: xHCI host not responding to stop
> endpoint command
>   11:09:50  xhci_hcd 0000:0f:00.0: xHCI host controller not
> responding, assume dead
>   11:09:50  xhci_hcd 0000:0f:00.0: HC died; cleaning up
>   11:09:52  xhci_hcd 0000:0e:00.3: port resume event for port 1
> (keyboard wake)
> 
> 5-second gap between suspend start and HC died - the stop endpoint
> timeout you predicted.
> 
> Command ring state at death:
> 
>   0xffffe070: Stop Ring Command: slot 1 sp 1 ep 1 flags C   (completed)
>   0xffffe080: Stop Ring Command: slot 1 sp 0 ep 1 flags C   <- dequeue (stuck)

That's odd, I wouldn't expect further Stop EP commands for the same
endpoint after one with the SP flag. Not until the USB device resumes.
Mathias may have guessed right that there is some unexpected activity
concurrently with suspend.

>   0xffffe090: empty                                          <- enqueue
> 
>   enqueue - dequeue = 1 TRB pending
>   USBCMD = 0x0   USBSTS = 0x1 (HCHalted)
>   port01 portsc = 0x663  Link=U3 CCS PP PED
> 
> 
> I have the full snapshot: command-ring/event-ring TRB dumps, all xhci
> debugfs registers, dmesg with full dyndbg trace, and the kernel
> journal for the cycle. I'm not sure the best way to send it though if
> you need it?

I suppose the debugfs zip isn't very large so an attachment would be
fine, plus another attachment with dmesg from the beginning of suspend
attempt, or complete dmesg if it doesn't exceed a few MB. Preferably
with sub-second resolution timestamps, if you have them.

Regards,
Michal

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
  2026-05-28 22:10                   ` Michal Pecio
@ 2026-05-28 23:06                     ` Martin Alderson
  2026-05-29 10:22                       ` Michal Pecio
  0 siblings, 1 reply; 15+ messages in thread
From: Martin Alderson @ 2026-05-28 23:06 UTC (permalink / raw)
  To: Michal Pecio; +Cc: Mathias Nyman, linux-usb

[-- Attachment #1: Type: text/plain, Size: 2077 bytes --]

Hi, please see this attachment. Thanks for all your help!

On Thu, May 28, 2026 at 11:11 PM Michal Pecio <michal.pecio@gmail.com> wrote:
>
> On Thu, 28 May 2026 12:52:16 +0100, Martin Alderson wrote:
> > Caught a fresh failure on kernel 7.0.9-205.fc44.x86_64 with xhci_hcd.dyndbg=+p:
> >
> > Timeline (single suspend cycle):
> >
> >   11:09:45  xhci_suspend: stopping usb1/3/5 port polling
> >   11:09:50  xhci_hcd 0000:0f:00.0: xHCI host not responding to stop
> > endpoint command
> >   11:09:50  xhci_hcd 0000:0f:00.0: xHCI host controller not
> > responding, assume dead
> >   11:09:50  xhci_hcd 0000:0f:00.0: HC died; cleaning up
> >   11:09:52  xhci_hcd 0000:0e:00.3: port resume event for port 1
> > (keyboard wake)
> >
> > 5-second gap between suspend start and HC died - the stop endpoint
> > timeout you predicted.
> >
> > Command ring state at death:
> >
> >   0xffffe070: Stop Ring Command: slot 1 sp 1 ep 1 flags C   (completed)
> >   0xffffe080: Stop Ring Command: slot 1 sp 0 ep 1 flags C   <- dequeue (stuck)
>
> That's odd, I wouldn't expect further Stop EP commands for the same
> endpoint after one with the SP flag. Not until the USB device resumes.
> Mathias may have guessed right that there is some unexpected activity
> concurrently with suspend.
>
> >   0xffffe090: empty                                          <- enqueue
> >
> >   enqueue - dequeue = 1 TRB pending
> >   USBCMD = 0x0   USBSTS = 0x1 (HCHalted)
> >   port01 portsc = 0x663  Link=U3 CCS PP PED
> >
> >
> > I have the full snapshot: command-ring/event-ring TRB dumps, all xhci
> > debugfs registers, dmesg with full dyndbg trace, and the kernel
> > journal for the cycle. I'm not sure the best way to send it though if
> > you need it?
>
> I suppose the debugfs zip isn't very large so an attachment would be
> fine, plus another attachment with dmesg from the beginning of suspend
> attempt, or complete dmesg if it doesn't exceed a few MB. Preferably
> with sub-second resolution timestamps, if you have them.
>
> Regards,
> Michal

[-- Attachment #2: xhci-0f00-snapshot-20260528T100954Z.tar.gz --]
[-- Type: application/gzip, Size: 26196 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
  2026-05-28 23:06                     ` Martin Alderson
@ 2026-05-29 10:22                       ` Michal Pecio
  2026-05-29 12:04                         ` Martin Alderson
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Pecio @ 2026-05-29 10:22 UTC (permalink / raw)
  To: Martin Alderson; +Cc: Mathias Nyman, linux-usb

On Fri, 29 May 2026 00:06:33 +0100, Martin Alderson wrote:
> Hi, please see this attachment. Thanks for all your help!

Let's go through it.

grep xhci_suspend 20260528T100954Z/dmesg.txt 
[Thu May 28 11:09:45 2026] xhci_hcd 0000:0e:00.4: xhci_suspend: stopping usb5 port polling.
[Thu May 28 11:09:45 2026] xhci_hcd 0000:0c:00.0: xhci_suspend: stopping usb1 port polling.
[Thu May 28 11:09:45 2026] xhci_hcd 0000:0e:00.3: xhci_suspend: stopping usb3 port polling.

Several HCs are suspending, but not 0000:0f:00.0. It seems that the
kernel is aware that "something" is still going on with its child USB
devices and it defers suspend until "something" finishes.

grep 0000:0f:00.0 20260528T100954Z/dmesg.txt 
[Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: Cancel URB 00000000f24bbb02, dev 1, ep 0x83, starting at offset 0xfffe5c70
[Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: // Ding dong!
[Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: Stopped on Transfer TRB for slot 1 ep 6
[Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: Removing canceled TD starting at 0xfffe5c70 (dma) in stream 0 URB 00000000f24bbb02
[Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: Set TR Deq ptr 0xfffe5c80, cycle 0
[Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: // Ding dong!
[Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: xhci_giveback_invalidated_tds: Keep cancelled URB 00000000f24bbb02 TD as cancel_status is 2
[Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: Successful Set TR Deq Ptr cmd, deq = @fffe5c80
[Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: xhci_handle_cmd_set_deq: Giveback cancelled URB 00000000f24bbb02 TD
[Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: Giveback URB 00000000f24bbb02, len = 0, expected = 32, status = -115
[Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: xhci_handle_cmd_set_deq: All TDs cleared, ring doorbell
[Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: // Ding dong!
[Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: Stopped on No-op or Link TRB for slot 1 ep 4
[Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: Stopped on No-op or Link TRB for slot 1 ep 2
[Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: Stopped on No-op or Link TRB for slot 1 ep 0
[Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: Set port 7-1 link state, portsc: 0x603, write 0x10661

Some device under 0000:0f:00.0 is suspended after having a 32 byte URB
unlinked from EP 3 IN and a few other (idle?) endpoints stopped.

[Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: Cancel URB 00000000e74c9e14, dev 1, ep 0x0, starting at offset 0xffff4060
[Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: // Ding dong!
[Thu May 28 11:09:50 2026] xhci_hcd 0000:0f:00.0: Command timeout, USBSTS: 0x00000000
[Thu May 28 11:09:50 2026] xhci_hcd 0000:0f:00.0: xHCI host not responding to stop endpoint command

Then some URB is unlinked from the same device's control endpoint with
no indication that the port has been resumed. Not 100% sure whether the
URB was submitted before or after device suspend. Either way, software
shouldn't do such things and your HW doesn't handle it gracefully.

It seems we will need to figure out how the offending URB gets there.
Can we identify the problematic device? Please post the output of:

lsusb -v |sed '/0000:0f:00.0/,/root hub/ !d'

Regards,
Michal

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
  2026-05-29 10:22                       ` Michal Pecio
@ 2026-05-29 12:04                         ` Martin Alderson
       [not found]                           ` <20260530005742.25893efa.michal.pecio@gmail.com>
  0 siblings, 1 reply; 15+ messages in thread
From: Martin Alderson @ 2026-05-29 12:04 UTC (permalink / raw)
  To: Michal Pecio; +Cc: Mathias Nyman, linux-usb

Hi see below:

  iSerial                 1 0000:0f:00.0
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength       0x0019
    bNumInterfaces          1
    bConfigurationValue     1
    iConfiguration          0
    bmAttributes         0xe0
      Self Powered
      Remote Wakeup
    MaxPower                0mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           1
      bInterfaceClass         9 Hub
      bInterfaceSubClass      0 [unknown]
      bInterfaceProtocol      0 Full speed (or root) hub
      iInterface              0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0004  1x 4 bytes
        bInterval              12
Hub Descriptor:
  bLength               9
  bDescriptorType      41
  nNbrPorts             1
  wHubCharacteristic 0x000a
    No power switching (usb 1.0)
    Per-port overcurrent protection
    TT think time 8 FS bits
  bPwrOn2PwrGood       10 * 2 milli seconds
  bHubContrCurrent      0 milli Ampere
  DeviceRemovable    0x00
  PortPwrCtrlMask    0xff
 Hub Port Status:
   Port 1: 0000.0103 power enable connect
Device Status:     0x0001
  Self Powered

Bus 007 Device 002: ID 046d:c52b Logitech, Inc. Unifying Receiver
Negotiated speed: Full Speed (12Mbps)
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               2.00
  bDeviceClass            0 [unknown]
  bDeviceSubClass         0 [unknown]
  bDeviceProtocol         0
  bMaxPacketSize0         8
  idVendor           0x046d Logitech, Inc.
  idProduct          0xc52b Unifying Receiver
  bcdDevice           12.11
  iManufacturer           1 Logitech
  iProduct                2 USB Receiver
  iSerial                 0
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength       0x0054
    bNumInterfaces          3
    bConfigurationValue     1
    iConfiguration          4 RQR12.11_B0032
    bmAttributes         0xa0
      (Bus Powered)
      Remote Wakeup
    MaxPower               98mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           1
      bInterfaceClass         3 Human Interface Device
      bInterfaceSubClass      1 Boot Interface Subclass
      bInterfaceProtocol      1 Keyboard
      iInterface              0
        HID Device Descriptor:
          bLength                 9
          bDescriptorType        33
          bcdHID               1.11
          bCountryCode            0 Not supported
          bNumDescriptors         1
          bDescriptorType        34 Report
          wDescriptorLength      59
          Report Descriptors:
            ** UNAVAILABLE **
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0008  1x 8 bytes
        bInterval               8
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        1
      bAlternateSetting       0
      bNumEndpoints           1
      bInterfaceClass         3 Human Interface Device
      bInterfaceSubClass      1 Boot Interface Subclass
      bInterfaceProtocol      2 Mouse
      iInterface              0
        HID Device Descriptor:
          bLength                 9
          bDescriptorType        33
          bcdHID               1.11
          bCountryCode            0 Not supported
          bNumDescriptors         1
          bDescriptorType        34 Report
          wDescriptorLength     148
          Report Descriptors:
            ** UNAVAILABLE **
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x82  EP 2 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0008  1x 8 bytes
        bInterval               2
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        2
      bAlternateSetting       0
      bNumEndpoints           1
      bInterfaceClass         3 Human Interface Device
      bInterfaceSubClass      0 [unknown]
      bInterfaceProtocol      0
      iInterface              0
        HID Device Descriptor:
          bLength                 9
          bDescriptorType        33
          bcdHID               1.11
          bCountryCode            0 Not supported
          bNumDescriptors         1
          bDescriptorType        34 Report
          wDescriptorLength      93
          Report Descriptors:
            ** UNAVAILABLE **
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x83  EP 3 IN
        bmAttributes            3
          Transfer Type            Interrupt
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0020  1x 32 bytes
        bInterval               2
Device Status:     0x0000
  (Bus Powered)

On Fri, May 29, 2026 at 11:22 AM Michal Pecio <michal.pecio@gmail.com> wrote:
>
> On Fri, 29 May 2026 00:06:33 +0100, Martin Alderson wrote:
> > Hi, please see this attachment. Thanks for all your help!
>
> Let's go through it.
>
> grep xhci_suspend 20260528T100954Z/dmesg.txt
> [Thu May 28 11:09:45 2026] xhci_hcd 0000:0e:00.4: xhci_suspend: stopping usb5 port polling.
> [Thu May 28 11:09:45 2026] xhci_hcd 0000:0c:00.0: xhci_suspend: stopping usb1 port polling.
> [Thu May 28 11:09:45 2026] xhci_hcd 0000:0e:00.3: xhci_suspend: stopping usb3 port polling.
>
> Several HCs are suspending, but not 0000:0f:00.0. It seems that the
> kernel is aware that "something" is still going on with its child USB
> devices and it defers suspend until "something" finishes.
>
> grep 0000:0f:00.0 20260528T100954Z/dmesg.txt
> [Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: Cancel URB 00000000f24bbb02, dev 1, ep 0x83, starting at offset 0xfffe5c70
> [Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: // Ding dong!
> [Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: Stopped on Transfer TRB for slot 1 ep 6
> [Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: Removing canceled TD starting at 0xfffe5c70 (dma) in stream 0 URB 00000000f24bbb02
> [Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: Set TR Deq ptr 0xfffe5c80, cycle 0
> [Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: // Ding dong!
> [Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: xhci_giveback_invalidated_tds: Keep cancelled URB 00000000f24bbb02 TD as cancel_status is 2
> [Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: Successful Set TR Deq Ptr cmd, deq = @fffe5c80
> [Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: xhci_handle_cmd_set_deq: Giveback cancelled URB 00000000f24bbb02 TD
> [Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: Giveback URB 00000000f24bbb02, len = 0, expected = 32, status = -115
> [Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: xhci_handle_cmd_set_deq: All TDs cleared, ring doorbell
> [Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: // Ding dong!
> [Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: Stopped on No-op or Link TRB for slot 1 ep 4
> [Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: Stopped on No-op or Link TRB for slot 1 ep 2
> [Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: Stopped on No-op or Link TRB for slot 1 ep 0
> [Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: Set port 7-1 link state, portsc: 0x603, write 0x10661
>
> Some device under 0000:0f:00.0 is suspended after having a 32 byte URB
> unlinked from EP 3 IN and a few other (idle?) endpoints stopped.
>
> [Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: Cancel URB 00000000e74c9e14, dev 1, ep 0x0, starting at offset 0xffff4060
> [Thu May 28 11:09:45 2026] xhci_hcd 0000:0f:00.0: // Ding dong!
> [Thu May 28 11:09:50 2026] xhci_hcd 0000:0f:00.0: Command timeout, USBSTS: 0x00000000
> [Thu May 28 11:09:50 2026] xhci_hcd 0000:0f:00.0: xHCI host not responding to stop endpoint command
>
> Then some URB is unlinked from the same device's control endpoint with
> no indication that the port has been resumed. Not 100% sure whether the
> URB was submitted before or after device suspend. Either way, software
> shouldn't do such things and your HW doesn't handle it gracefully.
>
> It seems we will need to figure out how the offending URB gets there.
> Can we identify the problematic device? Please post the output of:
>
> lsusb -v |sed '/0000:0f:00.0/,/root hub/ !d'
>
> Regards,
> Michal

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
       [not found]                           ` <20260530005742.25893efa.michal.pecio@gmail.com>
@ 2026-06-06 13:12                             ` Martin Alderson
  0 siblings, 0 replies; 15+ messages in thread
From: Martin Alderson @ 2026-06-06 13:12 UTC (permalink / raw)
  To: Michal Pecio; +Cc: Mathias Nyman, linux-usb

Hi Michal,

Thanks again for your continued help!

I had Claude investigate this in a bit more detail and it came up with this:

The receiver's delayedwork_callback (drivers/hid/hid-logitech-dj.c)
reacts to connect/disconnect/unknown notifications by calling
logi_dj_recv_query_paired_devices(), which sends a SET_REPORT to the
receiver via a synchronous usb_control_msg() on ep0. That work is
scheduled straight from logi_dj_raw_event() and is never stopped on
suspend — the driver has no .suspend callback (only .reset_resume),
and cancel_work_sync() only runs in .remove.

So the ordering that kills the controller is:

1. dj work issues a control SET_REPORT on ep0; the URB lands on the ring
2. usb_suspend_both() → usb_suspend_device() drives the port to U3
3. only afterwards does usb_suspend_both() set udev->can_submit = 0
and call usb_hcd_flush_endpoint() (drivers/usb/core/driver.c) — and
that flush unlinks the still-pending ep0 URB
4. xhci issues Stop Endpoint to an endpoint on a U3 port → 5s timeout → HC died

That matches the trace exactly: the "Cancel URB ... ep 0x0" appears
after "Set port 7-1 link state ... U3", and the debugfs command ring
shows the single stuck Stop Endpoint TRB (slot 1, ep 1).

I had Claude patch the driver and this seems to fix it:

--- /tmp/hid-logitech-dj.orig.c 2026-06-06 14:08:26.580516662 +0100
+++ hid-logitech-dj.c 2026-06-06 13:42:15.702948099 +0100
@@ -150,6 +150,7 @@
  unsigned long last_query; /* in jiffies */
  bool ready;
  bool dj_mode;
+ bool suspended;
  enum recvr_type type;
  unsigned int unnumbered_application;
  spinlock_t lock;
@@ -908,6 +909,17 @@
  return;
  }

+ /*
+ * Don't issue control reports while the receiver is suspended; leave
+ * the notifications queued and reschedule from resume.  A SET_REPORT
+ * submitted as the USB device enters U3 leaves a Stop Endpoint command
+ * pending on a suspended port, which times out and kills the xHC.
+ */
+ if (djrcv_dev->suspended) {
+ spin_unlock_irqrestore(&djrcv_dev->lock, flags);
+ return;
+ }
+
  count = kfifo_out(&djrcv_dev->notif_fifo, &workitem, sizeof(workitem));

  if (count != sizeof(workitem)) {
@@ -1983,11 +1995,63 @@
  return retval;
 }

+static int logi_dj_suspend(struct hid_device *hdev, pm_message_t message)
+{
+ struct dj_receiver_dev *djrcv_dev = hid_get_drvdata(hdev);
+ unsigned long flags;
+
+ if (!djrcv_dev)
+ return 0;
+
+ /*
+ * Stop the notification work from issuing control reports while the
+ * receiver suspends.  Setting ->suspended makes any requeued work a
+ * no-op (see delayedwork_callback); cancel_work_sync() then waits for
+ * an instance already running.  Without this, a SET_REPORT submitted
+ * as the device enters U3 leaves a Stop Endpoint command pending on a
+ * suspended port, which times out and kills the xHCI host.
+ */
+ spin_lock_irqsave(&djrcv_dev->lock, flags);
+ djrcv_dev->suspended = true;
+ spin_unlock_irqrestore(&djrcv_dev->lock, flags);
+
+ cancel_work_sync(&djrcv_dev->work);
+ return 0;
+}
+
+static void logi_dj_resume_common(struct dj_receiver_dev *djrcv_dev)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&djrcv_dev->lock, flags);
+ djrcv_dev->suspended = false;
+ /* Drain notifications that arrived (and were deferred) during suspend. */
+ if (!kfifo_is_empty(&djrcv_dev->notif_fifo))
+ schedule_work(&djrcv_dev->work);
+ spin_unlock_irqrestore(&djrcv_dev->lock, flags);
+}
+
+static int logi_dj_resume(struct hid_device *hdev)
+{
+ struct dj_receiver_dev *djrcv_dev = hid_get_drvdata(hdev);
+
+ if (!djrcv_dev)
+ return 0;
+
+ logi_dj_resume_common(djrcv_dev);
+ return 0;
+}
+
 static int logi_dj_reset_resume(struct hid_device *hdev)
 {
  struct dj_receiver_dev *djrcv_dev = hid_get_drvdata(hdev);

- if (!djrcv_dev || djrcv_dev->hidpp != hdev)
+ if (!djrcv_dev)
+ return 0;
+
+ logi_dj_resume_common(djrcv_dev);
+
+ if (djrcv_dev->hidpp != hdev)
  return 0;

  logi_dj_recv_switch_to_dj_mode(djrcv_dev, 0);
@@ -2148,6 +2212,8 @@
  .probe = logi_dj_probe,
  .remove = logi_dj_remove,
  .raw_event = logi_dj_raw_event,
+ .suspend = pm_ptr(logi_dj_suspend),
+ .resume = pm_ptr(logi_dj_resume),
  .reset_resume = pm_ptr(logi_dj_reset_resume),
 };

Not sure if this is helpful for you? I can submit this to the
linux-input list once I've ran it for a bit longer (it's survived ~15
suspend cycles so far with no issues I can detect) - maybe they could
take this as a basis to fix the driver (I am not a kernel expert
whatsoever!).

Thanks
Martin

On Fri, May 29, 2026 at 11:57 PM Michal Pecio <michal.pecio@gmail.com> wrote:
>
> On Fri, 29 May 2026 13:04:49 +0100, Martin Alderson wrote:
> > Bus 007 Device 002: ID 046d:c52b Logitech, Inc. Unifying Receiver
>
> So this is the problem device. Until we have any better idea what to
> try, please add usbcore and usbhid dynamic debug and keep collecting
> logs - maybe something useful will show up.
>
> Looking at usbhid, it seems it may fail to suspend if some operations
> are ongoing. And then usbcore may apparently suspend the device anyway
> and usbhid will presumably try to continue its thing on the suspended
> device. AFAIK any URB submissions should fail then, but there might be
> a bug. I haven't yet looked closely at how it all works.
>
> BTW, are you able to test patched kernels in case dynamic debug proves
> insufficient to figure out what's going on?
>
> Regards,
> Michal

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2026-06-06 13:12 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-29 21:52 xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend martinalderson
2026-03-30  0:07 ` Michal Pecio
2026-04-04 12:04   ` Martin Alderson
2026-04-04 13:24     ` Michal Pecio
2026-05-09 14:51       ` Martin Alderson
2026-05-09 16:06         ` Michal Pecio
2026-05-10 16:29           ` Martin Alderson
2026-05-12 10:03             ` Michal Pecio
2026-05-12 14:01               ` Mathias Nyman
2026-05-28 11:52                 ` Martin Alderson
2026-05-28 22:10                   ` Michal Pecio
2026-05-28 23:06                     ` Martin Alderson
2026-05-29 10:22                       ` Michal Pecio
2026-05-29 12:04                         ` Martin Alderson
     [not found]                           ` <20260530005742.25893efa.michal.pecio@gmail.com>
2026-06-06 13:12                             ` Martin Alderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox