xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend

Linux USB
 help / color / mirror / Atom feed

* xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
@ 2026-03-29 21:52 martinalderson
  2026-03-30  0:07 ` Michal Pecio
  0 siblings, 1 reply; 9+ messages in thread
From: martinalderson @ 2026-03-29 21:52 UTC (permalink / raw)
  To: linux-usb

[BUG] xhci_hcd 0000:0f:00.0: controller declared dead on resume from suspend

Hardware:
  CPU: AMD Ryzen 9 7900 12-Core Processor
  Board: ASUS PRIME B650-PLUS
  Controller: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8]
  Subsystem: ASUSTeK Computer Inc. [1043:8877]
  PCI: 0000:0f:00.0 (IOMMU group 30)

Software:
  Kernel: 7.0.0-rc5 (commit be762d8b, built 2026-03-28)
  Distro: Fedora 43 (Workstation)
  Desktop: GNOME on Wayland

Description:
  On the first suspend/resume cycle after boot, the xHCI controller at
  0000:0f:00.0 (AMD Raphael/Granite Ridge USB 2.0) fails to resume and
  is declared dead. A Logitech Unifying Receiver (046d:c52b) on this
  controller is disconnected and the mouse (Logitech M720 Triathlon)
  stops functioning.

  A second xHCI controller on the same system (0000:0c:00.0, AMD 600
  Series Chipset USB 3.2 [1022:43f7]) also errors on resume (USBSTS
  0x401) but successfully recovers via reinit. The 0f:00.0 controller
  does not recover.

  Regression from rc4: suspend/resume worked correctly on 7.0-rc4 and
  earlier kernels on the same hardware.

Reproduce:
  1. Boot with USB device attached to a port on the 0000:0f:00.0 controller
  2. Suspend (systemd suspend)
  3. Resume

dmesg on resume:
  xhci_hcd 0000:0f:00.0: xHCI host not responding to stop endpoint command
  xhci_hcd 0000:0f:00.0: xHCI host controller not responding, assume dead
  xhci_hcd 0000:0f:00.0: HC died; cleaning up
  xhci_hcd 0000:0c:00.0: xHC error in resume, USBSTS 0x401, Reinit
  usb usb1: root hub lost power or was reset
  usb usb2: root hub lost power or was reset
  usb 1-7: WARN: invalid context state for evaluate context command.
  usb 1-10: WARN: invalid context state for evaluate context command.
  usb 7-1: USB disconnect, device number 2

Workaround:
  PCI remove + rescan recovers the controller:
    echo 1 > /sys/bus/pci/devices/0000:0f:00.0/remove
    echo 1 > /sys/bus/pci/rescan

  A simple PCI device reset (echo 1 > .../reset) was insufficient -- the
  controller came back but did not re-enumerate the attached device.

Notes:
  - The 0f:00.0 controller is USB 2.0 only (USB3 root hub has no ports)
  - hci version 0x120, hcc params 0x0110ffc5, quirks 0x0000000200000010

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
  2026-03-29 21:52 xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend martinalderson
@ 2026-03-30  0:07 ` Michal Pecio
  2026-04-04 12:04   ` Martin Alderson
  0 siblings, 1 reply; 9+ messages in thread
From: Michal Pecio @ 2026-03-30  0:07 UTC (permalink / raw)
  To: martinalderson; +Cc: linux-usb

On Sun, 29 Mar 2026 17:52:39 -0400, martinalderson@gmail.com wrote:
> [BUG] xhci_hcd 0000:0f:00.0: controller declared dead on resume from
> suspend
> 
> Hardware:
>   CPU: AMD Ryzen 9 7900 12-Core Processor
>   Board: ASUS PRIME B650-PLUS
>   Controller: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8]
>   Subsystem: ASUSTeK Computer Inc. [1043:8877]
>   PCI: 0000:0f:00.0 (IOMMU group 30)
> 
> Software:
>   Kernel: 7.0.0-rc5 (commit be762d8b, built 2026-03-28)
>   Distro: Fedora 43 (Workstation)
>   Desktop: GNOME on Wayland
> 
> Description:
>   On the first suspend/resume cycle after boot, the xHCI controller at
>   0000:0f:00.0 (AMD Raphael/Granite Ridge USB 2.0) fails to resume and
>   is declared dead. A Logitech Unifying Receiver (046d:c52b) on this
>   controller is disconnected and the mouse (Logitech M720 Triathlon)
>   stops functioning.
> 
>   A second xHCI controller on the same system (0000:0c:00.0, AMD 600
>   Series Chipset USB 3.2 [1022:43f7]) also errors on resume (USBSTS
>   0x401) but successfully recovers via reinit. The 0f:00.0 controller
>   does not recover.
> 
>   Regression from rc4: suspend/resume worked correctly on 7.0-rc4 and
>   earlier kernels on the same hardware.

That's interesting because there were no USB subsystem changes
between 7.0-rc4 and 7.0-rc5.

Any chance you could git-bisect this?
Are both kernels built with the same .config?

> Reproduce:
>   1. Boot with USB device attached to a port on the 0000:0f:00.0
>      controller
>   2. Suspend (systemd suspend)
>   3. Resume

By the way, are you using this affected controller to resume
(with a keyboard or something like that)?
 
> dmesg on resume:
>   xhci_hcd 0000:0f:00.0: xHCI host not responding to stop endpoint command
>   xhci_hcd 0000:0f:00.0: xHCI host controller not responding, assume dead
>   xhci_hcd 0000:0f:00.0: HC died; cleaning up
>   xhci_hcd 0000:0c:00.0: xHC error in resume, USBSTS 0x401, Reinit
>   usb usb1: root hub lost power or was reset
>   usb usb2: root hub lost power or was reset
>   usb 1-7: WARN: invalid context state for evaluate context command.
>   usb 1-10: WARN: invalid context state for evaluate context command.
>   usb 7-1: USB disconnect, device number 2
> 
> Workaround:
>   PCI remove + rescan recovers the controller:
>     echo 1 > /sys/bus/pci/devices/0000:0f:00.0/remove
>     echo 1 > /sys/bus/pci/rescan
> 
>   A simple PCI device reset (echo 1 > .../reset) was insufficient -- the
>   controller came back but did not re-enumerate the attached device.

What about the unbind/bind procedure described here?
https://bugzilla.kernel.org/show_bug.cgi?id=221073

> Notes:
>   - The 0f:00.0 controller is USB 2.0 only (USB3 root hub has no ports)
>   - hci version 0x120, hcc params 0x0110ffc5, quirks 0x0000000200000010

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
  2026-03-30  0:07 ` Michal Pecio
@ 2026-04-04 12:04   ` Martin Alderson
  2026-04-04 13:24     ` Michal Pecio
  0 siblings, 1 reply; 9+ messages in thread
From: Martin Alderson @ 2026-04-04 12:04 UTC (permalink / raw)
  To: Michal Pecio; +Cc: linux-usb

Hi,

Just for clarity this never happened to me with the 6.19 kernel I was
on before (suspend/resumed many times on that kernel with no issues).
It's happened twice now (once with rc5, now with rc6) in a short space
of time. It may just be random luck though than a specific regression
- sorry if I confused things there.

Not sure I'm able to do a bisect because it's very intermittent so
would take an age to reproduce it sorry.

Previously I was on the Fedora 43 default kernel series, now I
switched to the COPR for 7.x (to try and fix something else).

Thanks for the bugzilla, I'll look at some of those workarounds.


On Mon, Mar 30, 2026 at 1:07 AM Michal Pecio <michal.pecio@gmail.com> wrote:
>
> On Sun, 29 Mar 2026 17:52:39 -0400, martinalderson@gmail.com wrote:
> > [BUG] xhci_hcd 0000:0f:00.0: controller declared dead on resume from
> > suspend
> >
> > Hardware:
> >   CPU: AMD Ryzen 9 7900 12-Core Processor
> >   Board: ASUS PRIME B650-PLUS
> >   Controller: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8]
> >   Subsystem: ASUSTeK Computer Inc. [1043:8877]
> >   PCI: 0000:0f:00.0 (IOMMU group 30)
> >
> > Software:
> >   Kernel: 7.0.0-rc5 (commit be762d8b, built 2026-03-28)
> >   Distro: Fedora 43 (Workstation)
> >   Desktop: GNOME on Wayland
> >
> > Description:
> >   On the first suspend/resume cycle after boot, the xHCI controller at
> >   0000:0f:00.0 (AMD Raphael/Granite Ridge USB 2.0) fails to resume and
> >   is declared dead. A Logitech Unifying Receiver (046d:c52b) on this
> >   controller is disconnected and the mouse (Logitech M720 Triathlon)
> >   stops functioning.
> >
> >   A second xHCI controller on the same system (0000:0c:00.0, AMD 600
> >   Series Chipset USB 3.2 [1022:43f7]) also errors on resume (USBSTS
> >   0x401) but successfully recovers via reinit. The 0f:00.0 controller
> >   does not recover.
> >
> >   Regression from rc4: suspend/resume worked correctly on 7.0-rc4 and
> >   earlier kernels on the same hardware.
>
> That's interesting because there were no USB subsystem changes
> between 7.0-rc4 and 7.0-rc5.
>
> Any chance you could git-bisect this?
> Are both kernels built with the same .config?
>
> > Reproduce:
> >   1. Boot with USB device attached to a port on the 0000:0f:00.0
> >      controller
> >   2. Suspend (systemd suspend)
> >   3. Resume
>
> By the way, are you using this affected controller to resume
> (with a keyboard or something like that)?
>
> > dmesg on resume:
> >   xhci_hcd 0000:0f:00.0: xHCI host not responding to stop endpoint command
> >   xhci_hcd 0000:0f:00.0: xHCI host controller not responding, assume dead
> >   xhci_hcd 0000:0f:00.0: HC died; cleaning up
> >   xhci_hcd 0000:0c:00.0: xHC error in resume, USBSTS 0x401, Reinit
> >   usb usb1: root hub lost power or was reset
> >   usb usb2: root hub lost power or was reset
> >   usb 1-7: WARN: invalid context state for evaluate context command.
> >   usb 1-10: WARN: invalid context state for evaluate context command.
> >   usb 7-1: USB disconnect, device number 2
> >
> > Workaround:
> >   PCI remove + rescan recovers the controller:
> >     echo 1 > /sys/bus/pci/devices/0000:0f:00.0/remove
> >     echo 1 > /sys/bus/pci/rescan
> >
> >   A simple PCI device reset (echo 1 > .../reset) was insufficient -- the
> >   controller came back but did not re-enumerate the attached device.
>
> What about the unbind/bind procedure described here?
> https://bugzilla.kernel.org/show_bug.cgi?id=221073
>
> > Notes:
> >   - The 0f:00.0 controller is USB 2.0 only (USB3 root hub has no ports)
> >   - hci version 0x120, hcc params 0x0110ffc5, quirks 0x0000000200000010

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
  2026-04-04 12:04   ` Martin Alderson
@ 2026-04-04 13:24     ` Michal Pecio
  2026-05-09 14:51       ` Martin Alderson
  0 siblings, 1 reply; 9+ messages in thread
From: Michal Pecio @ 2026-04-04 13:24 UTC (permalink / raw)
  To: Martin Alderson; +Cc: linux-usb

On Sat, 4 Apr 2026 13:04:02 +0100, Martin Alderson wrote:
> Just for clarity this never happened to me with the 6.19 kernel I was
> on before (suspend/resumed many times on that kernel with no issues).
> It's happened twice now (once with rc5, now with rc6) in a short space
> of time.

So apparently about once per week. That's not very easy to debug.
One trick I have seen people use to accelerate such tests is running
"rtcwake -s 5 -m freeze" in a loop. This puts the system in s2idle and
resumes automatically after 5 seconds.

Do you have more complete dmesg from those failures with timestamps?
From suspend up to until everything has calmed down after resume, or
also including whatever you have done later to restore operation.

> Previously I was on the Fedora 43 default kernel series, now I
> switched to the COPR for 7.x (to try and fix something else).

Not sure what COPR is, but I gather it went like this:
1. Fedora 6.19 kernel was OK for a long time
2. Some other kernel, possibly other config, 7.0-rc4 still worked, but
   only used for a short time. What about 7.0-rc1 to -rc3? 
3. After updating to -rc5 it's definitely broken.

> Thanks for the bugzilla, I'll look at some of those workarounds.

Particularly, collecting dynamic debug and debugfs could tell if it's
the same problem with missing IRQ after resume or something else.

Regards,
Michal

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
  2026-04-04 13:24     ` Michal Pecio
@ 2026-05-09 14:51       ` Martin Alderson
  2026-05-09 16:06         ` Michal Pecio
  0 siblings, 1 reply; 9+ messages in thread
From: Martin Alderson @ 2026-05-09 14:51 UTC (permalink / raw)
  To: Michal Pecio; +Cc: linux-usb

Hi, still experiencing this on 7.0.2. I tried to pull the logs
together to get to the bottom of this (I've tried a few different
kernels)

Kernel                          Suspends   xHCI 0f:00.0 deaths   Rate
------------------------------  --------   -------------------   -----
6.17.1-300.fc43 (March)             ~12             0             0%
6.18.16-200.fc43                     10             0             0%
6.19.7/8-200.fc43                     5             0             0%
7.0-rc4   (build 260320)             13             0             0%
7.0-rc5   (build 260328)              7             2            ~28%
7.0-rc6   (build 260401)             10             4             40%
7.0-rc7   (build 260409)              7             2            ~28%
7.0.0-261.vanilla.fc43                7             2            ~28%
6.17.1-300.fc43 (April, retry)       10             2             20%
 <-- same bug, stable kernel
7.0.1-262.vanilla.fc43                7             2            ~28%
7.0.2-300.vanilla.fc44                6             4            ~66%

May 09 15:29:37 fedora kernel: Freezing user space processes completed
(elapsed 0.001 seconds)
May 09 15:29:37 fedora kernel: OOM killer disabled.
May 09 15:29:37 fedora kernel: Freezing remaining freezable tasks
May 09 15:29:37 fedora kernel: Freezing remaining freezable tasks
completed (elapsed 0.001 seconds)
May 09 15:29:37 fedora kernel: printk: Suspending console(s) (use
no_console_suspend to debug)
May 09 15:29:37 fedora kernel: sd 6:0:0:0: [sdb] Synchronizing SCSI cache
May 09 15:29:37 fedora kernel: serial 00:01: disabled
May 09 15:29:37 fedora kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
May 09 15:29:37 fedora kernel: ata1.00: Entering standby power mode
May 09 15:29:37 fedora kernel: xhci_hcd 0000:0f:00.0: xHCI host not
responding to stop endpoint command
May 09 15:29:37 fedora kernel: xhci_hcd 0000:0f:00.0: xHCI host
controller not responding, assume dead
May 09 15:29:37 fedora kernel: xhci_hcd 0000:0f:00.0: HC died; cleaning up
May 09 15:29:37 fedora kernel: PM: suspend devices took 5.758 seconds
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: MODE1 reset
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: GPU mode1 reset
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: GPU smu mode1 reset
May 09 15:29:37 fedora kernel: ACPI: PM: Preparing to enter system
sleep state S3
May 09 15:29:37 fedora kernel: ACPI: PM: Saving platform NVS memory
May 09 15:29:37 fedora kernel: Disabling non-boot CPUs ...
May 09 15:29:37 fedora kernel: smpboot: CPU 23 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 22 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 21 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 20 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 19 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 18 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 17 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 16 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 15 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 14 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 13 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 12 is now offline
May 09 15:29:37 fedora kernel: Spectre V2 : Update user space SMT
mitigation: STIBP off
May 09 15:29:37 fedora kernel: smpboot: CPU 11 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 10 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 9 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 8 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 7 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 6 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 5 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 4 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 3 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 2 is now offline
May 09 15:29:37 fedora kernel: smpboot: CPU 1 is now offline
May 09 15:29:37 fedora kernel: ACPI: PM: Low-level resume complete
May 09 15:29:37 fedora kernel: ACPI: PM: Restoring platform NVS memory
May 09 15:29:37 fedora kernel: AMD-Vi: Virtual APIC enabled
May 09 15:29:37 fedora kernel: AMD-Vi: Virtual APIC enabled
May 09 15:29:37 fedora kernel: LVT offset 0 assigned for vector 0x400
May 09 15:29:37 fedora kernel: Enabling non-boot CPUs ...
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 1 APIC 0x2
May 09 15:29:37 fedora kernel: CPU1 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 2 APIC 0x4
May 09 15:29:37 fedora kernel: CPU2 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 3 APIC 0x6
May 09 15:29:37 fedora kernel: CPU3 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 4 APIC 0x8
May 09 15:29:37 fedora kernel: CPU4 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 5 APIC 0xa
May 09 15:29:37 fedora kernel: CPU5 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 6 APIC 0x10
May 09 15:29:37 fedora kernel: CPU6 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 7 APIC 0x12
May 09 15:29:37 fedora kernel: CPU7 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 8 APIC 0x14
May 09 15:29:37 fedora kernel: CPU8 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 9 APIC 0x16
May 09 15:29:37 fedora kernel: CPU9 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 10 APIC 0x18
May 09 15:29:37 fedora kernel: CPU10 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 11 APIC 0x1a
May 09 15:29:37 fedora kernel: CPU11 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 12 APIC 0x1
May 09 15:29:37 fedora kernel: Spectre V2 : Update user space SMT
mitigation: STIBP always-on
May 09 15:29:37 fedora kernel: CPU12 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 13 APIC 0x3
May 09 15:29:37 fedora kernel: Spurious APIC interrupt (vector 0xFF)
on CPU#13, should never happen.
May 09 15:29:37 fedora kernel: CPU13 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 14 APIC 0x5
May 09 15:29:37 fedora kernel: Spurious APIC interrupt (vector 0xFF)
on CPU#14, should never happen.
May 09 15:29:37 fedora kernel: CPU14 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 15 APIC 0x7
May 09 15:29:37 fedora kernel: Spurious APIC interrupt (vector 0xFF)
on CPU#15, should never happen.
May 09 15:29:37 fedora kernel: CPU15 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 16 APIC 0x9
May 09 15:29:37 fedora kernel: Spurious APIC interrupt (vector 0xFF)
on CPU#16, should never happen.
May 09 15:29:37 fedora kernel: CPU16 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 17 APIC 0xb
May 09 15:29:37 fedora kernel: Spurious APIC interrupt (vector 0xFF)
on CPU#17, should never happen.
May 09 15:29:37 fedora kernel: CPU17 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 18 APIC 0x11
May 09 15:29:37 fedora kernel: Spurious APIC interrupt (vector 0xFF)
on CPU#18, should never happen.
May 09 15:29:37 fedora kernel: CPU18 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 19 APIC 0x13
May 09 15:29:37 fedora kernel: Spurious APIC interrupt (vector 0xFF)
on CPU#19, should never happen.
May 09 15:29:37 fedora kernel: CPU19 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 20 APIC 0x15
May 09 15:29:37 fedora kernel: Spurious APIC interrupt (vector 0xFF)
on CPU#20, should never happen.
May 09 15:29:37 fedora kernel: CPU20 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 21 APIC 0x17
May 09 15:29:37 fedora kernel: Spurious APIC interrupt (vector 0xFF)
on CPU#21, should never happen.
May 09 15:29:37 fedora kernel: CPU21 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 22 APIC 0x19
May 09 15:29:37 fedora kernel: Spurious APIC interrupt (vector 0xFF)
on CPU#22, should never happen.
May 09 15:29:37 fedora kernel: CPU22 is up
May 09 15:29:37 fedora kernel: smpboot: Booting Node 0 Processor 23 APIC 0x1b
May 09 15:29:37 fedora kernel: Spurious APIC interrupt (vector 0xFF)
on CPU#23, should never happen.
May 09 15:29:37 fedora kernel: CPU23 is up
May 09 15:29:37 fedora kernel: ACPI: PM: Waking up from system sleep state S3
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: [drm] PCIE GART of
512M enabled (table at 0x00000083DAB00000).
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: PSP is resuming...
May 09 15:29:37 fedora kernel: nvme nvme0: D3 entry latency set to 10 seconds
May 09 15:29:37 fedora kernel: xhci_hcd 0000:0c:00.0: xHC error in
resume, USBSTS 0x401, Reinit
May 09 15:29:37 fedora kernel: usb usb1: root hub lost power or was reset
May 09 15:29:37 fedora kernel: usb usb2: root hub lost power or was reset
May 09 15:29:37 fedora kernel: serial 00:01: activated
May 09 15:29:37 fedora kernel: nvme nvme0: 24/0/0 default/read/poll queues
May 09 15:29:37 fedora kernel: usb 5-2: reset full-speed USB device
number 2 using xhci_hcd
May 09 15:29:37 fedora kernel: usb 1-7: WARN: invalid context state
for evaluate context command.
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: RAP: optional rap
ta ucode is not available
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: SECUREDISPLAY:
optional securedisplay ta ucode is not available
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: SMU is resuming...
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: smu driver if
version = 0x0000002e, smu fw if version = 0x00000033, smu fw program =
0, smu fw version = 0x00684c00 (104.76.0)
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: SMU is resumed successfully!
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: program
CP_MES_CNTL : 0x4000000
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: program
CP_MES_CNTL : 0xc000000
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: [drm] DMUB
hardware initialized: version=0x0A003500
May 09 15:29:37 fedora kernel: ata4: SATA link down (SStatus 0 SControl 300)
May 09 15:29:37 fedora kernel: ata3: SATA link down (SStatus 0 SControl 300)
May 09 15:29:37 fedora kernel: ata2: SATA link down (SStatus 0 SControl 300)
May 09 15:29:37 fedora kernel: usb 1-7: reset full-speed USB device
number 3 using xhci_hcd
May 09 15:29:37 fedora kernel: ata1: SATA link up 6.0 Gbps (SStatus
133 SControl 300)
May 09 15:29:37 fedora kernel: sd 0:0:0:0: [sda] Starting disk
May 09 15:29:37 fedora kernel: ata1.00: configured for UDMA/133
May 09 15:29:37 fedora kernel: ahci 0000:0d:00.0: port does not
support device sleep
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: ring gfx_0.0.0
uses VM inv eng 0 on hub 0
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: ring comp_1.0.0
uses VM inv eng 1 on hub 0
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: ring comp_1.1.0
uses VM inv eng 4 on hub 0
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: ring comp_1.0.1
uses VM inv eng 7 on hub 0
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: ring comp_1.1.1
uses VM inv eng 8 on hub 0
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: ring sdma0 uses VM
inv eng 9 on hub 0
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: ring sdma1 uses VM
inv eng 10 on hub 0
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: ring vcn_unified_0
uses VM inv eng 0 on hub 8
May 09 15:29:37 fedora kernel: amdgpu 0000:03:00.0: ring jpeg_dec uses
VM inv eng 1 on hub 8
May 09 15:29:37 fedora kernel: usb 1-10: WARN: invalid context state
for evaluate context command.
May 09 15:29:37 fedora kernel: usb 1-10: reset full-speed USB device
number 6 using xhci_hcd
May 09 15:29:37 fedora kernel: usb 1-1: reset high-speed USB device
number 2 using xhci_hcd
May 09 15:29:37 fedora kernel: usb 1-1.4: reset high-speed USB device
number 4 using xhci_hcd
May 09 15:29:37 fedora kernel: PM: resume devices took 2.046 seconds
May 09 15:29:37 fedora kernel: OOM killer enabled.
May 09 15:29:37 fedora kernel: Restarting tasks: Starting
May 09 15:29:37 fedora kernel: usb 7-1: USB disconnect, device number 2
May 09 15:29:37 fedora kernel: Restarting tasks: Done
May 09 15:29:37 fedora kernel: efivarfs: resyncing variable state
May 09 15:29:37 fedora kernel: efivarfs: finished resyncing variable state
May 09 15:29:37 fedora kernel: random: crng reseeded on system resumption
May 09 15:29:37 fedora kernel: PM: suspend exit
May 09 15:29:37 fedora kernel: Bluetooth: hci0: RTL: examining
hci_ver=0a hci_rev=000b lmp_ver=0a lmp_subver=8761
May 09 15:29:37 fedora kernel: Bluetooth: hci0: RTL: rom_version
status=0 version=1
May 09 15:29:37 fedora kernel: Bluetooth: hci0: RTL: btrtl_initialize: key id 0
May 09 15:29:37 fedora kernel: Bluetooth: hci0: RTL: loading
rtl_bt/rtl8761bu_fw.bin
May 09 15:29:37 fedora kernel: Bluetooth: hci0: RTL: loading
rtl_bt/rtl8761bu_config.bin
May 09 15:29:37 fedora kernel: Bluetooth: hci0: RTL: cfg_sz 6, total sz 30210
May 09 15:29:37 fedora kernel: Realtek Internal NBASE-T PHY
r8169-0-a00:00: attached PHY driver (mii_bus:phy_addr=r8169-0-a00:00,
irq=MAC)
May 09 15:29:37 fedora kernel: r8169 0000:0a:00.0 eno1: Link is Down
May 09 15:29:38 fedora kernel: Bluetooth: hci0: RTL: fw version 0xdfc6d922
May 09 15:29:38 fedora kernel: Bluetooth: MGMT ver 1.23
May 09 15:29:41 fedora kernel: r8169 0000:0a:00.0 eno1: Link is Up -
2.5Gbps/Full - flow control rx/tx
May 09 15:30:00 fedora kernel: input: soundcore Space One (AVRCP) as
/devices/virtual/input/input32
May 09 15:31:40 fedora kernel: xhci_hcd 0000:0f:00.0: remove, state 1
May 09 15:31:40 fedora kernel: usb usb7: USB disconnect, device number 1
May 09 15:31:40 fedora kernel: xhci_hcd 0000:0f:00.0: USB bus 7 deregistered
May 09 15:31:41 fedora kernel: pci 0000:0f:00.0: [1022:15b8] type 00
class 0x0c0330 PCIe Endpoint
May 09 15:31:41 fedora kernel: pci 0000:0f:00.0: BAR 0 [mem
0xf6e00000-0xf6efffff 64bit]
May 09 15:31:41 fedora kernel: pci 0000:0f:00.0: PME# supported from
D0 D3hot D3cold
May 09 15:31:41 fedora kernel: pci 0000:0f:00.0: Adding to iommu group 30
May 09 15:31:41 fedora kernel: pci 0000:0f:00.0: BAR 0 [mem
0xf6e00000-0xf6efffff 64bit]: assigned
May 09 15:31:41 fedora kernel: xhci_hcd 0000:0f:00.0: xHCI Host Controller
May 09 15:31:41 fedora kernel: xhci_hcd 0000:0f:00.0: new USB bus
registered, assigned bus number 7
May 09 15:31:41 fedora kernel: xhci_hcd 0000:0f:00.0: USB3 root hub has no ports
May 09 15:31:41 fedora kernel: xhci_hcd 0000:0f:00.0: hcc params
0x0110ffc5 hci version 0x120 quirks 0x0000000200000010
May 09 15:31:41 fedora kernel: usb usb7: New USB device found,
idVendor=1d6b, idProduct=0002, bcdDevice= 7.00
May 09 15:31:41 fedora kernel: usb usb7: New USB device strings:
Mfr=3, Product=2, SerialNumber=1
May 09 15:31:41 fedora kernel: usb usb7: Product: xHCI Host Controller
May 09 15:31:41 fedora kernel: usb usb7: Manufacturer: Linux
7.0.2-300.vanilla.fc44.x86_64 xhci-hcd
May 09 15:31:41 fedora kernel: usb usb7: SerialNumber: 0000:0f:00.0
May 09 15:31:41 fedora kernel: hub 7-0:1.0: USB hub found
May 09 15:31:41 fedora kernel: hub 7-0:1.0: 1 port detected
May 09 15:31:41 fedora kernel: usb 7-1: new full-speed USB device
number 2 using xhci_hcd
May 09 15:31:42 fedora kernel: usb 7-1: New USB device found,
idVendor=046d, idProduct=c52b, bcdDevice=12.11
May 09 15:31:42 fedora kernel: usb 7-1: New USB device strings: Mfr=1,
Product=2, SerialNumber=0
May 09 15:31:42 fedora kernel: usb 7-1: Product: USB Receiver
May 09 15:31:42 fedora kernel: usb 7-1: Manufacturer: Logitech
May 09 15:31:42 fedora kernel: logitech-djreceiver
0003:046D:C52B.000C: hiddev96,hidraw1: USB HID v1.11 Device [Logitech
USB Receiver] on usb-0000:0f:00.0-1/input2
May 09 15:31:42 fedora kernel: input: Logitech M720 Triathlon as
/devices/pci0000:00/0000:00:08.3/0000:0f:00.0/usb7/7-1/7-1:1.2/0003:046D:C52B.000C/0003:046D:405E.000D/input/input33
May 09 15:31:42 fedora kernel: logitech-hidpp-device
0003:046D:405E.000D: input,hidraw3: USB HID v1.11 Keyboard [Logitech
M720 Triathlon] on usb-0000:0f:00.0-1/input2:1
May 09 15:31:42 fedora kernel: logitech-hidpp-device
0003:046D:405E.000D: HID++ 4.5 device connected.

You can see at 15:31 the results of me running echo 1 >
/sys/bus/pci/devices/0000:0f:00.0/remove followed by echo 1 >
/sys/bus/pci/rescan here. This works 100% of the time to restore. I
have now wired this up to a systemd unit. Let me know if I can provide
anything else that would help?

On Sat, Apr 4, 2026 at 2:24 PM Michal Pecio <michal.pecio@gmail.com> wrote:
>
> On Sat, 4 Apr 2026 13:04:02 +0100, Martin Alderson wrote:
> > Just for clarity this never happened to me with the 6.19 kernel I was
> > on before (suspend/resumed many times on that kernel with no issues).
> > It's happened twice now (once with rc5, now with rc6) in a short space
> > of time.
>
> So apparently about once per week. That's not very easy to debug.
> One trick I have seen people use to accelerate such tests is running
> "rtcwake -s 5 -m freeze" in a loop. This puts the system in s2idle and
> resumes automatically after 5 seconds.
>
> Do you have more complete dmesg from those failures with timestamps?
> From suspend up to until everything has calmed down after resume, or
> also including whatever you have done later to restore operation.
>
> > Previously I was on the Fedora 43 default kernel series, now I
> > switched to the COPR for 7.x (to try and fix something else).
>
> Not sure what COPR is, but I gather it went like this:
> 1. Fedora 6.19 kernel was OK for a long time
> 2. Some other kernel, possibly other config, 7.0-rc4 still worked, but
>    only used for a short time. What about 7.0-rc1 to -rc3?
> 3. After updating to -rc5 it's definitely broken.
>
> > Thanks for the bugzilla, I'll look at some of those workarounds.
>
> Particularly, collecting dynamic debug and debugfs could tell if it's
> the same problem with missing IRQ after resume or something else.
>
> Regards,
> Michal

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
  2026-05-09 14:51       ` Martin Alderson
@ 2026-05-09 16:06         ` Michal Pecio
  2026-05-10 16:29           ` Martin Alderson
  0 siblings, 1 reply; 9+ messages in thread
From: Michal Pecio @ 2026-05-09 16:06 UTC (permalink / raw)
  To: Martin Alderson; +Cc: linux-usb

On Sat, 9 May 2026 15:51:03 +0100, Martin Alderson wrote:
> Hi, still experiencing this on 7.0.2. I tried to pull the logs
> together to get to the bottom of this (I've tried a few different
> kernels)
> 
> Kernel                          Suspends   xHCI 0f:00.0 deaths   Rate
> ------------------------------  --------   -------------------   -----
> 6.17.1-300.fc43 (March)             ~12             0             0%
> 6.18.16-200.fc43                     10             0             0%
> 6.19.7/8-200.fc43                     5             0             0%
> 7.0-rc4   (build 260320)             13             0             0%
> 7.0-rc5   (build 260328)              7             2            ~28%
> 7.0-rc6   (build 260401)             10             4             40%
> 7.0-rc7   (build 260409)              7             2            ~28%
> 7.0.0-261.vanilla.fc43                7             2            ~28%
> 6.17.1-300.fc43 (April, retry)       10             2             20%
>  <-- same bug, stable kernel

Looks like it's not a regression then, but not sure what else may have
caused it.

Any new USB device that wasn't connected before?
Perhaps a BIOS upgrade?

> 7.0.1-262.vanilla.fc43                7             2            ~28%
> 7.0.2-300.vanilla.fc44                6             4            ~66%
> 
> 
> May 09 15:29:37 fedora kernel: Freezing user space processes completed
> (elapsed 0.001 seconds)
> May 09 15:29:37 fedora kernel: OOM killer disabled.
> May 09 15:29:37 fedora kernel: Freezing remaining freezable tasks
> May 09 15:29:37 fedora kernel: Freezing remaining freezable tasks
> completed (elapsed 0.001 seconds)
> May 09 15:29:37 fedora kernel: printk: Suspending console(s) (use
> no_console_suspend to debug)
> May 09 15:29:37 fedora kernel: sd 6:0:0:0: [sdb] Synchronizing SCSI cache
> May 09 15:29:37 fedora kernel: serial 00:01: disabled
> May 09 15:29:37 fedora kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
> May 09 15:29:37 fedora kernel: ata1.00: Entering standby power mode
> May 09 15:29:37 fedora kernel: xhci_hcd 0000:0f:00.0: xHCI host not
> responding to stop endpoint command
> May 09 15:29:37 fedora kernel: xhci_hcd 0000:0f:00.0: xHCI host
> controller not responding, assume dead
> May 09 15:29:37 fedora kernel: xhci_hcd 0000:0f:00.0: HC died; cleaning up
> May 09 15:29:37 fedora kernel: PM: suspend devices took 5.758 seconds

That's not resume, it's during suspend. Are other logs also like that?

Regards,
Michal

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
  2026-05-09 16:06         ` Michal Pecio
@ 2026-05-10 16:29           ` Martin Alderson
  2026-05-12 10:03             ` Michal Pecio
  0 siblings, 1 reply; 9+ messages in thread
From: Martin Alderson @ 2026-05-10 16:29 UTC (permalink / raw)
  To: Michal Pecio; +Cc: linux-usb

Hi,

Two answers, plus a hypothesis:

1. The timing is during suspend in every single failure I have logs for.
I went back through 7 weeks of persistent journals and pulled the
context around every "HC died" event. All 9 failures show the same
sequence:

  xhci_hcd 0000:0f:00.0: xHCI host not responding to stop endpoint command
  xhci_hcd 0000:0f:00.0: xHCI host controller not responding, assume dead
  xhci_hcd 0000:0f:00.0: HC died; cleaning up
  PM: suspend devices took 5.5--6.1 seconds      <-- elevated
  amdgpu 0000:03:00.0: MODE1 reset
  ACPI: PM: Preparing to enter system sleep state S3

So it's reliably during suspend, before S3 entry, and the elevated
"suspend devices took" matches the 5s xHCI stop-endpoint timeout. A
clean suspend on the same boot takes ~0.46s.

2. No BIOS upgrade. ASUS PRIME B650-PLUS BIOS version 3263 dated
2025-06-09 across every boot from 2026-03-02 to 2026-05-08 (42 boots).

3. Re "any new USB device": yes, and it correlates exactly. A 4-port
USB hub appeared on bus 1 (controller 0c:00.0, AMD 600 Series USB 3.2)
on 2026-03-16, with a USB mass-storage device behind it on port 4. It's
the hub built into a new monitor I added around then. Per-boot
presence:

  2026-03-02 to 2026-03-16: NO hub, NO flash drive, ~12 6.17.1
                            suspends, 0 failures
  2026-03-16+:              hub + flash drive present
  2026-03-22 to 2026-03-28: 7.0-rc4 with hub present, 13 suspends,
                            0 failures
  2026-03-28+:              7.0-rc5+, failures begin
  2026-04-18 to 2026-04-25: 6.17.1 (retry, with hub still present),
                            10 suspends, 2 failures -- same kernel
                            that was clean in March

The hub is on a different xHCI from the one that dies (0c:00.0 vs
0f:00.0), but they're sibling controllers on the same AMD SoC, so
shared power/ACPI domains seem plausible.

Even with the hub identified as the trigger, I think there are still
two kernel-side issues worth flagging:

(a) Recovery: when the stop-endpoint timeout hits, 0f:00.0 is marked
"HC died" and never comes back without a manual PCI remove+rescan.
The other controller on the same machine recovers itself on resume:

  xhci_hcd 0000:0c:00.0: xHC error in resume, USBSTS 0x401, Reinit

There doesn't seem to be an equivalent recovery path for the
suspend-time stop-endpoint timeout on 0f:00.0.

Regards,
Martin


On Sat, May 9, 2026 at 5:06 PM Michal Pecio <michal.pecio@gmail.com> wrote:
>
> On Sat, 9 May 2026 15:51:03 +0100, Martin Alderson wrote:
> > Hi, still experiencing this on 7.0.2. I tried to pull the logs
> > together to get to the bottom of this (I've tried a few different
> > kernels)
> >
> > Kernel                          Suspends   xHCI 0f:00.0 deaths   Rate
> > ------------------------------  --------   -------------------   -----
> > 6.17.1-300.fc43 (March)             ~12             0             0%
> > 6.18.16-200.fc43                     10             0             0%
> > 6.19.7/8-200.fc43                     5             0             0%
> > 7.0-rc4   (build 260320)             13             0             0%
> > 7.0-rc5   (build 260328)              7             2            ~28%
> > 7.0-rc6   (build 260401)             10             4             40%
> > 7.0-rc7   (build 260409)              7             2            ~28%
> > 7.0.0-261.vanilla.fc43                7             2            ~28%
> > 6.17.1-300.fc43 (April, retry)       10             2             20%
> >  <-- same bug, stable kernel
>
> Looks like it's not a regression then, but not sure what else may have
> caused it.
>
> Any new USB device that wasn't connected before?
> Perhaps a BIOS upgrade?
>
> > 7.0.1-262.vanilla.fc43                7             2            ~28%
> > 7.0.2-300.vanilla.fc44                6             4            ~66%
> >
> >
> > May 09 15:29:37 fedora kernel: Freezing user space processes completed
> > (elapsed 0.001 seconds)
> > May 09 15:29:37 fedora kernel: OOM killer disabled.
> > May 09 15:29:37 fedora kernel: Freezing remaining freezable tasks
> > May 09 15:29:37 fedora kernel: Freezing remaining freezable tasks
> > completed (elapsed 0.001 seconds)
> > May 09 15:29:37 fedora kernel: printk: Suspending console(s) (use
> > no_console_suspend to debug)
> > May 09 15:29:37 fedora kernel: sd 6:0:0:0: [sdb] Synchronizing SCSI cache
> > May 09 15:29:37 fedora kernel: serial 00:01: disabled
> > May 09 15:29:37 fedora kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
> > May 09 15:29:37 fedora kernel: ata1.00: Entering standby power mode
> > May 09 15:29:37 fedora kernel: xhci_hcd 0000:0f:00.0: xHCI host not
> > responding to stop endpoint command
> > May 09 15:29:37 fedora kernel: xhci_hcd 0000:0f:00.0: xHCI host
> > controller not responding, assume dead
> > May 09 15:29:37 fedora kernel: xhci_hcd 0000:0f:00.0: HC died; cleaning up
> > May 09 15:29:37 fedora kernel: PM: suspend devices took 5.758 seconds
>
> That's not resume, it's during suspend. Are other logs also like that?
>
> Regards,
> Michal

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
  2026-05-10 16:29           ` Martin Alderson
@ 2026-05-12 10:03             ` Michal Pecio
  2026-05-12 14:01               ` Mathias Nyman
  0 siblings, 1 reply; 9+ messages in thread
From: Michal Pecio @ 2026-05-12 10:03 UTC (permalink / raw)
  To: Martin Alderson; +Cc: linux-usb, Mathias Nyman

On Sun, 10 May 2026 17:29:26 +0100, Martin Alderson wrote:
> 1. The timing is during suspend in every single failure I have logs
> for. I went back through 7 weeks of persistent journals and pulled the
> context around every "HC died" event. All 9 failures show the same
> sequence:
> 
>   xhci_hcd 0000:0f:00.0: xHCI host not responding to stop endpoint command
>   xhci_hcd 0000:0f:00.0: xHCI host controller not responding, assume dead
>   xhci_hcd 0000:0f:00.0: HC died; cleaning up
>   PM: suspend devices took 5.5--6.1 seconds      <-- elevated
>   amdgpu 0000:03:00.0: MODE1 reset
>   ACPI: PM: Preparing to enter system sleep state S3
> 
> So it's reliably during suspend, before S3 entry, and the elevated
> "suspend devices took" matches the 5s xHCI stop-endpoint timeout. A
> clean suspend on the same boot takes ~0.46s.

The S3 state probably doesn't matter, chances are that it would also
happen with s2idle or hibernation.

Could you enable dynamic debug before every suspend (or permanently
on every boot) and collect a dmesg log of this happening again?
And maybe also a snapshot of debugfs directory after resume but before
unbinding xhci_hcd. These may contain clues what triggered it.

echo 'module xhci_hcd +p' >/proc/dynamic_debug/control
zip -r debugfs.zip /sys/kernel/debug/usb/xhci/0000:0f:00.0

> 2. No BIOS upgrade. ASUS PRIME B650-PLUS BIOS version 3263 dated
> 2025-06-09 across every boot from 2026-03-02 to 2026-05-08 (42 boots).
> 
> 3. Re "any new USB device": yes, and it correlates exactly. A 4-port
> USB hub appeared on bus 1 (controller 0c:00.0, AMD 600 Series USB 3.2)
> on 2026-03-16, with a USB mass-storage device behind it on port 4. It's
> the hub built into a new monitor I added around then. Per-boot
> presence:
> 
>   2026-03-02 to 2026-03-16: NO hub, NO flash drive, ~12 6.17.1
>                             suspends, 0 failures
>   2026-03-16+:              hub + flash drive present
>   2026-03-22 to 2026-03-28: 7.0-rc4 with hub present, 13 suspends,
>                             0 failures
>   2026-03-28+:              7.0-rc5+, failures begin
>   2026-04-18 to 2026-04-25: 6.17.1 (retry, with hub still present),
>                             10 suspends, 2 failures -- same kernel
>                             that was clean in March
> 
> The hub is on a different xHCI from the one that dies (0c:00.0 vs
> 0f:00.0), but they're sibling controllers on the same AMD SoC, so
> shared power/ACPI domains seem plausible.

That's over a week from first connection to the first failure.
And these are separate chips of different type, the 600 series chipset
is not part of the CPU while the broken one is the CPU I/O die AFAIK.

May be a coincidence. I would sooner suspect changes to devices on the
affected bus to be responsible.

Regards,
Michal

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend
  2026-05-12 10:03             ` Michal Pecio
@ 2026-05-12 14:01               ` Mathias Nyman
  0 siblings, 0 replies; 9+ messages in thread
From: Mathias Nyman @ 2026-05-12 14:01 UTC (permalink / raw)
  To: Michal Pecio, Martin Alderson; +Cc: linux-usb

On 5/12/26 13:03, Michal Pecio wrote:
> On Sun, 10 May 2026 17:29:26 +0100, Martin Alderson wrote:
>> 1. The timing is during suspend in every single failure I have logs
>> for. I went back through 7 weeks of persistent journals and pulled the
>> context around every "HC died" event. All 9 failures show the same
>> sequence:
>>
>>    xhci_hcd 0000:0f:00.0: xHCI host not responding to stop endpoint command
>>    xhci_hcd 0000:0f:00.0: xHCI host controller not responding, assume dead
>>    xhci_hcd 0000:0f:00.0: HC died; cleaning up
>>    PM: suspend devices took 5.5--6.1 seconds      <-- elevated
>>    amdgpu 0000:03:00.0: MODE1 reset
>>    ACPI: PM: Preparing to enter system sleep state S3
>>
>> So it's reliably during suspend, before S3 entry, and the elevated
>> "suspend devices took" matches the 5s xHCI stop-endpoint timeout. A
>> clean suspend on the same boot takes ~0.46s.
> 
> The S3 state probably doesn't matter, chances are that it would also
> happen with s2idle or hibernation.
> 
> Could you enable dynamic debug before every suspend (or permanently
> on every boot) and collect a dmesg log of this happening again?
> And maybe also a snapshot of debugfs directory after resume but before
> unbinding xhci_hcd. These may contain clues what triggered it.

It's possible there is a race between queuing a command and suspend.
It looks like nothing is preventing a new command from being queued while
suspend stops the host from running, thus causing commands to timeout.

Suspend isn't checking if there are pending commands, or if command timer
is running either.

I wrote some debugging code, can be found in my debug_hc_died_cmdring_race branch:
git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git debug_hc_died_cmdring_race
https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=debug_hc_died_cmdring_race

If it prints
   "Can't queue command, xHC not accessible (stopped?)"
or
   "Suspending and stopping xHC with pending command(s)!!!"
Then we have a queue_command - suspend race.

Code below for reference
Mathias

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index e47e644b296e..50ce4a4a7fe3 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -4353,6 +4353,7 @@ static int queue_command(struct xhci_hcd *xhci, struct xhci_command *cmd,
  			 u32 field3, u32 field4, bool command_must_succeed)
  {
  	int reserved_trbs = xhci->cmd_ring_reserved_trbs;
+	struct usb_hcd *hcd = xhci_to_hcd(xhci);
  	int ret;
  
  	if ((xhci->xhc_state & XHCI_STATE_DYING) ||
@@ -4362,6 +4363,14 @@ static int queue_command(struct xhci_hcd *xhci, struct xhci_command *cmd,
  		return -ESHUTDOWN;
  	}
  
+	if (!HCD_HW_ACCESSIBLE(hcd)) {
+		xhci_err(xhci, "Can't queue command, xHC not accessible (stopped?)\n");
+		xhci_err(xhci, "called by %pS from %pS\n",
+			 __builtin_return_address(0),
+			 __builtin_return_address(1));
+		return -ESHUTDOWN;
+	}
+
  	if (!command_must_succeed)
  		reserved_trbs++;
  
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index a54f5b57f205..04279fbbe1dd 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -949,6 +949,34 @@ static bool xhci_pending_portevent(struct xhci_hcd *xhci)
  	return false;
  }
  
+static void xhci_dump_ring(struct xhci_hcd *xhci, struct xhci_ring *ring)
+{
+	struct xhci_segment	*seg;
+	union xhci_trb		*trb;
+	dma_addr_t		dma;
+	char			str[XHCI_MSG_MAX];
+	int			i, j;
+
+	seg = ring->first_seg;
+	dma =  xhci_trb_virt_to_dma(ring->deq_seg, ring->dequeue);
+
+        xhci_err(xhci, "Dequeue: %pad\n", &dma);
+
+	for (i = 0; i < ring->num_segs; i++) {
+		for (j = 0; j < TRBS_PER_SEGMENT; j++) {
+			trb = &seg->trbs[j];
+			dma = seg->dma + j * sizeof(*trb);
+			xhci_err(xhci, "%pad: %s\n", &dma,
+				 xhci_decode_trb(str, XHCI_MSG_MAX,
+						 le32_to_cpu(trb->generic.field[0]),
+						 le32_to_cpu(trb->generic.field[1]),
+						 le32_to_cpu(trb->generic.field[2]),
+						 le32_to_cpu(trb->generic.field[3])));
+		}
+		seg = seg->next;
+	}
+}
+
  /*
   * Stop HC (not bus-specific)
   *
@@ -999,6 +1027,12 @@ int xhci_suspend(struct xhci_hcd *xhci, bool do_wakeup)
  	/* step 1: stop endpoint */
  	/* skipped assuming that port suspend has done */
  
+	/* Check if command ring is empty */
+	if (!list_empty(&xhci->cmd_list)) {
+		xhci_err(xhci, "Suspending and stopping xHC with pending command(s)!!!\n");
+		xhci_dump_ring(xhci, xhci->cmd_ring);
+	}
+
  	/* step 2: clear Run/Stop bit */
  	command = readl(&xhci->op_regs->command);
  	command &= ~CMD_RUN;

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-05-12 14:01 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-29 21:52 xhci_hcd: AMD Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] dies on resume from suspend martinalderson
2026-03-30  0:07 ` Michal Pecio
2026-04-04 12:04   ` Martin Alderson
2026-04-04 13:24     ` Michal Pecio
2026-05-09 14:51       ` Martin Alderson
2026-05-09 16:06         ` Michal Pecio
2026-05-10 16:29           ` Martin Alderson
2026-05-12 10:03             ` Michal Pecio
2026-05-12 14:01               ` Mathias Nyman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox