PCI passthrough of XHCI on Framework AMD crashes the host

All of lore.kernel.org
 help / color / mirror / Atom feed

* PCI passthrough of XHCI on Framework AMD crashes the host
@ 2025-07-23 12:35 Marek Marczykowski-Górecki
  2025-07-23 12:55 ` Tu Dinh
  2025-07-23 13:17 ` Andrew Cooper
  0 siblings, 2 replies; 5+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-07-23 12:35 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Stefano Stabellini

[-- Attachment #1: Type: text/plain, Size: 4389 bytes --]

Hi,

There is yet another issue affecting Framework AMD... When I start a
domU with XHCI controller attached (PCI passthrough), the whole host
resets if there was an USB device plugged into it. I don't get any panic
message (neither on XHCI console - which is connected to a different
XHCI controller, nor on VGA), and the reboot reason register shows
0x08000800 ("an uncorrected error caused a data fabric sync flood
event") according to [1].

This is Framework AMD with AMD Ryzen 5 7640U.

The crash itself happens quite early on domU startup - specifically when
SeaBIOS tries to initialize XHCI. I tracked it down to the second
readl() in xhci_controller_setup() [2]. Interestingly, it's specifically
the second readl(), regardless of which of those comes first. I tried
swapping their order, or even repeating read from the same register -
always the second call triggers the crash. The first one succeeds and
returns some value (for example 0x1200020 for HCCPARAMS).

If I start the domU when no USB devices are connected, it doesn't crash.

If I manually unbind the device from the dom0 driver (echo 0000:c3:00.4 >
/sys/bus/pci/drivers/xhci_hcd/unbind), it doesn't crash. Note I have
seize=1 in domU config, so the `xl pci-assignable-add` calls is implicit.

If the system doesn't crash (either by not having any USB devices
connected initially, or by the manual unbind), the USB controller in
domU works fine. I can later connect devices and they appear inside
domU.

This system has a couple of XHCI controllers, and the same behavior is
observed on at least two of them.

The controller works just fine when used in dom0.

If I passthrough another PCI device instead (tried wifi card and audio
card), it doesn't crash.

The value read from from HCCPARAMS (BAR + 0x10) differs between good and bad case:
- 0x01200020 when it crashes
- 0x0110ffc5 when it works

It's weird to have this much differences here, given most bits in this
register is about device capabilities[3], not its runtime state...

In this system my main debugging tool is the XHCI console. But I tried
also without enabling XHCI console, and it still crashes, so it looks
like it isn't caused by the XHCI console.

I tried also disabling XHCI initialization in SeaBIOS, and then it
proceeds to booting domU's kernel. But as soon as Linux gets into
initializing that USB controller, it crashes the same way. So, it isn't
just SeaBIOS doing something weird (or at least not just that).

With PVH dom0, the behavior is a bit different:
1. Initially, the controller works fine in dom0.
2. When starting domU, instead of clean unbind this happens:

    [   11.248760] xhci_hcd 0000:c3:00.4: Controller not ready at resume -19
    [   11.248765] xhci_hcd 0000:c3:00.4: PCI post-resume error -19!
    [   11.248767] xhci_hcd 0000:c3:00.4: HC died; cleaning up
    [   11.249010] xhci_hcd 0000:c3:00.4: remove, state 4
    [   11.249013] usb usb8: USB disconnect, device number 1
    [   11.249437] xhci_hcd 0000:c3:00.4: USB bus 8 deregistered
    [   11.249832] xhci_hcd 0000:c3:00.4: remove, state 4
    [   11.249835] usb usb7: USB disconnect, device number 1
    [   11.250074] xhci_hcd 0000:c3:00.4: Host halt failed, -19
    [   11.250076] xhci_hcd 0000:c3:00.4: Host not accessible, reset failed.
    [   11.250389] xhci_hcd 0000:c3:00.4: USB bus 7 deregistered
    [   11.251011] pciback 0000:c3:00.4: xen_pciback: seizing device
    [   11.335120] pciback 0000:c3:00.4: xen_pciback: vpci: assign to virtual slot 0
    [   11.335544] pciback 0000:c3:00.4: registering for 1

3. Reading from BAR in domU (in SeaBIOS, and later Linux) returns
0xffffffff.
4. Does not crash the host.

Any ideas?

I don't have any other system with Zen4 to try on. The hw11 gitlab
runner is Ryzen 7 7735HS, and it doesn't have this issue. It's also
possible this is something related to Framework's firmware, but give all
the observations above, I find it less likely.

[1] https://docs.kernel.org/arch/x86/amd-debugging.html#random-reboot-issues
[2] https://github.com/coreboot/seabios/blob/master/src/hw/usb-xhci.c#L553
[3] https://www.intel.com/content/dam/www/public/us/en/documents/technical-specifications/extensible-host-controler-interface-usb-xhci.pdf (page 385)
-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PCI passthrough of XHCI on Framework AMD crashes the host
  2025-07-23 12:35 PCI passthrough of XHCI on Framework AMD crashes the host Marek Marczykowski-Górecki
@ 2025-07-23 12:55 ` Tu Dinh
  2025-07-23 13:10   ` Marek Marczykowski-Górecki
  2025-07-23 13:17 ` Andrew Cooper
  1 sibling, 1 reply; 5+ messages in thread
From: Tu Dinh @ 2025-07-23 12:55 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, xen-devel
  Cc: Andrew Cooper, Stefano Stabellini

On 23/07/2025 14:35, Marek Marczykowski-Górecki wrote:
> Hi,
>
> There is yet another issue affecting Framework AMD... When I start a
> domU with XHCI controller attached (PCI passthrough), the whole host
> resets if there was an USB device plugged into it. I don't get any panic
> message (neither on XHCI console - which is connected to a different
> XHCI controller, nor on VGA), and the reboot reason register shows
> 0x08000800 ("an uncorrected error caused a data fabric sync flood
> event") according to [1].
>
> This is Framework AMD with AMD Ryzen 5 7640U.
>
> The crash itself happens quite early on domU startup - specifically when
> SeaBIOS tries to initialize XHCI. I tracked it down to the second
> readl() in xhci_controller_setup() [2]. Interestingly, it's specifically
> the second readl(), regardless of which of those comes first. I tried
> swapping their order, or even repeating read from the same register -
> always the second call triggers the crash. The first one succeeds and
> returns some value (for example 0x1200020 for HCCPARAMS).
>
> If I start the domU when no USB devices are connected, it doesn't crash.
>
> If I manually unbind the device from the dom0 driver (echo 0000:c3:00.4 >
> /sys/bus/pci/drivers/xhci_hcd/unbind), it doesn't crash. Note I have
> seize=1 in domU config, so the `xl pci-assignable-add` calls is implicit.
>
> If the system doesn't crash (either by not having any USB devices
> connected initially, or by the manual unbind), the USB controller in
> domU works fine. I can later connect devices and they appear inside
> domU.
>
> This system has a couple of XHCI controllers, and the same behavior is
> observed on at least two of them.
>
> The controller works just fine when used in dom0.
>
> If I passthrough another PCI device instead (tried wifi card and audio
> card), it doesn't crash.
>
> The value read from from HCCPARAMS (BAR + 0x10) differs between good and bad case:
> - 0x01200020 when it crashes
> - 0x0110ffc5 when it works
>
> It's weird to have this much differences here, given most bits in this
> register is about device capabilities[3], not its runtime state...
>
> In this system my main debugging tool is the XHCI console. But I tried
> also without enabling XHCI console, and it still crashes, so it looks
> like it isn't caused by the XHCI console.
>
> I tried also disabling XHCI initialization in SeaBIOS, and then it
> proceeds to booting domU's kernel. But as soon as Linux gets into
> initializing that USB controller, it crashes the same way. So, it isn't
> just SeaBIOS doing something weird (or at least not just that).
>
> With PVH dom0, the behavior is a bit different:
> 1. Initially, the controller works fine in dom0.
> 2. When starting domU, instead of clean unbind this happens:
>
>      [   11.248760] xhci_hcd 0000:c3:00.4: Controller not ready at resume -19
>      [   11.248765] xhci_hcd 0000:c3:00.4: PCI post-resume error -19!
>      [   11.248767] xhci_hcd 0000:c3:00.4: HC died; cleaning up
>      [   11.249010] xhci_hcd 0000:c3:00.4: remove, state 4
>      [   11.249013] usb usb8: USB disconnect, device number 1
>      [   11.249437] xhci_hcd 0000:c3:00.4: USB bus 8 deregistered
>      [   11.249832] xhci_hcd 0000:c3:00.4: remove, state 4
>      [   11.249835] usb usb7: USB disconnect, device number 1
>      [   11.250074] xhci_hcd 0000:c3:00.4: Host halt failed, -19
>      [   11.250076] xhci_hcd 0000:c3:00.4: Host not accessible, reset failed.
>      [   11.250389] xhci_hcd 0000:c3:00.4: USB bus 7 deregistered
>      [   11.251011] pciback 0000:c3:00.4: xen_pciback: seizing device
>      [   11.335120] pciback 0000:c3:00.4: xen_pciback: vpci: assign to virtual slot 0
>      [   11.335544] pciback 0000:c3:00.4: registering for 1
>
> 3. Reading from BAR in domU (in SeaBIOS, and later Linux) returns
> 0xffffffff.
> 4. Does not crash the host.
>
> Any ideas?
>
> I don't have any other system with Zen4 to try on. The hw11 gitlab
> runner is Ryzen 7 7735HS, and it doesn't have this issue. It's also
> possible this is something related to Framework's firmware, but give all
> the observations above, I find it less likely.
>
> [1] https://docs.kernel.org/arch/x86/amd-debugging.html#random-reboot-issues
> [2] https://github.com/coreboot/seabios/blob/master/src/hw/usb-xhci.c#L553
> [3] https://www.intel.com/content/dam/www/public/us/en/documents/technical-specifications/extensible-host-controler-interface-usb-xhci.pdf (page 385)

I had a similar problem with a Beelink mini PC with the Ryzen 5800U
after a recent Qubes upgrade.

If the USB controller is passed through to sys-usb then the system
simply resets without warning.


Ngoc Tu Dinh | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PCI passthrough of XHCI on Framework AMD crashes the host
  2025-07-23 12:55 ` Tu Dinh
@ 2025-07-23 13:10   ` Marek Marczykowski-Górecki
  2025-07-23 13:13     ` Tu Dinh
  0 siblings, 1 reply; 5+ messages in thread
From: Marek Marczykowski-Górecki @ 2025-07-23 13:10 UTC (permalink / raw)
  To: Tu Dinh; +Cc: xen-devel, Andrew Cooper, Stefano Stabellini

[-- Attachment #1: Type: text/plain, Size: 5334 bytes --]

On Wed, Jul 23, 2025 at 12:55:53PM +0000, Tu Dinh wrote:
> On 23/07/2025 14:35, Marek Marczykowski-Górecki wrote:
> > Hi,
> >
> > There is yet another issue affecting Framework AMD... When I start a
> > domU with XHCI controller attached (PCI passthrough), the whole host
> > resets if there was an USB device plugged into it. I don't get any panic
> > message (neither on XHCI console - which is connected to a different
> > XHCI controller, nor on VGA), and the reboot reason register shows
> > 0x08000800 ("an uncorrected error caused a data fabric sync flood
> > event") according to [1].
> >
> > This is Framework AMD with AMD Ryzen 5 7640U.
> >
> > The crash itself happens quite early on domU startup - specifically when
> > SeaBIOS tries to initialize XHCI. I tracked it down to the second
> > readl() in xhci_controller_setup() [2]. Interestingly, it's specifically
> > the second readl(), regardless of which of those comes first. I tried
> > swapping their order, or even repeating read from the same register -
> > always the second call triggers the crash. The first one succeeds and
> > returns some value (for example 0x1200020 for HCCPARAMS).
> >
> > If I start the domU when no USB devices are connected, it doesn't crash.
> >
> > If I manually unbind the device from the dom0 driver (echo 0000:c3:00.4 >
> > /sys/bus/pci/drivers/xhci_hcd/unbind), it doesn't crash. Note I have
> > seize=1 in domU config, so the `xl pci-assignable-add` calls is implicit.
> >
> > If the system doesn't crash (either by not having any USB devices
> > connected initially, or by the manual unbind), the USB controller in
> > domU works fine. I can later connect devices and they appear inside
> > domU.
> >
> > This system has a couple of XHCI controllers, and the same behavior is
> > observed on at least two of them.
> >
> > The controller works just fine when used in dom0.
> >
> > If I passthrough another PCI device instead (tried wifi card and audio
> > card), it doesn't crash.
> >
> > The value read from from HCCPARAMS (BAR + 0x10) differs between good and bad case:
> > - 0x01200020 when it crashes
> > - 0x0110ffc5 when it works
> >
> > It's weird to have this much differences here, given most bits in this
> > register is about device capabilities[3], not its runtime state...
> >
> > In this system my main debugging tool is the XHCI console. But I tried
> > also without enabling XHCI console, and it still crashes, so it looks
> > like it isn't caused by the XHCI console.
> >
> > I tried also disabling XHCI initialization in SeaBIOS, and then it
> > proceeds to booting domU's kernel. But as soon as Linux gets into
> > initializing that USB controller, it crashes the same way. So, it isn't
> > just SeaBIOS doing something weird (or at least not just that).
> >
> > With PVH dom0, the behavior is a bit different:
> > 1. Initially, the controller works fine in dom0.
> > 2. When starting domU, instead of clean unbind this happens:
> >
> >      [   11.248760] xhci_hcd 0000:c3:00.4: Controller not ready at resume -19
> >      [   11.248765] xhci_hcd 0000:c3:00.4: PCI post-resume error -19!
> >      [   11.248767] xhci_hcd 0000:c3:00.4: HC died; cleaning up
> >      [   11.249010] xhci_hcd 0000:c3:00.4: remove, state 4
> >      [   11.249013] usb usb8: USB disconnect, device number 1
> >      [   11.249437] xhci_hcd 0000:c3:00.4: USB bus 8 deregistered
> >      [   11.249832] xhci_hcd 0000:c3:00.4: remove, state 4
> >      [   11.249835] usb usb7: USB disconnect, device number 1
> >      [   11.250074] xhci_hcd 0000:c3:00.4: Host halt failed, -19
> >      [   11.250076] xhci_hcd 0000:c3:00.4: Host not accessible, reset failed.
> >      [   11.250389] xhci_hcd 0000:c3:00.4: USB bus 7 deregistered
> >      [   11.251011] pciback 0000:c3:00.4: xen_pciback: seizing device
> >      [   11.335120] pciback 0000:c3:00.4: xen_pciback: vpci: assign to virtual slot 0
> >      [   11.335544] pciback 0000:c3:00.4: registering for 1
> >
> > 3. Reading from BAR in domU (in SeaBIOS, and later Linux) returns
> > 0xffffffff.
> > 4. Does not crash the host.
> >
> > Any ideas?
> >
> > I don't have any other system with Zen4 to try on. The hw11 gitlab
> > runner is Ryzen 7 7735HS, and it doesn't have this issue. It's also
> > possible this is something related to Framework's firmware, but give all
> > the observations above, I find it less likely.
> >
> > [1] https://docs.kernel.org/arch/x86/amd-debugging.html#random-reboot-issues
> > [2] https://github.com/coreboot/seabios/blob/master/src/hw/usb-xhci.c#L553
> > [3] https://www.intel.com/content/dam/www/public/us/en/documents/technical-specifications/extensible-host-controler-interface-usb-xhci.pdf (page 385)
> 
> I had a similar problem with a Beelink mini PC with the Ryzen 5800U
> after a recent Qubes upgrade.
> 
> If the USB controller is passed through to sys-usb then the system
> simply resets without warning.

Do you know if that happens also when no USB devices are connected at
that time? There could be more reasons for similar issue, and a common
one I've seen is dom0 kernel panic on unbind operation (which would be a
different issue than the one I have here).

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PCI passthrough of XHCI on Framework AMD crashes the host
  2025-07-23 13:10   ` Marek Marczykowski-Górecki
@ 2025-07-23 13:13     ` Tu Dinh
  0 siblings, 0 replies; 5+ messages in thread
From: Tu Dinh @ 2025-07-23 13:13 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: xen-devel, Andrew Cooper, Stefano Stabellini

On 23/07/2025 15:10, Marek Marczykowski-Górecki wrote:
> On Wed, Jul 23, 2025 at 12:55:53PM +0000, Tu Dinh wrote

[...]

>> I had a similar problem with a Beelink mini PC with the Ryzen 5800U
>> after a recent Qubes upgrade.
>>
>> If the USB controller is passed through to sys-usb then the system
>> simply resets without warning.
>
> Do you know if that happens also when no USB devices are connected at
> that time? There could be more reasons for similar issue, and a common
> one I've seen is dom0 kernel panic on unbind operation (which would be a
> different issue than the one I have here).
>

I had an USB mouse and keyboard connected at the time to control the PC.
I don't think it was dom0 kernel panic, no combination of reboot= or
panic= in the Xen and dom0 command line managed to stop the reboot.


Ngoc Tu Dinh | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: PCI passthrough of XHCI on Framework AMD crashes the host
  2025-07-23 12:35 PCI passthrough of XHCI on Framework AMD crashes the host Marek Marczykowski-Górecki
  2025-07-23 12:55 ` Tu Dinh
@ 2025-07-23 13:17 ` Andrew Cooper
  1 sibling, 0 replies; 5+ messages in thread
From: Andrew Cooper @ 2025-07-23 13:17 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki, xen-devel; +Cc: Stefano Stabellini

On 23/07/2025 1:35 pm, Marek Marczykowski-Górecki wrote:
> Hi,
>
> There is yet another issue affecting Framework AMD... When I start a
> domU with XHCI controller attached (PCI passthrough), the whole host
> resets if there was an USB device plugged into it. I don't get any panic
> message (neither on XHCI console - which is connected to a different
> XHCI controller, nor on VGA), and the reboot reason register shows
> 0x08000800 ("an uncorrected error caused a data fabric sync flood
> event") according to [1].

I checked with a contact at AMD, and "data fabric sync flood" is a final
emergency reaction, and is a "stop everything" action intended to
prevent further data corruption.

The hardware is fatally unhappy with something...

~Andrew


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-07-23 13:17 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-23 12:35 PCI passthrough of XHCI on Framework AMD crashes the host Marek Marczykowski-Górecki
2025-07-23 12:55 ` Tu Dinh
2025-07-23 13:10   ` Marek Marczykowski-Górecki
2025-07-23 13:13     ` Tu Dinh
2025-07-23 13:17 ` Andrew Cooper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.