[Bug 221402] New: eGPU breaks permanently after hibernate with TB4 tunnel active on AMD Strix Halo (Ryzen AI Max+ 395)

public inbox for dri-devel@lists.freedesktop.org
 help / color / mirror / Atom feed

* [Bug 221402] New: eGPU breaks permanently after hibernate with TB4 tunnel active on AMD Strix Halo (Ryzen AI Max+ 395) — Linux-specific
@ 2026-04-22 16:03 bugzilla-daemon
  2026-04-23 15:03 ` [Bug 221402] " bugzilla-daemon
  0 siblings, 1 reply; 2+ messages in thread
From: bugzilla-daemon @ 2026-04-22 16:03 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=221402

            Bug ID: 221402
           Summary: eGPU breaks permanently after hibernate with TB4
                    tunnel active on AMD Strix Halo (Ryzen AI Max+ 395) —
                    Linux-specific
           Product: Drivers
           Version: 2.5
          Hardware: AMD
                OS: Linux
            Status: NEW
          Severity: high
          Priority: P3
         Component: Video(DRI - non Intel)
          Assignee: drivers_video-dri@kernel-bugs.osdl.org
          Reporter: conrad.dobrowolski@gmail.com
        Regression: No

# eGPU breaks permanently after hibernate with TB4 tunnel active on AMD Strix
Halo (Ryzen AI Max+ 395) — Linux-specific, Windows unaffected

## Summary

On an ASUS ROG Flow Z13 (GZ302EA) with AMD Ryzen AI Max+ 395 "Strix Halo," an
eGPU connected via TB4 works perfectly on first boot but is permanently broken
at the Linux kernel level after a single hibernate cycle with the eGPU
attached. The failure persists across reboots, EC resets, cable swaps, BIOS
defaults, and multiple distro/kernel combinations.

**Critical finding:** The same hardware (laptop + enclosure + GPU + cable)
works correctly on Windows 11 with AMD drivers. This confirms:
- The hardware is not damaged
- The TB4 controller firmware is not permanently bricked
- The bug is in Linux's handling of amdgpu + Thunderbolt + Strix Halo
specifically

## Hardware

- **Laptop:** ASUS ROG Flow Z13 GZ302EA
- **SoC:** AMD Ryzen AI Max+ 395 (Strix Halo, 16C/32T, Radeon 8060S iGPU
gfx1151)
- **RAM:** 128 GiB unified LPDDR5X
- **BIOS:** GZ302EA.311 (American Megatrends, Sept 19 2025 — latest available
from ASUS as of Apr 2026)
- **eGPU enclosure:** ADTLINK UT4G (USB4 40 Gb/s)
- **eGPU GPU:** AMD Radeon RX 7900 XTX (Navi 31, 1002:744c) — reproduced with
RX 5700 XT
- **Cable:** Both TB4 and TB5 rated cables tested — identical failure

## Software

- **OS:** Fedora Linux 43 (KDE Plasma Desktop Edition)
- **Kernel (installed):** 6.19.12-200.fc43.x86_64
- **Kernel (live USB):** 6.17.1-300.fc43.x86_64 — tested, same failure
- **Mesa:** 25.3.6
- **ROCm:** 6.4.2
- **fwupd:** 2.0.20

## Reproduction

1. Fresh boot with eGPU attached — works perfectly. `lspci` shows full chain
(ASMedia 03:00/04:00 → Navi 31 at 05:00.0), amdgpu binds, rocminfo reports both
`gfx1151` (iGPU) and `gfx1100` (eGPU), monitors on the eGPU display correctly,
ROCm/Ollama/PyTorch all function.
2. `sudo systemctl hibernate` (with eGPU still attached)
3. Wake from hibernate
4. eGPU no longer functions and never recovers

## Failure Modes Observed

After the break, various failure modes appear cycling across reboots:

### Mode A — Discovery failure (most common, reproduces 100% on clean live USB)
amdgpu 0000:05:00.0: enabling device (0000 -> 0003)
amdgpu 0000:05:00.0: amdgpu: initializing kernel modesetting (IP DISCOVERY
0x1002:0x744C 0x1002:0x744C 0xC8).
amdgpu 0000:05:00.0: amdgpu: register mmio base: 0xBC000000
amdgpu 0000:05:00.0: amdgpu: register mmio size: 1048576
amdgpu 0000:05:00.0: amdgpu: failed to read discovery info from memory, vram
size read: 0
amdgpu 0000:05:00.0: amdgpu: [drm] ERROR discovery failed: -2
amdgpu 0000:05:00.0: amdgpu: Fatal error during GPU init
amdgpu 0000:05:00.0: amdgpu: finishing device.
amdgpu 0000:05:00.0: probe with driver amdgpu failed with error -2

### Mode B — TB4 port training timeout loop
thunderbolt 0000:c6:00.5: 0:2: failed to reach state TB_PORT_UP. Ignoring
port...
thunderbolt 0000:c6:00.5: 0:2: lost during suspend, disconnecting
Note: "lost during suspend" fires even when no suspend has occurred since boot.

### Mode C — SR-IOV guest path triggered on physical device (highly unusual)
amdgpu 0000:05:00.0: amdgpu: trn=2 ACK should not assert! wait again !
xgpu_nv_mailbox_trans_msg: 2471 callbacks suppressed
The `xgpu_nv_*` codepath is the Navi SR-IOV virtual-function guest driver. This
should NOT execute on a physical 7900 XTX. Strongly suggests the TB4 tunnel is
presenting corrupted PCIe capabilities after the hibernate event.

### Mode D — Partial PCIe enumeration
Upstream ASMedia switch appears but Navi 10 XL downstream switch never probes.
PCIe hot-plug event `Card present / Link Up` fires on the root port, but child
enumeration stalls.

### Mode E — TB4 handshake succeeds but no PCIe tunnel built
`boltctl list` shows `authorized` at 40 Gb/s RX/TX. `/sys/class/thunderbolt/`
populates correctly. But `lspci` shows nothing new under the TB4 root port.
Enclosure cycles power every ~60 seconds in a failed retry loop.

## Recovery attempts (all failed on Linux)

1. Soft reboot (`sudo reboot`)
2. EC reset (40-second power button + AC unplugged)
3. Cold boot with eGPU pre-attached (not hot-plugged)
4. `boltctl forget <uuid>` + fresh reauthorize
5. PCIe rescan: `echo 1 > /sys/bus/pci/rescan`
6. PCIe tunnel teardown + rebuild
7. Driver unbind/rebind on Navi 31 endpoint
8. Cable swap (TB4-certified → TB5-certified, both fail)
9. BIOS "Load Optimized Defaults"
10. `fwupdmgr reinstall <system-firmware-UUID>` — LVFS reports no releases
available
11. Fresh Fedora 43 live USB boot (kernel 6.17.1, clean userspace, no prior
state) — same failure
12. Second GPU tested (RX 5700 XT) — identical failure modes
13. Both USB4 ports on Z13 tested — identical

## What conclusively proves this is a Linux software issue

After extensive testing:

1. **Same 7900 XTX + ADTLINK + cable on separate x86 laptop (Linux):** Works
perfectly, full functionality
2. **Same Z13 + ADTLINK + RX 5700 XT on Windows 11 (fresh install + AMD
Adrenalin drivers):** Works correctly — GPU enumerates in Device Manager,
displays function normally, no failure modes reproduce
3. **Clean Fedora 43 live USB on Z13 (zero prior userspace state):** Reproduces
the same Mode A `discovery failed: -2` error immediately on first eGPU plug-in

This is conclusive: **the hardware is good**, and **the issue is in how Linux's
amdgpu driver interacts with the AMD Strix Halo TB4 controller state** in a way
that Windows' driver stack handles correctly.

## Working theory

The AMD Strix Halo TB4/USB4 host controller appears to enter a state after
hibernate where:
- The TB4 handshake completes correctly (`boltctl` reports authorized at full
40 Gb/s)
- The PCIe tunnel is built and enumerates successfully (full chain visible in
`lspci -tv`)
- Memory BARs are assigned correctly by the PCI core
- **But reads from the GPU's discovery ROM return zeros** — `vram size read: 0`

Windows' AMD chipset + TB4 driver apparently performs additional
reinitialization of the TB4 host controller on boot that Linux's
amdgpu/thunderbolt drivers do not.

### Possible upstream fixes to investigate

1. **amdgpu_discovery.c** should validate discovery info reads and attempt
retry/reset if returned values are all zero, instead of failing the device
probe immediately
2. **amdgpu_virt.c / mxgpu_nv.c** — the SR-IOV guest path (Mode C) should not
activate on devices whose capability reads are inconsistent; this appears to be
a separate genuine bug where corrupted capability data triggers the virt
codepath on physical hardware
3. **drivers/thunderbolt** — AMD Strix Halo USB4 host controller may need an
explicit reinit sequence after system hibernate/resume, which Intel TB4
controllers apparently do not require

## Relevant source files

- `drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c` — discovery info reads (Mode
A origin)
- `drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c` — SR-IOV detection logic
- `drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c` — `xgpu_nv_mailbox_trans_msg` (Mode C
origin)
- `drivers/thunderbolt/*` — TB4 host controller handling for AMD USB4

## Scope / other affected users

This bug appears specific to AMD Strix Halo platforms with external GPUs over
TB4. Other users on r/FlowZ13 and egpu.io forums with the same or similar
hardware (ROG Flow Z13 2025, HP ZBook Ultra 14, Framework Desktop) report
comparable issues. The bug does not appear on AMD desktop platforms or Intel
TB4 laptops under the same conditions.

## Environment output
$ uname -a
Linux fedora 6.19.12-200.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Apr 12 15:26:33
UTC 2026 x86_64 GNU/Linux
$ lsb_release -a
Distributor ID: Fedora
Description:    Fedora Linux 43 (KDE Plasma Desktop Edition)
Release:        43
$ cat /sys/class/dmi/id/product_name
ROG Flow Z13 GZ302EA_GZ302EA
$ BIOS Version
GZ302EA.311 (confirmed via ASUS support site — latest available as of
2026-04-22)

## Workaround

None on Linux. Current workaround is dual-boot with Windows 11 for eGPU
workloads. Linux side uses only the Strix Halo iGPU (gfx1151) with unified
memory for AI/compute workloads, which functions correctly.

## Prevention

Never hibernate with eGPU attached. Recommend upstream consideration of a
systemd-sleep hook that detects TB4 GPU passthrough and refuses hibernate, or a
kernel-side pre-hibernate TB4 tunnel teardown.

---

Report filed: 2026-04-22

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* [Bug 221402] eGPU breaks permanently after hibernate with TB4 tunnel active on AMD Strix Halo (Ryzen AI Max+ 395) — Linux-specific
  2026-04-22 16:03 [Bug 221402] New: eGPU breaks permanently after hibernate with TB4 tunnel active on AMD Strix Halo (Ryzen AI Max+ 395) — Linux-specific bugzilla-daemon
@ 2026-04-23 15:03 ` bugzilla-daemon
  0 siblings, 0 replies; 2+ messages in thread
From: bugzilla-daemon @ 2026-04-23 15:03 UTC (permalink / raw)
  To: dri-devel

https://bugzilla.kernel.org/show_bug.cgi?id=221402

Artem S. Tashkinov (aros@gmx.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |ANSWERED

--- Comment #1 from Artem S. Tashkinov (aros@gmx.com) ---
Please refile here: https://gitlab.freedesktop.org/drm/amd/-/issues

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-04-23 15:04 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-22 16:03 [Bug 221402] New: eGPU breaks permanently after hibernate with TB4 tunnel active on AMD Strix Halo (Ryzen AI Max+ 395) — Linux-specific bugzilla-daemon
2026-04-23 15:03 ` [Bug 221402] " bugzilla-daemon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox