All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset
@ 2026-04-09  0:05 Geramy Loveless
  2026-04-09 11:42 ` Christian König
  0 siblings, 1 reply; 14+ messages in thread
From: Geramy Loveless @ 2026-04-09  0:05 UTC (permalink / raw)
  To: amd-gfx; +Cc: alexander.deucher, christian.koenig

When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on
Thunderbolt the TB driver receives no notification and the tunnel
stays up while the endpoint is unreachable. All subsequent PCIe
reads return 0xFFFFFFFF and MES firmware cannot reinitialize,
triggering an infinite reset loop that hangs the system.

After MODE1 reset completes, check whether the PCIe endpoint is still
reachable using pci_device_is_present(). If the device is behind
Thunderbolt and the link is dead, walk up parent bridges calling
pci_bridge_secondary_bus_reset() to retrain the physical PCIe link
inside the dock. If recovery fails, return -ENODEV to prevent the
reset retry loop.

This also causes the GPU fan to be at 100% and basically when it
happens and you are not there, you now have a GPU with fan at 100% and
cant reset it.
I wanted to notate some other things I am finding sometimes before
this adventure of patches to the kernel and amdgpu driver.
Sometimes a crash could happen in the drive and then the GPU fan speed
hits 100% and the air is hot coming out without any workload, other
times
I have seen it have barely any fan speed at all and heat up more than
it should at the fan level its curently operating at. These are things
I have seen with this gpu in a TB5 dock with the driver and
instability. I'm not sure exactly whats going on there but I figured
since im communicating with these patches I might as well bring you up
to speed and supermario has been great help throughout me trying to
get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 /
USB4v2 dock!

It seems to be finally working with bar resizing after my kernel
patch. Which allows you to safely release a empty switch bridge at the
device end.
Then it rebuilds it afterwords with the increased bar. This was done
on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource
branch with my patch here.

https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u

Thank you!

Signed-off-by: Geramy Loveless <gloveless@jqluv.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++
1 file changed, 40 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 31a60173c..91d01d538 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev)
/* ensure no_hw_access is updated before we access hw */
smp_mb();
+ /*
+ * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe
+ * endpoint but the TB tunnel stays up unaware. Detect the
+ * dead link and attempt recovery by resetting parent bridges
+ * to retrain the physical PCIe link inside the dock.
+ */
+ if (!pci_device_is_present(adev->pdev) &&
+ pci_is_thunderbolt_attached(adev->pdev)) {
+ struct pci_dev *bridge;
+ bool recovered = false;
+
+ dev_info(adev->dev,
+ "PCIe link lost after mode1 reset, attempting Thunderbolt recovery\n");
+
+ bridge = pci_upstream_bridge(adev->pdev);
+ while (bridge && !pci_is_root_bus(bridge->bus)) {
+ dev_info(adev->dev,
+ "attempting link recovery via %s\n",
+ pci_name(bridge));
+ pci_bridge_secondary_bus_reset(bridge);
+ msleep(100);
+ if (pci_device_is_present(adev->pdev)) {
+ recovered = true;
+ break;
+ }
+ bridge = pci_upstream_bridge(bridge);
+ }
+
+ if (!recovered) {
+ dev_err(adev->dev,
+ "Thunderbolt PCIe link recovery failed\n");
+ ret = -ENODEV;
+ goto mode1_reset_failed;
+ }
+
+ dev_info(adev->dev,
+ "Thunderbolt PCIe link recovered via %s\n",
+ pci_name(bridge));
+ }
+
amdgpu_device_load_pci_state(adev->pdev);
ret = amdgpu_psp_wait_for_bootloader(adev);
if (ret)
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-04-13  8:03 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-09  0:05 [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset Geramy Loveless
2026-04-09 11:42 ` Christian König
2026-04-09 12:13   ` Geramy Loveless
2026-04-09 12:57     ` Christian König
2026-04-09 13:12       ` Geramy Loveless
2026-04-09 13:45         ` Christian König
2026-04-09 18:12   ` Mario Limonciello
2026-04-10  0:12     ` Geramy Loveless
2026-04-10  6:22       ` Geramy Loveless
2026-04-10  7:07         ` Geramy Loveless
2026-04-10  9:13     ` Christian König
2026-04-10 11:25     ` Lazar, Lijo
2026-04-10 19:17       ` Geramy Loveless
2026-04-10 22:42         ` Geramy Loveless

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.