[PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset
@ 2026-04-09  0:05 Geramy Loveless
  2026-04-09 11:42 ` Christian König
  0 siblings, 1 reply; 14+ messages in thread
From: Geramy Loveless @ 2026-04-09  0:05 UTC (permalink / raw)
  To: amd-gfx; +Cc: alexander.deucher, christian.koenig

When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on
Thunderbolt the TB driver receives no notification and the tunnel
stays up while the endpoint is unreachable. All subsequent PCIe
reads return 0xFFFFFFFF and MES firmware cannot reinitialize,
triggering an infinite reset loop that hangs the system.

After MODE1 reset completes, check whether the PCIe endpoint is still
reachable using pci_device_is_present(). If the device is behind
Thunderbolt and the link is dead, walk up parent bridges calling
pci_bridge_secondary_bus_reset() to retrain the physical PCIe link
inside the dock. If recovery fails, return -ENODEV to prevent the
reset retry loop.

This also causes the GPU fan to be at 100% and basically when it
happens and you are not there, you now have a GPU with fan at 100% and
cant reset it.
I wanted to notate some other things I am finding sometimes before
this adventure of patches to the kernel and amdgpu driver.
Sometimes a crash could happen in the drive and then the GPU fan speed
hits 100% and the air is hot coming out without any workload, other
times
I have seen it have barely any fan speed at all and heat up more than
it should at the fan level its curently operating at. These are things
I have seen with this gpu in a TB5 dock with the driver and
instability. I'm not sure exactly whats going on there but I figured
since im communicating with these patches I might as well bring you up
to speed and supermario has been great help throughout me trying to
get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 /
USB4v2 dock!

It seems to be finally working with bar resizing after my kernel
patch. Which allows you to safely release a empty switch bridge at the
device end.
Then it rebuilds it afterwords with the increased bar. This was done
on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource
branch with my patch here.

https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u

Thank you!

Signed-off-by: Geramy Loveless <gloveless@jqluv.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++
1 file changed, 40 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 31a60173c..91d01d538 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev)
/* ensure no_hw_access is updated before we access hw */
smp_mb();
+ /*
+ * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe
+ * endpoint but the TB tunnel stays up unaware. Detect the
+ * dead link and attempt recovery by resetting parent bridges
+ * to retrain the physical PCIe link inside the dock.
+ */
+ if (!pci_device_is_present(adev->pdev) &&
+ pci_is_thunderbolt_attached(adev->pdev)) {
+ struct pci_dev *bridge;
+ bool recovered = false;
+
+ dev_info(adev->dev,
+ "PCIe link lost after mode1 reset, attempting Thunderbolt recovery\n");
+
+ bridge = pci_upstream_bridge(adev->pdev);
+ while (bridge && !pci_is_root_bus(bridge->bus)) {
+ dev_info(adev->dev,
+ "attempting link recovery via %s\n",
+ pci_name(bridge));
+ pci_bridge_secondary_bus_reset(bridge);
+ msleep(100);
+ if (pci_device_is_present(adev->pdev)) {
+ recovered = true;
+ break;
+ }
+ bridge = pci_upstream_bridge(bridge);
+ }
+
+ if (!recovered) {
+ dev_err(adev->dev,
+ "Thunderbolt PCIe link recovery failed\n");
+ ret = -ENODEV;
+ goto mode1_reset_failed;
+ }
+
+ dev_info(adev->dev,
+ "Thunderbolt PCIe link recovered via %s\n",
+ pci_name(bridge));
+ }
+
amdgpu_device_load_pci_state(adev->pdev);
ret = amdgpu_psp_wait_for_bootloader(adev);
if (ret)
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset
  2026-04-09  0:05 [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset Geramy Loveless
@ 2026-04-09 11:42 ` Christian König
  2026-04-09 12:13   ` Geramy Loveless
  2026-04-09 18:12   ` Mario Limonciello
  0 siblings, 2 replies; 14+ messages in thread
From: Christian König @ 2026-04-09 11:42 UTC (permalink / raw)
  To: Geramy Loveless, amd-gfx, Mario Limonciello; +Cc: alexander.deucher

On 4/9/26 02:05, Geramy Loveless wrote:
> When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on
> Thunderbolt the TB driver receives no notification and the tunnel
> stays up while the endpoint is unreachable.

IIRC a MODE1 reset should keep the bus active and so the endpoint should still be reachable.

> All subsequent PCIe
> reads return 0xFFFFFFFF and MES firmware cannot reinitialize,
> triggering an infinite reset loop that hangs the system.

That sounds more like the MODE1 reset failed.

> After MODE1 reset completes, check whether the PCIe endpoint is still
> reachable using pci_device_is_present(). If the device is behind
> Thunderbolt and the link is dead, walk up parent bridges calling
> pci_bridge_secondary_bus_reset() to retrain the physical PCIe link
> inside the dock.

Well that is then a bus reset.

I mean that is a reasonable mitigation when a MODE1 reset failed, but the question is rather why does the MODE1 reset fails in the first place?

> If recovery fails, return -ENODEV to prevent the
> reset retry loop.
> 
> This also causes the GPU fan to be at 100% and basically when it
> happens and you are not there, you now have a GPU with fan at 100% and
> cant reset it.
> I wanted to notate some other things I am finding sometimes before
> this adventure of patches to the kernel and amdgpu driver.
> Sometimes a crash could happen in the drive and then the GPU fan speed
> hits 100% and the air is hot coming out without any workload, other
> times
> I have seen it have barely any fan speed at all and heat up more than
> it should at the fan level its curently operating at. These are things
> I have seen with this gpu in a TB5 dock with the driver and
> instability. I'm not sure exactly whats going on there but I figured
> since im communicating with these patches I might as well bring you up
> to speed and supermario has been great help throughout me trying to
> get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 /
> USB4v2 dock!

Adding Mario as well. That strongly sounds like you crashed the SMU which would also explain the failed MODE1 reset.

But all of that are only symptoms. Question is what is actually going on here? e.g. what is the root cause?

> 
> It seems to be finally working with bar resizing after my kernel
> patch. Which allows you to safely release a empty switch bridge at the
> device end.
> Then it rebuilds it afterwords with the increased bar. This was done
> on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource
> branch with my patch here.
> 
> https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u

Where is the MMIO register BAR before and after the rebuild?

Regards,
Christian.

> 
> Thank you!
> 
> Signed-off-by: Geramy Loveless <gloveless@jqluv.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++
> 1 file changed, 40 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 31a60173c..91d01d538 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev)
> /* ensure no_hw_access is updated before we access hw */
> smp_mb();
> + /*
> + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe
> + * endpoint but the TB tunnel stays up unaware. Detect the
> + * dead link and attempt recovery by resetting parent bridges
> + * to retrain the physical PCIe link inside the dock.
> + */
> + if (!pci_device_is_present(adev->pdev) &&
> + pci_is_thunderbolt_attached(adev->pdev)) {
> + struct pci_dev *bridge;
> + bool recovered = false;
> +
> + dev_info(adev->dev,
> + "PCIe link lost after mode1 reset, attempting Thunderbolt recovery\n");
> +
> + bridge = pci_upstream_bridge(adev->pdev);
> + while (bridge && !pci_is_root_bus(bridge->bus)) {
> + dev_info(adev->dev,
> + "attempting link recovery via %s\n",
> + pci_name(bridge));
> + pci_bridge_secondary_bus_reset(bridge);
> + msleep(100);
> + if (pci_device_is_present(adev->pdev)) {
> + recovered = true;
> + break;
> + }
> + bridge = pci_upstream_bridge(bridge);
> + }
> +
> + if (!recovered) {
> + dev_err(adev->dev,
> + "Thunderbolt PCIe link recovery failed\n");
> + ret = -ENODEV;
> + goto mode1_reset_failed;
> + }
> +
> + dev_info(adev->dev,
> + "Thunderbolt PCIe link recovered via %s\n",
> + pci_name(bridge));
> + }
> +
> amdgpu_device_load_pci_state(adev->pdev);
> ret = amdgpu_psp_wait_for_bootloader(adev);
> if (ret)
> --
> 2.51.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset
  2026-04-09 11:42 ` Christian König
@ 2026-04-09 12:13   ` Geramy Loveless
  2026-04-09 12:57     ` Christian König
  2026-04-09 18:12   ` Mario Limonciello
  1 sibling, 1 reply; 14+ messages in thread
From: Geramy Loveless @ 2026-04-09 12:13 UTC (permalink / raw)
  To: Christian König; +Cc: amd-gfx, Mario Limonciello, alexander.deucher

[-- Attachment #1: Type: text/plain, Size: 5463 bytes --]

Hi Christian,

I appreciate the speedy response,
What your saying makes sense they are basically wrapping symptoms that
could at least from what I seen now at this point only continue and
eventually create a web of useless code to try to catch all code paths it
hits during crashing. Let me investigate the real reason as to why it’s
crashing more rather then where.


On Thu, Apr 9, 2026 at 4:42 AM Christian König <christian.koenig@amd.com>
wrote:

> On 4/9/26 02:05, Geramy Loveless wrote:
> > When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on
> > Thunderbolt the TB driver receives no notification and the tunnel
> > stays up while the endpoint is unreachable.
>
> IIRC a MODE1 reset should keep the bus active and so the endpoint should
> still be reachable.
>
> > All subsequent PCIe
> > reads return 0xFFFFFFFF and MES firmware cannot reinitialize,
> > triggering an infinite reset loop that hangs the system.
>
> That sounds more like the MODE1 reset failed.
>
> > After MODE1 reset completes, check whether the PCIe endpoint is still
> > reachable using pci_device_is_present(). If the device is behind
> > Thunderbolt and the link is dead, walk up parent bridges calling
> > pci_bridge_secondary_bus_reset() to retrain the physical PCIe link
> > inside the dock.
>
> Well that is then a bus reset.
>
> I mean that is a reasonable mitigation when a MODE1 reset failed, but the
> question is rather why does the MODE1 reset fails in the first place?
>
> > If recovery fails, return -ENODEV to prevent the
> > reset retry loop.
> >
> > This also causes the GPU fan to be at 100% and basically when it
> > happens and you are not there, you now have a GPU with fan at 100% and
> > cant reset it.
> > I wanted to notate some other things I am finding sometimes before
> > this adventure of patches to the kernel and amdgpu driver.
> > Sometimes a crash could happen in the drive and then the GPU fan speed
> > hits 100% and the air is hot coming out without any workload, other
> > times
> > I have seen it have barely any fan speed at all and heat up more than
> > it should at the fan level its curently operating at. These are things
> > I have seen with this gpu in a TB5 dock with the driver and
> > instability. I'm not sure exactly whats going on there but I figured
> > since im communicating with these patches I might as well bring you up
> > to speed and supermario has been great help throughout me trying to
> > get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 /
> > USB4v2 dock!
>
> Adding Mario as well. That strongly sounds like you crashed the SMU which
> would also explain the failed MODE1 reset.
>
> But all of that are only symptoms. Question is what is actually going on
> here? e.g. what is the root cause?
>
> >
> > It seems to be finally working with bar resizing after my kernel
> > patch. Which allows you to safely release a empty switch bridge at the
> > device end.
> > Then it rebuilds it afterwords with the increased bar. This was done
> > on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource
> > branch with my patch here.
> >
> >
> https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u
>
> Where is the MMIO register BAR before and after the rebuild?
>
> Regards,
> Christian.
>
> >
> > Thank you!
> >
> > Signed-off-by: Geramy Loveless <gloveless@jqluv.com>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++
> > 1 file changed, 40 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 31a60173c..91d01d538 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct
> amdgpu_device *adev)
> > /* ensure no_hw_access is updated before we access hw */
> > smp_mb();
> > + /*
> > + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe
> > + * endpoint but the TB tunnel stays up unaware. Detect the
> > + * dead link and attempt recovery by resetting parent bridges
> > + * to retrain the physical PCIe link inside the dock.
> > + */
> > + if (!pci_device_is_present(adev->pdev) &&
> > + pci_is_thunderbolt_attached(adev->pdev)) {
> > + struct pci_dev *bridge;
> > + bool recovered = false;
> > +
> > + dev_info(adev->dev,
> > + "PCIe link lost after mode1 reset, attempting Thunderbolt recovery\n");
> > +
> > + bridge = pci_upstream_bridge(adev->pdev);
> > + while (bridge && !pci_is_root_bus(bridge->bus)) {
> > + dev_info(adev->dev,
> > + "attempting link recovery via %s\n",
> > + pci_name(bridge));
> > + pci_bridge_secondary_bus_reset(bridge);
> > + msleep(100);
> > + if (pci_device_is_present(adev->pdev)) {
> > + recovered = true;
> > + break;
> > + }
> > + bridge = pci_upstream_bridge(bridge);
> > + }
> > +
> > + if (!recovered) {
> > + dev_err(adev->dev,
> > + "Thunderbolt PCIe link recovery failed\n");
> > + ret = -ENODEV;
> > + goto mode1_reset_failed;
> > + }
> > +
> > + dev_info(adev->dev,
> > + "Thunderbolt PCIe link recovered via %s\n",
> > + pci_name(bridge));
> > + }
> > +
> > amdgpu_device_load_pci_state(adev->pdev);
> > ret = amdgpu_psp_wait_for_bootloader(adev);
> > if (ret)
> > --
> > 2.51.0
>
>

[-- Attachment #2: Type: text/html, Size: 6744 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset
  2026-04-09 12:13   ` Geramy Loveless
@ 2026-04-09 12:57     ` Christian König
  2026-04-09 13:12       ` Geramy Loveless
  0 siblings, 1 reply; 14+ messages in thread
From: Christian König @ 2026-04-09 12:57 UTC (permalink / raw)
  To: Geramy Loveless; +Cc: amd-gfx, Mario Limonciello, alexander.deucher

Hi Geramy,

On 4/9/26 14:13, Geramy Loveless wrote:
> Hi Christian,
> 
> I appreciate the speedy response,
> What your saying makes sense they are basically wrapping symptoms that could at least from what I seen now at this point only continue and eventually create a web of useless code to try to catch all code paths it hits during crashing. Let me investigate the real reason as to why it’s crashing more rather then where.

To just give you a bit background on what happens here:

AMD GPUs have an embedded micro controller called SMU which takes care of things like voltages, clocks, temperature, fan speed etc.. and reset.

So when the kernel driver detects that it needs to do a reset it sends a MODE1 reset command to the SMU. But instead of the SMU coming back a short time later noting that the reset was done the device just drops off the bus (e.g. all reads return 0xffffffff).

The cause of that can be anything, e.g. from power fluctuations to a dirty fan which doesn't starts to rotate again after it was stopped.

I would try to narrow it down step by step, e.g. if it work on older kernels, if yes what feature/patch broke the behavior. You can also try to disable certain power management features like ASPM (try amdgpu.aspm=0 on the kernel command line).

Hope that helps,
Christian.

> 
> 
> On Thu, Apr 9, 2026 at 4:42 AM Christian König <christian.koenig@amd.com <mailto:christian.koenig@amd.com>> wrote:
> 
>     On 4/9/26 02:05, Geramy Loveless wrote:
>     > When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on
>     > Thunderbolt the TB driver receives no notification and the tunnel
>     > stays up while the endpoint is unreachable.
> 
>     IIRC a MODE1 reset should keep the bus active and so the endpoint should still be reachable.
> 
>     > All subsequent PCIe
>     > reads return 0xFFFFFFFF and MES firmware cannot reinitialize,
>     > triggering an infinite reset loop that hangs the system.
> 
>     That sounds more like the MODE1 reset failed.
> 
>     > After MODE1 reset completes, check whether the PCIe endpoint is still
>     > reachable using pci_device_is_present(). If the device is behind
>     > Thunderbolt and the link is dead, walk up parent bridges calling
>     > pci_bridge_secondary_bus_reset() to retrain the physical PCIe link
>     > inside the dock.
> 
>     Well that is then a bus reset.
> 
>     I mean that is a reasonable mitigation when a MODE1 reset failed, but the question is rather why does the MODE1 reset fails in the first place?
> 
>     > If recovery fails, return -ENODEV to prevent the
>     > reset retry loop.
>     >
>     > This also causes the GPU fan to be at 100% and basically when it
>     > happens and you are not there, you now have a GPU with fan at 100% and
>     > cant reset it.
>     > I wanted to notate some other things I am finding sometimes before
>     > this adventure of patches to the kernel and amdgpu driver.
>     > Sometimes a crash could happen in the drive and then the GPU fan speed
>     > hits 100% and the air is hot coming out without any workload, other
>     > times
>     > I have seen it have barely any fan speed at all and heat up more than
>     > it should at the fan level its curently operating at. These are things
>     > I have seen with this gpu in a TB5 dock with the driver and
>     > instability. I'm not sure exactly whats going on there but I figured
>     > since im communicating with these patches I might as well bring you up
>     > to speed and supermario has been great help throughout me trying to
>     > get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 /
>     > USB4v2 dock!
> 
>     Adding Mario as well. That strongly sounds like you crashed the SMU which would also explain the failed MODE1 reset.
> 
>     But all of that are only symptoms. Question is what is actually going on here? e.g. what is the root cause?
> 
>     >
>     > It seems to be finally working with bar resizing after my kernel
>     > patch. Which allows you to safely release a empty switch bridge at the
>     > device end.
>     > Then it rebuilds it afterwords with the increased bar. This was done
>     > on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource
>     > branch with my patch here.
>     >
>     > https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u <https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u>
> 
>     Where is the MMIO register BAR before and after the rebuild?
> 
>     Regards,
>     Christian.
> 
>     >
>     > Thank you!
>     >
>     > Signed-off-by: Geramy Loveless <gloveless@jqluv.com <mailto:gloveless@jqluv.com>>
>     > ---
>     > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++
>     > 1 file changed, 40 insertions(+)
>     >
>     > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > index 31a60173c..91d01d538 100644
>     > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     > @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev)
>     > /* ensure no_hw_access is updated before we access hw */
>     > smp_mb();
>     > + /*
>     > + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe
>     > + * endpoint but the TB tunnel stays up unaware. Detect the
>     > + * dead link and attempt recovery by resetting parent bridges
>     > + * to retrain the physical PCIe link inside the dock.
>     > + */
>     > + if (!pci_device_is_present(adev->pdev) &&
>     > + pci_is_thunderbolt_attached(adev->pdev)) {
>     > + struct pci_dev *bridge;
>     > + bool recovered = false;
>     > +
>     > + dev_info(adev->dev,
>     > + "PCIe link lost after mode1 reset, attempting Thunderbolt recovery\n");
>     > +
>     > + bridge = pci_upstream_bridge(adev->pdev);
>     > + while (bridge && !pci_is_root_bus(bridge->bus)) {
>     > + dev_info(adev->dev,
>     > + "attempting link recovery via %s\n",
>     > + pci_name(bridge));
>     > + pci_bridge_secondary_bus_reset(bridge);
>     > + msleep(100);
>     > + if (pci_device_is_present(adev->pdev)) {
>     > + recovered = true;
>     > + break;
>     > + }
>     > + bridge = pci_upstream_bridge(bridge);
>     > + }
>     > +
>     > + if (!recovered) {
>     > + dev_err(adev->dev,
>     > + "Thunderbolt PCIe link recovery failed\n");
>     > + ret = -ENODEV;
>     > + goto mode1_reset_failed;
>     > + }
>     > +
>     > + dev_info(adev->dev,
>     > + "Thunderbolt PCIe link recovered via %s\n",
>     > + pci_name(bridge));
>     > + }
>     > +
>     > amdgpu_device_load_pci_state(adev->pdev);
>     > ret = amdgpu_psp_wait_for_bootloader(adev);
>     > if (ret)
>     > --
>     > 2.51.0
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset
  2026-04-09 12:57     ` Christian König
@ 2026-04-09 13:12       ` Geramy Loveless
  2026-04-09 13:45         ` Christian König
  0 siblings, 1 reply; 14+ messages in thread
From: Geramy Loveless @ 2026-04-09 13:12 UTC (permalink / raw)
  To: Christian König; +Cc: amd-gfx, Mario Limonciello, alexander.deucher

[-- Attachment #1: Type: text/plain, Size: 9091 bytes --]

 Christian,

This is going to be very interesting.
So the AMD Radeon AI R9700 Pro is running over USB4v2 into a Razor Dock on
a 650W power supply I have two of these setups cards provided by AMD for
some project testing on tb5 rocm ai broadcast messages for testing two
machines connected over tb5 with rdma sorry if I loose you here. What I am
seeing or finding in my opinion is thunderbolt is not very well flushed out
on Linux in terms of stability and accessible feature sets. After my kernel
pci patch I have gotten it to work which we went over already.

I have one machine running Qwen3.5 27B A3B on the GPU with llamacpp /
lemonade successfully but yesterday after attempting to load the model
which it did into vram it crashed which ten minutes prior was running a
prompt 73k tokens at 40tps. A side note The power management on Thunderbolt
seems to clash with the power management on the GPU. What I was seeing is
when the GPU doesn’t detect a display it “suspends” which may be correct or
not but it doesn’t matter either way in this scenario. The Thunderbolt
fabric I believe shuts down also and kills the link. I got passed this by
disabling all PM systems. So I’m not 100% sure what’s going on and how to
find the actual problem yet maybe if I spend more time and brain power on
this I’ll find where in power management it’s having the problem or if
that’s the real problem. Let me know if you can offer anymore guidance into
where I should dig. I appreciate the help this far and the SMU information
helps get more context.

On Thu, Apr 9, 2026 at 5:57 AM Christian König <christian.koenig@amd.com>
wrote:

> Hi Geramy,
>
> On 4/9/26 14:13, Geramy Loveless wrote:
> > Hi Christian,
> >
> > I appreciate the speedy response,
> > What your saying makes sense they are basically wrapping symptoms that
> could at least from what I seen now at this point only continue and
> eventually create a web of useless code to try to catch all code paths it
> hits during crashing. Let me investigate the real reason as to why it’s
> crashing more rather then where.
>
> To just give you a bit background on what happens here:
>
> AMD GPUs have an embedded micro controller called SMU which takes care of
> things like voltages, clocks, temperature, fan speed etc.. and reset.
>
> So when the kernel driver detects that it needs to do a reset it sends a
> MODE1 reset command to the SMU. But instead of the SMU coming back a short
> time later noting that the reset was done the device just drops off the bus
> (e.g. all reads return 0xffffffff).
>
> The cause of that can be anything, e.g. from power fluctuations to a dirty
> fan which doesn't starts to rotate again after it was stopped.
>
> I would try to narrow it down step by step, e.g. if it work on older
> kernels, if yes what feature/patch broke the behavior. You can also try to
> disable certain power management features like ASPM (try amdgpu.aspm=0 on
> the kernel command line).
>
> Hope that helps,
> Christian.
>
> >
> >
> > On Thu, Apr 9, 2026 at 4:42 AM Christian König <christian.koenig@amd.com
> <mailto:christian.koenig@amd.com>> wrote:
> >
> >     On 4/9/26 02:05, Geramy Loveless wrote:
> >     > When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1
> on
> >     > Thunderbolt the TB driver receives no notification and the tunnel
> >     > stays up while the endpoint is unreachable.
> >
> >     IIRC a MODE1 reset should keep the bus active and so the endpoint
> should still be reachable.
> >
> >     > All subsequent PCIe
> >     > reads return 0xFFFFFFFF and MES firmware cannot reinitialize,
> >     > triggering an infinite reset loop that hangs the system.
> >
> >     That sounds more like the MODE1 reset failed.
> >
> >     > After MODE1 reset completes, check whether the PCIe endpoint is
> still
> >     > reachable using pci_device_is_present(). If the device is behind
> >     > Thunderbolt and the link is dead, walk up parent bridges calling
> >     > pci_bridge_secondary_bus_reset() to retrain the physical PCIe link
> >     > inside the dock.
> >
> >     Well that is then a bus reset.
> >
> >     I mean that is a reasonable mitigation when a MODE1 reset failed,
> but the question is rather why does the MODE1 reset fails in the first
> place?
> >
> >     > If recovery fails, return -ENODEV to prevent the
> >     > reset retry loop.
> >     >
> >     > This also causes the GPU fan to be at 100% and basically when it
> >     > happens and you are not there, you now have a GPU with fan at 100%
> and
> >     > cant reset it.
> >     > I wanted to notate some other things I am finding sometimes before
> >     > this adventure of patches to the kernel and amdgpu driver.
> >     > Sometimes a crash could happen in the drive and then the GPU fan
> speed
> >     > hits 100% and the air is hot coming out without any workload, other
> >     > times
> >     > I have seen it have barely any fan speed at all and heat up more
> than
> >     > it should at the fan level its curently operating at. These are
> things
> >     > I have seen with this gpu in a TB5 dock with the driver and
> >     > instability. I'm not sure exactly whats going on there but I
> figured
> >     > since im communicating with these patches I might as well bring
> you up
> >     > to speed and supermario has been great help throughout me trying to
> >     > get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5
> /
> >     > USB4v2 dock!
> >
> >     Adding Mario as well. That strongly sounds like you crashed the SMU
> which would also explain the failed MODE1 reset.
> >
> >     But all of that are only symptoms. Question is what is actually
> going on here? e.g. what is the root cause?
> >
> >     >
> >     > It seems to be finally working with bar resizing after my kernel
> >     > patch. Which allows you to safely release a empty switch bridge at
> the
> >     > device end.
> >     > Then it rebuilds it afterwords with the increased bar. This was
> done
> >     > on Kernel 7.0-rc7 i believe it is and latest changes from
> pci/resource
> >     > branch with my patch here.
> >     >
> >     >
> https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u
> <
> https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u
> >
> >
> >     Where is the MMIO register BAR before and after the rebuild?
> >
> >     Regards,
> >     Christian.
> >
> >     >
> >     > Thank you!
> >     >
> >     > Signed-off-by: Geramy Loveless <gloveless@jqluv.com <mailto:
> gloveless@jqluv.com>>
> >     > ---
> >     > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40
> ++++++++++++++++++++++
> >     > 1 file changed, 40 insertions(+)
> >     >
> >     > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >     > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >     > index 31a60173c..91d01d538 100644
> >     > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >     > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >     > @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct
> amdgpu_device *adev)
> >     > /* ensure no_hw_access is updated before we access hw */
> >     > smp_mb();
> >     > + /*
> >     > + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe
> >     > + * endpoint but the TB tunnel stays up unaware. Detect the
> >     > + * dead link and attempt recovery by resetting parent bridges
> >     > + * to retrain the physical PCIe link inside the dock.
> >     > + */
> >     > + if (!pci_device_is_present(adev->pdev) &&
> >     > + pci_is_thunderbolt_attached(adev->pdev)) {
> >     > + struct pci_dev *bridge;
> >     > + bool recovered = false;
> >     > +
> >     > + dev_info(adev->dev,
> >     > + "PCIe link lost after mode1 reset, attempting Thunderbolt
> recovery\n");
> >     > +
> >     > + bridge = pci_upstream_bridge(adev->pdev);
> >     > + while (bridge && !pci_is_root_bus(bridge->bus)) {
> >     > + dev_info(adev->dev,
> >     > + "attempting link recovery via %s\n",
> >     > + pci_name(bridge));
> >     > + pci_bridge_secondary_bus_reset(bridge);
> >     > + msleep(100);
> >     > + if (pci_device_is_present(adev->pdev)) {
> >     > + recovered = true;
> >     > + break;
> >     > + }
> >     > + bridge = pci_upstream_bridge(bridge);
> >     > + }
> >     > +
> >     > + if (!recovered) {
> >     > + dev_err(adev->dev,
> >     > + "Thunderbolt PCIe link recovery failed\n");
> >     > + ret = -ENODEV;
> >     > + goto mode1_reset_failed;
> >     > + }
> >     > +
> >     > + dev_info(adev->dev,
> >     > + "Thunderbolt PCIe link recovered via %s\n",
> >     > + pci_name(bridge));
> >     > + }
> >     > +
> >     > amdgpu_device_load_pci_state(adev->pdev);
> >     > ret = amdgpu_psp_wait_for_bootloader(adev);
> >     > if (ret)
> >     > --
> >     > 2.51.0
> >
>
>

[-- Attachment #2: Type: text/html, Size: 11435 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset
  2026-04-09 13:12       ` Geramy Loveless
@ 2026-04-09 13:45         ` Christian König
  0 siblings, 0 replies; 14+ messages in thread
From: Christian König @ 2026-04-09 13:45 UTC (permalink / raw)
  To: Geramy Loveless; +Cc: amd-gfx, Mario Limonciello, alexander.deucher

Hi Geramy,

On 4/9/26 15:12, Geramy Loveless wrote:
>  Christian,
> 
> This is going to be very interesting.
> So the AMD Radeon AI R9700 Pro is running over USB4v2 into a Razor Dock on a 650W power supply I have two of these setups cards provided by AMD for some project testing on tb5 rocm ai broadcast messages for testing two machines connected over tb5 with rdma sorry if I loose you here.

Just as a site note 650W for two Radeon AI R9700 Pro is a bit tight. IIRC each of those GPUs can consume a maximum of 300W. Our test benches with two of those usually have 750W power supplies.

But I don't think that this is the issue here, power supply problems usually look different. Is the whole box provided by AMD? Who is your contact person anyway?

> What I am seeing or finding in my opinion is thunderbolt is not very well flushed out on Linux in terms of stability and accessible feature sets.

Yeah agree, thunderbolt setups is also not something we usually have in our production testing. So it can work but absolutely no guarantee for that.

> After my kernel pci patch I have gotten it to work which we went over already.

Your patch looks valid of hand. But you need to be super careful which resources to release on BAR resize.

For example AMD GPUs usually have two resizable 64bit BARs and one 32bit BAR, *don't* release the 32bit BAR or you most likely run into trouble.

So it would be helpful if you provide an lspci -vvvv -s $bdf of the two devices with and without your patch.

> 
> I have one machine running Qwen3.5 27B A3B on the GPU with llamacpp / lemonade successfully but yesterday after attempting to load the model which it did into vram it crashed which ten minutes prior was running a prompt 73k tokens at 40tps. A side note The power management on Thunderbolt seems to clash with the power management on the GPU. What I was seeing is when the GPU doesn’t detect a display it “suspends” which may be correct or not but it doesn’t matter either way in this scenario.

That behavior is perfectly correct but the message is a bit misleading. The GPU suspends when there is nothing displayed nor any software client connected.

So every time all applications stop the GPU is powered down and every time any application starts it is power up again. This of course exercises all the drivers and firmware involved, so there is a lot which can go wrong and I'm not surprised that you have seen problems with that.

> The Thunderbolt fabric I believe shuts down also and kills the link. I got passed this by disabling all PM systems.

How did you do that? E.g. which parameter did you use?

> So I’m not 100% sure what’s going on and how to find the actual problem yet maybe if I spend more time and brain power on this I’ll find where in power management it’s having the problem or if that’s the real problem. Let me know if you can offer anymore guidance into where I should dig. I appreciate the help this far and the SMU information helps get more context.

I'm here to help.

Cheers,
Christian.

> 
> On Thu, Apr 9, 2026 at 5:57 AM Christian König <christian.koenig@amd.com <mailto:christian.koenig@amd.com>> wrote:
> 
>     Hi Geramy,
> 
>     On 4/9/26 14:13, Geramy Loveless wrote:
>     > Hi Christian,
>     >
>     > I appreciate the speedy response,
>     > What your saying makes sense they are basically wrapping symptoms that could at least from what I seen now at this point only continue and eventually create a web of useless code to try to catch all code paths it hits during crashing. Let me investigate the real reason as to why it’s crashing more rather then where.
> 
>     To just give you a bit background on what happens here:
> 
>     AMD GPUs have an embedded micro controller called SMU which takes care of things like voltages, clocks, temperature, fan speed etc.. and reset.
> 
>     So when the kernel driver detects that it needs to do a reset it sends a MODE1 reset command to the SMU. But instead of the SMU coming back a short time later noting that the reset was done the device just drops off the bus (e.g. all reads return 0xffffffff).
> 
>     The cause of that can be anything, e.g. from power fluctuations to a dirty fan which doesn't starts to rotate again after it was stopped.
> 
>     I would try to narrow it down step by step, e.g. if it work on older kernels, if yes what feature/patch broke the behavior. You can also try to disable certain power management features like ASPM (try amdgpu.aspm=0 on the kernel command line).
> 
>     Hope that helps,
>     Christian.
> 
>     >
>     >
>     > On Thu, Apr 9, 2026 at 4:42 AM Christian König <christian.koenig@amd.com <mailto:christian.koenig@amd.com> <mailto:christian.koenig@amd.com <mailto:christian.koenig@amd.com>>> wrote:
>     >
>     >     On 4/9/26 02:05, Geramy Loveless wrote:
>     >     > When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on
>     >     > Thunderbolt the TB driver receives no notification and the tunnel
>     >     > stays up while the endpoint is unreachable.
>     >
>     >     IIRC a MODE1 reset should keep the bus active and so the endpoint should still be reachable.
>     >
>     >     > All subsequent PCIe
>     >     > reads return 0xFFFFFFFF and MES firmware cannot reinitialize,
>     >     > triggering an infinite reset loop that hangs the system.
>     >
>     >     That sounds more like the MODE1 reset failed.
>     >
>     >     > After MODE1 reset completes, check whether the PCIe endpoint is still
>     >     > reachable using pci_device_is_present(). If the device is behind
>     >     > Thunderbolt and the link is dead, walk up parent bridges calling
>     >     > pci_bridge_secondary_bus_reset() to retrain the physical PCIe link
>     >     > inside the dock.
>     >
>     >     Well that is then a bus reset.
>     >
>     >     I mean that is a reasonable mitigation when a MODE1 reset failed, but the question is rather why does the MODE1 reset fails in the first place?
>     >
>     >     > If recovery fails, return -ENODEV to prevent the
>     >     > reset retry loop.
>     >     >
>     >     > This also causes the GPU fan to be at 100% and basically when it
>     >     > happens and you are not there, you now have a GPU with fan at 100% and
>     >     > cant reset it.
>     >     > I wanted to notate some other things I am finding sometimes before
>     >     > this adventure of patches to the kernel and amdgpu driver.
>     >     > Sometimes a crash could happen in the drive and then the GPU fan speed
>     >     > hits 100% and the air is hot coming out without any workload, other
>     >     > times
>     >     > I have seen it have barely any fan speed at all and heat up more than
>     >     > it should at the fan level its curently operating at. These are things
>     >     > I have seen with this gpu in a TB5 dock with the driver and
>     >     > instability. I'm not sure exactly whats going on there but I figured
>     >     > since im communicating with these patches I might as well bring you up
>     >     > to speed and supermario has been great help throughout me trying to
>     >     > get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 /
>     >     > USB4v2 dock!
>     >
>     >     Adding Mario as well. That strongly sounds like you crashed the SMU which would also explain the failed MODE1 reset.
>     >
>     >     But all of that are only symptoms. Question is what is actually going on here? e.g. what is the root cause?
>     >
>     >     >
>     >     > It seems to be finally working with bar resizing after my kernel
>     >     > patch. Which allows you to safely release a empty switch bridge at the
>     >     > device end.
>     >     > Then it rebuilds it afterwords with the increased bar. This was done
>     >     > on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource
>     >     > branch with my patch here.
>     >     >
>     >     > https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u <https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u> <https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u <https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u>>
>     >
>     >     Where is the MMIO register BAR before and after the rebuild?
>     >
>     >     Regards,
>     >     Christian.
>     >
>     >     >
>     >     > Thank you!
>     >     >
>     >     > Signed-off-by: Geramy Loveless <gloveless@jqluv.com <mailto:gloveless@jqluv.com> <mailto:gloveless@jqluv.com <mailto:gloveless@jqluv.com>>>
>     >     > ---
>     >     > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++
>     >     > 1 file changed, 40 insertions(+)
>     >     >
>     >     > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     >     > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     >     > index 31a60173c..91d01d538 100644
>     >     > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     >     > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>     >     > @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev)
>     >     > /* ensure no_hw_access is updated before we access hw */
>     >     > smp_mb();
>     >     > + /*
>     >     > + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe
>     >     > + * endpoint but the TB tunnel stays up unaware. Detect the
>     >     > + * dead link and attempt recovery by resetting parent bridges
>     >     > + * to retrain the physical PCIe link inside the dock.
>     >     > + */
>     >     > + if (!pci_device_is_present(adev->pdev) &&
>     >     > + pci_is_thunderbolt_attached(adev->pdev)) {
>     >     > + struct pci_dev *bridge;
>     >     > + bool recovered = false;
>     >     > +
>     >     > + dev_info(adev->dev,
>     >     > + "PCIe link lost after mode1 reset, attempting Thunderbolt recovery\n");
>     >     > +
>     >     > + bridge = pci_upstream_bridge(adev->pdev);
>     >     > + while (bridge && !pci_is_root_bus(bridge->bus)) {
>     >     > + dev_info(adev->dev,
>     >     > + "attempting link recovery via %s\n",
>     >     > + pci_name(bridge));
>     >     > + pci_bridge_secondary_bus_reset(bridge);
>     >     > + msleep(100);
>     >     > + if (pci_device_is_present(adev->pdev)) {
>     >     > + recovered = true;
>     >     > + break;
>     >     > + }
>     >     > + bridge = pci_upstream_bridge(bridge);
>     >     > + }
>     >     > +
>     >     > + if (!recovered) {
>     >     > + dev_err(adev->dev,
>     >     > + "Thunderbolt PCIe link recovery failed\n");
>     >     > + ret = -ENODEV;
>     >     > + goto mode1_reset_failed;
>     >     > + }
>     >     > +
>     >     > + dev_info(adev->dev,
>     >     > + "Thunderbolt PCIe link recovered via %s\n",
>     >     > + pci_name(bridge));
>     >     > + }
>     >     > +
>     >     > amdgpu_device_load_pci_state(adev->pdev);
>     >     > ret = amdgpu_psp_wait_for_bootloader(adev);
>     >     > if (ret)
>     >     > --
>     >     > 2.51.0
>     >
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset
  2026-04-09 11:42 ` Christian König
  2026-04-09 12:13   ` Geramy Loveless
@ 2026-04-09 18:12   ` Mario Limonciello
  2026-04-10  0:12     ` Geramy Loveless
                       ` (2 more replies)
  1 sibling, 3 replies; 14+ messages in thread
From: Mario Limonciello @ 2026-04-09 18:12 UTC (permalink / raw)
  To: Christian König, Geramy Loveless, amd-gfx; +Cc: alexander.deucher



On 4/9/26 06:42, Christian König wrote:
> On 4/9/26 02:05, Geramy Loveless wrote:
>> When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on
>> Thunderbolt the TB driver receives no notification and the tunnel
>> stays up while the endpoint is unreachable.
> 
> IIRC a MODE1 reset should keep the bus active and so the endpoint should still be reachable.
> 
>> All subsequent PCIe
>> reads return 0xFFFFFFFF and MES firmware cannot reinitialize,
>> triggering an infinite reset loop that hangs the system.
> 
> That sounds more like the MODE1 reset failed.
> 
>> After MODE1 reset completes, check whether the PCIe endpoint is still
>> reachable using pci_device_is_present(). If the device is behind
>> Thunderbolt and the link is dead, walk up parent bridges calling
>> pci_bridge_secondary_bus_reset() to retrain the physical PCIe link
>> inside the dock.
> 
> Well that is then a bus reset.
> 
> I mean that is a reasonable mitigation when a MODE1 reset failed, but the question is rather why does the MODE1 reset fails in the first place?
> 
>> If recovery fails, return -ENODEV to prevent the
>> reset retry loop.
>>
>> This also causes the GPU fan to be at 100% and basically when it
>> happens and you are not there, you now have a GPU with fan at 100% and
>> cant reset it.
>> I wanted to notate some other things I am finding sometimes before
>> this adventure of patches to the kernel and amdgpu driver.
>> Sometimes a crash could happen in the drive and then the GPU fan speed
>> hits 100% and the air is hot coming out without any workload, other
>> times
>> I have seen it have barely any fan speed at all and heat up more than
>> it should at the fan level its curently operating at. These are things
>> I have seen with this gpu in a TB5 dock with the driver and
>> instability. I'm not sure exactly whats going on there but I figured
>> since im communicating with these patches I might as well bring you up
>> to speed and supermario has been great help throughout me trying to
>> get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 /
>> USB4v2 dock!
> 
> Adding Mario as well. That strongly sounds like you crashed the SMU which would also explain the failed MODE1 reset.
> 
> But all of that are only symptoms. Question is what is actually going on here? e.g. what is the root cause?

We don't spend a lot of time in recovery scenarios for when 💩 hits the 
fan.  I think in addition to finding and fixing the real root cause 
having a reproducible workload to cause the crash is a good opportunity 
to try to put in place better recovery too.

Generally speaking I like the idea of if a mode1 reset fails to do a 
harder reset.  At least in the path that we have GPU recovery 
(amdgpu.gpu_recovery module parameter) set, adding a fallback case to do 
a full device reset makes sense to me.

I think the placement is wrong though.  amdgpu_device_mode1_reset() has 
a bunch of callers, and if you end up with a mode1 reset doing a full 
reset that might be a surprise to those callers.

So I think a more logical place to put this would be explicitly in the 
GPU recovery path (amdgpu_device_gpu_recover).  Maybe as part of the 
mode1 reset failure you can:

set_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);

And then the GPU recovery path can jump right into a full reset?  Not 
sure if that jives with your stack trace though.

Furthermore; even though you reproduced this on Thunderbolt; I have no 
reason to believe it's specific to thunderbolt.  An SMU crash can happen 
in any hardware.  We may as well try full reset for recovery for any 
hardware.

> 
>>
>> It seems to be finally working with bar resizing after my kernel
>> patch. Which allows you to safely release a empty switch bridge at the
>> device end.
>> Then it rebuilds it afterwords with the increased bar. This was done
>> on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource
>> branch with my patch here.
>>
>> https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u
> 
> Where is the MMIO register BAR before and after the rebuild?
> 
> Regards,
> Christian.
> 
>>
>> Thank you!
>>
>> Signed-off-by: Geramy Loveless <gloveless@jqluv.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++
>> 1 file changed, 40 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 31a60173c..91d01d538 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev)
>> /* ensure no_hw_access is updated before we access hw */
>> smp_mb();
>> + /*
>> + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe
>> + * endpoint but the TB tunnel stays up unaware. Detect the
>> + * dead link and attempt recovery by resetting parent bridges
>> + * to retrain the physical PCIe link inside the dock.
>> + */
>> + if (!pci_device_is_present(adev->pdev) &&
>> + pci_is_thunderbolt_attached(adev->pdev)) {
>> + struct pci_dev *bridge;
>> + bool recovered = false;
>> +
>> + dev_info(adev->dev,
>> + "PCIe link lost after mode1 reset, attempting Thunderbolt recovery\n");
>> +
>> + bridge = pci_upstream_bridge(adev->pdev);
>> + while (bridge && !pci_is_root_bus(bridge->bus)) {
>> + dev_info(adev->dev,
>> + "attempting link recovery via %s\n",
>> + pci_name(bridge));
>> + pci_bridge_secondary_bus_reset(bridge);
>> + msleep(100);
>> + if (pci_device_is_present(adev->pdev)) {
>> + recovered = true;
>> + break;
>> + }
>> + bridge = pci_upstream_bridge(bridge);
>> + }
>> +
>> + if (!recovered) {
>> + dev_err(adev->dev,
>> + "Thunderbolt PCIe link recovery failed\n");
>> + ret = -ENODEV;
>> + goto mode1_reset_failed;
>> + }
>> +
>> + dev_info(adev->dev,
>> + "Thunderbolt PCIe link recovered via %s\n",
>> + pci_name(bridge));
>> + }
>> +
>> amdgpu_device_load_pci_state(adev->pdev);
>> ret = amdgpu_psp_wait_for_bootloader(adev);
>> if (ret)
>> --
>> 2.51.0
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset
  2026-04-09 18:12   ` Mario Limonciello
@ 2026-04-10  0:12     ` Geramy Loveless
  2026-04-10  6:22       ` Geramy Loveless
  2026-04-10  9:13     ` Christian König
  2026-04-10 11:25     ` Lazar, Lijo
  2 siblings, 1 reply; 14+ messages in thread
From: Geramy Loveless @ 2026-04-10  0:12 UTC (permalink / raw)
  To: Mario Limonciello; +Cc: Christian König, amd-gfx, alexander.deucher

Hey,

I have nearly finished my patch, I need to double check some stuff
first but here is a summary of the real cause that I can see.
See below for a in depth in depth analysis. I am not sure technically
if thunderbolt is at fault in this scenario or the amdgpu please let
me know what your opinion is on this and where it should be patched
at. Of course the patch i'm working on allows for a resolution path if
it happens which is a nice recovery mode but it doesnt solve it
occuring.

## Summary

R9700 Pro [1002:7551] (gfx1201) connected via Thunderbolt 5 dock. GPU
initializes fully but MMIO becomes unreachable while PCIe config space
continues to work. MMIO becomes unresponsive,
cascading into SDMA timeouts and GPU reset loops. Reproduced on two
separate boots (10s and 96s after init).

The split between working config space and dead MMIO points to the
Thunderbolt PCIe tunnel selectively dropping memory transactions while
continuing to pass configuration transactions.

## Hardware

- Host: MSI MS-S1 MAX (Strix Halo), AMD IOMMU
- Thunderbolt host controller: Intel Barlow Ridge TB5 [8086:5780] at 67:00.0
- Dock: Razer Core X V2 (TB4, FW 59.82)
- GPU: AMD R9700 Pro [1002:7551] gfx1201, 32GB GDDR6
- Connection: TB5 host → TB4 dock, 40 Gb/s dual lane
- TB tunnel: PCIe 0:10 <-> 3:9, extended encapsulation enabled
- PCIe topology through dock:
```
66:03.0 TB bridge (32GB pref window)
93:00.0 Intel 5786 Upstream Switch
94:00.0 Intel 5786 Downstream Switch (Gen4 x4 to AMD switch)
95:00.0 AMD 1478 Upstream Switch
96:00.0 AMD 1479 Downstream Switch (Gen5 x16 to GPU)
97:00.0 GPU [1002:7551]
```
- No display connected to eGPU

## Kernel

```
Linux 7.0.0-rc7-egpu+ #7 SMP PREEMPT_DYNAMIC
cmdline: pcie_port_pm=off pcie_aspm=off amdgpu.runpm=0
```

## The evidence

### 1. GPU initializes successfully on both boots

Boot -1 (7.0.0-rc7-egpu+, journalctl, precise timestamps):
```
16:45:46.038 SMU is initialized successfully!
16:45:46.038 [drm] Display Core v3.2.369 initialized on DCN 4.0.1
16:45:46.131 runtime pm is manually disabled
16:45:46.131 [drm] Initialized amdgpu 3.64.0 for 0000:97:00.0 on minor 0
```

Boot 0 (7.0.0-rc7-egpu+, dmesg):
```
[9551.162] SMU is initialized successfully!
[9551.163] [drm] Display Core v3.2.369 initialized on DCN 4.0.1
[9551.248] runtime pm is manually disabled
[9551.249] [drm] Initialized amdgpu 3.64.0 for 0000:97:00.0 on minor 0
```

All IP blocks come up clean. 32624MB VRAM. 64 CUs. SMU responds to
all init-time messages. No errors during initialization.

### 2. SMU becomes unreachable after variable delay

Boot -1 — **10 seconds** after init:
```
16:45:56.192 Failed to disable gfxoff!
16:45:56.192 SMU is in hanged state, failed to send smu message!
16:45:56.192 Failed to export SMU metrics table!
(repeated ~30 times)
```

Boot 0 — **96 seconds** after init:
```
[9647.872] Failed to export SMU metrics table!
[9647.872] SMU is in hanged state, failed to send smu message!
(repeated)
[9661.567] Failed to disable gfxoff!
```

The delay is not consistent (10s vs 96s), ruling out a fixed firmware
timer. The first failing operation varies (gfxoff disable vs metrics
export), suggesting the SMU itself isn't crashing — the communication
path to it is dying.

### 3. Config space alive, MMIO dead (proved during boot 0 crash)

Tested during the active crash with the GPU in "SMU hanged" state:

**Config space (works):**
```
$ sudo setpci -s 97:00.0 0x00.l
75511002 ← correct vendor/device ID
$ sudo setpci -s 97:00.0 0x04.l
00100406 ← status/command register OK
```

**MMIO BAR5 at 0xc4000000 — SMU register space (dead):**
```python
fd = os.open('/sys/bus/pci/devices/0000:97:00.0/resource5', os.O_RDONLY)
data = os.read(fd, 4)
# OSError: [Errno 5] Input/output error
```

**MMIO BAR0 at 0x8880000000 — VRAM (dead):**
```python
fd = os.open('/sys/bus/pci/devices/0000:97:00.0/resource0', os.O_RDONLY)
data = os.read(fd, 4)
# OSError: [Errno 5] Input/output error
```

Config transactions reach the device through the TB tunnel. Memory
transactions do not. This is the root cause of the "SMU hanged state" —
the SMU firmware is likely fine, but the MMIO writes to its mailbox
registers never arrive.

### 4. Thunderbolt host router and dock are runtime-suspended

Monitored with 2-second polling during the crash:
```
16:54:06 host=suspended dock=suspended gpu_errors=1
16:54:09 host=suspended dock=suspended gpu_errors=3
16:54:11 host=suspended dock=suspended gpu_errors=5
...
16:54:57 host=suspended dock=suspended gpu_errors=119
```

TB host router runtime PM configuration:
```
/sys/bus/thunderbolt/devices/0-0/power/control = auto
/sys/bus/thunderbolt/devices/0-0/power/autosuspend_delay_ms = 15000
/sys/bus/thunderbolt/devices/0-0/power/runtime_status = suspended
```

Both the TB host router and dock switch show `suspended` throughout
the crash. The PCIe tunnel was activated while they were in this state.

### 5. Waking TB host router does not restore MMIO

```
$ echo "on" > /sys/bus/thunderbolt/devices/0-0/power/control
$ cat /sys/bus/thunderbolt/devices/0-0/power/runtime_status
active

$ python3 -c "os.read(os.open('/sys/bus/pci/devices/.../resource5', ...), 4)"
# OSError: [Errno 5] Input/output error ← still dead
```

Once MMIO is lost, waking the TB host router doesn't recover it. The
damage to the memory transaction path persists until device removal
and re-enumeration (or reboot).

### 6. Full crash cascade (boot -1)

```
16:45:46 Init complete
16:45:56 SMU hanged (MMIO path dead)
16:45:58 SDMA ring timeout → ring reset succeeds
16:46:00 SDMA ring timeout again → ring reset succeeds
16:51:16 Full GPU reset (MODE1 via SMU)
16:51:33 GPU reset succeeded, SMU resumed
GPU runs for 37 more minutes
17:28:17 SDMA timeout → ring reset FAILS
17:28:17 GPU reset returns -ENODEV (device gone from bus)
Infinite reset loop, system unusable
```

MODE1 reset succeeds once (re-establishing the MMIO path temporarily)
but the problem recurs, and the second MODE1 reset kills the PCIe link
entirely (-ENODEV).

## Configuration notes

- `amdgpu.runpm=0` is set and confirmed. GPU runtime PM (BOCO) is not
active. This does not prevent the crash.
- `pcie_port_pm=off pcie_aspm=off` are set. PCIe link power management
is disabled. This does not prevent the crash.
- TB host router runtime PM (`power/control=auto`) is NOT disabled by
any of the above kernel parameters.
- SMU FW version mismatch: driver expects interface 0x2e, FW reports 0x32.
Init succeeds despite mismatch.
- "PCIE atomic ops is not supported" — TB bridge doesn't support AtomicOps.

## Related

- GitLab issue: https://gitlab.freedesktop.org/drm/amd/-/work_items/4978
- Device: [1002:7551] (gfx1201, Navi 48, R9700 Pro)
- SMU FW: smu_v14_0_2, version 0x00684a00 (104.74.0)
- SMU driver if version 0x2e, fw if version 0x32 (mismatch)
- TB host controller: Intel Barlow Ridge [8086:5780], FW 61.83
- TB dock: Razer Core X V2, FW 59.82




Geramy L. Loveless
Founder & Chief Innovation Officer

JQluv.net, Inc.
Site: JQluv.com
Mobile: 559.999.1557
Office: 1 (877) 44 JQluv




On Thu, Apr 9, 2026 at 11:12 AM Mario Limonciello
<mario.limonciello@amd.com> wrote:
>
>
>
> On 4/9/26 06:42, Christian König wrote:
> > On 4/9/26 02:05, Geramy Loveless wrote:
> >> When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on
> >> Thunderbolt the TB driver receives no notification and the tunnel
> >> stays up while the endpoint is unreachable.
> >
> > IIRC a MODE1 reset should keep the bus active and so the endpoint should still be reachable.
> >
> >> All subsequent PCIe
> >> reads return 0xFFFFFFFF and MES firmware cannot reinitialize,
> >> triggering an infinite reset loop that hangs the system.
> >
> > That sounds more like the MODE1 reset failed.
> >
> >> After MODE1 reset completes, check whether the PCIe endpoint is still
> >> reachable using pci_device_is_present(). If the device is behind
> >> Thunderbolt and the link is dead, walk up parent bridges calling
> >> pci_bridge_secondary_bus_reset() to retrain the physical PCIe link
> >> inside the dock.
> >
> > Well that is then a bus reset.
> >
> > I mean that is a reasonable mitigation when a MODE1 reset failed, but the question is rather why does the MODE1 reset fails in the first place?
> >
> >> If recovery fails, return -ENODEV to prevent the
> >> reset retry loop.
> >>
> >> This also causes the GPU fan to be at 100% and basically when it
> >> happens and you are not there, you now have a GPU with fan at 100% and
> >> cant reset it.
> >> I wanted to notate some other things I am finding sometimes before
> >> this adventure of patches to the kernel and amdgpu driver.
> >> Sometimes a crash could happen in the drive and then the GPU fan speed
> >> hits 100% and the air is hot coming out without any workload, other
> >> times
> >> I have seen it have barely any fan speed at all and heat up more than
> >> it should at the fan level its curently operating at. These are things
> >> I have seen with this gpu in a TB5 dock with the driver and
> >> instability. I'm not sure exactly whats going on there but I figured
> >> since im communicating with these patches I might as well bring you up
> >> to speed and supermario has been great help throughout me trying to
> >> get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 /
> >> USB4v2 dock!
> >
> > Adding Mario as well. That strongly sounds like you crashed the SMU which would also explain the failed MODE1 reset.
> >
> > But all of that are only symptoms. Question is what is actually going on here? e.g. what is the root cause?
>
> We don't spend a lot of time in recovery scenarios for when 💩 hits the
> fan.  I think in addition to finding and fixing the real root cause
> having a reproducible workload to cause the crash is a good opportunity
> to try to put in place better recovery too.
>
> Generally speaking I like the idea of if a mode1 reset fails to do a
> harder reset.  At least in the path that we have GPU recovery
> (amdgpu.gpu_recovery module parameter) set, adding a fallback case to do
> a full device reset makes sense to me.
>
> I think the placement is wrong though.  amdgpu_device_mode1_reset() has
> a bunch of callers, and if you end up with a mode1 reset doing a full
> reset that might be a surprise to those callers.
>
> So I think a more logical place to put this would be explicitly in the
> GPU recovery path (amdgpu_device_gpu_recover).  Maybe as part of the
> mode1 reset failure you can:
>
> set_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
>
> And then the GPU recovery path can jump right into a full reset?  Not
> sure if that jives with your stack trace though.
>
> Furthermore; even though you reproduced this on Thunderbolt; I have no
> reason to believe it's specific to thunderbolt.  An SMU crash can happen
> in any hardware.  We may as well try full reset for recovery for any
> hardware.
>
> >
> >>
> >> It seems to be finally working with bar resizing after my kernel
> >> patch. Which allows you to safely release a empty switch bridge at the
> >> device end.
> >> Then it rebuilds it afterwords with the increased bar. This was done
> >> on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource
> >> branch with my patch here.
> >>
> >> https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u
> >
> > Where is the MMIO register BAR before and after the rebuild?
> >
> > Regards,
> > Christian.
> >
> >>
> >> Thank you!
> >>
> >> Signed-off-by: Geramy Loveless <gloveless@jqluv.com>
> >> ---
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++
> >> 1 file changed, 40 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> index 31a60173c..91d01d538 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >> @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev)
> >> /* ensure no_hw_access is updated before we access hw */
> >> smp_mb();
> >> + /*
> >> + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe
> >> + * endpoint but the TB tunnel stays up unaware. Detect the
> >> + * dead link and attempt recovery by resetting parent bridges
> >> + * to retrain the physical PCIe link inside the dock.
> >> + */
> >> + if (!pci_device_is_present(adev->pdev) &&
> >> + pci_is_thunderbolt_attached(adev->pdev)) {
> >> + struct pci_dev *bridge;
> >> + bool recovered = false;
> >> +
> >> + dev_info(adev->dev,
> >> + "PCIe link lost after mode1 reset, attempting Thunderbolt recovery\n");
> >> +
> >> + bridge = pci_upstream_bridge(adev->pdev);
> >> + while (bridge && !pci_is_root_bus(bridge->bus)) {
> >> + dev_info(adev->dev,
> >> + "attempting link recovery via %s\n",
> >> + pci_name(bridge));
> >> + pci_bridge_secondary_bus_reset(bridge);
> >> + msleep(100);
> >> + if (pci_device_is_present(adev->pdev)) {
> >> + recovered = true;
> >> + break;
> >> + }
> >> + bridge = pci_upstream_bridge(bridge);
> >> + }
> >> +
> >> + if (!recovered) {
> >> + dev_err(adev->dev,
> >> + "Thunderbolt PCIe link recovery failed\n");
> >> + ret = -ENODEV;
> >> + goto mode1_reset_failed;
> >> + }
> >> +
> >> + dev_info(adev->dev,
> >> + "Thunderbolt PCIe link recovered via %s\n",
> >> + pci_name(bridge));
> >> + }
> >> +
> >> amdgpu_device_load_pci_state(adev->pdev);
> >> ret = amdgpu_psp_wait_for_bootloader(adev);
> >> if (ret)
> >> --
> >> 2.51.0
> >
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset
  2026-04-10  0:12     ` Geramy Loveless
@ 2026-04-10  6:22       ` Geramy Loveless
  2026-04-10  7:07         ` Geramy Loveless
  0 siblings, 1 reply; 14+ messages in thread
From: Geramy Loveless @ 2026-04-10  6:22 UTC (permalink / raw)
  To: Mario Limonciello; +Cc: Christian König, amd-gfx, alexander.deucher

Before you guys waste your time reading all the below, dont.
I had made a mistake in my patch to PCI basically causing the entire
tree of devices to get released, I was using an unsafe version of the
pci API.
It has been corrected following suggestions by the reviewer on the
original patch.

https://lore.kernel.org/linux-pci/20260410052918.5556-2-gloveless@jqluv.com/

If you would like I still could submit some safety patch work as mario
was suggesting that its not a bad idea to have the ability
to handle edge case situations to prevent crashing in the future if
something goes awry. There was also one basically the GPU would not
get initialized correctly and it would attempt to access null
reference rings which never got filled.

I appreciate the help and explanation of how the systems on the GPU
work and look forward to learning more but hopefully not because of a
bug that either
I cause or run across haha :) I am going to start backing out the
kernel parameters in hopes that everything is happy with the pci fix I
implemented and hopefully i dont need to set params.



On Thu, Apr 9, 2026 at 5:12 PM Geramy Loveless <gloveless@jqluv.com> wrote:
>
> Hey,
>
> I have nearly finished my patch, I need to double check some stuff
> first but here is a summary of the real cause that I can see.
> See below for a in depth in depth analysis. I am not sure technically
> if thunderbolt is at fault in this scenario or the amdgpu please let
> me know what your opinion is on this and where it should be patched
> at. Of course the patch i'm working on allows for a resolution path if
> it happens which is a nice recovery mode but it doesnt solve it
> occuring.
>
> ## Summary
>
> R9700 Pro [1002:7551] (gfx1201) connected via Thunderbolt 5 dock. GPU
> initializes fully but MMIO becomes unreachable while PCIe config space
> continues to work. MMIO becomes unresponsive,
> cascading into SDMA timeouts and GPU reset loops. Reproduced on two
> separate boots (10s and 96s after init).
>
> The split between working config space and dead MMIO points to the
> Thunderbolt PCIe tunnel selectively dropping memory transactions while
> continuing to pass configuration transactions.
>
> ## Hardware
>
> - Host: MSI MS-S1 MAX (Strix Halo), AMD IOMMU
> - Thunderbolt host controller: Intel Barlow Ridge TB5 [8086:5780] at 67:00.0
> - Dock: Razer Core X V2 (TB4, FW 59.82)
> - GPU: AMD R9700 Pro [1002:7551] gfx1201, 32GB GDDR6
> - Connection: TB5 host → TB4 dock, 40 Gb/s dual lane
> - TB tunnel: PCIe 0:10 <-> 3:9, extended encapsulation enabled
> - PCIe topology through dock:
> ```
> 66:03.0 TB bridge (32GB pref window)
> 93:00.0 Intel 5786 Upstream Switch
> 94:00.0 Intel 5786 Downstream Switch (Gen4 x4 to AMD switch)
> 95:00.0 AMD 1478 Upstream Switch
> 96:00.0 AMD 1479 Downstream Switch (Gen5 x16 to GPU)
> 97:00.0 GPU [1002:7551]
> ```
> - No display connected to eGPU
>
> ## Kernel
>
> ```
> Linux 7.0.0-rc7-egpu+ #7 SMP PREEMPT_DYNAMIC
> cmdline: pcie_port_pm=off pcie_aspm=off amdgpu.runpm=0
> ```
>
> ## The evidence
>
> ### 1. GPU initializes successfully on both boots
>
> Boot -1 (7.0.0-rc7-egpu+, journalctl, precise timestamps):
> ```
> 16:45:46.038 SMU is initialized successfully!
> 16:45:46.038 [drm] Display Core v3.2.369 initialized on DCN 4.0.1
> 16:45:46.131 runtime pm is manually disabled
> 16:45:46.131 [drm] Initialized amdgpu 3.64.0 for 0000:97:00.0 on minor 0
> ```
>
> Boot 0 (7.0.0-rc7-egpu+, dmesg):
> ```
> [9551.162] SMU is initialized successfully!
> [9551.163] [drm] Display Core v3.2.369 initialized on DCN 4.0.1
> [9551.248] runtime pm is manually disabled
> [9551.249] [drm] Initialized amdgpu 3.64.0 for 0000:97:00.0 on minor 0
> ```
>
> All IP blocks come up clean. 32624MB VRAM. 64 CUs. SMU responds to
> all init-time messages. No errors during initialization.
>
> ### 2. SMU becomes unreachable after variable delay
>
> Boot -1 — **10 seconds** after init:
> ```
> 16:45:56.192 Failed to disable gfxoff!
> 16:45:56.192 SMU is in hanged state, failed to send smu message!
> 16:45:56.192 Failed to export SMU metrics table!
> (repeated ~30 times)
> ```
>
> Boot 0 — **96 seconds** after init:
> ```
> [9647.872] Failed to export SMU metrics table!
> [9647.872] SMU is in hanged state, failed to send smu message!
> (repeated)
> [9661.567] Failed to disable gfxoff!
> ```
>
> The delay is not consistent (10s vs 96s), ruling out a fixed firmware
> timer. The first failing operation varies (gfxoff disable vs metrics
> export), suggesting the SMU itself isn't crashing — the communication
> path to it is dying.
>
> ### 3. Config space alive, MMIO dead (proved during boot 0 crash)
>
> Tested during the active crash with the GPU in "SMU hanged" state:
>
> **Config space (works):**
> ```
> $ sudo setpci -s 97:00.0 0x00.l
> 75511002 ← correct vendor/device ID
> $ sudo setpci -s 97:00.0 0x04.l
> 00100406 ← status/command register OK
> ```
>
> **MMIO BAR5 at 0xc4000000 — SMU register space (dead):**
> ```python
> fd = os.open('/sys/bus/pci/devices/0000:97:00.0/resource5', os.O_RDONLY)
> data = os.read(fd, 4)
> # OSError: [Errno 5] Input/output error
> ```
>
> **MMIO BAR0 at 0x8880000000 — VRAM (dead):**
> ```python
> fd = os.open('/sys/bus/pci/devices/0000:97:00.0/resource0', os.O_RDONLY)
> data = os.read(fd, 4)
> # OSError: [Errno 5] Input/output error
> ```
>
> Config transactions reach the device through the TB tunnel. Memory
> transactions do not. This is the root cause of the "SMU hanged state" —
> the SMU firmware is likely fine, but the MMIO writes to its mailbox
> registers never arrive.
>
> ### 4. Thunderbolt host router and dock are runtime-suspended
>
> Monitored with 2-second polling during the crash:
> ```
> 16:54:06 host=suspended dock=suspended gpu_errors=1
> 16:54:09 host=suspended dock=suspended gpu_errors=3
> 16:54:11 host=suspended dock=suspended gpu_errors=5
> ...
> 16:54:57 host=suspended dock=suspended gpu_errors=119
> ```
>
> TB host router runtime PM configuration:
> ```
> /sys/bus/thunderbolt/devices/0-0/power/control = auto
> /sys/bus/thunderbolt/devices/0-0/power/autosuspend_delay_ms = 15000
> /sys/bus/thunderbolt/devices/0-0/power/runtime_status = suspended
> ```
>
> Both the TB host router and dock switch show `suspended` throughout
> the crash. The PCIe tunnel was activated while they were in this state.
>
> ### 5. Waking TB host router does not restore MMIO
>
> ```
> $ echo "on" > /sys/bus/thunderbolt/devices/0-0/power/control
> $ cat /sys/bus/thunderbolt/devices/0-0/power/runtime_status
> active
>
> $ python3 -c "os.read(os.open('/sys/bus/pci/devices/.../resource5', ...), 4)"
> # OSError: [Errno 5] Input/output error ← still dead
> ```
>
> Once MMIO is lost, waking the TB host router doesn't recover it. The
> damage to the memory transaction path persists until device removal
> and re-enumeration (or reboot).
>
> ### 6. Full crash cascade (boot -1)
>
> ```
> 16:45:46 Init complete
> 16:45:56 SMU hanged (MMIO path dead)
> 16:45:58 SDMA ring timeout → ring reset succeeds
> 16:46:00 SDMA ring timeout again → ring reset succeeds
> 16:51:16 Full GPU reset (MODE1 via SMU)
> 16:51:33 GPU reset succeeded, SMU resumed
> GPU runs for 37 more minutes
> 17:28:17 SDMA timeout → ring reset FAILS
> 17:28:17 GPU reset returns -ENODEV (device gone from bus)
> Infinite reset loop, system unusable
> ```
>
> MODE1 reset succeeds once (re-establishing the MMIO path temporarily)
> but the problem recurs, and the second MODE1 reset kills the PCIe link
> entirely (-ENODEV).
>
> ## Configuration notes
>
> - `amdgpu.runpm=0` is set and confirmed. GPU runtime PM (BOCO) is not
> active. This does not prevent the crash.
> - `pcie_port_pm=off pcie_aspm=off` are set. PCIe link power management
> is disabled. This does not prevent the crash.
> - TB host router runtime PM (`power/control=auto`) is NOT disabled by
> any of the above kernel parameters.
> - SMU FW version mismatch: driver expects interface 0x2e, FW reports 0x32.
> Init succeeds despite mismatch.
> - "PCIE atomic ops is not supported" — TB bridge doesn't support AtomicOps.
>
> ## Related
>
> - GitLab issue: https://gitlab.freedesktop.org/drm/amd/-/work_items/4978
> - Device: [1002:7551] (gfx1201, Navi 48, R9700 Pro)
> - SMU FW: smu_v14_0_2, version 0x00684a00 (104.74.0)
> - SMU driver if version 0x2e, fw if version 0x32 (mismatch)
> - TB host controller: Intel Barlow Ridge [8086:5780], FW 61.83
> - TB dock: Razer Core X V2, FW 59.82
>
>
>
>
> Geramy L. Loveless
> Founder & Chief Innovation Officer
>
> JQluv.net, Inc.
> Site: JQluv.com
> Mobile: 559.999.1557
> Office: 1 (877) 44 JQluv
>
>
>
>
> On Thu, Apr 9, 2026 at 11:12 AM Mario Limonciello
> <mario.limonciello@amd.com> wrote:
> >
> >
> >
> > On 4/9/26 06:42, Christian König wrote:
> > > On 4/9/26 02:05, Geramy Loveless wrote:
> > >> When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on
> > >> Thunderbolt the TB driver receives no notification and the tunnel
> > >> stays up while the endpoint is unreachable.
> > >
> > > IIRC a MODE1 reset should keep the bus active and so the endpoint should still be reachable.
> > >
> > >> All subsequent PCIe
> > >> reads return 0xFFFFFFFF and MES firmware cannot reinitialize,
> > >> triggering an infinite reset loop that hangs the system.
> > >
> > > That sounds more like the MODE1 reset failed.
> > >
> > >> After MODE1 reset completes, check whether the PCIe endpoint is still
> > >> reachable using pci_device_is_present(). If the device is behind
> > >> Thunderbolt and the link is dead, walk up parent bridges calling
> > >> pci_bridge_secondary_bus_reset() to retrain the physical PCIe link
> > >> inside the dock.
> > >
> > > Well that is then a bus reset.
> > >
> > > I mean that is a reasonable mitigation when a MODE1 reset failed, but the question is rather why does the MODE1 reset fails in the first place?
> > >
> > >> If recovery fails, return -ENODEV to prevent the
> > >> reset retry loop.
> > >>
> > >> This also causes the GPU fan to be at 100% and basically when it
> > >> happens and you are not there, you now have a GPU with fan at 100% and
> > >> cant reset it.
> > >> I wanted to notate some other things I am finding sometimes before
> > >> this adventure of patches to the kernel and amdgpu driver.
> > >> Sometimes a crash could happen in the drive and then the GPU fan speed
> > >> hits 100% and the air is hot coming out without any workload, other
> > >> times
> > >> I have seen it have barely any fan speed at all and heat up more than
> > >> it should at the fan level its curently operating at. These are things
> > >> I have seen with this gpu in a TB5 dock with the driver and
> > >> instability. I'm not sure exactly whats going on there but I figured
> > >> since im communicating with these patches I might as well bring you up
> > >> to speed and supermario has been great help throughout me trying to
> > >> get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 /
> > >> USB4v2 dock!
> > >
> > > Adding Mario as well. That strongly sounds like you crashed the SMU which would also explain the failed MODE1 reset.
> > >
> > > But all of that are only symptoms. Question is what is actually going on here? e.g. what is the root cause?
> >
> > We don't spend a lot of time in recovery scenarios for when 💩 hits the
> > fan.  I think in addition to finding and fixing the real root cause
> > having a reproducible workload to cause the crash is a good opportunity
> > to try to put in place better recovery too.
> >
> > Generally speaking I like the idea of if a mode1 reset fails to do a
> > harder reset.  At least in the path that we have GPU recovery
> > (amdgpu.gpu_recovery module parameter) set, adding a fallback case to do
> > a full device reset makes sense to me.
> >
> > I think the placement is wrong though.  amdgpu_device_mode1_reset() has
> > a bunch of callers, and if you end up with a mode1 reset doing a full
> > reset that might be a surprise to those callers.
> >
> > So I think a more logical place to put this would be explicitly in the
> > GPU recovery path (amdgpu_device_gpu_recover).  Maybe as part of the
> > mode1 reset failure you can:
> >
> > set_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
> >
> > And then the GPU recovery path can jump right into a full reset?  Not
> > sure if that jives with your stack trace though.
> >
> > Furthermore; even though you reproduced this on Thunderbolt; I have no
> > reason to believe it's specific to thunderbolt.  An SMU crash can happen
> > in any hardware.  We may as well try full reset for recovery for any
> > hardware.
> >
> > >
> > >>
> > >> It seems to be finally working with bar resizing after my kernel
> > >> patch. Which allows you to safely release a empty switch bridge at the
> > >> device end.
> > >> Then it rebuilds it afterwords with the increased bar. This was done
> > >> on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource
> > >> branch with my patch here.
> > >>
> > >> https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u
> > >
> > > Where is the MMIO register BAR before and after the rebuild?
> > >
> > > Regards,
> > > Christian.
> > >
> > >>
> > >> Thank you!
> > >>
> > >> Signed-off-by: Geramy Loveless <gloveless@jqluv.com>
> > >> ---
> > >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++
> > >> 1 file changed, 40 insertions(+)
> > >>
> > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >> index 31a60173c..91d01d538 100644
> > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >> @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev)
> > >> /* ensure no_hw_access is updated before we access hw */
> > >> smp_mb();
> > >> + /*
> > >> + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe
> > >> + * endpoint but the TB tunnel stays up unaware. Detect the
> > >> + * dead link and attempt recovery by resetting parent bridges
> > >> + * to retrain the physical PCIe link inside the dock.
> > >> + */
> > >> + if (!pci_device_is_present(adev->pdev) &&
> > >> + pci_is_thunderbolt_attached(adev->pdev)) {
> > >> + struct pci_dev *bridge;
> > >> + bool recovered = false;
> > >> +
> > >> + dev_info(adev->dev,
> > >> + "PCIe link lost after mode1 reset, attempting Thunderbolt recovery\n");
> > >> +
> > >> + bridge = pci_upstream_bridge(adev->pdev);
> > >> + while (bridge && !pci_is_root_bus(bridge->bus)) {
> > >> + dev_info(adev->dev,
> > >> + "attempting link recovery via %s\n",
> > >> + pci_name(bridge));
> > >> + pci_bridge_secondary_bus_reset(bridge);
> > >> + msleep(100);
> > >> + if (pci_device_is_present(adev->pdev)) {
> > >> + recovered = true;
> > >> + break;
> > >> + }
> > >> + bridge = pci_upstream_bridge(bridge);
> > >> + }
> > >> +
> > >> + if (!recovered) {
> > >> + dev_err(adev->dev,
> > >> + "Thunderbolt PCIe link recovery failed\n");
> > >> + ret = -ENODEV;
> > >> + goto mode1_reset_failed;
> > >> + }
> > >> +
> > >> + dev_info(adev->dev,
> > >> + "Thunderbolt PCIe link recovered via %s\n",
> > >> + pci_name(bridge));
> > >> + }
> > >> +
> > >> amdgpu_device_load_pci_state(adev->pdev);
> > >> ret = amdgpu_psp_wait_for_bootloader(adev);
> > >> if (ret)
> > >> --
> > >> 2.51.0
> > >
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset
  2026-04-10  6:22       ` Geramy Loveless
@ 2026-04-10  7:07         ` Geramy Loveless
  0 siblings, 0 replies; 14+ messages in thread
From: Geramy Loveless @ 2026-04-10  7:07 UTC (permalink / raw)
  To: Mario Limonciello; +Cc: Christian König, amd-gfx, alexander.deucher

 One last update, we are more stable but there is still something
crashing the GPU somewhere please give guidance on where to look.

## SMU Firmware Version

```
smu driver if version = 0x0000002e
smu fw if version = 0x00000032
smu fw program = 0
smu fw version = 0x00684b00 (104.75.0)
```

Note: Driver interface version (0x2e / 46) does not match firmware
interface version (0x32 / 50).

## PCI Topology

```
65:00.0 PCI bridge: Intel Barlow Ridge Host 80G (rev 84)
66:00.0 PCI bridge: Intel Barlow Ridge Host 80G (rev 84) → NHI
66:01.0 PCI bridge: Intel Barlow Ridge Host 80G (rev 84) → empty hotplug port
66:02.0 PCI bridge: Intel Barlow Ridge Host 80G (rev 84) → USB
66:03.0 PCI bridge: Intel Barlow Ridge Host 80G (rev 84) → dock
93:00.0 PCI bridge: Intel Barlow Ridge Hub 80G (rev 85) → dock switch
94:00.0 PCI bridge: Intel Barlow Ridge Hub 80G (rev 85) → downstream
95:00.0 PCI bridge: AMD Navi 10 XL Upstream Port (rev 24)
96:00.0 PCI bridge: AMD Navi 10 XL Downstream Port (rev 24)
97:00.0 VGA: AMD [1002:7551] (rev c0) ← GPU
97:00.1 Audio: AMD [1002:ab40]
```

## Workload

GPU compute via llama.cpp (ROCm/HIP backend), running
Qwen3.5-35B-A3B-Q4_K_M.gguf model (20.49 GiB, fully offloaded to
VRAM). Flash attention enabled, 128K context, 32 threads.

## Crash Timeline

All timestamps from `dmesg -T`, kernel boot-relative times in brackets.

### GPU initialization (successful)

```
[603.644s] GPU probe: IP DISCOVERY 0x1002:0x7551
[603.653s] Detected IP block: smu_v14_0_0, gfx_v12_0_0
[603.771s] Detected VRAM RAM=32624M, BAR=32768M, RAM width 256bits GDDR6
[604.014s] SMU driver IF 0x2e, FW IF 0x32, FW version 104.75.0
[604.049s] SMU is initialized successfully!
[604.119s] Runtime PM manually disabled (amdgpu.runpm=0)
[604.119s] Initialized amdgpu 3.64.0 for 0000:97:00.0
```

### SMU stops responding [T+4238s after init, ~70 minutes]

```
[4841.828s] SMU: No response msg_reg: 12 resp_reg: 0
[4841.828s] [smu_v14_0_2_get_power_profile_mode] Failed to get activity monitor!
[4849.393s] SMU: No response msg_reg: 12 resp_reg: 0
[4849.393s] Failed to export SMU metrics table!
```

15 consecutive `SMU: No response` messages logged between [4841s] and
[4948s], approximately every 7-8 seconds. All with `msg_reg: 12
resp_reg: 0`. Failed operations include:
- `smu_v14_0_2_get_power_profile_mode` — Failed to get activity monitor
- `Failed to export SMU metrics table`
- `Failed to get current clock freq`

### Page faults begin [T+4349s after init, ~111s after first SMU failure]

```
[4948.927s] [gfxhub] page fault (src_id:0 ring:40 vmid:9 pasid:108)
Process llama-cli pid 35632
GCVM_L2_PROTECTION_FAULT_STATUS: 0x00941051
Faulty UTCL2 client ID: TCP (0x8)
PERMISSION_FAULTS: 0x5
WALKER_ERROR: 0x0
MAPPING_ERROR: 0x0
RW: 0x1 (write)
```

10 page faults logged at [4948s], all from TCP (Texture Cache Pipe),
all PERMISSION_FAULTS=0x5, WALKER_ERROR=0x0, MAPPING_ERROR=0x0. 7
unique faulting addresses:
- 0x000072ce90828000
- 0x000072ce90a88000
- 0x000072ce90a89000
- 0x000072ce90cde000
- 0x000072ce90ce1000
- 0x000072ce90f51000
- 0x000072ce90f52000

### MES failure and GPU reset [T+4349s]

```
[4952.809s] MES(0) failed to respond to msg=REMOVE_QUEUE
[4952.809s] failed to remove hardware queue from MES, doorbell=0x1806
[4952.809s] MES might be in unrecoverable state, issue a GPU reset
[4952.809s] Failed to evict queue 4
[4952.809s] Failed to evict process queues
[4952.809s] GPU reset begin!. Source: 3
```

### GPU reset fails

```
[4953.121s] Failed to evict queue 4
[4953.121s] Failed to suspend process pid 28552
[4953.121s] remove_all_kfd_queues_mes: Failed to remove queue 3 for dev 62536
```

6 MES(1) REMOVE_QUEUE failures, each timing out after ~2.5 seconds:
```
[4955.720s] MES(1) failed to respond to msg=REMOVE_QUEUE → failed to
unmap legacy queue
[4958.283s] MES(1) failed to respond to msg=REMOVE_QUEUE → failed to
unmap legacy queue
[4960.847s] MES(1) failed to respond to msg=REMOVE_QUEUE → failed to
unmap legacy queue
[4963.411s] MES(1) failed to respond to msg=REMOVE_QUEUE → failed to
unmap legacy queue
[4965.976s] MES(1) failed to respond to msg=REMOVE_QUEUE → failed to
unmap legacy queue
[4968.540s] MES(1) failed to respond to msg=REMOVE_QUEUE → failed to
unmap legacy queue
```

### PSP suspend fails

```
[4971.164s] psp gfx command LOAD_IP_FW(0x6) failed and response status is (0x0)
[4971.164s] Failed to terminate ras ta
[4971.164s] suspend of IP block <psp> failed -22
```

### Suspend unwind fails — SMU not ready

```
[4971.164s] SMU is resuming...
[4971.164s] SMC is not ready
[4971.164s] SMC engine is not correctly up!
[4971.164s] resume of IP block <smu> failed -5
[4971.164s] amdgpu_device_ip_resume_phase2 failed during unwind: -5
[4971.164s] GPU pre asic reset failed with err, -22 for drm dev, 0000:97:00.0
```

### MODE1 reset — SMU still dead

```
[4971.164s] MODE1 reset
[4971.164s] GPU mode1 reset
[4971.164s] GPU smu mode1 reset
[4972.193s] GPU reset succeeded, trying to resume
[4972.193s] VRAM is lost due to GPU reset!
[4972.193s] SMU is resuming...
[4972.193s] SMC is not ready
[4972.193s] SMC engine is not correctly up!
[4972.193s] resume of IP block <smu> failed -5
[4972.193s] GPU reset end with ret = -5
```



Geramy L. Loveless
Founder & Chief Innovation Officer

JQluv.net, Inc.
Site: JQluv.com
Mobile: 559.999.1557
Office: 1 (877) 44 JQluv




On Thu, Apr 9, 2026 at 11:22 PM Geramy Loveless <gloveless@jqluv.com> wrote:
>
> Before you guys waste your time reading all the below, dont.
> I had made a mistake in my patch to PCI basically causing the entire
> tree of devices to get released, I was using an unsafe version of the
> pci API.
> It has been corrected following suggestions by the reviewer on the
> original patch.
>
> https://lore.kernel.org/linux-pci/20260410052918.5556-2-gloveless@jqluv.com/
>
> If you would like I still could submit some safety patch work as mario
> was suggesting that its not a bad idea to have the ability
> to handle edge case situations to prevent crashing in the future if
> something goes awry. There was also one basically the GPU would not
> get initialized correctly and it would attempt to access null
> reference rings which never got filled.
>
> I appreciate the help and explanation of how the systems on the GPU
> work and look forward to learning more but hopefully not because of a
> bug that either
> I cause or run across haha :) I am going to start backing out the
> kernel parameters in hopes that everything is happy with the pci fix I
> implemented and hopefully i dont need to set params.
>
>
>
> On Thu, Apr 9, 2026 at 5:12 PM Geramy Loveless <gloveless@jqluv.com> wrote:
> >
> > Hey,
> >
> > I have nearly finished my patch, I need to double check some stuff
> > first but here is a summary of the real cause that I can see.
> > See below for a in depth in depth analysis. I am not sure technically
> > if thunderbolt is at fault in this scenario or the amdgpu please let
> > me know what your opinion is on this and where it should be patched
> > at. Of course the patch i'm working on allows for a resolution path if
> > it happens which is a nice recovery mode but it doesnt solve it
> > occuring.
> >
> > ## Summary
> >
> > R9700 Pro [1002:7551] (gfx1201) connected via Thunderbolt 5 dock. GPU
> > initializes fully but MMIO becomes unreachable while PCIe config space
> > continues to work. MMIO becomes unresponsive,
> > cascading into SDMA timeouts and GPU reset loops. Reproduced on two
> > separate boots (10s and 96s after init).
> >
> > The split between working config space and dead MMIO points to the
> > Thunderbolt PCIe tunnel selectively dropping memory transactions while
> > continuing to pass configuration transactions.
> >
> > ## Hardware
> >
> > - Host: MSI MS-S1 MAX (Strix Halo), AMD IOMMU
> > - Thunderbolt host controller: Intel Barlow Ridge TB5 [8086:5780] at 67:00.0
> > - Dock: Razer Core X V2 (TB4, FW 59.82)
> > - GPU: AMD R9700 Pro [1002:7551] gfx1201, 32GB GDDR6
> > - Connection: TB5 host → TB4 dock, 40 Gb/s dual lane
> > - TB tunnel: PCIe 0:10 <-> 3:9, extended encapsulation enabled
> > - PCIe topology through dock:
> > ```
> > 66:03.0 TB bridge (32GB pref window)
> > 93:00.0 Intel 5786 Upstream Switch
> > 94:00.0 Intel 5786 Downstream Switch (Gen4 x4 to AMD switch)
> > 95:00.0 AMD 1478 Upstream Switch
> > 96:00.0 AMD 1479 Downstream Switch (Gen5 x16 to GPU)
> > 97:00.0 GPU [1002:7551]
> > ```
> > - No display connected to eGPU
> >
> > ## Kernel
> >
> > ```
> > Linux 7.0.0-rc7-egpu+ #7 SMP PREEMPT_DYNAMIC
> > cmdline: pcie_port_pm=off pcie_aspm=off amdgpu.runpm=0
> > ```
> >
> > ## The evidence
> >
> > ### 1. GPU initializes successfully on both boots
> >
> > Boot -1 (7.0.0-rc7-egpu+, journalctl, precise timestamps):
> > ```
> > 16:45:46.038 SMU is initialized successfully!
> > 16:45:46.038 [drm] Display Core v3.2.369 initialized on DCN 4.0.1
> > 16:45:46.131 runtime pm is manually disabled
> > 16:45:46.131 [drm] Initialized amdgpu 3.64.0 for 0000:97:00.0 on minor 0
> > ```
> >
> > Boot 0 (7.0.0-rc7-egpu+, dmesg):
> > ```
> > [9551.162] SMU is initialized successfully!
> > [9551.163] [drm] Display Core v3.2.369 initialized on DCN 4.0.1
> > [9551.248] runtime pm is manually disabled
> > [9551.249] [drm] Initialized amdgpu 3.64.0 for 0000:97:00.0 on minor 0
> > ```
> >
> > All IP blocks come up clean. 32624MB VRAM. 64 CUs. SMU responds to
> > all init-time messages. No errors during initialization.
> >
> > ### 2. SMU becomes unreachable after variable delay
> >
> > Boot -1 — **10 seconds** after init:
> > ```
> > 16:45:56.192 Failed to disable gfxoff!
> > 16:45:56.192 SMU is in hanged state, failed to send smu message!
> > 16:45:56.192 Failed to export SMU metrics table!
> > (repeated ~30 times)
> > ```
> >
> > Boot 0 — **96 seconds** after init:
> > ```
> > [9647.872] Failed to export SMU metrics table!
> > [9647.872] SMU is in hanged state, failed to send smu message!
> > (repeated)
> > [9661.567] Failed to disable gfxoff!
> > ```
> >
> > The delay is not consistent (10s vs 96s), ruling out a fixed firmware
> > timer. The first failing operation varies (gfxoff disable vs metrics
> > export), suggesting the SMU itself isn't crashing — the communication
> > path to it is dying.
> >
> > ### 3. Config space alive, MMIO dead (proved during boot 0 crash)
> >
> > Tested during the active crash with the GPU in "SMU hanged" state:
> >
> > **Config space (works):**
> > ```
> > $ sudo setpci -s 97:00.0 0x00.l
> > 75511002 ← correct vendor/device ID
> > $ sudo setpci -s 97:00.0 0x04.l
> > 00100406 ← status/command register OK
> > ```
> >
> > **MMIO BAR5 at 0xc4000000 — SMU register space (dead):**
> > ```python
> > fd = os.open('/sys/bus/pci/devices/0000:97:00.0/resource5', os.O_RDONLY)
> > data = os.read(fd, 4)
> > # OSError: [Errno 5] Input/output error
> > ```
> >
> > **MMIO BAR0 at 0x8880000000 — VRAM (dead):**
> > ```python
> > fd = os.open('/sys/bus/pci/devices/0000:97:00.0/resource0', os.O_RDONLY)
> > data = os.read(fd, 4)
> > # OSError: [Errno 5] Input/output error
> > ```
> >
> > Config transactions reach the device through the TB tunnel. Memory
> > transactions do not. This is the root cause of the "SMU hanged state" —
> > the SMU firmware is likely fine, but the MMIO writes to its mailbox
> > registers never arrive.
> >
> > ### 4. Thunderbolt host router and dock are runtime-suspended
> >
> > Monitored with 2-second polling during the crash:
> > ```
> > 16:54:06 host=suspended dock=suspended gpu_errors=1
> > 16:54:09 host=suspended dock=suspended gpu_errors=3
> > 16:54:11 host=suspended dock=suspended gpu_errors=5
> > ...
> > 16:54:57 host=suspended dock=suspended gpu_errors=119
> > ```
> >
> > TB host router runtime PM configuration:
> > ```
> > /sys/bus/thunderbolt/devices/0-0/power/control = auto
> > /sys/bus/thunderbolt/devices/0-0/power/autosuspend_delay_ms = 15000
> > /sys/bus/thunderbolt/devices/0-0/power/runtime_status = suspended
> > ```
> >
> > Both the TB host router and dock switch show `suspended` throughout
> > the crash. The PCIe tunnel was activated while they were in this state.
> >
> > ### 5. Waking TB host router does not restore MMIO
> >
> > ```
> > $ echo "on" > /sys/bus/thunderbolt/devices/0-0/power/control
> > $ cat /sys/bus/thunderbolt/devices/0-0/power/runtime_status
> > active
> >
> > $ python3 -c "os.read(os.open('/sys/bus/pci/devices/.../resource5', ...), 4)"
> > # OSError: [Errno 5] Input/output error ← still dead
> > ```
> >
> > Once MMIO is lost, waking the TB host router doesn't recover it. The
> > damage to the memory transaction path persists until device removal
> > and re-enumeration (or reboot).
> >
> > ### 6. Full crash cascade (boot -1)
> >
> > ```
> > 16:45:46 Init complete
> > 16:45:56 SMU hanged (MMIO path dead)
> > 16:45:58 SDMA ring timeout → ring reset succeeds
> > 16:46:00 SDMA ring timeout again → ring reset succeeds
> > 16:51:16 Full GPU reset (MODE1 via SMU)
> > 16:51:33 GPU reset succeeded, SMU resumed
> > GPU runs for 37 more minutes
> > 17:28:17 SDMA timeout → ring reset FAILS
> > 17:28:17 GPU reset returns -ENODEV (device gone from bus)
> > Infinite reset loop, system unusable
> > ```
> >
> > MODE1 reset succeeds once (re-establishing the MMIO path temporarily)
> > but the problem recurs, and the second MODE1 reset kills the PCIe link
> > entirely (-ENODEV).
> >
> > ## Configuration notes
> >
> > - `amdgpu.runpm=0` is set and confirmed. GPU runtime PM (BOCO) is not
> > active. This does not prevent the crash.
> > - `pcie_port_pm=off pcie_aspm=off` are set. PCIe link power management
> > is disabled. This does not prevent the crash.
> > - TB host router runtime PM (`power/control=auto`) is NOT disabled by
> > any of the above kernel parameters.
> > - SMU FW version mismatch: driver expects interface 0x2e, FW reports 0x32.
> > Init succeeds despite mismatch.
> > - "PCIE atomic ops is not supported" — TB bridge doesn't support AtomicOps.
> >
> > ## Related
> >
> > - GitLab issue: https://gitlab.freedesktop.org/drm/amd/-/work_items/4978
> > - Device: [1002:7551] (gfx1201, Navi 48, R9700 Pro)
> > - SMU FW: smu_v14_0_2, version 0x00684a00 (104.74.0)
> > - SMU driver if version 0x2e, fw if version 0x32 (mismatch)
> > - TB host controller: Intel Barlow Ridge [8086:5780], FW 61.83
> > - TB dock: Razer Core X V2, FW 59.82
> >
> >
> >
> >
> > Geramy L. Loveless
> > Founder & Chief Innovation Officer
> >
> > JQluv.net, Inc.
> > Site: JQluv.com
> > Mobile: 559.999.1557
> > Office: 1 (877) 44 JQluv
> >
> >
> >
> >
> > On Thu, Apr 9, 2026 at 11:12 AM Mario Limonciello
> > <mario.limonciello@amd.com> wrote:
> > >
> > >
> > >
> > > On 4/9/26 06:42, Christian König wrote:
> > > > On 4/9/26 02:05, Geramy Loveless wrote:
> > > >> When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on
> > > >> Thunderbolt the TB driver receives no notification and the tunnel
> > > >> stays up while the endpoint is unreachable.
> > > >
> > > > IIRC a MODE1 reset should keep the bus active and so the endpoint should still be reachable.
> > > >
> > > >> All subsequent PCIe
> > > >> reads return 0xFFFFFFFF and MES firmware cannot reinitialize,
> > > >> triggering an infinite reset loop that hangs the system.
> > > >
> > > > That sounds more like the MODE1 reset failed.
> > > >
> > > >> After MODE1 reset completes, check whether the PCIe endpoint is still
> > > >> reachable using pci_device_is_present(). If the device is behind
> > > >> Thunderbolt and the link is dead, walk up parent bridges calling
> > > >> pci_bridge_secondary_bus_reset() to retrain the physical PCIe link
> > > >> inside the dock.
> > > >
> > > > Well that is then a bus reset.
> > > >
> > > > I mean that is a reasonable mitigation when a MODE1 reset failed, but the question is rather why does the MODE1 reset fails in the first place?
> > > >
> > > >> If recovery fails, return -ENODEV to prevent the
> > > >> reset retry loop.
> > > >>
> > > >> This also causes the GPU fan to be at 100% and basically when it
> > > >> happens and you are not there, you now have a GPU with fan at 100% and
> > > >> cant reset it.
> > > >> I wanted to notate some other things I am finding sometimes before
> > > >> this adventure of patches to the kernel and amdgpu driver.
> > > >> Sometimes a crash could happen in the drive and then the GPU fan speed
> > > >> hits 100% and the air is hot coming out without any workload, other
> > > >> times
> > > >> I have seen it have barely any fan speed at all and heat up more than
> > > >> it should at the fan level its curently operating at. These are things
> > > >> I have seen with this gpu in a TB5 dock with the driver and
> > > >> instability. I'm not sure exactly whats going on there but I figured
> > > >> since im communicating with these patches I might as well bring you up
> > > >> to speed and supermario has been great help throughout me trying to
> > > >> get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 /
> > > >> USB4v2 dock!
> > > >
> > > > Adding Mario as well. That strongly sounds like you crashed the SMU which would also explain the failed MODE1 reset.
> > > >
> > > > But all of that are only symptoms. Question is what is actually going on here? e.g. what is the root cause?
> > >
> > > We don't spend a lot of time in recovery scenarios for when 💩 hits the
> > > fan.  I think in addition to finding and fixing the real root cause
> > > having a reproducible workload to cause the crash is a good opportunity
> > > to try to put in place better recovery too.
> > >
> > > Generally speaking I like the idea of if a mode1 reset fails to do a
> > > harder reset.  At least in the path that we have GPU recovery
> > > (amdgpu.gpu_recovery module parameter) set, adding a fallback case to do
> > > a full device reset makes sense to me.
> > >
> > > I think the placement is wrong though.  amdgpu_device_mode1_reset() has
> > > a bunch of callers, and if you end up with a mode1 reset doing a full
> > > reset that might be a surprise to those callers.
> > >
> > > So I think a more logical place to put this would be explicitly in the
> > > GPU recovery path (amdgpu_device_gpu_recover).  Maybe as part of the
> > > mode1 reset failure you can:
> > >
> > > set_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
> > >
> > > And then the GPU recovery path can jump right into a full reset?  Not
> > > sure if that jives with your stack trace though.
> > >
> > > Furthermore; even though you reproduced this on Thunderbolt; I have no
> > > reason to believe it's specific to thunderbolt.  An SMU crash can happen
> > > in any hardware.  We may as well try full reset for recovery for any
> > > hardware.
> > >
> > > >
> > > >>
> > > >> It seems to be finally working with bar resizing after my kernel
> > > >> patch. Which allows you to safely release a empty switch bridge at the
> > > >> device end.
> > > >> Then it rebuilds it afterwords with the increased bar. This was done
> > > >> on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource
> > > >> branch with my patch here.
> > > >>
> > > >> https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u
> > > >
> > > > Where is the MMIO register BAR before and after the rebuild?
> > > >
> > > > Regards,
> > > > Christian.
> > > >
> > > >>
> > > >> Thank you!
> > > >>
> > > >> Signed-off-by: Geramy Loveless <gloveless@jqluv.com>
> > > >> ---
> > > >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++
> > > >> 1 file changed, 40 insertions(+)
> > > >>
> > > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > >> index 31a60173c..91d01d538 100644
> > > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > >> @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev)
> > > >> /* ensure no_hw_access is updated before we access hw */
> > > >> smp_mb();
> > > >> + /*
> > > >> + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe
> > > >> + * endpoint but the TB tunnel stays up unaware. Detect the
> > > >> + * dead link and attempt recovery by resetting parent bridges
> > > >> + * to retrain the physical PCIe link inside the dock.
> > > >> + */
> > > >> + if (!pci_device_is_present(adev->pdev) &&
> > > >> + pci_is_thunderbolt_attached(adev->pdev)) {
> > > >> + struct pci_dev *bridge;
> > > >> + bool recovered = false;
> > > >> +
> > > >> + dev_info(adev->dev,
> > > >> + "PCIe link lost after mode1 reset, attempting Thunderbolt recovery\n");
> > > >> +
> > > >> + bridge = pci_upstream_bridge(adev->pdev);
> > > >> + while (bridge && !pci_is_root_bus(bridge->bus)) {
> > > >> + dev_info(adev->dev,
> > > >> + "attempting link recovery via %s\n",
> > > >> + pci_name(bridge));
> > > >> + pci_bridge_secondary_bus_reset(bridge);
> > > >> + msleep(100);
> > > >> + if (pci_device_is_present(adev->pdev)) {
> > > >> + recovered = true;
> > > >> + break;
> > > >> + }
> > > >> + bridge = pci_upstream_bridge(bridge);
> > > >> + }
> > > >> +
> > > >> + if (!recovered) {
> > > >> + dev_err(adev->dev,
> > > >> + "Thunderbolt PCIe link recovery failed\n");
> > > >> + ret = -ENODEV;
> > > >> + goto mode1_reset_failed;
> > > >> + }
> > > >> +
> > > >> + dev_info(adev->dev,
> > > >> + "Thunderbolt PCIe link recovered via %s\n",
> > > >> + pci_name(bridge));
> > > >> + }
> > > >> +
> > > >> amdgpu_device_load_pci_state(adev->pdev);
> > > >> ret = amdgpu_psp_wait_for_bootloader(adev);
> > > >> if (ret)
> > > >> --
> > > >> 2.51.0
> > > >
> > >

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset
  2026-04-09 18:12   ` Mario Limonciello
  2026-04-10  0:12     ` Geramy Loveless
@ 2026-04-10  9:13     ` Christian König
  2026-04-10 11:25     ` Lazar, Lijo
  2 siblings, 0 replies; 14+ messages in thread
From: Christian König @ 2026-04-10  9:13 UTC (permalink / raw)
  To: Mario Limonciello, Geramy Loveless, amd-gfx
  Cc: alexander.deucher, Pelloux-Prayer, Pierre-Eric

Hi Mario,

On 4/9/26 20:12, Mario Limonciello wrote:
> 
> 
> On 4/9/26 06:42, Christian König wrote:
>> On 4/9/26 02:05, Geramy Loveless wrote:
>>> When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on
>>> Thunderbolt the TB driver receives no notification and the tunnel
>>> stays up while the endpoint is unreachable.
>>
>> IIRC a MODE1 reset should keep the bus active and so the endpoint should still be reachable.
>>
>>> All subsequent PCIe
>>> reads return 0xFFFFFFFF and MES firmware cannot reinitialize,
>>> triggering an infinite reset loop that hangs the system.
>>
>> That sounds more like the MODE1 reset failed.
>>
>>> After MODE1 reset completes, check whether the PCIe endpoint is still
>>> reachable using pci_device_is_present(). If the device is behind
>>> Thunderbolt and the link is dead, walk up parent bridges calling
>>> pci_bridge_secondary_bus_reset() to retrain the physical PCIe link
>>> inside the dock.
>>
>> Well that is then a bus reset.
>>
>> I mean that is a reasonable mitigation when a MODE1 reset failed, but the question is rather why does the MODE1 reset fails in the first place?
>>
>>> If recovery fails, return -ENODEV to prevent the
>>> reset retry loop.
>>>
>>> This also causes the GPU fan to be at 100% and basically when it
>>> happens and you are not there, you now have a GPU with fan at 100% and
>>> cant reset it.
>>> I wanted to notate some other things I am finding sometimes before
>>> this adventure of patches to the kernel and amdgpu driver.
>>> Sometimes a crash could happen in the drive and then the GPU fan speed
>>> hits 100% and the air is hot coming out without any workload, other
>>> times
>>> I have seen it have barely any fan speed at all and heat up more than
>>> it should at the fan level its curently operating at. These are things
>>> I have seen with this gpu in a TB5 dock with the driver and
>>> instability. I'm not sure exactly whats going on there but I figured
>>> since im communicating with these patches I might as well bring you up
>>> to speed and supermario has been great help throughout me trying to
>>> get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 /
>>> USB4v2 dock!
>>
>> Adding Mario as well. That strongly sounds like you crashed the SMU which would also explain the failed MODE1 reset.
>>
>> But all of that are only symptoms. Question is what is actually going on here? e.g. what is the root cause?
> 
> We don't spend a lot of time in recovery scenarios for when 💩 hits the fan.  I think in addition to finding and fixing the real root cause having a reproducible workload to cause the crash is a good opportunity to try to put in place better recovery too.
> 
> Generally speaking I like the idea of if a mode1 reset fails to do a harder reset.  At least in the path that we have GPU recovery (amdgpu.gpu_recovery module parameter) set, adding a fallback case to do a full device reset makes sense to me.

Well I just realized that Pierre-Eric is already working on that and I've forgotten to add him to the mail thread.

But the general idea is that when you can't recover the GPU that the driver send a WEDGE udev event noting that a GPU recovery didn't worked and it basically needs a bus reset.

> I think the placement is wrong though.  amdgpu_device_mode1_reset() has a bunch of callers, and if you end up with a mode1 reset doing a full reset that might be a surprise to those callers.
> 
> So I think a more logical place to put this would be explicitly in the GPU recovery path (amdgpu_device_gpu_recover).  Maybe as part of the mode1 reset failure you can:
> 
> set_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
> 
> And then the GPU recovery path can jump right into a full reset?  Not sure if that jives with your stack trace though.

The MODE1 reset is already the full reset in this case.

> Furthermore; even though you reproduced this on Thunderbolt; I have no reason to believe it's specific to thunderbolt.  An SMU crash can happen in any hardware.  We may as well try full reset for recovery for any hardware.

Yeah agree, we need some more general WEDGE event handling. E.g. basically what Pierre-Eric is already working on.

Regards,
Christian.

> 
>>
>>>
>>> It seems to be finally working with bar resizing after my kernel
>>> patch. Which allows you to safely release a empty switch bridge at the
>>> device end.
>>> Then it rebuilds it afterwords with the increased bar. This was done
>>> on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource
>>> branch with my patch here.
>>>
>>> https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u
>>
>> Where is the MMIO register BAR before and after the rebuild?
>>
>> Regards,
>> Christian.
>>
>>>
>>> Thank you!
>>>
>>> Signed-off-by: Geramy Loveless <gloveless@jqluv.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++
>>> 1 file changed, 40 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 31a60173c..91d01d538 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev)
>>> /* ensure no_hw_access is updated before we access hw */
>>> smp_mb();
>>> + /*
>>> + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe
>>> + * endpoint but the TB tunnel stays up unaware. Detect the
>>> + * dead link and attempt recovery by resetting parent bridges
>>> + * to retrain the physical PCIe link inside the dock.
>>> + */
>>> + if (!pci_device_is_present(adev->pdev) &&
>>> + pci_is_thunderbolt_attached(adev->pdev)) {
>>> + struct pci_dev *bridge;
>>> + bool recovered = false;
>>> +
>>> + dev_info(adev->dev,
>>> + "PCIe link lost after mode1 reset, attempting Thunderbolt recovery\n");
>>> +
>>> + bridge = pci_upstream_bridge(adev->pdev);
>>> + while (bridge && !pci_is_root_bus(bridge->bus)) {
>>> + dev_info(adev->dev,
>>> + "attempting link recovery via %s\n",
>>> + pci_name(bridge));
>>> + pci_bridge_secondary_bus_reset(bridge);
>>> + msleep(100);
>>> + if (pci_device_is_present(adev->pdev)) {
>>> + recovered = true;
>>> + break;
>>> + }
>>> + bridge = pci_upstream_bridge(bridge);
>>> + }
>>> +
>>> + if (!recovered) {
>>> + dev_err(adev->dev,
>>> + "Thunderbolt PCIe link recovery failed\n");
>>> + ret = -ENODEV;
>>> + goto mode1_reset_failed;
>>> + }
>>> +
>>> + dev_info(adev->dev,
>>> + "Thunderbolt PCIe link recovered via %s\n",
>>> + pci_name(bridge));
>>> + }
>>> +
>>> amdgpu_device_load_pci_state(adev->pdev);
>>> ret = amdgpu_psp_wait_for_bootloader(adev);
>>> if (ret)
>>> -- 
>>> 2.51.0
>>
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset
  2026-04-09 18:12   ` Mario Limonciello
  2026-04-10  0:12     ` Geramy Loveless
  2026-04-10  9:13     ` Christian König
@ 2026-04-10 11:25     ` Lazar, Lijo
  2026-04-10 19:17       ` Geramy Loveless
  2 siblings, 1 reply; 14+ messages in thread
From: Lazar, Lijo @ 2026-04-10 11:25 UTC (permalink / raw)
  To: Mario Limonciello, Christian König, Geramy Loveless, amd-gfx
  Cc: alexander.deucher



On 09-Apr-26 11:42 PM, Mario Limonciello wrote:
> 
> 
> On 4/9/26 06:42, Christian König wrote:
>> On 4/9/26 02:05, Geramy Loveless wrote:
>>> When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on
>>> Thunderbolt the TB driver receives no notification and the tunnel
>>> stays up while the endpoint is unreachable.
>>
>> IIRC a MODE1 reset should keep the bus active and so the endpoint 
>> should still be reachable.
>>
>>> All subsequent PCIe
>>> reads return 0xFFFFFFFF and MES firmware cannot reinitialize,
>>> triggering an infinite reset loop that hangs the system.
>>
>> That sounds more like the MODE1 reset failed.
>>
>>> After MODE1 reset completes, check whether the PCIe endpoint is still
>>> reachable using pci_device_is_present(). If the device is behind
>>> Thunderbolt and the link is dead, walk up parent bridges calling
>>> pci_bridge_secondary_bus_reset() to retrain the physical PCIe link
>>> inside the dock.
>>
>> Well that is then a bus reset.
>>
>> I mean that is a reasonable mitigation when a MODE1 reset failed, but 
>> the question is rather why does the MODE1 reset fails in the first place?
>>
>>> If recovery fails, return -ENODEV to prevent the
>>> reset retry loop.
>>>
>>> This also causes the GPU fan to be at 100% and basically when it
>>> happens and you are not there, you now have a GPU with fan at 100% and
>>> cant reset it.
>>> I wanted to notate some other things I am finding sometimes before
>>> this adventure of patches to the kernel and amdgpu driver.
>>> Sometimes a crash could happen in the drive and then the GPU fan speed
>>> hits 100% and the air is hot coming out without any workload, other
>>> times
>>> I have seen it have barely any fan speed at all and heat up more than
>>> it should at the fan level its curently operating at. These are things
>>> I have seen with this gpu in a TB5 dock with the driver and
>>> instability. I'm not sure exactly whats going on there but I figured
>>> since im communicating with these patches I might as well bring you up
>>> to speed and supermario has been great help throughout me trying to
>>> get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 /
>>> USB4v2 dock!
>>
>> Adding Mario as well. That strongly sounds like you crashed the SMU 
>> which would also explain the failed MODE1 reset.
>>
>> But all of that are only symptoms. Question is what is actually going 
>> on here? e.g. what is the root cause?
> 
> We don't spend a lot of time in recovery scenarios for when 💩 hits the 
> fan.  I think in addition to finding and fixing the real root cause 
> having a reproducible workload to cause the crash is a good opportunity 
> to try to put in place better recovery too.
> 
> Generally speaking I like the idea of if a mode1 reset fails to do a 
> harder reset.  At least in the path that we have GPU recovery 
> (amdgpu.gpu_recovery module parameter) set, adding a fallback case to do 
> a full device reset makes sense to me.
> 
> I think the placement is wrong though.  amdgpu_device_mode1_reset() has 
> a bunch of callers, and if you end up with a mode1 reset doing a full 
> reset that might be a surprise to those callers.
> 
> So I think a more logical place to put this would be explicitly in the 
> GPU recovery path (amdgpu_device_gpu_recover).  Maybe as part of the 
> mode1 reset failure you can:
> 
> set_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
> 
> And then the GPU recovery path can jump right into a full reset?  Not 
> sure if that jives with your stack trace though.
> 
> Furthermore; even though you reproduced this on Thunderbolt; I have no 
> reason to believe it's specific to thunderbolt.  An SMU crash can happen 
> in any hardware.  We may as well try full reset for recovery for any 
> hardware.

FWIW, if SMU crashes then SBR also shouldn't work since SBR handling 
needs some firmware support as well.

A kernel module triggering chain-reset by going one level up and 
resetting all devices under the bridge (in a while loop) also doesn't 
look like an acceptable solution.

Thanks,
Lijo

> 
>>
>>>
>>> It seems to be finally working with bar resizing after my kernel
>>> patch. Which allows you to safely release a empty switch bridge at the
>>> device end.
>>> Then it rebuilds it afterwords with the increased bar. This was done
>>> on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource
>>> branch with my patch here.
>>>
>>> https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF- 
>>> tmThxOvk2XUDpEzw@mail.gmail.com/T/#u
>>
>> Where is the MMIO register BAR before and after the rebuild?
>>
>> Regards,
>> Christian.
>>
>>>
>>> Thank you!
>>>
>>> Signed-off-by: Geramy Loveless <gloveless@jqluv.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++
>>> 1 file changed, 40 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 31a60173c..91d01d538 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct 
>>> amdgpu_device *adev)
>>> /* ensure no_hw_access is updated before we access hw */
>>> smp_mb();
>>> + /*
>>> + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe
>>> + * endpoint but the TB tunnel stays up unaware. Detect the
>>> + * dead link and attempt recovery by resetting parent bridges
>>> + * to retrain the physical PCIe link inside the dock.
>>> + */
>>> + if (!pci_device_is_present(adev->pdev) &&
>>> + pci_is_thunderbolt_attached(adev->pdev)) {
>>> + struct pci_dev *bridge;
>>> + bool recovered = false;
>>> +
>>> + dev_info(adev->dev,
>>> + "PCIe link lost after mode1 reset, attempting Thunderbolt 
>>> recovery\n");
>>> +
>>> + bridge = pci_upstream_bridge(adev->pdev);
>>> + while (bridge && !pci_is_root_bus(bridge->bus)) {
>>> + dev_info(adev->dev,
>>> + "attempting link recovery via %s\n",
>>> + pci_name(bridge));
>>> + pci_bridge_secondary_bus_reset(bridge);
>>> + msleep(100);
>>> + if (pci_device_is_present(adev->pdev)) {
>>> + recovered = true;
>>> + break;
>>> + }
>>> + bridge = pci_upstream_bridge(bridge);
>>> + }
>>> +
>>> + if (!recovered) {
>>> + dev_err(adev->dev,
>>> + "Thunderbolt PCIe link recovery failed\n");
>>> + ret = -ENODEV;
>>> + goto mode1_reset_failed;
>>> + }
>>> +
>>> + dev_info(adev->dev,
>>> + "Thunderbolt PCIe link recovered via %s\n",
>>> + pci_name(bridge));
>>> + }
>>> +
>>> amdgpu_device_load_pci_state(adev->pdev);
>>> ret = amdgpu_psp_wait_for_bootloader(adev);
>>> if (ret)
>>> -- 
>>> 2.51.0
>>
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset
  2026-04-10 11:25     ` Lazar, Lijo
@ 2026-04-10 19:17       ` Geramy Loveless
  2026-04-10 22:42         ` Geramy Loveless
  0 siblings, 1 reply; 14+ messages in thread
From: Geramy Loveless @ 2026-04-10 19:17 UTC (permalink / raw)
  To: Lazar, Lijo
  Cc: Mario Limonciello, Christian König, amd-gfx,
	alexander.deucher, Cristian Cocos

It seems there is another person having the same problems im having or
at least similar.
I am going to loop his logs in here and add him maybe we can tackle
this together and find the underlying problem easier or faster.
Its always better to have two logs from too different points sometimes
you get things in one you dont get in the other out of pure chance,
haha.

 https://pcforum.amd.com/s/question/0D5Pd00001S3Av9KAF/linux-9060xt-egpuoverthunderbolt-bugs-galore

Let me know if I can be of use, or if you need extra bandwidth for
making patches point my in the direction.

Thanks everyone!

On Fri, Apr 10, 2026 at 4:25 AM Lazar, Lijo <lijo.lazar@amd.com> wrote:
>
>
>
> On 09-Apr-26 11:42 PM, Mario Limonciello wrote:
> >
> >
> > On 4/9/26 06:42, Christian König wrote:
> >> On 4/9/26 02:05, Geramy Loveless wrote:
> >>> When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on
> >>> Thunderbolt the TB driver receives no notification and the tunnel
> >>> stays up while the endpoint is unreachable.
> >>
> >> IIRC a MODE1 reset should keep the bus active and so the endpoint
> >> should still be reachable.
> >>
> >>> All subsequent PCIe
> >>> reads return 0xFFFFFFFF and MES firmware cannot reinitialize,
> >>> triggering an infinite reset loop that hangs the system.
> >>
> >> That sounds more like the MODE1 reset failed.
> >>
> >>> After MODE1 reset completes, check whether the PCIe endpoint is still
> >>> reachable using pci_device_is_present(). If the device is behind
> >>> Thunderbolt and the link is dead, walk up parent bridges calling
> >>> pci_bridge_secondary_bus_reset() to retrain the physical PCIe link
> >>> inside the dock.
> >>
> >> Well that is then a bus reset.
> >>
> >> I mean that is a reasonable mitigation when a MODE1 reset failed, but
> >> the question is rather why does the MODE1 reset fails in the first place?
> >>
> >>> If recovery fails, return -ENODEV to prevent the
> >>> reset retry loop.
> >>>
> >>> This also causes the GPU fan to be at 100% and basically when it
> >>> happens and you are not there, you now have a GPU with fan at 100% and
> >>> cant reset it.
> >>> I wanted to notate some other things I am finding sometimes before
> >>> this adventure of patches to the kernel and amdgpu driver.
> >>> Sometimes a crash could happen in the drive and then the GPU fan speed
> >>> hits 100% and the air is hot coming out without any workload, other
> >>> times
> >>> I have seen it have barely any fan speed at all and heat up more than
> >>> it should at the fan level its curently operating at. These are things
> >>> I have seen with this gpu in a TB5 dock with the driver and
> >>> instability. I'm not sure exactly whats going on there but I figured
> >>> since im communicating with these patches I might as well bring you up
> >>> to speed and supermario has been great help throughout me trying to
> >>> get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 /
> >>> USB4v2 dock!
> >>
> >> Adding Mario as well. That strongly sounds like you crashed the SMU
> >> which would also explain the failed MODE1 reset.
> >>
> >> But all of that are only symptoms. Question is what is actually going
> >> on here? e.g. what is the root cause?
> >
> > We don't spend a lot of time in recovery scenarios for when 💩 hits the
> > fan.  I think in addition to finding and fixing the real root cause
> > having a reproducible workload to cause the crash is a good opportunity
> > to try to put in place better recovery too.
> >
> > Generally speaking I like the idea of if a mode1 reset fails to do a
> > harder reset.  At least in the path that we have GPU recovery
> > (amdgpu.gpu_recovery module parameter) set, adding a fallback case to do
> > a full device reset makes sense to me.
> >
> > I think the placement is wrong though.  amdgpu_device_mode1_reset() has
> > a bunch of callers, and if you end up with a mode1 reset doing a full
> > reset that might be a surprise to those callers.
> >
> > So I think a more logical place to put this would be explicitly in the
> > GPU recovery path (amdgpu_device_gpu_recover).  Maybe as part of the
> > mode1 reset failure you can:
> >
> > set_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
> >
> > And then the GPU recovery path can jump right into a full reset?  Not
> > sure if that jives with your stack trace though.
> >
> > Furthermore; even though you reproduced this on Thunderbolt; I have no
> > reason to believe it's specific to thunderbolt.  An SMU crash can happen
> > in any hardware.  We may as well try full reset for recovery for any
> > hardware.
>
> FWIW, if SMU crashes then SBR also shouldn't work since SBR handling
> needs some firmware support as well.
>
> A kernel module triggering chain-reset by going one level up and
> resetting all devices under the bridge (in a while loop) also doesn't
> look like an acceptable solution.
>
> Thanks,
> Lijo
>
> >
> >>
> >>>
> >>> It seems to be finally working with bar resizing after my kernel
> >>> patch. Which allows you to safely release a empty switch bridge at the
> >>> device end.
> >>> Then it rebuilds it afterwords with the increased bar. This was done
> >>> on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource
> >>> branch with my patch here.
> >>>
> >>> https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-
> >>> tmThxOvk2XUDpEzw@mail.gmail.com/T/#u
> >>
> >> Where is the MMIO register BAR before and after the rebuild?
> >>
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> Thank you!
> >>>
> >>> Signed-off-by: Geramy Loveless <gloveless@jqluv.com>
> >>> ---
> >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++
> >>> 1 file changed, 40 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> index 31a60173c..91d01d538 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct
> >>> amdgpu_device *adev)
> >>> /* ensure no_hw_access is updated before we access hw */
> >>> smp_mb();
> >>> + /*
> >>> + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe
> >>> + * endpoint but the TB tunnel stays up unaware. Detect the
> >>> + * dead link and attempt recovery by resetting parent bridges
> >>> + * to retrain the physical PCIe link inside the dock.
> >>> + */
> >>> + if (!pci_device_is_present(adev->pdev) &&
> >>> + pci_is_thunderbolt_attached(adev->pdev)) {
> >>> + struct pci_dev *bridge;
> >>> + bool recovered = false;
> >>> +
> >>> + dev_info(adev->dev,
> >>> + "PCIe link lost after mode1 reset, attempting Thunderbolt
> >>> recovery\n");
> >>> +
> >>> + bridge = pci_upstream_bridge(adev->pdev);
> >>> + while (bridge && !pci_is_root_bus(bridge->bus)) {
> >>> + dev_info(adev->dev,
> >>> + "attempting link recovery via %s\n",
> >>> + pci_name(bridge));
> >>> + pci_bridge_secondary_bus_reset(bridge);
> >>> + msleep(100);
> >>> + if (pci_device_is_present(adev->pdev)) {
> >>> + recovered = true;
> >>> + break;
> >>> + }
> >>> + bridge = pci_upstream_bridge(bridge);
> >>> + }
> >>> +
> >>> + if (!recovered) {
> >>> + dev_err(adev->dev,
> >>> + "Thunderbolt PCIe link recovery failed\n");
> >>> + ret = -ENODEV;
> >>> + goto mode1_reset_failed;
> >>> + }
> >>> +
> >>> + dev_info(adev->dev,
> >>> + "Thunderbolt PCIe link recovered via %s\n",
> >>> + pci_name(bridge));
> >>> + }
> >>> +
> >>> amdgpu_device_load_pci_state(adev->pdev);
> >>> ret = amdgpu_psp_wait_for_bootloader(adev);
> >>> if (ret)
> >>> --
> >>> 2.51.0
> >>
> >
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset
  2026-04-10 19:17       ` Geramy Loveless
@ 2026-04-10 22:42         ` Geramy Loveless
  0 siblings, 0 replies; 14+ messages in thread
From: Geramy Loveless @ 2026-04-10 22:42 UTC (permalink / raw)
  To: Lazar, Lijo
  Cc: Mario Limonciello, Christian König, amd-gfx,
	alexander.deucher, Cristian Cocos

I found something really interesting dev_is_removable is not working
on my Minisforum MS-S1 it replies with not removable for the gfx card
which should get the HotPlug+ from the thunderbolt but it doesnt I bet
if you add || pci_is_thunderbolt_attached also it could clear up a lot
of problems, just my two cents. :) I am running some tests now.

On Fri, Apr 10, 2026 at 12:17 PM Geramy Loveless <gloveless@jqluv.com> wrote:
>
> It seems there is another person having the same problems im having or
> at least similar.
> I am going to loop his logs in here and add him maybe we can tackle
> this together and find the underlying problem easier or faster.
> Its always better to have two logs from too different points sometimes
> you get things in one you dont get in the other out of pure chance,
> haha.
>
>  https://pcforum.amd.com/s/question/0D5Pd00001S3Av9KAF/linux-9060xt-egpuoverthunderbolt-bugs-galore
>
> Let me know if I can be of use, or if you need extra bandwidth for
> making patches point my in the direction.
>
> Thanks everyone!
>
> On Fri, Apr 10, 2026 at 4:25 AM Lazar, Lijo <lijo.lazar@amd.com> wrote:
> >
> >
> >
> > On 09-Apr-26 11:42 PM, Mario Limonciello wrote:
> > >
> > >
> > > On 4/9/26 06:42, Christian König wrote:
> > >> On 4/9/26 02:05, Geramy Loveless wrote:
> > >>> When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on
> > >>> Thunderbolt the TB driver receives no notification and the tunnel
> > >>> stays up while the endpoint is unreachable.
> > >>
> > >> IIRC a MODE1 reset should keep the bus active and so the endpoint
> > >> should still be reachable.
> > >>
> > >>> All subsequent PCIe
> > >>> reads return 0xFFFFFFFF and MES firmware cannot reinitialize,
> > >>> triggering an infinite reset loop that hangs the system.
> > >>
> > >> That sounds more like the MODE1 reset failed.
> > >>
> > >>> After MODE1 reset completes, check whether the PCIe endpoint is still
> > >>> reachable using pci_device_is_present(). If the device is behind
> > >>> Thunderbolt and the link is dead, walk up parent bridges calling
> > >>> pci_bridge_secondary_bus_reset() to retrain the physical PCIe link
> > >>> inside the dock.
> > >>
> > >> Well that is then a bus reset.
> > >>
> > >> I mean that is a reasonable mitigation when a MODE1 reset failed, but
> > >> the question is rather why does the MODE1 reset fails in the first place?
> > >>
> > >>> If recovery fails, return -ENODEV to prevent the
> > >>> reset retry loop.
> > >>>
> > >>> This also causes the GPU fan to be at 100% and basically when it
> > >>> happens and you are not there, you now have a GPU with fan at 100% and
> > >>> cant reset it.
> > >>> I wanted to notate some other things I am finding sometimes before
> > >>> this adventure of patches to the kernel and amdgpu driver.
> > >>> Sometimes a crash could happen in the drive and then the GPU fan speed
> > >>> hits 100% and the air is hot coming out without any workload, other
> > >>> times
> > >>> I have seen it have barely any fan speed at all and heat up more than
> > >>> it should at the fan level its curently operating at. These are things
> > >>> I have seen with this gpu in a TB5 dock with the driver and
> > >>> instability. I'm not sure exactly whats going on there but I figured
> > >>> since im communicating with these patches I might as well bring you up
> > >>> to speed and supermario has been great help throughout me trying to
> > >>> get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 /
> > >>> USB4v2 dock!
> > >>
> > >> Adding Mario as well. That strongly sounds like you crashed the SMU
> > >> which would also explain the failed MODE1 reset.
> > >>
> > >> But all of that are only symptoms. Question is what is actually going
> > >> on here? e.g. what is the root cause?
> > >
> > > We don't spend a lot of time in recovery scenarios for when 💩 hits the
> > > fan.  I think in addition to finding and fixing the real root cause
> > > having a reproducible workload to cause the crash is a good opportunity
> > > to try to put in place better recovery too.
> > >
> > > Generally speaking I like the idea of if a mode1 reset fails to do a
> > > harder reset.  At least in the path that we have GPU recovery
> > > (amdgpu.gpu_recovery module parameter) set, adding a fallback case to do
> > > a full device reset makes sense to me.
> > >
> > > I think the placement is wrong though.  amdgpu_device_mode1_reset() has
> > > a bunch of callers, and if you end up with a mode1 reset doing a full
> > > reset that might be a surprise to those callers.
> > >
> > > So I think a more logical place to put this would be explicitly in the
> > > GPU recovery path (amdgpu_device_gpu_recover).  Maybe as part of the
> > > mode1 reset failure you can:
> > >
> > > set_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
> > >
> > > And then the GPU recovery path can jump right into a full reset?  Not
> > > sure if that jives with your stack trace though.
> > >
> > > Furthermore; even though you reproduced this on Thunderbolt; I have no
> > > reason to believe it's specific to thunderbolt.  An SMU crash can happen
> > > in any hardware.  We may as well try full reset for recovery for any
> > > hardware.
> >
> > FWIW, if SMU crashes then SBR also shouldn't work since SBR handling
> > needs some firmware support as well.
> >
> > A kernel module triggering chain-reset by going one level up and
> > resetting all devices under the bridge (in a while loop) also doesn't
> > look like an acceptable solution.
> >
> > Thanks,
> > Lijo
> >
> > >
> > >>
> > >>>
> > >>> It seems to be finally working with bar resizing after my kernel
> > >>> patch. Which allows you to safely release a empty switch bridge at the
> > >>> device end.
> > >>> Then it rebuilds it afterwords with the increased bar. This was done
> > >>> on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource
> > >>> branch with my patch here.
> > >>>
> > >>> https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-
> > >>> tmThxOvk2XUDpEzw@mail.gmail.com/T/#u
> > >>
> > >> Where is the MMIO register BAR before and after the rebuild?
> > >>
> > >> Regards,
> > >> Christian.
> > >>
> > >>>
> > >>> Thank you!
> > >>>
> > >>> Signed-off-by: Geramy Loveless <gloveless@jqluv.com>
> > >>> ---
> > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++
> > >>> 1 file changed, 40 insertions(+)
> > >>>
> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >>> index 31a60173c..91d01d538 100644
> > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >>> @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct
> > >>> amdgpu_device *adev)
> > >>> /* ensure no_hw_access is updated before we access hw */
> > >>> smp_mb();
> > >>> + /*
> > >>> + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe
> > >>> + * endpoint but the TB tunnel stays up unaware. Detect the
> > >>> + * dead link and attempt recovery by resetting parent bridges
> > >>> + * to retrain the physical PCIe link inside the dock.
> > >>> + */
> > >>> + if (!pci_device_is_present(adev->pdev) &&
> > >>> + pci_is_thunderbolt_attached(adev->pdev)) {
> > >>> + struct pci_dev *bridge;
> > >>> + bool recovered = false;
> > >>> +
> > >>> + dev_info(adev->dev,
> > >>> + "PCIe link lost after mode1 reset, attempting Thunderbolt
> > >>> recovery\n");
> > >>> +
> > >>> + bridge = pci_upstream_bridge(adev->pdev);
> > >>> + while (bridge && !pci_is_root_bus(bridge->bus)) {
> > >>> + dev_info(adev->dev,
> > >>> + "attempting link recovery via %s\n",
> > >>> + pci_name(bridge));
> > >>> + pci_bridge_secondary_bus_reset(bridge);
> > >>> + msleep(100);
> > >>> + if (pci_device_is_present(adev->pdev)) {
> > >>> + recovered = true;
> > >>> + break;
> > >>> + }
> > >>> + bridge = pci_upstream_bridge(bridge);
> > >>> + }
> > >>> +
> > >>> + if (!recovered) {
> > >>> + dev_err(adev->dev,
> > >>> + "Thunderbolt PCIe link recovery failed\n");
> > >>> + ret = -ENODEV;
> > >>> + goto mode1_reset_failed;
> > >>> + }
> > >>> +
> > >>> + dev_info(adev->dev,
> > >>> + "Thunderbolt PCIe link recovered via %s\n",
> > >>> + pci_name(bridge));
> > >>> + }
> > >>> +
> > >>> amdgpu_device_load_pci_state(adev->pdev);
> > >>> ret = amdgpu_psp_wait_for_bootloader(adev);
> > >>> if (ret)
> > >>> --
> > >>> 2.51.0
> > >>
> > >
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-04-13  8:03 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-09  0:05 [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset Geramy Loveless
2026-04-09 11:42 ` Christian König
2026-04-09 12:13   ` Geramy Loveless
2026-04-09 12:57     ` Christian König
2026-04-09 13:12       ` Geramy Loveless
2026-04-09 13:45         ` Christian König
2026-04-09 18:12   ` Mario Limonciello
2026-04-10  0:12     ` Geramy Loveless
2026-04-10  6:22       ` Geramy Loveless
2026-04-10  7:07         ` Geramy Loveless
2026-04-10  9:13     ` Christian König
2026-04-10 11:25     ` Lazar, Lijo
2026-04-10 19:17       ` Geramy Loveless
2026-04-10 22:42         ` Geramy Loveless

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.