* [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset @ 2026-04-09 0:05 Geramy Loveless 2026-04-09 11:42 ` Christian König 0 siblings, 1 reply; 14+ messages in thread From: Geramy Loveless @ 2026-04-09 0:05 UTC (permalink / raw) To: amd-gfx; +Cc: alexander.deucher, christian.koenig When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on Thunderbolt the TB driver receives no notification and the tunnel stays up while the endpoint is unreachable. All subsequent PCIe reads return 0xFFFFFFFF and MES firmware cannot reinitialize, triggering an infinite reset loop that hangs the system. After MODE1 reset completes, check whether the PCIe endpoint is still reachable using pci_device_is_present(). If the device is behind Thunderbolt and the link is dead, walk up parent bridges calling pci_bridge_secondary_bus_reset() to retrain the physical PCIe link inside the dock. If recovery fails, return -ENODEV to prevent the reset retry loop. This also causes the GPU fan to be at 100% and basically when it happens and you are not there, you now have a GPU with fan at 100% and cant reset it. I wanted to notate some other things I am finding sometimes before this adventure of patches to the kernel and amdgpu driver. Sometimes a crash could happen in the drive and then the GPU fan speed hits 100% and the air is hot coming out without any workload, other times I have seen it have barely any fan speed at all and heat up more than it should at the fan level its curently operating at. These are things I have seen with this gpu in a TB5 dock with the driver and instability. I'm not sure exactly whats going on there but I figured since im communicating with these patches I might as well bring you up to speed and supermario has been great help throughout me trying to get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 / USB4v2 dock! It seems to be finally working with bar resizing after my kernel patch. Which allows you to safely release a empty switch bridge at the device end. Then it rebuilds it afterwords with the increased bar. This was done on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource branch with my patch here. https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u Thank you! Signed-off-by: Geramy Loveless <gloveless@jqluv.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++ 1 file changed, 40 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 31a60173c..91d01d538 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev) /* ensure no_hw_access is updated before we access hw */ smp_mb(); + /* + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe + * endpoint but the TB tunnel stays up unaware. Detect the + * dead link and attempt recovery by resetting parent bridges + * to retrain the physical PCIe link inside the dock. + */ + if (!pci_device_is_present(adev->pdev) && + pci_is_thunderbolt_attached(adev->pdev)) { + struct pci_dev *bridge; + bool recovered = false; + + dev_info(adev->dev, + "PCIe link lost after mode1 reset, attempting Thunderbolt recovery\n"); + + bridge = pci_upstream_bridge(adev->pdev); + while (bridge && !pci_is_root_bus(bridge->bus)) { + dev_info(adev->dev, + "attempting link recovery via %s\n", + pci_name(bridge)); + pci_bridge_secondary_bus_reset(bridge); + msleep(100); + if (pci_device_is_present(adev->pdev)) { + recovered = true; + break; + } + bridge = pci_upstream_bridge(bridge); + } + + if (!recovered) { + dev_err(adev->dev, + "Thunderbolt PCIe link recovery failed\n"); + ret = -ENODEV; + goto mode1_reset_failed; + } + + dev_info(adev->dev, + "Thunderbolt PCIe link recovered via %s\n", + pci_name(bridge)); + } + amdgpu_device_load_pci_state(adev->pdev); ret = amdgpu_psp_wait_for_bootloader(adev); if (ret) -- 2.51.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset 2026-04-09 0:05 [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset Geramy Loveless @ 2026-04-09 11:42 ` Christian König 2026-04-09 12:13 ` Geramy Loveless 2026-04-09 18:12 ` Mario Limonciello 0 siblings, 2 replies; 14+ messages in thread From: Christian König @ 2026-04-09 11:42 UTC (permalink / raw) To: Geramy Loveless, amd-gfx, Mario Limonciello; +Cc: alexander.deucher On 4/9/26 02:05, Geramy Loveless wrote: > When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on > Thunderbolt the TB driver receives no notification and the tunnel > stays up while the endpoint is unreachable. IIRC a MODE1 reset should keep the bus active and so the endpoint should still be reachable. > All subsequent PCIe > reads return 0xFFFFFFFF and MES firmware cannot reinitialize, > triggering an infinite reset loop that hangs the system. That sounds more like the MODE1 reset failed. > After MODE1 reset completes, check whether the PCIe endpoint is still > reachable using pci_device_is_present(). If the device is behind > Thunderbolt and the link is dead, walk up parent bridges calling > pci_bridge_secondary_bus_reset() to retrain the physical PCIe link > inside the dock. Well that is then a bus reset. I mean that is a reasonable mitigation when a MODE1 reset failed, but the question is rather why does the MODE1 reset fails in the first place? > If recovery fails, return -ENODEV to prevent the > reset retry loop. > > This also causes the GPU fan to be at 100% and basically when it > happens and you are not there, you now have a GPU with fan at 100% and > cant reset it. > I wanted to notate some other things I am finding sometimes before > this adventure of patches to the kernel and amdgpu driver. > Sometimes a crash could happen in the drive and then the GPU fan speed > hits 100% and the air is hot coming out without any workload, other > times > I have seen it have barely any fan speed at all and heat up more than > it should at the fan level its curently operating at. These are things > I have seen with this gpu in a TB5 dock with the driver and > instability. I'm not sure exactly whats going on there but I figured > since im communicating with these patches I might as well bring you up > to speed and supermario has been great help throughout me trying to > get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 / > USB4v2 dock! Adding Mario as well. That strongly sounds like you crashed the SMU which would also explain the failed MODE1 reset. But all of that are only symptoms. Question is what is actually going on here? e.g. what is the root cause? > > It seems to be finally working with bar resizing after my kernel > patch. Which allows you to safely release a empty switch bridge at the > device end. > Then it rebuilds it afterwords with the increased bar. This was done > on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource > branch with my patch here. > > https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u Where is the MMIO register BAR before and after the rebuild? Regards, Christian. > > Thank you! > > Signed-off-by: Geramy Loveless <gloveless@jqluv.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++ > 1 file changed, 40 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 31a60173c..91d01d538 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev) > /* ensure no_hw_access is updated before we access hw */ > smp_mb(); > + /* > + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe > + * endpoint but the TB tunnel stays up unaware. Detect the > + * dead link and attempt recovery by resetting parent bridges > + * to retrain the physical PCIe link inside the dock. > + */ > + if (!pci_device_is_present(adev->pdev) && > + pci_is_thunderbolt_attached(adev->pdev)) { > + struct pci_dev *bridge; > + bool recovered = false; > + > + dev_info(adev->dev, > + "PCIe link lost after mode1 reset, attempting Thunderbolt recovery\n"); > + > + bridge = pci_upstream_bridge(adev->pdev); > + while (bridge && !pci_is_root_bus(bridge->bus)) { > + dev_info(adev->dev, > + "attempting link recovery via %s\n", > + pci_name(bridge)); > + pci_bridge_secondary_bus_reset(bridge); > + msleep(100); > + if (pci_device_is_present(adev->pdev)) { > + recovered = true; > + break; > + } > + bridge = pci_upstream_bridge(bridge); > + } > + > + if (!recovered) { > + dev_err(adev->dev, > + "Thunderbolt PCIe link recovery failed\n"); > + ret = -ENODEV; > + goto mode1_reset_failed; > + } > + > + dev_info(adev->dev, > + "Thunderbolt PCIe link recovered via %s\n", > + pci_name(bridge)); > + } > + > amdgpu_device_load_pci_state(adev->pdev); > ret = amdgpu_psp_wait_for_bootloader(adev); > if (ret) > -- > 2.51.0 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset 2026-04-09 11:42 ` Christian König @ 2026-04-09 12:13 ` Geramy Loveless 2026-04-09 12:57 ` Christian König 2026-04-09 18:12 ` Mario Limonciello 1 sibling, 1 reply; 14+ messages in thread From: Geramy Loveless @ 2026-04-09 12:13 UTC (permalink / raw) To: Christian König; +Cc: amd-gfx, Mario Limonciello, alexander.deucher [-- Attachment #1: Type: text/plain, Size: 5463 bytes --] Hi Christian, I appreciate the speedy response, What your saying makes sense they are basically wrapping symptoms that could at least from what I seen now at this point only continue and eventually create a web of useless code to try to catch all code paths it hits during crashing. Let me investigate the real reason as to why it’s crashing more rather then where. On Thu, Apr 9, 2026 at 4:42 AM Christian König <christian.koenig@amd.com> wrote: > On 4/9/26 02:05, Geramy Loveless wrote: > > When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on > > Thunderbolt the TB driver receives no notification and the tunnel > > stays up while the endpoint is unreachable. > > IIRC a MODE1 reset should keep the bus active and so the endpoint should > still be reachable. > > > All subsequent PCIe > > reads return 0xFFFFFFFF and MES firmware cannot reinitialize, > > triggering an infinite reset loop that hangs the system. > > That sounds more like the MODE1 reset failed. > > > After MODE1 reset completes, check whether the PCIe endpoint is still > > reachable using pci_device_is_present(). If the device is behind > > Thunderbolt and the link is dead, walk up parent bridges calling > > pci_bridge_secondary_bus_reset() to retrain the physical PCIe link > > inside the dock. > > Well that is then a bus reset. > > I mean that is a reasonable mitigation when a MODE1 reset failed, but the > question is rather why does the MODE1 reset fails in the first place? > > > If recovery fails, return -ENODEV to prevent the > > reset retry loop. > > > > This also causes the GPU fan to be at 100% and basically when it > > happens and you are not there, you now have a GPU with fan at 100% and > > cant reset it. > > I wanted to notate some other things I am finding sometimes before > > this adventure of patches to the kernel and amdgpu driver. > > Sometimes a crash could happen in the drive and then the GPU fan speed > > hits 100% and the air is hot coming out without any workload, other > > times > > I have seen it have barely any fan speed at all and heat up more than > > it should at the fan level its curently operating at. These are things > > I have seen with this gpu in a TB5 dock with the driver and > > instability. I'm not sure exactly whats going on there but I figured > > since im communicating with these patches I might as well bring you up > > to speed and supermario has been great help throughout me trying to > > get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 / > > USB4v2 dock! > > Adding Mario as well. That strongly sounds like you crashed the SMU which > would also explain the failed MODE1 reset. > > But all of that are only symptoms. Question is what is actually going on > here? e.g. what is the root cause? > > > > > It seems to be finally working with bar resizing after my kernel > > patch. Which allows you to safely release a empty switch bridge at the > > device end. > > Then it rebuilds it afterwords with the increased bar. This was done > > on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource > > branch with my patch here. > > > > > https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u > > Where is the MMIO register BAR before and after the rebuild? > > Regards, > Christian. > > > > > Thank you! > > > > Signed-off-by: Geramy Loveless <gloveless@jqluv.com> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++ > > 1 file changed, 40 insertions(+) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index 31a60173c..91d01d538 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct > amdgpu_device *adev) > > /* ensure no_hw_access is updated before we access hw */ > > smp_mb(); > > + /* > > + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe > > + * endpoint but the TB tunnel stays up unaware. Detect the > > + * dead link and attempt recovery by resetting parent bridges > > + * to retrain the physical PCIe link inside the dock. > > + */ > > + if (!pci_device_is_present(adev->pdev) && > > + pci_is_thunderbolt_attached(adev->pdev)) { > > + struct pci_dev *bridge; > > + bool recovered = false; > > + > > + dev_info(adev->dev, > > + "PCIe link lost after mode1 reset, attempting Thunderbolt recovery\n"); > > + > > + bridge = pci_upstream_bridge(adev->pdev); > > + while (bridge && !pci_is_root_bus(bridge->bus)) { > > + dev_info(adev->dev, > > + "attempting link recovery via %s\n", > > + pci_name(bridge)); > > + pci_bridge_secondary_bus_reset(bridge); > > + msleep(100); > > + if (pci_device_is_present(adev->pdev)) { > > + recovered = true; > > + break; > > + } > > + bridge = pci_upstream_bridge(bridge); > > + } > > + > > + if (!recovered) { > > + dev_err(adev->dev, > > + "Thunderbolt PCIe link recovery failed\n"); > > + ret = -ENODEV; > > + goto mode1_reset_failed; > > + } > > + > > + dev_info(adev->dev, > > + "Thunderbolt PCIe link recovered via %s\n", > > + pci_name(bridge)); > > + } > > + > > amdgpu_device_load_pci_state(adev->pdev); > > ret = amdgpu_psp_wait_for_bootloader(adev); > > if (ret) > > -- > > 2.51.0 > > [-- Attachment #2: Type: text/html, Size: 6744 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset 2026-04-09 12:13 ` Geramy Loveless @ 2026-04-09 12:57 ` Christian König 2026-04-09 13:12 ` Geramy Loveless 0 siblings, 1 reply; 14+ messages in thread From: Christian König @ 2026-04-09 12:57 UTC (permalink / raw) To: Geramy Loveless; +Cc: amd-gfx, Mario Limonciello, alexander.deucher Hi Geramy, On 4/9/26 14:13, Geramy Loveless wrote: > Hi Christian, > > I appreciate the speedy response, > What your saying makes sense they are basically wrapping symptoms that could at least from what I seen now at this point only continue and eventually create a web of useless code to try to catch all code paths it hits during crashing. Let me investigate the real reason as to why it’s crashing more rather then where. To just give you a bit background on what happens here: AMD GPUs have an embedded micro controller called SMU which takes care of things like voltages, clocks, temperature, fan speed etc.. and reset. So when the kernel driver detects that it needs to do a reset it sends a MODE1 reset command to the SMU. But instead of the SMU coming back a short time later noting that the reset was done the device just drops off the bus (e.g. all reads return 0xffffffff). The cause of that can be anything, e.g. from power fluctuations to a dirty fan which doesn't starts to rotate again after it was stopped. I would try to narrow it down step by step, e.g. if it work on older kernels, if yes what feature/patch broke the behavior. You can also try to disable certain power management features like ASPM (try amdgpu.aspm=0 on the kernel command line). Hope that helps, Christian. > > > On Thu, Apr 9, 2026 at 4:42 AM Christian König <christian.koenig@amd.com <mailto:christian.koenig@amd.com>> wrote: > > On 4/9/26 02:05, Geramy Loveless wrote: > > When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on > > Thunderbolt the TB driver receives no notification and the tunnel > > stays up while the endpoint is unreachable. > > IIRC a MODE1 reset should keep the bus active and so the endpoint should still be reachable. > > > All subsequent PCIe > > reads return 0xFFFFFFFF and MES firmware cannot reinitialize, > > triggering an infinite reset loop that hangs the system. > > That sounds more like the MODE1 reset failed. > > > After MODE1 reset completes, check whether the PCIe endpoint is still > > reachable using pci_device_is_present(). If the device is behind > > Thunderbolt and the link is dead, walk up parent bridges calling > > pci_bridge_secondary_bus_reset() to retrain the physical PCIe link > > inside the dock. > > Well that is then a bus reset. > > I mean that is a reasonable mitigation when a MODE1 reset failed, but the question is rather why does the MODE1 reset fails in the first place? > > > If recovery fails, return -ENODEV to prevent the > > reset retry loop. > > > > This also causes the GPU fan to be at 100% and basically when it > > happens and you are not there, you now have a GPU with fan at 100% and > > cant reset it. > > I wanted to notate some other things I am finding sometimes before > > this adventure of patches to the kernel and amdgpu driver. > > Sometimes a crash could happen in the drive and then the GPU fan speed > > hits 100% and the air is hot coming out without any workload, other > > times > > I have seen it have barely any fan speed at all and heat up more than > > it should at the fan level its curently operating at. These are things > > I have seen with this gpu in a TB5 dock with the driver and > > instability. I'm not sure exactly whats going on there but I figured > > since im communicating with these patches I might as well bring you up > > to speed and supermario has been great help throughout me trying to > > get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 / > > USB4v2 dock! > > Adding Mario as well. That strongly sounds like you crashed the SMU which would also explain the failed MODE1 reset. > > But all of that are only symptoms. Question is what is actually going on here? e.g. what is the root cause? > > > > > It seems to be finally working with bar resizing after my kernel > > patch. Which allows you to safely release a empty switch bridge at the > > device end. > > Then it rebuilds it afterwords with the increased bar. This was done > > on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource > > branch with my patch here. > > > > https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u <https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u> > > Where is the MMIO register BAR before and after the rebuild? > > Regards, > Christian. > > > > > Thank you! > > > > Signed-off-by: Geramy Loveless <gloveless@jqluv.com <mailto:gloveless@jqluv.com>> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++ > > 1 file changed, 40 insertions(+) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index 31a60173c..91d01d538 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev) > > /* ensure no_hw_access is updated before we access hw */ > > smp_mb(); > > + /* > > + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe > > + * endpoint but the TB tunnel stays up unaware. Detect the > > + * dead link and attempt recovery by resetting parent bridges > > + * to retrain the physical PCIe link inside the dock. > > + */ > > + if (!pci_device_is_present(adev->pdev) && > > + pci_is_thunderbolt_attached(adev->pdev)) { > > + struct pci_dev *bridge; > > + bool recovered = false; > > + > > + dev_info(adev->dev, > > + "PCIe link lost after mode1 reset, attempting Thunderbolt recovery\n"); > > + > > + bridge = pci_upstream_bridge(adev->pdev); > > + while (bridge && !pci_is_root_bus(bridge->bus)) { > > + dev_info(adev->dev, > > + "attempting link recovery via %s\n", > > + pci_name(bridge)); > > + pci_bridge_secondary_bus_reset(bridge); > > + msleep(100); > > + if (pci_device_is_present(adev->pdev)) { > > + recovered = true; > > + break; > > + } > > + bridge = pci_upstream_bridge(bridge); > > + } > > + > > + if (!recovered) { > > + dev_err(adev->dev, > > + "Thunderbolt PCIe link recovery failed\n"); > > + ret = -ENODEV; > > + goto mode1_reset_failed; > > + } > > + > > + dev_info(adev->dev, > > + "Thunderbolt PCIe link recovered via %s\n", > > + pci_name(bridge)); > > + } > > + > > amdgpu_device_load_pci_state(adev->pdev); > > ret = amdgpu_psp_wait_for_bootloader(adev); > > if (ret) > > -- > > 2.51.0 > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset 2026-04-09 12:57 ` Christian König @ 2026-04-09 13:12 ` Geramy Loveless 2026-04-09 13:45 ` Christian König 0 siblings, 1 reply; 14+ messages in thread From: Geramy Loveless @ 2026-04-09 13:12 UTC (permalink / raw) To: Christian König; +Cc: amd-gfx, Mario Limonciello, alexander.deucher [-- Attachment #1: Type: text/plain, Size: 9091 bytes --] Christian, This is going to be very interesting. So the AMD Radeon AI R9700 Pro is running over USB4v2 into a Razor Dock on a 650W power supply I have two of these setups cards provided by AMD for some project testing on tb5 rocm ai broadcast messages for testing two machines connected over tb5 with rdma sorry if I loose you here. What I am seeing or finding in my opinion is thunderbolt is not very well flushed out on Linux in terms of stability and accessible feature sets. After my kernel pci patch I have gotten it to work which we went over already. I have one machine running Qwen3.5 27B A3B on the GPU with llamacpp / lemonade successfully but yesterday after attempting to load the model which it did into vram it crashed which ten minutes prior was running a prompt 73k tokens at 40tps. A side note The power management on Thunderbolt seems to clash with the power management on the GPU. What I was seeing is when the GPU doesn’t detect a display it “suspends” which may be correct or not but it doesn’t matter either way in this scenario. The Thunderbolt fabric I believe shuts down also and kills the link. I got passed this by disabling all PM systems. So I’m not 100% sure what’s going on and how to find the actual problem yet maybe if I spend more time and brain power on this I’ll find where in power management it’s having the problem or if that’s the real problem. Let me know if you can offer anymore guidance into where I should dig. I appreciate the help this far and the SMU information helps get more context. On Thu, Apr 9, 2026 at 5:57 AM Christian König <christian.koenig@amd.com> wrote: > Hi Geramy, > > On 4/9/26 14:13, Geramy Loveless wrote: > > Hi Christian, > > > > I appreciate the speedy response, > > What your saying makes sense they are basically wrapping symptoms that > could at least from what I seen now at this point only continue and > eventually create a web of useless code to try to catch all code paths it > hits during crashing. Let me investigate the real reason as to why it’s > crashing more rather then where. > > To just give you a bit background on what happens here: > > AMD GPUs have an embedded micro controller called SMU which takes care of > things like voltages, clocks, temperature, fan speed etc.. and reset. > > So when the kernel driver detects that it needs to do a reset it sends a > MODE1 reset command to the SMU. But instead of the SMU coming back a short > time later noting that the reset was done the device just drops off the bus > (e.g. all reads return 0xffffffff). > > The cause of that can be anything, e.g. from power fluctuations to a dirty > fan which doesn't starts to rotate again after it was stopped. > > I would try to narrow it down step by step, e.g. if it work on older > kernels, if yes what feature/patch broke the behavior. You can also try to > disable certain power management features like ASPM (try amdgpu.aspm=0 on > the kernel command line). > > Hope that helps, > Christian. > > > > > > > On Thu, Apr 9, 2026 at 4:42 AM Christian König <christian.koenig@amd.com > <mailto:christian.koenig@amd.com>> wrote: > > > > On 4/9/26 02:05, Geramy Loveless wrote: > > > When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 > on > > > Thunderbolt the TB driver receives no notification and the tunnel > > > stays up while the endpoint is unreachable. > > > > IIRC a MODE1 reset should keep the bus active and so the endpoint > should still be reachable. > > > > > All subsequent PCIe > > > reads return 0xFFFFFFFF and MES firmware cannot reinitialize, > > > triggering an infinite reset loop that hangs the system. > > > > That sounds more like the MODE1 reset failed. > > > > > After MODE1 reset completes, check whether the PCIe endpoint is > still > > > reachable using pci_device_is_present(). If the device is behind > > > Thunderbolt and the link is dead, walk up parent bridges calling > > > pci_bridge_secondary_bus_reset() to retrain the physical PCIe link > > > inside the dock. > > > > Well that is then a bus reset. > > > > I mean that is a reasonable mitigation when a MODE1 reset failed, > but the question is rather why does the MODE1 reset fails in the first > place? > > > > > If recovery fails, return -ENODEV to prevent the > > > reset retry loop. > > > > > > This also causes the GPU fan to be at 100% and basically when it > > > happens and you are not there, you now have a GPU with fan at 100% > and > > > cant reset it. > > > I wanted to notate some other things I am finding sometimes before > > > this adventure of patches to the kernel and amdgpu driver. > > > Sometimes a crash could happen in the drive and then the GPU fan > speed > > > hits 100% and the air is hot coming out without any workload, other > > > times > > > I have seen it have barely any fan speed at all and heat up more > than > > > it should at the fan level its curently operating at. These are > things > > > I have seen with this gpu in a TB5 dock with the driver and > > > instability. I'm not sure exactly whats going on there but I > figured > > > since im communicating with these patches I might as well bring > you up > > > to speed and supermario has been great help throughout me trying to > > > get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 > / > > > USB4v2 dock! > > > > Adding Mario as well. That strongly sounds like you crashed the SMU > which would also explain the failed MODE1 reset. > > > > But all of that are only symptoms. Question is what is actually > going on here? e.g. what is the root cause? > > > > > > > > It seems to be finally working with bar resizing after my kernel > > > patch. Which allows you to safely release a empty switch bridge at > the > > > device end. > > > Then it rebuilds it afterwords with the increased bar. This was > done > > > on Kernel 7.0-rc7 i believe it is and latest changes from > pci/resource > > > branch with my patch here. > > > > > > > https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u > < > https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u > > > > > > Where is the MMIO register BAR before and after the rebuild? > > > > Regards, > > Christian. > > > > > > > > Thank you! > > > > > > Signed-off-by: Geramy Loveless <gloveless@jqluv.com <mailto: > gloveless@jqluv.com>> > > > --- > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 > ++++++++++++++++++++++ > > > 1 file changed, 40 insertions(+) > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > index 31a60173c..91d01d538 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct > amdgpu_device *adev) > > > /* ensure no_hw_access is updated before we access hw */ > > > smp_mb(); > > > + /* > > > + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe > > > + * endpoint but the TB tunnel stays up unaware. Detect the > > > + * dead link and attempt recovery by resetting parent bridges > > > + * to retrain the physical PCIe link inside the dock. > > > + */ > > > + if (!pci_device_is_present(adev->pdev) && > > > + pci_is_thunderbolt_attached(adev->pdev)) { > > > + struct pci_dev *bridge; > > > + bool recovered = false; > > > + > > > + dev_info(adev->dev, > > > + "PCIe link lost after mode1 reset, attempting Thunderbolt > recovery\n"); > > > + > > > + bridge = pci_upstream_bridge(adev->pdev); > > > + while (bridge && !pci_is_root_bus(bridge->bus)) { > > > + dev_info(adev->dev, > > > + "attempting link recovery via %s\n", > > > + pci_name(bridge)); > > > + pci_bridge_secondary_bus_reset(bridge); > > > + msleep(100); > > > + if (pci_device_is_present(adev->pdev)) { > > > + recovered = true; > > > + break; > > > + } > > > + bridge = pci_upstream_bridge(bridge); > > > + } > > > + > > > + if (!recovered) { > > > + dev_err(adev->dev, > > > + "Thunderbolt PCIe link recovery failed\n"); > > > + ret = -ENODEV; > > > + goto mode1_reset_failed; > > > + } > > > + > > > + dev_info(adev->dev, > > > + "Thunderbolt PCIe link recovered via %s\n", > > > + pci_name(bridge)); > > > + } > > > + > > > amdgpu_device_load_pci_state(adev->pdev); > > > ret = amdgpu_psp_wait_for_bootloader(adev); > > > if (ret) > > > -- > > > 2.51.0 > > > > [-- Attachment #2: Type: text/html, Size: 11435 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset 2026-04-09 13:12 ` Geramy Loveless @ 2026-04-09 13:45 ` Christian König 0 siblings, 0 replies; 14+ messages in thread From: Christian König @ 2026-04-09 13:45 UTC (permalink / raw) To: Geramy Loveless; +Cc: amd-gfx, Mario Limonciello, alexander.deucher Hi Geramy, On 4/9/26 15:12, Geramy Loveless wrote: > Christian, > > This is going to be very interesting. > So the AMD Radeon AI R9700 Pro is running over USB4v2 into a Razor Dock on a 650W power supply I have two of these setups cards provided by AMD for some project testing on tb5 rocm ai broadcast messages for testing two machines connected over tb5 with rdma sorry if I loose you here. Just as a site note 650W for two Radeon AI R9700 Pro is a bit tight. IIRC each of those GPUs can consume a maximum of 300W. Our test benches with two of those usually have 750W power supplies. But I don't think that this is the issue here, power supply problems usually look different. Is the whole box provided by AMD? Who is your contact person anyway? > What I am seeing or finding in my opinion is thunderbolt is not very well flushed out on Linux in terms of stability and accessible feature sets. Yeah agree, thunderbolt setups is also not something we usually have in our production testing. So it can work but absolutely no guarantee for that. > After my kernel pci patch I have gotten it to work which we went over already. Your patch looks valid of hand. But you need to be super careful which resources to release on BAR resize. For example AMD GPUs usually have two resizable 64bit BARs and one 32bit BAR, *don't* release the 32bit BAR or you most likely run into trouble. So it would be helpful if you provide an lspci -vvvv -s $bdf of the two devices with and without your patch. > > I have one machine running Qwen3.5 27B A3B on the GPU with llamacpp / lemonade successfully but yesterday after attempting to load the model which it did into vram it crashed which ten minutes prior was running a prompt 73k tokens at 40tps. A side note The power management on Thunderbolt seems to clash with the power management on the GPU. What I was seeing is when the GPU doesn’t detect a display it “suspends” which may be correct or not but it doesn’t matter either way in this scenario. That behavior is perfectly correct but the message is a bit misleading. The GPU suspends when there is nothing displayed nor any software client connected. So every time all applications stop the GPU is powered down and every time any application starts it is power up again. This of course exercises all the drivers and firmware involved, so there is a lot which can go wrong and I'm not surprised that you have seen problems with that. > The Thunderbolt fabric I believe shuts down also and kills the link. I got passed this by disabling all PM systems. How did you do that? E.g. which parameter did you use? > So I’m not 100% sure what’s going on and how to find the actual problem yet maybe if I spend more time and brain power on this I’ll find where in power management it’s having the problem or if that’s the real problem. Let me know if you can offer anymore guidance into where I should dig. I appreciate the help this far and the SMU information helps get more context. I'm here to help. Cheers, Christian. > > On Thu, Apr 9, 2026 at 5:57 AM Christian König <christian.koenig@amd.com <mailto:christian.koenig@amd.com>> wrote: > > Hi Geramy, > > On 4/9/26 14:13, Geramy Loveless wrote: > > Hi Christian, > > > > I appreciate the speedy response, > > What your saying makes sense they are basically wrapping symptoms that could at least from what I seen now at this point only continue and eventually create a web of useless code to try to catch all code paths it hits during crashing. Let me investigate the real reason as to why it’s crashing more rather then where. > > To just give you a bit background on what happens here: > > AMD GPUs have an embedded micro controller called SMU which takes care of things like voltages, clocks, temperature, fan speed etc.. and reset. > > So when the kernel driver detects that it needs to do a reset it sends a MODE1 reset command to the SMU. But instead of the SMU coming back a short time later noting that the reset was done the device just drops off the bus (e.g. all reads return 0xffffffff). > > The cause of that can be anything, e.g. from power fluctuations to a dirty fan which doesn't starts to rotate again after it was stopped. > > I would try to narrow it down step by step, e.g. if it work on older kernels, if yes what feature/patch broke the behavior. You can also try to disable certain power management features like ASPM (try amdgpu.aspm=0 on the kernel command line). > > Hope that helps, > Christian. > > > > > > > On Thu, Apr 9, 2026 at 4:42 AM Christian König <christian.koenig@amd.com <mailto:christian.koenig@amd.com> <mailto:christian.koenig@amd.com <mailto:christian.koenig@amd.com>>> wrote: > > > > On 4/9/26 02:05, Geramy Loveless wrote: > > > When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on > > > Thunderbolt the TB driver receives no notification and the tunnel > > > stays up while the endpoint is unreachable. > > > > IIRC a MODE1 reset should keep the bus active and so the endpoint should still be reachable. > > > > > All subsequent PCIe > > > reads return 0xFFFFFFFF and MES firmware cannot reinitialize, > > > triggering an infinite reset loop that hangs the system. > > > > That sounds more like the MODE1 reset failed. > > > > > After MODE1 reset completes, check whether the PCIe endpoint is still > > > reachable using pci_device_is_present(). If the device is behind > > > Thunderbolt and the link is dead, walk up parent bridges calling > > > pci_bridge_secondary_bus_reset() to retrain the physical PCIe link > > > inside the dock. > > > > Well that is then a bus reset. > > > > I mean that is a reasonable mitigation when a MODE1 reset failed, but the question is rather why does the MODE1 reset fails in the first place? > > > > > If recovery fails, return -ENODEV to prevent the > > > reset retry loop. > > > > > > This also causes the GPU fan to be at 100% and basically when it > > > happens and you are not there, you now have a GPU with fan at 100% and > > > cant reset it. > > > I wanted to notate some other things I am finding sometimes before > > > this adventure of patches to the kernel and amdgpu driver. > > > Sometimes a crash could happen in the drive and then the GPU fan speed > > > hits 100% and the air is hot coming out without any workload, other > > > times > > > I have seen it have barely any fan speed at all and heat up more than > > > it should at the fan level its curently operating at. These are things > > > I have seen with this gpu in a TB5 dock with the driver and > > > instability. I'm not sure exactly whats going on there but I figured > > > since im communicating with these patches I might as well bring you up > > > to speed and supermario has been great help throughout me trying to > > > get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 / > > > USB4v2 dock! > > > > Adding Mario as well. That strongly sounds like you crashed the SMU which would also explain the failed MODE1 reset. > > > > But all of that are only symptoms. Question is what is actually going on here? e.g. what is the root cause? > > > > > > > > It seems to be finally working with bar resizing after my kernel > > > patch. Which allows you to safely release a empty switch bridge at the > > > device end. > > > Then it rebuilds it afterwords with the increased bar. This was done > > > on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource > > > branch with my patch here. > > > > > > https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u <https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u> <https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u <https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u>> > > > > Where is the MMIO register BAR before and after the rebuild? > > > > Regards, > > Christian. > > > > > > > > Thank you! > > > > > > Signed-off-by: Geramy Loveless <gloveless@jqluv.com <mailto:gloveless@jqluv.com> <mailto:gloveless@jqluv.com <mailto:gloveless@jqluv.com>>> > > > --- > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++ > > > 1 file changed, 40 insertions(+) > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > index 31a60173c..91d01d538 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev) > > > /* ensure no_hw_access is updated before we access hw */ > > > smp_mb(); > > > + /* > > > + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe > > > + * endpoint but the TB tunnel stays up unaware. Detect the > > > + * dead link and attempt recovery by resetting parent bridges > > > + * to retrain the physical PCIe link inside the dock. > > > + */ > > > + if (!pci_device_is_present(adev->pdev) && > > > + pci_is_thunderbolt_attached(adev->pdev)) { > > > + struct pci_dev *bridge; > > > + bool recovered = false; > > > + > > > + dev_info(adev->dev, > > > + "PCIe link lost after mode1 reset, attempting Thunderbolt recovery\n"); > > > + > > > + bridge = pci_upstream_bridge(adev->pdev); > > > + while (bridge && !pci_is_root_bus(bridge->bus)) { > > > + dev_info(adev->dev, > > > + "attempting link recovery via %s\n", > > > + pci_name(bridge)); > > > + pci_bridge_secondary_bus_reset(bridge); > > > + msleep(100); > > > + if (pci_device_is_present(adev->pdev)) { > > > + recovered = true; > > > + break; > > > + } > > > + bridge = pci_upstream_bridge(bridge); > > > + } > > > + > > > + if (!recovered) { > > > + dev_err(adev->dev, > > > + "Thunderbolt PCIe link recovery failed\n"); > > > + ret = -ENODEV; > > > + goto mode1_reset_failed; > > > + } > > > + > > > + dev_info(adev->dev, > > > + "Thunderbolt PCIe link recovered via %s\n", > > > + pci_name(bridge)); > > > + } > > > + > > > amdgpu_device_load_pci_state(adev->pdev); > > > ret = amdgpu_psp_wait_for_bootloader(adev); > > > if (ret) > > > -- > > > 2.51.0 > > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset 2026-04-09 11:42 ` Christian König 2026-04-09 12:13 ` Geramy Loveless @ 2026-04-09 18:12 ` Mario Limonciello 2026-04-10 0:12 ` Geramy Loveless ` (2 more replies) 1 sibling, 3 replies; 14+ messages in thread From: Mario Limonciello @ 2026-04-09 18:12 UTC (permalink / raw) To: Christian König, Geramy Loveless, amd-gfx; +Cc: alexander.deucher On 4/9/26 06:42, Christian König wrote: > On 4/9/26 02:05, Geramy Loveless wrote: >> When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on >> Thunderbolt the TB driver receives no notification and the tunnel >> stays up while the endpoint is unreachable. > > IIRC a MODE1 reset should keep the bus active and so the endpoint should still be reachable. > >> All subsequent PCIe >> reads return 0xFFFFFFFF and MES firmware cannot reinitialize, >> triggering an infinite reset loop that hangs the system. > > That sounds more like the MODE1 reset failed. > >> After MODE1 reset completes, check whether the PCIe endpoint is still >> reachable using pci_device_is_present(). If the device is behind >> Thunderbolt and the link is dead, walk up parent bridges calling >> pci_bridge_secondary_bus_reset() to retrain the physical PCIe link >> inside the dock. > > Well that is then a bus reset. > > I mean that is a reasonable mitigation when a MODE1 reset failed, but the question is rather why does the MODE1 reset fails in the first place? > >> If recovery fails, return -ENODEV to prevent the >> reset retry loop. >> >> This also causes the GPU fan to be at 100% and basically when it >> happens and you are not there, you now have a GPU with fan at 100% and >> cant reset it. >> I wanted to notate some other things I am finding sometimes before >> this adventure of patches to the kernel and amdgpu driver. >> Sometimes a crash could happen in the drive and then the GPU fan speed >> hits 100% and the air is hot coming out without any workload, other >> times >> I have seen it have barely any fan speed at all and heat up more than >> it should at the fan level its curently operating at. These are things >> I have seen with this gpu in a TB5 dock with the driver and >> instability. I'm not sure exactly whats going on there but I figured >> since im communicating with these patches I might as well bring you up >> to speed and supermario has been great help throughout me trying to >> get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 / >> USB4v2 dock! > > Adding Mario as well. That strongly sounds like you crashed the SMU which would also explain the failed MODE1 reset. > > But all of that are only symptoms. Question is what is actually going on here? e.g. what is the root cause? We don't spend a lot of time in recovery scenarios for when 💩 hits the fan. I think in addition to finding and fixing the real root cause having a reproducible workload to cause the crash is a good opportunity to try to put in place better recovery too. Generally speaking I like the idea of if a mode1 reset fails to do a harder reset. At least in the path that we have GPU recovery (amdgpu.gpu_recovery module parameter) set, adding a fallback case to do a full device reset makes sense to me. I think the placement is wrong though. amdgpu_device_mode1_reset() has a bunch of callers, and if you end up with a mode1 reset doing a full reset that might be a surprise to those callers. So I think a more logical place to put this would be explicitly in the GPU recovery path (amdgpu_device_gpu_recover). Maybe as part of the mode1 reset failure you can: set_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags); And then the GPU recovery path can jump right into a full reset? Not sure if that jives with your stack trace though. Furthermore; even though you reproduced this on Thunderbolt; I have no reason to believe it's specific to thunderbolt. An SMU crash can happen in any hardware. We may as well try full reset for recovery for any hardware. > >> >> It seems to be finally working with bar resizing after my kernel >> patch. Which allows you to safely release a empty switch bridge at the >> device end. >> Then it rebuilds it afterwords with the increased bar. This was done >> on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource >> branch with my patch here. >> >> https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u > > Where is the MMIO register BAR before and after the rebuild? > > Regards, > Christian. > >> >> Thank you! >> >> Signed-off-by: Geramy Loveless <gloveless@jqluv.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++ >> 1 file changed, 40 insertions(+) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> index 31a60173c..91d01d538 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev) >> /* ensure no_hw_access is updated before we access hw */ >> smp_mb(); >> + /* >> + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe >> + * endpoint but the TB tunnel stays up unaware. Detect the >> + * dead link and attempt recovery by resetting parent bridges >> + * to retrain the physical PCIe link inside the dock. >> + */ >> + if (!pci_device_is_present(adev->pdev) && >> + pci_is_thunderbolt_attached(adev->pdev)) { >> + struct pci_dev *bridge; >> + bool recovered = false; >> + >> + dev_info(adev->dev, >> + "PCIe link lost after mode1 reset, attempting Thunderbolt recovery\n"); >> + >> + bridge = pci_upstream_bridge(adev->pdev); >> + while (bridge && !pci_is_root_bus(bridge->bus)) { >> + dev_info(adev->dev, >> + "attempting link recovery via %s\n", >> + pci_name(bridge)); >> + pci_bridge_secondary_bus_reset(bridge); >> + msleep(100); >> + if (pci_device_is_present(adev->pdev)) { >> + recovered = true; >> + break; >> + } >> + bridge = pci_upstream_bridge(bridge); >> + } >> + >> + if (!recovered) { >> + dev_err(adev->dev, >> + "Thunderbolt PCIe link recovery failed\n"); >> + ret = -ENODEV; >> + goto mode1_reset_failed; >> + } >> + >> + dev_info(adev->dev, >> + "Thunderbolt PCIe link recovered via %s\n", >> + pci_name(bridge)); >> + } >> + >> amdgpu_device_load_pci_state(adev->pdev); >> ret = amdgpu_psp_wait_for_bootloader(adev); >> if (ret) >> -- >> 2.51.0 > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset 2026-04-09 18:12 ` Mario Limonciello @ 2026-04-10 0:12 ` Geramy Loveless 2026-04-10 6:22 ` Geramy Loveless 2026-04-10 9:13 ` Christian König 2026-04-10 11:25 ` Lazar, Lijo 2 siblings, 1 reply; 14+ messages in thread From: Geramy Loveless @ 2026-04-10 0:12 UTC (permalink / raw) To: Mario Limonciello; +Cc: Christian König, amd-gfx, alexander.deucher Hey, I have nearly finished my patch, I need to double check some stuff first but here is a summary of the real cause that I can see. See below for a in depth in depth analysis. I am not sure technically if thunderbolt is at fault in this scenario or the amdgpu please let me know what your opinion is on this and where it should be patched at. Of course the patch i'm working on allows for a resolution path if it happens which is a nice recovery mode but it doesnt solve it occuring. ## Summary R9700 Pro [1002:7551] (gfx1201) connected via Thunderbolt 5 dock. GPU initializes fully but MMIO becomes unreachable while PCIe config space continues to work. MMIO becomes unresponsive, cascading into SDMA timeouts and GPU reset loops. Reproduced on two separate boots (10s and 96s after init). The split between working config space and dead MMIO points to the Thunderbolt PCIe tunnel selectively dropping memory transactions while continuing to pass configuration transactions. ## Hardware - Host: MSI MS-S1 MAX (Strix Halo), AMD IOMMU - Thunderbolt host controller: Intel Barlow Ridge TB5 [8086:5780] at 67:00.0 - Dock: Razer Core X V2 (TB4, FW 59.82) - GPU: AMD R9700 Pro [1002:7551] gfx1201, 32GB GDDR6 - Connection: TB5 host → TB4 dock, 40 Gb/s dual lane - TB tunnel: PCIe 0:10 <-> 3:9, extended encapsulation enabled - PCIe topology through dock: ``` 66:03.0 TB bridge (32GB pref window) 93:00.0 Intel 5786 Upstream Switch 94:00.0 Intel 5786 Downstream Switch (Gen4 x4 to AMD switch) 95:00.0 AMD 1478 Upstream Switch 96:00.0 AMD 1479 Downstream Switch (Gen5 x16 to GPU) 97:00.0 GPU [1002:7551] ``` - No display connected to eGPU ## Kernel ``` Linux 7.0.0-rc7-egpu+ #7 SMP PREEMPT_DYNAMIC cmdline: pcie_port_pm=off pcie_aspm=off amdgpu.runpm=0 ``` ## The evidence ### 1. GPU initializes successfully on both boots Boot -1 (7.0.0-rc7-egpu+, journalctl, precise timestamps): ``` 16:45:46.038 SMU is initialized successfully! 16:45:46.038 [drm] Display Core v3.2.369 initialized on DCN 4.0.1 16:45:46.131 runtime pm is manually disabled 16:45:46.131 [drm] Initialized amdgpu 3.64.0 for 0000:97:00.0 on minor 0 ``` Boot 0 (7.0.0-rc7-egpu+, dmesg): ``` [9551.162] SMU is initialized successfully! [9551.163] [drm] Display Core v3.2.369 initialized on DCN 4.0.1 [9551.248] runtime pm is manually disabled [9551.249] [drm] Initialized amdgpu 3.64.0 for 0000:97:00.0 on minor 0 ``` All IP blocks come up clean. 32624MB VRAM. 64 CUs. SMU responds to all init-time messages. No errors during initialization. ### 2. SMU becomes unreachable after variable delay Boot -1 — **10 seconds** after init: ``` 16:45:56.192 Failed to disable gfxoff! 16:45:56.192 SMU is in hanged state, failed to send smu message! 16:45:56.192 Failed to export SMU metrics table! (repeated ~30 times) ``` Boot 0 — **96 seconds** after init: ``` [9647.872] Failed to export SMU metrics table! [9647.872] SMU is in hanged state, failed to send smu message! (repeated) [9661.567] Failed to disable gfxoff! ``` The delay is not consistent (10s vs 96s), ruling out a fixed firmware timer. The first failing operation varies (gfxoff disable vs metrics export), suggesting the SMU itself isn't crashing — the communication path to it is dying. ### 3. Config space alive, MMIO dead (proved during boot 0 crash) Tested during the active crash with the GPU in "SMU hanged" state: **Config space (works):** ``` $ sudo setpci -s 97:00.0 0x00.l 75511002 ← correct vendor/device ID $ sudo setpci -s 97:00.0 0x04.l 00100406 ← status/command register OK ``` **MMIO BAR5 at 0xc4000000 — SMU register space (dead):** ```python fd = os.open('/sys/bus/pci/devices/0000:97:00.0/resource5', os.O_RDONLY) data = os.read(fd, 4) # OSError: [Errno 5] Input/output error ``` **MMIO BAR0 at 0x8880000000 — VRAM (dead):** ```python fd = os.open('/sys/bus/pci/devices/0000:97:00.0/resource0', os.O_RDONLY) data = os.read(fd, 4) # OSError: [Errno 5] Input/output error ``` Config transactions reach the device through the TB tunnel. Memory transactions do not. This is the root cause of the "SMU hanged state" — the SMU firmware is likely fine, but the MMIO writes to its mailbox registers never arrive. ### 4. Thunderbolt host router and dock are runtime-suspended Monitored with 2-second polling during the crash: ``` 16:54:06 host=suspended dock=suspended gpu_errors=1 16:54:09 host=suspended dock=suspended gpu_errors=3 16:54:11 host=suspended dock=suspended gpu_errors=5 ... 16:54:57 host=suspended dock=suspended gpu_errors=119 ``` TB host router runtime PM configuration: ``` /sys/bus/thunderbolt/devices/0-0/power/control = auto /sys/bus/thunderbolt/devices/0-0/power/autosuspend_delay_ms = 15000 /sys/bus/thunderbolt/devices/0-0/power/runtime_status = suspended ``` Both the TB host router and dock switch show `suspended` throughout the crash. The PCIe tunnel was activated while they were in this state. ### 5. Waking TB host router does not restore MMIO ``` $ echo "on" > /sys/bus/thunderbolt/devices/0-0/power/control $ cat /sys/bus/thunderbolt/devices/0-0/power/runtime_status active $ python3 -c "os.read(os.open('/sys/bus/pci/devices/.../resource5', ...), 4)" # OSError: [Errno 5] Input/output error ← still dead ``` Once MMIO is lost, waking the TB host router doesn't recover it. The damage to the memory transaction path persists until device removal and re-enumeration (or reboot). ### 6. Full crash cascade (boot -1) ``` 16:45:46 Init complete 16:45:56 SMU hanged (MMIO path dead) 16:45:58 SDMA ring timeout → ring reset succeeds 16:46:00 SDMA ring timeout again → ring reset succeeds 16:51:16 Full GPU reset (MODE1 via SMU) 16:51:33 GPU reset succeeded, SMU resumed GPU runs for 37 more minutes 17:28:17 SDMA timeout → ring reset FAILS 17:28:17 GPU reset returns -ENODEV (device gone from bus) Infinite reset loop, system unusable ``` MODE1 reset succeeds once (re-establishing the MMIO path temporarily) but the problem recurs, and the second MODE1 reset kills the PCIe link entirely (-ENODEV). ## Configuration notes - `amdgpu.runpm=0` is set and confirmed. GPU runtime PM (BOCO) is not active. This does not prevent the crash. - `pcie_port_pm=off pcie_aspm=off` are set. PCIe link power management is disabled. This does not prevent the crash. - TB host router runtime PM (`power/control=auto`) is NOT disabled by any of the above kernel parameters. - SMU FW version mismatch: driver expects interface 0x2e, FW reports 0x32. Init succeeds despite mismatch. - "PCIE atomic ops is not supported" — TB bridge doesn't support AtomicOps. ## Related - GitLab issue: https://gitlab.freedesktop.org/drm/amd/-/work_items/4978 - Device: [1002:7551] (gfx1201, Navi 48, R9700 Pro) - SMU FW: smu_v14_0_2, version 0x00684a00 (104.74.0) - SMU driver if version 0x2e, fw if version 0x32 (mismatch) - TB host controller: Intel Barlow Ridge [8086:5780], FW 61.83 - TB dock: Razer Core X V2, FW 59.82 Geramy L. Loveless Founder & Chief Innovation Officer JQluv.net, Inc. Site: JQluv.com Mobile: 559.999.1557 Office: 1 (877) 44 JQluv On Thu, Apr 9, 2026 at 11:12 AM Mario Limonciello <mario.limonciello@amd.com> wrote: > > > > On 4/9/26 06:42, Christian König wrote: > > On 4/9/26 02:05, Geramy Loveless wrote: > >> When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on > >> Thunderbolt the TB driver receives no notification and the tunnel > >> stays up while the endpoint is unreachable. > > > > IIRC a MODE1 reset should keep the bus active and so the endpoint should still be reachable. > > > >> All subsequent PCIe > >> reads return 0xFFFFFFFF and MES firmware cannot reinitialize, > >> triggering an infinite reset loop that hangs the system. > > > > That sounds more like the MODE1 reset failed. > > > >> After MODE1 reset completes, check whether the PCIe endpoint is still > >> reachable using pci_device_is_present(). If the device is behind > >> Thunderbolt and the link is dead, walk up parent bridges calling > >> pci_bridge_secondary_bus_reset() to retrain the physical PCIe link > >> inside the dock. > > > > Well that is then a bus reset. > > > > I mean that is a reasonable mitigation when a MODE1 reset failed, but the question is rather why does the MODE1 reset fails in the first place? > > > >> If recovery fails, return -ENODEV to prevent the > >> reset retry loop. > >> > >> This also causes the GPU fan to be at 100% and basically when it > >> happens and you are not there, you now have a GPU with fan at 100% and > >> cant reset it. > >> I wanted to notate some other things I am finding sometimes before > >> this adventure of patches to the kernel and amdgpu driver. > >> Sometimes a crash could happen in the drive and then the GPU fan speed > >> hits 100% and the air is hot coming out without any workload, other > >> times > >> I have seen it have barely any fan speed at all and heat up more than > >> it should at the fan level its curently operating at. These are things > >> I have seen with this gpu in a TB5 dock with the driver and > >> instability. I'm not sure exactly whats going on there but I figured > >> since im communicating with these patches I might as well bring you up > >> to speed and supermario has been great help throughout me trying to > >> get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 / > >> USB4v2 dock! > > > > Adding Mario as well. That strongly sounds like you crashed the SMU which would also explain the failed MODE1 reset. > > > > But all of that are only symptoms. Question is what is actually going on here? e.g. what is the root cause? > > We don't spend a lot of time in recovery scenarios for when 💩 hits the > fan. I think in addition to finding and fixing the real root cause > having a reproducible workload to cause the crash is a good opportunity > to try to put in place better recovery too. > > Generally speaking I like the idea of if a mode1 reset fails to do a > harder reset. At least in the path that we have GPU recovery > (amdgpu.gpu_recovery module parameter) set, adding a fallback case to do > a full device reset makes sense to me. > > I think the placement is wrong though. amdgpu_device_mode1_reset() has > a bunch of callers, and if you end up with a mode1 reset doing a full > reset that might be a surprise to those callers. > > So I think a more logical place to put this would be explicitly in the > GPU recovery path (amdgpu_device_gpu_recover). Maybe as part of the > mode1 reset failure you can: > > set_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags); > > And then the GPU recovery path can jump right into a full reset? Not > sure if that jives with your stack trace though. > > Furthermore; even though you reproduced this on Thunderbolt; I have no > reason to believe it's specific to thunderbolt. An SMU crash can happen > in any hardware. We may as well try full reset for recovery for any > hardware. > > > > >> > >> It seems to be finally working with bar resizing after my kernel > >> patch. Which allows you to safely release a empty switch bridge at the > >> device end. > >> Then it rebuilds it afterwords with the increased bar. This was done > >> on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource > >> branch with my patch here. > >> > >> https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u > > > > Where is the MMIO register BAR before and after the rebuild? > > > > Regards, > > Christian. > > > >> > >> Thank you! > >> > >> Signed-off-by: Geramy Loveless <gloveless@jqluv.com> > >> --- > >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++ > >> 1 file changed, 40 insertions(+) > >> > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >> index 31a60173c..91d01d538 100644 > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >> @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev) > >> /* ensure no_hw_access is updated before we access hw */ > >> smp_mb(); > >> + /* > >> + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe > >> + * endpoint but the TB tunnel stays up unaware. Detect the > >> + * dead link and attempt recovery by resetting parent bridges > >> + * to retrain the physical PCIe link inside the dock. > >> + */ > >> + if (!pci_device_is_present(adev->pdev) && > >> + pci_is_thunderbolt_attached(adev->pdev)) { > >> + struct pci_dev *bridge; > >> + bool recovered = false; > >> + > >> + dev_info(adev->dev, > >> + "PCIe link lost after mode1 reset, attempting Thunderbolt recovery\n"); > >> + > >> + bridge = pci_upstream_bridge(adev->pdev); > >> + while (bridge && !pci_is_root_bus(bridge->bus)) { > >> + dev_info(adev->dev, > >> + "attempting link recovery via %s\n", > >> + pci_name(bridge)); > >> + pci_bridge_secondary_bus_reset(bridge); > >> + msleep(100); > >> + if (pci_device_is_present(adev->pdev)) { > >> + recovered = true; > >> + break; > >> + } > >> + bridge = pci_upstream_bridge(bridge); > >> + } > >> + > >> + if (!recovered) { > >> + dev_err(adev->dev, > >> + "Thunderbolt PCIe link recovery failed\n"); > >> + ret = -ENODEV; > >> + goto mode1_reset_failed; > >> + } > >> + > >> + dev_info(adev->dev, > >> + "Thunderbolt PCIe link recovered via %s\n", > >> + pci_name(bridge)); > >> + } > >> + > >> amdgpu_device_load_pci_state(adev->pdev); > >> ret = amdgpu_psp_wait_for_bootloader(adev); > >> if (ret) > >> -- > >> 2.51.0 > > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset 2026-04-10 0:12 ` Geramy Loveless @ 2026-04-10 6:22 ` Geramy Loveless 2026-04-10 7:07 ` Geramy Loveless 0 siblings, 1 reply; 14+ messages in thread From: Geramy Loveless @ 2026-04-10 6:22 UTC (permalink / raw) To: Mario Limonciello; +Cc: Christian König, amd-gfx, alexander.deucher Before you guys waste your time reading all the below, dont. I had made a mistake in my patch to PCI basically causing the entire tree of devices to get released, I was using an unsafe version of the pci API. It has been corrected following suggestions by the reviewer on the original patch. https://lore.kernel.org/linux-pci/20260410052918.5556-2-gloveless@jqluv.com/ If you would like I still could submit some safety patch work as mario was suggesting that its not a bad idea to have the ability to handle edge case situations to prevent crashing in the future if something goes awry. There was also one basically the GPU would not get initialized correctly and it would attempt to access null reference rings which never got filled. I appreciate the help and explanation of how the systems on the GPU work and look forward to learning more but hopefully not because of a bug that either I cause or run across haha :) I am going to start backing out the kernel parameters in hopes that everything is happy with the pci fix I implemented and hopefully i dont need to set params. On Thu, Apr 9, 2026 at 5:12 PM Geramy Loveless <gloveless@jqluv.com> wrote: > > Hey, > > I have nearly finished my patch, I need to double check some stuff > first but here is a summary of the real cause that I can see. > See below for a in depth in depth analysis. I am not sure technically > if thunderbolt is at fault in this scenario or the amdgpu please let > me know what your opinion is on this and where it should be patched > at. Of course the patch i'm working on allows for a resolution path if > it happens which is a nice recovery mode but it doesnt solve it > occuring. > > ## Summary > > R9700 Pro [1002:7551] (gfx1201) connected via Thunderbolt 5 dock. GPU > initializes fully but MMIO becomes unreachable while PCIe config space > continues to work. MMIO becomes unresponsive, > cascading into SDMA timeouts and GPU reset loops. Reproduced on two > separate boots (10s and 96s after init). > > The split between working config space and dead MMIO points to the > Thunderbolt PCIe tunnel selectively dropping memory transactions while > continuing to pass configuration transactions. > > ## Hardware > > - Host: MSI MS-S1 MAX (Strix Halo), AMD IOMMU > - Thunderbolt host controller: Intel Barlow Ridge TB5 [8086:5780] at 67:00.0 > - Dock: Razer Core X V2 (TB4, FW 59.82) > - GPU: AMD R9700 Pro [1002:7551] gfx1201, 32GB GDDR6 > - Connection: TB5 host → TB4 dock, 40 Gb/s dual lane > - TB tunnel: PCIe 0:10 <-> 3:9, extended encapsulation enabled > - PCIe topology through dock: > ``` > 66:03.0 TB bridge (32GB pref window) > 93:00.0 Intel 5786 Upstream Switch > 94:00.0 Intel 5786 Downstream Switch (Gen4 x4 to AMD switch) > 95:00.0 AMD 1478 Upstream Switch > 96:00.0 AMD 1479 Downstream Switch (Gen5 x16 to GPU) > 97:00.0 GPU [1002:7551] > ``` > - No display connected to eGPU > > ## Kernel > > ``` > Linux 7.0.0-rc7-egpu+ #7 SMP PREEMPT_DYNAMIC > cmdline: pcie_port_pm=off pcie_aspm=off amdgpu.runpm=0 > ``` > > ## The evidence > > ### 1. GPU initializes successfully on both boots > > Boot -1 (7.0.0-rc7-egpu+, journalctl, precise timestamps): > ``` > 16:45:46.038 SMU is initialized successfully! > 16:45:46.038 [drm] Display Core v3.2.369 initialized on DCN 4.0.1 > 16:45:46.131 runtime pm is manually disabled > 16:45:46.131 [drm] Initialized amdgpu 3.64.0 for 0000:97:00.0 on minor 0 > ``` > > Boot 0 (7.0.0-rc7-egpu+, dmesg): > ``` > [9551.162] SMU is initialized successfully! > [9551.163] [drm] Display Core v3.2.369 initialized on DCN 4.0.1 > [9551.248] runtime pm is manually disabled > [9551.249] [drm] Initialized amdgpu 3.64.0 for 0000:97:00.0 on minor 0 > ``` > > All IP blocks come up clean. 32624MB VRAM. 64 CUs. SMU responds to > all init-time messages. No errors during initialization. > > ### 2. SMU becomes unreachable after variable delay > > Boot -1 — **10 seconds** after init: > ``` > 16:45:56.192 Failed to disable gfxoff! > 16:45:56.192 SMU is in hanged state, failed to send smu message! > 16:45:56.192 Failed to export SMU metrics table! > (repeated ~30 times) > ``` > > Boot 0 — **96 seconds** after init: > ``` > [9647.872] Failed to export SMU metrics table! > [9647.872] SMU is in hanged state, failed to send smu message! > (repeated) > [9661.567] Failed to disable gfxoff! > ``` > > The delay is not consistent (10s vs 96s), ruling out a fixed firmware > timer. The first failing operation varies (gfxoff disable vs metrics > export), suggesting the SMU itself isn't crashing — the communication > path to it is dying. > > ### 3. Config space alive, MMIO dead (proved during boot 0 crash) > > Tested during the active crash with the GPU in "SMU hanged" state: > > **Config space (works):** > ``` > $ sudo setpci -s 97:00.0 0x00.l > 75511002 ← correct vendor/device ID > $ sudo setpci -s 97:00.0 0x04.l > 00100406 ← status/command register OK > ``` > > **MMIO BAR5 at 0xc4000000 — SMU register space (dead):** > ```python > fd = os.open('/sys/bus/pci/devices/0000:97:00.0/resource5', os.O_RDONLY) > data = os.read(fd, 4) > # OSError: [Errno 5] Input/output error > ``` > > **MMIO BAR0 at 0x8880000000 — VRAM (dead):** > ```python > fd = os.open('/sys/bus/pci/devices/0000:97:00.0/resource0', os.O_RDONLY) > data = os.read(fd, 4) > # OSError: [Errno 5] Input/output error > ``` > > Config transactions reach the device through the TB tunnel. Memory > transactions do not. This is the root cause of the "SMU hanged state" — > the SMU firmware is likely fine, but the MMIO writes to its mailbox > registers never arrive. > > ### 4. Thunderbolt host router and dock are runtime-suspended > > Monitored with 2-second polling during the crash: > ``` > 16:54:06 host=suspended dock=suspended gpu_errors=1 > 16:54:09 host=suspended dock=suspended gpu_errors=3 > 16:54:11 host=suspended dock=suspended gpu_errors=5 > ... > 16:54:57 host=suspended dock=suspended gpu_errors=119 > ``` > > TB host router runtime PM configuration: > ``` > /sys/bus/thunderbolt/devices/0-0/power/control = auto > /sys/bus/thunderbolt/devices/0-0/power/autosuspend_delay_ms = 15000 > /sys/bus/thunderbolt/devices/0-0/power/runtime_status = suspended > ``` > > Both the TB host router and dock switch show `suspended` throughout > the crash. The PCIe tunnel was activated while they were in this state. > > ### 5. Waking TB host router does not restore MMIO > > ``` > $ echo "on" > /sys/bus/thunderbolt/devices/0-0/power/control > $ cat /sys/bus/thunderbolt/devices/0-0/power/runtime_status > active > > $ python3 -c "os.read(os.open('/sys/bus/pci/devices/.../resource5', ...), 4)" > # OSError: [Errno 5] Input/output error ← still dead > ``` > > Once MMIO is lost, waking the TB host router doesn't recover it. The > damage to the memory transaction path persists until device removal > and re-enumeration (or reboot). > > ### 6. Full crash cascade (boot -1) > > ``` > 16:45:46 Init complete > 16:45:56 SMU hanged (MMIO path dead) > 16:45:58 SDMA ring timeout → ring reset succeeds > 16:46:00 SDMA ring timeout again → ring reset succeeds > 16:51:16 Full GPU reset (MODE1 via SMU) > 16:51:33 GPU reset succeeded, SMU resumed > GPU runs for 37 more minutes > 17:28:17 SDMA timeout → ring reset FAILS > 17:28:17 GPU reset returns -ENODEV (device gone from bus) > Infinite reset loop, system unusable > ``` > > MODE1 reset succeeds once (re-establishing the MMIO path temporarily) > but the problem recurs, and the second MODE1 reset kills the PCIe link > entirely (-ENODEV). > > ## Configuration notes > > - `amdgpu.runpm=0` is set and confirmed. GPU runtime PM (BOCO) is not > active. This does not prevent the crash. > - `pcie_port_pm=off pcie_aspm=off` are set. PCIe link power management > is disabled. This does not prevent the crash. > - TB host router runtime PM (`power/control=auto`) is NOT disabled by > any of the above kernel parameters. > - SMU FW version mismatch: driver expects interface 0x2e, FW reports 0x32. > Init succeeds despite mismatch. > - "PCIE atomic ops is not supported" — TB bridge doesn't support AtomicOps. > > ## Related > > - GitLab issue: https://gitlab.freedesktop.org/drm/amd/-/work_items/4978 > - Device: [1002:7551] (gfx1201, Navi 48, R9700 Pro) > - SMU FW: smu_v14_0_2, version 0x00684a00 (104.74.0) > - SMU driver if version 0x2e, fw if version 0x32 (mismatch) > - TB host controller: Intel Barlow Ridge [8086:5780], FW 61.83 > - TB dock: Razer Core X V2, FW 59.82 > > > > > Geramy L. Loveless > Founder & Chief Innovation Officer > > JQluv.net, Inc. > Site: JQluv.com > Mobile: 559.999.1557 > Office: 1 (877) 44 JQluv > > > > > On Thu, Apr 9, 2026 at 11:12 AM Mario Limonciello > <mario.limonciello@amd.com> wrote: > > > > > > > > On 4/9/26 06:42, Christian König wrote: > > > On 4/9/26 02:05, Geramy Loveless wrote: > > >> When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on > > >> Thunderbolt the TB driver receives no notification and the tunnel > > >> stays up while the endpoint is unreachable. > > > > > > IIRC a MODE1 reset should keep the bus active and so the endpoint should still be reachable. > > > > > >> All subsequent PCIe > > >> reads return 0xFFFFFFFF and MES firmware cannot reinitialize, > > >> triggering an infinite reset loop that hangs the system. > > > > > > That sounds more like the MODE1 reset failed. > > > > > >> After MODE1 reset completes, check whether the PCIe endpoint is still > > >> reachable using pci_device_is_present(). If the device is behind > > >> Thunderbolt and the link is dead, walk up parent bridges calling > > >> pci_bridge_secondary_bus_reset() to retrain the physical PCIe link > > >> inside the dock. > > > > > > Well that is then a bus reset. > > > > > > I mean that is a reasonable mitigation when a MODE1 reset failed, but the question is rather why does the MODE1 reset fails in the first place? > > > > > >> If recovery fails, return -ENODEV to prevent the > > >> reset retry loop. > > >> > > >> This also causes the GPU fan to be at 100% and basically when it > > >> happens and you are not there, you now have a GPU with fan at 100% and > > >> cant reset it. > > >> I wanted to notate some other things I am finding sometimes before > > >> this adventure of patches to the kernel and amdgpu driver. > > >> Sometimes a crash could happen in the drive and then the GPU fan speed > > >> hits 100% and the air is hot coming out without any workload, other > > >> times > > >> I have seen it have barely any fan speed at all and heat up more than > > >> it should at the fan level its curently operating at. These are things > > >> I have seen with this gpu in a TB5 dock with the driver and > > >> instability. I'm not sure exactly whats going on there but I figured > > >> since im communicating with these patches I might as well bring you up > > >> to speed and supermario has been great help throughout me trying to > > >> get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 / > > >> USB4v2 dock! > > > > > > Adding Mario as well. That strongly sounds like you crashed the SMU which would also explain the failed MODE1 reset. > > > > > > But all of that are only symptoms. Question is what is actually going on here? e.g. what is the root cause? > > > > We don't spend a lot of time in recovery scenarios for when 💩 hits the > > fan. I think in addition to finding and fixing the real root cause > > having a reproducible workload to cause the crash is a good opportunity > > to try to put in place better recovery too. > > > > Generally speaking I like the idea of if a mode1 reset fails to do a > > harder reset. At least in the path that we have GPU recovery > > (amdgpu.gpu_recovery module parameter) set, adding a fallback case to do > > a full device reset makes sense to me. > > > > I think the placement is wrong though. amdgpu_device_mode1_reset() has > > a bunch of callers, and if you end up with a mode1 reset doing a full > > reset that might be a surprise to those callers. > > > > So I think a more logical place to put this would be explicitly in the > > GPU recovery path (amdgpu_device_gpu_recover). Maybe as part of the > > mode1 reset failure you can: > > > > set_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags); > > > > And then the GPU recovery path can jump right into a full reset? Not > > sure if that jives with your stack trace though. > > > > Furthermore; even though you reproduced this on Thunderbolt; I have no > > reason to believe it's specific to thunderbolt. An SMU crash can happen > > in any hardware. We may as well try full reset for recovery for any > > hardware. > > > > > > > >> > > >> It seems to be finally working with bar resizing after my kernel > > >> patch. Which allows you to safely release a empty switch bridge at the > > >> device end. > > >> Then it rebuilds it afterwords with the increased bar. This was done > > >> on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource > > >> branch with my patch here. > > >> > > >> https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u > > > > > > Where is the MMIO register BAR before and after the rebuild? > > > > > > Regards, > > > Christian. > > > > > >> > > >> Thank you! > > >> > > >> Signed-off-by: Geramy Loveless <gloveless@jqluv.com> > > >> --- > > >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++ > > >> 1 file changed, 40 insertions(+) > > >> > > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > >> index 31a60173c..91d01d538 100644 > > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > >> @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev) > > >> /* ensure no_hw_access is updated before we access hw */ > > >> smp_mb(); > > >> + /* > > >> + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe > > >> + * endpoint but the TB tunnel stays up unaware. Detect the > > >> + * dead link and attempt recovery by resetting parent bridges > > >> + * to retrain the physical PCIe link inside the dock. > > >> + */ > > >> + if (!pci_device_is_present(adev->pdev) && > > >> + pci_is_thunderbolt_attached(adev->pdev)) { > > >> + struct pci_dev *bridge; > > >> + bool recovered = false; > > >> + > > >> + dev_info(adev->dev, > > >> + "PCIe link lost after mode1 reset, attempting Thunderbolt recovery\n"); > > >> + > > >> + bridge = pci_upstream_bridge(adev->pdev); > > >> + while (bridge && !pci_is_root_bus(bridge->bus)) { > > >> + dev_info(adev->dev, > > >> + "attempting link recovery via %s\n", > > >> + pci_name(bridge)); > > >> + pci_bridge_secondary_bus_reset(bridge); > > >> + msleep(100); > > >> + if (pci_device_is_present(adev->pdev)) { > > >> + recovered = true; > > >> + break; > > >> + } > > >> + bridge = pci_upstream_bridge(bridge); > > >> + } > > >> + > > >> + if (!recovered) { > > >> + dev_err(adev->dev, > > >> + "Thunderbolt PCIe link recovery failed\n"); > > >> + ret = -ENODEV; > > >> + goto mode1_reset_failed; > > >> + } > > >> + > > >> + dev_info(adev->dev, > > >> + "Thunderbolt PCIe link recovered via %s\n", > > >> + pci_name(bridge)); > > >> + } > > >> + > > >> amdgpu_device_load_pci_state(adev->pdev); > > >> ret = amdgpu_psp_wait_for_bootloader(adev); > > >> if (ret) > > >> -- > > >> 2.51.0 > > > > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset 2026-04-10 6:22 ` Geramy Loveless @ 2026-04-10 7:07 ` Geramy Loveless 0 siblings, 0 replies; 14+ messages in thread From: Geramy Loveless @ 2026-04-10 7:07 UTC (permalink / raw) To: Mario Limonciello; +Cc: Christian König, amd-gfx, alexander.deucher One last update, we are more stable but there is still something crashing the GPU somewhere please give guidance on where to look. ## SMU Firmware Version ``` smu driver if version = 0x0000002e smu fw if version = 0x00000032 smu fw program = 0 smu fw version = 0x00684b00 (104.75.0) ``` Note: Driver interface version (0x2e / 46) does not match firmware interface version (0x32 / 50). ## PCI Topology ``` 65:00.0 PCI bridge: Intel Barlow Ridge Host 80G (rev 84) 66:00.0 PCI bridge: Intel Barlow Ridge Host 80G (rev 84) → NHI 66:01.0 PCI bridge: Intel Barlow Ridge Host 80G (rev 84) → empty hotplug port 66:02.0 PCI bridge: Intel Barlow Ridge Host 80G (rev 84) → USB 66:03.0 PCI bridge: Intel Barlow Ridge Host 80G (rev 84) → dock 93:00.0 PCI bridge: Intel Barlow Ridge Hub 80G (rev 85) → dock switch 94:00.0 PCI bridge: Intel Barlow Ridge Hub 80G (rev 85) → downstream 95:00.0 PCI bridge: AMD Navi 10 XL Upstream Port (rev 24) 96:00.0 PCI bridge: AMD Navi 10 XL Downstream Port (rev 24) 97:00.0 VGA: AMD [1002:7551] (rev c0) ← GPU 97:00.1 Audio: AMD [1002:ab40] ``` ## Workload GPU compute via llama.cpp (ROCm/HIP backend), running Qwen3.5-35B-A3B-Q4_K_M.gguf model (20.49 GiB, fully offloaded to VRAM). Flash attention enabled, 128K context, 32 threads. ## Crash Timeline All timestamps from `dmesg -T`, kernel boot-relative times in brackets. ### GPU initialization (successful) ``` [603.644s] GPU probe: IP DISCOVERY 0x1002:0x7551 [603.653s] Detected IP block: smu_v14_0_0, gfx_v12_0_0 [603.771s] Detected VRAM RAM=32624M, BAR=32768M, RAM width 256bits GDDR6 [604.014s] SMU driver IF 0x2e, FW IF 0x32, FW version 104.75.0 [604.049s] SMU is initialized successfully! [604.119s] Runtime PM manually disabled (amdgpu.runpm=0) [604.119s] Initialized amdgpu 3.64.0 for 0000:97:00.0 ``` ### SMU stops responding [T+4238s after init, ~70 minutes] ``` [4841.828s] SMU: No response msg_reg: 12 resp_reg: 0 [4841.828s] [smu_v14_0_2_get_power_profile_mode] Failed to get activity monitor! [4849.393s] SMU: No response msg_reg: 12 resp_reg: 0 [4849.393s] Failed to export SMU metrics table! ``` 15 consecutive `SMU: No response` messages logged between [4841s] and [4948s], approximately every 7-8 seconds. All with `msg_reg: 12 resp_reg: 0`. Failed operations include: - `smu_v14_0_2_get_power_profile_mode` — Failed to get activity monitor - `Failed to export SMU metrics table` - `Failed to get current clock freq` ### Page faults begin [T+4349s after init, ~111s after first SMU failure] ``` [4948.927s] [gfxhub] page fault (src_id:0 ring:40 vmid:9 pasid:108) Process llama-cli pid 35632 GCVM_L2_PROTECTION_FAULT_STATUS: 0x00941051 Faulty UTCL2 client ID: TCP (0x8) PERMISSION_FAULTS: 0x5 WALKER_ERROR: 0x0 MAPPING_ERROR: 0x0 RW: 0x1 (write) ``` 10 page faults logged at [4948s], all from TCP (Texture Cache Pipe), all PERMISSION_FAULTS=0x5, WALKER_ERROR=0x0, MAPPING_ERROR=0x0. 7 unique faulting addresses: - 0x000072ce90828000 - 0x000072ce90a88000 - 0x000072ce90a89000 - 0x000072ce90cde000 - 0x000072ce90ce1000 - 0x000072ce90f51000 - 0x000072ce90f52000 ### MES failure and GPU reset [T+4349s] ``` [4952.809s] MES(0) failed to respond to msg=REMOVE_QUEUE [4952.809s] failed to remove hardware queue from MES, doorbell=0x1806 [4952.809s] MES might be in unrecoverable state, issue a GPU reset [4952.809s] Failed to evict queue 4 [4952.809s] Failed to evict process queues [4952.809s] GPU reset begin!. Source: 3 ``` ### GPU reset fails ``` [4953.121s] Failed to evict queue 4 [4953.121s] Failed to suspend process pid 28552 [4953.121s] remove_all_kfd_queues_mes: Failed to remove queue 3 for dev 62536 ``` 6 MES(1) REMOVE_QUEUE failures, each timing out after ~2.5 seconds: ``` [4955.720s] MES(1) failed to respond to msg=REMOVE_QUEUE → failed to unmap legacy queue [4958.283s] MES(1) failed to respond to msg=REMOVE_QUEUE → failed to unmap legacy queue [4960.847s] MES(1) failed to respond to msg=REMOVE_QUEUE → failed to unmap legacy queue [4963.411s] MES(1) failed to respond to msg=REMOVE_QUEUE → failed to unmap legacy queue [4965.976s] MES(1) failed to respond to msg=REMOVE_QUEUE → failed to unmap legacy queue [4968.540s] MES(1) failed to respond to msg=REMOVE_QUEUE → failed to unmap legacy queue ``` ### PSP suspend fails ``` [4971.164s] psp gfx command LOAD_IP_FW(0x6) failed and response status is (0x0) [4971.164s] Failed to terminate ras ta [4971.164s] suspend of IP block <psp> failed -22 ``` ### Suspend unwind fails — SMU not ready ``` [4971.164s] SMU is resuming... [4971.164s] SMC is not ready [4971.164s] SMC engine is not correctly up! [4971.164s] resume of IP block <smu> failed -5 [4971.164s] amdgpu_device_ip_resume_phase2 failed during unwind: -5 [4971.164s] GPU pre asic reset failed with err, -22 for drm dev, 0000:97:00.0 ``` ### MODE1 reset — SMU still dead ``` [4971.164s] MODE1 reset [4971.164s] GPU mode1 reset [4971.164s] GPU smu mode1 reset [4972.193s] GPU reset succeeded, trying to resume [4972.193s] VRAM is lost due to GPU reset! [4972.193s] SMU is resuming... [4972.193s] SMC is not ready [4972.193s] SMC engine is not correctly up! [4972.193s] resume of IP block <smu> failed -5 [4972.193s] GPU reset end with ret = -5 ``` Geramy L. Loveless Founder & Chief Innovation Officer JQluv.net, Inc. Site: JQluv.com Mobile: 559.999.1557 Office: 1 (877) 44 JQluv On Thu, Apr 9, 2026 at 11:22 PM Geramy Loveless <gloveless@jqluv.com> wrote: > > Before you guys waste your time reading all the below, dont. > I had made a mistake in my patch to PCI basically causing the entire > tree of devices to get released, I was using an unsafe version of the > pci API. > It has been corrected following suggestions by the reviewer on the > original patch. > > https://lore.kernel.org/linux-pci/20260410052918.5556-2-gloveless@jqluv.com/ > > If you would like I still could submit some safety patch work as mario > was suggesting that its not a bad idea to have the ability > to handle edge case situations to prevent crashing in the future if > something goes awry. There was also one basically the GPU would not > get initialized correctly and it would attempt to access null > reference rings which never got filled. > > I appreciate the help and explanation of how the systems on the GPU > work and look forward to learning more but hopefully not because of a > bug that either > I cause or run across haha :) I am going to start backing out the > kernel parameters in hopes that everything is happy with the pci fix I > implemented and hopefully i dont need to set params. > > > > On Thu, Apr 9, 2026 at 5:12 PM Geramy Loveless <gloveless@jqluv.com> wrote: > > > > Hey, > > > > I have nearly finished my patch, I need to double check some stuff > > first but here is a summary of the real cause that I can see. > > See below for a in depth in depth analysis. I am not sure technically > > if thunderbolt is at fault in this scenario or the amdgpu please let > > me know what your opinion is on this and where it should be patched > > at. Of course the patch i'm working on allows for a resolution path if > > it happens which is a nice recovery mode but it doesnt solve it > > occuring. > > > > ## Summary > > > > R9700 Pro [1002:7551] (gfx1201) connected via Thunderbolt 5 dock. GPU > > initializes fully but MMIO becomes unreachable while PCIe config space > > continues to work. MMIO becomes unresponsive, > > cascading into SDMA timeouts and GPU reset loops. Reproduced on two > > separate boots (10s and 96s after init). > > > > The split between working config space and dead MMIO points to the > > Thunderbolt PCIe tunnel selectively dropping memory transactions while > > continuing to pass configuration transactions. > > > > ## Hardware > > > > - Host: MSI MS-S1 MAX (Strix Halo), AMD IOMMU > > - Thunderbolt host controller: Intel Barlow Ridge TB5 [8086:5780] at 67:00.0 > > - Dock: Razer Core X V2 (TB4, FW 59.82) > > - GPU: AMD R9700 Pro [1002:7551] gfx1201, 32GB GDDR6 > > - Connection: TB5 host → TB4 dock, 40 Gb/s dual lane > > - TB tunnel: PCIe 0:10 <-> 3:9, extended encapsulation enabled > > - PCIe topology through dock: > > ``` > > 66:03.0 TB bridge (32GB pref window) > > 93:00.0 Intel 5786 Upstream Switch > > 94:00.0 Intel 5786 Downstream Switch (Gen4 x4 to AMD switch) > > 95:00.0 AMD 1478 Upstream Switch > > 96:00.0 AMD 1479 Downstream Switch (Gen5 x16 to GPU) > > 97:00.0 GPU [1002:7551] > > ``` > > - No display connected to eGPU > > > > ## Kernel > > > > ``` > > Linux 7.0.0-rc7-egpu+ #7 SMP PREEMPT_DYNAMIC > > cmdline: pcie_port_pm=off pcie_aspm=off amdgpu.runpm=0 > > ``` > > > > ## The evidence > > > > ### 1. GPU initializes successfully on both boots > > > > Boot -1 (7.0.0-rc7-egpu+, journalctl, precise timestamps): > > ``` > > 16:45:46.038 SMU is initialized successfully! > > 16:45:46.038 [drm] Display Core v3.2.369 initialized on DCN 4.0.1 > > 16:45:46.131 runtime pm is manually disabled > > 16:45:46.131 [drm] Initialized amdgpu 3.64.0 for 0000:97:00.0 on minor 0 > > ``` > > > > Boot 0 (7.0.0-rc7-egpu+, dmesg): > > ``` > > [9551.162] SMU is initialized successfully! > > [9551.163] [drm] Display Core v3.2.369 initialized on DCN 4.0.1 > > [9551.248] runtime pm is manually disabled > > [9551.249] [drm] Initialized amdgpu 3.64.0 for 0000:97:00.0 on minor 0 > > ``` > > > > All IP blocks come up clean. 32624MB VRAM. 64 CUs. SMU responds to > > all init-time messages. No errors during initialization. > > > > ### 2. SMU becomes unreachable after variable delay > > > > Boot -1 — **10 seconds** after init: > > ``` > > 16:45:56.192 Failed to disable gfxoff! > > 16:45:56.192 SMU is in hanged state, failed to send smu message! > > 16:45:56.192 Failed to export SMU metrics table! > > (repeated ~30 times) > > ``` > > > > Boot 0 — **96 seconds** after init: > > ``` > > [9647.872] Failed to export SMU metrics table! > > [9647.872] SMU is in hanged state, failed to send smu message! > > (repeated) > > [9661.567] Failed to disable gfxoff! > > ``` > > > > The delay is not consistent (10s vs 96s), ruling out a fixed firmware > > timer. The first failing operation varies (gfxoff disable vs metrics > > export), suggesting the SMU itself isn't crashing — the communication > > path to it is dying. > > > > ### 3. Config space alive, MMIO dead (proved during boot 0 crash) > > > > Tested during the active crash with the GPU in "SMU hanged" state: > > > > **Config space (works):** > > ``` > > $ sudo setpci -s 97:00.0 0x00.l > > 75511002 ← correct vendor/device ID > > $ sudo setpci -s 97:00.0 0x04.l > > 00100406 ← status/command register OK > > ``` > > > > **MMIO BAR5 at 0xc4000000 — SMU register space (dead):** > > ```python > > fd = os.open('/sys/bus/pci/devices/0000:97:00.0/resource5', os.O_RDONLY) > > data = os.read(fd, 4) > > # OSError: [Errno 5] Input/output error > > ``` > > > > **MMIO BAR0 at 0x8880000000 — VRAM (dead):** > > ```python > > fd = os.open('/sys/bus/pci/devices/0000:97:00.0/resource0', os.O_RDONLY) > > data = os.read(fd, 4) > > # OSError: [Errno 5] Input/output error > > ``` > > > > Config transactions reach the device through the TB tunnel. Memory > > transactions do not. This is the root cause of the "SMU hanged state" — > > the SMU firmware is likely fine, but the MMIO writes to its mailbox > > registers never arrive. > > > > ### 4. Thunderbolt host router and dock are runtime-suspended > > > > Monitored with 2-second polling during the crash: > > ``` > > 16:54:06 host=suspended dock=suspended gpu_errors=1 > > 16:54:09 host=suspended dock=suspended gpu_errors=3 > > 16:54:11 host=suspended dock=suspended gpu_errors=5 > > ... > > 16:54:57 host=suspended dock=suspended gpu_errors=119 > > ``` > > > > TB host router runtime PM configuration: > > ``` > > /sys/bus/thunderbolt/devices/0-0/power/control = auto > > /sys/bus/thunderbolt/devices/0-0/power/autosuspend_delay_ms = 15000 > > /sys/bus/thunderbolt/devices/0-0/power/runtime_status = suspended > > ``` > > > > Both the TB host router and dock switch show `suspended` throughout > > the crash. The PCIe tunnel was activated while they were in this state. > > > > ### 5. Waking TB host router does not restore MMIO > > > > ``` > > $ echo "on" > /sys/bus/thunderbolt/devices/0-0/power/control > > $ cat /sys/bus/thunderbolt/devices/0-0/power/runtime_status > > active > > > > $ python3 -c "os.read(os.open('/sys/bus/pci/devices/.../resource5', ...), 4)" > > # OSError: [Errno 5] Input/output error ← still dead > > ``` > > > > Once MMIO is lost, waking the TB host router doesn't recover it. The > > damage to the memory transaction path persists until device removal > > and re-enumeration (or reboot). > > > > ### 6. Full crash cascade (boot -1) > > > > ``` > > 16:45:46 Init complete > > 16:45:56 SMU hanged (MMIO path dead) > > 16:45:58 SDMA ring timeout → ring reset succeeds > > 16:46:00 SDMA ring timeout again → ring reset succeeds > > 16:51:16 Full GPU reset (MODE1 via SMU) > > 16:51:33 GPU reset succeeded, SMU resumed > > GPU runs for 37 more minutes > > 17:28:17 SDMA timeout → ring reset FAILS > > 17:28:17 GPU reset returns -ENODEV (device gone from bus) > > Infinite reset loop, system unusable > > ``` > > > > MODE1 reset succeeds once (re-establishing the MMIO path temporarily) > > but the problem recurs, and the second MODE1 reset kills the PCIe link > > entirely (-ENODEV). > > > > ## Configuration notes > > > > - `amdgpu.runpm=0` is set and confirmed. GPU runtime PM (BOCO) is not > > active. This does not prevent the crash. > > - `pcie_port_pm=off pcie_aspm=off` are set. PCIe link power management > > is disabled. This does not prevent the crash. > > - TB host router runtime PM (`power/control=auto`) is NOT disabled by > > any of the above kernel parameters. > > - SMU FW version mismatch: driver expects interface 0x2e, FW reports 0x32. > > Init succeeds despite mismatch. > > - "PCIE atomic ops is not supported" — TB bridge doesn't support AtomicOps. > > > > ## Related > > > > - GitLab issue: https://gitlab.freedesktop.org/drm/amd/-/work_items/4978 > > - Device: [1002:7551] (gfx1201, Navi 48, R9700 Pro) > > - SMU FW: smu_v14_0_2, version 0x00684a00 (104.74.0) > > - SMU driver if version 0x2e, fw if version 0x32 (mismatch) > > - TB host controller: Intel Barlow Ridge [8086:5780], FW 61.83 > > - TB dock: Razer Core X V2, FW 59.82 > > > > > > > > > > Geramy L. Loveless > > Founder & Chief Innovation Officer > > > > JQluv.net, Inc. > > Site: JQluv.com > > Mobile: 559.999.1557 > > Office: 1 (877) 44 JQluv > > > > > > > > > > On Thu, Apr 9, 2026 at 11:12 AM Mario Limonciello > > <mario.limonciello@amd.com> wrote: > > > > > > > > > > > > On 4/9/26 06:42, Christian König wrote: > > > > On 4/9/26 02:05, Geramy Loveless wrote: > > > >> When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on > > > >> Thunderbolt the TB driver receives no notification and the tunnel > > > >> stays up while the endpoint is unreachable. > > > > > > > > IIRC a MODE1 reset should keep the bus active and so the endpoint should still be reachable. > > > > > > > >> All subsequent PCIe > > > >> reads return 0xFFFFFFFF and MES firmware cannot reinitialize, > > > >> triggering an infinite reset loop that hangs the system. > > > > > > > > That sounds more like the MODE1 reset failed. > > > > > > > >> After MODE1 reset completes, check whether the PCIe endpoint is still > > > >> reachable using pci_device_is_present(). If the device is behind > > > >> Thunderbolt and the link is dead, walk up parent bridges calling > > > >> pci_bridge_secondary_bus_reset() to retrain the physical PCIe link > > > >> inside the dock. > > > > > > > > Well that is then a bus reset. > > > > > > > > I mean that is a reasonable mitigation when a MODE1 reset failed, but the question is rather why does the MODE1 reset fails in the first place? > > > > > > > >> If recovery fails, return -ENODEV to prevent the > > > >> reset retry loop. > > > >> > > > >> This also causes the GPU fan to be at 100% and basically when it > > > >> happens and you are not there, you now have a GPU with fan at 100% and > > > >> cant reset it. > > > >> I wanted to notate some other things I am finding sometimes before > > > >> this adventure of patches to the kernel and amdgpu driver. > > > >> Sometimes a crash could happen in the drive and then the GPU fan speed > > > >> hits 100% and the air is hot coming out without any workload, other > > > >> times > > > >> I have seen it have barely any fan speed at all and heat up more than > > > >> it should at the fan level its curently operating at. These are things > > > >> I have seen with this gpu in a TB5 dock with the driver and > > > >> instability. I'm not sure exactly whats going on there but I figured > > > >> since im communicating with these patches I might as well bring you up > > > >> to speed and supermario has been great help throughout me trying to > > > >> get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 / > > > >> USB4v2 dock! > > > > > > > > Adding Mario as well. That strongly sounds like you crashed the SMU which would also explain the failed MODE1 reset. > > > > > > > > But all of that are only symptoms. Question is what is actually going on here? e.g. what is the root cause? > > > > > > We don't spend a lot of time in recovery scenarios for when 💩 hits the > > > fan. I think in addition to finding and fixing the real root cause > > > having a reproducible workload to cause the crash is a good opportunity > > > to try to put in place better recovery too. > > > > > > Generally speaking I like the idea of if a mode1 reset fails to do a > > > harder reset. At least in the path that we have GPU recovery > > > (amdgpu.gpu_recovery module parameter) set, adding a fallback case to do > > > a full device reset makes sense to me. > > > > > > I think the placement is wrong though. amdgpu_device_mode1_reset() has > > > a bunch of callers, and if you end up with a mode1 reset doing a full > > > reset that might be a surprise to those callers. > > > > > > So I think a more logical place to put this would be explicitly in the > > > GPU recovery path (amdgpu_device_gpu_recover). Maybe as part of the > > > mode1 reset failure you can: > > > > > > set_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags); > > > > > > And then the GPU recovery path can jump right into a full reset? Not > > > sure if that jives with your stack trace though. > > > > > > Furthermore; even though you reproduced this on Thunderbolt; I have no > > > reason to believe it's specific to thunderbolt. An SMU crash can happen > > > in any hardware. We may as well try full reset for recovery for any > > > hardware. > > > > > > > > > > >> > > > >> It seems to be finally working with bar resizing after my kernel > > > >> patch. Which allows you to safely release a empty switch bridge at the > > > >> device end. > > > >> Then it rebuilds it afterwords with the increased bar. This was done > > > >> on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource > > > >> branch with my patch here. > > > >> > > > >> https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u > > > > > > > > Where is the MMIO register BAR before and after the rebuild? > > > > > > > > Regards, > > > > Christian. > > > > > > > >> > > > >> Thank you! > > > >> > > > >> Signed-off-by: Geramy Loveless <gloveless@jqluv.com> > > > >> --- > > > >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++ > > > >> 1 file changed, 40 insertions(+) > > > >> > > > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > >> index 31a60173c..91d01d538 100644 > > > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > >> @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev) > > > >> /* ensure no_hw_access is updated before we access hw */ > > > >> smp_mb(); > > > >> + /* > > > >> + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe > > > >> + * endpoint but the TB tunnel stays up unaware. Detect the > > > >> + * dead link and attempt recovery by resetting parent bridges > > > >> + * to retrain the physical PCIe link inside the dock. > > > >> + */ > > > >> + if (!pci_device_is_present(adev->pdev) && > > > >> + pci_is_thunderbolt_attached(adev->pdev)) { > > > >> + struct pci_dev *bridge; > > > >> + bool recovered = false; > > > >> + > > > >> + dev_info(adev->dev, > > > >> + "PCIe link lost after mode1 reset, attempting Thunderbolt recovery\n"); > > > >> + > > > >> + bridge = pci_upstream_bridge(adev->pdev); > > > >> + while (bridge && !pci_is_root_bus(bridge->bus)) { > > > >> + dev_info(adev->dev, > > > >> + "attempting link recovery via %s\n", > > > >> + pci_name(bridge)); > > > >> + pci_bridge_secondary_bus_reset(bridge); > > > >> + msleep(100); > > > >> + if (pci_device_is_present(adev->pdev)) { > > > >> + recovered = true; > > > >> + break; > > > >> + } > > > >> + bridge = pci_upstream_bridge(bridge); > > > >> + } > > > >> + > > > >> + if (!recovered) { > > > >> + dev_err(adev->dev, > > > >> + "Thunderbolt PCIe link recovery failed\n"); > > > >> + ret = -ENODEV; > > > >> + goto mode1_reset_failed; > > > >> + } > > > >> + > > > >> + dev_info(adev->dev, > > > >> + "Thunderbolt PCIe link recovered via %s\n", > > > >> + pci_name(bridge)); > > > >> + } > > > >> + > > > >> amdgpu_device_load_pci_state(adev->pdev); > > > >> ret = amdgpu_psp_wait_for_bootloader(adev); > > > >> if (ret) > > > >> -- > > > >> 2.51.0 > > > > > > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset 2026-04-09 18:12 ` Mario Limonciello 2026-04-10 0:12 ` Geramy Loveless @ 2026-04-10 9:13 ` Christian König 2026-04-10 11:25 ` Lazar, Lijo 2 siblings, 0 replies; 14+ messages in thread From: Christian König @ 2026-04-10 9:13 UTC (permalink / raw) To: Mario Limonciello, Geramy Loveless, amd-gfx Cc: alexander.deucher, Pelloux-Prayer, Pierre-Eric Hi Mario, On 4/9/26 20:12, Mario Limonciello wrote: > > > On 4/9/26 06:42, Christian König wrote: >> On 4/9/26 02:05, Geramy Loveless wrote: >>> When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on >>> Thunderbolt the TB driver receives no notification and the tunnel >>> stays up while the endpoint is unreachable. >> >> IIRC a MODE1 reset should keep the bus active and so the endpoint should still be reachable. >> >>> All subsequent PCIe >>> reads return 0xFFFFFFFF and MES firmware cannot reinitialize, >>> triggering an infinite reset loop that hangs the system. >> >> That sounds more like the MODE1 reset failed. >> >>> After MODE1 reset completes, check whether the PCIe endpoint is still >>> reachable using pci_device_is_present(). If the device is behind >>> Thunderbolt and the link is dead, walk up parent bridges calling >>> pci_bridge_secondary_bus_reset() to retrain the physical PCIe link >>> inside the dock. >> >> Well that is then a bus reset. >> >> I mean that is a reasonable mitigation when a MODE1 reset failed, but the question is rather why does the MODE1 reset fails in the first place? >> >>> If recovery fails, return -ENODEV to prevent the >>> reset retry loop. >>> >>> This also causes the GPU fan to be at 100% and basically when it >>> happens and you are not there, you now have a GPU with fan at 100% and >>> cant reset it. >>> I wanted to notate some other things I am finding sometimes before >>> this adventure of patches to the kernel and amdgpu driver. >>> Sometimes a crash could happen in the drive and then the GPU fan speed >>> hits 100% and the air is hot coming out without any workload, other >>> times >>> I have seen it have barely any fan speed at all and heat up more than >>> it should at the fan level its curently operating at. These are things >>> I have seen with this gpu in a TB5 dock with the driver and >>> instability. I'm not sure exactly whats going on there but I figured >>> since im communicating with these patches I might as well bring you up >>> to speed and supermario has been great help throughout me trying to >>> get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 / >>> USB4v2 dock! >> >> Adding Mario as well. That strongly sounds like you crashed the SMU which would also explain the failed MODE1 reset. >> >> But all of that are only symptoms. Question is what is actually going on here? e.g. what is the root cause? > > We don't spend a lot of time in recovery scenarios for when 💩 hits the fan. I think in addition to finding and fixing the real root cause having a reproducible workload to cause the crash is a good opportunity to try to put in place better recovery too. > > Generally speaking I like the idea of if a mode1 reset fails to do a harder reset. At least in the path that we have GPU recovery (amdgpu.gpu_recovery module parameter) set, adding a fallback case to do a full device reset makes sense to me. Well I just realized that Pierre-Eric is already working on that and I've forgotten to add him to the mail thread. But the general idea is that when you can't recover the GPU that the driver send a WEDGE udev event noting that a GPU recovery didn't worked and it basically needs a bus reset. > I think the placement is wrong though. amdgpu_device_mode1_reset() has a bunch of callers, and if you end up with a mode1 reset doing a full reset that might be a surprise to those callers. > > So I think a more logical place to put this would be explicitly in the GPU recovery path (amdgpu_device_gpu_recover). Maybe as part of the mode1 reset failure you can: > > set_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags); > > And then the GPU recovery path can jump right into a full reset? Not sure if that jives with your stack trace though. The MODE1 reset is already the full reset in this case. > Furthermore; even though you reproduced this on Thunderbolt; I have no reason to believe it's specific to thunderbolt. An SMU crash can happen in any hardware. We may as well try full reset for recovery for any hardware. Yeah agree, we need some more general WEDGE event handling. E.g. basically what Pierre-Eric is already working on. Regards, Christian. > >> >>> >>> It seems to be finally working with bar resizing after my kernel >>> patch. Which allows you to safely release a empty switch bridge at the >>> device end. >>> Then it rebuilds it afterwords with the increased bar. This was done >>> on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource >>> branch with my patch here. >>> >>> https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u >> >> Where is the MMIO register BAR before and after the rebuild? >> >> Regards, >> Christian. >> >>> >>> Thank you! >>> >>> Signed-off-by: Geramy Loveless <gloveless@jqluv.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++ >>> 1 file changed, 40 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> index 31a60173c..91d01d538 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev) >>> /* ensure no_hw_access is updated before we access hw */ >>> smp_mb(); >>> + /* >>> + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe >>> + * endpoint but the TB tunnel stays up unaware. Detect the >>> + * dead link and attempt recovery by resetting parent bridges >>> + * to retrain the physical PCIe link inside the dock. >>> + */ >>> + if (!pci_device_is_present(adev->pdev) && >>> + pci_is_thunderbolt_attached(adev->pdev)) { >>> + struct pci_dev *bridge; >>> + bool recovered = false; >>> + >>> + dev_info(adev->dev, >>> + "PCIe link lost after mode1 reset, attempting Thunderbolt recovery\n"); >>> + >>> + bridge = pci_upstream_bridge(adev->pdev); >>> + while (bridge && !pci_is_root_bus(bridge->bus)) { >>> + dev_info(adev->dev, >>> + "attempting link recovery via %s\n", >>> + pci_name(bridge)); >>> + pci_bridge_secondary_bus_reset(bridge); >>> + msleep(100); >>> + if (pci_device_is_present(adev->pdev)) { >>> + recovered = true; >>> + break; >>> + } >>> + bridge = pci_upstream_bridge(bridge); >>> + } >>> + >>> + if (!recovered) { >>> + dev_err(adev->dev, >>> + "Thunderbolt PCIe link recovery failed\n"); >>> + ret = -ENODEV; >>> + goto mode1_reset_failed; >>> + } >>> + >>> + dev_info(adev->dev, >>> + "Thunderbolt PCIe link recovered via %s\n", >>> + pci_name(bridge)); >>> + } >>> + >>> amdgpu_device_load_pci_state(adev->pdev); >>> ret = amdgpu_psp_wait_for_bootloader(adev); >>> if (ret) >>> -- >>> 2.51.0 >> > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset 2026-04-09 18:12 ` Mario Limonciello 2026-04-10 0:12 ` Geramy Loveless 2026-04-10 9:13 ` Christian König @ 2026-04-10 11:25 ` Lazar, Lijo 2026-04-10 19:17 ` Geramy Loveless 2 siblings, 1 reply; 14+ messages in thread From: Lazar, Lijo @ 2026-04-10 11:25 UTC (permalink / raw) To: Mario Limonciello, Christian König, Geramy Loveless, amd-gfx Cc: alexander.deucher On 09-Apr-26 11:42 PM, Mario Limonciello wrote: > > > On 4/9/26 06:42, Christian König wrote: >> On 4/9/26 02:05, Geramy Loveless wrote: >>> When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on >>> Thunderbolt the TB driver receives no notification and the tunnel >>> stays up while the endpoint is unreachable. >> >> IIRC a MODE1 reset should keep the bus active and so the endpoint >> should still be reachable. >> >>> All subsequent PCIe >>> reads return 0xFFFFFFFF and MES firmware cannot reinitialize, >>> triggering an infinite reset loop that hangs the system. >> >> That sounds more like the MODE1 reset failed. >> >>> After MODE1 reset completes, check whether the PCIe endpoint is still >>> reachable using pci_device_is_present(). If the device is behind >>> Thunderbolt and the link is dead, walk up parent bridges calling >>> pci_bridge_secondary_bus_reset() to retrain the physical PCIe link >>> inside the dock. >> >> Well that is then a bus reset. >> >> I mean that is a reasonable mitigation when a MODE1 reset failed, but >> the question is rather why does the MODE1 reset fails in the first place? >> >>> If recovery fails, return -ENODEV to prevent the >>> reset retry loop. >>> >>> This also causes the GPU fan to be at 100% and basically when it >>> happens and you are not there, you now have a GPU with fan at 100% and >>> cant reset it. >>> I wanted to notate some other things I am finding sometimes before >>> this adventure of patches to the kernel and amdgpu driver. >>> Sometimes a crash could happen in the drive and then the GPU fan speed >>> hits 100% and the air is hot coming out without any workload, other >>> times >>> I have seen it have barely any fan speed at all and heat up more than >>> it should at the fan level its curently operating at. These are things >>> I have seen with this gpu in a TB5 dock with the driver and >>> instability. I'm not sure exactly whats going on there but I figured >>> since im communicating with these patches I might as well bring you up >>> to speed and supermario has been great help throughout me trying to >>> get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 / >>> USB4v2 dock! >> >> Adding Mario as well. That strongly sounds like you crashed the SMU >> which would also explain the failed MODE1 reset. >> >> But all of that are only symptoms. Question is what is actually going >> on here? e.g. what is the root cause? > > We don't spend a lot of time in recovery scenarios for when 💩 hits the > fan. I think in addition to finding and fixing the real root cause > having a reproducible workload to cause the crash is a good opportunity > to try to put in place better recovery too. > > Generally speaking I like the idea of if a mode1 reset fails to do a > harder reset. At least in the path that we have GPU recovery > (amdgpu.gpu_recovery module parameter) set, adding a fallback case to do > a full device reset makes sense to me. > > I think the placement is wrong though. amdgpu_device_mode1_reset() has > a bunch of callers, and if you end up with a mode1 reset doing a full > reset that might be a surprise to those callers. > > So I think a more logical place to put this would be explicitly in the > GPU recovery path (amdgpu_device_gpu_recover). Maybe as part of the > mode1 reset failure you can: > > set_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags); > > And then the GPU recovery path can jump right into a full reset? Not > sure if that jives with your stack trace though. > > Furthermore; even though you reproduced this on Thunderbolt; I have no > reason to believe it's specific to thunderbolt. An SMU crash can happen > in any hardware. We may as well try full reset for recovery for any > hardware. FWIW, if SMU crashes then SBR also shouldn't work since SBR handling needs some firmware support as well. A kernel module triggering chain-reset by going one level up and resetting all devices under the bridge (in a while loop) also doesn't look like an acceptable solution. Thanks, Lijo > >> >>> >>> It seems to be finally working with bar resizing after my kernel >>> patch. Which allows you to safely release a empty switch bridge at the >>> device end. >>> Then it rebuilds it afterwords with the increased bar. This was done >>> on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource >>> branch with my patch here. >>> >>> https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF- >>> tmThxOvk2XUDpEzw@mail.gmail.com/T/#u >> >> Where is the MMIO register BAR before and after the rebuild? >> >> Regards, >> Christian. >> >>> >>> Thank you! >>> >>> Signed-off-by: Geramy Loveless <gloveless@jqluv.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++ >>> 1 file changed, 40 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> index 31a60173c..91d01d538 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct >>> amdgpu_device *adev) >>> /* ensure no_hw_access is updated before we access hw */ >>> smp_mb(); >>> + /* >>> + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe >>> + * endpoint but the TB tunnel stays up unaware. Detect the >>> + * dead link and attempt recovery by resetting parent bridges >>> + * to retrain the physical PCIe link inside the dock. >>> + */ >>> + if (!pci_device_is_present(adev->pdev) && >>> + pci_is_thunderbolt_attached(adev->pdev)) { >>> + struct pci_dev *bridge; >>> + bool recovered = false; >>> + >>> + dev_info(adev->dev, >>> + "PCIe link lost after mode1 reset, attempting Thunderbolt >>> recovery\n"); >>> + >>> + bridge = pci_upstream_bridge(adev->pdev); >>> + while (bridge && !pci_is_root_bus(bridge->bus)) { >>> + dev_info(adev->dev, >>> + "attempting link recovery via %s\n", >>> + pci_name(bridge)); >>> + pci_bridge_secondary_bus_reset(bridge); >>> + msleep(100); >>> + if (pci_device_is_present(adev->pdev)) { >>> + recovered = true; >>> + break; >>> + } >>> + bridge = pci_upstream_bridge(bridge); >>> + } >>> + >>> + if (!recovered) { >>> + dev_err(adev->dev, >>> + "Thunderbolt PCIe link recovery failed\n"); >>> + ret = -ENODEV; >>> + goto mode1_reset_failed; >>> + } >>> + >>> + dev_info(adev->dev, >>> + "Thunderbolt PCIe link recovered via %s\n", >>> + pci_name(bridge)); >>> + } >>> + >>> amdgpu_device_load_pci_state(adev->pdev); >>> ret = amdgpu_psp_wait_for_bootloader(adev); >>> if (ret) >>> -- >>> 2.51.0 >> > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset 2026-04-10 11:25 ` Lazar, Lijo @ 2026-04-10 19:17 ` Geramy Loveless 2026-04-10 22:42 ` Geramy Loveless 0 siblings, 1 reply; 14+ messages in thread From: Geramy Loveless @ 2026-04-10 19:17 UTC (permalink / raw) To: Lazar, Lijo Cc: Mario Limonciello, Christian König, amd-gfx, alexander.deucher, Cristian Cocos It seems there is another person having the same problems im having or at least similar. I am going to loop his logs in here and add him maybe we can tackle this together and find the underlying problem easier or faster. Its always better to have two logs from too different points sometimes you get things in one you dont get in the other out of pure chance, haha. https://pcforum.amd.com/s/question/0D5Pd00001S3Av9KAF/linux-9060xt-egpuoverthunderbolt-bugs-galore Let me know if I can be of use, or if you need extra bandwidth for making patches point my in the direction. Thanks everyone! On Fri, Apr 10, 2026 at 4:25 AM Lazar, Lijo <lijo.lazar@amd.com> wrote: > > > > On 09-Apr-26 11:42 PM, Mario Limonciello wrote: > > > > > > On 4/9/26 06:42, Christian König wrote: > >> On 4/9/26 02:05, Geramy Loveless wrote: > >>> When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on > >>> Thunderbolt the TB driver receives no notification and the tunnel > >>> stays up while the endpoint is unreachable. > >> > >> IIRC a MODE1 reset should keep the bus active and so the endpoint > >> should still be reachable. > >> > >>> All subsequent PCIe > >>> reads return 0xFFFFFFFF and MES firmware cannot reinitialize, > >>> triggering an infinite reset loop that hangs the system. > >> > >> That sounds more like the MODE1 reset failed. > >> > >>> After MODE1 reset completes, check whether the PCIe endpoint is still > >>> reachable using pci_device_is_present(). If the device is behind > >>> Thunderbolt and the link is dead, walk up parent bridges calling > >>> pci_bridge_secondary_bus_reset() to retrain the physical PCIe link > >>> inside the dock. > >> > >> Well that is then a bus reset. > >> > >> I mean that is a reasonable mitigation when a MODE1 reset failed, but > >> the question is rather why does the MODE1 reset fails in the first place? > >> > >>> If recovery fails, return -ENODEV to prevent the > >>> reset retry loop. > >>> > >>> This also causes the GPU fan to be at 100% and basically when it > >>> happens and you are not there, you now have a GPU with fan at 100% and > >>> cant reset it. > >>> I wanted to notate some other things I am finding sometimes before > >>> this adventure of patches to the kernel and amdgpu driver. > >>> Sometimes a crash could happen in the drive and then the GPU fan speed > >>> hits 100% and the air is hot coming out without any workload, other > >>> times > >>> I have seen it have barely any fan speed at all and heat up more than > >>> it should at the fan level its curently operating at. These are things > >>> I have seen with this gpu in a TB5 dock with the driver and > >>> instability. I'm not sure exactly whats going on there but I figured > >>> since im communicating with these patches I might as well bring you up > >>> to speed and supermario has been great help throughout me trying to > >>> get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 / > >>> USB4v2 dock! > >> > >> Adding Mario as well. That strongly sounds like you crashed the SMU > >> which would also explain the failed MODE1 reset. > >> > >> But all of that are only symptoms. Question is what is actually going > >> on here? e.g. what is the root cause? > > > > We don't spend a lot of time in recovery scenarios for when 💩 hits the > > fan. I think in addition to finding and fixing the real root cause > > having a reproducible workload to cause the crash is a good opportunity > > to try to put in place better recovery too. > > > > Generally speaking I like the idea of if a mode1 reset fails to do a > > harder reset. At least in the path that we have GPU recovery > > (amdgpu.gpu_recovery module parameter) set, adding a fallback case to do > > a full device reset makes sense to me. > > > > I think the placement is wrong though. amdgpu_device_mode1_reset() has > > a bunch of callers, and if you end up with a mode1 reset doing a full > > reset that might be a surprise to those callers. > > > > So I think a more logical place to put this would be explicitly in the > > GPU recovery path (amdgpu_device_gpu_recover). Maybe as part of the > > mode1 reset failure you can: > > > > set_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags); > > > > And then the GPU recovery path can jump right into a full reset? Not > > sure if that jives with your stack trace though. > > > > Furthermore; even though you reproduced this on Thunderbolt; I have no > > reason to believe it's specific to thunderbolt. An SMU crash can happen > > in any hardware. We may as well try full reset for recovery for any > > hardware. > > FWIW, if SMU crashes then SBR also shouldn't work since SBR handling > needs some firmware support as well. > > A kernel module triggering chain-reset by going one level up and > resetting all devices under the bridge (in a while loop) also doesn't > look like an acceptable solution. > > Thanks, > Lijo > > > > >> > >>> > >>> It seems to be finally working with bar resizing after my kernel > >>> patch. Which allows you to safely release a empty switch bridge at the > >>> device end. > >>> Then it rebuilds it afterwords with the increased bar. This was done > >>> on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource > >>> branch with my patch here. > >>> > >>> https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF- > >>> tmThxOvk2XUDpEzw@mail.gmail.com/T/#u > >> > >> Where is the MMIO register BAR before and after the rebuild? > >> > >> Regards, > >> Christian. > >> > >>> > >>> Thank you! > >>> > >>> Signed-off-by: Geramy Loveless <gloveless@jqluv.com> > >>> --- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++ > >>> 1 file changed, 40 insertions(+) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>> index 31a60173c..91d01d538 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>> @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct > >>> amdgpu_device *adev) > >>> /* ensure no_hw_access is updated before we access hw */ > >>> smp_mb(); > >>> + /* > >>> + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe > >>> + * endpoint but the TB tunnel stays up unaware. Detect the > >>> + * dead link and attempt recovery by resetting parent bridges > >>> + * to retrain the physical PCIe link inside the dock. > >>> + */ > >>> + if (!pci_device_is_present(adev->pdev) && > >>> + pci_is_thunderbolt_attached(adev->pdev)) { > >>> + struct pci_dev *bridge; > >>> + bool recovered = false; > >>> + > >>> + dev_info(adev->dev, > >>> + "PCIe link lost after mode1 reset, attempting Thunderbolt > >>> recovery\n"); > >>> + > >>> + bridge = pci_upstream_bridge(adev->pdev); > >>> + while (bridge && !pci_is_root_bus(bridge->bus)) { > >>> + dev_info(adev->dev, > >>> + "attempting link recovery via %s\n", > >>> + pci_name(bridge)); > >>> + pci_bridge_secondary_bus_reset(bridge); > >>> + msleep(100); > >>> + if (pci_device_is_present(adev->pdev)) { > >>> + recovered = true; > >>> + break; > >>> + } > >>> + bridge = pci_upstream_bridge(bridge); > >>> + } > >>> + > >>> + if (!recovered) { > >>> + dev_err(adev->dev, > >>> + "Thunderbolt PCIe link recovery failed\n"); > >>> + ret = -ENODEV; > >>> + goto mode1_reset_failed; > >>> + } > >>> + > >>> + dev_info(adev->dev, > >>> + "Thunderbolt PCIe link recovered via %s\n", > >>> + pci_name(bridge)); > >>> + } > >>> + > >>> amdgpu_device_load_pci_state(adev->pdev); > >>> ret = amdgpu_psp_wait_for_bootloader(adev); > >>> if (ret) > >>> -- > >>> 2.51.0 > >> > > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset 2026-04-10 19:17 ` Geramy Loveless @ 2026-04-10 22:42 ` Geramy Loveless 0 siblings, 0 replies; 14+ messages in thread From: Geramy Loveless @ 2026-04-10 22:42 UTC (permalink / raw) To: Lazar, Lijo Cc: Mario Limonciello, Christian König, amd-gfx, alexander.deucher, Cristian Cocos I found something really interesting dev_is_removable is not working on my Minisforum MS-S1 it replies with not removable for the gfx card which should get the HotPlug+ from the thunderbolt but it doesnt I bet if you add || pci_is_thunderbolt_attached also it could clear up a lot of problems, just my two cents. :) I am running some tests now. On Fri, Apr 10, 2026 at 12:17 PM Geramy Loveless <gloveless@jqluv.com> wrote: > > It seems there is another person having the same problems im having or > at least similar. > I am going to loop his logs in here and add him maybe we can tackle > this together and find the underlying problem easier or faster. > Its always better to have two logs from too different points sometimes > you get things in one you dont get in the other out of pure chance, > haha. > > https://pcforum.amd.com/s/question/0D5Pd00001S3Av9KAF/linux-9060xt-egpuoverthunderbolt-bugs-galore > > Let me know if I can be of use, or if you need extra bandwidth for > making patches point my in the direction. > > Thanks everyone! > > On Fri, Apr 10, 2026 at 4:25 AM Lazar, Lijo <lijo.lazar@amd.com> wrote: > > > > > > > > On 09-Apr-26 11:42 PM, Mario Limonciello wrote: > > > > > > > > > On 4/9/26 06:42, Christian König wrote: > > >> On 4/9/26 02:05, Geramy Loveless wrote: > > >>> When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on > > >>> Thunderbolt the TB driver receives no notification and the tunnel > > >>> stays up while the endpoint is unreachable. > > >> > > >> IIRC a MODE1 reset should keep the bus active and so the endpoint > > >> should still be reachable. > > >> > > >>> All subsequent PCIe > > >>> reads return 0xFFFFFFFF and MES firmware cannot reinitialize, > > >>> triggering an infinite reset loop that hangs the system. > > >> > > >> That sounds more like the MODE1 reset failed. > > >> > > >>> After MODE1 reset completes, check whether the PCIe endpoint is still > > >>> reachable using pci_device_is_present(). If the device is behind > > >>> Thunderbolt and the link is dead, walk up parent bridges calling > > >>> pci_bridge_secondary_bus_reset() to retrain the physical PCIe link > > >>> inside the dock. > > >> > > >> Well that is then a bus reset. > > >> > > >> I mean that is a reasonable mitigation when a MODE1 reset failed, but > > >> the question is rather why does the MODE1 reset fails in the first place? > > >> > > >>> If recovery fails, return -ENODEV to prevent the > > >>> reset retry loop. > > >>> > > >>> This also causes the GPU fan to be at 100% and basically when it > > >>> happens and you are not there, you now have a GPU with fan at 100% and > > >>> cant reset it. > > >>> I wanted to notate some other things I am finding sometimes before > > >>> this adventure of patches to the kernel and amdgpu driver. > > >>> Sometimes a crash could happen in the drive and then the GPU fan speed > > >>> hits 100% and the air is hot coming out without any workload, other > > >>> times > > >>> I have seen it have barely any fan speed at all and heat up more than > > >>> it should at the fan level its curently operating at. These are things > > >>> I have seen with this gpu in a TB5 dock with the driver and > > >>> instability. I'm not sure exactly whats going on there but I figured > > >>> since im communicating with these patches I might as well bring you up > > >>> to speed and supermario has been great help throughout me trying to > > >>> get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 / > > >>> USB4v2 dock! > > >> > > >> Adding Mario as well. That strongly sounds like you crashed the SMU > > >> which would also explain the failed MODE1 reset. > > >> > > >> But all of that are only symptoms. Question is what is actually going > > >> on here? e.g. what is the root cause? > > > > > > We don't spend a lot of time in recovery scenarios for when 💩 hits the > > > fan. I think in addition to finding and fixing the real root cause > > > having a reproducible workload to cause the crash is a good opportunity > > > to try to put in place better recovery too. > > > > > > Generally speaking I like the idea of if a mode1 reset fails to do a > > > harder reset. At least in the path that we have GPU recovery > > > (amdgpu.gpu_recovery module parameter) set, adding a fallback case to do > > > a full device reset makes sense to me. > > > > > > I think the placement is wrong though. amdgpu_device_mode1_reset() has > > > a bunch of callers, and if you end up with a mode1 reset doing a full > > > reset that might be a surprise to those callers. > > > > > > So I think a more logical place to put this would be explicitly in the > > > GPU recovery path (amdgpu_device_gpu_recover). Maybe as part of the > > > mode1 reset failure you can: > > > > > > set_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags); > > > > > > And then the GPU recovery path can jump right into a full reset? Not > > > sure if that jives with your stack trace though. > > > > > > Furthermore; even though you reproduced this on Thunderbolt; I have no > > > reason to believe it's specific to thunderbolt. An SMU crash can happen > > > in any hardware. We may as well try full reset for recovery for any > > > hardware. > > > > FWIW, if SMU crashes then SBR also shouldn't work since SBR handling > > needs some firmware support as well. > > > > A kernel module triggering chain-reset by going one level up and > > resetting all devices under the bridge (in a while loop) also doesn't > > look like an acceptable solution. > > > > Thanks, > > Lijo > > > > > > > >> > > >>> > > >>> It seems to be finally working with bar resizing after my kernel > > >>> patch. Which allows you to safely release a empty switch bridge at the > > >>> device end. > > >>> Then it rebuilds it afterwords with the increased bar. This was done > > >>> on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource > > >>> branch with my patch here. > > >>> > > >>> https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF- > > >>> tmThxOvk2XUDpEzw@mail.gmail.com/T/#u > > >> > > >> Where is the MMIO register BAR before and after the rebuild? > > >> > > >> Regards, > > >> Christian. > > >> > > >>> > > >>> Thank you! > > >>> > > >>> Signed-off-by: Geramy Loveless <gloveless@jqluv.com> > > >>> --- > > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++ > > >>> 1 file changed, 40 insertions(+) > > >>> > > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > >>> index 31a60173c..91d01d538 100644 > > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > >>> @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct > > >>> amdgpu_device *adev) > > >>> /* ensure no_hw_access is updated before we access hw */ > > >>> smp_mb(); > > >>> + /* > > >>> + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe > > >>> + * endpoint but the TB tunnel stays up unaware. Detect the > > >>> + * dead link and attempt recovery by resetting parent bridges > > >>> + * to retrain the physical PCIe link inside the dock. > > >>> + */ > > >>> + if (!pci_device_is_present(adev->pdev) && > > >>> + pci_is_thunderbolt_attached(adev->pdev)) { > > >>> + struct pci_dev *bridge; > > >>> + bool recovered = false; > > >>> + > > >>> + dev_info(adev->dev, > > >>> + "PCIe link lost after mode1 reset, attempting Thunderbolt > > >>> recovery\n"); > > >>> + > > >>> + bridge = pci_upstream_bridge(adev->pdev); > > >>> + while (bridge && !pci_is_root_bus(bridge->bus)) { > > >>> + dev_info(adev->dev, > > >>> + "attempting link recovery via %s\n", > > >>> + pci_name(bridge)); > > >>> + pci_bridge_secondary_bus_reset(bridge); > > >>> + msleep(100); > > >>> + if (pci_device_is_present(adev->pdev)) { > > >>> + recovered = true; > > >>> + break; > > >>> + } > > >>> + bridge = pci_upstream_bridge(bridge); > > >>> + } > > >>> + > > >>> + if (!recovered) { > > >>> + dev_err(adev->dev, > > >>> + "Thunderbolt PCIe link recovery failed\n"); > > >>> + ret = -ENODEV; > > >>> + goto mode1_reset_failed; > > >>> + } > > >>> + > > >>> + dev_info(adev->dev, > > >>> + "Thunderbolt PCIe link recovered via %s\n", > > >>> + pci_name(bridge)); > > >>> + } > > >>> + > > >>> amdgpu_device_load_pci_state(adev->pdev); > > >>> ret = amdgpu_psp_wait_for_bootloader(adev); > > >>> if (ret) > > >>> -- > > >>> 2.51.0 > > >> > > > > > ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2026-04-13 8:03 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-09 0:05 [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset Geramy Loveless 2026-04-09 11:42 ` Christian König 2026-04-09 12:13 ` Geramy Loveless 2026-04-09 12:57 ` Christian König 2026-04-09 13:12 ` Geramy Loveless 2026-04-09 13:45 ` Christian König 2026-04-09 18:12 ` Mario Limonciello 2026-04-10 0:12 ` Geramy Loveless 2026-04-10 6:22 ` Geramy Loveless 2026-04-10 7:07 ` Geramy Loveless 2026-04-10 9:13 ` Christian König 2026-04-10 11:25 ` Lazar, Lijo 2026-04-10 19:17 ` Geramy Loveless 2026-04-10 22:42 ` Geramy Loveless
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.