Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Christian König" <christian.koenig@amd.com>
To: Mario Limonciello <mario.limonciello@amd.com>,
	Geramy Loveless <gloveless@jqluv.com>,
	amd-gfx@lists.freedesktop.org
Cc: alexander.deucher@amd.com, "Pelloux-Prayer,
	Pierre-Eric" <Pierre-eric.Pelloux-prayer@amd.com>
Subject: Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset
Date: Fri, 10 Apr 2026 11:13:13 +0200	[thread overview]
Message-ID: <cd093c11-df4f-472a-93df-1d875f479745@amd.com> (raw)
In-Reply-To: <47306de6-cbf6-4b2d-847e-d1e5d933516d@amd.com>

Hi Mario,

On 4/9/26 20:12, Mario Limonciello wrote:
> 
> 
> On 4/9/26 06:42, Christian König wrote:
>> On 4/9/26 02:05, Geramy Loveless wrote:
>>> When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on
>>> Thunderbolt the TB driver receives no notification and the tunnel
>>> stays up while the endpoint is unreachable.
>>
>> IIRC a MODE1 reset should keep the bus active and so the endpoint should still be reachable.
>>
>>> All subsequent PCIe
>>> reads return 0xFFFFFFFF and MES firmware cannot reinitialize,
>>> triggering an infinite reset loop that hangs the system.
>>
>> That sounds more like the MODE1 reset failed.
>>
>>> After MODE1 reset completes, check whether the PCIe endpoint is still
>>> reachable using pci_device_is_present(). If the device is behind
>>> Thunderbolt and the link is dead, walk up parent bridges calling
>>> pci_bridge_secondary_bus_reset() to retrain the physical PCIe link
>>> inside the dock.
>>
>> Well that is then a bus reset.
>>
>> I mean that is a reasonable mitigation when a MODE1 reset failed, but the question is rather why does the MODE1 reset fails in the first place?
>>
>>> If recovery fails, return -ENODEV to prevent the
>>> reset retry loop.
>>>
>>> This also causes the GPU fan to be at 100% and basically when it
>>> happens and you are not there, you now have a GPU with fan at 100% and
>>> cant reset it.
>>> I wanted to notate some other things I am finding sometimes before
>>> this adventure of patches to the kernel and amdgpu driver.
>>> Sometimes a crash could happen in the drive and then the GPU fan speed
>>> hits 100% and the air is hot coming out without any workload, other
>>> times
>>> I have seen it have barely any fan speed at all and heat up more than
>>> it should at the fan level its curently operating at. These are things
>>> I have seen with this gpu in a TB5 dock with the driver and
>>> instability. I'm not sure exactly whats going on there but I figured
>>> since im communicating with these patches I might as well bring you up
>>> to speed and supermario has been great help throughout me trying to
>>> get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 /
>>> USB4v2 dock!
>>
>> Adding Mario as well. That strongly sounds like you crashed the SMU which would also explain the failed MODE1 reset.
>>
>> But all of that are only symptoms. Question is what is actually going on here? e.g. what is the root cause?
> 
> We don't spend a lot of time in recovery scenarios for when 💩 hits the fan.  I think in addition to finding and fixing the real root cause having a reproducible workload to cause the crash is a good opportunity to try to put in place better recovery too.
> 
> Generally speaking I like the idea of if a mode1 reset fails to do a harder reset.  At least in the path that we have GPU recovery (amdgpu.gpu_recovery module parameter) set, adding a fallback case to do a full device reset makes sense to me.

Well I just realized that Pierre-Eric is already working on that and I've forgotten to add him to the mail thread.

But the general idea is that when you can't recover the GPU that the driver send a WEDGE udev event noting that a GPU recovery didn't worked and it basically needs a bus reset.

> I think the placement is wrong though.  amdgpu_device_mode1_reset() has a bunch of callers, and if you end up with a mode1 reset doing a full reset that might be a surprise to those callers.
> 
> So I think a more logical place to put this would be explicitly in the GPU recovery path (amdgpu_device_gpu_recover).  Maybe as part of the mode1 reset failure you can:
> 
> set_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
> 
> And then the GPU recovery path can jump right into a full reset?  Not sure if that jives with your stack trace though.

The MODE1 reset is already the full reset in this case.

> Furthermore; even though you reproduced this on Thunderbolt; I have no reason to believe it's specific to thunderbolt.  An SMU crash can happen in any hardware.  We may as well try full reset for recovery for any hardware.

Yeah agree, we need some more general WEDGE event handling. E.g. basically what Pierre-Eric is already working on.

Regards,
Christian.

> 
>>
>>>
>>> It seems to be finally working with bar resizing after my kernel
>>> patch. Which allows you to safely release a empty switch bridge at the
>>> device end.
>>> Then it rebuilds it afterwords with the increased bar. This was done
>>> on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource
>>> branch with my patch here.
>>>
>>> https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u
>>
>> Where is the MMIO register BAR before and after the rebuild?
>>
>> Regards,
>> Christian.
>>
>>>
>>> Thank you!
>>>
>>> Signed-off-by: Geramy Loveless <gloveless@jqluv.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++
>>> 1 file changed, 40 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 31a60173c..91d01d538 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev)
>>> /* ensure no_hw_access is updated before we access hw */
>>> smp_mb();
>>> + /*
>>> + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe
>>> + * endpoint but the TB tunnel stays up unaware. Detect the
>>> + * dead link and attempt recovery by resetting parent bridges
>>> + * to retrain the physical PCIe link inside the dock.
>>> + */
>>> + if (!pci_device_is_present(adev->pdev) &&
>>> + pci_is_thunderbolt_attached(adev->pdev)) {
>>> + struct pci_dev *bridge;
>>> + bool recovered = false;
>>> +
>>> + dev_info(adev->dev,
>>> + "PCIe link lost after mode1 reset, attempting Thunderbolt recovery\n");
>>> +
>>> + bridge = pci_upstream_bridge(adev->pdev);
>>> + while (bridge && !pci_is_root_bus(bridge->bus)) {
>>> + dev_info(adev->dev,
>>> + "attempting link recovery via %s\n",
>>> + pci_name(bridge));
>>> + pci_bridge_secondary_bus_reset(bridge);
>>> + msleep(100);
>>> + if (pci_device_is_present(adev->pdev)) {
>>> + recovered = true;
>>> + break;
>>> + }
>>> + bridge = pci_upstream_bridge(bridge);
>>> + }
>>> +
>>> + if (!recovered) {
>>> + dev_err(adev->dev,
>>> + "Thunderbolt PCIe link recovery failed\n");
>>> + ret = -ENODEV;
>>> + goto mode1_reset_failed;
>>> + }
>>> +
>>> + dev_info(adev->dev,
>>> + "Thunderbolt PCIe link recovered via %s\n",
>>> + pci_name(bridge));
>>> + }
>>> +
>>> amdgpu_device_load_pci_state(adev->pdev);
>>> ret = amdgpu_psp_wait_for_bootloader(adev);
>>> if (ret)
>>> -- 
>>> 2.51.0
>>
>

next prev parent reply	other threads:[~2026-04-10  9:13 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-09  0:05 [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset Geramy Loveless
2026-04-09 11:42 ` Christian König
2026-04-09 12:13   ` Geramy Loveless
2026-04-09 12:57     ` Christian König
2026-04-09 13:12       ` Geramy Loveless
2026-04-09 13:45         ` Christian König
2026-04-09 18:12   ` Mario Limonciello
2026-04-10  0:12     ` Geramy Loveless
2026-04-10  6:22       ` Geramy Loveless
2026-04-10  7:07         ` Geramy Loveless
2026-04-10  9:13     ` Christian König [this message]
2026-04-10 11:25     ` Lazar, Lijo
2026-04-10 19:17       ` Geramy Loveless
2026-04-10 22:42         ` Geramy Loveless

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cd093c11-df4f-472a-93df-1d875f479745@amd.com \
    --to=christian.koenig@amd.com \
    --cc=Pierre-eric.Pelloux-prayer@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=gloveless@jqluv.com \
    --cc=mario.limonciello@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.