All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Christian König" <christian.koenig@amd.com>
To: Geramy Loveless <gloveless@jqluv.com>,
	amd-gfx@lists.freedesktop.org,
	Mario Limonciello <mario.limonciello@amd.com>
Cc: alexander.deucher@amd.com
Subject: Re: [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset
Date: Thu, 9 Apr 2026 13:42:36 +0200	[thread overview]
Message-ID: <243af06e-912b-4915-bc64-5aa16dad7db0@amd.com> (raw)
In-Reply-To: <CAGpo2mebCaP4vFuTnn6jgu6OjjE_ssS7i8ENepuUjwwHXddCHA@mail.gmail.com>

On 4/9/26 02:05, Geramy Loveless wrote:
> When an AMD GPU behind a Thunderbolt PCIe tunnel undergoes a MODE1 on
> Thunderbolt the TB driver receives no notification and the tunnel
> stays up while the endpoint is unreachable.

IIRC a MODE1 reset should keep the bus active and so the endpoint should still be reachable.

> All subsequent PCIe
> reads return 0xFFFFFFFF and MES firmware cannot reinitialize,
> triggering an infinite reset loop that hangs the system.

That sounds more like the MODE1 reset failed.

> After MODE1 reset completes, check whether the PCIe endpoint is still
> reachable using pci_device_is_present(). If the device is behind
> Thunderbolt and the link is dead, walk up parent bridges calling
> pci_bridge_secondary_bus_reset() to retrain the physical PCIe link
> inside the dock.

Well that is then a bus reset.

I mean that is a reasonable mitigation when a MODE1 reset failed, but the question is rather why does the MODE1 reset fails in the first place?

> If recovery fails, return -ENODEV to prevent the
> reset retry loop.
> 
> This also causes the GPU fan to be at 100% and basically when it
> happens and you are not there, you now have a GPU with fan at 100% and
> cant reset it.
> I wanted to notate some other things I am finding sometimes before
> this adventure of patches to the kernel and amdgpu driver.
> Sometimes a crash could happen in the drive and then the GPU fan speed
> hits 100% and the air is hot coming out without any workload, other
> times
> I have seen it have barely any fan speed at all and heat up more than
> it should at the fan level its curently operating at. These are things
> I have seen with this gpu in a TB5 dock with the driver and
> instability. I'm not sure exactly whats going on there but I figured
> since im communicating with these patches I might as well bring you up
> to speed and supermario has been great help throughout me trying to
> get the AMD AI R9700 Pro working on my MS-S1 Halo Strix with a TB5 /
> USB4v2 dock!

Adding Mario as well. That strongly sounds like you crashed the SMU which would also explain the failed MODE1 reset.

But all of that are only symptoms. Question is what is actually going on here? e.g. what is the root cause?

> 
> It seems to be finally working with bar resizing after my kernel
> patch. Which allows you to safely release a empty switch bridge at the
> device end.
> Then it rebuilds it afterwords with the increased bar. This was done
> on Kernel 7.0-rc7 i believe it is and latest changes from pci/resource
> branch with my patch here.
> 
> https://lore.kernel.org/linux-pci/CAGpo2meKY6SXsESU-D0PGgbESLqdF8UBF-tmThxOvk2XUDpEzw@mail.gmail.com/T/#u

Where is the MMIO register BAR before and after the rebuild?

Regards,
Christian.

> 
> Thank you!
> 
> Signed-off-by: Geramy Loveless <gloveless@jqluv.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 ++++++++++++++++++++++
> 1 file changed, 40 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 31a60173c..91d01d538 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5770,6 +5770,46 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev)
> /* ensure no_hw_access is updated before we access hw */
> smp_mb();
> + /*
> + * On Thunderbolt-attached GPUs, MODE1 reset kills the PCIe
> + * endpoint but the TB tunnel stays up unaware. Detect the
> + * dead link and attempt recovery by resetting parent bridges
> + * to retrain the physical PCIe link inside the dock.
> + */
> + if (!pci_device_is_present(adev->pdev) &&
> + pci_is_thunderbolt_attached(adev->pdev)) {
> + struct pci_dev *bridge;
> + bool recovered = false;
> +
> + dev_info(adev->dev,
> + "PCIe link lost after mode1 reset, attempting Thunderbolt recovery\n");
> +
> + bridge = pci_upstream_bridge(adev->pdev);
> + while (bridge && !pci_is_root_bus(bridge->bus)) {
> + dev_info(adev->dev,
> + "attempting link recovery via %s\n",
> + pci_name(bridge));
> + pci_bridge_secondary_bus_reset(bridge);
> + msleep(100);
> + if (pci_device_is_present(adev->pdev)) {
> + recovered = true;
> + break;
> + }
> + bridge = pci_upstream_bridge(bridge);
> + }
> +
> + if (!recovered) {
> + dev_err(adev->dev,
> + "Thunderbolt PCIe link recovery failed\n");
> + ret = -ENODEV;
> + goto mode1_reset_failed;
> + }
> +
> + dev_info(adev->dev,
> + "Thunderbolt PCIe link recovered via %s\n",
> + pci_name(bridge));
> + }
> +
> amdgpu_device_load_pci_state(adev->pdev);
> ret = amdgpu_psp_wait_for_bootloader(adev);
> if (ret)
> --
> 2.51.0


  reply	other threads:[~2026-04-09 11:42 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-09  0:05 [PATCH] amdgpu: recover Thunderbolt PCIe link after MODE1 GPU reset Geramy Loveless
2026-04-09 11:42 ` Christian König [this message]
2026-04-09 12:13   ` Geramy Loveless
2026-04-09 12:57     ` Christian König
2026-04-09 13:12       ` Geramy Loveless
2026-04-09 13:45         ` Christian König
2026-04-09 18:12   ` Mario Limonciello
2026-04-10  0:12     ` Geramy Loveless
2026-04-10  6:22       ` Geramy Loveless
2026-04-10  7:07         ` Geramy Loveless
2026-04-10  9:13     ` Christian König
2026-04-10 11:25     ` Lazar, Lijo
2026-04-10 19:17       ` Geramy Loveless
2026-04-10 22:42         ` Geramy Loveless

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=243af06e-912b-4915-bc64-5aa16dad7db0@amd.com \
    --to=christian.koenig@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=gloveless@jqluv.com \
    --cc=mario.limonciello@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.