All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Ionut Nechita (Sunlight Linux)" <sunlightlinux@gmail.com>
To: "Alex Deucher" <alexander.deucher@amd.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Mario Limonciello" <superm1@kernel.org>,
	"Ionut Nechita" <ionut_n2001@yahoo.com>
Cc: amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH 1/1] drm/amdgpu: Fix TLB flush failures after hibernation resume
Date: Tue,  6 Jan 2026 14:59:33 +0200	[thread overview]
Message-ID: <20260106125929.25214-6-sunlightlinux@gmail.com> (raw)
In-Reply-To: <20260106125929.25214-3-sunlightlinux@gmail.com>

From: Ionut Nechita <ionut_n2001@yahoo.com>

After resume from hibernation, the amdgpu driver experiences TLB
flush failures with errors:

  amdgpu: TLB flush failed for PASID xxxxx
  amdgpu: failed to write reg 28b4 wait reg 28c6
  amdgpu: failed to write reg 1a6f4 wait reg 1a706

Root Cause:
-----------
The KIQ (Kernel Interface Queue) ring is marked as ready
(ring.sched.ready = true) during resume, but the hardware is not
fully functional yet. When TLB invalidation attempts to use KIQ
for register access, the commands fail because the GPU hasn't
completed initialization.

Solution:
---------
1. Add resume_gpu_stable flag (initially false on resume)
2. Force TLB invalidation to use direct MMIO path instead of KIQ
   when resume_gpu_stable is false
3. After ring tests pass in gfx_v9_0_cp_resume(), set
   resume_gpu_stable to true
4. From that point forward, use faster KIQ path for TLB invalidation

This ensures TLB flushes work correctly during early resume while
still benefiting from KIQ-based invalidation after the GPU is stable.

Changes:
--------
- amdgpu.h: Add resume_gpu_stable flag to amdgpu_device
- amdgpu_device.c: Initialize resume_gpu_stable to false on resume
- amdgpu_gmc.c: Check resume_gpu_stable in flush_gpu_tlb_pasid
- gfx_v9_0.c: Set resume_gpu_stable after ring tests pass
- gmc_v9_0.c: Check resume_gpu_stable before using KIQ path

Tested on AMD Cezanne (Renoir) with ROCm workloads after hibernation.
Result: Eliminates TLB flush failures on resume.

Signed-off-by: Ionut Nechita <ionut_n2001@yahoo.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  6 ++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c    |  9 +++++++--
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 10 ++++++++++
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |  6 +++++-
 5 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 9f9774f58ce1c..6bf4f6c90164c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1225,6 +1225,7 @@ struct amdgpu_device {
 	bool				in_s4;
 	bool				in_s0ix;
 	suspend_state_t			last_suspend_state;
+	bool				resume_gpu_stable;
 
 	enum pp_mp1_state               mp1_state;
 	struct amdgpu_doorbell_index doorbell_index;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 12201b8e99b3f..440d86ed1e0d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5457,6 +5457,12 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
 		goto exit;
 	}
 
+	/*
+	 * Set resume_gpu_stable to false BEFORE KFD resume to ensure
+	 * extended timeouts are used for TLB flushes during hibernation recovery
+	 */
+	adev->resume_gpu_stable = false;
+
 	r = amdgpu_amdkfd_resume(adev, !amdgpu_sriov_vf(adev) && !adev->in_runpm);
 	if (r)
 		goto exit;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 869bceb0fe2c6..83fe30f0ada75 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -731,7 +731,12 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid,
 	if (!down_read_trylock(&adev->reset_domain->sem))
 		return 0;
 
-	if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready) {
+	/*
+	 * After hibernation resume, KIQ may report as ready but not be fully
+	 * functional. Use direct MMIO path until GPU is confirmed stable.
+	 */
+	if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
+	    !adev->resume_gpu_stable) {
 		if (adev->gmc.flush_tlb_needs_extra_type_2)
 			adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
 								 2, all_hub,
@@ -835,9 +840,9 @@ void amdgpu_gmc_fw_reg_write_reg_wait(struct amdgpu_device *adev,
 		goto failed_kiq;
 
 	might_sleep();
+
 	while (r < 1 && cnt++ < MAX_KIQ_REG_TRY &&
 	       !amdgpu_reset_pending(adev->reset_domain)) {
-
 		msleep(MAX_KIQ_REG_BAILOUT_INTERVAL);
 		r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
 	}
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 0148d7ff34d99..fbd07b455b915 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -3985,6 +3985,16 @@ static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
 		amdgpu_ring_test_helper(ring);
 	}
 
+	/*
+	 * After successful ring tests, mark GPU as stable for resume.
+	 * This allows KIQ-based TLB invalidation to be used instead of
+	 * slower direct MMIO path.
+	 */
+	if (!adev->resume_gpu_stable) {
+		adev->resume_gpu_stable = true;
+		dev_info(adev->dev, "GPU rings verified, enabling KIQ path\n");
+	}
+
 	gfx_v9_0_enable_gui_idle_interrupt(adev, true);
 
 	return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 8ad7519f7b581..8a0202f6b3e3c 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -855,9 +855,13 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device *adev, uint32_t vmid,
 
 	/* This is necessary for SRIOV as well as for GFXOFF to function
 	 * properly under bare metal
+	 *
+	 * After hibernation resume, KIQ may report as ready but not be fully
+	 * functional. Use direct MMIO path until GPU is confirmed stable.
 	 */
 	if (adev->gfx.kiq[inst].ring.sched.ready &&
-	    (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev))) {
+	    (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) &&
+	    adev->resume_gpu_stable) {
 		uint32_t req = hub->vm_inv_eng0_req + hub->eng_distance * eng;
 		uint32_t ack = hub->vm_inv_eng0_ack + hub->eng_distance * eng;
 
-- 
2.52.0


  reply	other threads:[~2026-01-06 14:00 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-06 12:59 [PATCH 0/1] drm/amdgpu: Fix TLB flush failures after hibernation resume Ionut Nechita (Sunlight Linux)
2026-01-06 12:59 ` Ionut Nechita (Sunlight Linux) [this message]
2026-01-06 16:26   ` [PATCH 1/1] " Alex Deucher
2026-01-07 10:52     ` Ionut Nechita (Sunlight Linux)
2026-01-08 12:36   ` Christian König
2026-01-26 19:40     ` Ionut Nechita (Sunlight Linux)
2026-01-26 20:25       ` Alex Deucher
2026-01-26 20:28         ` Mario Limonciello (AMD) (kernel.org)
2026-01-26 20:32           ` Mario Limonciello (AMD) (kernel.org)
2026-01-26 20:46             ` Ionut Nechita (Sunlight Linux)
2026-01-26 20:37         ` Ionut Nechita (Sunlight Linux)
2026-01-27 11:35           ` Christian König
2026-02-01 19:05         ` Ionut Nechita (Sunlight Linux)
2026-02-02 10:39           ` Christian König
2026-03-24  8:07             ` Ionut Nechita (Sunlight Linux)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260106125929.25214-6-sunlightlinux@gmail.com \
    --to=sunlightlinux@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=ionut_n2001@yahoo.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=superm1@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.