[PATCH 0/1] drm/amdgpu: Fix TLB flush failures after hibernation resume

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/1] drm/amdgpu: Fix TLB flush failures after hibernation resume
@ 2026-01-06 12:59 Ionut Nechita (Sunlight Linux)
  2026-01-06 12:59 ` [PATCH 1/1] " Ionut Nechita (Sunlight Linux)
  0 siblings, 1 reply; 13+ messages in thread
From: Ionut Nechita (Sunlight Linux) @ 2026-01-06 12:59 UTC (permalink / raw)
  To: Alex Deucher, Christian König, Mario Limonciello,
	Ionut Nechita
  Cc: amd-gfx, dri-devel, linux-kernel

From: Ionut Nechita <ionut_n2001@yahoo.com>

Hi,

This patch addresses critical TLB flush failures that occur during
hibernation resume on AMD GPUs, particularly affecting ROCm workloads.

Problem:
--------
After resuming from hibernation (S4), the amdgpu driver consistently
fails TLB invalidation operations with these errors:

  amdgpu: TLB flush failed for PASID xxxxx
  amdgpu: failed to write reg 28b4 wait reg 28c6
  amdgpu: failed to write reg 1a6f4 wait reg 1a706

These failures cause compute workloads to malfunction or crash, making
hibernation unreliable for systems running ROCm/OpenCL applications.

Root Cause:
-----------
During resume, the KIQ (Kernel Interface Queue) ring is marked as ready
(ring.sched.ready = true) before the GPU hardware has fully initialized.
When TLB invalidation attempts to use KIQ for register access during
this window, the commands fail because the GPU is not yet stable.

Solution:
---------
This patch introduces a resume_gpu_stable flag that:
- Starts as false during resume
- Forces TLB invalidation to use the reliable MMIO path initially
- Gets set to true after ring tests pass in gfx_v9_0_cp_resume()
- Allows switching to the faster KIQ path once GPU is confirmed stable

This ensures TLB flushes work correctly during early resume while still
benefiting from KIQ-based invalidation after the GPU is fully operational.

Testing:
--------
Tested on AMD Cezanne (Renoir) with ROCm workloads across multiple
hibernation cycles. The patch eliminates all TLB flush failures and
restores reliable hibernation support for compute workloads.

Impact:
-------
Affects all AMD GPUs using KIQ for TLB invalidation, particularly
visible on systems with active compute workloads (ROCm, OpenCL).

Ionut Nechita (1):
  drm/amdgpu: Fix TLB flush failures after hibernation resume

 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  6 ++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c    |  9 +++++++--
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 10 ++++++++++
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |  6 +++++-
 5 files changed, 29 insertions(+), 3 deletions(-)

-- 
2.52.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/1] drm/amdgpu: Fix TLB flush failures after hibernation resume
  2026-01-06 12:59 [PATCH 0/1] drm/amdgpu: Fix TLB flush failures after hibernation resume Ionut Nechita (Sunlight Linux)
@ 2026-01-06 12:59 ` Ionut Nechita (Sunlight Linux)
  2026-01-06 16:26   ` Alex Deucher
  2026-01-08 12:36   ` Christian König
  0 siblings, 2 replies; 13+ messages in thread
From: Ionut Nechita (Sunlight Linux) @ 2026-01-06 12:59 UTC (permalink / raw)
  To: Alex Deucher, Christian König, Mario Limonciello,
	Ionut Nechita
  Cc: amd-gfx, dri-devel, linux-kernel

From: Ionut Nechita <ionut_n2001@yahoo.com>

After resume from hibernation, the amdgpu driver experiences TLB
flush failures with errors:

  amdgpu: TLB flush failed for PASID xxxxx
  amdgpu: failed to write reg 28b4 wait reg 28c6
  amdgpu: failed to write reg 1a6f4 wait reg 1a706

Root Cause:
-----------
The KIQ (Kernel Interface Queue) ring is marked as ready
(ring.sched.ready = true) during resume, but the hardware is not
fully functional yet. When TLB invalidation attempts to use KIQ
for register access, the commands fail because the GPU hasn't
completed initialization.

Solution:
---------
1. Add resume_gpu_stable flag (initially false on resume)
2. Force TLB invalidation to use direct MMIO path instead of KIQ
   when resume_gpu_stable is false
3. After ring tests pass in gfx_v9_0_cp_resume(), set
   resume_gpu_stable to true
4. From that point forward, use faster KIQ path for TLB invalidation

This ensures TLB flushes work correctly during early resume while
still benefiting from KIQ-based invalidation after the GPU is stable.

Changes:
--------
- amdgpu.h: Add resume_gpu_stable flag to amdgpu_device
- amdgpu_device.c: Initialize resume_gpu_stable to false on resume
- amdgpu_gmc.c: Check resume_gpu_stable in flush_gpu_tlb_pasid
- gfx_v9_0.c: Set resume_gpu_stable after ring tests pass
- gmc_v9_0.c: Check resume_gpu_stable before using KIQ path

Tested on AMD Cezanne (Renoir) with ROCm workloads after hibernation.
Result: Eliminates TLB flush failures on resume.

Signed-off-by: Ionut Nechita <ionut_n2001@yahoo.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  6 ++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c    |  9 +++++++--
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 10 ++++++++++
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |  6 +++++-
 5 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 9f9774f58ce1c..6bf4f6c90164c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1225,6 +1225,7 @@ struct amdgpu_device {
 	bool				in_s4;
 	bool				in_s0ix;
 	suspend_state_t			last_suspend_state;
+	bool				resume_gpu_stable;
 
 	enum pp_mp1_state               mp1_state;
 	struct amdgpu_doorbell_index doorbell_index;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 12201b8e99b3f..440d86ed1e0d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5457,6 +5457,12 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
 		goto exit;
 	}
 
+	/*
+	 * Set resume_gpu_stable to false BEFORE KFD resume to ensure
+	 * extended timeouts are used for TLB flushes during hibernation recovery
+	 */
+	adev->resume_gpu_stable = false;
+
 	r = amdgpu_amdkfd_resume(adev, !amdgpu_sriov_vf(adev) && !adev->in_runpm);
 	if (r)
 		goto exit;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 869bceb0fe2c6..83fe30f0ada75 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -731,7 +731,12 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid,
 	if (!down_read_trylock(&adev->reset_domain->sem))
 		return 0;
 
-	if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready) {
+	/*
+	 * After hibernation resume, KIQ may report as ready but not be fully
+	 * functional. Use direct MMIO path until GPU is confirmed stable.
+	 */
+	if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
+	    !adev->resume_gpu_stable) {
 		if (adev->gmc.flush_tlb_needs_extra_type_2)
 			adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
 								 2, all_hub,
@@ -835,9 +840,9 @@ void amdgpu_gmc_fw_reg_write_reg_wait(struct amdgpu_device *adev,
 		goto failed_kiq;
 
 	might_sleep();
+
 	while (r < 1 && cnt++ < MAX_KIQ_REG_TRY &&
 	       !amdgpu_reset_pending(adev->reset_domain)) {
-
 		msleep(MAX_KIQ_REG_BAILOUT_INTERVAL);
 		r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
 	}
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 0148d7ff34d99..fbd07b455b915 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -3985,6 +3985,16 @@ static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
 		amdgpu_ring_test_helper(ring);
 	}
 
+	/*
+	 * After successful ring tests, mark GPU as stable for resume.
+	 * This allows KIQ-based TLB invalidation to be used instead of
+	 * slower direct MMIO path.
+	 */
+	if (!adev->resume_gpu_stable) {
+		adev->resume_gpu_stable = true;
+		dev_info(adev->dev, "GPU rings verified, enabling KIQ path\n");
+	}
+
 	gfx_v9_0_enable_gui_idle_interrupt(adev, true);
 
 	return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 8ad7519f7b581..8a0202f6b3e3c 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -855,9 +855,13 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device *adev, uint32_t vmid,
 
 	/* This is necessary for SRIOV as well as for GFXOFF to function
 	 * properly under bare metal
+	 *
+	 * After hibernation resume, KIQ may report as ready but not be fully
+	 * functional. Use direct MMIO path until GPU is confirmed stable.
 	 */
 	if (adev->gfx.kiq[inst].ring.sched.ready &&
-	    (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev))) {
+	    (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) &&
+	    adev->resume_gpu_stable) {
 		uint32_t req = hub->vm_inv_eng0_req + hub->eng_distance * eng;
 		uint32_t ack = hub->vm_inv_eng0_ack + hub->eng_distance * eng;
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu: Fix TLB flush failures after hibernation resume
  2026-01-06 12:59 ` [PATCH 1/1] " Ionut Nechita (Sunlight Linux)
@ 2026-01-06 16:26   ` Alex Deucher
  2026-01-07 10:52     ` Ionut Nechita (Sunlight Linux)
  2026-01-08 12:36   ` Christian König
  1 sibling, 1 reply; 13+ messages in thread
From: Alex Deucher @ 2026-01-06 16:26 UTC (permalink / raw)
  To: Ionut Nechita (Sunlight Linux)
  Cc: Alex Deucher, Christian König, Mario Limonciello,
	Ionut Nechita, amd-gfx, dri-devel, linux-kernel

On Tue, Jan 6, 2026 at 9:16 AM Ionut Nechita (Sunlight Linux)
<sunlightlinux@gmail.com> wrote:
>
> From: Ionut Nechita <ionut_n2001@yahoo.com>
>
> After resume from hibernation, the amdgpu driver experiences TLB
> flush failures with errors:
>
>   amdgpu: TLB flush failed for PASID xxxxx
>   amdgpu: failed to write reg 28b4 wait reg 28c6
>   amdgpu: failed to write reg 1a6f4 wait reg 1a706
>
> Root Cause:
> -----------
> The KIQ (Kernel Interface Queue) ring is marked as ready
> (ring.sched.ready = true) during resume, but the hardware is not
> fully functional yet. When TLB invalidation attempts to use KIQ
> for register access, the commands fail because the GPU hasn't
> completed initialization.
>
> Solution:
> ---------
> 1. Add resume_gpu_stable flag (initially false on resume)
> 2. Force TLB invalidation to use direct MMIO path instead of KIQ
>    when resume_gpu_stable is false
> 3. After ring tests pass in gfx_v9_0_cp_resume(), set
>    resume_gpu_stable to true
> 4. From that point forward, use faster KIQ path for TLB invalidation
>
> This ensures TLB flushes work correctly during early resume while
> still benefiting from KIQ-based invalidation after the GPU is stable.
>
> Changes:
> --------
> - amdgpu.h: Add resume_gpu_stable flag to amdgpu_device
> - amdgpu_device.c: Initialize resume_gpu_stable to false on resume
> - amdgpu_gmc.c: Check resume_gpu_stable in flush_gpu_tlb_pasid
> - gfx_v9_0.c: Set resume_gpu_stable after ring tests pass
> - gmc_v9_0.c: Check resume_gpu_stable before using KIQ path
>
> Tested on AMD Cezanne (Renoir) with ROCm workloads after hibernation.
> Result: Eliminates TLB flush failures on resume.
>
> Signed-off-by: Ionut Nechita <ionut_n2001@yahoo.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  6 ++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c    |  9 +++++++--
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 10 ++++++++++
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |  6 +++++-
>  5 files changed, 29 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 9f9774f58ce1c..6bf4f6c90164c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1225,6 +1225,7 @@ struct amdgpu_device {
>         bool                            in_s4;
>         bool                            in_s0ix;
>         suspend_state_t                 last_suspend_state;
> +       bool                            resume_gpu_stable;
>
>         enum pp_mp1_state               mp1_state;
>         struct amdgpu_doorbell_index doorbell_index;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 12201b8e99b3f..440d86ed1e0d3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5457,6 +5457,12 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
>                 goto exit;
>         }
>
> +       /*
> +        * Set resume_gpu_stable to false BEFORE KFD resume to ensure
> +        * extended timeouts are used for TLB flushes during hibernation recovery
> +        */
> +       adev->resume_gpu_stable = false;

This change disables TLB flushes via KIQ permanently.  This is called
after gfx_v9_0_cp_resume() so resume_gpu_stable is only set to true
between the calls to amdgpu_device_ip_resume() and here.

kiq sched.ready should be handled correctly.  kiq sched.ready gets set
to false in suspend via:

gfx_v9_0_suspend() -> gfx_v9_0_hw_fini() -> gfx_v9_0_cp_enable()

Then on resume, kiq sched.ready gets set to true again via:

gfx_v9_0_resume() -> gfx_v9_0_hw_init() -> gfx_v9_0_cp_resume() ->
gfx_v9_0_kcq_resume() -> amdgpu_gfx_enable_kcq() ->
amdgpu_ring_test_helper()

At that point the KIQ hardware is ready. If it weren't, then the above
sequence would not have worked.  The gfx and compute ring tests are
irrelevant.

Alex

> +
>         r = amdgpu_amdkfd_resume(adev, !amdgpu_sriov_vf(adev) && !adev->in_runpm);
>         if (r)
>                 goto exit;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index 869bceb0fe2c6..83fe30f0ada75 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -731,7 +731,12 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid,
>         if (!down_read_trylock(&adev->reset_domain->sem))
>                 return 0;
>
> -       if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready) {
> +       /*
> +        * After hibernation resume, KIQ may report as ready but not be fully
> +        * functional. Use direct MMIO path until GPU is confirmed stable.
> +        */
> +       if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
> +           !adev->resume_gpu_stable) {
>                 if (adev->gmc.flush_tlb_needs_extra_type_2)
>                         adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
>                                                                  2, all_hub,
> @@ -835,9 +840,9 @@ void amdgpu_gmc_fw_reg_write_reg_wait(struct amdgpu_device *adev,
>                 goto failed_kiq;
>
>         might_sleep();
> +
>         while (r < 1 && cnt++ < MAX_KIQ_REG_TRY &&
>                !amdgpu_reset_pending(adev->reset_domain)) {
> -
>                 msleep(MAX_KIQ_REG_BAILOUT_INTERVAL);
>                 r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>         }
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index 0148d7ff34d99..fbd07b455b915 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -3985,6 +3985,16 @@ static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
>                 amdgpu_ring_test_helper(ring);
>         }
>
> +       /*
> +        * After successful ring tests, mark GPU as stable for resume.
> +        * This allows KIQ-based TLB invalidation to be used instead of
> +        * slower direct MMIO path.
> +        */
> +       if (!adev->resume_gpu_stable) {
> +               adev->resume_gpu_stable = true;
> +               dev_info(adev->dev, "GPU rings verified, enabling KIQ path\n");
> +       }
> +
>         gfx_v9_0_enable_gui_idle_interrupt(adev, true);
>
>         return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 8ad7519f7b581..8a0202f6b3e3c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -855,9 +855,13 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device *adev, uint32_t vmid,
>
>         /* This is necessary for SRIOV as well as for GFXOFF to function
>          * properly under bare metal
> +        *
> +        * After hibernation resume, KIQ may report as ready but not be fully
> +        * functional. Use direct MMIO path until GPU is confirmed stable.
>          */
>         if (adev->gfx.kiq[inst].ring.sched.ready &&
> -           (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev))) {
> +           (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) &&
> +           adev->resume_gpu_stable) {
>                 uint32_t req = hub->vm_inv_eng0_req + hub->eng_distance * eng;
>                 uint32_t ack = hub->vm_inv_eng0_ack + hub->eng_distance * eng;
>
> --
> 2.52.0
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu: Fix TLB flush failures after hibernation resume
  2026-01-06 16:26   ` Alex Deucher
@ 2026-01-07 10:52     ` Ionut Nechita (Sunlight Linux)
  0 siblings, 0 replies; 13+ messages in thread
From: Ionut Nechita (Sunlight Linux) @ 2026-01-07 10:52 UTC (permalink / raw)
  To: alexdeucher
  Cc: alexander.deucher, amd-gfx, christian.koenig, dri-devel,
	ionut_n2001, linux-kernel, sunlightlinux, superm1

Hi Alex,

Thank you for the detailed review and for pointing out the ordering issue.

You're absolutely right - I misunderstood the call sequence. Setting
resume_gpu_stable to false in amdgpu_device_resume() happens after
gfx_v9_0_cp_resume(), which defeats the purpose and permanently
disables the KIQ path.

However, I'm still experiencing the TLB flush failures after hibernation
resume on AMD Cezanne (Renoir):

  amdgpu: TLB flush failed for PASID xxxxx
  amdgpu: failed to write reg 28b4 wait reg 28c6
  amdgpu: failed to write reg 1a6f4 wait reg 1a706

If kiq sched.ready is being handled correctly as you described, what
else could cause these failures during resume? Are there any known
issues with KIQ-based TLB invalidation after hibernation on GFX9?

Should I investigate:
- Timing issues with KIQ command submission during early resume?
- Power/clock gating states affecting KIQ functionality?
- Missing synchronization after KIQ initialization?

Any guidance on the correct direction to investigate would be appreciated.

Thanks,
Ionut

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu: Fix TLB flush failures after hibernation resume
  2026-01-06 12:59 ` [PATCH 1/1] " Ionut Nechita (Sunlight Linux)
  2026-01-06 16:26   ` Alex Deucher
@ 2026-01-08 12:36   ` Christian König
  2026-01-26 19:40     ` Ionut Nechita (Sunlight Linux)
  1 sibling, 1 reply; 13+ messages in thread
From: Christian König @ 2026-01-08 12:36 UTC (permalink / raw)
  To: Ionut Nechita (Sunlight Linux), Alex Deucher, Mario Limonciello,
	Ionut Nechita
  Cc: amd-gfx, dri-devel, linux-kernel

On 1/6/26 13:59, Ionut Nechita (Sunlight Linux) wrote:
> From: Ionut Nechita <ionut_n2001@yahoo.com>
> 
> After resume from hibernation, the amdgpu driver experiences TLB
> flush failures with errors:

In general we don't support hibernation with the driver.

> 
>   amdgpu: TLB flush failed for PASID xxxxx
>   amdgpu: failed to write reg 28b4 wait reg 28c6
>   amdgpu: failed to write reg 1a6f4 wait reg 1a706
> 
> Root Cause:
> -----------
> The KIQ (Kernel Interface Queue) ring is marked as ready
> (ring.sched.ready = true) during resume, but the hardware is not
> fully functional yet.

Well since that is the root cause we should probably fix that.

> When TLB invalidation attempts to use KIQ
> for register access, the commands fail because the GPU hasn't
> completed initialization.
> 
> Solution:
> ---------
> 1. Add resume_gpu_stable flag (initially false on resume)
> 2. Force TLB invalidation to use direct MMIO path instead of KIQ
>    when resume_gpu_stable is false

That is a really bad idea. So absolutely clear NAK to this patch here.

> 3. After ring tests pass in gfx_v9_0_cp_resume(), set
>    resume_gpu_stable to true
> 4. From that point forward, use faster KIQ path for TLB invalidation
> 
> This ensures TLB flushes work correctly during early resume while
> still benefiting from KIQ-based invalidation after the GPU is stable.

No it doesn't. This patch only works by coincident and not proper engineering.

This only works because the TLB flush is most likely not necesssary.

Question is why the KIQ is not up and running before we do anything with it?

Regards,
Christian.

> 
> Changes:
> --------
> - amdgpu.h: Add resume_gpu_stable flag to amdgpu_device
> - amdgpu_device.c: Initialize resume_gpu_stable to false on resume
> - amdgpu_gmc.c: Check resume_gpu_stable in flush_gpu_tlb_pasid
> - gfx_v9_0.c: Set resume_gpu_stable after ring tests pass
> - gmc_v9_0.c: Check resume_gpu_stable before using KIQ path
> 
> Tested on AMD Cezanne (Renoir) with ROCm workloads after hibernation.
> Result: Eliminates TLB flush failures on resume.
> 
> Signed-off-by: Ionut Nechita <ionut_n2001@yahoo.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  6 ++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c    |  9 +++++++--
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 10 ++++++++++
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      |  6 +++++-
>  5 files changed, 29 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 9f9774f58ce1c..6bf4f6c90164c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1225,6 +1225,7 @@ struct amdgpu_device {
>  	bool				in_s4;
>  	bool				in_s0ix;
>  	suspend_state_t			last_suspend_state;
> +	bool				resume_gpu_stable;
>  
>  	enum pp_mp1_state               mp1_state;
>  	struct amdgpu_doorbell_index doorbell_index;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 12201b8e99b3f..440d86ed1e0d3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5457,6 +5457,12 @@ int amdgpu_device_resume(struct drm_device *dev, bool notify_clients)
>  		goto exit;
>  	}
>  
> +	/*
> +	 * Set resume_gpu_stable to false BEFORE KFD resume to ensure
> +	 * extended timeouts are used for TLB flushes during hibernation recovery
> +	 */
> +	adev->resume_gpu_stable = false;
> +
>  	r = amdgpu_amdkfd_resume(adev, !amdgpu_sriov_vf(adev) && !adev->in_runpm);
>  	if (r)
>  		goto exit;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index 869bceb0fe2c6..83fe30f0ada75 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -731,7 +731,12 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid,
>  	if (!down_read_trylock(&adev->reset_domain->sem))
>  		return 0;
>  
> -	if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready) {
> +	/*
> +	 * After hibernation resume, KIQ may report as ready but not be fully
> +	 * functional. Use direct MMIO path until GPU is confirmed stable.
> +	 */
> +	if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
> +	    !adev->resume_gpu_stable) {
>  		if (adev->gmc.flush_tlb_needs_extra_type_2)
>  			adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
>  								 2, all_hub,
> @@ -835,9 +840,9 @@ void amdgpu_gmc_fw_reg_write_reg_wait(struct amdgpu_device *adev,
>  		goto failed_kiq;
>  
>  	might_sleep();
> +
>  	while (r < 1 && cnt++ < MAX_KIQ_REG_TRY &&
>  	       !amdgpu_reset_pending(adev->reset_domain)) {
> -
>  		msleep(MAX_KIQ_REG_BAILOUT_INTERVAL);
>  		r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
>  	}
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index 0148d7ff34d99..fbd07b455b915 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -3985,6 +3985,16 @@ static int gfx_v9_0_cp_resume(struct amdgpu_device *adev)
>  		amdgpu_ring_test_helper(ring);
>  	}
>  
> +	/*
> +	 * After successful ring tests, mark GPU as stable for resume.
> +	 * This allows KIQ-based TLB invalidation to be used instead of
> +	 * slower direct MMIO path.
> +	 */
> +	if (!adev->resume_gpu_stable) {
> +		adev->resume_gpu_stable = true;
> +		dev_info(adev->dev, "GPU rings verified, enabling KIQ path\n");
> +	}
> +
>  	gfx_v9_0_enable_gui_idle_interrupt(adev, true);
>  
>  	return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 8ad7519f7b581..8a0202f6b3e3c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -855,9 +855,13 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device *adev, uint32_t vmid,
>  
>  	/* This is necessary for SRIOV as well as for GFXOFF to function
>  	 * properly under bare metal
> +	 *
> +	 * After hibernation resume, KIQ may report as ready but not be fully
> +	 * functional. Use direct MMIO path until GPU is confirmed stable.
>  	 */
>  	if (adev->gfx.kiq[inst].ring.sched.ready &&
> -	    (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev))) {
> +	    (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) &&
> +	    adev->resume_gpu_stable) {
>  		uint32_t req = hub->vm_inv_eng0_req + hub->eng_distance * eng;
>  		uint32_t ack = hub->vm_inv_eng0_ack + hub->eng_distance * eng;
>  


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu: Fix TLB flush failures after hibernation resume
  2026-01-08 12:36   ` Christian König
@ 2026-01-26 19:40     ` Ionut Nechita (Sunlight Linux)
  2026-01-26 20:25       ` Alex Deucher
  0 siblings, 1 reply; 13+ messages in thread
From: Ionut Nechita (Sunlight Linux) @ 2026-01-26 19:40 UTC (permalink / raw)
  To: christian.koenig
  Cc: alexander.deucher, amd-gfx, dri-devel, ionut_n2001, linux-kernel,
	sunlightlinux, superm1

From: Ionut Nechita <sunlightlinux@gmail.com>

On Thu, Jan 8 2026 at 13:36, Christian König wrote:

> Question is why the KIQ is not up and running before we do anything with it?

Thank you for the feedback. I completely understand that my patch is
just a workaround and not proper engineering - you're absolutely right
that the real issue is KIQ being marked as ready before it's actually
functional.

I don't have experience with GPU drivers and video subsystems, so I'm
not familiar with the proper initialization sequence for KIQ. I'd prefer
not to keep a workaround for this issue in my tree.

Is there a proper fix available, or could you point me in the right
direction? I'm happy to test any patches on my AMD Cezanne (Renoir)
hardware where I can reliably reproduce the issue after hibernation.

Also, regarding hibernation support: you mentioned that hibernation is
not generally supported with the driver. Should I expect other issues
beyond this TLB flush problem, or is this the main blocker?

Thanks for your time,
Ionut

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu: Fix TLB flush failures after hibernation resume
  2026-01-26 19:40     ` Ionut Nechita (Sunlight Linux)
@ 2026-01-26 20:25       ` Alex Deucher
  2026-01-26 20:28         ` Mario Limonciello (AMD) (kernel.org)
                           ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Alex Deucher @ 2026-01-26 20:25 UTC (permalink / raw)
  To: Ionut Nechita (Sunlight Linux)
  Cc: christian.koenig, alexander.deucher, amd-gfx, dri-devel,
	ionut_n2001, linux-kernel, superm1

On Mon, Jan 26, 2026 at 2:52 PM Ionut Nechita (Sunlight Linux)
<sunlightlinux@gmail.com> wrote:
>
> From: Ionut Nechita <sunlightlinux@gmail.com>
>
> On Thu, Jan 8 2026 at 13:36, Christian König wrote:
>
> > Question is why the KIQ is not up and running before we do anything with it?
>
> Thank you for the feedback. I completely understand that my patch is
> just a workaround and not proper engineering - you're absolutely right
> that the real issue is KIQ being marked as ready before it's actually
> functional.
>
> I don't have experience with GPU drivers and video subsystems, so I'm
> not familiar with the proper initialization sequence for KIQ. I'd prefer
> not to keep a workaround for this issue in my tree.
>
> Is there a proper fix available, or could you point me in the right
> direction? I'm happy to test any patches on my AMD Cezanne (Renoir)
> hardware where I can reliably reproduce the issue after hibernation.

Can you get a stack trace when this happens so we can see the call chain?

>
> Also, regarding hibernation support: you mentioned that hibernation is
> not generally supported with the driver. Should I expect other issues
> beyond this TLB flush problem, or is this the main blocker?

The biggest issue with hibernation is that it's not compatible with
secure boot so most distros don't officially support it.  The other
issue is that when we go into hibernation, we need to evict the
contents of VRAM somewhere and at the point when that happens, swap is
already offline.  So in a lot of cases, we don't have enough memory to
back up the VRAM contents.  There were patches to the Linux PM core,
but I can't recall if they've all landed yet.  There's also the
possibility that the user's swap partition is too small.

Alex

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu: Fix TLB flush failures after hibernation resume
  2026-01-26 20:25       ` Alex Deucher
@ 2026-01-26 20:28         ` Mario Limonciello (AMD) (kernel.org)
  2026-01-26 20:32           ` Mario Limonciello (AMD) (kernel.org)
  2026-01-26 20:37         ` Ionut Nechita (Sunlight Linux)
  2026-02-01 19:05         ` Ionut Nechita (Sunlight Linux)
  2 siblings, 1 reply; 13+ messages in thread
From: Mario Limonciello (AMD) (kernel.org) @ 2026-01-26 20:28 UTC (permalink / raw)
  To: Alex Deucher, Ionut Nechita (Sunlight Linux)
  Cc: christian.koenig, alexander.deucher, amd-gfx, dri-devel,
	ionut_n2001, linux-kernel



On 1/26/2026 2:25 PM, Alex Deucher wrote:
> On Mon, Jan 26, 2026 at 2:52 PM Ionut Nechita (Sunlight Linux)
> <sunlightlinux@gmail.com> wrote:
>>
>> From: Ionut Nechita <sunlightlinux@gmail.com>
>>
>> On Thu, Jan 8 2026 at 13:36, Christian König wrote:
>>
>>> Question is why the KIQ is not up and running before we do anything with it?
>>
>> Thank you for the feedback. I completely understand that my patch is
>> just a workaround and not proper engineering - you're absolutely right
>> that the real issue is KIQ being marked as ready before it's actually
>> functional.
>>
>> I don't have experience with GPU drivers and video subsystems, so I'm
>> not familiar with the proper initialization sequence for KIQ. I'd prefer
>> not to keep a workaround for this issue in my tree.
>>
>> Is there a proper fix available, or could you point me in the right
>> direction? I'm happy to test any patches on my AMD Cezanne (Renoir)
>> hardware where I can reliably reproduce the issue after hibernation.
> 
> Can you get a stack trace when this happens so we can see the call chain?
> 
>>
>> Also, regarding hibernation support: you mentioned that hibernation is
>> not generally supported with the driver. Should I expect other issues
>> beyond this TLB flush problem, or is this the main blocker?
> 
> The biggest issue with hibernation is that it's not compatible with
> secure boot so most distros don't officially support it.

And by extension of this it doesn't get as much testing as s2idle/s3 do.

>  The other
> issue is that when we go into hibernation, we need to evict the
> contents of VRAM somewhere and at the point when that happens, swap is
> already offline.  So in a lot of cases, we don't have enough memory to
> back up the VRAM contents.  There were patches to the Linux PM core,
> but I can't recall if they've all landed yet.  

Yeah everything should have landed now, so swap will still be enabled.

There's also the
> possibility that the user's swap partition is too small.
> 
> Alex

I heard something about /sys/power/reserved_size being too small by 
default still, so if you're having problems you might increase that.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu: Fix TLB flush failures after hibernation resume
  2026-01-26 20:28         ` Mario Limonciello (AMD) (kernel.org)
@ 2026-01-26 20:32           ` Mario Limonciello (AMD) (kernel.org)
  2026-01-26 20:46             ` Ionut Nechita (Sunlight Linux)
  0 siblings, 1 reply; 13+ messages in thread
From: Mario Limonciello (AMD) (kernel.org) @ 2026-01-26 20:32 UTC (permalink / raw)
  To: Alex Deucher, Ionut Nechita (Sunlight Linux)
  Cc: christian.koenig, alexander.deucher, amd-gfx, dri-devel,
	ionut_n2001, linux-kernel



On 1/26/2026 2:28 PM, Mario Limonciello (AMD) (kernel.org) wrote:
> 
> 
> On 1/26/2026 2:25 PM, Alex Deucher wrote:
>> On Mon, Jan 26, 2026 at 2:52 PM Ionut Nechita (Sunlight Linux)
>> <sunlightlinux@gmail.com> wrote:
>>>
>>> From: Ionut Nechita <sunlightlinux@gmail.com>
>>>
>>> On Thu, Jan 8 2026 at 13:36, Christian König wrote:
>>>
>>>> Question is why the KIQ is not up and running before we do anything 
>>>> with it?
>>>
>>> Thank you for the feedback. I completely understand that my patch is
>>> just a workaround and not proper engineering - you're absolutely right
>>> that the real issue is KIQ being marked as ready before it's actually
>>> functional.
>>>
>>> I don't have experience with GPU drivers and video subsystems, so I'm
>>> not familiar with the proper initialization sequence for KIQ. I'd prefer
>>> not to keep a workaround for this issue in my tree.
>>>
>>> Is there a proper fix available, or could you point me in the right
>>> direction? I'm happy to test any patches on my AMD Cezanne (Renoir)
>>> hardware where I can reliably reproduce the issue after hibernation.
>>
>> Can you get a stack trace when this happens so we can see the call chain?
>>
>>>
>>> Also, regarding hibernation support: you mentioned that hibernation is
>>> not generally supported with the driver. Should I expect other issues
>>> beyond this TLB flush problem, or is this the main blocker?
>>
>> The biggest issue with hibernation is that it's not compatible with
>> secure boot so most distros don't officially support it.
> 
> And by extension of this it doesn't get as much testing as s2idle/s3 do.
> 
>>  The other
>> issue is that when we go into hibernation, we need to evict the
>> contents of VRAM somewhere and at the point when that happens, swap is
>> already offline.  So in a lot of cases, we don't have enough memory to
>> back up the VRAM contents.  There were patches to the Linux PM core,
>> but I can't recall if they've all landed yet. 
> 
> Yeah everything should have landed now, so swap will still be enabled.
> 
> There's also the
>> possibility that the user's swap partition is too small.
>>
>> Alex
> 
> I heard something about /sys/power/reserved_size being too small by 
> default still, so if you're having problems you might increase that.
> 
Sorry not reserved_size, /sys/power/image_size.

Here's where it was mentioned.

https://gitlab.freedesktop.org/drm/amd/-/issues/4882#note_3287247

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu: Fix TLB flush failures after hibernation resume
  2026-01-26 20:25       ` Alex Deucher
  2026-01-26 20:28         ` Mario Limonciello (AMD) (kernel.org)
@ 2026-01-26 20:37         ` Ionut Nechita (Sunlight Linux)
  2026-01-27 11:35           ` Christian König
  2026-02-01 19:05         ` Ionut Nechita (Sunlight Linux)
  2 siblings, 1 reply; 13+ messages in thread
From: Ionut Nechita (Sunlight Linux) @ 2026-01-26 20:37 UTC (permalink / raw)
  To: alexdeucher
  Cc: alexander.deucher, amd-gfx, christian.koenig, dri-devel,
	ionut_n2001, linux-kernel, sunlightlinux, superm1

Hi Alex,

Thank you for the feedback and for taking the time to review this issue.

I'll add debug code to capture the full stack trace when the TLB flush
failures occur. I'll test this on my AMD Cezanne system over the next
few days when I have more time available, and will send you the complete
call chain information.

Regarding the hibernation limitations you mentioned - I understand the
challenges with secure boot compatibility and VRAM eviction. In my case,
I'm not using secure boot, and my system has sufficient RAM and swap
space to handle the VRAM backup, so those particular issues shouldn't
affect my setup.

I'll follow up with the stack traces and additional debugging information
in the next few days.

Thanks again,
Ionut

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu: Fix TLB flush failures after hibernation resume
  2026-01-26 20:32           ` Mario Limonciello (AMD) (kernel.org)
@ 2026-01-26 20:46             ` Ionut Nechita (Sunlight Linux)
  0 siblings, 0 replies; 13+ messages in thread
From: Ionut Nechita (Sunlight Linux) @ 2026-01-26 20:46 UTC (permalink / raw)
  To: superm1
  Cc: alexander.deucher, alexdeucher, amd-gfx, christian.koenig,
	dri-devel, ionut_n2001, linux-kernel, sunlightlinux

Hi Mario,

Thanks for the additional context about /sys/power/image_size and the
GitLab issue reference.

My platform configuration appears to be well set up for hibernation:

  - 30 GB RAM / 30 GB swap available
  - /sys/power/image_size: ~12 GB (13033648128 bytes)
  - Not using secure boot
  - Swap partition is adequately sized for VRAM eviction

Based on the GitLab discussion you referenced, my image_size should be
sufficient. So the known hibernation limitations shouldn't be blocking
factors on my system - the TLB flush failure appears to be the main
remaining issue.

I'll capture stack traces as Alex requested and follow up with the
debugging information in the next few days.

Thanks,
Ionut

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu: Fix TLB flush failures after hibernation resume
  2026-01-26 20:37         ` Ionut Nechita (Sunlight Linux)
@ 2026-01-27 11:35           ` Christian König
  0 siblings, 0 replies; 13+ messages in thread
From: Christian König @ 2026-01-27 11:35 UTC (permalink / raw)
  To: Ionut Nechita (Sunlight Linux), alexdeucher
  Cc: alexander.deucher, amd-gfx, christian.koenig, dri-devel,
	ionut_n2001, linux-kernel, superm1

Hi Ionut,

On 1/26/26 21:37, Ionut Nechita (Sunlight Linux) wrote:
> Hi Alex,
> 
> Thank you for the feedback and for taking the time to review this issue.
> 
> I'll add debug code to capture the full stack trace when the TLB flush
> failures occur. I'll test this on my AMD Cezanne system over the next
> few days when I have more time available, and will send you the complete
> call chain information.
> 
> Regarding the hibernation limitations you mentioned - I understand the
> challenges with secure boot compatibility and VRAM eviction. In my case,
> I'm not using secure boot, and my system has sufficient RAM and swap
> space to handle the VRAM backup, so those particular issues shouldn't
> affect my setup.

the problem is since the AMD GPU doesn't official support hibernation you are pretty much the only one able to reproduce the bug.

In other words I can't go to my manager to get a Cezanne system plus time to work on this.

What Alex, Mario, I and all the other devs can do is to explain to you how the driver works and try to further debug what's going on here.

> I'll follow up with the stack traces and additional debugging information
> in the next few days.

That would certainly be helpful.

Thanks,
Christian.

> 
> Thanks again,
> Ionut


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu: Fix TLB flush failures after hibernation resume
  2026-01-26 20:25       ` Alex Deucher
  2026-01-26 20:28         ` Mario Limonciello (AMD) (kernel.org)
  2026-01-26 20:37         ` Ionut Nechita (Sunlight Linux)
@ 2026-02-01 19:05         ` Ionut Nechita (Sunlight Linux)
  2 siblings, 0 replies; 13+ messages in thread
From: Ionut Nechita (Sunlight Linux) @ 2026-02-01 19:05 UTC (permalink / raw)
  To: alexdeucher
  Cc: alexander.deucher, amd-gfx, christian.koenig, dri-devel,
	ionut_n2001, linux-kernel, sunlightlinux, superm1

Hi Alex,

Thank you for the quick response and for the information about hibernation support.

Here's the stack trace showing the call chain when the TLB flush failures occur. The issue happens in two places:

1. During resume (hibernation restore):

Call Trace:
 dump_stack_lvl+0x5b/0x80
 amdgpu_gmc_fw_reg_write_reg_wait+0x1c7/0x1d0 [amdgpu]
 gmc_v9_0_hw_init+0x2e2/0x390 [amdgpu]
 gmc_v9_0_resume+0x26/0x70 [amdgpu]
 amdgpu_ip_block_resume+0x27/0x50 [amdgpu]
 amdgpu_device_ip_resume_phase1+0x55/0x90 [amdgpu]
 amdgpu_device_resume+0x69/0x380 [amdgpu]
 amdgpu_pmops_resume+0x46/0x80 [amdgpu]
 dpm_run_callback+0x4a/0x150
 device_resume+0x1df/0x2f0
 async_resume+0x21/0x30
 async_run_entry_fn+0x36/0x160
 process_one_work+0x193/0x350
 worker_thread+0x2d7/0x410

2. Subsequent failures during command submission:

Call Trace:
 dump_stack_lvl+0x5b/0x80
 amdgpu_gmc_fw_reg_write_reg_wait+0x1c7/0x1d0 [amdgpu]
 amdgpu_gmc_flush_gpu_tlb+0xd0/0x280 [amdgpu]
 amdgpu_gart_invalidate_tlb.part.0+0x59/0x90 [amdgpu]
 amdgpu_ttm_alloc_gart+0x146/0x180 [amdgpu]
 amdgpu_cs_parser_bos.isra.0+0x5d6/0x7d0 [amdgpu]
 amdgpu_cs_ioctl+0xbd0/0x1aa0 [amdgpu]
 drm_ioctl_kernel+0xa6/0x100
 drm_ioctl+0x262/0x520
 amdgpu_drm_ioctl+0x4a/0x80 [amdgpu]

Error message: "amdgpu 0000:04:00.0: amdgpu: failed to write reg 1a6f4 wait reg 1a706"

Full dmesg log available at: https://gitlab.freedesktop.org/-/project/4522/uploads/6a285ad2e24f4807e5d75c3f4ed5a7a1/dmesg-dump-stack.txt

Regarding the hibernation support issues you mentioned - I understand the limitations with secure boot and VRAM eviction. In my case, I have secure boot disabled and sufficient swap space, so the primary issue I'm encountering is this TLB flush failure.

I'm happy to test any patches or help with further debugging if needed.

Thanks,
Ionut

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-02-01 19:05 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-06 12:59 [PATCH 0/1] drm/amdgpu: Fix TLB flush failures after hibernation resume Ionut Nechita (Sunlight Linux)
2026-01-06 12:59 ` [PATCH 1/1] " Ionut Nechita (Sunlight Linux)
2026-01-06 16:26   ` Alex Deucher
2026-01-07 10:52     ` Ionut Nechita (Sunlight Linux)
2026-01-08 12:36   ` Christian König
2026-01-26 19:40     ` Ionut Nechita (Sunlight Linux)
2026-01-26 20:25       ` Alex Deucher
2026-01-26 20:28         ` Mario Limonciello (AMD) (kernel.org)
2026-01-26 20:32           ` Mario Limonciello (AMD) (kernel.org)
2026-01-26 20:46             ` Ionut Nechita (Sunlight Linux)
2026-01-26 20:37         ` Ionut Nechita (Sunlight Linux)
2026-01-27 11:35           ` Christian König
2026-02-01 19:05         ` Ionut Nechita (Sunlight Linux)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox