[PATCH 0/6] drm/amdgpu: Improve retry fault handling

AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/6] drm/amdgpu: Improve retry fault handling
@ 2026-05-13 16:30 Timur Kristóf
  2026-05-13 16:30 ` [PATCH 1/6] drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly Timur Kristóf
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Timur Kristóf @ 2026-05-13 16:30 UTC (permalink / raw)
  To: amd-gfx, Alex Deucher, christian.koenig, Marek Olšák,
	Natalie Vock, Melissa Wen
  Cc: Timur Kristóf

Fix some issues regarding retry fault handling,
such as enabling the retry fault interrupt (necessary
for retry faults to work) and such.

Improve retry faults on Navi 3 dGPUs by enabling
the filter CAM, which can filter the repeated page
fault interrupts that happen when retry faults are
enabled, making the handling more efficient.

With this series, the kernel is able to mitigate
most page faults on Navi 3 without causing a hang
and without a need to reset the GPU, when the
amdgpu.noretry=0 module parameter is set.

Timur Kristóf (6):
  drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly
  drm/amdgpu/gfxhub: Enable retry fault interrupts when needed
  drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed
  drm/amdgpu/gmc: Don't compare page fault timestamps with other
    interrupts
  drm/amdgpu/ih: Add retry_cam_ack IH function pointer
  drm/amdgpu: Enable retry CAM on Navi 3 dGPUs

 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c     |  7 +++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h     |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h      |  1 +
 drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c | 17 ++++++++++-------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c   | 17 ++++++++++-------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c   | 19 +++++++++++--------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c    | 15 +++++++++------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c    | 15 +++++++++------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c    | 15 +++++++++------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c    | 15 +++++++++------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c    | 17 ++++++++++-------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c  | 17 ++++++++++-------
 drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c      |  5 ++++-
 drivers/gpu/drm/amd/amdgpu/ih_v6_0.c        | 18 +++++++++++++++++-
 drivers/gpu/drm/amd/amdgpu/ih_v7_0.c        |  6 ++++++
 drivers/gpu/drm/amd/amdgpu/mmhub_v3_0.c     |  2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_1.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_2.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v3_3.c     |  2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v4_1_0.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v4_2_0.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/vega20_ih.c      |  8 +++++++-
 22 files changed, 134 insertions(+), 71 deletions(-)

-- 
2.54.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/6] drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly
  2026-05-13 16:30 [PATCH 0/6] drm/amdgpu: Improve retry fault handling Timur Kristóf
@ 2026-05-13 16:30 ` Timur Kristóf
  2026-05-13 16:30 ` [PATCH 2/6] drm/amdgpu/gfxhub: Enable retry fault interrupts when needed Timur Kristóf
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Timur Kristóf @ 2026-05-13 16:30 UTC (permalink / raw)
  To: amd-gfx, Alex Deucher, christian.koenig, Marek Olšák,
	Natalie Vock, Melissa Wen
  Cc: Timur Kristóf

Whether retry faults are actually enabled, is determined by
the amdgpu_gmc_noretry_set() function. The rest of the code
base should use gmc->noretry instead of the module parameter.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c   | 2 +-
 drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c    | 2 +-
 drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c  | 2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v3_0.c     | 2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_1.c   | 2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_2.c   | 2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v3_3.c     | 2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v4_1_0.c   | 2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v4_2_0.c   | 2 +-
 10 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c
index f9949fedfbb9..f845ba698b40 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c
@@ -321,7 +321,7 @@ static void gfxhub_v11_5_0_setup_vmid_config(struct amdgpu_device *adev)
 		/* Send no-retry XNACK on fault to suppress VM fault storm. */
 		tmp = REG_SET_FIELD(tmp, GCVM_CONTEXT1_CNTL,
 				    RETRY_PERMISSION_OR_INVALID_PAGE_FAULT,
-				    !amdgpu_noretry);
+				    !adev->gmc.noretry);
 		WREG32_SOC15_OFFSET(GC, 0, regGCVM_CONTEXT1_CNTL,
 				    i * hub->ctx_distance, tmp);
 		WREG32_SOC15_OFFSET(GC, 0, regGCVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32,
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c
index 7609b9cecae8..ba78b5a1a7cd 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c
@@ -326,7 +326,7 @@ static void gfxhub_v12_0_setup_vmid_config(struct amdgpu_device *adev)
 		/* Send no-retry XNACK on fault to suppress VM fault storm. */
 		tmp = REG_SET_FIELD(tmp, GCVM_CONTEXT1_CNTL,
 				    RETRY_PERMISSION_OR_INVALID_PAGE_FAULT,
-				    !amdgpu_noretry);
+				    !adev->gmc.noretry);
 		WREG32_SOC15_OFFSET(GC, 0, regGCVM_CONTEXT1_CNTL,
 				    i * hub->ctx_distance, tmp);
 		WREG32_SOC15_OFFSET(GC, 0, regGCVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32,
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c
index abe30c8bd2ba..631f99e3741a 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c
@@ -318,7 +318,7 @@ static void gfxhub_v3_0_setup_vmid_config(struct amdgpu_device *adev)
 		/* Send no-retry XNACK on fault to suppress VM fault storm. */
 		tmp = REG_SET_FIELD(tmp, GCVM_CONTEXT1_CNTL,
 				    RETRY_PERMISSION_OR_INVALID_PAGE_FAULT,
-				    !amdgpu_noretry);
+				    !adev->gmc.noretry);
 		WREG32_SOC15_OFFSET(GC, 0, regGCVM_CONTEXT1_CNTL,
 				    i * hub->ctx_distance, tmp);
 		WREG32_SOC15_OFFSET(GC, 0, regGCVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32,
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c
index b3ef6e71811f..8a87410ce016 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c
@@ -323,7 +323,7 @@ static void gfxhub_v3_0_3_setup_vmid_config(struct amdgpu_device *adev)
 		/* Send no-retry XNACK on fault to suppress VM fault storm. */
 		tmp = REG_SET_FIELD(tmp, GCVM_CONTEXT1_CNTL,
 				    RETRY_PERMISSION_OR_INVALID_PAGE_FAULT,
-				    !amdgpu_noretry);
+				    !adev->gmc.noretry);
 		WREG32_SOC15_OFFSET(GC, 0, regGCVM_CONTEXT1_CNTL,
 				    i * hub->ctx_distance, tmp);
 		WREG32_SOC15_OFFSET(GC, 0, regGCVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32,
diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0.c b/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0.c
index 3d82cfa0f1b5..ab56dd15b3f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0.c
@@ -340,7 +340,7 @@ static void mmhub_v3_0_setup_vmid_config(struct amdgpu_device *adev)
 		/* Send no-retry XNACK on fault to suppress VM fault storm. */
 		tmp = REG_SET_FIELD(tmp, MMVM_CONTEXT1_CNTL,
 				    RETRY_PERMISSION_OR_INVALID_PAGE_FAULT,
-				    !amdgpu_noretry);
+				    !adev->gmc.noretry);
 		WREG32_SOC15_OFFSET(MMHUB, 0, regMMVM_CONTEXT1_CNTL,
 				    i * hub->ctx_distance, tmp);
 		WREG32_SOC15_OFFSET(MMHUB, 0, regMMVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32,
diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_1.c b/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_1.c
index a1b0b7b39a42..6522a89379b7 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_1.c
@@ -333,7 +333,7 @@ static void mmhub_v3_0_1_setup_vmid_config(struct amdgpu_device *adev)
 		/* Send no-retry XNACK on fault to suppress VM fault storm. */
 		tmp = REG_SET_FIELD(tmp, MMVM_CONTEXT1_CNTL,
 				    RETRY_PERMISSION_OR_INVALID_PAGE_FAULT,
-				    !amdgpu_noretry);
+				    !adev->gmc.noretry);
 		WREG32_SOC15_OFFSET(MMHUB, 0, regMMVM_CONTEXT1_CNTL,
 				    i * hub->ctx_distance, tmp);
 		WREG32_SOC15_OFFSET(MMHUB, 0, regMMVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32,
diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_2.c b/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_2.c
index 34e8dbd47c0f..23cf95783264 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_2.c
@@ -339,7 +339,7 @@ static void mmhub_v3_0_2_setup_vmid_config(struct amdgpu_device *adev)
 		/* Send no-retry XNACK on fault to suppress VM fault storm. */
 		tmp = REG_SET_FIELD(tmp, MMVM_CONTEXT1_CNTL,
 				    RETRY_PERMISSION_OR_INVALID_PAGE_FAULT,
-				    !amdgpu_noretry);
+				    !adev->gmc.noretry);
 		WREG32_SOC15_OFFSET(MMHUB, 0, regMMVM_CONTEXT1_CNTL,
 				    i * hub->ctx_distance, tmp);
 		WREG32_SOC15_OFFSET(MMHUB, 0, regMMVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32,
diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v3_3.c b/drivers/gpu/drm/amd/amdgpu/mmhub_v3_3.c
index cfce7e1297d4..98568c72c2be 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmhub_v3_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v3_3.c
@@ -451,7 +451,7 @@ static void mmhub_v3_3_setup_vmid_config(struct amdgpu_device *adev)
 		/* Send no-retry XNACK on fault to suppress VM fault storm. */
 		tmp = REG_SET_FIELD(tmp, MMVM_CONTEXT1_CNTL,
 				    RETRY_PERMISSION_OR_INVALID_PAGE_FAULT,
-				    !amdgpu_noretry);
+				    !adev->gmc.noretry);
 		WREG32_SOC15_OFFSET(MMHUB, 0, regMMVM_CONTEXT1_CNTL,
 				    i * hub->ctx_distance, tmp);
 		WREG32_SOC15_OFFSET(MMHUB, 0, regMMVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32,
diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v4_1_0.c b/drivers/gpu/drm/amd/amdgpu/mmhub_v4_1_0.c
index bef75c4c48d3..c9fb48992a2d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmhub_v4_1_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v4_1_0.c
@@ -334,7 +334,7 @@ static void mmhub_v4_1_0_setup_vmid_config(struct amdgpu_device *adev)
 		/* Send no-retry XNACK on fault to suppress VM fault storm. */
 		tmp = REG_SET_FIELD(tmp, MMVM_CONTEXT1_CNTL,
 				    RETRY_PERMISSION_OR_INVALID_PAGE_FAULT,
-				    !amdgpu_noretry);
+				    !adev->gmc.noretry);
 		WREG32_SOC15_OFFSET(MMHUB, 0, regMMVM_CONTEXT1_CNTL,
 				    i * hub->ctx_distance, tmp);
 		WREG32_SOC15_OFFSET(MMHUB, 0, regMMVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32,
diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v4_2_0.c b/drivers/gpu/drm/amd/amdgpu/mmhub_v4_2_0.c
index 29f7ed466858..49b7f16a941f 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmhub_v4_2_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v4_2_0.c
@@ -518,7 +518,7 @@ static void mmhub_v4_2_0_mid_setup_vmid_config(struct amdgpu_device *adev,
 			/* Send no-retry XNACK on fault to suppress VM fault storm. */
 			tmp = REG_SET_FIELD(tmp, MMVM_CONTEXT1_CNTL,
 					    RETRY_PERMISSION_OR_INVALID_PAGE_FAULT,
-					    !amdgpu_noretry);
+					    !adev->gmc.noretry);
 			WREG32_SOC15_OFFSET(MMHUB, GET_INST(MMHUB, j), regMMVM_CONTEXT1_CNTL,
 					    i * hub->ctx_distance, tmp);
 			WREG32_SOC15_OFFSET(MMHUB, GET_INST(MMHUB, j), regMMVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32,
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/6] drm/amdgpu/gfxhub: Enable retry fault interrupts when needed
  2026-05-13 16:30 [PATCH 0/6] drm/amdgpu: Improve retry fault handling Timur Kristóf
  2026-05-13 16:30 ` [PATCH 1/6] drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly Timur Kristóf
@ 2026-05-13 16:30 ` Timur Kristóf
  2026-05-13 16:30 ` [PATCH 3/6] drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed Timur Kristóf
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Timur Kristóf @ 2026-05-13 16:30 UTC (permalink / raw)
  To: amd-gfx, Alex Deucher, christian.koenig, Marek Olšák,
	Natalie Vock, Melissa Wen
  Cc: Timur Kristóf

Enable retry fault interrupts when retrying page faults is
enabled in amdgpu (ie. amdgpu.noretry=0).

Needs to be done for each GFXHUB version at once,
because none of them actually enabled this interrupt.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c | 5 +++++
 drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c   | 5 +++++
 drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c   | 5 +++++
 drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c    | 5 +++++
 drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c    | 5 +++++
 drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c    | 5 +++++
 drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c    | 5 +++++
 drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c    | 5 +++++
 drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c  | 5 +++++
 9 files changed, 45 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c
index f845ba698b40..8fdf66ad265c 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c
@@ -456,6 +456,11 @@ static void gfxhub_v11_5_0_set_fault_enable_default(struct amdgpu_device *adev,
 				CRASH_ON_RETRY_FAULT, 1);
 	}
 	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
+
+	tmp = RREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL2);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL2,
+		ENABLE_RETRY_FAULT_INTERRUPT, value && !adev->gmc.noretry);
+	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL2, tmp);
 }
 
 static const struct amdgpu_vmhub_funcs gfxhub_v11_5_0_vmhub_funcs = {
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c
index ba78b5a1a7cd..84344c67013a 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c
@@ -461,6 +461,11 @@ static void gfxhub_v12_0_set_fault_enable_default(struct amdgpu_device *adev,
 				CRASH_ON_RETRY_FAULT, 1);
 	}
 	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
+
+	tmp = RREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL2);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL2,
+		ENABLE_RETRY_FAULT_INTERRUPT, value && !adev->gmc.noretry);
+	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL2, tmp);
 }
 
 static const struct amdgpu_vmhub_funcs gfxhub_v12_0_vmhub_funcs = {
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c
index 3544eb42dca6..e505aaf8b447 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c
@@ -648,6 +648,11 @@ static void gfxhub_v12_1_xcc_set_fault_enable_default(struct amdgpu_device *adev
 					    CRASH_ON_RETRY_FAULT, 1);
 		WREG32_SOC15(GC, GET_INST(GC, i),
 			     regGCVM_L2_PROTECTION_FAULT_CNTL_HI32, tmp);
+
+		tmp = RREG32_SOC15(GC, GET_INST(GC, i), regGCVM_L2_PROTECTION_FAULT_CNTL2);
+		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL2,
+			ENABLE_RETRY_FAULT_INTERRUPT, value && !adev->gmc.noretry);
+		WREG32_SOC15(GC, GET_INST(GC, i), regGCVM_L2_PROTECTION_FAULT_CNTL2, tmp);
 	}
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
index a7bfc9f41d0e..c8a615147904 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
@@ -410,6 +410,11 @@ static void gfxhub_v1_0_set_fault_enable_default(struct amdgpu_device *adev,
 				CRASH_ON_RETRY_FAULT, 1);
 	}
 	WREG32_SOC15(GC, 0, mmVM_L2_PROTECTION_FAULT_CNTL, tmp);
+
+	tmp = RREG32_SOC15(GC, 0, mmVM_L2_PROTECTION_FAULT_CNTL2);
+	tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL2,
+		ENABLE_RETRY_FAULT_INTERRUPT, value && !adev->gmc.noretry);
+	WREG32_SOC15(GC, 0, mmVM_L2_PROTECTION_FAULT_CNTL2, tmp);
 }
 
 static void gfxhub_v1_0_init(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
index 6c03bf9f1ae8..afc8c6a6f1bb 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
@@ -523,6 +523,11 @@ static void gfxhub_v1_2_xcc_set_fault_enable_default(struct amdgpu_device *adev,
 					CRASH_ON_RETRY_FAULT, 1);
 		}
 		WREG32_SOC15(GC, GET_INST(GC, i), regVM_L2_PROTECTION_FAULT_CNTL, tmp);
+
+		tmp = RREG32_SOC15(GC, GET_INST(GC, i), regVM_L2_PROTECTION_FAULT_CNTL2);
+		tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL2,
+			ENABLE_RETRY_FAULT_INTERRUPT, value && !adev->gmc.noretry);
+		WREG32_SOC15(GC, GET_INST(GC, i), regVM_L2_PROTECTION_FAULT_CNTL2, tmp);
 	}
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
index 793faf62cb07..a27bb37b2a11 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
@@ -425,6 +425,11 @@ static void gfxhub_v2_0_set_fault_enable_default(struct amdgpu_device *adev,
 				CRASH_ON_RETRY_FAULT, 1);
 	}
 	WREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
+
+	tmp = RREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL2);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL2,
+		ENABLE_RETRY_FAULT_INTERRUPT, value && !adev->gmc.noretry);
+	WREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL2, tmp);
 }
 
 static const struct amdgpu_vmhub_funcs gfxhub_v2_0_vmhub_funcs = {
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
index aceb8447feac..db56f7a61d61 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
@@ -456,6 +456,11 @@ static void gfxhub_v2_1_set_fault_enable_default(struct amdgpu_device *adev,
 				CRASH_ON_RETRY_FAULT, 1);
 	}
 	WREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
+
+	tmp = RREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL2);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL2,
+		ENABLE_RETRY_FAULT_INTERRUPT, value && !adev->gmc.noretry);
+	WREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL2, tmp);
 }
 
 static const struct amdgpu_vmhub_funcs gfxhub_v2_1_vmhub_funcs = {
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c
index 631f99e3741a..97585c7b879c 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c
@@ -453,6 +453,11 @@ static void gfxhub_v3_0_set_fault_enable_default(struct amdgpu_device *adev,
 				CRASH_ON_RETRY_FAULT, 1);
 	}
 	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
+
+	tmp = RREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL2);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL2,
+		ENABLE_RETRY_FAULT_INTERRUPT, value && !adev->gmc.noretry);
+	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL2, tmp);
 }
 
 static const struct amdgpu_vmhub_funcs gfxhub_v3_0_vmhub_funcs = {
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c
index 8a87410ce016..72f24372a4e8 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c
@@ -441,6 +441,11 @@ static void gfxhub_v3_0_3_set_fault_enable_default(struct amdgpu_device *adev,
 				CRASH_ON_RETRY_FAULT, 1);
 	}
 	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
+
+	tmp = RREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL2);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL2,
+		ENABLE_RETRY_FAULT_INTERRUPT, value && !adev->gmc.noretry);
+	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL2, tmp);
 }
 
 static const struct amdgpu_vmhub_funcs gfxhub_v3_0_3_vmhub_funcs = {
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/6] drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed
  2026-05-13 16:30 [PATCH 0/6] drm/amdgpu: Improve retry fault handling Timur Kristóf
  2026-05-13 16:30 ` [PATCH 1/6] drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly Timur Kristóf
  2026-05-13 16:30 ` [PATCH 2/6] drm/amdgpu/gfxhub: Enable retry fault interrupts when needed Timur Kristóf
@ 2026-05-13 16:30 ` Timur Kristóf
  2026-05-13 16:30 ` [PATCH 4/6] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts Timur Kristóf
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Timur Kristóf @ 2026-05-13 16:30 UTC (permalink / raw)
  To: amd-gfx, Alex Deucher, christian.koenig, Marek Olšák,
	Natalie Vock, Melissa Wen
  Cc: Timur Kristóf

When the fault stop mode isn't AMDGPU_VM_FAULT_STOP_ALWAYS,
these bits should be programmed to 0.

Program CRASH_ON_NO_RETRY_FAULT and CRASH_ON_RETRY_FAULT
always, to make sure to clear the bits when we don't want
to crash.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c | 10 ++++------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c   | 10 ++++------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c   | 14 ++++++--------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c    | 10 ++++------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c    | 10 ++++------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c    | 10 ++++------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c    | 10 ++++------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c    | 10 ++++------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c  | 10 ++++------
 9 files changed, 38 insertions(+), 56 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c
index 8fdf66ad265c..3c6c20e529a9 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c
@@ -449,12 +449,10 @@ static void gfxhub_v11_5_0_set_fault_enable_default(struct amdgpu_device *adev,
 			    WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
 	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
 			    EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
-	if (!value) {
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_NO_RETRY_FAULT, 1);
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_RETRY_FAULT, 1);
-	}
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_NO_RETRY_FAULT, !value);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_RETRY_FAULT, !value);
 	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
 
 	tmp = RREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL2);
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c
index 84344c67013a..9234a66a439a 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c
@@ -454,12 +454,10 @@ static void gfxhub_v12_0_set_fault_enable_default(struct amdgpu_device *adev,
 			    WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
 	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
 			    EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
-	if (!value) {
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_NO_RETRY_FAULT, 1);
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_RETRY_FAULT, 1);
-	}
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_NO_RETRY_FAULT, !value);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_RETRY_FAULT, !value);
 	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
 
 	tmp = RREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL2);
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c
index e505aaf8b447..82ee96b5ef4b 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c
@@ -633,19 +633,17 @@ static void gfxhub_v12_1_xcc_set_fault_enable_default(struct amdgpu_device *adev
 		tmp = REG_SET_FIELD(tmp,
 				    GCVM_L2_PROTECTION_FAULT_CNTL_LO32,
 				    OTHER_CLIENT_ID_NO_RETRY_FAULT_INTERRUPT, value);
-		if (!value)
-			tmp = REG_SET_FIELD(tmp,
-					    GCVM_L2_PROTECTION_FAULT_CNTL_LO32,
-					    CRASH_ON_NO_RETRY_FAULT, 1);
+		tmp = REG_SET_FIELD(tmp,
+				    GCVM_L2_PROTECTION_FAULT_CNTL_LO32,
+				    CRASH_ON_NO_RETRY_FAULT, !value);
 		WREG32_SOC15(GC, GET_INST(GC, i),
 			     regGCVM_L2_PROTECTION_FAULT_CNTL_LO32, tmp);
 
 		tmp = RREG32_SOC15(GC, GET_INST(GC, i),
 				   regGCVM_L2_PROTECTION_FAULT_CNTL_HI32);
-		if (!value)
-			tmp = REG_SET_FIELD(tmp,
-					    GCVM_L2_PROTECTION_FAULT_CNTL_HI32,
-					    CRASH_ON_RETRY_FAULT, 1);
+		tmp = REG_SET_FIELD(tmp,
+				    GCVM_L2_PROTECTION_FAULT_CNTL_HI32,
+				    CRASH_ON_RETRY_FAULT, !value);
 		WREG32_SOC15(GC, GET_INST(GC, i),
 			     regGCVM_L2_PROTECTION_FAULT_CNTL_HI32, tmp);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
index c8a615147904..2b20b86236be 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
@@ -403,12 +403,10 @@ static void gfxhub_v1_0_set_fault_enable_default(struct amdgpu_device *adev,
 			WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
 	tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
 			EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
-	if (!value) {
-		tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_NO_RETRY_FAULT, 1);
-		tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_RETRY_FAULT, 1);
-	}
+	tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
+			CRASH_ON_NO_RETRY_FAULT, !value);
+	tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
+			CRASH_ON_RETRY_FAULT, !value);
 	WREG32_SOC15(GC, 0, mmVM_L2_PROTECTION_FAULT_CNTL, tmp);
 
 	tmp = RREG32_SOC15(GC, 0, mmVM_L2_PROTECTION_FAULT_CNTL2);
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
index afc8c6a6f1bb..182cf3994512 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
@@ -516,12 +516,10 @@ static void gfxhub_v1_2_xcc_set_fault_enable_default(struct amdgpu_device *adev,
 				WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
 		tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
 				EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
-		if (!value) {
-			tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
-					CRASH_ON_NO_RETRY_FAULT, 1);
-			tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
-					CRASH_ON_RETRY_FAULT, 1);
-		}
+		tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
+				CRASH_ON_NO_RETRY_FAULT, !value);
+		tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
+				CRASH_ON_RETRY_FAULT, !value);
 		WREG32_SOC15(GC, GET_INST(GC, i), regVM_L2_PROTECTION_FAULT_CNTL, tmp);
 
 		tmp = RREG32_SOC15(GC, GET_INST(GC, i), regVM_L2_PROTECTION_FAULT_CNTL2);
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
index a27bb37b2a11..35ef43137f1d 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
@@ -418,12 +418,10 @@ static void gfxhub_v2_0_set_fault_enable_default(struct amdgpu_device *adev,
 			    WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
 	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
 			    EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
-	if (!value) {
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_NO_RETRY_FAULT, 1);
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_RETRY_FAULT, 1);
-	}
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_NO_RETRY_FAULT, !value);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_RETRY_FAULT, !value);
 	WREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
 
 	tmp = RREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL2);
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
index db56f7a61d61..c6b610c48540 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
@@ -449,12 +449,10 @@ static void gfxhub_v2_1_set_fault_enable_default(struct amdgpu_device *adev,
 			    WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
 	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
 			    EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
-	if (!value) {
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_NO_RETRY_FAULT, 1);
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_RETRY_FAULT, 1);
-	}
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_NO_RETRY_FAULT, !value);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_RETRY_FAULT, !value);
 	WREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
 
 	tmp = RREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL2);
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c
index 97585c7b879c..0d8b8980898e 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c
@@ -446,12 +446,10 @@ static void gfxhub_v3_0_set_fault_enable_default(struct amdgpu_device *adev,
 			    WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
 	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
 			    EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
-	if (!value) {
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_NO_RETRY_FAULT, 1);
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_RETRY_FAULT, 1);
-	}
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_NO_RETRY_FAULT, !value);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_RETRY_FAULT, !value);
 	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
 
 	tmp = RREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL2);
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c
index 72f24372a4e8..766dc0ce738b 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c
@@ -434,12 +434,10 @@ static void gfxhub_v3_0_3_set_fault_enable_default(struct amdgpu_device *adev,
 			    WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
 	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
 			    EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
-	if (!value) {
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_NO_RETRY_FAULT, 1);
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_RETRY_FAULT, 1);
-	}
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_NO_RETRY_FAULT, !value);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_RETRY_FAULT, !value);
 	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
 
 	tmp = RREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL2);
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/6] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts
  2026-05-13 16:30 [PATCH 0/6] drm/amdgpu: Improve retry fault handling Timur Kristóf
                   ` (2 preceding siblings ...)
  2026-05-13 16:30 ` [PATCH 3/6] drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed Timur Kristóf
@ 2026-05-13 16:30 ` Timur Kristóf
  2026-05-13 16:30 ` [PATCH 5/6] drm/amdgpu/ih: Add retry_cam_ack IH function pointer Timur Kristóf
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Timur Kristóf @ 2026-05-13 16:30 UTC (permalink / raw)
  To: amd-gfx, Alex Deucher, christian.koenig, Marek Olšák,
	Natalie Vock, Melissa Wen
  Cc: Timur Kristóf

Different interrupts may have different timestamp sources,
which shouldn't be compared.

If we compare the timestamps of retry faults to timestamps
of other interrupts, it may result in all retry fault
interrupts being filtered out, because of the different
time stamp source.

This issue was observed on Strix Halo.
Solved by storing the timestamp of the last page fault interrupt.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 5 ++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 13bec8461cde..52258f1341c2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -437,9 +437,12 @@ bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev,
 	uint32_t hash;
 
 	/* Stale retry fault if timestamp goes backward */
-	if (amdgpu_ih_ts_after(timestamp, ih->processed_timestamp))
+	if (timestamp == adev->gmc.processed_fault_timestamp ||
+		amdgpu_ih_ts_after(timestamp, adev->gmc.processed_fault_timestamp))
 		return true;
 
+	adev->gmc.processed_fault_timestamp = MAX(timestamp, adev->gmc.processed_fault_timestamp);
+
 	/* If we don't have space left in the ring buffer return immediately */
 	stamp = max(timestamp, AMDGPU_GMC_FAULT_TIMEOUT + 1) -
 		AMDGPU_GMC_FAULT_TIMEOUT;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
index 676e3aaa1f27..77eb15380284 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
@@ -361,6 +361,7 @@ struct amdgpu_gmc {
 
 	u64 noretry_flags;
 	u64 init_pte_flags;
+	u64 processed_fault_timestamp;
 
 	bool flush_tlb_needs_extra_type_0;
 	bool flush_tlb_needs_extra_type_2;
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/6] drm/amdgpu/ih: Add retry_cam_ack IH function pointer
  2026-05-13 16:30 [PATCH 0/6] drm/amdgpu: Improve retry fault handling Timur Kristóf
                   ` (3 preceding siblings ...)
  2026-05-13 16:30 ` [PATCH 4/6] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts Timur Kristóf
@ 2026-05-13 16:30 ` Timur Kristóf
  2026-05-13 16:30 ` [PATCH 6/6] drm/amdgpu: Enable retry CAM on Navi 3 dGPUs Timur Kristóf
  2026-05-13 16:36 ` [PATCH 0/6] drm/amdgpu: Improve retry fault handling Alex Deucher
  6 siblings, 0 replies; 18+ messages in thread
From: Timur Kristóf @ 2026-05-13 16:30 UTC (permalink / raw)
  To: amd-gfx, Alex Deucher, christian.koenig, Marek Olšák,
	Natalie Vock, Melissa Wen
  Cc: Timur Kristóf

Instead of writing the doorbell in amdgpu_gmc_handle_retry_fault()
directly, add an IH function pointer which can be defined in
a different way for different IH versions.

This is to allow implementing the filter CAM without a doorbell.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h  | 1 +
 drivers/gpu/drm/amd/amdgpu/ih_v7_0.c    | 6 ++++++
 drivers/gpu/drm/amd/amdgpu/vega20_ih.c  | 8 +++++++-
 4 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 52258f1341c2..d790b7619ccd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -565,7 +565,7 @@ int amdgpu_gmc_handle_retry_fault(struct amdgpu_device *adev,
 
 		ret = amdgpu_vm_handle_fault(adev, entry->pasid, entry->vmid, node_id,
 					     addr, entry->timestamp, write_fault);
-		WDOORBELL32(adev->irq.retry_cam_doorbell_index, cam_index);
+		adev->irq.ih_funcs->retry_cam_ack(adev, cam_index);
 		if (ret)
 			return 1;
 	} else {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
index 444437c30088..e6e34f6e86f4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
@@ -97,6 +97,7 @@ struct amdgpu_ih_funcs {
 	const char *(*node_id_to_die_name)(struct amdgpu_device *adev,
 					   unsigned int node_id,
 					   char *buf, size_t size);
+	void (*retry_cam_ack)(struct amdgpu_device *adev, u32 cam_index);
 };
 
 #define amdgpu_ih_get_wptr(adev, ih) (adev)->irq.ih_funcs->get_wptr((adev), (ih))
diff --git a/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c b/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
index 6de9e87e04e1..c2431f4c2671 100644
--- a/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
@@ -289,6 +289,11 @@ static uint32_t ih_v7_0_setup_retry_doorbell(u32 doorbell_index)
 	return val;
 }
 
+static void ih_v7_0_retry_cam_ack(struct amdgpu_device *adev, u32 cam_index)
+{
+	WDOORBELL32(adev->irq.retry_cam_doorbell_index, cam_index);
+}
+
 #define regIH_RING1_CLIENT_CFG_INDEX_V7_1             0x122
 #define regIH_RING1_CLIENT_CFG_INDEX_V7_1_BASE_IDX    0
 #define regIH_RING1_CLIENT_CFG_DATA_V7_1              0x123
@@ -858,6 +863,7 @@ static const struct amdgpu_ih_funcs ih_v7_0_funcs = {
 	.decode_iv_ts = amdgpu_ih_decode_iv_ts_helper,
 	.set_rptr = ih_v7_0_set_rptr,
 	.node_id_to_die_name = ih_v7_0_node_id_to_die_name,
+	.retry_cam_ack = ih_v7_0_retry_cam_ack,
 };
 
 static void ih_v7_0_set_interrupt_funcs(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
index 85846fd08ce4..30a82fff3ff7 100644
--- a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
@@ -293,6 +293,11 @@ static uint32_t vega20_setup_retry_doorbell(u32 doorbell_index)
 	return val;
 }
 
+static void vega20_retry_cam_ack(struct amdgpu_device *adev, u32 cam_index)
+{
+	WDOORBELL32(adev->irq.retry_cam_doorbell_index, cam_index);
+}
+
 /**
  * vega20_ih_irq_init - init and enable the interrupt ring
  *
@@ -738,7 +743,8 @@ static const struct amdgpu_ih_funcs vega20_ih_funcs = {
 	.get_wptr = vega20_ih_get_wptr,
 	.decode_iv = amdgpu_ih_decode_iv_helper,
 	.decode_iv_ts = amdgpu_ih_decode_iv_ts_helper,
-	.set_rptr = vega20_ih_set_rptr
+	.set_rptr = vega20_ih_set_rptr,
+	.retry_cam_ack = vega20_retry_cam_ack,
 };
 
 static void vega20_ih_set_interrupt_funcs(struct amdgpu_device *adev)
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 6/6] drm/amdgpu: Enable retry CAM on Navi 3 dGPUs
  2026-05-13 16:30 [PATCH 0/6] drm/amdgpu: Improve retry fault handling Timur Kristóf
                   ` (4 preceding siblings ...)
  2026-05-13 16:30 ` [PATCH 5/6] drm/amdgpu/ih: Add retry_cam_ack IH function pointer Timur Kristóf
@ 2026-05-13 16:30 ` Timur Kristóf
  2026-05-13 16:36 ` [PATCH 0/6] drm/amdgpu: Improve retry fault handling Alex Deucher
  6 siblings, 0 replies; 18+ messages in thread
From: Timur Kristóf @ 2026-05-13 16:30 UTC (permalink / raw)
  To: amd-gfx, Alex Deucher, christian.koenig, Marek Olšák,
	Natalie Vock, Melissa Wen
  Cc: Timur Kristóf

The retry CAM can filter interrupts which occur repeatedly,
such as page fault interrupts when retry faults are enabled.
This makes processing those interrupts much more efficient,
because the CPU won't have to deal with processing the same
interrupt repeatedly.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c |  5 ++++-
 drivers/gpu/drm/amd/amdgpu/ih_v6_0.c   | 18 +++++++++++++++++-
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
index 16388e3caea3..2a226b4c9e09 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -108,13 +108,16 @@ static int gmc_v11_0_process_interrupt(struct amdgpu_device *adev,
 	bool write_fault = !!(entry->src_data[1] &
 			      AMDGPU_GMC9_FAULT_SOURCE_DATA_WRITE);
 	uint32_t status = 0;
+	uint32_t cam_index;
 	u64 addr;
 
 	addr = (u64)entry->src_data[0] << 12;
 	addr |= ((u64)entry->src_data[1] & 0xf) << 44;
 
 	if (retry_fault) {
-		int ret = amdgpu_gmc_handle_retry_fault(adev, entry, addr, 0, 0,
+		cam_index = entry->src_data[2] & 0x3ff;
+
+		int ret = amdgpu_gmc_handle_retry_fault(adev, entry, addr, cam_index, 0,
 							write_fault);
 		/* Returning 1 here also prevents sending the IV to the KFD */
 		if (ret == 1)
diff --git a/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c b/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
index 333e9c30c091..0a87c3126d1e 100644
--- a/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
@@ -307,6 +307,11 @@ static int ih_v6_0_enable_ring(struct amdgpu_device *adev,
 	return 0;
 }
 
+static void ih_v6_0_retry_cam_ack(struct amdgpu_device *adev, u32 cam_index)
+{
+	WREG32_SOC15(OSSSYS, 0, regIH_RETRY_CAM_ACK, cam_index);
+}
+
 /**
  * ih_v6_0_irq_init - init and enable the interrupt ring
  *
@@ -392,6 +397,16 @@ static int ih_v6_0_irq_init(struct amdgpu_device *adev)
 
 	pci_set_master(adev->pdev);
 
+	if (!(adev->flags & AMD_IS_APU)) {
+		/* Enable IH Retry CAM */
+		tmp = RREG32_SOC15(OSSSYS, 0, regIH_RETRY_INT_CAM_CNTL);
+		tmp = REG_SET_FIELD(tmp, IH_RETRY_INT_CAM_CNTL, ENABLE, 1);
+		tmp = REG_SET_FIELD(tmp, IH_RETRY_INT_CAM_CNTL, CAM_SIZE, 0xF);
+		WREG32_SOC15(OSSSYS, 0, regIH_RETRY_INT_CAM_CNTL, tmp);
+
+		adev->irq.retry_cam_enabled = true;
+	}
+
 	/* enable interrupts */
 	ret = ih_v6_0_toggle_interrupts(adev, true);
 	if (ret)
@@ -800,7 +815,8 @@ static const struct amdgpu_ih_funcs ih_v6_0_funcs = {
 	.get_wptr = ih_v6_0_get_wptr,
 	.decode_iv = amdgpu_ih_decode_iv_helper,
 	.decode_iv_ts = amdgpu_ih_decode_iv_ts_helper,
-	.set_rptr = ih_v6_0_set_rptr
+	.set_rptr = ih_v6_0_set_rptr,
+	.retry_cam_ack = ih_v6_0_retry_cam_ack,
 };
 
 static void ih_v6_0_set_interrupt_funcs(struct amdgpu_device *adev)
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling
  2026-05-13 16:30 [PATCH 0/6] drm/amdgpu: Improve retry fault handling Timur Kristóf
                   ` (5 preceding siblings ...)
  2026-05-13 16:30 ` [PATCH 6/6] drm/amdgpu: Enable retry CAM on Navi 3 dGPUs Timur Kristóf
@ 2026-05-13 16:36 ` Alex Deucher
  2026-05-13 16:43   ` Timur Kristóf
  6 siblings, 1 reply; 18+ messages in thread
From: Alex Deucher @ 2026-05-13 16:36 UTC (permalink / raw)
  To: Timur Kristóf, Amir Shetaia
  Cc: amd-gfx, Alex Deucher, christian.koenig, Marek Olšák,
	Natalie Vock, Melissa Wen

+ Amir

Amir may have some insights on navi4x as he was looking at this recently.

Alex

On Wed, May 13, 2026 at 12:30 PM Timur Kristóf <timur.kristof@gmail.com> wrote:
>
> Fix some issues regarding retry fault handling,
> such as enabling the retry fault interrupt (necessary
> for retry faults to work) and such.
>
> Improve retry faults on Navi 3 dGPUs by enabling
> the filter CAM, which can filter the repeated page
> fault interrupts that happen when retry faults are
> enabled, making the handling more efficient.
>
> With this series, the kernel is able to mitigate
> most page faults on Navi 3 without causing a hang
> and without a need to reset the GPU, when the
> amdgpu.noretry=0 module parameter is set.
>
> Timur Kristóf (6):
>   drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly
>   drm/amdgpu/gfxhub: Enable retry fault interrupts when needed
>   drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed
>   drm/amdgpu/gmc: Don't compare page fault timestamps with other
>     interrupts
>   drm/amdgpu/ih: Add retry_cam_ack IH function pointer
>   drm/amdgpu: Enable retry CAM on Navi 3 dGPUs
>
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c     |  7 +++++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h     |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h      |  1 +
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c | 17 ++++++++++-------
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c   | 17 ++++++++++-------
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c   | 19 +++++++++++--------
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c    | 15 +++++++++------
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c    | 15 +++++++++------
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c    | 15 +++++++++------
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c    | 15 +++++++++------
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c    | 17 ++++++++++-------
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c  | 17 ++++++++++-------
>  drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c      |  5 ++++-
>  drivers/gpu/drm/amd/amdgpu/ih_v6_0.c        | 18 +++++++++++++++++-
>  drivers/gpu/drm/amd/amdgpu/ih_v7_0.c        |  6 ++++++
>  drivers/gpu/drm/amd/amdgpu/mmhub_v3_0.c     |  2 +-
>  drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_1.c   |  2 +-
>  drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_2.c   |  2 +-
>  drivers/gpu/drm/amd/amdgpu/mmhub_v3_3.c     |  2 +-
>  drivers/gpu/drm/amd/amdgpu/mmhub_v4_1_0.c   |  2 +-
>  drivers/gpu/drm/amd/amdgpu/mmhub_v4_2_0.c   |  2 +-
>  drivers/gpu/drm/amd/amdgpu/vega20_ih.c      |  8 +++++++-
>  22 files changed, 134 insertions(+), 71 deletions(-)
>
> --
> 2.54.0
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling
  2026-05-13 16:36 ` [PATCH 0/6] drm/amdgpu: Improve retry fault handling Alex Deucher
@ 2026-05-13 16:43   ` Timur Kristóf
  2026-05-13 17:28     ` Shetaia, Amir
  0 siblings, 1 reply; 18+ messages in thread
From: Timur Kristóf @ 2026-05-13 16:43 UTC (permalink / raw)
  To: Amir Shetaia, Alex Deucher
  Cc: amd-gfx, Alex Deucher, christian.koenig, Marek Olšák,
	Natalie Vock, Melissa Wen

On Wednesday, May 13, 2026 6:36:02 PM Central European Summer Time Alex 
Deucher wrote:
> + Amir
> 
> Amir may have some insights on navi4x as he was looking at this recently.
> 
> Alex

Hi Alex, Amir,

I think we are very close to enabling retry faults by default on Navi 3.
I'd be happy to receive feedback on the above series.

With regards to Navi 4:

I also attempted to get it working on Navi 48, and I managed to get retry 
faults enabled, but it seems that amdgpu_vm_handle_fault() can't actually 
resolve the page fault on Navi 48. It just keeps retrying until it times out. 
Christian suggested this may be due to an invalid page being stuck in the 
cache. I tried adding a TLB flush but unfortunately that just made it worse (it 
hangs irrecoverably).

Any insight is appreciated!

Thanks & best regards,
Timur

> 
> On Wed, May 13, 2026 at 12:30 PM Timur Kristóf <timur.kristof@gmail.com> 
wrote:
> > Fix some issues regarding retry fault handling,
> > such as enabling the retry fault interrupt (necessary
> > for retry faults to work) and such.
> > 
> > Improve retry faults on Navi 3 dGPUs by enabling
> > the filter CAM, which can filter the repeated page
> > fault interrupts that happen when retry faults are
> > enabled, making the handling more efficient.
> > 
> > With this series, the kernel is able to mitigate
> > most page faults on Navi 3 without causing a hang
> > and without a need to reset the GPU, when the
> > amdgpu.noretry=0 module parameter is set.
> > 
> > Timur Kristóf (6):
> >   drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly
> >   drm/amdgpu/gfxhub: Enable retry fault interrupts when needed
> >   drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed
> >   drm/amdgpu/gmc: Don't compare page fault timestamps with other
> >   
> >     interrupts
> >   
> >   drm/amdgpu/ih: Add retry_cam_ack IH function pointer
> >   drm/amdgpu: Enable retry CAM on Navi 3 dGPUs
> >  
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c     |  7 +++++--
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h     |  1 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h      |  1 +
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c | 17 ++++++++++-------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c   | 17 ++++++++++-------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c   | 19 +++++++++++--------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c    | 15 +++++++++------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c    | 15 +++++++++------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c    | 15 +++++++++------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c    | 15 +++++++++------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c    | 17 ++++++++++-------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c  | 17 ++++++++++-------
> >  drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c      |  5 ++++-
> >  drivers/gpu/drm/amd/amdgpu/ih_v6_0.c        | 18 +++++++++++++++++-
> >  drivers/gpu/drm/amd/amdgpu/ih_v7_0.c        |  6 ++++++
> >  drivers/gpu/drm/amd/amdgpu/mmhub_v3_0.c     |  2 +-
> >  drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_1.c   |  2 +-
> >  drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_2.c   |  2 +-
> >  drivers/gpu/drm/amd/amdgpu/mmhub_v3_3.c     |  2 +-
> >  drivers/gpu/drm/amd/amdgpu/mmhub_v4_1_0.c   |  2 +-
> >  drivers/gpu/drm/amd/amdgpu/mmhub_v4_2_0.c   |  2 +-
> >  drivers/gpu/drm/amd/amdgpu/vega20_ih.c      |  8 +++++++-
> >  22 files changed, 134 insertions(+), 71 deletions(-)
> > 
> > --
> > 2.54.0





^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 0/6] drm/amdgpu: Improve retry fault handling
  2026-05-13 16:43   ` Timur Kristóf
@ 2026-05-13 17:28     ` Shetaia, Amir
  2026-05-13 17:32       ` Deucher, Alexander
  2026-05-13 17:51       ` Timur Kristóf
  0 siblings, 2 replies; 18+ messages in thread
From: Shetaia, Amir @ 2026-05-13 17:28 UTC (permalink / raw)
  To: Timur Kristóf, Alex Deucher
  Cc: amd-gfx@lists.freedesktop.org, Deucher,  Alexander,
	Koenig, Christian, Marek Olšák, Natalie Vock,
	Melissa Wen

AMD General

Hi Timur, Alex,

Thanks for looping me in. Yes, we've been deep in NV4 (gfx1201) XNACK for the past few weeks and what you're describing on NV48 lines up closely with what we've seen

Quick highlights from my work:

1. IH retry CAM ACK doesn't actually free the slot when written via WDOORBELL on NV4 .. we have to use MMIO
(WREG32_SOC15(OSSSYS, 0, regIH_RETRY_CAM_ACK, cam_index & 0x3ff)).
I think you may want to check that, since "fault never resolves" is exactly the symptom you'd see if the CAM never gets cleared.

2. gfx12 needs its own retry-fault detection path .. amdgpu_gmc_handle_retry_fault on gfx9-era constants
(AMDGPU_GMC9_FAULT_SOURCE_DATA_RETRY on src_data[1]) never matches on gfx12. We added a gfx12-native handler that
reads from src_data[2] for NV4.

3. TLB flush making it worse is a known trap .. on NV4 we see the same. The flush adds more pressure on the same UTC L2
 already saturated by the retry storm; the GCR can't drain. We have UMR captures showing GCVM_L2 stuck busy on the
user VMID with SDMA parked on a GCR ack.

4. Up to ~512 MiB our patches resolve faults cleanly; at 1 GiB we see random hangs that we've isolated to an SDMA ->
GCR -> GC-cache deadlock when the BO-clear runs in ih_soft_work context.

Could you reply with your series? I tried searching the inbox but couldn't find it. Once I have it, I can diff against ours to see what overlaps and what's net-new on each side.

AMIR SHETAIA
Senior Software Development Engineer  |  AMD
Software Platform Architecture Team
----------------------------------------------------------------------------------------------------------------------------------
1 Commerce Valley Drive, Markham, ON L3T 7X6
LinkedIn  |  Instagram  |  X  |  amd.com

-----Original Message-----
From: Timur Kristóf <timur.kristof@gmail.com>
Sent: Wednesday, May 13, 2026 12:43 PM
To: Shetaia, Amir <Amir.Shetaia@amd.com>; Alex Deucher <alexdeucher@gmail.com>
Cc: amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Marek Olšák <maraeo@gmail.com>; Natalie Vock <natalie.vock@gmx.de>; Melissa Wen <mwen@igalia.com>
Subject: Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling

[You don't often get email from timur.kristof@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]

On Wednesday, May 13, 2026 6:36:02 PM Central European Summer Time Alex Deucher wrote:
> + Amir
>
> Amir may have some insights on navi4x as he was looking at this recently.
>
> Alex

Hi Alex, Amir,

I think we are very close to enabling retry faults by default on Navi 3.
I'd be happy to receive feedback on the above series.

With regards to Navi 4:

I also attempted to get it working on Navi 48, and I managed to get retry faults enabled, but it seems that amdgpu_vm_handle_fault() can't actually resolve the page fault on Navi 48. It just keeps retrying until it times out.
Christian suggested this may be due to an invalid page being stuck in the cache. I tried adding a TLB flush but unfortunately that just made it worse (it hangs irrecoverably).

Any insight is appreciated!

Thanks & best regards,
Timur

>
> On Wed, May 13, 2026 at 12:30 PM Timur Kristóf
> <timur.kristof@gmail.com>
wrote:
> > Fix some issues regarding retry fault handling, such as enabling the
> > retry fault interrupt (necessary for retry faults to work) and such.
> >
> > Improve retry faults on Navi 3 dGPUs by enabling the filter CAM,
> > which can filter the repeated page fault interrupts that happen when
> > retry faults are enabled, making the handling more efficient.
> >
> > With this series, the kernel is able to mitigate most page faults on
> > Navi 3 without causing a hang and without a need to reset the GPU,
> > when the
> > amdgpu.noretry=0 module parameter is set.
> >
> > Timur Kristóf (6):
> >   drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly
> >   drm/amdgpu/gfxhub: Enable retry fault interrupts when needed
> >   drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed
> >   drm/amdgpu/gmc: Don't compare page fault timestamps with other
> >
> >     interrupts
> >
> >   drm/amdgpu/ih: Add retry_cam_ack IH function pointer
> >   drm/amdgpu: Enable retry CAM on Navi 3 dGPUs
> >
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c     |  7 +++++--
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h     |  1 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h      |  1 +
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c | 17 ++++++++++-------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c   | 17 ++++++++++-------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c   | 19 +++++++++++--------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c    | 15 +++++++++------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c    | 15 +++++++++------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c    | 15 +++++++++------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c    | 15 +++++++++------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c    | 17 ++++++++++-------
> >  drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c  | 17 ++++++++++-------
> >  drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c      |  5 ++++-
> >  drivers/gpu/drm/amd/amdgpu/ih_v6_0.c        | 18 +++++++++++++++++-
> >  drivers/gpu/drm/amd/amdgpu/ih_v7_0.c        |  6 ++++++
> >  drivers/gpu/drm/amd/amdgpu/mmhub_v3_0.c     |  2 +-
> >  drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_1.c   |  2 +-
> >  drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_2.c   |  2 +-
> >  drivers/gpu/drm/amd/amdgpu/mmhub_v3_3.c     |  2 +-
> >  drivers/gpu/drm/amd/amdgpu/mmhub_v4_1_0.c   |  2 +-
> >  drivers/gpu/drm/amd/amdgpu/mmhub_v4_2_0.c   |  2 +-
> >  drivers/gpu/drm/amd/amdgpu/vega20_ih.c      |  8 +++++++-
> >  22 files changed, 134 insertions(+), 71 deletions(-)
> >
> > --
> > 2.54.0

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 0/6] drm/amdgpu: Improve retry fault handling
  2026-05-13 17:28     ` Shetaia, Amir
@ 2026-05-13 17:32       ` Deucher, Alexander
  2026-05-13 17:51       ` Timur Kristóf
  1 sibling, 0 replies; 18+ messages in thread
From: Deucher, Alexander @ 2026-05-13 17:32 UTC (permalink / raw)
  To: Shetaia, Amir, Timur Kristóf, Alex Deucher
  Cc: amd-gfx@lists.freedesktop.org, Koenig,  Christian,
	Marek Olšák, Natalie Vock, Melissa Wen

AMD General

> -----Original Message-----
> From: Shetaia, Amir <Amir.Shetaia@amd.com>
> Sent: Wednesday, May 13, 2026 1:29 PM
> To: Timur Kristóf <timur.kristof@gmail.com>; Alex Deucher
> <alexdeucher@gmail.com>
> Cc: amd-gfx@lists.freedesktop.org; Deucher, Alexander
> <Alexander.Deucher@amd.com>; Koenig, Christian
> <Christian.Koenig@amd.com>; Marek Olšák <maraeo@gmail.com>; Natalie
> Vock <natalie.vock@gmx.de>; Melissa Wen <mwen@igalia.com>
> Subject: RE: [PATCH 0/6] drm/amdgpu: Improve retry fault handling
>
> AMD General
>
> Hi Timur, Alex,
>
> Thanks for looping me in. Yes, we've been deep in NV4 (gfx1201) XNACK for
> the past few weeks and what you're describing on NV48 lines up closely with
> what we've seen
>
> Quick highlights from my work:
>
> 1. IH retry CAM ACK doesn't actually free the slot when written via
> WDOORBELL on NV4 .. we have to use MMIO (WREG32_SOC15(OSSSYS, 0,
> regIH_RETRY_CAM_ACK, cam_index & 0x3ff)).
> I think you may want to check that, since "fault never resolves" is exactly the
> symptom you'd see if the CAM never gets cleared.
>
> 2. gfx12 needs its own retry-fault detection path ..
> amdgpu_gmc_handle_retry_fault on gfx9-era constants
> (AMDGPU_GMC9_FAULT_SOURCE_DATA_RETRY on src_data[1]) never
> matches on gfx12. We added a gfx12-native handler that reads from
> src_data[2] for NV4.
>
> 3. TLB flush making it worse is a known trap .. on NV4 we see the same. The
> flush adds more pressure on the same UTC L2  already saturated by the retry
> storm; the GCR can't drain. We have UMR captures showing GCVM_L2 stuck
> busy on the user VMID with SDMA parked on a GCR ack.
>
> 4. Up to ~512 MiB our patches resolve faults cleanly; at 1 GiB we see random
> hangs that we've isolated to an SDMA -> GCR -> GC-cache deadlock when the
> BO-clear runs in ih_soft_work context.
>
> Could you reply with your series? I tried searching the inbox but couldn't find
> it. Once I have it, I can diff against ours to see what overlaps and what's net-
> new on each side.
>

Here's the patch series:
https://patchwork.freedesktop.org/series/166522/

Alex

> AMIR SHETAIA
> Senior Software Development Engineer  |  AMD Software Platform
> Architecture Team
> ----------------------------------------------------------------------------------------------
> ------------------------------------
> 1 Commerce Valley Drive, Markham, ON L3T 7X6 LinkedIn  |  Instagram  |  X  |
> amd.com
>
>
>
>
> -----Original Message-----
> From: Timur Kristóf <timur.kristof@gmail.com>
> Sent: Wednesday, May 13, 2026 12:43 PM
> To: Shetaia, Amir <Amir.Shetaia@amd.com>; Alex Deucher
> <alexdeucher@gmail.com>
> Cc: amd-gfx@lists.freedesktop.org; Deucher, Alexander
> <Alexander.Deucher@amd.com>; Koenig, Christian
> <Christian.Koenig@amd.com>; Marek Olšák <maraeo@gmail.com>; Natalie
> Vock <natalie.vock@gmx.de>; Melissa Wen <mwen@igalia.com>
> Subject: Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling
>
> [You don't often get email from timur.kristof@gmail.com. Learn why this is
> important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On Wednesday, May 13, 2026 6:36:02 PM Central European Summer Time
> Alex Deucher wrote:
> > + Amir
> >
> > Amir may have some insights on navi4x as he was looking at this recently.
> >
> > Alex
>
> Hi Alex, Amir,
>
> I think we are very close to enabling retry faults by default on Navi 3.
> I'd be happy to receive feedback on the above series.
>
> With regards to Navi 4:
>
> I also attempted to get it working on Navi 48, and I managed to get retry faults
> enabled, but it seems that amdgpu_vm_handle_fault() can't actually resolve
> the page fault on Navi 48. It just keeps retrying until it times out.
> Christian suggested this may be due to an invalid page being stuck in the
> cache. I tried adding a TLB flush but unfortunately that just made it worse (it
> hangs irrecoverably).
>
> Any insight is appreciated!
>
> Thanks & best regards,
> Timur
>
> >
> > On Wed, May 13, 2026 at 12:30 PM Timur Kristóf
> > <timur.kristof@gmail.com>
> wrote:
> > > Fix some issues regarding retry fault handling, such as enabling the
> > > retry fault interrupt (necessary for retry faults to work) and such.
> > >
> > > Improve retry faults on Navi 3 dGPUs by enabling the filter CAM,
> > > which can filter the repeated page fault interrupts that happen when
> > > retry faults are enabled, making the handling more efficient.
> > >
> > > With this series, the kernel is able to mitigate most page faults on
> > > Navi 3 without causing a hang and without a need to reset the GPU,
> > > when the
> > > amdgpu.noretry=0 module parameter is set.
> > >
> > > Timur Kristóf (6):
> > >   drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly
> > >   drm/amdgpu/gfxhub: Enable retry fault interrupts when needed
> > >   drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed
> > >   drm/amdgpu/gmc: Don't compare page fault timestamps with other
> > >
> > >     interrupts
> > >
> > >   drm/amdgpu/ih: Add retry_cam_ack IH function pointer
> > >   drm/amdgpu: Enable retry CAM on Navi 3 dGPUs
> > >
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c     |  7 +++++--
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h     |  1 +
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h      |  1 +
> > >  drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c | 17 ++++++++++-----
> --
> > >  drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c   | 17 ++++++++++-------
> > >  drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c   | 19 +++++++++++-----
> ---
> > >  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c    | 15 +++++++++------
> > >  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c    | 15 +++++++++------
> > >  drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c    | 15 +++++++++------
> > >  drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c    | 15 +++++++++------
> > >  drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c    | 17 ++++++++++-------
> > >  drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c  | 17 ++++++++++------
> -
> > >  drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c      |  5 ++++-
> > >  drivers/gpu/drm/amd/amdgpu/ih_v6_0.c        | 18 +++++++++++++++++-
> > >  drivers/gpu/drm/amd/amdgpu/ih_v7_0.c        |  6 ++++++
> > >  drivers/gpu/drm/amd/amdgpu/mmhub_v3_0.c     |  2 +-
> > >  drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_1.c   |  2 +-
> > >  drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_2.c   |  2 +-
> > >  drivers/gpu/drm/amd/amdgpu/mmhub_v3_3.c     |  2 +-
> > >  drivers/gpu/drm/amd/amdgpu/mmhub_v4_1_0.c   |  2 +-
> > >  drivers/gpu/drm/amd/amdgpu/mmhub_v4_2_0.c   |  2 +-
> > >  drivers/gpu/drm/amd/amdgpu/vega20_ih.c      |  8 +++++++-
> > >  22 files changed, 134 insertions(+), 71 deletions(-)
> > >
> > > --
> > > 2.54.0
>
>
>
>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling
  2026-05-13 17:28     ` Shetaia, Amir
  2026-05-13 17:32       ` Deucher, Alexander
@ 2026-05-13 17:51       ` Timur Kristóf
  2026-05-13 20:32         ` Shetaia, Amir
  1 sibling, 1 reply; 18+ messages in thread
From: Timur Kristóf @ 2026-05-13 17:51 UTC (permalink / raw)
  To: Alex Deucher, Shetaia, Amir
  Cc: amd-gfx@lists.freedesktop.org, Deucher, Alexander,
	Koenig, Christian, Marek Olšák, Natalie Vock,
	Melissa Wen

Hi Amir,

Thanks for the quick response!
See my replies below.

On Wednesday, May 13, 2026 7:28:41 PM Central European Summer Time Shetaia, 
> 
> Thanks for looping me in. Yes, we've been deep in NV4 (gfx1201) XNACK for
> the past few weeks and what you're describing on NV48 lines up closely with
> what we've seen

> Quick highlights from my work:
> 
> 1. IH retry CAM ACK doesn't actually free the slot when written via
> WDOORBELL on NV4 .. we have to use MMIO
> (WREG32_SOC15(OSSSYS, 0,
> regIH_RETRY_CAM_ACK, cam_index & 0x3ff)).

I agree. That's my conclusion as well and that's exactly what I'm doing in my 
series for Navi 31, see the following patch:
"drm/amdgpu: Enable retry CAM on Navi 3 dGPUs"

> "fault never resolves" is exactly the symptom you'd see if the
> CAM never gets cleared. 

Not exactly.

When the CAM never gets cleared, the first page fault is still resolved, but 
subsequent page faults (that belong to the same CAM entry) will cause a hang 
because the IRQ handler is not called (because the IRQ is filtered out).

That's not what I see on Navi 48. Instead what I see is that the IRQ is fired 
repeatedly and amdgpu_vm_handle_fault() is called repeatedly, but just doesn't 
resolve the fault.

> 2. gfx12 needs its own retry-fault detection path ..
> amdgpu_gmc_handle_retry_fault on gfx9-era constants
> (AMDGPU_GMC9_FAULT_SOURCE_DATA_RETRY on src_data[1]) never matches on
> gfx12. We added a gfx12-native handler that reads from src_data[2] for NV4.

Interesting. Could you share what bits you checked on src_data[2]?

The gfx9-era constants worked for me on both Navi 31 and 48 for detecting 
retry faults; however I needed to program some extra register fields in the 
gfxhub code to actually enable retry fault interrupts.

> 
> 3. TLB flush making it worse is a known trap .. on NV4 we see the same. The
> flush adds more pressure on the same UTC L2
> already saturated by the retry
> storm; the GCR can't drain. We have UMR captures showing GCVM_L2 stuck busy
> on the user VMID with SDMA parked on a GCR ack.

I am pretty sure this is what I saw.
Do you have any clue about what can be done about this?

> 4. Up to ~512 MiB our patches resolve faults cleanly;

That's pretty impressive! Nice work!

> at 1 GiB we see random
> hangs that we've isolated to an SDMA ->
> GCR -> GC-cache deadlock when the
> BO-clear runs in ih_soft_work context. 

Actually something I forgot to ask: on Navi 4x is it possible to use the IH1 
ring? On my machine it seemed that the retry fault interrupts always come in 
on the IH0 ring even though the IH1 is enabled and configured upstream already.

> Could you reply with your series? I tried searching the inbox but couldn't
> find it. Once I have it, I can diff against ours to see what overlaps and
> what's net-new on each side.

You can view it on patchwork or the mailing list arcives:
https://patchwork.freedesktop.org/series/166522/
https://lists.freedesktop.org/archives/amd-gfx/2026-May/thread.html#144500

Or if that's more comfortable for you, here is my GitLab branch:
https://gitlab.freedesktop.org/Venemo/linux/-/commits/ven_retry_faults

Thanks & best regards,
Timur

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 0/6] drm/amdgpu: Improve retry fault handling
  2026-05-13 17:51       ` Timur Kristóf
@ 2026-05-13 20:32         ` Shetaia, Amir
  2026-05-13 22:12           ` Timur Kristóf
  0 siblings, 1 reply; 18+ messages in thread
From: Shetaia, Amir @ 2026-05-13 20:32 UTC (permalink / raw)
  To: Timur Kristóf, Alex Deucher
  Cc: amd-gfx@lists.freedesktop.org, Deucher,  Alexander,
	Koenig, Christian, Marek Olšák, Natalie Vock,
	Melissa Wen

AMD General

Hi Timur and Alex,

Thanks for sending the series.

Timur, you are right, I see your patch 6 already does the MMIO ACK for gmc_v11_0/ih_v6_0. I missed that. The gap is only in patch 5's ih_v7_0 implementation, which still does WDOORBELL. that's where I'd suggest swapping in MMIO for NV4.

Some answers to your questions:

1. "Fault never resolves on NV48" different shape from our broken-CAM-ACK symptom.

You're right, those are different. Our cam-walk-monotonically symptom only shows up when CAM is enabled but the ACK is broken.
On your NV48 setup CAM probably isn't enabled at all (your patch 6 only enables it for ih_v6_0_irq_init, no equivalent in ih_v7_0_irq_init), so retries fire repeatedly on the IH ring instead of being deduped by CAM.
That matches what you're seeing .. amdgpu_vm_handle_fault keeps being called but each call is on a fresh IRQ for the same address.

Two things that could be happening underneath:
- The fault handler runs but the updated PTE never reaches UTC L0 (TLB invalidation gap). On NV4 we see this as "valid PTEs failing to translate" in our UMR captures.
- Or amdgpu_vm_handle_fault is bailing early without actually fixing the mapping

Quickest discriminator: enable the CAM in ih_v7_0_irq_init (set IH_RETRY_INT_CAM_CNTL.ENABLE=1, CAM_SIZE=0xF, adev->irq.retry_cam_enabled=true), use MMIO ACK from gmc_v12_0, and see if the symptom changes from "infinite retries"
to "first batch of pages map, then it hangs after a few hundred."

2. What bits we check on src_data[2]:

Honestly, we don't use src_data[2] for retry detection. We use it only for the cam_index: cam_index = entry->src_data[2] & 0x3ff;   /* low 10 bits = CAM slot */

For retry detection we initially used the gfx9 constant on src_data[1] like you, but observed the bit cleared on a lot of NV4 events that should have been retries (waves were hung in xnack-stall but no IH event matched).
So we just go through the retry path unconditionally on NV4 and let amdgpu_vm_handle_fault sort it out via SVM range migration. May be specific to gfx1201 / our test path

3. TLB flush making it worse .. clue about what to do:

Honest answer: not really, not a SW-only fix. Our 1 GiB hang is an architectural deadlock ... ih_soft_work blocks on a dma_fence for an SDMA BO-clear, the BO-clear is stalled on a GCR (cache flush) request,
and the GC cache block isn't ACK'ing the GCR while UTC L2 is saturated by the user shader's XNACK retry storm. Adding a TLB flush adds another translation request to the same saturated UTC, which is why it makes things worse.

4. IH1 ring on NV4:

Same as you ... retry faults on NV4 always come in on IH0. We delegate from IH0 to ih.ring_soft (amdgpu_irq_delegate(adev, entry, 8)) so the SVM/migration path can sleep, but the original entry is on IH0. We haven't tried IH1 routing.

Re your branch: thanks for the gitlab link, easier than digging through patchwork.
I'll cherry-pick patches 1, 3, 4 into our test build to see if patch 4 cleans up the timestamp filter delta we're seeing (97k entered / 2.8k completed at 1 GiB might be partly explained by your Strix Halo bug).

AMIR SHETAIA

-----Original Message-----
From: Timur Kristóf <timur.kristof@gmail.com>
Sent: Wednesday, May 13, 2026 1:52 PM
To: Alex Deucher <alexdeucher@gmail.com>; Shetaia, Amir <Amir.Shetaia@amd.com>
Cc: amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Marek Olšák <maraeo@gmail.com>; Natalie Vock <natalie.vock@gmx.de>; Melissa Wen <mwen@igalia.com>
Subject: Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling

[You don't often get email from timur.kristof@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]

Hi Amir,

Thanks for the quick response!
See my replies below.

On Wednesday, May 13, 2026 7:28:41 PM Central European Summer Time Shetaia,
>
> Thanks for looping me in. Yes, we've been deep in NV4 (gfx1201) XNACK
> for the past few weeks and what you're describing on NV48 lines up
> closely with what we've seen

> Quick highlights from my work:
>
> 1. IH retry CAM ACK doesn't actually free the slot when written via
> WDOORBELL on NV4 .. we have to use MMIO (WREG32_SOC15(OSSSYS, 0,
> regIH_RETRY_CAM_ACK, cam_index & 0x3ff)).

I agree. That's my conclusion as well and that's exactly what I'm doing in my series for Navi 31, see the following patch:
"drm/amdgpu: Enable retry CAM on Navi 3 dGPUs"

> "fault never resolves" is exactly the symptom you'd see if the CAM
> never gets cleared.

Not exactly.

When the CAM never gets cleared, the first page fault is still resolved, but subsequent page faults (that belong to the same CAM entry) will cause a hang because the IRQ handler is not called (because the IRQ is filtered out).

That's not what I see on Navi 48. Instead what I see is that the IRQ is fired repeatedly and amdgpu_vm_handle_fault() is called repeatedly, but just doesn't resolve the fault.

> 2. gfx12 needs its own retry-fault detection path ..
> amdgpu_gmc_handle_retry_fault on gfx9-era constants
> (AMDGPU_GMC9_FAULT_SOURCE_DATA_RETRY on src_data[1]) never matches on
> gfx12. We added a gfx12-native handler that reads from src_data[2] for NV4.

Interesting. Could you share what bits you checked on src_data[2]?

The gfx9-era constants worked for me on both Navi 31 and 48 for detecting retry faults; however I needed to program some extra register fields in the gfxhub code to actually enable retry fault interrupts.

>
> 3. TLB flush making it worse is a known trap .. on NV4 we see the
> same. The flush adds more pressure on the same UTC L2 already
> saturated by the retry storm; the GCR can't drain. We have UMR
> captures showing GCVM_L2 stuck busy on the user VMID with SDMA parked
> on a GCR ack.

I am pretty sure this is what I saw.
Do you have any clue about what can be done about this?

> 4. Up to ~512 MiB our patches resolve faults cleanly;

That's pretty impressive! Nice work!

> at 1 GiB we see random
> hangs that we've isolated to an SDMA -> GCR -> GC-cache deadlock when
> the BO-clear runs in ih_soft_work context.

Actually something I forgot to ask: on Navi 4x is it possible to use the IH1 ring? On my machine it seemed that the retry fault interrupts always come in on the IH0 ring even though the IH1 is enabled and configured upstream already.

> Could you reply with your series? I tried searching the inbox but
> couldn't find it. Once I have it, I can diff against ours to see what
> overlaps and what's net-new on each side.

You can view it on patchwork or the mailing list arcives:
https://patchwork.freedesktop.org/series/166522/
https://lists.freedesktop.org/archives/amd-gfx/2026-May/thread.html#144500

Or if that's more comfortable for you, here is my GitLab branch:
https://gitlab.freedesktop.org/Venemo/linux/-/commits/ven_retry_faults

Thanks & best regards,
Timur

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling
  2026-05-13 20:32         ` Shetaia, Amir
@ 2026-05-13 22:12           ` Timur Kristóf
  2026-05-14 15:04             ` Shetaia, Amir
  0 siblings, 1 reply; 18+ messages in thread
From: Timur Kristóf @ 2026-05-13 22:12 UTC (permalink / raw)
  To: Alex Deucher, Shetaia, Amir
  Cc: amd-gfx@lists.freedesktop.org, Deucher, Alexander,
	Koenig, Christian, Marek Olšák, Natalie Vock,
	Melissa Wen

Hi Amir,

> Timur, you are right, I see your patch 6 already does the MMIO ACK for
> gmc_v11_0/ih_v6_0. I missed that. The gap is only in patch 5's ih_v7_0
> implementation, which still does WDOORBELL. that's where I'd suggest
> swapping in MMIO for NV4.

First, let me clear up a slight misunderstanding here. The patch series that I 
sent here only contains what I managed to get working on Navi 31, it does not 
contain any code from my attempt at Navi 48.

The patch 5 is just a slight refactor of the pre-existing code and 
(intentionally) doesn't contain any functional changes. It seems that IH 7.1 
relies on the doorbell, so I didn't want to remove it, albeit I have no means 
to verify if that actually works or not.

If you like, I can push a few WIP patches to a different branch tomorrow to 
show you exactly what I did on Navi 48. However, I wouldn't want to submit 
that to the mailing list without first making sure that it works well.

> 1. "Fault never resolves on NV48" different shape from our broken-CAM-ACK
> symptom.

> You're right, those are different. Our cam-walk-monotonically symptom only
> shows up when CAM is enabled but the ACK is broken.
 > On your NV48 setup CAM
> probably isn't enabled at all (your patch 6 only enables it for
> ih_v6_0_irq_init, no equivalent in ih_v7_0_irq_init)

For the attempt on Navi 48, I enabled the CAM in ih_v7_0 the same way I do for 
ih_v6_0 in the series. But, because I couldn't get it fully working, I didn't 
include any of that code in the series.

> so retries fire
> repeatedly on the IH ring instead of being deduped by CAM. That matches
> what you're seeing .. amdgpu_vm_handle_fault keeps being called but each
> call is on a fresh IRQ for the same address. 
> Two things that could be happening underneath:
> - The fault handler runs but the updated PTE never reaches UTC L0 (TLB
> invalidation gap). On NV4 we see this as "valid PTEs failing to translate"
> in our UMR captures.

I think this explanation may fit what I saw.
Why is it not reaching UTC L0?
Also, how do you inspect this stuff in umr?

> 2. What bits we check on src_data[2]:
> 
> Honestly, we don't use src_data[2] for retry detection. We use it only for
> the cam_index: cam_index = entry->src_data[2] & 0x3ff;   /* low 10 bits =
> CAM slot */

> For retry detection we initially used the gfx9 constant on src_data[1] like
> you, but observed the bit cleared on a lot of NV4 events that should have
> been retries (waves were hung in xnack-stall but no IH event matched).
> So
> we just go through the retry path unconditionally on NV4

I see. I think I needed to change some gfxhub registers to get those src_data 
bits that you are missing.

> and let
> amdgpu_vm_handle_fault sort it out via SVM range migration. May be specific
> to gfx1201 / our test path 

My test case is a simple Vulkan shader which I am executing with vkrunner. 
Each shader invocation does an out of bounds read from a different page. For 
Navi 31 (and Strix Halo), I started out with just 1 page fault, and once I 
could mitigate that reliably, I turned it up to several hundred faults.

> 3. TLB flush making it worse .. clue about what to do:
> 
> Honest answer: not really, not a SW-only fix. Our 1 GiB hang is an
> architectural deadlock ... ih_soft_work blocks on a dma_fence for an SDMA
> BO-clear, the BO-clear is stalled on a GCR (cache flush) request,
> and the
> GC cache block isn't ACK'ing the GCR while UTC L2 is saturated by the user
> shader's XNACK retry storm. Adding a TLB flush adds another translation
> request to the same saturated UTC, which is why it makes things worse.

This may be related to a flaw in amdgpu_gmc_handle_retry_fault():
what the function does is first call amdgpu_vm_handle_fault() which kicks off an 
SDMA job to update page tables, and then it ACKs the filter CAM. However, at 
the moment when the CAM is ACKed, the SDMA job is not finished yet, so the CP 
sees that the page is still invalid and the page fault interrupt is fired 
again. I actually noticed that on Navi 31 too, but it's just not fatal there. 
It just basically handles the same page fault twice. Once we solve this flaw, I 
would like to propose to enable retry faults by default on Navi 3.

Here is an idea for a solution:

Instead of ACKing the CAM right away, we should do it after the SDMA fence is 
signalled, ie. when we are sure the page tables are updated. Maybe we can set 
a callback on the fence and do it there, though it would require a slight code 
churn to get that to work.

What do you think?

> 4. IH1 ring on NV4:
> 
> Same as you ... retry faults on NV4 always come in on IH0. We delegate from
> IH0 to ih.ring_soft (amdgpu_irq_delegate(adev, entry, 8)) so the
> SVM/migration path can sleep, but the original entry is on IH0. We haven't
> tried IH1 routing.

Why, though? 

The ih_v7_0 code does set up the IH1 ring and configure it exactly the same as 
the ih_v6_0, so I don't see why it wouldn't work?

> Re your branch: thanks for the gitlab link, easier than digging through
> patchwork.
> I'll cherry-pick patches 1, 3, 4 into our test build to see if
> patch 4 cleans up the timestamp filter delta we're seeing (97k entered /
> 2.8k completed at 1 GiB might be partly explained by your Strix Halo bug). 

The timestamp issue actually gave me an endless headache when I first got into 
this topic in December. I hope the patch helps!

Best regards,
Timur

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 0/6] drm/amdgpu: Improve retry fault handling
  2026-05-13 22:12           ` Timur Kristóf
@ 2026-05-14 15:04             ` Shetaia, Amir
  2026-05-14 17:02               ` Timur Kristóf
  0 siblings, 1 reply; 18+ messages in thread
From: Shetaia, Amir @ 2026-05-14 15:04 UTC (permalink / raw)
  To: Timur Kristóf, Alex Deucher
  Cc: amd-gfx@lists.freedesktop.org, Deucher,  Alexander,
	Koenig, Christian, Marek Olšák, Natalie Vock,
	Melissa Wen

AMD General

Hi Timur,

Thanks for clarifying. yes, please push the Navi 48 WIP branch when you have a chance, useful to compare side-by-side.

Patch 5 / WDOORBELL on ih_v7_0: I have been able to confirm that doorbell ACK does not free the CAM slot on the NV4 family
(gfx1200/1201). I verified empirically, slots walk monotonically until the CAM fills, then HW silently drops retry events.
MMIO write to regIH_RETRY_CAM_ACK does free it. Almost certainly required for your Navi 48 attempt.

Why PTE update doesn't reach UTC L0: Most likely a race between the SDMA PTE-update job completing and the CAM ACK firing the
IRQ for re-handling. Until SDMA finishes the write + invalidation propagates, UTC L0 keeps getting the stale "invalid" reply which is exactly your symptom.

To inspect with umr: clone https://gitlab.freedesktop.org/tomstdenis/umr and use scripts/diag/dump_all_cpc_info.sh (needs
halt_if_hws_hang=1 gpu_recovery=0). For UTC specifically: umr -i <gpu> -O bits -r '*.*.regGCVM_L2_STATUS' (look at L2_BUSY and
CONTEXT_DOMAIN_BUSY per-VMID bits) and regGCVM_L2_PROTECTION_FAULT_STATUS (empty + busy = translation timeout, not denial).
Page-table walk to confirm PTE landed: sudo umr -i <gpu> -vm <vmid>@<va> 1.

Your ACK-after-SDMA-fence idea: good catch. We hadn't dug into that race because the BO-clear deadlock dominates on NV4, but
the ordering bug is almost certainly there too. Fence callback is the right shape, cleaner than blocking the worker. Would be a
generic fix benefiting both Navi 3 and 4. Happy to review if you take a swing.

IH1 routing: fair point. I just inherited the existing IH0 path and haven't tried IH1. On the list once CAM ACK and fence ordering are stable.

AMIR SHETAIA
Senior Software Development Engineer  |  AMD
Software Platform Architecture Team
----------------------------------------------------------------------------------------------------------------------------------
1 Commerce Valley Drive, Markham, ON L3T 7X6
LinkedIn  |  Instagram  |  X  |  amd.com

-----Original Message-----
From: Timur Kristóf <timur.kristof@gmail.com>
Sent: Wednesday, May 13, 2026 6:12 PM
To: Alex Deucher <alexdeucher@gmail.com>; Shetaia, Amir <Amir.Shetaia@amd.com>
Cc: amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Marek Olšák <maraeo@gmail.com>; Natalie Vock <natalie.vock@gmx.de>; Melissa Wen <mwen@igalia.com>
Subject: Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling

Hi Amir,

> Timur, you are right, I see your patch 6 already does the MMIO ACK for
> gmc_v11_0/ih_v6_0. I missed that. The gap is only in patch 5's ih_v7_0
> implementation, which still does WDOORBELL. that's where I'd suggest
> swapping in MMIO for NV4.

First, let me clear up a slight misunderstanding here. The patch series that I sent here only contains what I managed to get working on Navi 31, it does not contain any code from my attempt at Navi 48.

The patch 5 is just a slight refactor of the pre-existing code and
(intentionally) doesn't contain any functional changes. It seems that IH 7.1 relies on the doorbell, so I didn't want to remove it, albeit I have no means to verify if that actually works or not.

If you like, I can push a few WIP patches to a different branch tomorrow to show you exactly what I did on Navi 48. However, I wouldn't want to submit that to the mailing list without first making sure that it works well.

> 1. "Fault never resolves on NV48" different shape from our
> broken-CAM-ACK symptom.

> You're right, those are different. Our cam-walk-monotonically symptom
> only shows up when CAM is enabled but the ACK is broken.
 > On your NV48 setup CAM
> probably isn't enabled at all (your patch 6 only enables it for
> ih_v6_0_irq_init, no equivalent in ih_v7_0_irq_init)

For the attempt on Navi 48, I enabled the CAM in ih_v7_0 the same way I do for
ih_v6_0 in the series. But, because I couldn't get it fully working, I didn't include any of that code in the series.

> so retries fire
> repeatedly on the IH ring instead of being deduped by CAM. That
> matches what you're seeing .. amdgpu_vm_handle_fault keeps being
> called but each call is on a fresh IRQ for the same address.
> Two things that could be happening underneath:
> - The fault handler runs but the updated PTE never reaches UTC L0 (TLB
> invalidation gap). On NV4 we see this as "valid PTEs failing to translate"
> in our UMR captures.

I think this explanation may fit what I saw.
Why is it not reaching UTC L0?
Also, how do you inspect this stuff in umr?

> 2. What bits we check on src_data[2]:
>
> Honestly, we don't use src_data[2] for retry detection. We use it only for
> the cam_index: cam_index = entry->src_data[2] & 0x3ff;   /* low 10 bits =
> CAM slot */

> For retry detection we initially used the gfx9 constant on src_data[1]
> like you, but observed the bit cleared on a lot of NV4 events that
> should have been retries (waves were hung in xnack-stall but no IH event matched).
> So
> we just go through the retry path unconditionally on NV4

I see. I think I needed to change some gfxhub registers to get those src_data bits that you are missing.

> and let
> amdgpu_vm_handle_fault sort it out via SVM range migration. May be
> specific to gfx1201 / our test path

My test case is a simple Vulkan shader which I am executing with vkrunner.
Each shader invocation does an out of bounds read from a different page. For Navi 31 (and Strix Halo), I started out with just 1 page fault, and once I could mitigate that reliably, I turned it up to several hundred faults.

> 3. TLB flush making it worse .. clue about what to do:
>
> Honest answer: not really, not a SW-only fix. Our 1 GiB hang is an
> architectural deadlock ... ih_soft_work blocks on a dma_fence for an
> SDMA BO-clear, the BO-clear is stalled on a GCR (cache flush) request,
> and the GC cache block isn't ACK'ing the GCR while UTC L2 is saturated
> by the user shader's XNACK retry storm. Adding a TLB flush adds
> another translation request to the same saturated UTC, which is why it
> makes things worse.

This may be related to a flaw in amdgpu_gmc_handle_retry_fault():
what the function does is first call amdgpu_vm_handle_fault() which kicks off an SDMA job to update page tables, and then it ACKs the filter CAM. However, at the moment when the CAM is ACKed, the SDMA job is not finished yet, so the CP sees that the page is still invalid and the page fault interrupt is fired again. I actually noticed that on Navi 31 too, but it's just not fatal there.
It just basically handles the same page fault twice. Once we solve this flaw, I would like to propose to enable retry faults by default on Navi 3.

Here is an idea for a solution:

Instead of ACKing the CAM right away, we should do it after the SDMA fence is signalled, ie. when we are sure the page tables are updated. Maybe we can set a callback on the fence and do it there, though it would require a slight code churn to get that to work.

What do you think?

> 4. IH1 ring on NV4:
>
> Same as you ... retry faults on NV4 always come in on IH0. We delegate
> from
> IH0 to ih.ring_soft (amdgpu_irq_delegate(adev, entry, 8)) so the
> SVM/migration path can sleep, but the original entry is on IH0. We
> haven't tried IH1 routing.

Why, though?

The ih_v7_0 code does set up the IH1 ring and configure it exactly the same as the ih_v6_0, so I don't see why it wouldn't work?

> Re your branch: thanks for the gitlab link, easier than digging
> through patchwork.
> I'll cherry-pick patches 1, 3, 4 into our test build to see if patch 4
> cleans up the timestamp filter delta we're seeing (97k entered / 2.8k
> completed at 1 GiB might be partly explained by your Strix Halo bug).

The timestamp issue actually gave me an endless headache when I first got into this topic in December. I hope the patch helps!

Best regards,
Timur

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling
  2026-05-14 15:04             ` Shetaia, Amir
@ 2026-05-14 17:02               ` Timur Kristóf
  2026-05-14 19:32                 ` Shetaia, Amir
  0 siblings, 1 reply; 18+ messages in thread
From: Timur Kristóf @ 2026-05-14 17:02 UTC (permalink / raw)
  To: Alex Deucher, Shetaia, Amir
  Cc: amd-gfx@lists.freedesktop.org, Deucher, Alexander,
	Koenig, Christian, Marek Olšák, Natalie Vock,
	Melissa Wen

Hi Amir,

I pushed a work in progress branch here:
https://gitlab.freedesktop.org/Venemo/linux/-/commits/ven_retry_faults_navi4

This contains the two series that I sent to the mailing list yesterday:
"Improve retry fault handling"
"Improve soft IH ring"

If we have to rely on the soft IH ring on Navi4, then we absolutely need the 
patches from "Improve soft IH ring", as it fixes a few bugs and switches the 
soft IH ring to use the system unbound workqueue. Otherwise it can happen that 
the soft IH ring fills up even before your CPU can process anything in it. 
(This is especially bad on HW that doesn't have the filter CAM enabled,
eg. Strix Halo.)

On top of that,

There is a patch to solve the race condition and make sure that the retry CAM 
is ACKed after the VM update is complete. On Navi 31, this works and 
successfully makes sure that the race is eliminated and each VM fault is only 
processed once. I'd like to submit these if you guys agree on the approach.

And finally, there is a very messy patch that contains what I tried to do to 
get it all working on Navi 48. This has a lot of comments with my notes during 
development, and some attempts I tried to fix the issue (without success).

Basically, what I did was:

- Enable the retry CAM the same way as I did on IH 6.0 exactly like you said
- Added the cam_index to gmc_v12_0_process_interrupt
- Added a few bits to gfxhub_v12_0 and gfx_v12_0 to actually enable retry 
faults (without that, the faults don't have the retry bit in src_data[1])

Now, what I observe on Navi 48 is:

1. Retry fault interrupt is triggered
2. Dispatched on the soft IH ring
3. The interrupt handler is called again on the soft IRQ ring
4. The fault is mitigated by amdgpu_vm_handle_fault()
5. Finally, the retry CAM is ACKed
6. As soon as the CAM is ACKed, I get another interrupt for the same address, 
and it goes on infinitely or until GPU reset is triggered.

Attempts at fixing what I saw on Navi 48:

- Tried to add a TLB flush to various places. Now that the race is mitigated, 
the TLB flush doesn't make things worse anymore, but it's also not helping.
- Tried to set the FORCE_MISS bits on various cache related registers, hoping 
that it would work around the possibility of the PTE/PDE not being updated in 
the cache. This didn't help either.
- Tried to use amdgpu.vm_update_mode=3 to use the CPU to update the page 
tables, but that didn't help on Navi 48
- Tried to halt the CP using CP_ME_CNTL to see if it would stop sending 
interrupts, but it didn't.
- Using umr I tried to disable the L2 cache and reset it with the GRBM, but 
that just broke the whole system.

Other notes on the retry fault topic in general:

- On GPUs that don't have the retry CAM (eg. Navi 1-2 and APUs) we'll need to 
consider disabling the VM fault interrupt while processing the retry, because 
the soft IH ring can fill up really quickly and the CPU may struggle to keep up 
with it.
- In general (for all GPUs), I think it would be better to update a larger VA 
range of subsequent pages rather than just a single page, in order to more 
efficiently mitigate VM faults to neighbouring pages.
- Currently all of the invalid accesses are redirected to the same dummy page, 
which is a security hole because it means that a process that had an out of 
bounds write can leak data to a process that does an out of bounds read. I 
already talked to Christian about this, he has a few ideas how to fix it.

Thanks & best regards,
Timur


On Thursday, May 14, 2026 5:04:00 PM Central European Summer Time Shetaia, 
Amir wrote:
> AMD General
> 
> Hi Timur,
> 
> Thanks for clarifying. yes, please push the Navi 48 WIP branch when you have
> a chance, useful to compare side-by-side.
 
> Patch 5 / WDOORBELL on ih_v7_0: I have been able to confirm that doorbell
> ACK does not free the CAM slot on the NV4 family
 (gfx1200/1201). I
> verified empirically, slots walk monotonically until the CAM fills, then HW
> silently drops retry events. MMIO write to regIH_RETRY_CAM_ACK does free
> it. Almost certainly required for your Navi 48 attempt. 
> Why PTE update doesn't reach UTC L0: Most likely a race between the SDMA
> PTE-update job completing and the CAM ACK firing the
 IRQ for re-handling.
> Until SDMA finishes the write + invalidation propagates, UTC L0 keeps
> getting the stale "invalid" reply which is exactly your symptom. 
> To inspect with umr: clone https://gitlab.freedesktop.org/tomstdenis/umr and
> use scripts/diag/dump_all_cpc_info.sh (needs
 halt_if_hws_hang=1
> gpu_recovery=0). For UTC specifically: umr -i <gpu> -O bits -r
> '*.*.regGCVM_L2_STATUS' (look at L2_BUSY and CONTEXT_DOMAIN_BUSY per-VMID
> bits) and regGCVM_L2_PROTECTION_FAULT_STATUS (empty + busy = translation
> timeout, not denial). Page-table walk to confirm PTE landed: sudo umr -i
> <gpu> -vm <vmid>@<va> 1. 
> Your ACK-after-SDMA-fence idea: good catch. We hadn't dug into that race
> because the BO-clear deadlock dominates on NV4, but
 the ordering bug is
> almost certainly there too. Fence callback is the right shape, cleaner than
> blocking the worker. Would be a generic fix benefiting both Navi 3 and 4.
> Happy to review if you take a swing. 
> IH1 routing: fair point. I just inherited the existing IH0 path and haven't
> tried IH1. On the list once CAM ACK and fence ordering are stable.
 
> AMIR SHETAIA
> Senior Software Development Engineer  |  AMD
> Software Platform Architecture Team
> ----------------------------------------------------------------------------
> ------------------------------------------------------
 1 Commerce Valley
> Drive, Markham, ON L3T 7X6
> LinkedIn  |  Instagram  |  X  |  amd.com
> 
> 
> 
> 
> -----Original Message-----
> From: Timur Kristóf <timur.kristof@gmail.com>
> Sent: Wednesday, May 13, 2026 6:12 PM
> To: Alex Deucher <alexdeucher@gmail.com>; Shetaia, Amir
> <Amir.Shetaia@amd.com>
 Cc: amd-gfx@lists.freedesktop.org; Deucher,
> Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
> <Christian.Koenig@amd.com>; Marek Olšák <maraeo@gmail.com>; Natalie Vock
> <natalie.vock@gmx.de>; Melissa Wen <mwen@igalia.com> Subject: Re: [PATCH
> 0/6] drm/amdgpu: Improve retry fault handling 
> Hi Amir,
> 
> 
> > Timur, you are right, I see your patch 6 already does the MMIO ACK for
> > gmc_v11_0/ih_v6_0. I missed that. The gap is only in patch 5's ih_v7_0
> > implementation, which still does WDOORBELL. that's where I'd suggest
> > swapping in MMIO for NV4.
> 
> 
> First, let me clear up a slight misunderstanding here. The patch series that
> I sent here only contains what I managed to get working on Navi 31, it does
> not contain any code from my attempt at Navi 48.
 
> The patch 5 is just a slight refactor of the pre-existing code and
> (intentionally) doesn't contain any functional changes. It seems that IH 7.1
> relies on the doorbell, so I didn't want to remove it, albeit I have no
> means to verify if that actually works or not.
 
> If you like, I can push a few WIP patches to a different branch tomorrow to
> show you exactly what I did on Navi 48. However, I wouldn't want to submit
> that to the mailing list without first making sure that it works well.
 
> 
> > 1. "Fault never resolves on NV48" different shape from our
> > broken-CAM-ACK symptom.
> 
> 
> 
> > You're right, those are different. Our cam-walk-monotonically symptom
> > only shows up when CAM is enabled but the ACK is broken.
> > 
>  > On your NV48 setup CAM
> > 
> > probably isn't enabled at all (your patch 6 only enables it for
> > ih_v6_0_irq_init, no equivalent in ih_v7_0_irq_init)
> 
> 
> For the attempt on Navi 48, I enabled the CAM in ih_v7_0 the same way I do
> for
 ih_v6_0 in the series. But, because I couldn't get it fully working, I
> didn't include any of that code in the series. 
> 
> > so retries fire
> > repeatedly on the IH ring instead of being deduped by CAM. That
> > matches what you're seeing .. amdgpu_vm_handle_fault keeps being
> > called but each call is on a fresh IRQ for the same address.
> > Two things that could be happening underneath:
> > - The fault handler runs but the updated PTE never reaches UTC L0 (TLB
> > invalidation gap). On NV4 we see this as "valid PTEs failing to
> > translate"
> > in our UMR captures.
> 
> 
> I think this explanation may fit what I saw.
> Why is it not reaching UTC L0?
> Also, how do you inspect this stuff in umr?
> 
> 
> > 2. What bits we check on src_data[2]:
> >
> >
> >
> > Honestly, we don't use src_data[2] for retry detection. We use it only
> > for
> > the cam_index: cam_index = entry->src_data[2] & 0x3ff;   /* low 10 bits =
> > CAM slot */
> 
> 
> 
> > For retry detection we initially used the gfx9 constant on src_data[1]
> > like you, but observed the bit cleared on a lot of NV4 events that
> > should have been retries (waves were hung in xnack-stall but no IH event
> > matched).
 So
> > we just go through the retry path unconditionally on NV4
> 
> 
> I see. I think I needed to change some gfxhub registers to get those
> src_data bits that you are missing.
 
> 
> > and let
> > amdgpu_vm_handle_fault sort it out via SVM range migration. May be
> > specific to gfx1201 / our test path
> 
> 
> My test case is a simple Vulkan shader which I am executing with vkrunner.
> Each shader invocation does an out of bounds read from a different page. For
> Navi 31 (and Strix Halo), I started out with just 1 page fault, and once I
> could mitigate that reliably, I turned it up to several hundred faults.
 
> 
> > 3. TLB flush making it worse .. clue about what to do:
> >
> >
> >
> > Honest answer: not really, not a SW-only fix. Our 1 GiB hang is an
> > architectural deadlock ... ih_soft_work blocks on a dma_fence for an
> > SDMA BO-clear, the BO-clear is stalled on a GCR (cache flush) request,
> > and the GC cache block isn't ACK'ing the GCR while UTC L2 is saturated
> > by the user shader's XNACK retry storm. Adding a TLB flush adds
> > another translation request to the same saturated UTC, which is why it
> > makes things worse.
> 
> 
> This may be related to a flaw in amdgpu_gmc_handle_retry_fault():
> what the function does is first call amdgpu_vm_handle_fault() which kicks
> off an SDMA job to update page tables, and then it ACKs the filter CAM.
> However, at the moment when the CAM is ACKed, the SDMA job is not finished
> yet, so the CP sees that the page is still invalid and the page fault
> interrupt is fired again. I actually noticed that on Navi 31 too, but it's
> just not fatal there.
 It just basically handles the same page fault twice.
> Once we solve this flaw, I would like to propose to enable retry faults by
> default on Navi 3. 
> Here is an idea for a solution:
> 
> Instead of ACKing the CAM right away, we should do it after the SDMA fence
> is signalled, ie. when we are sure the page tables are updated. Maybe we
> can set a callback on the fence and do it there, though it would require a
> slight code churn to get that to work.
 
> What do you think?
> 
> 
> > 4. IH1 ring on NV4:
> >
> >
> >
> > Same as you ... retry faults on NV4 always come in on IH0. We delegate
> > from
> > IH0 to ih.ring_soft (amdgpu_irq_delegate(adev, entry, 8)) so the
> > SVM/migration path can sleep, but the original entry is on IH0. We
> > haven't tried IH1 routing.
> 
> 
> Why, though?
> 
> The ih_v7_0 code does set up the IH1 ring and configure it exactly the same
> as the ih_v6_0, so I don't see why it wouldn't work?
 
> 
> > Re your branch: thanks for the gitlab link, easier than digging
> > through patchwork.
> > I'll cherry-pick patches 1, 3, 4 into our test build to see if patch 4
> > cleans up the timestamp filter delta we're seeing (97k entered / 2.8k
> > completed at 1 GiB might be partly explained by your Strix Halo bug).
> 
> 
> The timestamp issue actually gave me an endless headache when I first got
> into this topic in December. I hope the patch helps!
 
> Best regards,
> Timur
> 
> 




^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 0/6] drm/amdgpu: Improve retry fault handling
  2026-05-14 17:02               ` Timur Kristóf
@ 2026-05-14 19:32                 ` Shetaia, Amir
  2026-05-14 21:24                   ` Timur Kristóf
  0 siblings, 1 reply; 18+ messages in thread
From: Shetaia, Amir @ 2026-05-14 19:32 UTC (permalink / raw)
  To: Timur Kristóf, Alex Deucher
  Cc: amd-gfx@lists.freedesktop.org, Deucher,  Alexander,
	Koenig, Christian, Marek Olšák, Natalie Vock,
	Melissa Wen

AMD General

Hi Timur,

I cherry-picked your soft-IH-ring + timestamp-filter patches onto our build, ran malloc 1 GiB sweep (N=10) on Navi4 (gfx1201):

                                                    PASS    1st hang   sR_r entered
  Baseline (no patches)            1/10    trial 6       97k
  With your patches                  2/10    trial 8      43k

What I noticed on NV4:
- 2× trials before failure (soft-IH unbound workqueue helping)
- ~55% reduction in spurious svm_range_restore_pages (your timestamp-filter fix catches a real bug here too, not just Strix Halo)
- Soft IH worker visibly on events_unbound queue

Same dma_fence_wait_timeout signature underneath, now on TWO workers (amdgpu_irq_handle_ih_soft + ttm_bo_delayed_delete,
TTM cleanup gated on the same fence chain). Central BO-clear/GCR deadlock unchanged, your patches fix adjacent bugs but that one needs the HW fix.

Also tested UTCL0 retry-rate knob. On NV4 the gfx12 equivalents are regGL1C_UTCL0_RETRY / regGL1XC_UTCL0_RETRY
(the TCP-level register moved to GL1C-level on gfx12). Wrote INCR=0xff COUNT=0xf via umr to all SE/SA banks → 0/10 PASS, hung sooner.
So, slowing UTCL0 retries doesn't help. deadlock isn't pressure-driven on UTCL0.


AMIR SHETAIA
Senior Software Development Engineer  |  AMD
Software Platform Architecture Team
----------------------------------------------------------------------------------------------------------------------------------
1 Commerce Valley Drive, Markham, ON L3T 7X6
LinkedIn  |  Instagram  |  X  |  amd.com




-----Original Message-----
From: Timur Kristóf <timur.kristof@gmail.com>
Sent: Thursday, May 14, 2026 1:03 PM
To: Alex Deucher <alexdeucher@gmail.com>; Shetaia, Amir <Amir.Shetaia@amd.com>
Cc: amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Marek Olšák <maraeo@gmail.com>; Natalie Vock <natalie.vock@gmx.de>; Melissa Wen <mwen@igalia.com>
Subject: Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling

Hi Amir,

I pushed a work in progress branch here:
https://gitlab.freedesktop.org/Venemo/linux/-/commits/ven_retry_faults_navi4

This contains the two series that I sent to the mailing list yesterday:
"Improve retry fault handling"
"Improve soft IH ring"

If we have to rely on the soft IH ring on Navi4, then we absolutely need the patches from "Improve soft IH ring", as it fixes a few bugs and switches the soft IH ring to use the system unbound workqueue. Otherwise it can happen that the soft IH ring fills up even before your CPU can process anything in it.
(This is especially bad on HW that doesn't have the filter CAM enabled, eg. Strix Halo.)

On top of that,

There is a patch to solve the race condition and make sure that the retry CAM is ACKed after the VM update is complete. On Navi 31, this works and successfully makes sure that the race is eliminated and each VM fault is only processed once. I'd like to submit these if you guys agree on the approach.

And finally, there is a very messy patch that contains what I tried to do to get it all working on Navi 48. This has a lot of comments with my notes during development, and some attempts I tried to fix the issue (without success).

Basically, what I did was:

- Enable the retry CAM the same way as I did on IH 6.0 exactly like you said
- Added the cam_index to gmc_v12_0_process_interrupt
- Added a few bits to gfxhub_v12_0 and gfx_v12_0 to actually enable retry faults (without that, the faults don't have the retry bit in src_data[1])

Now, what I observe on Navi 48 is:

1. Retry fault interrupt is triggered
2. Dispatched on the soft IH ring
3. The interrupt handler is called again on the soft IRQ ring 4. The fault is mitigated by amdgpu_vm_handle_fault() 5. Finally, the retry CAM is ACKed 6. As soon as the CAM is ACKed, I get another interrupt for the same address, and it goes on infinitely or until GPU reset is triggered.

Attempts at fixing what I saw on Navi 48:

- Tried to add a TLB flush to various places. Now that the race is mitigated, the TLB flush doesn't make things worse anymore, but it's also not helping.
- Tried to set the FORCE_MISS bits on various cache related registers, hoping that it would work around the possibility of the PTE/PDE not being updated in the cache. This didn't help either.
- Tried to use amdgpu.vm_update_mode=3 to use the CPU to update the page tables, but that didn't help on Navi 48
- Tried to halt the CP using CP_ME_CNTL to see if it would stop sending interrupts, but it didn't.
- Using umr I tried to disable the L2 cache and reset it with the GRBM, but that just broke the whole system.

Other notes on the retry fault topic in general:

- On GPUs that don't have the retry CAM (eg. Navi 1-2 and APUs) we'll need to consider disabling the VM fault interrupt while processing the retry, because the soft IH ring can fill up really quickly and the CPU may struggle to keep up with it.
- In general (for all GPUs), I think it would be better to update a larger VA range of subsequent pages rather than just a single page, in order to more efficiently mitigate VM faults to neighbouring pages.
- Currently all of the invalid accesses are redirected to the same dummy page, which is a security hole because it means that a process that had an out of bounds write can leak data to a process that does an out of bounds read. I already talked to Christian about this, he has a few ideas how to fix it.

Thanks & best regards,
Timur


On Thursday, May 14, 2026 5:04:00 PM Central European Summer Time Shetaia, Amir wrote:
> AMD General
>
> Hi Timur,
>
> Thanks for clarifying. yes, please push the Navi 48 WIP branch when
> you have a chance, useful to compare side-by-side.

> Patch 5 / WDOORBELL on ih_v7_0: I have been able to confirm that
> doorbell ACK does not free the CAM slot on the NV4 family
 (gfx1200/1201). I
> verified empirically, slots walk monotonically until the CAM fills,
> then HW silently drops retry events. MMIO write to regIH_RETRY_CAM_ACK
> does free it. Almost certainly required for your Navi 48 attempt.
> Why PTE update doesn't reach UTC L0: Most likely a race between the
> SDMA PTE-update job completing and the CAM ACK firing the
 IRQ for re-handling.
> Until SDMA finishes the write + invalidation propagates, UTC L0 keeps
> getting the stale "invalid" reply which is exactly your symptom.
> To inspect with umr: clone
> https://gitlab.freedesktop.org/tomstdenis/umr and use
> scripts/diag/dump_all_cpc_info.sh (needs
 halt_if_hws_hang=1
> gpu_recovery=0). For UTC specifically: umr -i <gpu> -O bits -r
> '*.*.regGCVM_L2_STATUS' (look at L2_BUSY and CONTEXT_DOMAIN_BUSY
> per-VMID
> bits) and regGCVM_L2_PROTECTION_FAULT_STATUS (empty + busy =
> translation timeout, not denial). Page-table walk to confirm PTE
> landed: sudo umr -i <gpu> -vm <vmid>@<va> 1.
> Your ACK-after-SDMA-fence idea: good catch. We hadn't dug into that
> race because the BO-clear deadlock dominates on NV4, but
 the ordering bug is
> almost certainly there too. Fence callback is the right shape, cleaner
> than blocking the worker. Would be a generic fix benefiting both Navi 3 and 4.
> Happy to review if you take a swing.
> IH1 routing: fair point. I just inherited the existing IH0 path and
> haven't tried IH1. On the list once CAM ACK and fence ordering are stable.

> AMIR SHETAIA
> Senior Software Development Engineer  |  AMD Software Platform
> Architecture Team
> ----------------------------------------------------------------------
> ------
> ------------------------------------------------------
 1 Commerce Valley
> Drive, Markham, ON L3T 7X6
> LinkedIn  |  Instagram  |  X  |  amd.com
>
>
>
>
> -----Original Message-----
> From: Timur Kristóf <timur.kristof@gmail.com>
> Sent: Wednesday, May 13, 2026 6:12 PM
> To: Alex Deucher <alexdeucher@gmail.com>; Shetaia, Amir
> <Amir.Shetaia@amd.com>
 Cc: amd-gfx@lists.freedesktop.org; Deucher,
> Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
> <Christian.Koenig@amd.com>; Marek Olšák <maraeo@gmail.com>; Natalie
> Vock <natalie.vock@gmx.de>; Melissa Wen <mwen@igalia.com> Subject: Re:
> [PATCH 0/6] drm/amdgpu: Improve retry fault handling Hi Amir,
>
>
> > Timur, you are right, I see your patch 6 already does the MMIO ACK
> > for gmc_v11_0/ih_v6_0. I missed that. The gap is only in patch 5's
> > ih_v7_0 implementation, which still does WDOORBELL. that's where I'd
> > suggest swapping in MMIO for NV4.
>
>
> First, let me clear up a slight misunderstanding here. The patch
> series that I sent here only contains what I managed to get working on
> Navi 31, it does not contain any code from my attempt at Navi 48.

> The patch 5 is just a slight refactor of the pre-existing code and
> (intentionally) doesn't contain any functional changes. It seems that
> IH 7.1 relies on the doorbell, so I didn't want to remove it, albeit I
> have no means to verify if that actually works or not.

> If you like, I can push a few WIP patches to a different branch
> tomorrow to show you exactly what I did on Navi 48. However, I
> wouldn't want to submit that to the mailing list without first making sure that it works well.

>
> > 1. "Fault never resolves on NV48" different shape from our
> > broken-CAM-ACK symptom.
>
>
>
> > You're right, those are different. Our cam-walk-monotonically
> > symptom only shows up when CAM is enabled but the ACK is broken.
> >
>  > On your NV48 setup CAM
> >
> > probably isn't enabled at all (your patch 6 only enables it for
> > ih_v6_0_irq_init, no equivalent in ih_v7_0_irq_init)
>
>
> For the attempt on Navi 48, I enabled the CAM in ih_v7_0 the same way
> I do for
 ih_v6_0 in the series. But, because I couldn't get it fully working, I
> didn't include any of that code in the series.
>
> > so retries fire
> > repeatedly on the IH ring instead of being deduped by CAM. That
> > matches what you're seeing .. amdgpu_vm_handle_fault keeps being
> > called but each call is on a fresh IRQ for the same address.
> > Two things that could be happening underneath:
> > - The fault handler runs but the updated PTE never reaches UTC L0
> > (TLB invalidation gap). On NV4 we see this as "valid PTEs failing to
> > translate"
> > in our UMR captures.
>
>
> I think this explanation may fit what I saw.
> Why is it not reaching UTC L0?
> Also, how do you inspect this stuff in umr?
>
>
> > 2. What bits we check on src_data[2]:
> >
> >
> >
> > Honestly, we don't use src_data[2] for retry detection. We use it
> > only for
> > the cam_index: cam_index = entry->src_data[2] & 0x3ff;   /* low 10 bits =
> > CAM slot */
>
>
>
> > For retry detection we initially used the gfx9 constant on
> > src_data[1] like you, but observed the bit cleared on a lot of NV4
> > events that should have been retries (waves were hung in xnack-stall
> > but no IH event matched).
 So
> > we just go through the retry path unconditionally on NV4
>
>
> I see. I think I needed to change some gfxhub registers to get those
> src_data bits that you are missing.

>
> > and let
> > amdgpu_vm_handle_fault sort it out via SVM range migration. May be
> > specific to gfx1201 / our test path
>
>
> My test case is a simple Vulkan shader which I am executing with vkrunner.
> Each shader invocation does an out of bounds read from a different
> page. For Navi 31 (and Strix Halo), I started out with just 1 page
> fault, and once I could mitigate that reliably, I turned it up to several hundred faults.

>
> > 3. TLB flush making it worse .. clue about what to do:
> >
> >
> >
> > Honest answer: not really, not a SW-only fix. Our 1 GiB hang is an
> > architectural deadlock ... ih_soft_work blocks on a dma_fence for an
> > SDMA BO-clear, the BO-clear is stalled on a GCR (cache flush)
> > request, and the GC cache block isn't ACK'ing the GCR while UTC L2
> > is saturated by the user shader's XNACK retry storm. Adding a TLB
> > flush adds another translation request to the same saturated UTC,
> > which is why it makes things worse.
>
>
> This may be related to a flaw in amdgpu_gmc_handle_retry_fault():
> what the function does is first call amdgpu_vm_handle_fault() which
> kicks off an SDMA job to update page tables, and then it ACKs the filter CAM.
> However, at the moment when the CAM is ACKed, the SDMA job is not
> finished yet, so the CP sees that the page is still invalid and the
> page fault interrupt is fired again. I actually noticed that on Navi
> 31 too, but it's just not fatal there.
 It just basically handles the same page fault twice.
> Once we solve this flaw, I would like to propose to enable retry
> faults by default on Navi 3.
> Here is an idea for a solution:
>
> Instead of ACKing the CAM right away, we should do it after the SDMA
> fence is signalled, ie. when we are sure the page tables are updated.
> Maybe we can set a callback on the fence and do it there, though it
> would require a slight code churn to get that to work.

> What do you think?
>
>
> > 4. IH1 ring on NV4:
> >
> >
> >
> > Same as you ... retry faults on NV4 always come in on IH0. We
> > delegate from
> > IH0 to ih.ring_soft (amdgpu_irq_delegate(adev, entry, 8)) so the
> > SVM/migration path can sleep, but the original entry is on IH0. We
> > haven't tried IH1 routing.
>
>
> Why, though?
>
> The ih_v7_0 code does set up the IH1 ring and configure it exactly the
> same as the ih_v6_0, so I don't see why it wouldn't work?

>
> > Re your branch: thanks for the gitlab link, easier than digging
> > through patchwork.
> > I'll cherry-pick patches 1, 3, 4 into our test build to see if patch
> > 4 cleans up the timestamp filter delta we're seeing (97k entered /
> > 2.8k completed at 1 GiB might be partly explained by your Strix Halo bug).
>
>
> The timestamp issue actually gave me an endless headache when I first
> got into this topic in December. I hope the patch helps!

> Best regards,
> Timur
>
>




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6] drm/amdgpu: Improve retry fault handling
  2026-05-14 19:32                 ` Shetaia, Amir
@ 2026-05-14 21:24                   ` Timur Kristóf
  0 siblings, 0 replies; 18+ messages in thread
From: Timur Kristóf @ 2026-05-14 21:24 UTC (permalink / raw)
  To: Alex Deucher, Shetaia, Amir
  Cc: amd-gfx@lists.freedesktop.org, Deucher, Alexander,
	Koenig, Christian, Marek Olšák, Natalie Vock,
	Melissa Wen

Hi Amir,

Thanks, glad to hear that!

I recommend to also apply "Add fence argument to amdgpu_vm_handle_fault()" and 
"ACK the retry CAM after VM update finishes" as these fix the race between the 
CAM and the SDMA, and should improve the reliability of the whole mechanism in 
my opinion.

In the meantime I think I figured out what I was missing.
Considering that the FORCE_MISS bits didn't change the behavior, I was 
thinking that the curlpit must be something else, so I started looking into 
the PTE that is being added.

It all started working for me when I added some new PTE flags to 
amdgpu_vm_handle_fault(): AMDGPU_PTE_IS_PTE and AMDGPU_PTE_NOALLOC.
With that, I cleaned up my patches and force-pushed them to the same branch:

https://gitlab.freedesktop.org/Venemo/linux/-/commits/ven_retry_faults_navi4

What do you think?

Best regards,
Timur


On Thursday, May 14, 2026 9:32:16 PM Central European Summer Time Shetaia, 
Amir wrote:
> AMD General
> 
> Hi Timur,
> 
> I cherry-picked your soft-IH-ring + timestamp-filter patches onto our build,
> ran malloc 1 GiB sweep (N=10) on Navi4 (gfx1201):
 
>                                                     PASS    1st hang   sR_r
> entered
 Baseline (no patches)            1/10    trial 6       97k
>   With your patches                  2/10    trial 8      43k
> 
> What I noticed on NV4:
> - 2× trials before failure (soft-IH unbound workqueue helping)
> - ~55% reduction in spurious svm_range_restore_pages (your timestamp-filter
> fix catches a real bug here too, not just Strix Halo)
 - Soft IH worker
> visibly on events_unbound queue
> 
> Same dma_fence_wait_timeout signature underneath, now on TWO workers
> (amdgpu_irq_handle_ih_soft + ttm_bo_delayed_delete,
 TTM cleanup gated on
> the same fence chain). Central BO-clear/GCR deadlock unchanged, your
> patches fix adjacent bugs but that one needs the HW fix. 
> Also tested UTCL0 retry-rate knob. On NV4 the gfx12 equivalents are
> regGL1C_UTCL0_RETRY / regGL1XC_UTCL0_RETRY
 (the TCP-level register moved
> to GL1C-level on gfx12). Wrote INCR=0xff COUNT=0xf via umr to all SE/SA
> banks → 0/10 PASS, hung sooner. So, slowing UTCL0 retries doesn't help.
> deadlock isn't pressure-driven on UTCL0. 
> 
> AMIR SHETAIA
> Senior Software Development Engineer  |  AMD
> Software Platform Architecture Team
> ----------------------------------------------------------------------------
> ------------------------------------------------------
 1 Commerce Valley
> Drive, Markham, ON L3T 7X6
> LinkedIn  |  Instagram  |  X  |  amd.com
> 
> 
> 
> 
> -----Original Message-----
> From: Timur Kristóf <timur.kristof@gmail.com>
> Sent: Thursday, May 14, 2026 1:03 PM
> To: Alex Deucher <alexdeucher@gmail.com>; Shetaia, Amir
> <Amir.Shetaia@amd.com>
 Cc: amd-gfx@lists.freedesktop.org; Deucher,
> Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
> <Christian.Koenig@amd.com>; Marek Olšák <maraeo@gmail.com>; Natalie Vock
> <natalie.vock@gmx.de>; Melissa Wen <mwen@igalia.com> Subject: Re: [PATCH
> 0/6] drm/amdgpu: Improve retry fault handling 
> Hi Amir,
> 
> I pushed a work in progress branch here:
> https://gitlab.freedesktop.org/Venemo/linux/-/commits/ven_retry_faults_navi4
> 
 
> This contains the two series that I sent to the mailing list yesterday:
> "Improve retry fault handling"
> "Improve soft IH ring"
> 
> If we have to rely on the soft IH ring on Navi4, then we absolutely need the
> patches from "Improve soft IH ring", as it fixes a few bugs and switches
> the soft IH ring to use the system unbound workqueue. Otherwise it can
> happen that the soft IH ring fills up even before your CPU can process
> anything in it.
 (This is especially bad on HW that doesn't have the filter
> CAM enabled, eg. Strix Halo.) 
> On top of that,
> 
> There is a patch to solve the race condition and make sure that the retry
> CAM is ACKed after the VM update is complete. On Navi 31, this works and
> successfully makes sure that the race is eliminated and each VM fault is
> only processed once. I'd like to submit these if you guys agree on the
> approach.
 
> And finally, there is a very messy patch that contains what I tried to do to
> get it all working on Navi 48. This has a lot of comments with my notes
> during development, and some attempts I tried to fix the issue (without
> success).
 
> Basically, what I did was:
> 
> - Enable the retry CAM the same way as I did on IH 6.0 exactly like you
> said
 - Added the cam_index to gmc_v12_0_process_interrupt
> - Added a few bits to gfxhub_v12_0 and gfx_v12_0 to actually enable retry
> faults (without that, the faults don't have the retry bit in src_data[1])
 
> Now, what I observe on Navi 48 is:
> 
> 1. Retry fault interrupt is triggered
> 2. Dispatched on the soft IH ring
> 3. The interrupt handler is called again on the soft IRQ ring 4. The fault
> is mitigated by amdgpu_vm_handle_fault() 5. Finally, the retry CAM is ACKed
> 6. As soon as the CAM is ACKed, I get another interrupt for the same
> address, and it goes on infinitely or until GPU reset is triggered.
 
> Attempts at fixing what I saw on Navi 48:
> 
> - Tried to add a TLB flush to various places. Now that the race is
> mitigated, the TLB flush doesn't make things worse anymore, but it's also
> not helping.
 - Tried to set the FORCE_MISS bits on various cache related
> registers, hoping that it would work around the possibility of the PTE/PDE
> not being updated in the cache. This didn't help either. - Tried to use
> amdgpu.vm_update_mode=3 to use the CPU to update the page tables, but that
> didn't help on Navi 48 - Tried to halt the CP using CP_ME_CNTL to see if it
> would stop sending interrupts, but it didn't. - Using umr I tried to
> disable the L2 cache and reset it with the GRBM, but that just broke the
> whole system. 
> Other notes on the retry fault topic in general:
> 
> - On GPUs that don't have the retry CAM (eg. Navi 1-2 and APUs) we'll need
> to consider disabling the VM fault interrupt while processing the retry,
> because the soft IH ring can fill up really quickly and the CPU may
> struggle to keep up with it.
 - In general (for all GPUs), I think it would
> be better to update a larger VA range of subsequent pages rather than just
> a single page, in order to more efficiently mitigate VM faults to
> neighbouring pages. - Currently all of the invalid accesses are redirected
> to the same dummy page, which is a security hole because it means that a
> process that had an out of bounds write can leak data to a process that
> does an out of bounds read. I already talked to Christian about this, he
> has a few ideas how to fix it. 
> Thanks & best regards,
> Timur
> 
> 
> On Thursday, May 14, 2026 5:04:00 PM Central European Summer Time Shetaia,
> Amir wrote:
> > AMD General
> >
> >
> >
> > Hi Timur,
> >
> >
> >
> > Thanks for clarifying. yes, please push the Navi 48 WIP branch when
> > you have a chance, useful to compare side-by-side.
> 
> 
> 
> > Patch 5 / WDOORBELL on ih_v7_0: I have been able to confirm that
> > doorbell ACK does not free the CAM slot on the NV4 family
> 
>  (gfx1200/1201). I
> 
> > verified empirically, slots walk monotonically until the CAM fills,
> > then HW silently drops retry events. MMIO write to regIH_RETRY_CAM_ACK
> > does free it. Almost certainly required for your Navi 48 attempt.
> > Why PTE update doesn't reach UTC L0: Most likely a race between the
> > SDMA PTE-update job completing and the CAM ACK firing the
> 
>  IRQ for re-handling.
> 
> > Until SDMA finishes the write + invalidation propagates, UTC L0 keeps
> > getting the stale "invalid" reply which is exactly your symptom.
> > To inspect with umr: clone
> > https://gitlab.freedesktop.org/tomstdenis/umr and use
> > scripts/diag/dump_all_cpc_info.sh (needs
> 
>  halt_if_hws_hang=1
> 
> > gpu_recovery=0). For UTC specifically: umr -i <gpu> -O bits -r
> > '*.*.regGCVM_L2_STATUS' (look at L2_BUSY and CONTEXT_DOMAIN_BUSY
> > per-VMID
> > bits) and regGCVM_L2_PROTECTION_FAULT_STATUS (empty + busy =
> > translation timeout, not denial). Page-table walk to confirm PTE
> > landed: sudo umr -i <gpu> -vm <vmid>@<va> 1.
> > Your ACK-after-SDMA-fence idea: good catch. We hadn't dug into that
> > race because the BO-clear deadlock dominates on NV4, but
> 
>  the ordering bug is
> 
> > almost certainly there too. Fence callback is the right shape, cleaner
> > than blocking the worker. Would be a generic fix benefiting both Navi 3
> > and 4.
 Happy to review if you take a swing.
> > IH1 routing: fair point. I just inherited the existing IH0 path and
> > haven't tried IH1. On the list once CAM ACK and fence ordering are
> > stable.
> 
> 
> 
> > AMIR SHETAIA
> > Senior Software Development Engineer  |  AMD Software Platform
> > Architecture Team
> > ----------------------------------------------------------------------
> > ------
> > ------------------------------------------------------
> 
>  1 Commerce Valley
> 
> > Drive, Markham, ON L3T 7X6
> > LinkedIn  |  Instagram  |  X  |  amd.com
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Timur Kristóf <timur.kristof@gmail.com>
> > Sent: Wednesday, May 13, 2026 6:12 PM
> > To: Alex Deucher <alexdeucher@gmail.com>; Shetaia, Amir
> > <Amir.Shetaia@amd.com>
> 
>  Cc: amd-gfx@lists.freedesktop.org; Deucher,
> 
> > Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
> > <Christian.Koenig@amd.com>; Marek Olšák <maraeo@gmail.com>; Natalie
> > Vock <natalie.vock@gmx.de>; Melissa Wen <mwen@igalia.com> Subject: Re:
> > [PATCH 0/6] drm/amdgpu: Improve retry fault handling Hi Amir,
> >
> >
> >
> >
> > > Timur, you are right, I see your patch 6 already does the MMIO ACK
> > > for gmc_v11_0/ih_v6_0. I missed that. The gap is only in patch 5's
> > > ih_v7_0 implementation, which still does WDOORBELL. that's where I'd
> > > suggest swapping in MMIO for NV4.
> >
> >
> >
> >
> > First, let me clear up a slight misunderstanding here. The patch
> > series that I sent here only contains what I managed to get working on
> > Navi 31, it does not contain any code from my attempt at Navi 48.
> 
> 
> 
> > The patch 5 is just a slight refactor of the pre-existing code and
> > (intentionally) doesn't contain any functional changes. It seems that
> > IH 7.1 relies on the doorbell, so I didn't want to remove it, albeit I
> > have no means to verify if that actually works or not.
> 
> 
> 
> > If you like, I can push a few WIP patches to a different branch
> > tomorrow to show you exactly what I did on Navi 48. However, I
> > wouldn't want to submit that to the mailing list without first making sure
> > that it works well.
> 
> 
> >
> >
> > > 1. "Fault never resolves on NV48" different shape from our
> > > broken-CAM-ACK symptom.
> >
> >
> >
> >
> >
> > > You're right, those are different. Our cam-walk-monotonically
> > > symptom only shows up when CAM is enabled but the ACK is broken.
> > >
> > >
> > >
> >  > On your NV48 setup CAM
> > >
> > >
> > >
> > > probably isn't enabled at all (your patch 6 only enables it for
> > > ih_v6_0_irq_init, no equivalent in ih_v7_0_irq_init)
> >
> >
> >
> >
> > For the attempt on Navi 48, I enabled the CAM in ih_v7_0 the same way
> > I do for
> 
>  ih_v6_0 in the series. But, because I couldn't get it fully working, I
> 
> > didn't include any of that code in the series.
> >
> >
> >
> > > so retries fire
> > > repeatedly on the IH ring instead of being deduped by CAM. That
> > > matches what you're seeing .. amdgpu_vm_handle_fault keeps being
> > > called but each call is on a fresh IRQ for the same address.
> > > Two things that could be happening underneath:
> > > - The fault handler runs but the updated PTE never reaches UTC L0
> > > (TLB invalidation gap). On NV4 we see this as "valid PTEs failing to
> > > translate"
> > > in our UMR captures.
> >
> >
> >
> >
> > I think this explanation may fit what I saw.
> > Why is it not reaching UTC L0?
> > Also, how do you inspect this stuff in umr?
> >
> >
> >
> >
> > > 2. What bits we check on src_data[2]:
> > >
> > >
> > >
> > >
> > >
> > > Honestly, we don't use src_data[2] for retry detection. We use it
> > > only for
> > > the cam_index: cam_index = entry->src_data[2] & 0x3ff;   /* low 10 bits
> > > =
> > > CAM slot */
> >
> >
> >
> >
> >
> > > For retry detection we initially used the gfx9 constant on
> > > src_data[1] like you, but observed the bit cleared on a lot of NV4
> > > events that should have been retries (waves were hung in xnack-stall
> > > but no IH event matched).
> 
>  So
> 
> > > we just go through the retry path unconditionally on NV4
> >
> >
> >
> >
> > I see. I think I needed to change some gfxhub registers to get those
> > src_data bits that you are missing.
> 
> 
> 
> >
> >
> > > and let
> > > amdgpu_vm_handle_fault sort it out via SVM range migration. May be
> > > specific to gfx1201 / our test path
> >
> >
> >
> >
> > My test case is a simple Vulkan shader which I am executing with
> > vkrunner.
> > Each shader invocation does an out of bounds read from a different
> > page. For Navi 31 (and Strix Halo), I started out with just 1 page
> > fault, and once I could mitigate that reliably, I turned it up to several
> > hundred faults.
> 
> 
> >
> >
> > > 3. TLB flush making it worse .. clue about what to do:
> > >
> > >
> > >
> > >
> > >
> > > Honest answer: not really, not a SW-only fix. Our 1 GiB hang is an
> > > architectural deadlock ... ih_soft_work blocks on a dma_fence for an
> > > SDMA BO-clear, the BO-clear is stalled on a GCR (cache flush)
> > > request, and the GC cache block isn't ACK'ing the GCR while UTC L2
> > > is saturated by the user shader's XNACK retry storm. Adding a TLB
> > > flush adds another translation request to the same saturated UTC,
> > > which is why it makes things worse.
> >
> >
> >
> >
> > This may be related to a flaw in amdgpu_gmc_handle_retry_fault():
> > what the function does is first call amdgpu_vm_handle_fault() which
> > kicks off an SDMA job to update page tables, and then it ACKs the filter
> > CAM.
 However, at the moment when the CAM is ACKed, the SDMA job is not
> > finished yet, so the CP sees that the page is still invalid and the page
> > fault interrupt is fired again. I actually noticed that on Navi 31 too,
> > but it's just not fatal there.
> 
>  It just basically handles the same page fault twice.
> 
> > Once we solve this flaw, I would like to propose to enable retry
> > faults by default on Navi 3.
> > Here is an idea for a solution:
> >
> >
> >
> > Instead of ACKing the CAM right away, we should do it after the SDMA
> > fence is signalled, ie. when we are sure the page tables are updated.
> > Maybe we can set a callback on the fence and do it there, though it
> > would require a slight code churn to get that to work.
> 
> 
> 
> > What do you think?
> >
> >
> >
> >
> > > 4. IH1 ring on NV4:
> > >
> > >
> > >
> > >
> > >
> > > Same as you ... retry faults on NV4 always come in on IH0. We
> > > delegate from
> > > IH0 to ih.ring_soft (amdgpu_irq_delegate(adev, entry, 8)) so the
> > > SVM/migration path can sleep, but the original entry is on IH0. We
> > > haven't tried IH1 routing.
> >
> >
> >
> >
> > Why, though?
> >
> >
> >
> > The ih_v7_0 code does set up the IH1 ring and configure it exactly the
> > same as the ih_v6_0, so I don't see why it wouldn't work?
> 
> 
> 
> >
> >
> > > Re your branch: thanks for the gitlab link, easier than digging
> > > through patchwork.
> > > I'll cherry-pick patches 1, 3, 4 into our test build to see if patch
> > > 4 cleans up the timestamp filter delta we're seeing (97k entered /
> > > 2.8k completed at 1 GiB might be partly explained by your Strix Halo
> > > bug).
> >
> >
> >
> >
> > The timestamp issue actually gave me an endless headache when I first
> > got into this topic in December. I hope the patch helps!
> 
> 
> 
> > Best regards,
> > Timur
> >
> >
> >
> 
> 
> 
> 




^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2026-05-14 21:24 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-13 16:30 [PATCH 0/6] drm/amdgpu: Improve retry fault handling Timur Kristóf
2026-05-13 16:30 ` [PATCH 1/6] drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly Timur Kristóf
2026-05-13 16:30 ` [PATCH 2/6] drm/amdgpu/gfxhub: Enable retry fault interrupts when needed Timur Kristóf
2026-05-13 16:30 ` [PATCH 3/6] drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed Timur Kristóf
2026-05-13 16:30 ` [PATCH 4/6] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts Timur Kristóf
2026-05-13 16:30 ` [PATCH 5/6] drm/amdgpu/ih: Add retry_cam_ack IH function pointer Timur Kristóf
2026-05-13 16:30 ` [PATCH 6/6] drm/amdgpu: Enable retry CAM on Navi 3 dGPUs Timur Kristóf
2026-05-13 16:36 ` [PATCH 0/6] drm/amdgpu: Improve retry fault handling Alex Deucher
2026-05-13 16:43   ` Timur Kristóf
2026-05-13 17:28     ` Shetaia, Amir
2026-05-13 17:32       ` Deucher, Alexander
2026-05-13 17:51       ` Timur Kristóf
2026-05-13 20:32         ` Shetaia, Amir
2026-05-13 22:12           ` Timur Kristóf
2026-05-14 15:04             ` Shetaia, Amir
2026-05-14 17:02               ` Timur Kristóf
2026-05-14 19:32                 ` Shetaia, Amir
2026-05-14 21:24                   ` Timur Kristóf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox