[PATCH 0/7] drm/amdgpu: Improve retry fault handling (v2)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/7] drm/amdgpu: Improve retry fault handling (v2)
@ 2026-05-25 11:45 Timur Kristóf
  2026-05-25 11:45 ` [PATCH 1/7] drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly Timur Kristóf
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Timur Kristóf @ 2026-05-25 11:45 UTC (permalink / raw)
  To: amd-gfx, Alex Deucher, christian.koenig, Natalie Vock,
	Mario Limonciello, Amir Shetaia, Marek Olšák
  Cc: Timur Kristóf

Fix some issues regarding retry fault handling,
such as enabling the retry fault interrupt (necessary
for retry faults to work) and similar.

Improve retry faults on Navi 3 dGPUs by enabling
the filter CAM, which can filter the repeated page
fault interrupts that happen when retry faults are
enabled, making the handling more efficient.

With this series, the kernel is able to mitigate
most page faults on Navi 3 without causing a hang
and without a need to reset the GPU, when the
amdgpu.noretry=0 module parameter is set.

Changes in v2:

* Reordered patches in the series to put bug fixes first
* Enable retry fault interrupt in init_system_aperture_regs()
  instead of in set_fault_enable_default()
* Added a patch to respect the noretry flag on GFX12.1 too

Timur Kristóf (7):
  drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly
  drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed
  drm/amdgpu/gmc: Don't compare page fault timestamps with other
    interrupts
  drm/amdgpu/ih: Add retry_cam_ack IH function pointer
  drm/amdgpu/gfxhub: Enable retry fault interrupts when needed
  drm/amdgpu/gfxhub: Respect noretry flag for retry faults on GFX12.1
  drm/amdgpu: Enable retry CAM on Navi 3 dGPUs

 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c     |  7 +++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h     |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h      |  1 +
 drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c | 21 ++++++++++++---------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c   | 21 ++++++++++++---------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c   | 16 +++++++---------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c    | 19 +++++++++++--------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c    | 12 ++++++------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c    | 19 +++++++++++--------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c    | 19 +++++++++++--------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c    | 21 ++++++++++++---------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c  | 21 ++++++++++++---------
 drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c      |  5 ++++-
 drivers/gpu/drm/amd/amdgpu/ih_v6_0.c        | 18 +++++++++++++++++-
 drivers/gpu/drm/amd/amdgpu/ih_v7_0.c        |  6 ++++++
 drivers/gpu/drm/amd/amdgpu/mmhub_v3_0.c     |  2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_1.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_2.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v3_3.c     |  2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v4_1_0.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v4_2_0.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/vega20_ih.c      |  8 +++++++-
 22 files changed, 141 insertions(+), 86 deletions(-)

-- 
2.54.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/7] drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly
  2026-05-25 11:45 [PATCH 0/7] drm/amdgpu: Improve retry fault handling (v2) Timur Kristóf
@ 2026-05-25 11:45 ` Timur Kristóf
  2026-05-25 11:45 ` [PATCH 2/7] drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed Timur Kristóf
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Timur Kristóf @ 2026-05-25 11:45 UTC (permalink / raw)
  To: amd-gfx, Alex Deucher, christian.koenig, Natalie Vock,
	Mario Limonciello, Amir Shetaia, Marek Olšák
  Cc: Timur Kristóf

Whether retry faults are actually enabled, is determined by
the amdgpu_gmc_noretry_set() function. The rest of the code
base should use gmc->noretry instead of the module parameter.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c   | 2 +-
 drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c    | 2 +-
 drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c  | 2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v3_0.c     | 2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_1.c   | 2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_2.c   | 2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v3_3.c     | 2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v4_1_0.c   | 2 +-
 drivers/gpu/drm/amd/amdgpu/mmhub_v4_2_0.c   | 2 +-
 10 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c
index f9949fedfbb9..f845ba698b40 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c
@@ -321,7 +321,7 @@ static void gfxhub_v11_5_0_setup_vmid_config(struct amdgpu_device *adev)
 		/* Send no-retry XNACK on fault to suppress VM fault storm. */
 		tmp = REG_SET_FIELD(tmp, GCVM_CONTEXT1_CNTL,
 				    RETRY_PERMISSION_OR_INVALID_PAGE_FAULT,
-				    !amdgpu_noretry);
+				    !adev->gmc.noretry);
 		WREG32_SOC15_OFFSET(GC, 0, regGCVM_CONTEXT1_CNTL,
 				    i * hub->ctx_distance, tmp);
 		WREG32_SOC15_OFFSET(GC, 0, regGCVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32,
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c
index 7609b9cecae8..ba78b5a1a7cd 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c
@@ -326,7 +326,7 @@ static void gfxhub_v12_0_setup_vmid_config(struct amdgpu_device *adev)
 		/* Send no-retry XNACK on fault to suppress VM fault storm. */
 		tmp = REG_SET_FIELD(tmp, GCVM_CONTEXT1_CNTL,
 				    RETRY_PERMISSION_OR_INVALID_PAGE_FAULT,
-				    !amdgpu_noretry);
+				    !adev->gmc.noretry);
 		WREG32_SOC15_OFFSET(GC, 0, regGCVM_CONTEXT1_CNTL,
 				    i * hub->ctx_distance, tmp);
 		WREG32_SOC15_OFFSET(GC, 0, regGCVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32,
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c
index abe30c8bd2ba..631f99e3741a 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c
@@ -318,7 +318,7 @@ static void gfxhub_v3_0_setup_vmid_config(struct amdgpu_device *adev)
 		/* Send no-retry XNACK on fault to suppress VM fault storm. */
 		tmp = REG_SET_FIELD(tmp, GCVM_CONTEXT1_CNTL,
 				    RETRY_PERMISSION_OR_INVALID_PAGE_FAULT,
-				    !amdgpu_noretry);
+				    !adev->gmc.noretry);
 		WREG32_SOC15_OFFSET(GC, 0, regGCVM_CONTEXT1_CNTL,
 				    i * hub->ctx_distance, tmp);
 		WREG32_SOC15_OFFSET(GC, 0, regGCVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32,
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c
index b3ef6e71811f..8a87410ce016 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c
@@ -323,7 +323,7 @@ static void gfxhub_v3_0_3_setup_vmid_config(struct amdgpu_device *adev)
 		/* Send no-retry XNACK on fault to suppress VM fault storm. */
 		tmp = REG_SET_FIELD(tmp, GCVM_CONTEXT1_CNTL,
 				    RETRY_PERMISSION_OR_INVALID_PAGE_FAULT,
-				    !amdgpu_noretry);
+				    !adev->gmc.noretry);
 		WREG32_SOC15_OFFSET(GC, 0, regGCVM_CONTEXT1_CNTL,
 				    i * hub->ctx_distance, tmp);
 		WREG32_SOC15_OFFSET(GC, 0, regGCVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32,
diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0.c b/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0.c
index 3d82cfa0f1b5..ab56dd15b3f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0.c
@@ -340,7 +340,7 @@ static void mmhub_v3_0_setup_vmid_config(struct amdgpu_device *adev)
 		/* Send no-retry XNACK on fault to suppress VM fault storm. */
 		tmp = REG_SET_FIELD(tmp, MMVM_CONTEXT1_CNTL,
 				    RETRY_PERMISSION_OR_INVALID_PAGE_FAULT,
-				    !amdgpu_noretry);
+				    !adev->gmc.noretry);
 		WREG32_SOC15_OFFSET(MMHUB, 0, regMMVM_CONTEXT1_CNTL,
 				    i * hub->ctx_distance, tmp);
 		WREG32_SOC15_OFFSET(MMHUB, 0, regMMVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32,
diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_1.c b/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_1.c
index a1b0b7b39a42..6522a89379b7 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_1.c
@@ -333,7 +333,7 @@ static void mmhub_v3_0_1_setup_vmid_config(struct amdgpu_device *adev)
 		/* Send no-retry XNACK on fault to suppress VM fault storm. */
 		tmp = REG_SET_FIELD(tmp, MMVM_CONTEXT1_CNTL,
 				    RETRY_PERMISSION_OR_INVALID_PAGE_FAULT,
-				    !amdgpu_noretry);
+				    !adev->gmc.noretry);
 		WREG32_SOC15_OFFSET(MMHUB, 0, regMMVM_CONTEXT1_CNTL,
 				    i * hub->ctx_distance, tmp);
 		WREG32_SOC15_OFFSET(MMHUB, 0, regMMVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32,
diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_2.c b/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_2.c
index 34e8dbd47c0f..23cf95783264 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v3_0_2.c
@@ -339,7 +339,7 @@ static void mmhub_v3_0_2_setup_vmid_config(struct amdgpu_device *adev)
 		/* Send no-retry XNACK on fault to suppress VM fault storm. */
 		tmp = REG_SET_FIELD(tmp, MMVM_CONTEXT1_CNTL,
 				    RETRY_PERMISSION_OR_INVALID_PAGE_FAULT,
-				    !amdgpu_noretry);
+				    !adev->gmc.noretry);
 		WREG32_SOC15_OFFSET(MMHUB, 0, regMMVM_CONTEXT1_CNTL,
 				    i * hub->ctx_distance, tmp);
 		WREG32_SOC15_OFFSET(MMHUB, 0, regMMVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32,
diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v3_3.c b/drivers/gpu/drm/amd/amdgpu/mmhub_v3_3.c
index cfce7e1297d4..98568c72c2be 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmhub_v3_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v3_3.c
@@ -451,7 +451,7 @@ static void mmhub_v3_3_setup_vmid_config(struct amdgpu_device *adev)
 		/* Send no-retry XNACK on fault to suppress VM fault storm. */
 		tmp = REG_SET_FIELD(tmp, MMVM_CONTEXT1_CNTL,
 				    RETRY_PERMISSION_OR_INVALID_PAGE_FAULT,
-				    !amdgpu_noretry);
+				    !adev->gmc.noretry);
 		WREG32_SOC15_OFFSET(MMHUB, 0, regMMVM_CONTEXT1_CNTL,
 				    i * hub->ctx_distance, tmp);
 		WREG32_SOC15_OFFSET(MMHUB, 0, regMMVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32,
diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v4_1_0.c b/drivers/gpu/drm/amd/amdgpu/mmhub_v4_1_0.c
index bef75c4c48d3..c9fb48992a2d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmhub_v4_1_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v4_1_0.c
@@ -334,7 +334,7 @@ static void mmhub_v4_1_0_setup_vmid_config(struct amdgpu_device *adev)
 		/* Send no-retry XNACK on fault to suppress VM fault storm. */
 		tmp = REG_SET_FIELD(tmp, MMVM_CONTEXT1_CNTL,
 				    RETRY_PERMISSION_OR_INVALID_PAGE_FAULT,
-				    !amdgpu_noretry);
+				    !adev->gmc.noretry);
 		WREG32_SOC15_OFFSET(MMHUB, 0, regMMVM_CONTEXT1_CNTL,
 				    i * hub->ctx_distance, tmp);
 		WREG32_SOC15_OFFSET(MMHUB, 0, regMMVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32,
diff --git a/drivers/gpu/drm/amd/amdgpu/mmhub_v4_2_0.c b/drivers/gpu/drm/amd/amdgpu/mmhub_v4_2_0.c
index 29f7ed466858..49b7f16a941f 100644
--- a/drivers/gpu/drm/amd/amdgpu/mmhub_v4_2_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mmhub_v4_2_0.c
@@ -518,7 +518,7 @@ static void mmhub_v4_2_0_mid_setup_vmid_config(struct amdgpu_device *adev,
 			/* Send no-retry XNACK on fault to suppress VM fault storm. */
 			tmp = REG_SET_FIELD(tmp, MMVM_CONTEXT1_CNTL,
 					    RETRY_PERMISSION_OR_INVALID_PAGE_FAULT,
-					    !amdgpu_noretry);
+					    !adev->gmc.noretry);
 			WREG32_SOC15_OFFSET(MMHUB, GET_INST(MMHUB, j), regMMVM_CONTEXT1_CNTL,
 					    i * hub->ctx_distance, tmp);
 			WREG32_SOC15_OFFSET(MMHUB, GET_INST(MMHUB, j), regMMVM_CONTEXT1_PAGE_TABLE_START_ADDR_LO32,
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/7] drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed
  2026-05-25 11:45 [PATCH 0/7] drm/amdgpu: Improve retry fault handling (v2) Timur Kristóf
  2026-05-25 11:45 ` [PATCH 1/7] drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly Timur Kristóf
@ 2026-05-25 11:45 ` Timur Kristóf
  2026-05-26 15:00   ` Alex Deucher
  2026-05-25 11:45 ` [PATCH 3/7] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts Timur Kristóf
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Timur Kristóf @ 2026-05-25 11:45 UTC (permalink / raw)
  To: amd-gfx, Alex Deucher, christian.koenig, Natalie Vock,
	Mario Limonciello, Amir Shetaia, Marek Olšák
  Cc: Timur Kristóf

When the fault stop mode isn't AMDGPU_VM_FAULT_STOP_ALWAYS,
these bits should be programmed to 0.

Program CRASH_ON_NO_RETRY_FAULT and CRASH_ON_RETRY_FAULT
always, to make sure to clear the bits when we don't want
to crash.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c | 10 ++++------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c   | 10 ++++------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c   | 14 ++++++--------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c    | 10 ++++------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c    | 10 ++++------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c    | 10 ++++------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c    | 10 ++++------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c    | 10 ++++------
 drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c  | 10 ++++------
 9 files changed, 38 insertions(+), 56 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c
index f845ba698b40..652eea6eae4a 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c
@@ -449,12 +449,10 @@ static void gfxhub_v11_5_0_set_fault_enable_default(struct amdgpu_device *adev,
 			    WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
 	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
 			    EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
-	if (!value) {
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_NO_RETRY_FAULT, 1);
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_RETRY_FAULT, 1);
-	}
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_NO_RETRY_FAULT, !value);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_RETRY_FAULT, !value);
 	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c
index ba78b5a1a7cd..6cbf837d50dd 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c
@@ -454,12 +454,10 @@ static void gfxhub_v12_0_set_fault_enable_default(struct amdgpu_device *adev,
 			    WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
 	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
 			    EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
-	if (!value) {
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_NO_RETRY_FAULT, 1);
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_RETRY_FAULT, 1);
-	}
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_NO_RETRY_FAULT, !value);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_RETRY_FAULT, !value);
 	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c
index 3544eb42dca6..4c2fd1e6616e 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c
@@ -633,19 +633,17 @@ static void gfxhub_v12_1_xcc_set_fault_enable_default(struct amdgpu_device *adev
 		tmp = REG_SET_FIELD(tmp,
 				    GCVM_L2_PROTECTION_FAULT_CNTL_LO32,
 				    OTHER_CLIENT_ID_NO_RETRY_FAULT_INTERRUPT, value);
-		if (!value)
-			tmp = REG_SET_FIELD(tmp,
-					    GCVM_L2_PROTECTION_FAULT_CNTL_LO32,
-					    CRASH_ON_NO_RETRY_FAULT, 1);
+		tmp = REG_SET_FIELD(tmp,
+				    GCVM_L2_PROTECTION_FAULT_CNTL_LO32,
+				    CRASH_ON_NO_RETRY_FAULT, !value);
 		WREG32_SOC15(GC, GET_INST(GC, i),
 			     regGCVM_L2_PROTECTION_FAULT_CNTL_LO32, tmp);
 
 		tmp = RREG32_SOC15(GC, GET_INST(GC, i),
 				   regGCVM_L2_PROTECTION_FAULT_CNTL_HI32);
-		if (!value)
-			tmp = REG_SET_FIELD(tmp,
-					    GCVM_L2_PROTECTION_FAULT_CNTL_HI32,
-					    CRASH_ON_RETRY_FAULT, 1);
+		tmp = REG_SET_FIELD(tmp,
+				    GCVM_L2_PROTECTION_FAULT_CNTL_HI32,
+				    CRASH_ON_RETRY_FAULT, !value);
 		WREG32_SOC15(GC, GET_INST(GC, i),
 			     regGCVM_L2_PROTECTION_FAULT_CNTL_HI32, tmp);
 	}
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
index a7bfc9f41d0e..bfe247b1a333 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
@@ -403,12 +403,10 @@ static void gfxhub_v1_0_set_fault_enable_default(struct amdgpu_device *adev,
 			WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
 	tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
 			EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
-	if (!value) {
-		tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_NO_RETRY_FAULT, 1);
-		tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_RETRY_FAULT, 1);
-	}
+	tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
+			CRASH_ON_NO_RETRY_FAULT, !value);
+	tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
+			CRASH_ON_RETRY_FAULT, !value);
 	WREG32_SOC15(GC, 0, mmVM_L2_PROTECTION_FAULT_CNTL, tmp);
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
index 6c03bf9f1ae8..fbdf46070b38 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
@@ -516,12 +516,10 @@ static void gfxhub_v1_2_xcc_set_fault_enable_default(struct amdgpu_device *adev,
 				WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
 		tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
 				EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
-		if (!value) {
-			tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
-					CRASH_ON_NO_RETRY_FAULT, 1);
-			tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
-					CRASH_ON_RETRY_FAULT, 1);
-		}
+		tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
+				CRASH_ON_NO_RETRY_FAULT, !value);
+		tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
+				CRASH_ON_RETRY_FAULT, !value);
 		WREG32_SOC15(GC, GET_INST(GC, i), regVM_L2_PROTECTION_FAULT_CNTL, tmp);
 	}
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
index 793faf62cb07..9ea593e2c719 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
@@ -418,12 +418,10 @@ static void gfxhub_v2_0_set_fault_enable_default(struct amdgpu_device *adev,
 			    WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
 	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
 			    EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
-	if (!value) {
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_NO_RETRY_FAULT, 1);
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_RETRY_FAULT, 1);
-	}
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_NO_RETRY_FAULT, !value);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_RETRY_FAULT, !value);
 	WREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
index aceb8447feac..30b90d35abd0 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
@@ -449,12 +449,10 @@ static void gfxhub_v2_1_set_fault_enable_default(struct amdgpu_device *adev,
 			    WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
 	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
 			    EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
-	if (!value) {
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_NO_RETRY_FAULT, 1);
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_RETRY_FAULT, 1);
-	}
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_NO_RETRY_FAULT, !value);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_RETRY_FAULT, !value);
 	WREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c
index 631f99e3741a..9e6a6e13dec0 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c
@@ -446,12 +446,10 @@ static void gfxhub_v3_0_set_fault_enable_default(struct amdgpu_device *adev,
 			    WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
 	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
 			    EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
-	if (!value) {
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_NO_RETRY_FAULT, 1);
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_RETRY_FAULT, 1);
-	}
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_NO_RETRY_FAULT, !value);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_RETRY_FAULT, !value);
 	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c
index 8a87410ce016..b3b1085c7cd3 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c
@@ -434,12 +434,10 @@ static void gfxhub_v3_0_3_set_fault_enable_default(struct amdgpu_device *adev,
 			    WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
 	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
 			    EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
-	if (!value) {
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_NO_RETRY_FAULT, 1);
-		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
-				CRASH_ON_RETRY_FAULT, 1);
-	}
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_NO_RETRY_FAULT, !value);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
+			    CRASH_ON_RETRY_FAULT, !value);
 	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
 }
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/7] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts
  2026-05-25 11:45 [PATCH 0/7] drm/amdgpu: Improve retry fault handling (v2) Timur Kristóf
  2026-05-25 11:45 ` [PATCH 1/7] drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly Timur Kristóf
  2026-05-25 11:45 ` [PATCH 2/7] drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed Timur Kristóf
@ 2026-05-25 11:45 ` Timur Kristóf
  2026-06-15 14:32   ` Tvrtko Ursulin
  2026-05-25 11:45 ` [PATCH 4/7] drm/amdgpu/ih: Add retry_cam_ack IH function pointer Timur Kristóf
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Timur Kristóf @ 2026-05-25 11:45 UTC (permalink / raw)
  To: amd-gfx, Alex Deucher, christian.koenig, Natalie Vock,
	Mario Limonciello, Amir Shetaia, Marek Olšák
  Cc: Timur Kristóf

Different interrupts may have different timestamp sources,
which shouldn't be compared.

If we compare the timestamps of retry faults to timestamps
of other interrupts, it may result in all retry fault
interrupts being filtered out, because of the different
time stamp source.

This issue was observed on Strix Halo.
Solved by storing the timestamp of the last page fault interrupt.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 5 ++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 13bec8461cde..52258f1341c2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -437,9 +437,12 @@ bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev,
 	uint32_t hash;
 
 	/* Stale retry fault if timestamp goes backward */
-	if (amdgpu_ih_ts_after(timestamp, ih->processed_timestamp))
+	if (timestamp == adev->gmc.processed_fault_timestamp ||
+		amdgpu_ih_ts_after(timestamp, adev->gmc.processed_fault_timestamp))
 		return true;
 
+	adev->gmc.processed_fault_timestamp = MAX(timestamp, adev->gmc.processed_fault_timestamp);
+
 	/* If we don't have space left in the ring buffer return immediately */
 	stamp = max(timestamp, AMDGPU_GMC_FAULT_TIMEOUT + 1) -
 		AMDGPU_GMC_FAULT_TIMEOUT;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
index 676e3aaa1f27..77eb15380284 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
@@ -361,6 +361,7 @@ struct amdgpu_gmc {
 
 	u64 noretry_flags;
 	u64 init_pte_flags;
+	u64 processed_fault_timestamp;
 
 	bool flush_tlb_needs_extra_type_0;
 	bool flush_tlb_needs_extra_type_2;
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/7] drm/amdgpu/ih: Add retry_cam_ack IH function pointer
  2026-05-25 11:45 [PATCH 0/7] drm/amdgpu: Improve retry fault handling (v2) Timur Kristóf
                   ` (2 preceding siblings ...)
  2026-05-25 11:45 ` [PATCH 3/7] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts Timur Kristóf
@ 2026-05-25 11:45 ` Timur Kristóf
  2026-06-15 14:44   ` Tvrtko Ursulin
  2026-05-25 11:45 ` [PATCH 5/7] drm/amdgpu/gfxhub: Enable retry fault interrupts when needed Timur Kristóf
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Timur Kristóf @ 2026-05-25 11:45 UTC (permalink / raw)
  To: amd-gfx, Alex Deucher, christian.koenig, Natalie Vock,
	Mario Limonciello, Amir Shetaia, Marek Olšák
  Cc: Timur Kristóf

Instead of writing the doorbell in amdgpu_gmc_handle_retry_fault()
directly, add an IH function pointer which can be defined in
a different way for different IH versions.

This is to allow implementing the filter CAM without a doorbell.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h  | 1 +
 drivers/gpu/drm/amd/amdgpu/ih_v7_0.c    | 6 ++++++
 drivers/gpu/drm/amd/amdgpu/vega20_ih.c  | 8 +++++++-
 4 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 52258f1341c2..d790b7619ccd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -565,7 +565,7 @@ int amdgpu_gmc_handle_retry_fault(struct amdgpu_device *adev,
 
 		ret = amdgpu_vm_handle_fault(adev, entry->pasid, entry->vmid, node_id,
 					     addr, entry->timestamp, write_fault);
-		WDOORBELL32(adev->irq.retry_cam_doorbell_index, cam_index);
+		adev->irq.ih_funcs->retry_cam_ack(adev, cam_index);
 		if (ret)
 			return 1;
 	} else {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
index 444437c30088..e6e34f6e86f4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
@@ -97,6 +97,7 @@ struct amdgpu_ih_funcs {
 	const char *(*node_id_to_die_name)(struct amdgpu_device *adev,
 					   unsigned int node_id,
 					   char *buf, size_t size);
+	void (*retry_cam_ack)(struct amdgpu_device *adev, u32 cam_index);
 };
 
 #define amdgpu_ih_get_wptr(adev, ih) (adev)->irq.ih_funcs->get_wptr((adev), (ih))
diff --git a/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c b/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
index 6de9e87e04e1..c2431f4c2671 100644
--- a/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
@@ -289,6 +289,11 @@ static uint32_t ih_v7_0_setup_retry_doorbell(u32 doorbell_index)
 	return val;
 }
 
+static void ih_v7_0_retry_cam_ack(struct amdgpu_device *adev, u32 cam_index)
+{
+	WDOORBELL32(adev->irq.retry_cam_doorbell_index, cam_index);
+}
+
 #define regIH_RING1_CLIENT_CFG_INDEX_V7_1             0x122
 #define regIH_RING1_CLIENT_CFG_INDEX_V7_1_BASE_IDX    0
 #define regIH_RING1_CLIENT_CFG_DATA_V7_1              0x123
@@ -858,6 +863,7 @@ static const struct amdgpu_ih_funcs ih_v7_0_funcs = {
 	.decode_iv_ts = amdgpu_ih_decode_iv_ts_helper,
 	.set_rptr = ih_v7_0_set_rptr,
 	.node_id_to_die_name = ih_v7_0_node_id_to_die_name,
+	.retry_cam_ack = ih_v7_0_retry_cam_ack,
 };
 
 static void ih_v7_0_set_interrupt_funcs(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
index 85846fd08ce4..30a82fff3ff7 100644
--- a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
@@ -293,6 +293,11 @@ static uint32_t vega20_setup_retry_doorbell(u32 doorbell_index)
 	return val;
 }
 
+static void vega20_retry_cam_ack(struct amdgpu_device *adev, u32 cam_index)
+{
+	WDOORBELL32(adev->irq.retry_cam_doorbell_index, cam_index);
+}
+
 /**
  * vega20_ih_irq_init - init and enable the interrupt ring
  *
@@ -738,7 +743,8 @@ static const struct amdgpu_ih_funcs vega20_ih_funcs = {
 	.get_wptr = vega20_ih_get_wptr,
 	.decode_iv = amdgpu_ih_decode_iv_helper,
 	.decode_iv_ts = amdgpu_ih_decode_iv_ts_helper,
-	.set_rptr = vega20_ih_set_rptr
+	.set_rptr = vega20_ih_set_rptr,
+	.retry_cam_ack = vega20_retry_cam_ack,
 };
 
 static void vega20_ih_set_interrupt_funcs(struct amdgpu_device *adev)
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/7] drm/amdgpu/gfxhub: Enable retry fault interrupts when needed
  2026-05-25 11:45 [PATCH 0/7] drm/amdgpu: Improve retry fault handling (v2) Timur Kristóf
                   ` (3 preceding siblings ...)
  2026-05-25 11:45 ` [PATCH 4/7] drm/amdgpu/ih: Add retry_cam_ack IH function pointer Timur Kristóf
@ 2026-05-25 11:45 ` Timur Kristóf
  2026-05-25 11:45 ` [PATCH 6/7] drm/amdgpu/gfxhub: Respect noretry flag for retry faults on GFX12.1 Timur Kristóf
  2026-05-25 11:45 ` [PATCH 7/7] drm/amdgpu: Enable retry CAM on Navi 3 dGPUs Timur Kristóf
  6 siblings, 0 replies; 16+ messages in thread
From: Timur Kristóf @ 2026-05-25 11:45 UTC (permalink / raw)
  To: amd-gfx, Alex Deucher, christian.koenig, Natalie Vock,
	Mario Limonciello, Amir Shetaia, Marek Olšák
  Cc: Timur Kristóf

Enable retry fault interrupts when initializing the GFXHUB
system aperture registers according to whether retrying
page faults is enabled in amdgpu (ie. amdgpu.noretry=0).

Needs to be done for each GFXHUB version at once,
because none of them actually enabled this interrupt.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c | 9 +++++++--
 drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c   | 9 +++++++--
 drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c    | 9 +++++++--
 drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c    | 2 ++
 drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c    | 9 +++++++--
 drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c    | 9 +++++++--
 drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c    | 9 +++++++--
 drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c  | 9 +++++++--
 8 files changed, 51 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c
index 652eea6eae4a..ef20eafd59ae 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c
@@ -155,6 +155,7 @@ static void gfxhub_v11_5_0_init_gart_aperture_regs(struct amdgpu_device *adev)
 static void gfxhub_v11_5_0_init_system_aperture_regs(struct amdgpu_device *adev)
 {
 	uint64_t value;
+	u32 tmp;
 
 	WREG32_SOC15(GC, 0, regGCMC_VM_AGP_BASE, 0);
 	WREG32_SOC15(GC, 0, regGCMC_VM_AGP_BOT, adev->gmc.agp_start >> 24);
@@ -180,8 +181,12 @@ static void gfxhub_v11_5_0_init_system_aperture_regs(struct amdgpu_device *adev)
 	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_DEFAULT_ADDR_HI32,
 		     (u32)((u64)adev->dummy_page_addr >> 44));
 
-	WREG32_FIELD15_PREREG(GC, 0, GCVM_L2_PROTECTION_FAULT_CNTL2,
-		       ACTIVE_PAGE_MIGRATION_PTE_READ_RETRY, 1);
+	tmp = RREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL2);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL2,
+				 ACTIVE_PAGE_MIGRATION_PTE_READ_RETRY, 1);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL2,
+				 ENABLE_RETRY_FAULT_INTERRUPT, !adev->gmc.noretry);
+	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL2, tmp);
 }
 
 static void gfxhub_v11_5_0_init_tlb_regs(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c
index 6cbf837d50dd..ec3ff4dec674 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c
@@ -158,6 +158,7 @@ static void gfxhub_v12_0_init_gart_aperture_regs(struct amdgpu_device *adev)
 static void gfxhub_v12_0_init_system_aperture_regs(struct amdgpu_device *adev)
 {
 	uint64_t value;
+	u32 tmp;
 
 	/* Program the AGP BAR */
 	WREG32_SOC15(GC, 0, regGCMC_VM_AGP_BASE, 0);
@@ -184,8 +185,12 @@ static void gfxhub_v12_0_init_system_aperture_regs(struct amdgpu_device *adev)
 	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_DEFAULT_ADDR_HI32,
 		     (u32)((u64)adev->dummy_page_addr >> 44));
 
-	WREG32_FIELD15_PREREG(GC, 0, GCVM_L2_PROTECTION_FAULT_CNTL2,
-		       ACTIVE_PAGE_MIGRATION_PTE_READ_RETRY, 1);
+	tmp = RREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL2);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL2,
+				 ACTIVE_PAGE_MIGRATION_PTE_READ_RETRY, 1);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL2,
+				 ENABLE_RETRY_FAULT_INTERRUPT, !adev->gmc.noretry);
+	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL2, tmp);
 }
 
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
index bfe247b1a333..27d7f7cb903f 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
@@ -91,6 +91,7 @@ static void gfxhub_v1_0_init_gart_aperture_regs(struct amdgpu_device *adev)
 static void gfxhub_v1_0_init_system_aperture_regs(struct amdgpu_device *adev)
 {
 	uint64_t value;
+	u32 tmp;
 
 	if (!amdgpu_sriov_vf(adev) || adev->asic_type <= CHIP_VEGA10) {
 		/* Program the AGP BAR */
@@ -134,8 +135,12 @@ static void gfxhub_v1_0_init_system_aperture_regs(struct amdgpu_device *adev)
 		WREG32_SOC15(GC, 0, mmVM_L2_PROTECTION_FAULT_DEFAULT_ADDR_HI32,
 			     (u32)((u64)adev->dummy_page_addr >> 44));
 
-		WREG32_FIELD15(GC, 0, VM_L2_PROTECTION_FAULT_CNTL2,
-			       ACTIVE_PAGE_MIGRATION_PTE_READ_RETRY, 1);
+		tmp = RREG32_SOC15(GC, 0, mmVM_L2_PROTECTION_FAULT_CNTL2);
+		tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL2,
+					 ACTIVE_PAGE_MIGRATION_PTE_READ_RETRY, 1);
+		tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL2,
+					 ENABLE_RETRY_FAULT_INTERRUPT, !adev->gmc.noretry);
+		WREG32_SOC15(GC, 0, mmVM_L2_PROTECTION_FAULT_CNTL2, tmp);
 	}
 
 	/* In the case squeezing vram into GART aperture, we don't use
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
index fbdf46070b38..ed9a64bc5aaa 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
@@ -176,6 +176,8 @@ gfxhub_v1_2_xcc_init_system_aperture_regs(struct amdgpu_device *adev,
 			tmp = RREG32_SOC15(GC, GET_INST(GC, i), regVM_L2_PROTECTION_FAULT_CNTL2);
 			tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL2,
 					    ACTIVE_PAGE_MIGRATION_PTE_READ_RETRY, 1);
+			tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL2,
+					    ENABLE_RETRY_FAULT_INTERRUPT, !adev->gmc.noretry);
 			WREG32_SOC15(GC, GET_INST(GC, i), regVM_L2_PROTECTION_FAULT_CNTL2, tmp);
 		}
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
index 9ea593e2c719..152b2735d360 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
@@ -151,6 +151,7 @@ static void gfxhub_v2_0_init_gart_aperture_regs(struct amdgpu_device *adev)
 static void gfxhub_v2_0_init_system_aperture_regs(struct amdgpu_device *adev)
 {
 	uint64_t value;
+	u32 tmp;
 
 	if (!amdgpu_sriov_vf(adev)) {
 		/* Program the AGP BAR */
@@ -178,8 +179,12 @@ static void gfxhub_v2_0_init_system_aperture_regs(struct amdgpu_device *adev)
 	WREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_DEFAULT_ADDR_HI32,
 		     (u32)((u64)adev->dummy_page_addr >> 44));
 
-	WREG32_FIELD15(GC, 0, GCVM_L2_PROTECTION_FAULT_CNTL2,
-		       ACTIVE_PAGE_MIGRATION_PTE_READ_RETRY, 1);
+	tmp = RREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL2);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL2,
+				 ACTIVE_PAGE_MIGRATION_PTE_READ_RETRY, 1);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL2,
+				 ENABLE_RETRY_FAULT_INTERRUPT, !adev->gmc.noretry);
+	WREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL2, tmp);
 }
 
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
index 30b90d35abd0..83c2ddbbd292 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
@@ -154,6 +154,7 @@ static void gfxhub_v2_1_init_gart_aperture_regs(struct amdgpu_device *adev)
 static void gfxhub_v2_1_init_system_aperture_regs(struct amdgpu_device *adev)
 {
 	uint64_t value;
+	u32 tmp;
 
 	if (amdgpu_sriov_vf(adev))
 		return;
@@ -182,8 +183,12 @@ static void gfxhub_v2_1_init_system_aperture_regs(struct amdgpu_device *adev)
 	WREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_DEFAULT_ADDR_HI32,
 		     (u32)((u64)adev->dummy_page_addr >> 44));
 
-	WREG32_FIELD15(GC, 0, GCVM_L2_PROTECTION_FAULT_CNTL2,
-		       ACTIVE_PAGE_MIGRATION_PTE_READ_RETRY, 1);
+	tmp = RREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL2);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL2,
+				 ACTIVE_PAGE_MIGRATION_PTE_READ_RETRY, 1);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL2,
+				 ENABLE_RETRY_FAULT_INTERRUPT, !adev->gmc.noretry);
+	WREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL2, tmp);
 }
 
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c
index 9e6a6e13dec0..90bbb2fe4884 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c
@@ -150,6 +150,7 @@ static void gfxhub_v3_0_init_gart_aperture_regs(struct amdgpu_device *adev)
 static void gfxhub_v3_0_init_system_aperture_regs(struct amdgpu_device *adev)
 {
 	uint64_t value;
+	u32 tmp;
 
 	/* Program the AGP BAR */
 	WREG32_SOC15(GC, 0, regGCMC_VM_AGP_BASE, 0);
@@ -176,8 +177,12 @@ static void gfxhub_v3_0_init_system_aperture_regs(struct amdgpu_device *adev)
 	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_DEFAULT_ADDR_HI32,
 		     (u32)((u64)adev->dummy_page_addr >> 44));
 
-	WREG32_FIELD15_PREREG(GC, 0, GCVM_L2_PROTECTION_FAULT_CNTL2,
-		       ACTIVE_PAGE_MIGRATION_PTE_READ_RETRY, 1);
+	tmp = RREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL2);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL2,
+				 ACTIVE_PAGE_MIGRATION_PTE_READ_RETRY, 1);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL2,
+				 ENABLE_RETRY_FAULT_INTERRUPT, !adev->gmc.noretry);
+	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL2, tmp);
 }
 
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c
index b3b1085c7cd3..1b3c067ab48c 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c
@@ -153,6 +153,7 @@ static void gfxhub_v3_0_3_init_gart_aperture_regs(struct amdgpu_device *adev)
 static void gfxhub_v3_0_3_init_system_aperture_regs(struct amdgpu_device *adev)
 {
 	uint64_t value;
+	u32 tmp;
 
 	if (amdgpu_sriov_vf(adev))
 		return;
@@ -181,8 +182,12 @@ static void gfxhub_v3_0_3_init_system_aperture_regs(struct amdgpu_device *adev)
 	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_DEFAULT_ADDR_HI32,
 		     (u32)((u64)adev->dummy_page_addr >> 44));
 
-	WREG32_FIELD15_PREREG(GC, 0, GCVM_L2_PROTECTION_FAULT_CNTL2,
-		       ACTIVE_PAGE_MIGRATION_PTE_READ_RETRY, 1);
+	tmp = RREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL2);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL2,
+				 ACTIVE_PAGE_MIGRATION_PTE_READ_RETRY, 1);
+	tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL2,
+				 ENABLE_RETRY_FAULT_INTERRUPT, !adev->gmc.noretry);
+	WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL2, tmp);
 }
 
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 6/7] drm/amdgpu/gfxhub: Respect noretry flag for retry faults on GFX12.1
  2026-05-25 11:45 [PATCH 0/7] drm/amdgpu: Improve retry fault handling (v2) Timur Kristóf
                   ` (4 preceding siblings ...)
  2026-05-25 11:45 ` [PATCH 5/7] drm/amdgpu/gfxhub: Enable retry fault interrupts when needed Timur Kristóf
@ 2026-05-25 11:45 ` Timur Kristóf
  2026-05-25 11:45 ` [PATCH 7/7] drm/amdgpu: Enable retry CAM on Navi 3 dGPUs Timur Kristóf
  6 siblings, 0 replies; 16+ messages in thread
From: Timur Kristóf @ 2026-05-25 11:45 UTC (permalink / raw)
  To: amd-gfx, Alex Deucher, christian.koenig, Natalie Vock,
	Mario Limonciello, Amir Shetaia, Marek Olšák
  Cc: Timur Kristóf

When retry faults are disabled (amdgpu.noretry=1),
the ENABLE_RETRY_FAULT_INTERRUPT bit should be programmed to 0.

Note that retry faults are enabled by default on GFX12.1
so this just fixes the case when they are explicitly disabled.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c
index 4c2fd1e6616e..d2edfe037da8 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c
@@ -243,7 +243,7 @@ static void gfxhub_v12_1_xcc_init_system_aperture_regs(struct amdgpu_device *ade
 		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL2,
 				    ACTIVE_PAGE_MIGRATION_PTE_READ_RETRY, 1);
 		tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL2,
-				    ENABLE_RETRY_FAULT_INTERRUPT, 0x1);
+				    ENABLE_RETRY_FAULT_INTERRUPT, !adev->gmc.noretry);
 		WREG32_SOC15(GC, GET_INST(GC, i),
 			     regGCVM_L2_PROTECTION_FAULT_CNTL2, tmp);
 	}
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 7/7] drm/amdgpu: Enable retry CAM on Navi 3 dGPUs
  2026-05-25 11:45 [PATCH 0/7] drm/amdgpu: Improve retry fault handling (v2) Timur Kristóf
                   ` (5 preceding siblings ...)
  2026-05-25 11:45 ` [PATCH 6/7] drm/amdgpu/gfxhub: Respect noretry flag for retry faults on GFX12.1 Timur Kristóf
@ 2026-05-25 11:45 ` Timur Kristóf
  6 siblings, 0 replies; 16+ messages in thread
From: Timur Kristóf @ 2026-05-25 11:45 UTC (permalink / raw)
  To: amd-gfx, Alex Deucher, christian.koenig, Natalie Vock,
	Mario Limonciello, Amir Shetaia, Marek Olšák
  Cc: Timur Kristóf

The retry CAM can filter interrupts which occur repeatedly,
such as page fault interrupts when retry faults are enabled.
This makes processing those interrupts much more efficient,
because the CPU won't have to deal with processing the same
interrupt repeatedly.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c |  5 ++++-
 drivers/gpu/drm/amd/amdgpu/ih_v6_0.c   | 18 +++++++++++++++++-
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
index 16388e3caea3..2a226b4c9e09 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -108,13 +108,16 @@ static int gmc_v11_0_process_interrupt(struct amdgpu_device *adev,
 	bool write_fault = !!(entry->src_data[1] &
 			      AMDGPU_GMC9_FAULT_SOURCE_DATA_WRITE);
 	uint32_t status = 0;
+	uint32_t cam_index;
 	u64 addr;
 
 	addr = (u64)entry->src_data[0] << 12;
 	addr |= ((u64)entry->src_data[1] & 0xf) << 44;
 
 	if (retry_fault) {
-		int ret = amdgpu_gmc_handle_retry_fault(adev, entry, addr, 0, 0,
+		cam_index = entry->src_data[2] & 0x3ff;
+
+		int ret = amdgpu_gmc_handle_retry_fault(adev, entry, addr, cam_index, 0,
 							write_fault);
 		/* Returning 1 here also prevents sending the IV to the KFD */
 		if (ret == 1)
diff --git a/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c b/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
index 333e9c30c091..0a87c3126d1e 100644
--- a/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
@@ -307,6 +307,11 @@ static int ih_v6_0_enable_ring(struct amdgpu_device *adev,
 	return 0;
 }
 
+static void ih_v6_0_retry_cam_ack(struct amdgpu_device *adev, u32 cam_index)
+{
+	WREG32_SOC15(OSSSYS, 0, regIH_RETRY_CAM_ACK, cam_index);
+}
+
 /**
  * ih_v6_0_irq_init - init and enable the interrupt ring
  *
@@ -392,6 +397,16 @@ static int ih_v6_0_irq_init(struct amdgpu_device *adev)
 
 	pci_set_master(adev->pdev);
 
+	if (!(adev->flags & AMD_IS_APU)) {
+		/* Enable IH Retry CAM */
+		tmp = RREG32_SOC15(OSSSYS, 0, regIH_RETRY_INT_CAM_CNTL);
+		tmp = REG_SET_FIELD(tmp, IH_RETRY_INT_CAM_CNTL, ENABLE, 1);
+		tmp = REG_SET_FIELD(tmp, IH_RETRY_INT_CAM_CNTL, CAM_SIZE, 0xF);
+		WREG32_SOC15(OSSSYS, 0, regIH_RETRY_INT_CAM_CNTL, tmp);
+
+		adev->irq.retry_cam_enabled = true;
+	}
+
 	/* enable interrupts */
 	ret = ih_v6_0_toggle_interrupts(adev, true);
 	if (ret)
@@ -800,7 +815,8 @@ static const struct amdgpu_ih_funcs ih_v6_0_funcs = {
 	.get_wptr = ih_v6_0_get_wptr,
 	.decode_iv = amdgpu_ih_decode_iv_helper,
 	.decode_iv_ts = amdgpu_ih_decode_iv_ts_helper,
-	.set_rptr = ih_v6_0_set_rptr
+	.set_rptr = ih_v6_0_set_rptr,
+	.retry_cam_ack = ih_v6_0_retry_cam_ack,
 };
 
 static void ih_v6_0_set_interrupt_funcs(struct amdgpu_device *adev)
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/7] drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed
  2026-05-25 11:45 ` [PATCH 2/7] drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed Timur Kristóf
@ 2026-05-26 15:00   ` Alex Deucher
  0 siblings, 0 replies; 16+ messages in thread
From: Alex Deucher @ 2026-05-26 15:00 UTC (permalink / raw)
  To: Timur Kristóf
  Cc: amd-gfx, Alex Deucher, christian.koenig, Natalie Vock,
	Mario Limonciello, Amir Shetaia, Marek Olšák

Applied patches 1 and 2.

Alex

On Mon, May 25, 2026 at 8:04 AM Timur Kristóf <timur.kristof@gmail.com> wrote:
>
> When the fault stop mode isn't AMDGPU_VM_FAULT_STOP_ALWAYS,
> these bits should be programmed to 0.
>
> Program CRASH_ON_NO_RETRY_FAULT and CRASH_ON_RETRY_FAULT
> always, to make sure to clear the bits when we don't want
> to crash.
>
> Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
> Reviewed-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c | 10 ++++------
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c   | 10 ++++------
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c   | 14 ++++++--------
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c    | 10 ++++------
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c    | 10 ++++------
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c    | 10 ++++------
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c    | 10 ++++------
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c    | 10 ++++------
>  drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c  | 10 ++++------
>  9 files changed, 38 insertions(+), 56 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c
> index f845ba698b40..652eea6eae4a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v11_5_0.c
> @@ -449,12 +449,10 @@ static void gfxhub_v11_5_0_set_fault_enable_default(struct amdgpu_device *adev,
>                             WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
>         tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
>                             EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
> -       if (!value) {
> -               tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> -                               CRASH_ON_NO_RETRY_FAULT, 1);
> -               tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> -                               CRASH_ON_RETRY_FAULT, 1);
> -       }
> +       tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> +                           CRASH_ON_NO_RETRY_FAULT, !value);
> +       tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> +                           CRASH_ON_RETRY_FAULT, !value);
>         WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
>  }
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c
> index ba78b5a1a7cd..6cbf837d50dd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_0.c
> @@ -454,12 +454,10 @@ static void gfxhub_v12_0_set_fault_enable_default(struct amdgpu_device *adev,
>                             WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
>         tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
>                             EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
> -       if (!value) {
> -               tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> -                               CRASH_ON_NO_RETRY_FAULT, 1);
> -               tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> -                               CRASH_ON_RETRY_FAULT, 1);
> -       }
> +       tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> +                           CRASH_ON_NO_RETRY_FAULT, !value);
> +       tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> +                           CRASH_ON_RETRY_FAULT, !value);
>         WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
>  }
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c
> index 3544eb42dca6..4c2fd1e6616e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v12_1.c
> @@ -633,19 +633,17 @@ static void gfxhub_v12_1_xcc_set_fault_enable_default(struct amdgpu_device *adev
>                 tmp = REG_SET_FIELD(tmp,
>                                     GCVM_L2_PROTECTION_FAULT_CNTL_LO32,
>                                     OTHER_CLIENT_ID_NO_RETRY_FAULT_INTERRUPT, value);
> -               if (!value)
> -                       tmp = REG_SET_FIELD(tmp,
> -                                           GCVM_L2_PROTECTION_FAULT_CNTL_LO32,
> -                                           CRASH_ON_NO_RETRY_FAULT, 1);
> +               tmp = REG_SET_FIELD(tmp,
> +                                   GCVM_L2_PROTECTION_FAULT_CNTL_LO32,
> +                                   CRASH_ON_NO_RETRY_FAULT, !value);
>                 WREG32_SOC15(GC, GET_INST(GC, i),
>                              regGCVM_L2_PROTECTION_FAULT_CNTL_LO32, tmp);
>
>                 tmp = RREG32_SOC15(GC, GET_INST(GC, i),
>                                    regGCVM_L2_PROTECTION_FAULT_CNTL_HI32);
> -               if (!value)
> -                       tmp = REG_SET_FIELD(tmp,
> -                                           GCVM_L2_PROTECTION_FAULT_CNTL_HI32,
> -                                           CRASH_ON_RETRY_FAULT, 1);
> +               tmp = REG_SET_FIELD(tmp,
> +                                   GCVM_L2_PROTECTION_FAULT_CNTL_HI32,
> +                                   CRASH_ON_RETRY_FAULT, !value);
>                 WREG32_SOC15(GC, GET_INST(GC, i),
>                              regGCVM_L2_PROTECTION_FAULT_CNTL_HI32, tmp);
>         }
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
> index a7bfc9f41d0e..bfe247b1a333 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c
> @@ -403,12 +403,10 @@ static void gfxhub_v1_0_set_fault_enable_default(struct amdgpu_device *adev,
>                         WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
>         tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
>                         EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
> -       if (!value) {
> -               tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
> -                               CRASH_ON_NO_RETRY_FAULT, 1);
> -               tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
> -                               CRASH_ON_RETRY_FAULT, 1);
> -       }
> +       tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
> +                       CRASH_ON_NO_RETRY_FAULT, !value);
> +       tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
> +                       CRASH_ON_RETRY_FAULT, !value);
>         WREG32_SOC15(GC, 0, mmVM_L2_PROTECTION_FAULT_CNTL, tmp);
>  }
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
> index 6c03bf9f1ae8..fbdf46070b38 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v1_2.c
> @@ -516,12 +516,10 @@ static void gfxhub_v1_2_xcc_set_fault_enable_default(struct amdgpu_device *adev,
>                                 WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
>                 tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
>                                 EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
> -               if (!value) {
> -                       tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
> -                                       CRASH_ON_NO_RETRY_FAULT, 1);
> -                       tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
> -                                       CRASH_ON_RETRY_FAULT, 1);
> -               }
> +               tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
> +                               CRASH_ON_NO_RETRY_FAULT, !value);
> +               tmp = REG_SET_FIELD(tmp, VM_L2_PROTECTION_FAULT_CNTL,
> +                               CRASH_ON_RETRY_FAULT, !value);
>                 WREG32_SOC15(GC, GET_INST(GC, i), regVM_L2_PROTECTION_FAULT_CNTL, tmp);
>         }
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
> index 793faf62cb07..9ea593e2c719 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_0.c
> @@ -418,12 +418,10 @@ static void gfxhub_v2_0_set_fault_enable_default(struct amdgpu_device *adev,
>                             WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
>         tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
>                             EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
> -       if (!value) {
> -               tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> -                               CRASH_ON_NO_RETRY_FAULT, 1);
> -               tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> -                               CRASH_ON_RETRY_FAULT, 1);
> -       }
> +       tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> +                           CRASH_ON_NO_RETRY_FAULT, !value);
> +       tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> +                           CRASH_ON_RETRY_FAULT, !value);
>         WREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
>  }
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
> index aceb8447feac..30b90d35abd0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v2_1.c
> @@ -449,12 +449,10 @@ static void gfxhub_v2_1_set_fault_enable_default(struct amdgpu_device *adev,
>                             WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
>         tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
>                             EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
> -       if (!value) {
> -               tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> -                               CRASH_ON_NO_RETRY_FAULT, 1);
> -               tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> -                               CRASH_ON_RETRY_FAULT, 1);
> -       }
> +       tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> +                           CRASH_ON_NO_RETRY_FAULT, !value);
> +       tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> +                           CRASH_ON_RETRY_FAULT, !value);
>         WREG32_SOC15(GC, 0, mmGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
>  }
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c
> index 631f99e3741a..9e6a6e13dec0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0.c
> @@ -446,12 +446,10 @@ static void gfxhub_v3_0_set_fault_enable_default(struct amdgpu_device *adev,
>                             WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
>         tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
>                             EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
> -       if (!value) {
> -               tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> -                               CRASH_ON_NO_RETRY_FAULT, 1);
> -               tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> -                               CRASH_ON_RETRY_FAULT, 1);
> -       }
> +       tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> +                           CRASH_ON_NO_RETRY_FAULT, !value);
> +       tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> +                           CRASH_ON_RETRY_FAULT, !value);
>         WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
>  }
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c
> index 8a87410ce016..b3b1085c7cd3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfxhub_v3_0_3.c
> @@ -434,12 +434,10 @@ static void gfxhub_v3_0_3_set_fault_enable_default(struct amdgpu_device *adev,
>                             WRITE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
>         tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
>                             EXECUTE_PROTECTION_FAULT_ENABLE_DEFAULT, value);
> -       if (!value) {
> -               tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> -                               CRASH_ON_NO_RETRY_FAULT, 1);
> -               tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> -                               CRASH_ON_RETRY_FAULT, 1);
> -       }
> +       tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> +                           CRASH_ON_NO_RETRY_FAULT, !value);
> +       tmp = REG_SET_FIELD(tmp, GCVM_L2_PROTECTION_FAULT_CNTL,
> +                           CRASH_ON_RETRY_FAULT, !value);
>         WREG32_SOC15(GC, 0, regGCVM_L2_PROTECTION_FAULT_CNTL, tmp);
>  }
>
> --
> 2.54.0
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/7] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts
  2026-05-25 11:45 ` [PATCH 3/7] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts Timur Kristóf
@ 2026-06-15 14:32   ` Tvrtko Ursulin
  2026-06-15 14:52     ` Timur Kristóf
  0 siblings, 1 reply; 16+ messages in thread
From: Tvrtko Ursulin @ 2026-06-15 14:32 UTC (permalink / raw)
  To: Timur Kristóf, amd-gfx, Alex Deucher, christian.koenig,
	Natalie Vock, Mario Limonciello, Amir Shetaia,
	Marek Olšák


On 25/05/2026 12:45, Timur Kristóf wrote:
> Different interrupts may have different timestamp sources,
> which shouldn't be compared.
> 
> If we compare the timestamps of retry faults to timestamps
> of other interrupts, it may result in all retry fault
> interrupts being filtered out, because of the different
> time stamp source.
> 
> This issue was observed on Strix Halo.
> Solved by storing the timestamp of the last page fault interrupt.

This one may require access to AMD docs to review. For example I am 
immediately curious as to how many different clock sources on a single 
IH there are, how does that relate to the timestamp_src field, and if 
there are indeed multiple clock domains should the patch perhaps be 
generalized to something like 
ih->processed_timestamp[entry->timestamp_src] or something?

Regards,

Tvrtko

> 
> Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 5 ++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 1 +
>   2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index 13bec8461cde..52258f1341c2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -437,9 +437,12 @@ bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev,
>   	uint32_t hash;
>   
>   	/* Stale retry fault if timestamp goes backward */
> -	if (amdgpu_ih_ts_after(timestamp, ih->processed_timestamp))
> +	if (timestamp == adev->gmc.processed_fault_timestamp ||
> +		amdgpu_ih_ts_after(timestamp, adev->gmc.processed_fault_timestamp))
>   		return true;
>   
> +	adev->gmc.processed_fault_timestamp = MAX(timestamp, adev->gmc.processed_fault_timestamp);
> +
>   	/* If we don't have space left in the ring buffer return immediately */
>   	stamp = max(timestamp, AMDGPU_GMC_FAULT_TIMEOUT + 1) -
>   		AMDGPU_GMC_FAULT_TIMEOUT;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> index 676e3aaa1f27..77eb15380284 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> @@ -361,6 +361,7 @@ struct amdgpu_gmc {
>   
>   	u64 noretry_flags;
>   	u64 init_pte_flags;
> +	u64 processed_fault_timestamp;
>   
>   	bool flush_tlb_needs_extra_type_0;
>   	bool flush_tlb_needs_extra_type_2;


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/7] drm/amdgpu/ih: Add retry_cam_ack IH function pointer
  2026-05-25 11:45 ` [PATCH 4/7] drm/amdgpu/ih: Add retry_cam_ack IH function pointer Timur Kristóf
@ 2026-06-15 14:44   ` Tvrtko Ursulin
  2026-06-15 15:02     ` Timur Kristóf
  0 siblings, 1 reply; 16+ messages in thread
From: Tvrtko Ursulin @ 2026-06-15 14:44 UTC (permalink / raw)
  To: Timur Kristóf, amd-gfx, Alex Deucher, christian.koenig,
	Natalie Vock, Mario Limonciello, Amir Shetaia,
	Marek Olšák


On 25/05/2026 12:45, Timur Kristóf wrote:
> Instead of writing the doorbell in amdgpu_gmc_handle_retry_fault()
> directly, add an IH function pointer which can be defined in
> a different way for different IH versions.
> 
> This is to allow implementing the filter CAM without a doorbell.
> 
> Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h  | 1 +
>   drivers/gpu/drm/amd/amdgpu/ih_v7_0.c    | 6 ++++++
>   drivers/gpu/drm/amd/amdgpu/vega20_ih.c  | 8 +++++++-
>   4 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index 52258f1341c2..d790b7619ccd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -565,7 +565,7 @@ int amdgpu_gmc_handle_retry_fault(struct amdgpu_device *adev,
>   
>   		ret = amdgpu_vm_handle_fault(adev, entry->pasid, entry->vmid, node_id,
>   					     addr, entry->timestamp, write_fault);
> -		WDOORBELL32(adev->irq.retry_cam_doorbell_index, cam_index);
> +		adev->irq.ih_funcs->retry_cam_ack(adev, cam_index);

How does not map which IP generations can end up calling it? Presumably 
your selection of ih_v7_0 and vega20_ih.c is an insightful one, but for 
me I see amdgpu_gmc_handle_retry_fault() is called from 
gmc_v9_0_process_interrupt, gmc_v10_0_process_interrupt, 
gmc_v11_0_process_interrupt and gmc_v12_0_process_interrupt(). Is there 
a map somewhere which shows which GMC versions go with which IH blocks?

Regards,

Tvrtko

>   		if (ret)
>   			return 1;
>   	} else {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
> index 444437c30088..e6e34f6e86f4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
> @@ -97,6 +97,7 @@ struct amdgpu_ih_funcs {
>   	const char *(*node_id_to_die_name)(struct amdgpu_device *adev,
>   					   unsigned int node_id,
>   					   char *buf, size_t size);
> +	void (*retry_cam_ack)(struct amdgpu_device *adev, u32 cam_index);
>   };
>   
>   #define amdgpu_ih_get_wptr(adev, ih) (adev)->irq.ih_funcs->get_wptr((adev), (ih))
> diff --git a/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c b/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
> index 6de9e87e04e1..c2431f4c2671 100644
> --- a/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
> @@ -289,6 +289,11 @@ static uint32_t ih_v7_0_setup_retry_doorbell(u32 doorbell_index)
>   	return val;
>   }
>   
> +static void ih_v7_0_retry_cam_ack(struct amdgpu_device *adev, u32 cam_index)
> +{
> +	WDOORBELL32(adev->irq.retry_cam_doorbell_index, cam_index);
> +}
> +
>   #define regIH_RING1_CLIENT_CFG_INDEX_V7_1             0x122
>   #define regIH_RING1_CLIENT_CFG_INDEX_V7_1_BASE_IDX    0
>   #define regIH_RING1_CLIENT_CFG_DATA_V7_1              0x123
> @@ -858,6 +863,7 @@ static const struct amdgpu_ih_funcs ih_v7_0_funcs = {
>   	.decode_iv_ts = amdgpu_ih_decode_iv_ts_helper,
>   	.set_rptr = ih_v7_0_set_rptr,
>   	.node_id_to_die_name = ih_v7_0_node_id_to_die_name,
> +	.retry_cam_ack = ih_v7_0_retry_cam_ack,
>   };
>   
>   static void ih_v7_0_set_interrupt_funcs(struct amdgpu_device *adev)
> diff --git a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
> index 85846fd08ce4..30a82fff3ff7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
> @@ -293,6 +293,11 @@ static uint32_t vega20_setup_retry_doorbell(u32 doorbell_index)
>   	return val;
>   }
>   
> +static void vega20_retry_cam_ack(struct amdgpu_device *adev, u32 cam_index)
> +{
> +	WDOORBELL32(adev->irq.retry_cam_doorbell_index, cam_index);
> +}
> +
>   /**
>    * vega20_ih_irq_init - init and enable the interrupt ring
>    *
> @@ -738,7 +743,8 @@ static const struct amdgpu_ih_funcs vega20_ih_funcs = {
>   	.get_wptr = vega20_ih_get_wptr,
>   	.decode_iv = amdgpu_ih_decode_iv_helper,
>   	.decode_iv_ts = amdgpu_ih_decode_iv_ts_helper,
> -	.set_rptr = vega20_ih_set_rptr
> +	.set_rptr = vega20_ih_set_rptr,
> +	.retry_cam_ack = vega20_retry_cam_ack,
>   };
>   
>   static void vega20_ih_set_interrupt_funcs(struct amdgpu_device *adev)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/7] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts
  2026-06-15 14:32   ` Tvrtko Ursulin
@ 2026-06-15 14:52     ` Timur Kristóf
  2026-06-15 15:23       ` Tvrtko Ursulin
  0 siblings, 1 reply; 16+ messages in thread
From: Timur Kristóf @ 2026-06-15 14:52 UTC (permalink / raw)
  To: amd-gfx, Alex Deucher, christian.koenig, Natalie Vock,
	Mario Limonciello, Amir Shetaia, Marek Olšák,
	Tvrtko Ursulin

On Monday, June 15, 2026 4:32:23 PM Central European Summer Time Tvrtko 
Ursulin wrote:
> On 25/05/2026 12:45, Timur Kristóf wrote:
> > Different interrupts may have different timestamp sources,
> > which shouldn't be compared.
> > 
> > If we compare the timestamps of retry faults to timestamps
> > of other interrupts, it may result in all retry fault
> > interrupts being filtered out, because of the different
> > time stamp source.
> > 
> > This issue was observed on Strix Halo.
> > Solved by storing the timestamp of the last page fault interrupt.
> 

Hi,

> This one may require access to AMD docs to review. For example I am
> immediately curious as to how many different clock sources on a single
> IH there are

As far as I know there are various timestamp sources in the GPU and some 
interrupts use different ones. I am not aware of any documentation on this 
topic, unfortunately.

> how does that relate to the timestamp_src field

The timestamp_src field is set differently when the timestamp source is 
different. So, it could happen that we accidentally filter out all page faults 
when we shouldn't.

> and if there are indeed multiple clock domains should the patch perhaps be
> generalized to something like
> ih->processed_timestamp[entry->timestamp_src] or something?

For the context of this patch, I think it doesn't matter how many different 
kinds of time stamps there are. What's important is that we just shouldn't 
compare timestamps of page faults with time stamps of other interrupts.

As far as I see the timestamp doesn't really matter for other interrupts as we 
only use it to filter out page faults and nothing else.

Hope this helps,
Timur

> > ---
> > 
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 5 ++++-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 1 +
> >   2 files changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index
> > 13bec8461cde..52258f1341c2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > @@ -437,9 +437,12 @@ bool amdgpu_gmc_filter_faults(struct amdgpu_device
> > *adev,> 
> >   	uint32_t hash;
> >   	
> >   	/* Stale retry fault if timestamp goes backward */
> > 
> > -	if (amdgpu_ih_ts_after(timestamp, ih->processed_timestamp))
> > +	if (timestamp == adev->gmc.processed_fault_timestamp ||
> > +		amdgpu_ih_ts_after(timestamp, adev-
>gmc.processed_fault_timestamp))
> > 
> >   		return true;
> > 
> > +	adev->gmc.processed_fault_timestamp = MAX(timestamp,
> > adev->gmc.processed_fault_timestamp); +
> > 
> >   	/* If we don't have space left in the ring buffer return 
immediately */
> >   	stamp = max(timestamp, AMDGPU_GMC_FAULT_TIMEOUT + 1) -
> >   	
> >   		AMDGPU_GMC_FAULT_TIMEOUT;
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h index
> > 676e3aaa1f27..77eb15380284 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> > @@ -361,6 +361,7 @@ struct amdgpu_gmc {
> > 
> >   	u64 noretry_flags;
> >   	u64 init_pte_flags;
> > 
> > +	u64 processed_fault_timestamp;
> > 
> >   	bool flush_tlb_needs_extra_type_0;
> >   	bool flush_tlb_needs_extra_type_2;





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/7] drm/amdgpu/ih: Add retry_cam_ack IH function pointer
  2026-06-15 14:44   ` Tvrtko Ursulin
@ 2026-06-15 15:02     ` Timur Kristóf
  0 siblings, 0 replies; 16+ messages in thread
From: Timur Kristóf @ 2026-06-15 15:02 UTC (permalink / raw)
  To: amd-gfx, Alex Deucher, christian.koenig, Natalie Vock,
	Mario Limonciello, Amir Shetaia, Marek Olšák,
	Tvrtko Ursulin

On Monday, June 15, 2026 4:44:22 PM Central European Summer Time Tvrtko 
Ursulin wrote:
> On 25/05/2026 12:45, Timur Kristóf wrote:
> > Instead of writing the doorbell in amdgpu_gmc_handle_retry_fault()
> > directly, add an IH function pointer which can be defined in
> > a different way for different IH versions.
> > 
> > This is to allow implementing the filter CAM without a doorbell.
> > 
> > Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
> > ---
> > 
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 2 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h  | 1 +
> >   drivers/gpu/drm/amd/amdgpu/ih_v7_0.c    | 6 ++++++
> >   drivers/gpu/drm/amd/amdgpu/vega20_ih.c  | 8 +++++++-
> >   4 files changed, 15 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index
> > 52258f1341c2..d790b7619ccd 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > @@ -565,7 +565,7 @@ int amdgpu_gmc_handle_retry_fault(struct amdgpu_device
> > *adev,> 
> >   		ret = amdgpu_vm_handle_fault(adev, entry->pasid, 
entry->vmid, node_id,
> >   		
> >   					     addr, entry-
>timestamp, write_fault);
> > 
> > -		WDOORBELL32(adev->irq.retry_cam_doorbell_index, 
cam_index);
> > +		adev->irq.ih_funcs->retry_cam_ack(adev, cam_index);
> 
> How does not map which IP generations can end up calling it? Presumably
> your selection of ih_v7_0 and vega20_ih.c is an insightful one, but for
> me I see amdgpu_gmc_handle_retry_fault() is called from
> gmc_v9_0_process_interrupt, gmc_v10_0_process_interrupt,
> gmc_v11_0_process_interrupt and gmc_v12_0_process_interrupt(). Is there
> a map somewhere which shows which GMC versions go with which IH blocks?

The hardware writes interrupt data into a so-called IH ring (interrupt handler 
ring):
- Old GPUs only have one IH ring
- Newer dedicated GPUs have two IH rings
- APUs only have one IH ring
- Additionally there is a soft IH ring, which is implemented entirely in the 
driver and is used to make the processing more reliable.

Specifically for retry faults, it is beneficial to configure page fault 
interrupts on the second IH ring (when available) to avoid them competing with 
other interrupts for space in the ring. In the upstream code this is only done 
on Vega dGPUs. My series implements that for GFX11 and a subsequent series 
also for GFX12 dGPUs.

To answer your actual question, it all depends on the IH IP block for any 
given generation. For some GPUs (but not all), the IH code configures the 
hardware to use the second IH ring for page fault interrupts. For everything 
else, the first IH ring is used. On GFX12 it seems the interrupts are handled 
on the first IH ring even though the second ring is configured; it's unclear if 
that's a bug or missing code in the kernel.

Hope this helps,
Timur




> 
> >   		if (ret)
> >   		
> >   			return 1;
> >   	
> >   	} else {
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h index 444437c30088..e6e34f6e86f4
> > 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
> > @@ -97,6 +97,7 @@ struct amdgpu_ih_funcs {
> > 
> >   	const char *(*node_id_to_die_name)(struct amdgpu_device *adev,
> >   	
> >   					   unsigned int 
node_id,
> >   					   char *buf, size_t 
size);
> > 
> > +	void (*retry_cam_ack)(struct amdgpu_device *adev, u32 cam_index);
> > 
> >   };
> >   
> >   #define amdgpu_ih_get_wptr(adev, ih)
> >   (adev)->irq.ih_funcs->get_wptr((adev), (ih))> 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
> > b/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c index 6de9e87e04e1..c2431f4c2671
> > 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
> > @@ -289,6 +289,11 @@ static uint32_t ih_v7_0_setup_retry_doorbell(u32
> > doorbell_index)> 
> >   	return val;
> >   
> >   }
> > 
> > +static void ih_v7_0_retry_cam_ack(struct amdgpu_device *adev, u32
> > cam_index) +{
> > +	WDOORBELL32(adev->irq.retry_cam_doorbell_index, cam_index);
> > +}
> > +
> > 
> >   #define regIH_RING1_CLIENT_CFG_INDEX_V7_1             0x122
> >   #define regIH_RING1_CLIENT_CFG_INDEX_V7_1_BASE_IDX    0
> >   #define regIH_RING1_CLIENT_CFG_DATA_V7_1              0x123
> > 
> > @@ -858,6 +863,7 @@ static const struct amdgpu_ih_funcs ih_v7_0_funcs = {
> > 
> >   	.decode_iv_ts = amdgpu_ih_decode_iv_ts_helper,
> >   	.set_rptr = ih_v7_0_set_rptr,
> >   	.node_id_to_die_name = ih_v7_0_node_id_to_die_name,
> > 
> > +	.retry_cam_ack = ih_v7_0_retry_cam_ack,
> > 
> >   };
> >   
> >   static void ih_v7_0_set_interrupt_funcs(struct amdgpu_device *adev)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
> > b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c index 85846fd08ce4..30a82fff3ff7
> > 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
> > @@ -293,6 +293,11 @@ static uint32_t vega20_setup_retry_doorbell(u32
> > doorbell_index)> 
> >   	return val;
> >   
> >   }
> > 
> > +static void vega20_retry_cam_ack(struct amdgpu_device *adev, u32
> > cam_index) +{
> > +	WDOORBELL32(adev->irq.retry_cam_doorbell_index, cam_index);
> > +}
> > +
> > 
> >   /**
> >   
> >    * vega20_ih_irq_init - init and enable the interrupt ring
> >    *
> > 
> > @@ -738,7 +743,8 @@ static const struct amdgpu_ih_funcs vega20_ih_funcs =
> > {
> > 
> >   	.get_wptr = vega20_ih_get_wptr,
> >   	.decode_iv = amdgpu_ih_decode_iv_helper,
> >   	.decode_iv_ts = amdgpu_ih_decode_iv_ts_helper,
> > 
> > -	.set_rptr = vega20_ih_set_rptr
> > +	.set_rptr = vega20_ih_set_rptr,
> > +	.retry_cam_ack = vega20_retry_cam_ack,
> > 
> >   };
> >   
> >   static void vega20_ih_set_interrupt_funcs(struct amdgpu_device *adev)





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/7] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts
  2026-06-15 14:52     ` Timur Kristóf
@ 2026-06-15 15:23       ` Tvrtko Ursulin
  2026-06-15 15:32         ` Timur Kristóf
  0 siblings, 1 reply; 16+ messages in thread
From: Tvrtko Ursulin @ 2026-06-15 15:23 UTC (permalink / raw)
  To: Timur Kristóf, amd-gfx, Alex Deucher, christian.koenig,
	Natalie Vock, Mario Limonciello, Amir Shetaia,
	Marek Olšák


On 15/06/2026 15:52, Timur Kristóf wrote:
> On Monday, June 15, 2026 4:32:23 PM Central European Summer Time Tvrtko
> Ursulin wrote:
>> On 25/05/2026 12:45, Timur Kristóf wrote:
>>> Different interrupts may have different timestamp sources,
>>> which shouldn't be compared.
>>>
>>> If we compare the timestamps of retry faults to timestamps
>>> of other interrupts, it may result in all retry fault
>>> interrupts being filtered out, because of the different
>>> time stamp source.
>>>
>>> This issue was observed on Strix Halo.
>>> Solved by storing the timestamp of the last page fault interrupt.
>>
> 
> Hi,
> 
>> This one may require access to AMD docs to review. For example I am
>> immediately curious as to how many different clock sources on a single
>> IH there are
> 
> As far as I know there are various timestamp sources in the GPU and some
> interrupts use different ones. I am not aware of any documentation on this
> topic, unfortunately.
> 
>> how does that relate to the timestamp_src field
> 
> The timestamp_src field is set differently when the timestamp source is
> different. So, it could happen that we accidentally filter out all page faults
> when we shouldn't.
> 
>> and if there are indeed multiple clock domains should the patch perhaps be
>> generalized to something like
>> ih->processed_timestamp[entry->timestamp_src] or something?
> 
> For the context of this patch, I think it doesn't matter how many different
> kinds of time stamps there are. What's important is that we just shouldn't
> compare timestamps of page faults with time stamps of other interrupts.

True, thank you!

Another question is why the backward timestamp check is needed only for 
fault interrupts? I do not see it elsewhere.

Let me also ask two more things below.

> As far as I see the timestamp doesn't really matter for other interrupts as we
> only use it to filter out page faults and nothing else.
> 
> Hope this helps,
> Timur
> 
>>> ---
>>>
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 5 ++++-
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 1 +
>>>    2 files changed, 5 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index
>>> 13bec8461cde..52258f1341c2 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>> @@ -437,9 +437,12 @@ bool amdgpu_gmc_filter_faults(struct amdgpu_device
>>> *adev,>
>>>    	uint32_t hash;
>>>    	
>>>    	/* Stale retry fault if timestamp goes backward */
>>>
>>> -	if (amdgpu_ih_ts_after(timestamp, ih->processed_timestamp))
>>> +	if (timestamp == adev->gmc.processed_fault_timestamp ||
>>> +		amdgpu_ih_ts_after(timestamp, adev-
>> gmc.processed_fault_timestamp))

First thing is whether you are confident the equality check is either 
safe or required?

For example can two blocks fault with the same timestamp on different 
addresses?

Or from a different angle, is the clock granularity good enough to not 
coalesce two separate faults to a single timestamp?

>>>
>>>    		return true;
>>>
>>> +	adev->gmc.processed_fault_timestamp = MAX(timestamp,
>>> adev->gmc.processed_fault_timestamp); +

Doesn't a plain assign work here? The if above has already verified new 
timestamp is larger than the old.

Regards,

Tvrtko

>>>
>>>    	/* If we don't have space left in the ring buffer return
> immediately */
>>>    	stamp = max(timestamp, AMDGPU_GMC_FAULT_TIMEOUT + 1) -
>>>    	
>>>    		AMDGPU_GMC_FAULT_TIMEOUT;
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h index
>>> 676e3aaa1f27..77eb15380284 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>>> @@ -361,6 +361,7 @@ struct amdgpu_gmc {
>>>
>>>    	u64 noretry_flags;
>>>    	u64 init_pte_flags;
>>>
>>> +	u64 processed_fault_timestamp;
>>>
>>>    	bool flush_tlb_needs_extra_type_0;
>>>    	bool flush_tlb_needs_extra_type_2;
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/7] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts
  2026-06-15 15:23       ` Tvrtko Ursulin
@ 2026-06-15 15:32         ` Timur Kristóf
  2026-06-15 15:48           ` Tvrtko Ursulin
  0 siblings, 1 reply; 16+ messages in thread
From: Timur Kristóf @ 2026-06-15 15:32 UTC (permalink / raw)
  To: amd-gfx, Alex Deucher, christian.koenig, Natalie Vock,
	Mario Limonciello, Amir Shetaia, Marek Olšák,
	Tvrtko Ursulin

On Monday, June 15, 2026 5:23:52 PM Central European Summer Time Tvrtko 
Ursulin wrote:
> On 15/06/2026 15:52, Timur Kristóf wrote:
> > On Monday, June 15, 2026 4:32:23 PM Central European Summer Time Tvrtko
> > 
> > Ursulin wrote:
> >> On 25/05/2026 12:45, Timur Kristóf wrote:
> >>> Different interrupts may have different timestamp sources,
> >>> which shouldn't be compared.
> >>> 
> >>> If we compare the timestamps of retry faults to timestamps
> >>> of other interrupts, it may result in all retry fault
> >>> interrupts being filtered out, because of the different
> >>> time stamp source.
> >>> 
> >>> This issue was observed on Strix Halo.
> >>> Solved by storing the timestamp of the last page fault interrupt.
> > 
> > Hi,
> > 
> >> This one may require access to AMD docs to review. For example I am
> >> immediately curious as to how many different clock sources on a single
> >> IH there are
> > 
> > As far as I know there are various timestamp sources in the GPU and some
> > interrupts use different ones. I am not aware of any documentation on this
> > topic, unfortunately.
> > 
> >> how does that relate to the timestamp_src field
> > 
> > The timestamp_src field is set differently when the timestamp source is
> > different. So, it could happen that we accidentally filter out all page
> > faults when we shouldn't.
> > 
> >> and if there are indeed multiple clock domains should the patch perhaps
> >> be
> >> generalized to something like
> >> ih->processed_timestamp[entry->timestamp_src] or something?
> > 
> > For the context of this patch, I think it doesn't matter how many
> > different
> > kinds of time stamps there are. What's important is that we just shouldn't
> > compare timestamps of page faults with time stamps of other interrupts.
> 
> True, thank you!
> 
> Another question is why the backward timestamp check is needed only for
> fault interrupts? I do not see it elsewhere.

Correct, this is only used for retry fault interrupts and only when they are 
dispatched to the soft IH ring.

The reason this was added is because when retry faults are enabled and the GPU 
hits a VM fault, it keeps spamming the CPU with many interrupts for the same 
fault until the fault is resolved. The CPU needs to filter out the faults which 
it is already handling, otherwise we would end up handling the same fault 
multiple times.

(As a side note, I should also probably look into how to reduce the frequency 
of how often these interrupts are repeated.)

> 
> Let me also ask two more things below.
> 
> > As far as I see the timestamp doesn't really matter for other interrupts
> > as we only use it to filter out page faults and nothing else.
> > 
> > Hope this helps,
> > Timur
> > 
> >>> ---
> >>> 
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 5 ++++-
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 1 +
> >>>    2 files changed, 5 insertions(+), 1 deletion(-)
> >>> 
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index
> >>> 13bec8461cde..52258f1341c2 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> >>> @@ -437,9 +437,12 @@ bool amdgpu_gmc_filter_faults(struct amdgpu_device
> >>> *adev,>
> >>> 
> >>>    	uint32_t hash;
> >>>    	
> >>>    	/* Stale retry fault if timestamp goes backward */
> >>> 
> >>> -	if (amdgpu_ih_ts_after(timestamp, ih->processed_timestamp))
> >>> +	if (timestamp == adev->gmc.processed_fault_timestamp ||
> >>> +		amdgpu_ih_ts_after(timestamp, adev-
> >> 
> >> gmc.processed_fault_timestamp))
> 
> First thing is whether you are confident the equality check is either
> safe or required?

I don't see why it wouldn't be safe. But maybe it isn't required.
What do you suggest instead?

> For example can two blocks fault with the same timestamp on different
> addresses?

They might. But keep in mind that the GFX block just keeps spamming the 
interrupts until the fault is handled. So, if we filter one out by mistake, we 
know we will just receive the same fault again very soon.

> Or from a different angle, is the clock granularity good enough to not
> coalesce two separate faults to a single timestamp?

I am not sure about that.

> 
> >>>    		return true;
> >>> 
> >>> +	adev->gmc.processed_fault_timestamp = MAX(timestamp,
> >>> adev->gmc.processed_fault_timestamp); +
> 
> Doesn't a plain assign work here? The if above has already verified new
> timestamp is larger than the old.
> 
> Regards,
> 
> Tvrtko
> 
> >>>    	/* If we don't have space left in the ring buffer return
> > 
> > immediately */
> > 
> >>>    	stamp = max(timestamp, AMDGPU_GMC_FAULT_TIMEOUT + 1) -
> >>>    	
> >>>    		AMDGPU_GMC_FAULT_TIMEOUT;
> >>> 
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h index
> >>> 676e3aaa1f27..77eb15380284 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> >>> @@ -361,6 +361,7 @@ struct amdgpu_gmc {
> >>> 
> >>>    	u64 noretry_flags;
> >>>    	u64 init_pte_flags;
> >>> 
> >>> +	u64 processed_fault_timestamp;
> >>> 
> >>>    	bool flush_tlb_needs_extra_type_0;
> >>>    	bool flush_tlb_needs_extra_type_2;





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/7] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts
  2026-06-15 15:32         ` Timur Kristóf
@ 2026-06-15 15:48           ` Tvrtko Ursulin
  0 siblings, 0 replies; 16+ messages in thread
From: Tvrtko Ursulin @ 2026-06-15 15:48 UTC (permalink / raw)
  To: Timur Kristóf, amd-gfx, Alex Deucher, christian.koenig,
	Natalie Vock, Mario Limonciello, Amir Shetaia,
	Marek Olšák


On 15/06/2026 16:32, Timur Kristóf wrote:
> On Monday, June 15, 2026 5:23:52 PM Central European Summer Time Tvrtko
> Ursulin wrote:
>> On 15/06/2026 15:52, Timur Kristóf wrote:
>>> On Monday, June 15, 2026 4:32:23 PM Central European Summer Time Tvrtko
>>>
>>> Ursulin wrote:
>>>> On 25/05/2026 12:45, Timur Kristóf wrote:
>>>>> Different interrupts may have different timestamp sources,
>>>>> which shouldn't be compared.
>>>>>
>>>>> If we compare the timestamps of retry faults to timestamps
>>>>> of other interrupts, it may result in all retry fault
>>>>> interrupts being filtered out, because of the different
>>>>> time stamp source.
>>>>>
>>>>> This issue was observed on Strix Halo.
>>>>> Solved by storing the timestamp of the last page fault interrupt.
>>>
>>> Hi,
>>>
>>>> This one may require access to AMD docs to review. For example I am
>>>> immediately curious as to how many different clock sources on a single
>>>> IH there are
>>>
>>> As far as I know there are various timestamp sources in the GPU and some
>>> interrupts use different ones. I am not aware of any documentation on this
>>> topic, unfortunately.
>>>
>>>> how does that relate to the timestamp_src field
>>>
>>> The timestamp_src field is set differently when the timestamp source is
>>> different. So, it could happen that we accidentally filter out all page
>>> faults when we shouldn't.
>>>
>>>> and if there are indeed multiple clock domains should the patch perhaps
>>>> be
>>>> generalized to something like
>>>> ih->processed_timestamp[entry->timestamp_src] or something?
>>>
>>> For the context of this patch, I think it doesn't matter how many
>>> different
>>> kinds of time stamps there are. What's important is that we just shouldn't
>>> compare timestamps of page faults with time stamps of other interrupts.
>>
>> True, thank you!
>>
>> Another question is why the backward timestamp check is needed only for
>> fault interrupts? I do not see it elsewhere.
> 
> Correct, this is only used for retry fault interrupts and only when they are
> dispatched to the soft IH ring.
> 
> The reason this was added is because when retry faults are enabled and the GPU
> hits a VM fault, it keeps spamming the CPU with many interrupts for the same
> fault until the fault is resolved. The CPU needs to filter out the faults which
> it is already handling, otherwise we would end up handling the same fault
> multiple times.

Got it, thank you!

> (As a side note, I should also probably look into how to reduce the frequency
> of how often these interrupts are repeated.)
> 
>>
>> Let me also ask two more things below.
>>
>>> As far as I see the timestamp doesn't really matter for other interrupts
>>> as we only use it to filter out page faults and nothing else.
>>>
>>> Hope this helps,
>>> Timur
>>>
>>>>> ---
>>>>>
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 5 ++++-
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 1 +
>>>>>     2 files changed, 5 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index
>>>>> 13bec8461cde..52258f1341c2 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>>> @@ -437,9 +437,12 @@ bool amdgpu_gmc_filter_faults(struct amdgpu_device
>>>>> *adev,>
>>>>>
>>>>>     	uint32_t hash;
>>>>>     	
>>>>>     	/* Stale retry fault if timestamp goes backward */
>>>>>
>>>>> -	if (amdgpu_ih_ts_after(timestamp, ih->processed_timestamp))
>>>>> +	if (timestamp == adev->gmc.processed_fault_timestamp ||
>>>>> +		amdgpu_ih_ts_after(timestamp, adev-
>>>>
>>>> gmc.processed_fault_timestamp))
>>
>> First thing is whether you are confident the equality check is either
>> safe or required?
> 
> I don't see why it wouldn't be safe. But maybe it isn't required.
> What do you suggest instead?

Safe as is whether it has potential to swallow a legitimate unseen faults.

Looking at amdgpu_gmc_filter_faults() a bit lower down, it does appear 
to filter out repeated faults on the same address. Would it be safe to 
rely on that instead of the timestamp equality check?

I appreciate that may cause a transient interrupt handling storm if the 
clock granularity is poor, but maybe that is better than losing a fault.

>> For example can two blocks fault with the same timestamp on different
>> addresses?
> 
> They might. But keep in mind that the GFX block just keeps spamming the
> interrupts until the fault is handled. So, if we filter one out by mistake, we
> know we will just receive the same fault again very soon.
> 
>> Or from a different angle, is the clock granularity good enough to not
>> coalesce two separate faults to a single timestamp?
> 
> I am not sure about that.

I guess if the equality filter can be removed then this concern also 
goes away.

Regards,

Tvrtko

>>
>>>>>     		return true;
>>>>>
>>>>> +	adev->gmc.processed_fault_timestamp = MAX(timestamp,
>>>>> adev->gmc.processed_fault_timestamp); +
>>
>> Doesn't a plain assign work here? The if above has already verified new
>> timestamp is larger than the old.
>>
>> Regards,
>>
>> Tvrtko
>>
>>>>>     	/* If we don't have space left in the ring buffer return
>>>
>>> immediately */
>>>
>>>>>     	stamp = max(timestamp, AMDGPU_GMC_FAULT_TIMEOUT + 1) -
>>>>>     	
>>>>>     		AMDGPU_GMC_FAULT_TIMEOUT;
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h index
>>>>> 676e3aaa1f27..77eb15380284 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>>>>> @@ -361,6 +361,7 @@ struct amdgpu_gmc {
>>>>>
>>>>>     	u64 noretry_flags;
>>>>>     	u64 init_pte_flags;
>>>>>
>>>>> +	u64 processed_fault_timestamp;
>>>>>
>>>>>     	bool flush_tlb_needs_extra_type_0;
>>>>>     	bool flush_tlb_needs_extra_type_2;
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-06-15 15:48 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-25 11:45 [PATCH 0/7] drm/amdgpu: Improve retry fault handling (v2) Timur Kristóf
2026-05-25 11:45 ` [PATCH 1/7] drm/amdgpu: Use gmc->noretry instead of amdgpu_noretry directly Timur Kristóf
2026-05-25 11:45 ` [PATCH 2/7] drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed Timur Kristóf
2026-05-26 15:00   ` Alex Deucher
2026-05-25 11:45 ` [PATCH 3/7] drm/amdgpu/gmc: Don't compare page fault timestamps with other interrupts Timur Kristóf
2026-06-15 14:32   ` Tvrtko Ursulin
2026-06-15 14:52     ` Timur Kristóf
2026-06-15 15:23       ` Tvrtko Ursulin
2026-06-15 15:32         ` Timur Kristóf
2026-06-15 15:48           ` Tvrtko Ursulin
2026-05-25 11:45 ` [PATCH 4/7] drm/amdgpu/ih: Add retry_cam_ack IH function pointer Timur Kristóf
2026-06-15 14:44   ` Tvrtko Ursulin
2026-06-15 15:02     ` Timur Kristóf
2026-05-25 11:45 ` [PATCH 5/7] drm/amdgpu/gfxhub: Enable retry fault interrupts when needed Timur Kristóf
2026-05-25 11:45 ` [PATCH 6/7] drm/amdgpu/gfxhub: Respect noretry flag for retry faults on GFX12.1 Timur Kristóf
2026-05-25 11:45 ` [PATCH 7/7] drm/amdgpu: Enable retry CAM on Navi 3 dGPUs Timur Kristóf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.