[PATCH 0/7] drm/amdgpu: Implement retry faults on Navi 4

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/7] drm/amdgpu: Implement retry faults on Navi 4
@ 2026-05-29 10:30 Timur Kristóf
  2026-05-29 10:30 ` [PATCH 1/7] drm/amdgpu/vm: Add fence argument to amdgpu_vm_handle_fault() Timur Kristóf
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Timur Kristóf @ 2026-05-29 10:30 UTC (permalink / raw)
  To: amd-gfx, Alexander.Deucher, Christian König, Natalie Vock,
	Amir Shetaia, Marek Olšák, Mario Limonciello
  Cc: Timur Kristóf

Implement retry faults on Navi 4 in order to mitigate VM faults.
Based on my previous series (required for correct operation):

* Improve retry fault handling (v2)
* Improve soft IH ring

Solve a race condition between the VM update
performed by amdgpu_vm_handle_fault() and retry_cam_ack()
to make sure the ACK is always done after the VM update.

Adjust soft IH ring size on Navi 4. Note that Navi 4
seems to send the retry fault interrupts on the first
IH ring so they end up being dispatched on the soft
IH ring.

Adjust the PTE flags to make the VM update work correctly
on Navi 4. Without that, the update seems to be stuck in
a cache and can't resolve the fault.

Enable the retry CAM on Navi 4 as well in order to filter
the retry fault interrupts. Change the IH v7.0 code to
use the MMIO based ACK rather than a doorbell.
The doorbell seems to just not work at all on Navi 4
just like it also doesn't work Navi 3.

With this series, the kernel is able to mitigate VM faults
when amdgpu.noretry=0 is specified on the kernel command line.

Timur Kristóf (7):
  drm/amdgpu/vm: Add fence argument to amdgpu_vm_handle_fault()
  drm/amdgpu: ACK the retry CAM after VM update finishes
  drm/amdgpu/ih7.0: Use MMIO ACK instead of doorbell for retry CAM on IH
    7.0
  drm/amdgpu/ih7.0: Use IH_SW_RING_SIZE for soft IH ring instead of
    PAGE_SIZE
  drm/amdgpu/gmc12.0: Use AMDGPU_PTE_IS_PTE flag for init_pte_flags on
    GFX12.0
  drm/amdgpu/vm: Use init PTE flags, and NOALLOC in
    amdgpu_vm_handle_fault()
  drm/amdgpu/gmc12: Pass cam_index to retry fault handler

 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c     | 30 ++++++++++++++++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h     |  8 ++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c      | 10 +++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h      |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c      |  8 ++++--
 drivers/gpu/drm/amd/amdgpu/gmc_v12_1.c      |  4 +--
 drivers/gpu/drm/amd/amdgpu/ih_v7_0.c        | 25 +++--------------
 8 files changed, 57 insertions(+), 32 deletions(-)

-- 
2.53.0

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/7] drm/amdgpu/vm: Add fence argument to amdgpu_vm_handle_fault()
  2026-05-29 10:30 [PATCH 0/7] drm/amdgpu: Implement retry faults on Navi 4 Timur Kristóf
@ 2026-05-29 10:30 ` Timur Kristóf
  2026-06-24 13:54   ` Tvrtko Ursulin
  2026-05-29 10:30 ` [PATCH 2/7] drm/amdgpu: ACK the retry CAM after VM update finishes Timur Kristóf
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Timur Kristóf @ 2026-05-29 10:30 UTC (permalink / raw)
  To: amd-gfx, Alexander.Deucher, Christian König, Natalie Vock,
	Amir Shetaia, Marek Olšák, Mario Limonciello
  Cc: Timur Kristóf

Allow the caller to respond to when the VM update is finished.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 4 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  | 5 ++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h  | 2 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v12_1.c  | 4 ++--
 4 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index d790b7619ccd4..26aea960e2759 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -564,7 +564,7 @@ int amdgpu_gmc_handle_retry_fault(struct amdgpu_device *adev,
 		}
 
 		ret = amdgpu_vm_handle_fault(adev, entry->pasid, entry->vmid, node_id,
-					     addr, entry->timestamp, write_fault);
+					     addr, entry->timestamp, write_fault, NULL);
 		adev->irq.ih_funcs->retry_cam_ack(adev, cam_index);
 		if (ret)
 			return 1;
@@ -587,7 +587,7 @@ int amdgpu_gmc_handle_retry_fault(struct amdgpu_device *adev,
 		 * tables
 		 */
 		if (amdgpu_vm_handle_fault(adev, entry->pasid, entry->vmid, node_id,
-					   addr, entry->timestamp, write_fault))
+					   addr, entry->timestamp, write_fault, NULL))
 			return 1;
 	}
 	return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index b523a7b97d6f1..8c3ba7213eb22 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2962,13 +2962,14 @@ struct amdgpu_vm *amdgpu_vm_lock_by_pasid(struct amdgpu_device *adev,
  *           GFX 9.4.3.
  * @addr: Address of the fault
  * @write_fault: true is write fault, false is read fault
+ * @fence: optional resulting fence, signaled after update is done
  *
  * Try to gracefully handle a VM fault. Return true if the fault was handled and
  * shouldn't be reported any more.
  */
 bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
 			    u32 vmid, u32 node_id, uint64_t addr,
-			    uint64_t ts, bool write_fault)
+			    uint64_t ts, bool write_fault, struct dma_fence **fence)
 {
 	bool is_compute_context = false;
 	struct amdgpu_bo *root;
@@ -3034,6 +3035,8 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
 
 	r = amdgpu_vm_update_pdes(adev, vm, true);
 
+	*fence = vm->last_update;
+
 error_unlock:
 	amdgpu_bo_unreserve(root);
 	if (r < 0)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
index cc096c005e348..72da6b3d98c70 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -589,7 +589,7 @@ void amdgpu_vm_put_task_info(struct amdgpu_task_info *task_info);
 
 bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
 			    u32 vmid, u32 node_id, uint64_t addr, uint64_t ts,
-			    bool write_fault);
+			    bool write_fault, struct dma_fence **fence);
 
 struct amdgpu_vm *amdgpu_vm_lock_by_pasid(struct amdgpu_device *adev,
 					  struct amdgpu_bo **root, u32 pasid);
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v12_1.c b/drivers/gpu/drm/amd/amdgpu/gmc_v12_1.c
index 855cd29cbffaa..da18c02013966 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v12_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v12_1.c
@@ -155,7 +155,7 @@ static int gmc_v12_1_process_interrupt(struct amdgpu_device *adev,
 			cam_index = entry->src_data[3] & 0x3ff;
 
 			ret = amdgpu_vm_handle_fault(adev, entry->pasid, entry->vmid, node_id,
-							addr, entry->timestamp, write_fault);
+							addr, entry->timestamp, write_fault, NULL);
 			WDOORBELL32(adev->irq.retry_cam_doorbell_index, cam_index);
 			if (ret)
 				return 1;
@@ -178,7 +178,7 @@ static int gmc_v12_1_process_interrupt(struct amdgpu_device *adev,
 			 * tables
 			 */
 			if (amdgpu_vm_handle_fault(adev, entry->pasid, entry->vmid, node_id,
-						   addr, entry->timestamp, write_fault))
+						   addr, entry->timestamp, write_fault, NULL))
 				return 1;
 		}
 	}
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/7] drm/amdgpu: ACK the retry CAM after VM update finishes
  2026-05-29 10:30 [PATCH 0/7] drm/amdgpu: Implement retry faults on Navi 4 Timur Kristóf
  2026-05-29 10:30 ` [PATCH 1/7] drm/amdgpu/vm: Add fence argument to amdgpu_vm_handle_fault() Timur Kristóf
@ 2026-05-29 10:30 ` Timur Kristóf
  2026-06-24 14:31   ` Tvrtko Ursulin
  2026-05-29 10:30 ` [PATCH 3/7] drm/amdgpu/ih7.0: Use MMIO ACK instead of doorbell for retry CAM on IH 7.0 Timur Kristóf
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Timur Kristóf @ 2026-05-29 10:30 UTC (permalink / raw)
  To: amd-gfx, Alexander.Deucher, Christian König, Natalie Vock,
	Amir Shetaia, Marek Olšák, Mario Limonciello
  Cc: Timur Kristóf

Add a fence callback to the VM update and ACK the retry CAM
after the VM update is finished. Previously, we would ACK it
immediately after calling amdgpu_vm_handle_fault() which
caused a race condition that was likely to trigger the same
interrupt again, causing the same fault to be handled
multiple times.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c     | 28 +++++++++++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h     |  8 ++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c      |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c |  2 +-
 4 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 26aea960e2759..21c8d87477448 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -545,6 +545,16 @@ void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, uint64_t addr,
 	} while (fault->timestamp < tmp);
 }
 
+static void amdgpu_gmc_retry_fault_handled(struct dma_fence *fence,
+					   struct dma_fence_cb *cb)
+{
+	struct amdgpu_fence_cb *afc = container_of(cb, struct amdgpu_fence_cb, cb);
+	struct amdgpu_device *adev = afc->adev;
+
+	/* CAM index is the array index of the current callback struct */
+	adev->irq.ih_funcs->retry_cam_ack(adev, afc - &adev->gmc.retry_cb[0]);
+}
+
 int amdgpu_gmc_handle_retry_fault(struct amdgpu_device *adev,
 				  struct amdgpu_iv_entry *entry,
 				  u64 addr,
@@ -552,6 +562,7 @@ int amdgpu_gmc_handle_retry_fault(struct amdgpu_device *adev,
 				  u32 node_id,
 				  bool write_fault)
 {
+	struct dma_fence *fence = NULL;
 	int ret;
 
 	if (adev->irq.retry_cam_enabled) {
@@ -564,8 +575,21 @@ int amdgpu_gmc_handle_retry_fault(struct amdgpu_device *adev,
 		}
 
 		ret = amdgpu_vm_handle_fault(adev, entry->pasid, entry->vmid, node_id,
-					     addr, entry->timestamp, write_fault, NULL);
-		adev->irq.ih_funcs->retry_cam_ack(adev, cam_index);
+					     addr, entry->timestamp, write_fault, &fence);
+
+		/* If the update is already done, ACK now, otherwise when it's done. */
+		if (fence) {
+			adev->gmc.retry_cb[cam_index].adev = adev;
+
+			if (dma_fence_add_callback(fence, &adev->gmc.retry_cb[cam_index].cb,
+						   amdgpu_gmc_retry_fault_handled))
+				adev->irq.ih_funcs->retry_cam_ack(adev, cam_index);
+
+			dma_fence_put(fence);
+		} else {
+			adev->irq.ih_funcs->retry_cam_ack(adev, cam_index);
+		}
+
 		if (ret)
 			return 1;
 	} else {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
index 77eb153802845..3bfb06e011a86 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
@@ -27,6 +27,7 @@
 #define __AMDGPU_GMC_H__
 
 #include <linux/types.h>
+#include <linux/dma-fence.h>
 
 #include "amdgpu_irq.h"
 #include "amdgpu_xgmi.h"
@@ -214,6 +215,11 @@ struct amdgpu_gmc_memrange {
 	int nid_mask;
 };
 
+struct amdgpu_fence_cb {
+	struct amdgpu_device *adev;
+	struct dma_fence_cb cb;
+};
+
 enum amdgpu_gart_placement {
 	AMDGPU_GART_PLACEMENT_BEST_FIT = 0,
 	AMDGPU_GART_PLACEMENT_HIGH,
@@ -305,6 +311,8 @@ struct amdgpu_gmc {
 	} fault_hash[AMDGPU_GMC_FAULT_HASH_SIZE];
 	uint64_t		last_fault:AMDGPU_GMC_FAULT_RING_ORDER;
 
+	struct amdgpu_fence_cb retry_cb[16];
+
 	bool tmz_enabled;
 	bool is_app_apu;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 8c3ba7213eb22..f5e9b97e92a8c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -3035,7 +3035,7 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
 
 	r = amdgpu_vm_update_pdes(adev, vm, true);
 
-	*fence = vm->last_update;
+	*fence = dma_fence_get(vm->last_update);
 
 error_unlock:
 	amdgpu_bo_unreserve(root);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
index 2eb64df6daa94..6e28f0e435bf5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
@@ -132,7 +132,7 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
 				   DMA_RESV_USAGE_BOOKKEEP);
 	}
 
-	if (fence && !p->immediate) {
+	if (fence) {
 		/*
 		 * Most hw generations now have a separate queue for page table
 		 * updates, but when the queue is shared with userspace we need
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/7] drm/amdgpu/ih7.0: Use MMIO ACK instead of doorbell for retry CAM on IH 7.0
  2026-05-29 10:30 [PATCH 0/7] drm/amdgpu: Implement retry faults on Navi 4 Timur Kristóf
  2026-05-29 10:30 ` [PATCH 1/7] drm/amdgpu/vm: Add fence argument to amdgpu_vm_handle_fault() Timur Kristóf
  2026-05-29 10:30 ` [PATCH 2/7] drm/amdgpu: ACK the retry CAM after VM update finishes Timur Kristóf
@ 2026-05-29 10:30 ` Timur Kristóf
  2026-05-29 10:30 ` [PATCH 4/7] drm/amdgpu/ih7.0: Use IH_SW_RING_SIZE for soft IH ring instead of PAGE_SIZE Timur Kristóf
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Timur Kristóf @ 2026-05-29 10:30 UTC (permalink / raw)
  To: amd-gfx, Alexander.Deucher, Christian König, Natalie Vock,
	Amir Shetaia, Marek Olšák, Mario Limonciello
  Cc: Timur Kristóf

The doorbell is not working on Navi 48.
Use the IH_RETRY_CAM_ACK register just like on Navi 3x.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/ih_v7_0.c | 20 +++-----------------
 1 file changed, 3 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c b/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
index 24be9d726428e..4e6d14facf93a 100644
--- a/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
@@ -279,19 +279,9 @@ static int ih_v7_0_enable_ring(struct amdgpu_device *adev,
 	return 0;
 }
 
-static uint32_t ih_v7_0_setup_retry_doorbell(u32 doorbell_index)
-{
-	u32 val = 0;
-
-	val = REG_SET_FIELD(val, IH_DOORBELL_RPTR, OFFSET, doorbell_index);
-	val = REG_SET_FIELD(val, IH_DOORBELL_RPTR, ENABLE, 1);
-
-	return val;
-}
-
 static void ih_v7_0_retry_cam_ack(struct amdgpu_device *adev, u32 cam_index)
 {
-	WDOORBELL32(adev->irq.retry_cam_doorbell_index, cam_index);
+	WREG32_SOC15(OSSSYS, 0, regIH_RETRY_CAM_ACK, cam_index);
 }
 
 #define regIH_RING1_CLIENT_CFG_INDEX_V7_1             0x122
@@ -400,12 +390,8 @@ static int ih_v7_0_irq_init(struct amdgpu_device *adev)
 
 	pci_set_master(adev->pdev);
 
-	if (amdgpu_ip_version(adev, OSSSYS_HWIP, 0) == IP_VERSION(7, 1, 0)) {
-		/* Allocate the doorbell for IH Retry CAM */
-		adev->irq.retry_cam_doorbell_index = (adev->doorbell_index.ih + 2) << 1;
-		WREG32_SOC15(OSSSYS, 0, regIH_DOORBELL_RETRY_CAM,
-				ih_v7_0_setup_retry_doorbell(adev->irq.retry_cam_doorbell_index));
-
+	if (amdgpu_ip_version(adev, OSSSYS_HWIP, 0) == IP_VERSION(7, 1, 0) ||
+	    !(adev->flags & AMD_IS_APU)) {
 		/* Enable IH Retry CAM */
 		tmp = RREG32_SOC15(OSSSYS, 0, regIH_RETRY_INT_CAM_CNTL);
 		tmp = REG_SET_FIELD(tmp, IH_RETRY_INT_CAM_CNTL, ENABLE, 1);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/7] drm/amdgpu/ih7.0: Use IH_SW_RING_SIZE for soft IH ring instead of PAGE_SIZE
  2026-05-29 10:30 [PATCH 0/7] drm/amdgpu: Implement retry faults on Navi 4 Timur Kristóf
                   ` (2 preceding siblings ...)
  2026-05-29 10:30 ` [PATCH 3/7] drm/amdgpu/ih7.0: Use MMIO ACK instead of doorbell for retry CAM on IH 7.0 Timur Kristóf
@ 2026-05-29 10:30 ` Timur Kristóf
  2026-06-24 14:37   ` Tvrtko Ursulin
  2026-05-29 10:30 ` [PATCH 5/7] drm/amdgpu/gmc12.0: Use AMDGPU_PTE_IS_PTE flag for init_pte_flags on GFX12.0 Timur Kristóf
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Timur Kristóf @ 2026-05-29 10:30 UTC (permalink / raw)
  To: amd-gfx, Alexander.Deucher, Christian König, Natalie Vock,
	Amir Shetaia, Marek Olšák, Mario Limonciello
  Cc: Timur Kristóf

When there are a lot of retry faults happening, the soft IH ring
can fill up really quickly and possibly overflow. PAGE_SIZE was
too small, use IH_SW_RING_SIZE to match what other GPU generations
are doing.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/ih_v7_0.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c b/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
index 4e6d14facf93a..ec0919fa82540 100644
--- a/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
@@ -587,7 +587,6 @@ static int ih_v7_0_sw_init(struct amdgpu_ip_block *ip_block)
 	int r;
 	struct amdgpu_device *adev = ip_block->adev;
 	bool use_bus_addr;
-	unsigned int sw_ring_size;
 
 	r = amdgpu_irq_add_id(adev, SOC21_IH_CLIENTID_IH, 0,
 			      &adev->irq.self_irq);
@@ -619,9 +618,7 @@ static int ih_v7_0_sw_init(struct amdgpu_ip_block *ip_block)
 	/* initialize ih control register offset */
 	ih_v7_0_init_register_offset(adev);
 
-	sw_ring_size = (amdgpu_ip_version(adev, OSSSYS_HWIP, 0) == IP_VERSION(7, 1, 0)) ?
-			IH_SW_RING_SIZE : PAGE_SIZE;
-	r = amdgpu_ih_ring_init(adev, &adev->irq.ih_soft, sw_ring_size, true);
+	r = amdgpu_ih_ring_init(adev, &adev->irq.ih_soft, IH_SW_RING_SIZE, true);
 	if (r)
 		return r;
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/7] drm/amdgpu/gmc12.0: Use AMDGPU_PTE_IS_PTE flag for init_pte_flags on GFX12.0
  2026-05-29 10:30 [PATCH 0/7] drm/amdgpu: Implement retry faults on Navi 4 Timur Kristóf
                   ` (3 preceding siblings ...)
  2026-05-29 10:30 ` [PATCH 4/7] drm/amdgpu/ih7.0: Use IH_SW_RING_SIZE for soft IH ring instead of PAGE_SIZE Timur Kristóf
@ 2026-05-29 10:30 ` Timur Kristóf
  2026-06-24 14:54   ` Tvrtko Ursulin
  2026-05-29 10:30 ` [PATCH 6/7] drm/amdgpu/vm: Use init PTE flags, and NOALLOC in amdgpu_vm_handle_fault() Timur Kristóf
  2026-05-29 10:30 ` [PATCH 7/7] drm/amdgpu/gmc12: Pass cam_index to retry fault handler Timur Kristóf
  6 siblings, 1 reply; 21+ messages in thread
From: Timur Kristóf @ 2026-05-29 10:30 UTC (permalink / raw)
  To: amd-gfx, Alexander.Deucher, Christian König, Natalie Vock,
	Amir Shetaia, Marek Olšák, Mario Limonciello
  Cc: Timur Kristóf

According to some SVM code this flag is necessary on
also GFX12.0 not just GFX12.1.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
index 586703ec0dfa0..7bbf5f848ce1b 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
@@ -641,11 +641,12 @@ static int gmc_v12_0_early_init(struct amdgpu_ip_block *ip_block)
 		adev->gmc.xgmi.connected_to_cpu =
 			adev->smuio.funcs->is_host_gpu_xgmi_supported(adev);
 
+	adev->gmc.init_pte_flags = AMDGPU_PTE_IS_PTE;
+
 	switch (amdgpu_ip_version(adev, GC_HWIP, 0)) {
 	case IP_VERSION(12, 1, 0):
 		gmc_v12_1_set_gmc_funcs(adev);
 		gmc_v12_1_set_irq_funcs(adev);
-		adev->gmc.init_pte_flags = AMDGPU_PTE_IS_PTE;
 		break;
 	default:
 		gmc_v12_0_set_gmc_funcs(adev);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/7] drm/amdgpu/vm: Use init PTE flags, and NOALLOC in amdgpu_vm_handle_fault()
  2026-05-29 10:30 [PATCH 0/7] drm/amdgpu: Implement retry faults on Navi 4 Timur Kristóf
                   ` (4 preceding siblings ...)
  2026-05-29 10:30 ` [PATCH 5/7] drm/amdgpu/gmc12.0: Use AMDGPU_PTE_IS_PTE flag for init_pte_flags on GFX12.0 Timur Kristóf
@ 2026-05-29 10:30 ` Timur Kristóf
  2026-06-24 14:56   ` Tvrtko Ursulin
  2026-05-29 10:30 ` [PATCH 7/7] drm/amdgpu/gmc12: Pass cam_index to retry fault handler Timur Kristóf
  6 siblings, 1 reply; 21+ messages in thread
From: Timur Kristóf @ 2026-05-29 10:30 UTC (permalink / raw)
  To: amd-gfx, Alexander.Deucher, Christian König, Natalie Vock,
	Amir Shetaia, Marek Olšák, Mario Limonciello
  Cc: Timur Kristóf

These flags seem to be necessary for retry faults to work.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index f5e9b97e92a8c..80c07abc81565 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -3002,7 +3002,8 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
 	}
 
 	addr /= AMDGPU_GPU_PAGE_SIZE;
-	flags = AMDGPU_PTE_VALID | AMDGPU_PTE_SNOOPED |
+	flags = adev->gmc.init_pte_flags |
+		AMDGPU_PTE_VALID | AMDGPU_PTE_SNOOPED |
 		AMDGPU_PTE_SYSTEM;
 
 	if (is_compute_context) {
@@ -3017,6 +3018,8 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
 		flags |= AMDGPU_PTE_EXECUTABLE | AMDGPU_PTE_READABLE |
 			AMDGPU_PTE_WRITEABLE;
 
+		if (adev->family >= AMDGPU_FAMILY_NV)
+			flags |= AMDGPU_PTE_NOALLOC;
 	} else {
 		/* Let the hw retry silently on the PTE */
 		value = 0;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 7/7] drm/amdgpu/gmc12: Pass cam_index to retry fault handler
  2026-05-29 10:30 [PATCH 0/7] drm/amdgpu: Implement retry faults on Navi 4 Timur Kristóf
                   ` (5 preceding siblings ...)
  2026-05-29 10:30 ` [PATCH 6/7] drm/amdgpu/vm: Use init PTE flags, and NOALLOC in amdgpu_vm_handle_fault() Timur Kristóf
@ 2026-05-29 10:30 ` Timur Kristóf
  2026-06-24 14:59   ` Tvrtko Ursulin
  6 siblings, 1 reply; 21+ messages in thread
From: Timur Kristóf @ 2026-05-29 10:30 UTC (permalink / raw)
  To: amd-gfx, Alexander.Deucher, Christian König, Natalie Vock,
	Amir Shetaia, Marek Olšák, Mario Limonciello
  Cc: Timur Kristóf

This is necessary if we want to make use of the filter CAM.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
index 7bbf5f848ce1b..90568d8ea0cf6 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
@@ -99,6 +99,7 @@ static int gmc_v12_0_process_interrupt(struct amdgpu_device *adev,
 	bool write_fault = !!(entry->src_data[1] &
 			      AMDGPU_GMC9_FAULT_SOURCE_DATA_WRITE);
 	uint32_t status = 0;
+	uint32_t cam_index;
 	u64 addr;
 
 	addr = (u64)entry->src_data[0] << 12;
@@ -110,7 +111,9 @@ static int gmc_v12_0_process_interrupt(struct amdgpu_device *adev,
 		hub = &adev->vmhub[AMDGPU_GFXHUB(0)];
 
 	if (retry_fault) {
-		int ret = amdgpu_gmc_handle_retry_fault(adev, entry, addr, 0, 0,
+		cam_index = entry->src_data[2] & 0x3ff;
+
+		int ret = amdgpu_gmc_handle_retry_fault(adev, entry, addr, cam_index, 0,
 							write_fault);
 		/* Returning 1 here also prevents sending the IV to the KFD */
 		if (ret == 1)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/7] drm/amdgpu/vm: Add fence argument to amdgpu_vm_handle_fault()
  2026-05-29 10:30 ` [PATCH 1/7] drm/amdgpu/vm: Add fence argument to amdgpu_vm_handle_fault() Timur Kristóf
@ 2026-06-24 13:54   ` Tvrtko Ursulin
  2026-06-24 14:09     ` Timur Kristóf
  0 siblings, 1 reply; 21+ messages in thread
From: Tvrtko Ursulin @ 2026-06-24 13:54 UTC (permalink / raw)
  To: Timur Kristóf, amd-gfx, Alexander.Deucher,
	Christian König, Natalie Vock, Amir Shetaia,
	Marek Olšák, Mario Limonciello


On 29/05/2026 11:30, Timur Kristóf wrote:
> Allow the caller to respond to when the VM update is finished.
> 
> Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 4 ++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  | 5 ++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h  | 2 +-
>   drivers/gpu/drm/amd/amdgpu/gmc_v12_1.c  | 4 ++--
>   4 files changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index d790b7619ccd4..26aea960e2759 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -564,7 +564,7 @@ int amdgpu_gmc_handle_retry_fault(struct amdgpu_device *adev,
>   		}
>   
>   		ret = amdgpu_vm_handle_fault(adev, entry->pasid, entry->vmid, node_id,
> -					     addr, entry->timestamp, write_fault);
> +					     addr, entry->timestamp, write_fault, NULL);
>   		adev->irq.ih_funcs->retry_cam_ack(adev, cam_index);
>   		if (ret)
>   			return 1;
> @@ -587,7 +587,7 @@ int amdgpu_gmc_handle_retry_fault(struct amdgpu_device *adev,
>   		 * tables
>   		 */
>   		if (amdgpu_vm_handle_fault(adev, entry->pasid, entry->vmid, node_id,
> -					   addr, entry->timestamp, write_fault))
> +					   addr, entry->timestamp, write_fault, NULL))
>   			return 1;
>   	}
>   	return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index b523a7b97d6f1..8c3ba7213eb22 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -2962,13 +2962,14 @@ struct amdgpu_vm *amdgpu_vm_lock_by_pasid(struct amdgpu_device *adev,
>    *           GFX 9.4.3.
>    * @addr: Address of the fault
>    * @write_fault: true is write fault, false is read fault
> + * @fence: optional resulting fence, signaled after update is done
>    *
>    * Try to gracefully handle a VM fault. Return true if the fault was handled and
>    * shouldn't be reported any more.
>    */
>   bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
>   			    u32 vmid, u32 node_id, uint64_t addr,
> -			    uint64_t ts, bool write_fault)
> +			    uint64_t ts, bool write_fault, struct dma_fence **fence)
>   {
>   	bool is_compute_context = false;
>   	struct amdgpu_bo *root;
> @@ -3034,6 +3035,8 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
>   
>   	r = amdgpu_vm_update_pdes(adev, vm, true);
>   
> +	*fence = vm->last_update;

Unless the heat wave is severely interfering with my ability to read 
code, fence here is mostly NULL and who owns the reference is suspect. 
Did you mean like this:

if (fence)
	*fence = dma_fence_get(vm->last_update);

Kernel doc should perhaps clarify along the lines of:

"@fence: If non-null, returns a fence with an extra reference for the 
caller, which is signaled after update is done.".

Regards,

Tvrtko

> +
>   error_unlock:
>   	amdgpu_bo_unreserve(root);
>   	if (r < 0)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> index cc096c005e348..72da6b3d98c70 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> @@ -589,7 +589,7 @@ void amdgpu_vm_put_task_info(struct amdgpu_task_info *task_info);
>   
>   bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
>   			    u32 vmid, u32 node_id, uint64_t addr, uint64_t ts,
> -			    bool write_fault);
> +			    bool write_fault, struct dma_fence **fence);
>   
>   struct amdgpu_vm *amdgpu_vm_lock_by_pasid(struct amdgpu_device *adev,
>   					  struct amdgpu_bo **root, u32 pasid);
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v12_1.c b/drivers/gpu/drm/amd/amdgpu/gmc_v12_1.c
> index 855cd29cbffaa..da18c02013966 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v12_1.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v12_1.c
> @@ -155,7 +155,7 @@ static int gmc_v12_1_process_interrupt(struct amdgpu_device *adev,
>   			cam_index = entry->src_data[3] & 0x3ff;
>   
>   			ret = amdgpu_vm_handle_fault(adev, entry->pasid, entry->vmid, node_id,
> -							addr, entry->timestamp, write_fault);
> +							addr, entry->timestamp, write_fault, NULL);
>   			WDOORBELL32(adev->irq.retry_cam_doorbell_index, cam_index);
>   			if (ret)
>   				return 1;
> @@ -178,7 +178,7 @@ static int gmc_v12_1_process_interrupt(struct amdgpu_device *adev,
>   			 * tables
>   			 */
>   			if (amdgpu_vm_handle_fault(adev, entry->pasid, entry->vmid, node_id,
> -						   addr, entry->timestamp, write_fault))
> +						   addr, entry->timestamp, write_fault, NULL))
>   				return 1;
>   		}
>   	}


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/7] drm/amdgpu/vm: Add fence argument to amdgpu_vm_handle_fault()
  2026-06-24 13:54   ` Tvrtko Ursulin
@ 2026-06-24 14:09     ` Timur Kristóf
  0 siblings, 0 replies; 21+ messages in thread
From: Timur Kristóf @ 2026-06-24 14:09 UTC (permalink / raw)
  To: amd-gfx, Alexander.Deucher, Christian König, Natalie Vock,
	Amir Shetaia, Marek Olšák, Mario Limonciello,
	Tvrtko Ursulin

On 2026. június 24., szerda 15:54:30 közép-európai nyári idő Tvrtko Ursulin 
wrote:
> On 29/05/2026 11:30, Timur Kristóf wrote:
> > Allow the caller to respond to when the VM update is finished.
> > 
> > Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
> > ---
> > 
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 4 ++--
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  | 5 ++++-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h  | 2 +-
> >   drivers/gpu/drm/amd/amdgpu/gmc_v12_1.c  | 4 ++--
> >   4 files changed, 9 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index
> > d790b7619ccd4..26aea960e2759 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > @@ -564,7 +564,7 @@ int amdgpu_gmc_handle_retry_fault(struct amdgpu_device
> > *adev,> 
> >   		}
> >   		
> >   		ret = amdgpu_vm_handle_fault(adev, entry->pasid, 
entry->vmid, node_id,
> > 
> > -					     addr, entry-
>timestamp, write_fault);
> > +					     addr, entry-
>timestamp, write_fault, NULL);
> > 
> >   		adev->irq.ih_funcs->retry_cam_ack(adev, cam_index);
> >   		if (ret)
> >   		
> >   			return 1;
> > 
> > @@ -587,7 +587,7 @@ int amdgpu_gmc_handle_retry_fault(struct amdgpu_device
> > *adev,> 
> >   		 * tables
> >   		 */
> >   		
> >   		if (amdgpu_vm_handle_fault(adev, entry->pasid, entry-
>vmid, node_id,
> > 
> > -					   addr, entry-
>timestamp, write_fault))
> > +					   addr, entry-
>timestamp, write_fault, NULL))
> > 
> >   			return 1;
> >   	
> >   	}
> >   	return 0;
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index
> > b523a7b97d6f1..8c3ba7213eb22 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > @@ -2962,13 +2962,14 @@ struct amdgpu_vm *amdgpu_vm_lock_by_pasid(struct
> > amdgpu_device *adev,> 
> >    *           GFX 9.4.3.
> >    * @addr: Address of the fault
> >    * @write_fault: true is write fault, false is read fault
> > 
> > + * @fence: optional resulting fence, signaled after update is done
> > 
> >    *
> >    * Try to gracefully handle a VM fault. Return true if the fault was
> >    handled and * shouldn't be reported any more.
> >    */
> >   
> >   bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
> >   
> >   			    u32 vmid, u32 node_id, uint64_t addr,
> > 
> > -			    uint64_t ts, bool write_fault)
> > +			    uint64_t ts, bool write_fault, struct 
dma_fence **fence)
> > 
> >   {
> >   
> >   	bool is_compute_context = false;
> >   	struct amdgpu_bo *root;
> > 
> > @@ -3034,6 +3035,8 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device
> > *adev, u32 pasid,> 
> >   	r = amdgpu_vm_update_pdes(adev, vm, true);
> > 
> > +	*fence = vm->last_update;
> 
> Unless the heat wave is severely interfering with my ability to read
> code, fence here is mostly NULL and who owns the reference is suspect.
> Did you mean like this:
> 
> if (fence)
> 	*fence = dma_fence_get(vm->last_update);
> 
> Kernel doc should perhaps clarify along the lines of:
> 
> "@fence: If non-null, returns a fence with an extra reference for the
> caller, which is signaled after update is done.".
> 
> Regards,
> 
> Tvrtko

Thank you!
Yes, that's a valid point. I will fix this.

Note that I will likely drop this patch from the next version of the series 
and submit it separately, because it conflicts with Christian's series.
> 
> > +
> > 
> >   error_unlock:
> >   	amdgpu_bo_unreserve(root);
> >   	if (r < 0)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h index
> > cc096c005e348..72da6b3d98c70 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> > @@ -589,7 +589,7 @@ void amdgpu_vm_put_task_info(struct amdgpu_task_info
> > *task_info);> 
> >   bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
> >   
> >   			    u32 vmid, u32 node_id, uint64_t addr, 
uint64_t ts,
> > 
> > -			    bool write_fault);
> > +			    bool write_fault, struct dma_fence 
**fence);
> > 
> >   struct amdgpu_vm *amdgpu_vm_lock_by_pasid(struct amdgpu_device *adev,
> >   
> >   					  struct amdgpu_bo 
**root, u32 pasid);
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v12_1.c
> > b/drivers/gpu/drm/amd/amdgpu/gmc_v12_1.c index
> > 855cd29cbffaa..da18c02013966 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gmc_v12_1.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v12_1.c
> > @@ -155,7 +155,7 @@ static int gmc_v12_1_process_interrupt(struct
> > amdgpu_device *adev,> 
> >   			cam_index = entry->src_data[3] & 0x3ff;
> >   			
> >   			ret = amdgpu_vm_handle_fault(adev, entry-
>pasid, entry->vmid,
> >   			node_id,
> > 
> > -							
addr, entry->timestamp, write_fault);
> > +							
addr, entry->timestamp, write_fault, NULL);
> > 
> >   			WDOORBELL32(adev-
>irq.retry_cam_doorbell_index, cam_index);
> >   			if (ret)
> >   			
> >   				return 1;
> > 
> > @@ -178,7 +178,7 @@ static int gmc_v12_1_process_interrupt(struct
> > amdgpu_device *adev,> 
> >   			 * tables
> >   			 */
> >   			
> >   			if (amdgpu_vm_handle_fault(adev, entry-
>pasid, entry->vmid, node_id,
> > 
> > -						   addr, 
entry->timestamp, write_fault))
> > +						   addr, 
entry->timestamp, write_fault, NULL))
> > 
> >   				return 1;
> >   		
> >   		}
> >   	
> >   	}




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/7] drm/amdgpu: ACK the retry CAM after VM update finishes
  2026-05-29 10:30 ` [PATCH 2/7] drm/amdgpu: ACK the retry CAM after VM update finishes Timur Kristóf
@ 2026-06-24 14:31   ` Tvrtko Ursulin
  2026-06-24 14:52     ` Timur Kristóf
  0 siblings, 1 reply; 21+ messages in thread
From: Tvrtko Ursulin @ 2026-06-24 14:31 UTC (permalink / raw)
  To: Timur Kristóf, amd-gfx, Alexander.Deucher,
	Christian König, Natalie Vock, Amir Shetaia,
	Marek Olšák, Mario Limonciello


On 29/05/2026 11:30, Timur Kristóf wrote:
> Add a fence callback to the VM update and ACK the retry CAM
> after the VM update is finished. Previously, we would ACK it
> immediately after calling amdgpu_vm_handle_fault() which
> caused a race condition that was likely to trigger the same
> interrupt again, causing the same fault to be handled
> multiple times.
> 
> Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c     | 28 +++++++++++++++++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h     |  8 ++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c      |  2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c |  2 +-
>   4 files changed, 36 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index 26aea960e2759..21c8d87477448 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -545,6 +545,16 @@ void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, uint64_t addr,
>   	} while (fault->timestamp < tmp);
>   }
>   
> +static void amdgpu_gmc_retry_fault_handled(struct dma_fence *fence,
> +					   struct dma_fence_cb *cb)
> +{
> +	struct amdgpu_fence_cb *afc = container_of(cb, struct amdgpu_fence_cb, cb);
> +	struct amdgpu_device *adev = afc->adev;
> +
> +	/* CAM index is the array index of the current callback struct */
> +	adev->irq.ih_funcs->retry_cam_ack(adev, afc - &adev->gmc.retry_cb[0]);

Is the "afc - &adev->gmc.retry_cb[0]" part correct? It will be the index 
of the array element, while ->retry_cam_ack() expects the content of 
that element, no?

> +}
> +
>   int amdgpu_gmc_handle_retry_fault(struct amdgpu_device *adev,
>   				  struct amdgpu_iv_entry *entry,
>   				  u64 addr,
> @@ -552,6 +562,7 @@ int amdgpu_gmc_handle_retry_fault(struct amdgpu_device *adev,
>   				  u32 node_id,
>   				  bool write_fault)
>   {
> +	struct dma_fence *fence = NULL;
>   	int ret;
>   
>   	if (adev->irq.retry_cam_enabled) {
> @@ -564,8 +575,21 @@ int amdgpu_gmc_handle_retry_fault(struct amdgpu_device *adev,
>   		}
>   
>   		ret = amdgpu_vm_handle_fault(adev, entry->pasid, entry->vmid, node_id,
> -					     addr, entry->timestamp, write_fault, NULL);
> -		adev->irq.ih_funcs->retry_cam_ack(adev, cam_index);
> +					     addr, entry->timestamp, write_fault, &fence);
> +
> +		/* If the update is already done, ACK now, otherwise when it's done. */
> +		if (fence) {
> +			adev->gmc.retry_cb[cam_index].adev = adev;

Why is 16 retry_cb elements enough? I see in the code cam_index extraced 
from the IV entry with a mask such as 0x3ff.
> +
> +			if (dma_fence_add_callback(fence, &adev->gmc.retry_cb[cam_index].cb,
> +						   amdgpu_gmc_retry_fault_handled))
> +				adev->irq.ih_funcs->retry_cam_ack(adev, cam_index);
> +
> +			dma_fence_put(fence);
> +		} else {
> +			adev->irq.ih_funcs->retry_cam_ack(adev, cam_index);
> +		}
> +
>   		if (ret)
>   			return 1;
>   	} else {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> index 77eb153802845..3bfb06e011a86 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> @@ -27,6 +27,7 @@
>   #define __AMDGPU_GMC_H__
>   
>   #include <linux/types.h>
> +#include <linux/dma-fence.h>
>   
>   #include "amdgpu_irq.h"
>   #include "amdgpu_xgmi.h"
> @@ -214,6 +215,11 @@ struct amdgpu_gmc_memrange {
>   	int nid_mask;
>   };
>   
> +struct amdgpu_fence_cb {
> +	struct amdgpu_device *adev;
> +	struct dma_fence_cb cb;
> +};
> +
>   enum amdgpu_gart_placement {
>   	AMDGPU_GART_PLACEMENT_BEST_FIT = 0,
>   	AMDGPU_GART_PLACEMENT_HIGH,
> @@ -305,6 +311,8 @@ struct amdgpu_gmc {
>   	} fault_hash[AMDGPU_GMC_FAULT_HASH_SIZE];
>   	uint64_t		last_fault:AMDGPU_GMC_FAULT_RING_ORDER;
>   
> +	struct amdgpu_fence_cb retry_cb[16];
> +
>   	bool tmz_enabled;
>   	bool is_app_apu;
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 8c3ba7213eb22..f5e9b97e92a8c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -3035,7 +3035,7 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
>   
>   	r = amdgpu_vm_update_pdes(adev, vm, true);
>   
> -	*fence = vm->last_update;
> +	*fence = dma_fence_get(vm->last_update);

Ah! But passing over since you said you are dropping that patch anyway.

>   error_unlock:
>   	amdgpu_bo_unreserve(root);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index 2eb64df6daa94..6e28f0e435bf5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -132,7 +132,7 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>   				   DMA_RESV_USAGE_BOOKKEEP);
>   	}
>   
> -	if (fence && !p->immediate) {
> +	if (fence) {

Is this deliberate and if so what it is about? Commit message should 
explain it as well.

Regards,

Tvrtko

>   		/*
>   		 * Most hw generations now have a separate queue for page table
>   		 * updates, but when the queue is shared with userspace we need


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/7] drm/amdgpu/ih7.0: Use IH_SW_RING_SIZE for soft IH ring instead of PAGE_SIZE
  2026-05-29 10:30 ` [PATCH 4/7] drm/amdgpu/ih7.0: Use IH_SW_RING_SIZE for soft IH ring instead of PAGE_SIZE Timur Kristóf
@ 2026-06-24 14:37   ` Tvrtko Ursulin
  2026-06-24 15:16     ` Timur Kristóf
  0 siblings, 1 reply; 21+ messages in thread
From: Tvrtko Ursulin @ 2026-06-24 14:37 UTC (permalink / raw)
  To: Timur Kristóf, amd-gfx, Alexander.Deucher,
	Christian König, Natalie Vock, Amir Shetaia,
	Marek Olšák, Mario Limonciello


On 29/05/2026 11:30, Timur Kristóf wrote:
> When there are a lot of retry faults happening, the soft IH ring
> can fill up really quickly and possibly overflow. PAGE_SIZE was
> too small, use IH_SW_RING_SIZE to match what other GPU generations
> are doing.
> 
> Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/ih_v7_0.c | 5 +----
>   1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c b/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
> index 4e6d14facf93a..ec0919fa82540 100644
> --- a/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
> @@ -587,7 +587,6 @@ static int ih_v7_0_sw_init(struct amdgpu_ip_block *ip_block)
>   	int r;
>   	struct amdgpu_device *adev = ip_block->adev;
>   	bool use_bus_addr;
> -	unsigned int sw_ring_size;
>   
>   	r = amdgpu_irq_add_id(adev, SOC21_IH_CLIENTID_IH, 0,
>   			      &adev->irq.self_irq);
> @@ -619,9 +618,7 @@ static int ih_v7_0_sw_init(struct amdgpu_ip_block *ip_block)
>   	/* initialize ih control register offset */
>   	ih_v7_0_init_register_offset(adev);
>   
> -	sw_ring_size = (amdgpu_ip_version(adev, OSSSYS_HWIP, 0) == IP_VERSION(7, 1, 0)) ?
> -			IH_SW_RING_SIZE : PAGE_SIZE;
> -	r = amdgpu_ih_ring_init(adev, &adev->irq.ih_soft, sw_ring_size, true);
> +	r = amdgpu_ih_ring_init(adev, &adev->irq.ih_soft, IH_SW_RING_SIZE, true);
>   	if (r)
>   		return r;
>   

Looks plausible to me.

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

The only other which uses PAGE_SIZE for the soft IH ring is ih_v6_1.c so 
maybe that one needs tweaking too?

Regards,

Tvrtko


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/7] drm/amdgpu: ACK the retry CAM after VM update finishes
  2026-06-24 14:31   ` Tvrtko Ursulin
@ 2026-06-24 14:52     ` Timur Kristóf
  2026-06-24 15:14       ` Tvrtko Ursulin
  0 siblings, 1 reply; 21+ messages in thread
From: Timur Kristóf @ 2026-06-24 14:52 UTC (permalink / raw)
  To: amd-gfx, Alexander.Deucher, Christian König, Natalie Vock,
	Amir Shetaia, Marek Olšák, Mario Limonciello,
	Tvrtko Ursulin

On 2026. június 24., szerda 16:31:20 közép-európai nyári idő Tvrtko Ursulin 
wrote:
> On 29/05/2026 11:30, Timur Kristóf wrote:
> > Add a fence callback to the VM update and ACK the retry CAM
> > after the VM update is finished. Previously, we would ACK it
> > immediately after calling amdgpu_vm_handle_fault() which
> > caused a race condition that was likely to trigger the same
> > interrupt again, causing the same fault to be handled
> > multiple times.
> > 
> > Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
> > ---
> > 
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c     | 28 +++++++++++++++++++--
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h     |  8 ++++++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c      |  2 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c |  2 +-
> >   4 files changed, 36 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index
> > 26aea960e2759..21c8d87477448 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > @@ -545,6 +545,16 @@ void amdgpu_gmc_filter_faults_remove(struct
> > amdgpu_device *adev, uint64_t addr,> 
> >   	} while (fault->timestamp < tmp);
> >   
> >   }
> > 
> > +static void amdgpu_gmc_retry_fault_handled(struct dma_fence *fence,
> > +					   struct dma_fence_cb 
*cb)
> > +{
> > +	struct amdgpu_fence_cb *afc = container_of(cb, struct 
amdgpu_fence_cb,
> > cb); +	struct amdgpu_device *adev = afc->adev;
> > +
> > +	/* CAM index is the array index of the current callback struct */
> > +	adev->irq.ih_funcs->retry_cam_ack(adev, afc - &adev-
>gmc.retry_cb[0]);
> 
> Is the "afc - &adev->gmc.retry_cb[0]" part correct? It will be the index
> of the array element, while ->retry_cam_ack() expects the content of
> that element, no?

Like the comment says, the CAM index is the array index.
We just need the CAM index in order to tell the CAM to ACK the current entry.
The contents of the array are just there to make dma_fence_add_callback() work 
with this callback function.

> 
> > +}
> > +
> > 
> >   int amdgpu_gmc_handle_retry_fault(struct amdgpu_device *adev,
> >   
> >   				  struct amdgpu_iv_entry *entry,
> >   				  u64 addr,
> > 
> > @@ -552,6 +562,7 @@ int amdgpu_gmc_handle_retry_fault(struct amdgpu_device
> > *adev,> 
> >   				  u32 node_id,
> >   				  bool write_fault)
> >   
> >   {
> > 
> > +	struct dma_fence *fence = NULL;
> > 
> >   	int ret;
> >   	
> >   	if (adev->irq.retry_cam_enabled) {
> > 
> > @@ -564,8 +575,21 @@ int amdgpu_gmc_handle_retry_fault(struct
> > amdgpu_device *adev,> 
> >   		}
> >   		
> >   		ret = amdgpu_vm_handle_fault(adev, entry->pasid, 
entry->vmid, node_id,
> > 
> > -					     addr, entry-
>timestamp, write_fault, NULL);
> > -		adev->irq.ih_funcs->retry_cam_ack(adev, cam_index);
> > +					     addr, entry-
>timestamp, write_fault, &fence);
> > +
> > +		/* If the update is already done, ACK now, otherwise 
when it's done. */
> > +		if (fence) {
> > +			adev->gmc.retry_cb[cam_index].adev = adev;
> 
> Why is 16 retry_cb elements enough? I see in the code cam_index extraced
> from the IV entry with a mask such as 0x3ff.

I think this came up in a conversation after I had already submitted the 
patch. The maximum amount of CAM entries are specified by the 
IH_RETRY_INT_CAM_CNTL.CAM_SIZE field.  The content of the field will need to be 
interpreted as something like this:
((CAM_SIZE + 1) * 64) = (15 + 1) * 64 = 1024

It is a good question whether we actually want to statically allocate that 
many items. We should very much avoid doing dynamic allocation in the page 
fault handler. I'm open to suggestions on how to move forward with this.

> 
> > +
> > +			if (dma_fence_add_callback(fence, &adev-
>gmc.retry_cb[cam_index].cb,
> > +						   
amdgpu_gmc_retry_fault_handled))
> > +				adev->irq.ih_funcs-
>retry_cam_ack(adev, cam_index);
> > +
> > +			dma_fence_put(fence);
> > +		} else {
> > +			adev->irq.ih_funcs->retry_cam_ack(adev, 
cam_index);
> > +		}
> > +
> > 
> >   		if (ret)
> >   		
> >   			return 1;
> >   	
> >   	} else {
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h index
> > 77eb153802845..3bfb06e011a86 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> > @@ -27,6 +27,7 @@
> > 
> >   #define __AMDGPU_GMC_H__
> >   
> >   #include <linux/types.h>
> > 
> > +#include <linux/dma-fence.h>
> > 
> >   #include "amdgpu_irq.h"
> >   #include "amdgpu_xgmi.h"
> > 
> > @@ -214,6 +215,11 @@ struct amdgpu_gmc_memrange {
> > 
> >   	int nid_mask;
> >   
> >   };
> > 
> > +struct amdgpu_fence_cb {
> > +	struct amdgpu_device *adev;
> > +	struct dma_fence_cb cb;
> > +};
> > +
> > 
> >   enum amdgpu_gart_placement {
> >   
> >   	AMDGPU_GART_PLACEMENT_BEST_FIT = 0,
> >   	AMDGPU_GART_PLACEMENT_HIGH,
> > 
> > @@ -305,6 +311,8 @@ struct amdgpu_gmc {
> > 
> >   	} fault_hash[AMDGPU_GMC_FAULT_HASH_SIZE];
> >   	uint64_t		last_fault:AMDGPU_GMC_FAULT_RING_ORDER;
> > 
> > +	struct amdgpu_fence_cb retry_cb[16];
> > +
> > 
> >   	bool tmz_enabled;
> >   	bool is_app_apu;
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index
> > 8c3ba7213eb22..f5e9b97e92a8c 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > @@ -3035,7 +3035,7 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device
> > *adev, u32 pasid,> 
> >   	r = amdgpu_vm_update_pdes(adev, vm, true);
> > 
> > -	*fence = vm->last_update;
> > +	*fence = dma_fence_get(vm->last_update);
> 
> Ah! But passing over since you said you are dropping that patch anyway.

That line should have gone to the previous patch and was added to this one by 
mistake.

> 
> >   error_unlock:
> >   	amdgpu_bo_unreserve(root);
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c index
> > 2eb64df6daa94..6e28f0e435bf5 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > @@ -132,7 +132,7 @@ static int amdgpu_vm_sdma_commit(struct
> > amdgpu_vm_update_params *p,> 
> >   				   DMA_RESV_USAGE_BOOKKEEP);
> >   	
> >   	}
> > 
> > -	if (fence && !p->immediate) {
> > +	if (fence) {
> 
> Is this deliberate and if so what it is about? Commit message should
> explain it as well.

The reason it is changed is because previously it wouldn't return a fence in 
immediate mode. This line also should have gone to the previous patch and was 
added to this one by mistake. 

Thanks & best regards,
Timur




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/7] drm/amdgpu/gmc12.0: Use AMDGPU_PTE_IS_PTE flag for init_pte_flags on GFX12.0
  2026-05-29 10:30 ` [PATCH 5/7] drm/amdgpu/gmc12.0: Use AMDGPU_PTE_IS_PTE flag for init_pte_flags on GFX12.0 Timur Kristóf
@ 2026-06-24 14:54   ` Tvrtko Ursulin
  2026-06-24 15:30     ` Timur Kristóf
  0 siblings, 1 reply; 21+ messages in thread
From: Tvrtko Ursulin @ 2026-06-24 14:54 UTC (permalink / raw)
  To: Timur Kristóf, amd-gfx, Alexander.Deucher,
	Christian König, Natalie Vock, Amir Shetaia,
	Marek Olšák, Mario Limonciello, Siwei He, Philip Yang,
	Mukul Joshi


On 29/05/2026 11:30, Timur Kristóf wrote:
> According to some SVM code this flag is necessary on
> also GFX12.0 not just GFX12.1.
> 
> Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
> index 586703ec0dfa0..7bbf5f848ce1b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
> @@ -641,11 +641,12 @@ static int gmc_v12_0_early_init(struct amdgpu_ip_block *ip_block)
>   		adev->gmc.xgmi.connected_to_cpu =
>   			adev->smuio.funcs->is_host_gpu_xgmi_supported(adev);
>   
> +	adev->gmc.init_pte_flags = AMDGPU_PTE_IS_PTE;
> +
>   	switch (amdgpu_ip_version(adev, GC_HWIP, 0)) {
>   	case IP_VERSION(12, 1, 0):
>   		gmc_v12_1_set_gmc_funcs(adev);
>   		gmc_v12_1_set_irq_funcs(adev);
> -		adev->gmc.init_pte_flags = AMDGPU_PTE_IS_PTE;
>   		break;
>   	default:
>   		gmc_v12_0_set_gmc_funcs(adev);

Code is fine but I don't have the inside knowledge to comment on the 
GFX12.0 vs GFX12.1 situation. Where is this SVM code commit message 
mentions?

Otherwise, maybe people who added this workaround could review? Those 
two are the relevant patches AFAICT:

commit db29ddf6505f3e831e000c95ae013b18a37f70bc
Author: Mukul Joshi <mukul.joshi@amd.com>
Date:   Thu Apr 24 21:51:23 2025 -0400

     drm/amdgpu: Add per-ASIC PTE init flag


commit 9d47b2c36b9a6c6b844c33cab407a5d7ad102234
Author: Siwei He <siwei.he@amd.com>
Date:   Tue Apr 14 14:46:54 2026 -0400

     drm/amdgpu: OR init_pte_flags into invalid leaf PTE update

I took the liberty to add some CCs.

Regards,

Tvrtko


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 6/7] drm/amdgpu/vm: Use init PTE flags, and NOALLOC in amdgpu_vm_handle_fault()
  2026-05-29 10:30 ` [PATCH 6/7] drm/amdgpu/vm: Use init PTE flags, and NOALLOC in amdgpu_vm_handle_fault() Timur Kristóf
@ 2026-06-24 14:56   ` Tvrtko Ursulin
  0 siblings, 0 replies; 21+ messages in thread
From: Tvrtko Ursulin @ 2026-06-24 14:56 UTC (permalink / raw)
  To: Timur Kristóf, amd-gfx, Alexander.Deucher,
	Christian König, Natalie Vock, Amir Shetaia,
	Marek Olšák, Mario Limonciello


On 29/05/2026 11:30, Timur Kristóf wrote:
> These flags seem to be necessary for retry faults to work.
> 
> Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index f5e9b97e92a8c..80c07abc81565 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -3002,7 +3002,8 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
>   	}
>   
>   	addr /= AMDGPU_GPU_PAGE_SIZE;
> -	flags = AMDGPU_PTE_VALID | AMDGPU_PTE_SNOOPED |
> +	flags = adev->gmc.init_pte_flags |
> +		AMDGPU_PTE_VALID | AMDGPU_PTE_SNOOPED |
>   		AMDGPU_PTE_SYSTEM;
>   
>   	if (is_compute_context) {
> @@ -3017,6 +3018,8 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, u32 pasid,
>   		flags |= AMDGPU_PTE_EXECUTABLE | AMDGPU_PTE_READABLE |
>   			AMDGPU_PTE_WRITEABLE;
>   
> +		if (adev->family >= AMDGPU_FAMILY_NV)
> +			flags |= AMDGPU_PTE_NOALLOC;
>   	} else {
>   		/* Let the hw retry silently on the PTE */
>   		value = 0;

Same as the previous one - I'm afraid someone from AMD will need to r-b 
this one too.

Regards,

Tvrtko


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 7/7] drm/amdgpu/gmc12: Pass cam_index to retry fault handler
  2026-05-29 10:30 ` [PATCH 7/7] drm/amdgpu/gmc12: Pass cam_index to retry fault handler Timur Kristóf
@ 2026-06-24 14:59   ` Tvrtko Ursulin
  0 siblings, 0 replies; 21+ messages in thread
From: Tvrtko Ursulin @ 2026-06-24 14:59 UTC (permalink / raw)
  To: Timur Kristóf, amd-gfx, Alexander.Deucher,
	Christian König, Natalie Vock, Amir Shetaia,
	Marek Olšák, Mario Limonciello


On 29/05/2026 11:30, Timur Kristóf wrote:
> This is necessary if we want to make use of the filter CAM.
> 
> Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
> index 7bbf5f848ce1b..90568d8ea0cf6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
> @@ -99,6 +99,7 @@ static int gmc_v12_0_process_interrupt(struct amdgpu_device *adev,
>   	bool write_fault = !!(entry->src_data[1] &
>   			      AMDGPU_GMC9_FAULT_SOURCE_DATA_WRITE);
>   	uint32_t status = 0;
> +	uint32_t cam_index;
>   	u64 addr;
>   
>   	addr = (u64)entry->src_data[0] << 12;
> @@ -110,7 +111,9 @@ static int gmc_v12_0_process_interrupt(struct amdgpu_device *adev,
>   		hub = &adev->vmhub[AMDGPU_GFXHUB(0)];
>   
>   	if (retry_fault) {
> -		int ret = amdgpu_gmc_handle_retry_fault(adev, entry, addr, 0, 0,
> +		cam_index = entry->src_data[2] & 0x3ff;
> +
> +		int ret = amdgpu_gmc_handle_retry_fault(adev, entry, addr, cam_index, 0,
>   							write_fault);
>   		/* Returning 1 here also prevents sending the IV to the KFD */
>   		if (ret == 1)

Looks the same as gmc_v9 so I feel more confident to give r-b.

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

The only tweak I would potentially do is move the cam_index local into 
the local scope of the if branch. Any apply kernel coding style of 
leaving a blank line between the declaration block and code.

Regards,

Tvrtko


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/7] drm/amdgpu: ACK the retry CAM after VM update finishes
  2026-06-24 14:52     ` Timur Kristóf
@ 2026-06-24 15:14       ` Tvrtko Ursulin
  2026-06-24 15:42         ` Timur Kristóf
  0 siblings, 1 reply; 21+ messages in thread
From: Tvrtko Ursulin @ 2026-06-24 15:14 UTC (permalink / raw)
  To: Timur Kristóf, amd-gfx, Alexander.Deucher,
	Christian König, Natalie Vock, Amir Shetaia,
	Marek Olšák, Mario Limonciello


On 24/06/2026 15:52, Timur Kristóf wrote:
> On 2026. június 24., szerda 16:31:20 közép-európai nyári idő Tvrtko Ursulin
> wrote:
>> On 29/05/2026 11:30, Timur Kristóf wrote:
>>> Add a fence callback to the VM update and ACK the retry CAM
>>> after the VM update is finished. Previously, we would ACK it
>>> immediately after calling amdgpu_vm_handle_fault() which
>>> caused a race condition that was likely to trigger the same
>>> interrupt again, causing the same fault to be handled
>>> multiple times.
>>>
>>> Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
>>> ---
>>>
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c     | 28 +++++++++++++++++++--
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h     |  8 ++++++
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c      |  2 +-
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c |  2 +-
>>>    4 files changed, 36 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index
>>> 26aea960e2759..21c8d87477448 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>> @@ -545,6 +545,16 @@ void amdgpu_gmc_filter_faults_remove(struct
>>> amdgpu_device *adev, uint64_t addr,>
>>>    	} while (fault->timestamp < tmp);
>>>    
>>>    }
>>>
>>> +static void amdgpu_gmc_retry_fault_handled(struct dma_fence *fence,
>>> +					   struct dma_fence_cb
> *cb)
>>> +{
>>> +	struct amdgpu_fence_cb *afc = container_of(cb, struct
> amdgpu_fence_cb,
>>> cb); +	struct amdgpu_device *adev = afc->adev;
>>> +
>>> +	/* CAM index is the array index of the current callback struct */
>>> +	adev->irq.ih_funcs->retry_cam_ack(adev, afc - &adev-
>> gmc.retry_cb[0]);
>>
>> Is the "afc - &adev->gmc.retry_cb[0]" part correct? It will be the index
>> of the array element, while ->retry_cam_ack() expects the content of
>> that element, no?
> 
> Like the comment says, the CAM index is the array index.
> We just need the CAM index in order to tell the CAM to ACK the current entry.
> The contents of the array are just there to make dma_fence_add_callback() work
> with this callback function.

Ah you are right, I got confused. But it is also a bit bad, and I mean 
not just the array sizing dilema from lower in the email. But since the 
cam_index comes from the hardware and then below we blindly do:

	if (dma_fence_add_callback(fence, &adev->gmc.retry_cb[cam_index].cb, 
amdgpu_gmc_retry_fault_handled))

Should hardware manage to send two faults with the same cam_index when 
the previous one hasn't been handled, that is the very same callback is 
already installed and unsignaled (expect the unexpected), we have just 
upgraded the hardware bug to a kernel crash.

If I now understand it right, you want to "remember" the cam_index 
received so callback knows what to handle. Hmm.. Allocating memory does 
seem allowed if I follow correctly that amdgpu_vm_handle_fault() is 
calling dma_resv_reserve_fences(). So unless I am missing something 
perhaps kmalloc of struct amdgpu_fence_cb would be fine after all? And 
if so you should also probably rename it to a less generic name along 
the lines of amgpud_retry_fault_cb or so. Workable?

Regards,

Tvrtko

>>
>>> +}
>>> +
>>>
>>>    int amdgpu_gmc_handle_retry_fault(struct amdgpu_device *adev,
>>>    
>>>    				  struct amdgpu_iv_entry *entry,
>>>    				  u64 addr,
>>>
>>> @@ -552,6 +562,7 @@ int amdgpu_gmc_handle_retry_fault(struct amdgpu_device
>>> *adev,>
>>>    				  u32 node_id,
>>>    				  bool write_fault)
>>>    
>>>    {
>>>
>>> +	struct dma_fence *fence = NULL;
>>>
>>>    	int ret;
>>>    	
>>>    	if (adev->irq.retry_cam_enabled) {
>>>
>>> @@ -564,8 +575,21 @@ int amdgpu_gmc_handle_retry_fault(struct
>>> amdgpu_device *adev,>
>>>    		}
>>>    		
>>>    		ret = amdgpu_vm_handle_fault(adev, entry->pasid,
> entry->vmid, node_id,
>>>
>>> -					     addr, entry-
>> timestamp, write_fault, NULL);
>>> -		adev->irq.ih_funcs->retry_cam_ack(adev, cam_index);
>>> +					     addr, entry-
>> timestamp, write_fault, &fence);
>>> +
>>> +		/* If the update is already done, ACK now, otherwise
> when it's done. */
>>> +		if (fence) {
>>> +			adev->gmc.retry_cb[cam_index].adev = adev;
>>
>> Why is 16 retry_cb elements enough? I see in the code cam_index extraced
>> from the IV entry with a mask such as 0x3ff.
> 
> I think this came up in a conversation after I had already submitted the
> patch. The maximum amount of CAM entries are specified by the
> IH_RETRY_INT_CAM_CNTL.CAM_SIZE field.  The content of the field will need to be
> interpreted as something like this:
> ((CAM_SIZE + 1) * 64) = (15 + 1) * 64 = 1024
> 
> It is a good question whether we actually want to statically allocate that
> many items. We should very much avoid doing dynamic allocation in the page
> fault handler. I'm open to suggestions on how to move forward with this.
> 
>>
>>> +
>>> +			if (dma_fence_add_callback(fence, &adev-
>> gmc.retry_cb[cam_index].cb,
>>> +						
> amdgpu_gmc_retry_fault_handled))
>>> +				adev->irq.ih_funcs-
>> retry_cam_ack(adev, cam_index);
>>> +
>>> +			dma_fence_put(fence);
>>> +		} else {
>>> +			adev->irq.ih_funcs->retry_cam_ack(adev,
> cam_index);
>>> +		}
>>> +
>>>
>>>    		if (ret)
>>>    		
>>>    			return 1;
>>>    	
>>>    	} else {
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h index
>>> 77eb153802845..3bfb06e011a86 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>>> @@ -27,6 +27,7 @@
>>>
>>>    #define __AMDGPU_GMC_H__
>>>    
>>>    #include <linux/types.h>
>>>
>>> +#include <linux/dma-fence.h>
>>>
>>>    #include "amdgpu_irq.h"
>>>    #include "amdgpu_xgmi.h"
>>>
>>> @@ -214,6 +215,11 @@ struct amdgpu_gmc_memrange {
>>>
>>>    	int nid_mask;
>>>    
>>>    };
>>>
>>> +struct amdgpu_fence_cb {
>>> +	struct amdgpu_device *adev;
>>> +	struct dma_fence_cb cb;
>>> +};
>>> +
>>>
>>>    enum amdgpu_gart_placement {
>>>    
>>>    	AMDGPU_GART_PLACEMENT_BEST_FIT = 0,
>>>    	AMDGPU_GART_PLACEMENT_HIGH,
>>>
>>> @@ -305,6 +311,8 @@ struct amdgpu_gmc {
>>>
>>>    	} fault_hash[AMDGPU_GMC_FAULT_HASH_SIZE];
>>>    	uint64_t		last_fault:AMDGPU_GMC_FAULT_RING_ORDER;
>>>
>>> +	struct amdgpu_fence_cb retry_cb[16];
>>> +
>>>
>>>    	bool tmz_enabled;
>>>    	bool is_app_apu;
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index
>>> 8c3ba7213eb22..f5e9b97e92a8c 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> @@ -3035,7 +3035,7 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device
>>> *adev, u32 pasid,>
>>>    	r = amdgpu_vm_update_pdes(adev, vm, true);
>>>
>>> -	*fence = vm->last_update;
>>> +	*fence = dma_fence_get(vm->last_update);
>>
>> Ah! But passing over since you said you are dropping that patch anyway.
> 
> That line should have gone to the previous patch and was added to this one by
> mistake.
> 
>>
>>>    error_unlock:
>>>    	amdgpu_bo_unreserve(root);
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c index
>>> 2eb64df6daa94..6e28f0e435bf5 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>> @@ -132,7 +132,7 @@ static int amdgpu_vm_sdma_commit(struct
>>> amdgpu_vm_update_params *p,>
>>>    				   DMA_RESV_USAGE_BOOKKEEP);
>>>    	
>>>    	}
>>>
>>> -	if (fence && !p->immediate) {
>>> +	if (fence) {
>>
>> Is this deliberate and if so what it is about? Commit message should
>> explain it as well.
> 
> The reason it is changed is because previously it wouldn't return a fence in
> immediate mode. This line also should have gone to the previous patch and was
> added to this one by mistake.
> 
> Thanks & best regards,
> Timur
> 
> 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/7] drm/amdgpu/ih7.0: Use IH_SW_RING_SIZE for soft IH ring instead of PAGE_SIZE
  2026-06-24 14:37   ` Tvrtko Ursulin
@ 2026-06-24 15:16     ` Timur Kristóf
  0 siblings, 0 replies; 21+ messages in thread
From: Timur Kristóf @ 2026-06-24 15:16 UTC (permalink / raw)
  To: amd-gfx, Alexander.Deucher, Christian König, Natalie Vock,
	Amir Shetaia, Marek Olšák, Mario Limonciello,
	Tvrtko Ursulin

On 2026. június 24., szerda 16:37:11 közép-európai nyári idő Tvrtko Ursulin 
wrote:
> On 29/05/2026 11:30, Timur Kristóf wrote:
> > When there are a lot of retry faults happening, the soft IH ring
> > can fill up really quickly and possibly overflow. PAGE_SIZE was
> > too small, use IH_SW_RING_SIZE to match what other GPU generations
> > are doing.
> > 
> > Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
> > ---
> > 
> >   drivers/gpu/drm/amd/amdgpu/ih_v7_0.c | 5 +----
> >   1 file changed, 1 insertion(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
> > b/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c index 4e6d14facf93a..ec0919fa82540
> > 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/ih_v7_0.c
> > @@ -587,7 +587,6 @@ static int ih_v7_0_sw_init(struct amdgpu_ip_block
> > *ip_block)> 
> >   	int r;
> >   	struct amdgpu_device *adev = ip_block->adev;
> >   	bool use_bus_addr;
> > 
> > -	unsigned int sw_ring_size;
> > 
> >   	r = amdgpu_irq_add_id(adev, SOC21_IH_CLIENTID_IH, 0,
> >   	
> >   			      &adev->irq.self_irq);
> > 
> > @@ -619,9 +618,7 @@ static int ih_v7_0_sw_init(struct amdgpu_ip_block
> > *ip_block)> 
> >   	/* initialize ih control register offset */
> >   	ih_v7_0_init_register_offset(adev);
> > 
> > -	sw_ring_size = (amdgpu_ip_version(adev, OSSSYS_HWIP, 0) == 
IP_VERSION(7,
> > 1, 0)) ? -			IH_SW_RING_SIZE : PAGE_SIZE;
> > -	r = amdgpu_ih_ring_init(adev, &adev->irq.ih_soft, sw_ring_size, 
true);
> > +	r = amdgpu_ih_ring_init(adev, &adev->irq.ih_soft, IH_SW_RING_SIZE,
> > true);
> > 
> >   	if (r)
> >   	
> >   		return r;
> 
> Looks plausible to me.
> 
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> 
> The only other which uses PAGE_SIZE for the soft IH ring is ih_v6_1.c so
> maybe that one needs tweaking too?

I can tweak that too for sure.

As a side note, though, IH 6.1 is used by RDNA 3.5 APUs which need further 
work to make use of retry faults because the APUs don't have the retry CAM so 
we'll need to come up with a different way to filter the interrupts in software 
as we already discussed in a thread on the previous series. Same goes for 
RDNA1-2 dGPUs which also don't seem to have the retry CAM.

Thanks & best regards,
Timur





^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/7] drm/amdgpu/gmc12.0: Use AMDGPU_PTE_IS_PTE flag for init_pte_flags on GFX12.0
  2026-06-24 14:54   ` Tvrtko Ursulin
@ 2026-06-24 15:30     ` Timur Kristóf
  0 siblings, 0 replies; 21+ messages in thread
From: Timur Kristóf @ 2026-06-24 15:30 UTC (permalink / raw)
  To: amd-gfx, Alexander.Deucher, Christian König, Natalie Vock,
	Amir Shetaia, Marek Olšák, Mario Limonciello, Siwei He,
	Philip Yang, Mukul Joshi, Tvrtko Ursulin

On 2026. június 24., szerda 16:54:04 közép-európai nyári idő Tvrtko Ursulin 
wrote:
> On 29/05/2026 11:30, Timur Kristóf wrote:
> > According to some SVM code this flag is necessary on
> > also GFX12.0 not just GFX12.1.
> > 
> > Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
> > ---
> > 
> >   drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c | 3 ++-
> >   1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
> > b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c index
> > 586703ec0dfa0..7bbf5f848ce1b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
> > @@ -641,11 +641,12 @@ static int gmc_v12_0_early_init(struct
> > amdgpu_ip_block *ip_block)> 
> >   		adev->gmc.xgmi.connected_to_cpu =
> >   		
> >   			adev->smuio.funcs-
>is_host_gpu_xgmi_supported(adev);
> > 
> > +	adev->gmc.init_pte_flags = AMDGPU_PTE_IS_PTE;
> > +
> > 
> >   	switch (amdgpu_ip_version(adev, GC_HWIP, 0)) {
> >   	
> >   	case IP_VERSION(12, 1, 0):
> >   		gmc_v12_1_set_gmc_funcs(adev);
> >   		gmc_v12_1_set_irq_funcs(adev);
> > 
> > -		adev->gmc.init_pte_flags = AMDGPU_PTE_IS_PTE;
> > 
> >   		break;
> >   	
> >   	default:
> >   		gmc_v12_0_set_gmc_funcs(adev);
> 
> Code is fine but I don't have the inside knowledge to comment on the
> GFX12.0 vs GFX12.1 situation. Where is this SVM code commit message
> mentions?

Hi,

amdgpu_vm_pte_update_flags() has the following comment:

/* Workaround for fault priority problem on GMC9 and GFX12,
 * EXECUTABLE for GMC9 fault priority and init_pte_flags
 * (e.g. AMDGPU_PTE_IS_PTE on GFX12)
 */

svm_range_get_pte_flags() unconditionally uses this PTE flag on GFX12.x, added 
by this commit: a8a4615ba0fa3ee1248ae2184e7e848d7b644e70
which says: "This resolves the issues related to SVM enablement in GFX12."

While they don't elaborate what exactly the problem is that is being worked 
around, it is pretty clear that the flag is necessary for proper fault handling 
on GFX12.x and indeed I can confirm that it solves the problem for me. I think 
the reason why it was missing here is because the authors just fixed the issue 
for SVM and forgot to fix it also for graphics.

Timur

> 
> Otherwise, maybe people who added this workaround could review? Those
> two are the relevant patches AFAICT:
> 
> commit db29ddf6505f3e831e000c95ae013b18a37f70bc
> Author: Mukul Joshi <mukul.joshi@amd.com>
> Date:   Thu Apr 24 21:51:23 2025 -0400
> 
>      drm/amdgpu: Add per-ASIC PTE init flag
> 
> 
> commit 9d47b2c36b9a6c6b844c33cab407a5d7ad102234
> Author: Siwei He <siwei.he@amd.com>
> Date:   Tue Apr 14 14:46:54 2026 -0400
> 
>      drm/amdgpu: OR init_pte_flags into invalid leaf PTE update
> 
> I took the liberty to add some CCs.
> 
> Regards,
> 
> Tvrtko





^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/7] drm/amdgpu: ACK the retry CAM after VM update finishes
  2026-06-24 15:14       ` Tvrtko Ursulin
@ 2026-06-24 15:42         ` Timur Kristóf
  2026-06-24 15:52           ` Tvrtko Ursulin
  0 siblings, 1 reply; 21+ messages in thread
From: Timur Kristóf @ 2026-06-24 15:42 UTC (permalink / raw)
  To: amd-gfx, Alexander.Deucher, Christian König, Natalie Vock,
	Amir Shetaia, Marek Olšák, Mario Limonciello,
	Tvrtko Ursulin

On 2026. június 24., szerda 17:14:59 közép-európai nyári idő Tvrtko Ursulin 
wrote:
> On 24/06/2026 15:52, Timur Kristóf wrote:
> > On 2026. június 24., szerda 16:31:20 közép-európai nyári idő Tvrtko
> > Ursulin
> > 
> > wrote:
> >> On 29/05/2026 11:30, Timur Kristóf wrote:
> >>> Add a fence callback to the VM update and ACK the retry CAM
> >>> after the VM update is finished. Previously, we would ACK it
> >>> immediately after calling amdgpu_vm_handle_fault() which
> >>> caused a race condition that was likely to trigger the same
> >>> interrupt again, causing the same fault to be handled
> >>> multiple times.
> >>> 
> >>> Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
> >>> ---
> >>> 
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c     | 28
> >>>    +++++++++++++++++++--
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h     |  8 ++++++
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c      |  2 +-
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c |  2 +-
> >>>    4 files changed, 36 insertions(+), 4 deletions(-)
> >>> 
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index
> >>> 26aea960e2759..21c8d87477448 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> >>> @@ -545,6 +545,16 @@ void amdgpu_gmc_filter_faults_remove(struct
> >>> amdgpu_device *adev, uint64_t addr,>
> >>> 
> >>>    	} while (fault->timestamp < tmp);
> >>>    
> >>>    }
> >>> 
> >>> +static void amdgpu_gmc_retry_fault_handled(struct dma_fence *fence,
> >>> +					   struct dma_fence_cb
> > 
> > *cb)
> > 
> >>> +{
> >>> +	struct amdgpu_fence_cb *afc = container_of(cb, struct
> > 
> > amdgpu_fence_cb,
> > 
> >>> cb); +	struct amdgpu_device *adev = afc->adev;
> >>> +
> >>> +	/* CAM index is the array index of the current callback struct */
> >>> +	adev->irq.ih_funcs->retry_cam_ack(adev, afc - &adev-
> >> 
> >> gmc.retry_cb[0]);
> >> 
> >> Is the "afc - &adev->gmc.retry_cb[0]" part correct? It will be the index
> >> of the array element, while ->retry_cam_ack() expects the content of
> >> that element, no?
> > 
> > Like the comment says, the CAM index is the array index.
> > We just need the CAM index in order to tell the CAM to ACK the current
> > entry. The contents of the array are just there to make
> > dma_fence_add_callback() work with this callback function.
> 
> Ah you are right, I got confused. But it is also a bit bad, and I mean
> not just the array sizing dilema from lower in the email. But since the
> cam_index comes from the hardware and then below we blindly do:
> 
> 	if (dma_fence_add_callback(fence, &adev-
>gmc.retry_cb[cam_index].cb,
> amdgpu_gmc_retry_fault_handled))
> 
> Should hardware manage to send two faults with the same cam_index when
> the previous one hasn't been handled

The retry CAM exists to filter page fault interrupts and prevent sending 
multiple interrupts for the same fault. It won't send and interrupt with the 
same cam_index until we ACK the previous one.

> that is the very same callback is
> already installed and unsignaled (expect the unexpected), we have just
> upgraded the hardware bug to a kernel crash.
> 
> If I now understand it right, you want to "remember" the cam_index
> received so callback knows what to handle. Hmm.. Allocating memory does
> seem allowed if I follow correctly that amdgpu_vm_handle_fault() is
> calling dma_resv_reserve_fences(). So unless I am missing something
> perhaps kmalloc of struct amdgpu_fence_cb would be fine after all?

It may be fine, but I'd very much prefer to avoid it if possible.

> And
> if so you should also probably rename it to a less generic name along
> the lines of amgpud_retry_fault_cb or so. Workable?

I'm OK to rename it for sure.

Timur

> 
> >>> +}
> >>> +
> >>> 
> >>>    int amdgpu_gmc_handle_retry_fault(struct amdgpu_device *adev,
> >>>    
> >>>    				  struct amdgpu_iv_entry *entry,
> >>>    				  u64 addr,
> >>> 
> >>> @@ -552,6 +562,7 @@ int amdgpu_gmc_handle_retry_fault(struct
> >>> amdgpu_device
> >>> *adev,>
> >>> 
> >>>    				  u32 node_id,
> >>>    				  bool write_fault)
> >>>    
> >>>    {
> >>> 
> >>> +	struct dma_fence *fence = NULL;
> >>> 
> >>>    	int ret;
> >>>    	
> >>>    	if (adev->irq.retry_cam_enabled) {
> >>> 
> >>> @@ -564,8 +575,21 @@ int amdgpu_gmc_handle_retry_fault(struct
> >>> amdgpu_device *adev,>
> >>> 
> >>>    		}
> >>>    		
> >>>    		ret = amdgpu_vm_handle_fault(adev, entry->pasid,
> > 
> > entry->vmid, node_id,
> > 
> >>> -					     addr, entry-
> >> 
> >> timestamp, write_fault, NULL);
> >> 
> >>> -		adev->irq.ih_funcs->retry_cam_ack(adev, cam_index);
> >>> +					     addr, entry-
> >> 
> >> timestamp, write_fault, &fence);
> >> 
> >>> +
> >>> +		/* If the update is already done, ACK now, otherwise
> > 
> > when it's done. */
> > 
> >>> +		if (fence) {
> >>> +			adev->gmc.retry_cb[cam_index].adev = adev;
> >> 
> >> Why is 16 retry_cb elements enough? I see in the code cam_index extraced
> >> from the IV entry with a mask such as 0x3ff.
> > 
> > I think this came up in a conversation after I had already submitted the
> > patch. The maximum amount of CAM entries are specified by the
> > IH_RETRY_INT_CAM_CNTL.CAM_SIZE field.  The content of the field will need
> > to be interpreted as something like this:
> > ((CAM_SIZE + 1) * 64) = (15 + 1) * 64 = 1024
> > 
> > It is a good question whether we actually want to statically allocate that
> > many items. We should very much avoid doing dynamic allocation in the page
> > fault handler. I'm open to suggestions on how to move forward with this.
> > 
> >>> +
> >>> +			if (dma_fence_add_callback(fence, &adev-
> >> 
> >> gmc.retry_cb[cam_index].cb,
> >> 
> >>> +
> > 
> > amdgpu_gmc_retry_fault_handled))
> > 
> >>> +				adev->irq.ih_funcs-
> >> 
> >> retry_cam_ack(adev, cam_index);
> >> 
> >>> +
> >>> +			dma_fence_put(fence);
> >>> +		} else {
> >>> +			adev->irq.ih_funcs->retry_cam_ack(adev,
> > 
> > cam_index);
> > 
> >>> +		}
> >>> +
> >>> 
> >>>    		if (ret)
> >>>    		
> >>>    			return 1;
> >>>    	
> >>>    	} else {
> >>> 
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h index
> >>> 77eb153802845..3bfb06e011a86 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> >>> @@ -27,6 +27,7 @@
> >>> 
> >>>    #define __AMDGPU_GMC_H__
> >>>    
> >>>    #include <linux/types.h>
> >>> 
> >>> +#include <linux/dma-fence.h>
> >>> 
> >>>    #include "amdgpu_irq.h"
> >>>    #include "amdgpu_xgmi.h"
> >>> 
> >>> @@ -214,6 +215,11 @@ struct amdgpu_gmc_memrange {
> >>> 
> >>>    	int nid_mask;
> >>>    
> >>>    };
> >>> 
> >>> +struct amdgpu_fence_cb {
> >>> +	struct amdgpu_device *adev;
> >>> +	struct dma_fence_cb cb;
> >>> +};
> >>> +
> >>> 
> >>>    enum amdgpu_gart_placement {
> >>>    
> >>>    	AMDGPU_GART_PLACEMENT_BEST_FIT = 0,
> >>>    	AMDGPU_GART_PLACEMENT_HIGH,
> >>> 
> >>> @@ -305,6 +311,8 @@ struct amdgpu_gmc {
> >>> 
> >>>    	} fault_hash[AMDGPU_GMC_FAULT_HASH_SIZE];
> >>>    	uint64_t		last_fault:AMDGPU_GMC_FAULT_RING_ORDER;
> >>> 
> >>> +	struct amdgpu_fence_cb retry_cb[16];
> >>> +
> >>> 
> >>>    	bool tmz_enabled;
> >>>    	bool is_app_apu;
> >>> 
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index
> >>> 8c3ba7213eb22..f5e9b97e92a8c 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> >>> @@ -3035,7 +3035,7 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device
> >>> *adev, u32 pasid,>
> >>> 
> >>>    	r = amdgpu_vm_update_pdes(adev, vm, true);
> >>> 
> >>> -	*fence = vm->last_update;
> >>> +	*fence = dma_fence_get(vm->last_update);
> >> 
> >> Ah! But passing over since you said you are dropping that patch anyway.
> > 
> > That line should have gone to the previous patch and was added to this one
> > by mistake.
> > 
> >>>    error_unlock:
> >>>    	amdgpu_bo_unreserve(root);
> >>> 
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c index
> >>> 2eb64df6daa94..6e28f0e435bf5 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> >>> @@ -132,7 +132,7 @@ static int amdgpu_vm_sdma_commit(struct
> >>> amdgpu_vm_update_params *p,>
> >>> 
> >>>    				   DMA_RESV_USAGE_BOOKKEEP);
> >>>    	
> >>>    	}
> >>> 
> >>> -	if (fence && !p->immediate) {
> >>> +	if (fence) {
> >> 
> >> Is this deliberate and if so what it is about? Commit message should
> >> explain it as well.
> > 
> > The reason it is changed is because previously it wouldn't return a fence
> > in immediate mode. This line also should have gone to the previous patch
> > and was added to this one by mistake.
> > 
> > Thanks & best regards,
> > Timur





^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/7] drm/amdgpu: ACK the retry CAM after VM update finishes
  2026-06-24 15:42         ` Timur Kristóf
@ 2026-06-24 15:52           ` Tvrtko Ursulin
  0 siblings, 0 replies; 21+ messages in thread
From: Tvrtko Ursulin @ 2026-06-24 15:52 UTC (permalink / raw)
  To: Timur Kristóf, amd-gfx, Alexander.Deucher,
	Christian König, Natalie Vock, Amir Shetaia,
	Marek Olšák, Mario Limonciello


On 24/06/2026 16:42, Timur Kristóf wrote:
> On 2026. június 24., szerda 17:14:59 közép-európai nyári idő Tvrtko Ursulin
> wrote:
>> On 24/06/2026 15:52, Timur Kristóf wrote:
>>> On 2026. június 24., szerda 16:31:20 közép-európai nyári idő Tvrtko
>>> Ursulin
>>>
>>> wrote:
>>>> On 29/05/2026 11:30, Timur Kristóf wrote:
>>>>> Add a fence callback to the VM update and ACK the retry CAM
>>>>> after the VM update is finished. Previously, we would ACK it
>>>>> immediately after calling amdgpu_vm_handle_fault() which
>>>>> caused a race condition that was likely to trigger the same
>>>>> interrupt again, causing the same fault to be handled
>>>>> multiple times.
>>>>>
>>>>> Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
>>>>> ---
>>>>>
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c     | 28
>>>>>     +++++++++++++++++++--
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h     |  8 ++++++
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c      |  2 +-
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c |  2 +-
>>>>>     4 files changed, 36 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index
>>>>> 26aea960e2759..21c8d87477448 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>>>> @@ -545,6 +545,16 @@ void amdgpu_gmc_filter_faults_remove(struct
>>>>> amdgpu_device *adev, uint64_t addr,>
>>>>>
>>>>>     	} while (fault->timestamp < tmp);
>>>>>     
>>>>>     }
>>>>>
>>>>> +static void amdgpu_gmc_retry_fault_handled(struct dma_fence *fence,
>>>>> +					   struct dma_fence_cb
>>>
>>> *cb)
>>>
>>>>> +{
>>>>> +	struct amdgpu_fence_cb *afc = container_of(cb, struct
>>>
>>> amdgpu_fence_cb,
>>>
>>>>> cb); +	struct amdgpu_device *adev = afc->adev;
>>>>> +
>>>>> +	/* CAM index is the array index of the current callback struct */
>>>>> +	adev->irq.ih_funcs->retry_cam_ack(adev, afc - &adev-
>>>>
>>>> gmc.retry_cb[0]);
>>>>
>>>> Is the "afc - &adev->gmc.retry_cb[0]" part correct? It will be the index
>>>> of the array element, while ->retry_cam_ack() expects the content of
>>>> that element, no?
>>>
>>> Like the comment says, the CAM index is the array index.
>>> We just need the CAM index in order to tell the CAM to ACK the current
>>> entry. The contents of the array are just there to make
>>> dma_fence_add_callback() work with this callback function.
>>
>> Ah you are right, I got confused. But it is also a bit bad, and I mean
>> not just the array sizing dilema from lower in the email. But since the
>> cam_index comes from the hardware and then below we blindly do:
>>
>> 	if (dma_fence_add_callback(fence, &adev-
>> gmc.retry_cb[cam_index].cb,
>> amdgpu_gmc_retry_fault_handled))
>>
>> Should hardware manage to send two faults with the same cam_index when
>> the previous one hasn't been handled
> 
> The retry CAM exists to filter page fault interrupts and prevent sending
> multiple interrupts for the same fault. It won't send and interrupt with the
> same cam_index until we ACK the previous one.

Okay, but at least kernel needs to be defensive and check to avoid a crash.

>> that is the very same callback is
>> already installed and unsignaled (expect the unexpected), we have just
>> upgraded the hardware bug to a kernel crash.
>>
>> If I now understand it right, you want to "remember" the cam_index
>> received so callback knows what to handle. Hmm.. Allocating memory does
>> seem allowed if I follow correctly that amdgpu_vm_handle_fault() is
>> calling dma_resv_reserve_fences(). So unless I am missing something
>> perhaps kmalloc of struct amdgpu_fence_cb would be fine after all?
> 
> It may be fine, but I'd very much prefer to avoid it if possible.

One option is to not free the callback struct upon handling it, but 
stash it somewhere so the following interrupt can simply take it. You 
can coordinate using cmpxchg for example.

So on the interrupt arriving the flow would be:

if "grab previously stashed unused callback via cmpxchg"
else
	"allocate a new one"

On signalling:

cmpxchg to store the executed callback into the slot and free what was 
in the slot, if anything.

Slot possibly goes into struct gmc.

If you want to optimise for multiple parallel cam indices you would need 
multiple slots. Maybe hash the index read from hardware to limit the 
number of slots placeholds to less than 1024 that you mentioned. Depends 
how many parallel interrupts you want to handle on the fast path (no 
allocations).

And free the used slot(s) on driver exit of course.

Regards,

Tvrtko

>> And
>> if so you should also probably rename it to a less generic name along
>> the lines of amgpud_retry_fault_cb or so. Workable?
> 
> I'm OK to rename it for sure.
> 
> Timur
> 
>>
>>>>> +}
>>>>> +
>>>>>
>>>>>     int amdgpu_gmc_handle_retry_fault(struct amdgpu_device *adev,
>>>>>     
>>>>>     				  struct amdgpu_iv_entry *entry,
>>>>>     				  u64 addr,
>>>>>
>>>>> @@ -552,6 +562,7 @@ int amdgpu_gmc_handle_retry_fault(struct
>>>>> amdgpu_device
>>>>> *adev,>
>>>>>
>>>>>     				  u32 node_id,
>>>>>     				  bool write_fault)
>>>>>     
>>>>>     {
>>>>>
>>>>> +	struct dma_fence *fence = NULL;
>>>>>
>>>>>     	int ret;
>>>>>     	
>>>>>     	if (adev->irq.retry_cam_enabled) {
>>>>>
>>>>> @@ -564,8 +575,21 @@ int amdgpu_gmc_handle_retry_fault(struct
>>>>> amdgpu_device *adev,>
>>>>>
>>>>>     		}
>>>>>     		
>>>>>     		ret = amdgpu_vm_handle_fault(adev, entry->pasid,
>>>
>>> entry->vmid, node_id,
>>>
>>>>> -					     addr, entry-
>>>>
>>>> timestamp, write_fault, NULL);
>>>>
>>>>> -		adev->irq.ih_funcs->retry_cam_ack(adev, cam_index);
>>>>> +					     addr, entry-
>>>>
>>>> timestamp, write_fault, &fence);
>>>>
>>>>> +
>>>>> +		/* If the update is already done, ACK now, otherwise
>>>
>>> when it's done. */
>>>
>>>>> +		if (fence) {
>>>>> +			adev->gmc.retry_cb[cam_index].adev = adev;
>>>>
>>>> Why is 16 retry_cb elements enough? I see in the code cam_index extraced
>>>> from the IV entry with a mask such as 0x3ff.
>>>
>>> I think this came up in a conversation after I had already submitted the
>>> patch. The maximum amount of CAM entries are specified by the
>>> IH_RETRY_INT_CAM_CNTL.CAM_SIZE field.  The content of the field will need
>>> to be interpreted as something like this:
>>> ((CAM_SIZE + 1) * 64) = (15 + 1) * 64 = 1024
>>>
>>> It is a good question whether we actually want to statically allocate that
>>> many items. We should very much avoid doing dynamic allocation in the page
>>> fault handler. I'm open to suggestions on how to move forward with this.
>>>
>>>>> +
>>>>> +			if (dma_fence_add_callback(fence, &adev-
>>>>
>>>> gmc.retry_cb[cam_index].cb,
>>>>
>>>>> +
>>>
>>> amdgpu_gmc_retry_fault_handled))
>>>
>>>>> +				adev->irq.ih_funcs-
>>>>
>>>> retry_cam_ack(adev, cam_index);
>>>>
>>>>> +
>>>>> +			dma_fence_put(fence);
>>>>> +		} else {
>>>>> +			adev->irq.ih_funcs->retry_cam_ack(adev,
>>>
>>> cam_index);
>>>
>>>>> +		}
>>>>> +
>>>>>
>>>>>     		if (ret)
>>>>>     		
>>>>>     			return 1;
>>>>>     	
>>>>>     	} else {
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h index
>>>>> 77eb153802845..3bfb06e011a86 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>>>>> @@ -27,6 +27,7 @@
>>>>>
>>>>>     #define __AMDGPU_GMC_H__
>>>>>     
>>>>>     #include <linux/types.h>
>>>>>
>>>>> +#include <linux/dma-fence.h>
>>>>>
>>>>>     #include "amdgpu_irq.h"
>>>>>     #include "amdgpu_xgmi.h"
>>>>>
>>>>> @@ -214,6 +215,11 @@ struct amdgpu_gmc_memrange {
>>>>>
>>>>>     	int nid_mask;
>>>>>     
>>>>>     };
>>>>>
>>>>> +struct amdgpu_fence_cb {
>>>>> +	struct amdgpu_device *adev;
>>>>> +	struct dma_fence_cb cb;
>>>>> +};
>>>>> +
>>>>>
>>>>>     enum amdgpu_gart_placement {
>>>>>     
>>>>>     	AMDGPU_GART_PLACEMENT_BEST_FIT = 0,
>>>>>     	AMDGPU_GART_PLACEMENT_HIGH,
>>>>>
>>>>> @@ -305,6 +311,8 @@ struct amdgpu_gmc {
>>>>>
>>>>>     	} fault_hash[AMDGPU_GMC_FAULT_HASH_SIZE];
>>>>>     	uint64_t		last_fault:AMDGPU_GMC_FAULT_RING_ORDER;
>>>>>
>>>>> +	struct amdgpu_fence_cb retry_cb[16];
>>>>> +
>>>>>
>>>>>     	bool tmz_enabled;
>>>>>     	bool is_app_apu;
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index
>>>>> 8c3ba7213eb22..f5e9b97e92a8c 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>>> @@ -3035,7 +3035,7 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device
>>>>> *adev, u32 pasid,>
>>>>>
>>>>>     	r = amdgpu_vm_update_pdes(adev, vm, true);
>>>>>
>>>>> -	*fence = vm->last_update;
>>>>> +	*fence = dma_fence_get(vm->last_update);
>>>>
>>>> Ah! But passing over since you said you are dropping that patch anyway.
>>>
>>> That line should have gone to the previous patch and was added to this one
>>> by mistake.
>>>
>>>>>     error_unlock:
>>>>>     	amdgpu_bo_unreserve(root);
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c index
>>>>> 2eb64df6daa94..6e28f0e435bf5 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>>>> @@ -132,7 +132,7 @@ static int amdgpu_vm_sdma_commit(struct
>>>>> amdgpu_vm_update_params *p,>
>>>>>
>>>>>     				   DMA_RESV_USAGE_BOOKKEEP);
>>>>>     	
>>>>>     	}
>>>>>
>>>>> -	if (fence && !p->immediate) {
>>>>> +	if (fence) {
>>>>
>>>> Is this deliberate and if so what it is about? Commit message should
>>>> explain it as well.
>>>
>>> The reason it is changed is because previously it wouldn't return a fence
>>> in immediate mode. This line also should have gone to the previous patch
>>> and was added to this one by mistake.
>>>
>>> Thanks & best regards,
>>> Timur
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2026-06-24 15:52 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-29 10:30 [PATCH 0/7] drm/amdgpu: Implement retry faults on Navi 4 Timur Kristóf
2026-05-29 10:30 ` [PATCH 1/7] drm/amdgpu/vm: Add fence argument to amdgpu_vm_handle_fault() Timur Kristóf
2026-06-24 13:54   ` Tvrtko Ursulin
2026-06-24 14:09     ` Timur Kristóf
2026-05-29 10:30 ` [PATCH 2/7] drm/amdgpu: ACK the retry CAM after VM update finishes Timur Kristóf
2026-06-24 14:31   ` Tvrtko Ursulin
2026-06-24 14:52     ` Timur Kristóf
2026-06-24 15:14       ` Tvrtko Ursulin
2026-06-24 15:42         ` Timur Kristóf
2026-06-24 15:52           ` Tvrtko Ursulin
2026-05-29 10:30 ` [PATCH 3/7] drm/amdgpu/ih7.0: Use MMIO ACK instead of doorbell for retry CAM on IH 7.0 Timur Kristóf
2026-05-29 10:30 ` [PATCH 4/7] drm/amdgpu/ih7.0: Use IH_SW_RING_SIZE for soft IH ring instead of PAGE_SIZE Timur Kristóf
2026-06-24 14:37   ` Tvrtko Ursulin
2026-06-24 15:16     ` Timur Kristóf
2026-05-29 10:30 ` [PATCH 5/7] drm/amdgpu/gmc12.0: Use AMDGPU_PTE_IS_PTE flag for init_pte_flags on GFX12.0 Timur Kristóf
2026-06-24 14:54   ` Tvrtko Ursulin
2026-06-24 15:30     ` Timur Kristóf
2026-05-29 10:30 ` [PATCH 6/7] drm/amdgpu/vm: Use init PTE flags, and NOALLOC in amdgpu_vm_handle_fault() Timur Kristóf
2026-06-24 14:56   ` Tvrtko Ursulin
2026-05-29 10:30 ` [PATCH 7/7] drm/amdgpu/gmc12: Pass cam_index to retry fault handler Timur Kristóf
2026-06-24 14:59   ` Tvrtko Ursulin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.