Re: [PATCH 2/7] drm/amdgpu: ACK the retry CAM after VM update finishes

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Timur Kristóf" <timur.kristof@gmail.com>
To: amd-gfx@lists.freedesktop.org, Alexander.Deucher@amd.com,
	"Christian König" <christian.koenig@amd.com>,
	"Natalie Vock" <natalie.vock@gmx.de>,
	"Amir Shetaia" <Amir.Shetaia@amd.com>,
	"Marek Olšák" <maraeo@gmail.com>,
	"Mario Limonciello" <mario.limonciello@amd.com>,
	"Tvrtko Ursulin" <tursulin@ursulin.net>
Subject: Re: [PATCH 2/7] drm/amdgpu: ACK the retry CAM after VM update finishes
Date: Wed, 24 Jun 2026 16:52:30 +0200	[thread overview]
Message-ID: <2746166.fDdHjke4Dd@timur-max> (raw)
In-Reply-To: <55bb616f-343f-4189-8ef8-e9f20429cdbb@ursulin.net>

On 2026. június 24., szerda 16:31:20 közép-európai nyári idő Tvrtko Ursulin 
wrote:
> On 29/05/2026 11:30, Timur Kristóf wrote:
> > Add a fence callback to the VM update and ACK the retry CAM
> > after the VM update is finished. Previously, we would ACK it
> > immediately after calling amdgpu_vm_handle_fault() which
> > caused a race condition that was likely to trigger the same
> > interrupt again, causing the same fault to be handled
> > multiple times.
> > 
> > Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
> > ---
> > 
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c     | 28 +++++++++++++++++++--
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h     |  8 ++++++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c      |  2 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c |  2 +-
> >   4 files changed, 36 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index
> > 26aea960e2759..21c8d87477448 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > @@ -545,6 +545,16 @@ void amdgpu_gmc_filter_faults_remove(struct
> > amdgpu_device *adev, uint64_t addr,> 
> >   	} while (fault->timestamp < tmp);
> >   
> >   }
> > 
> > +static void amdgpu_gmc_retry_fault_handled(struct dma_fence *fence,
> > +					   struct dma_fence_cb 
*cb)
> > +{
> > +	struct amdgpu_fence_cb *afc = container_of(cb, struct 
amdgpu_fence_cb,
> > cb); +	struct amdgpu_device *adev = afc->adev;
> > +
> > +	/* CAM index is the array index of the current callback struct */
> > +	adev->irq.ih_funcs->retry_cam_ack(adev, afc - &adev-
>gmc.retry_cb[0]);
> 
> Is the "afc - &adev->gmc.retry_cb[0]" part correct? It will be the index
> of the array element, while ->retry_cam_ack() expects the content of
> that element, no?

Like the comment says, the CAM index is the array index.
We just need the CAM index in order to tell the CAM to ACK the current entry.
The contents of the array are just there to make dma_fence_add_callback() work 
with this callback function.

> 
> > +}
> > +
> > 
> >   int amdgpu_gmc_handle_retry_fault(struct amdgpu_device *adev,
> >   
> >   				  struct amdgpu_iv_entry *entry,
> >   				  u64 addr,
> > 
> > @@ -552,6 +562,7 @@ int amdgpu_gmc_handle_retry_fault(struct amdgpu_device
> > *adev,> 
> >   				  u32 node_id,
> >   				  bool write_fault)
> >   
> >   {
> > 
> > +	struct dma_fence *fence = NULL;
> > 
> >   	int ret;
> >   	
> >   	if (adev->irq.retry_cam_enabled) {
> > 
> > @@ -564,8 +575,21 @@ int amdgpu_gmc_handle_retry_fault(struct
> > amdgpu_device *adev,> 
> >   		}
> >   		
> >   		ret = amdgpu_vm_handle_fault(adev, entry->pasid, 
entry->vmid, node_id,
> > 
> > -					     addr, entry-
>timestamp, write_fault, NULL);
> > -		adev->irq.ih_funcs->retry_cam_ack(adev, cam_index);
> > +					     addr, entry-
>timestamp, write_fault, &fence);
> > +
> > +		/* If the update is already done, ACK now, otherwise 
when it's done. */
> > +		if (fence) {
> > +			adev->gmc.retry_cb[cam_index].adev = adev;
> 
> Why is 16 retry_cb elements enough? I see in the code cam_index extraced
> from the IV entry with a mask such as 0x3ff.

I think this came up in a conversation after I had already submitted the 
patch. The maximum amount of CAM entries are specified by the 
IH_RETRY_INT_CAM_CNTL.CAM_SIZE field.  The content of the field will need to be 
interpreted as something like this:
((CAM_SIZE + 1) * 64) = (15 + 1) * 64 = 1024

It is a good question whether we actually want to statically allocate that 
many items. We should very much avoid doing dynamic allocation in the page 
fault handler. I'm open to suggestions on how to move forward with this.

> 
> > +
> > +			if (dma_fence_add_callback(fence, &adev-
>gmc.retry_cb[cam_index].cb,
> > +						   
amdgpu_gmc_retry_fault_handled))
> > +				adev->irq.ih_funcs-
>retry_cam_ack(adev, cam_index);
> > +
> > +			dma_fence_put(fence);
> > +		} else {
> > +			adev->irq.ih_funcs->retry_cam_ack(adev, 
cam_index);
> > +		}
> > +
> > 
> >   		if (ret)
> >   		
> >   			return 1;
> >   	
> >   	} else {
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h index
> > 77eb153802845..3bfb06e011a86 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
> > @@ -27,6 +27,7 @@
> > 
> >   #define __AMDGPU_GMC_H__
> >   
> >   #include <linux/types.h>
> > 
> > +#include <linux/dma-fence.h>
> > 
> >   #include "amdgpu_irq.h"
> >   #include "amdgpu_xgmi.h"
> > 
> > @@ -214,6 +215,11 @@ struct amdgpu_gmc_memrange {
> > 
> >   	int nid_mask;
> >   
> >   };
> > 
> > +struct amdgpu_fence_cb {
> > +	struct amdgpu_device *adev;
> > +	struct dma_fence_cb cb;
> > +};
> > +
> > 
> >   enum amdgpu_gart_placement {
> >   
> >   	AMDGPU_GART_PLACEMENT_BEST_FIT = 0,
> >   	AMDGPU_GART_PLACEMENT_HIGH,
> > 
> > @@ -305,6 +311,8 @@ struct amdgpu_gmc {
> > 
> >   	} fault_hash[AMDGPU_GMC_FAULT_HASH_SIZE];
> >   	uint64_t		last_fault:AMDGPU_GMC_FAULT_RING_ORDER;
> > 
> > +	struct amdgpu_fence_cb retry_cb[16];
> > +
> > 
> >   	bool tmz_enabled;
> >   	bool is_app_apu;
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index
> > 8c3ba7213eb22..f5e9b97e92a8c 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> > @@ -3035,7 +3035,7 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device
> > *adev, u32 pasid,> 
> >   	r = amdgpu_vm_update_pdes(adev, vm, true);
> > 
> > -	*fence = vm->last_update;
> > +	*fence = dma_fence_get(vm->last_update);
> 
> Ah! But passing over since you said you are dropping that patch anyway.

That line should have gone to the previous patch and was added to this one by 
mistake.

> 
> >   error_unlock:
> >   	amdgpu_bo_unreserve(root);
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c index
> > 2eb64df6daa94..6e28f0e435bf5 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > @@ -132,7 +132,7 @@ static int amdgpu_vm_sdma_commit(struct
> > amdgpu_vm_update_params *p,> 
> >   				   DMA_RESV_USAGE_BOOKKEEP);
> >   	
> >   	}
> > 
> > -	if (fence && !p->immediate) {
> > +	if (fence) {
> 
> Is this deliberate and if so what it is about? Commit message should
> explain it as well.

The reason it is changed is because previously it wouldn't return a fence in 
immediate mode. This line also should have gone to the previous patch and was 
added to this one by mistake. 

Thanks & best regards,
Timur

next prev parent reply	other threads:[~2026-06-24 14:52 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-29 10:30 [PATCH 0/7] drm/amdgpu: Implement retry faults on Navi 4 Timur Kristóf
2026-05-29 10:30 ` [PATCH 1/7] drm/amdgpu/vm: Add fence argument to amdgpu_vm_handle_fault() Timur Kristóf
2026-06-24 13:54   ` Tvrtko Ursulin
2026-06-24 14:09     ` Timur Kristóf
2026-05-29 10:30 ` [PATCH 2/7] drm/amdgpu: ACK the retry CAM after VM update finishes Timur Kristóf
2026-06-24 14:31   ` Tvrtko Ursulin
2026-06-24 14:52     ` Timur Kristóf [this message]
2026-06-24 15:14       ` Tvrtko Ursulin
2026-06-24 15:42         ` Timur Kristóf
2026-06-24 15:52           ` Tvrtko Ursulin
2026-05-29 10:30 ` [PATCH 3/7] drm/amdgpu/ih7.0: Use MMIO ACK instead of doorbell for retry CAM on IH 7.0 Timur Kristóf
2026-05-29 10:30 ` [PATCH 4/7] drm/amdgpu/ih7.0: Use IH_SW_RING_SIZE for soft IH ring instead of PAGE_SIZE Timur Kristóf
2026-06-24 14:37   ` Tvrtko Ursulin
2026-06-24 15:16     ` Timur Kristóf
2026-05-29 10:30 ` [PATCH 5/7] drm/amdgpu/gmc12.0: Use AMDGPU_PTE_IS_PTE flag for init_pte_flags on GFX12.0 Timur Kristóf
2026-06-24 14:54   ` Tvrtko Ursulin
2026-06-24 15:30     ` Timur Kristóf
2026-05-29 10:30 ` [PATCH 6/7] drm/amdgpu/vm: Use init PTE flags, and NOALLOC in amdgpu_vm_handle_fault() Timur Kristóf
2026-06-24 14:56   ` Tvrtko Ursulin
2026-05-29 10:30 ` [PATCH 7/7] drm/amdgpu/gmc12: Pass cam_index to retry fault handler Timur Kristóf
2026-06-24 14:59   ` Tvrtko Ursulin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2746166.fDdHjke4Dd@timur-max \
    --to=timur.kristof@gmail.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Amir.Shetaia@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=maraeo@gmail.com \
    --cc=mario.limonciello@amd.com \
    --cc=natalie.vock@gmx.de \
    --cc=tursulin@ursulin.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.