Re: [PATCH v4] drm/amdgpu: Fix the race condition for draining retry fault

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Chen, Xiaogang" <xiaogang.chen@amd.com>
To: Emily Deng <Emily.Deng@amd.com>, amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH v4] drm/amdgpu: Fix the race condition for draining retry fault
Date: Thu, 6 Mar 2025 11:00:46 -0600	[thread overview]
Message-ID: <b6c5ca92-b76e-4565-b96b-eaea9c6296f6@amd.com> (raw)
In-Reply-To: <20250306060309.3222693-1-Emily.Deng@amd.com>

[-- Attachment #1: Type: text/plain, Size: 3314 bytes --]

Thanks for catch up and fix this race condition. It looks good to me. 
One minor thing below:

On 3/6/2025 12:03 AM, Emily Deng wrote:
> Issue:
> In the scenario where svm_range_restore_pages is called, but svm->checkpoint_ts
>   has not been set and the retry fault has not been drained, svm_range_unmap_from_cpu
> is triggered and calls svm_range_free. Meanwhile, svm_range_restore_pages
> continues execution and reaches svm_range_from_addr. This results in
> a "failed to find prange..." error, causing the page recovery to fail.
>
> How to fix:
> Move the timestamp check code under the protection of svm->lock.
>
> v2:
> Make sure all right locks are released before go out.
>
> v3:
> Directly goto out_unlock_svms, and return -EAGAIN.
>
> v4:
> Refine code.
>
> Signed-off-by: Emily Deng<Emily.Deng@amd.com>
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 30 +++++++++++++++-------------
>   1 file changed, 16 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index d04725583f19..83ac14bf7a7a 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -3008,19 +3008,6 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
>   		goto out;
>   	}
>   
> -	/* check if this page fault time stamp is before svms->checkpoint_ts */
> -	if (svms->checkpoint_ts[gpuidx] != 0) {
> -		if (amdgpu_ih_ts_after_or_equal(ts,  svms->checkpoint_ts[gpuidx])) {
> -			pr_debug("draining retry fault, drop fault 0x%llx\n", addr);
> -			r = 0;
> -			goto out;
> -		} else
> -			/* ts is after svms->checkpoint_ts now, reset svms->checkpoint_ts
> -			 * to zero to avoid following ts wrap around give wrong comparing
> -			 */
> -			svms->checkpoint_ts[gpuidx] = 0;
> -	}
> -
>   	if (!p->xnack_enabled) {
>   		pr_debug("XNACK not enabled for pasid 0x%x\n", pasid);
>   		r = -EFAULT;
> @@ -3040,6 +3027,20 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
>   	mmap_read_lock(mm);
>   retry_write_locked:
>   	mutex_lock(&svms->lock);
> +
> +	/* check if this page fault time stamp is before svms->checkpoint_ts */
> +	if (svms->checkpoint_ts[gpuidx] != 0) {
> +		if (amdgpu_ih_ts_after_or_equal(ts,  svms->checkpoint_ts[gpuidx])) {
> +			pr_debug("draining retry fault, drop fault 0x%llx\n", addr);
> +			r = -EAGAIN;

We drop page fault because it is stale, not mean to handle it again. if 
return -EAGAIN we do amdgpu_gmc_filter_faults_remove. If after unmap, 
user map same range again we should treat page fault happened at same 
range as new one.

Regards

Xiaogang

> +			goto out_unlock_svms;
> +		} else
> +			/* ts is after svms->checkpoint_ts now, reset svms->checkpoint_ts
> +			 * to zero to avoid following ts wrap around give wrong comparing
> +			 */
> +			svms->checkpoint_ts[gpuidx] = 0;
> +	}
> +
>   	prange = svm_range_from_addr(svms, addr, NULL);
>   	if (!prange) {
>   		pr_debug("failed to find prange svms 0x%p address [0x%llx]\n",
> @@ -3165,7 +3166,8 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
>   	mutex_unlock(&svms->lock);
>   	mmap_read_unlock(mm);
>   
> -	svm_range_count_fault(node, p, gpuidx);
> +	if (r != -EAGAIN)
> +		svm_range_count_fault(node, p, gpuidx);
>   
>   	mmput(mm);
>   out:

[-- Attachment #2: Type: text/html, Size: 3941 bytes --]

next prev parent reply	other threads:[~2025-03-06 17:00 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-06  6:03 [PATCH v4] drm/amdgpu: Fix the race condition for draining retry fault Emily Deng
2025-03-06 15:49 ` Felix Kuehling
2025-03-06 17:00 ` Chen, Xiaogang [this message]
2025-03-07  1:27   ` Deng, Emily
2025-03-08  0:37     ` Chen, Xiaogang
2025-03-10  0:51       ` Deng, Emily
2025-03-12 22:40         ` Felix Kuehling

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b6c5ca92-b76e-4565-b96b-eaea9c6296f6@amd.com \
    --to=xiaogang.chen@amd.com \
    --cc=Emily.Deng@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.