From: "Chen, Xiaogang" <xiaogang.chen@amd.com>
To: Philip Yang <Philip.Yang@amd.com>, amd-gfx@lists.freedesktop.org
Cc: Felix.Kuehling@amd.com
Subject: Re: [PATCH] drm/amdkfd: Wait vm update fence after retry fault recovered
Date: Tue, 26 Sep 2023 15:43:14 -0500 [thread overview]
Message-ID: <1d6af500-2a17-0f95-3c86-024cdded0fa9@amd.com> (raw)
In-Reply-To: <20230922213759.20904-1-Philip.Yang@amd.com>
On 9/22/2023 4:37 PM, Philip Yang wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> Otherwise kfd flush tlb does nothing if vm update fence callback doesn't
> update vm->tlb_seq. H/W will generate retry fault again.
>
> This works now because retry fault keep coming, recover will update page
> table again after AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING timeout and flush
> tlb.
I think what this patch does is waiting vm->last_update fence at gpu
page fault retry handler. I do not know what bug it tries to fix. h/w
will keep generating retry fault as long as vm page table is not setup
correctly, no matter kfd driver waits the fence or not. vm page table
eventually will be setup.
There is a consequence I saw: if we wait vm page table update fence it
will delay gpu page fault handler exit. Then more h/w interrupt vectors
will be sent to sw ring, potentially cause the ring overflow.
Regards
Xiaogang
> Remove wait parameter in svm_range_validate_and_map because it is
> always called with true.
>
> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 15 +++++++--------
> 1 file changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index 70aa882636ab..61f4de1633a8 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -1447,7 +1447,7 @@ svm_range_map_to_gpu(struct kfd_process_device *pdd, struct svm_range *prange,
> static int
> svm_range_map_to_gpus(struct svm_range *prange, unsigned long offset,
> unsigned long npages, bool readonly,
> - unsigned long *bitmap, bool wait, bool flush_tlb)
> + unsigned long *bitmap, bool flush_tlb)
> {
> struct kfd_process_device *pdd;
> struct amdgpu_device *bo_adev = NULL;
> @@ -1480,8 +1480,7 @@ svm_range_map_to_gpus(struct svm_range *prange, unsigned long offset,
>
> r = svm_range_map_to_gpu(pdd, prange, offset, npages, readonly,
> prange->dma_addr[gpuidx],
> - bo_adev, wait ? &fence : NULL,
> - flush_tlb);
> + bo_adev, &fence, flush_tlb);
> if (r)
> break;
>
> @@ -1605,7 +1604,7 @@ static void *kfd_svm_page_owner(struct kfd_process *p, int32_t gpuidx)
> */
> static int svm_range_validate_and_map(struct mm_struct *mm,
> struct svm_range *prange, int32_t gpuidx,
> - bool intr, bool wait, bool flush_tlb)
> + bool intr, bool flush_tlb)
> {
> struct svm_validate_context *ctx;
> unsigned long start, end, addr;
> @@ -1729,7 +1728,7 @@ static int svm_range_validate_and_map(struct mm_struct *mm,
>
> if (!r)
> r = svm_range_map_to_gpus(prange, offset, npages, readonly,
> - ctx->bitmap, wait, flush_tlb);
> + ctx->bitmap, flush_tlb);
>
> if (!r && next == end)
> prange->mapped_to_gpu = true;
> @@ -1823,7 +1822,7 @@ static void svm_range_restore_work(struct work_struct *work)
> mutex_lock(&prange->migrate_mutex);
>
> r = svm_range_validate_and_map(mm, prange, MAX_GPU_INSTANCE,
> - false, true, false);
> + false, false);
> if (r)
> pr_debug("failed %d to map 0x%lx to gpus\n", r,
> prange->start);
> @@ -3064,7 +3063,7 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
> }
> }
>
> - r = svm_range_validate_and_map(mm, prange, gpuidx, false, false, false);
> + r = svm_range_validate_and_map(mm, prange, gpuidx, false, false);
> if (r)
> pr_debug("failed %d to map svms 0x%p [0x%lx 0x%lx] to gpus\n",
> r, svms, prange->start, prange->last);
> @@ -3603,7 +3602,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
> flush_tlb = !migrated && update_mapping && prange->mapped_to_gpu;
>
> r = svm_range_validate_and_map(mm, prange, MAX_GPU_INSTANCE,
> - true, true, flush_tlb);
> + true, flush_tlb);
> if (r)
> pr_debug("failed %d to map svm range\n", r);
>
> --
> 2.35.1
>
next prev parent reply other threads:[~2023-09-26 20:43 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-22 21:37 [PATCH] drm/amdkfd: Wait vm update fence after retry fault recovered Philip Yang
2023-09-26 20:43 ` Chen, Xiaogang [this message]
2023-09-27 14:29 ` Philip Yang
2023-09-27 18:22 ` Chen, Xiaogang
2023-09-28 21:50 ` Felix Kuehling
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1d6af500-2a17-0f95-3c86-024cdded0fa9@amd.com \
--to=xiaogang.chen@amd.com \
--cc=Felix.Kuehling@amd.com \
--cc=Philip.Yang@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox