Re: [PATCH v2] drm/amdkfd: handle errors from svm validate and map

AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "Chen, Xiaogang" <xiaogang.chen@amd.com>
To: Philip Yang <yangp@amd.com>, Philip Yang <Philip.Yang@amd.com>,
	amd-gfx@lists.freedesktop.org
Cc: alex.sierra@amd.com, Felix.Kuehling@amd.com, james.zhu@amd.com
Subject: Re: [PATCH v2] drm/amdkfd: handle errors from svm validate and map
Date: Fri, 15 Sep 2023 16:33:33 -0500	[thread overview]
Message-ID: <a01a8dd1-c4b2-b632-1116-40836d2a5007@amd.com> (raw)
In-Reply-To: <5ee60075-10a2-e3d5-4e5a-538fca90bc58@amd.com>


On 9/15/2023 4:20 PM, Philip Yang wrote:
>
>
> On 2023-09-15 17:06, Chen, Xiaogang wrote:
>>
>> On 9/15/2023 8:28 AM, Philip Yang wrote:
>>> Caution: This message originated from an External Source. Use proper 
>>> caution when opening attachments, clicking links, or responding.
>>>
>>>
>>> If new range is splited to multiple pranges with max_svm_range_pages
>>> alignment and added to update_list, svm validate and map should keep
>>> going after error to make sure maps_to_gpu flag is up to date for the
>>> whole range.
>>>
>>> svm validate and map update set prange->mapped_to_gpu after mapping to
>>> GPUs successfully, otherwise clear prange->mapped_to_gpu flag (for
>>> update mapping case) instead of setting error flag, we can remove
>>> the redundant error flag to simpliy code.
>>>
>>> Update prange->mapped_to_gpu flag inside svm_range_lock, to guarant we
>>> always set prange invalid flag to evict queues or unmap from GPUs 
>>> before
>>> the system memory is moved.
>>>
>>> After svm validate and map return error, the caller retry will update
>>> the mapping for the whole range.
>>>
>>> Fixes: c22b04407097 ("drm/amdkfd: flag added to handle errors from 
>>> svm validate and map")
>>> Signed-off-by: Philip Yang <Philip.Yang@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 18 ++++++++----------
>>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.h |  1 -
>>>   2 files changed, 8 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> index 5d7ba7dbf6ce..26018b1d6138 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> @@ -818,7 +818,7 @@ svm_range_is_same_attrs(struct kfd_process *p, 
>>> struct svm_range *prange,
>>>                  }
>>>          }
>>>
>>> -       return !prange->is_error_flag;
>>> +       return true;
>>>   }
>>>
>>>   /**
>>> @@ -1724,20 +1724,17 @@ static int svm_range_validate_and_map(struct 
>>> mm_struct *mm,
>>>                                            ctx->bitmap, wait, 
>>> flush_tlb);
>>>
>>>   unlock_out:
>>> +               prange->mapped_to_gpu = !r;
>>
>> I do not understand why set prange->mapped_to_gpu here? It handles 
>> one vma, not for all prange. If there are multiple vma and first vma 
>> handle is ok, and second vma handle fail at 
>> amdgpu_hmm_range_get_pages or svm_range_dma_map, you would get 
>> prange->mapped_to_gpu be true, but only part of pragne got mapped, 
>> right?
>
> If all vmas map to gpu successfully, prange->mapped_to_gpu set to 
> true, otherwise, prange->mapped_to_gpu set to false, and then svm 
> validate map to gpu return failed, the caller will retry if error code 
> is -EAGAIN.
>
if second vma handle got failed at amdgpu_hmm_range_get_pages 
prange->mapped_to_gpu would be true since the code would jump out of the 
loop, right?

Regards

Xiaogang

> Regards,
>
> Philip
>
>>
>>
>> Regards
>>
>> Xiaogang
>>
>>> svm_range_unlock(prange);
>>>
>>>                  addr = next;
>>>          }
>>>
>>> -       if (addr == end) {
>>> +       if (addr == end)
>>>                  prange->validated_once = true;
>>> -               prange->mapped_to_gpu = true;
>>> -       }
>>>
>>>   unreserve_out:
>>>          svm_range_unreserve_bos(ctx);
>>> -
>>> -       prange->is_error_flag = !!r;
>>>          if (!r)
>>>                  prange->validate_timestamp = ktime_get_boottime();
>>>
>>> @@ -2106,7 +2103,8 @@ svm_range_add(struct kfd_process *p, uint64_t 
>>> start, uint64_t size,
>>>                  next = interval_tree_iter_next(node, start, last);
>>>                  next_start = min(node->last, last) + 1;
>>>
>>> -               if (svm_range_is_same_attrs(p, prange, nattr, attrs)) {
>>> +               if (svm_range_is_same_attrs(p, prange, nattr, attrs) &&
>>> +                   prange->mapped_to_gpu) {
>>>                          /* nothing to do */
>>>                  } else if (node->start < start || node->last > last) {
>>>                          /* node intersects the update range and its 
>>> attributes
>>> @@ -3519,7 +3517,7 @@ svm_range_set_attr(struct kfd_process *p, 
>>> struct mm_struct *mm,
>>>          struct svm_range *next;
>>>          bool update_mapping = false;
>>>          bool flush_tlb;
>>> -       int r = 0;
>>> +       int r, ret = 0;
>>>
>>>          pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] pages 
>>> 0x%llx\n",
>>>                   p->pasid, &p->svms, start, start + size - 1, size);
>>> @@ -3607,7 +3605,7 @@ svm_range_set_attr(struct kfd_process *p, 
>>> struct mm_struct *mm,
>>>   out_unlock_range:
>>>                  mutex_unlock(&prange->migrate_mutex);
>>>                  if (r)
>>> -                       break;
>>> +                       ret = r;
>>>          }
>>>
>>>          dynamic_svm_range_dump(svms);
>>> @@ -3620,7 +3618,7 @@ svm_range_set_attr(struct kfd_process *p, 
>>> struct mm_struct *mm,
>>>          pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] done, 
>>> r=%d\n", p->pasid,
>>>                   &p->svms, start, start + size - 1, r);
>>>
>>> -       return r;
>>> +       return ret ? ret : r;
>>>   }
>>>
>>>   static int
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
>>> index 9e668eeefb32..91715bf3233c 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
>>> @@ -134,7 +134,6 @@ struct svm_range {
>>>          DECLARE_BITMAP(bitmap_aip, MAX_GPU_INSTANCE);
>>>          bool                            validated_once;
>>>          bool                            mapped_to_gpu;
>>> -       bool                            is_error_flag;
>>>   };
>>>
>>>   static inline void svm_range_lock(struct svm_range *prange)
>>> -- 
>>> 2.35.1
>>>

next prev parent reply	other threads:[~2023-09-15 21:33 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-13 15:16 [PATCH] drm/amdkfd: handle errors from svm validate and map Philip Yang
2023-09-13 16:14 ` Felix Kuehling
2023-09-13 17:24   ` Philip Yang
2023-09-13 17:33   ` Philip Yang
2023-09-13 18:27     ` Felix Kuehling
2023-09-15 13:28 ` [PATCH v2] " Philip Yang
2023-09-15 21:06   ` Chen, Xiaogang
2023-09-15 21:20     ` Philip Yang
2023-09-15 21:33       ` Chen, Xiaogang [this message]
2023-09-18 13:27         ` Philip Yang
2023-09-19 14:21 ` [PATCH v3] drm/amdkfd: Handle " Philip Yang
2023-09-19 21:15   ` Felix Kuehling
2023-09-20 14:20     ` Philip Yang
2023-09-20 14:35       ` Felix Kuehling
2023-09-20 15:38         ` Philip Yang
2023-09-20 15:45 ` [PATCH v4] " Philip Yang
2023-09-21 19:32   ` Felix Kuehling
2023-09-25 13:39   ` James Zhu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a01a8dd1-c4b2-b632-1116-40836d2a5007@amd.com \
    --to=xiaogang.chen@amd.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Philip.Yang@amd.com \
    --cc=alex.sierra@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=james.zhu@amd.com \
    --cc=yangp@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox