From: "Yang, Philip" <Philip.Yang@amd.com>
To: Jason Gunthorpe <jgg@mellanox.com>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
Jerome Glisse <jglisse@redhat.com>,
Ralph Campbell <rcampbell@nvidia.com>,
John Hubbard <jhubbard@nvidia.com>,
"Kuehling, Felix" <Felix.Kuehling@amd.com>,
Juergen Gross <jgross@suse.com>,
"Zhou, David(ChunMing)" <David1.Zhou@amd.com>,
Mike Marciniszyn <mike.marciniszyn@intel.com>,
Stefano Stabellini <sstabellini@kernel.org>,
Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>,
"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
"nouveau@lists.freedesktop.org" <nouveau@lists.freedesktop.org>,
Dennis Dalessandro <dennis.dalessandro@intel.com>,
"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>,
Christoph Hellwig <hch@infradead.org>,
"dri-devel@lists.freedesktop.org"
<dri-devel@lists.freedesktop.org>,
"Deucher, Alexander" <Alexander.Deucher@amd.com>,
"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
Boris Ostrovsky <boris.ostrovsky@oracle.com>,
Petr Cvek <petrcvekcz@gmail.com>,
"Koenig, Christian" <Christian.Koenig@amd.com>,
Ben Skeggs <bskeggs@redhat.com>
Subject: Re: [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror
Date: Fri, 1 Nov 2019 15:59:26 +0000 [thread overview]
Message-ID: <8280fb65-a897-3d71-79f9-9f80d9e474e9@amd.com> (raw)
In-Reply-To: <20191101151222.GN22766@mellanox.com>
On 2019-11-01 11:12 a.m., Jason Gunthorpe wrote:
> On Fri, Nov 01, 2019 at 02:44:51PM +0000, Yang, Philip wrote:
>>
>>
>> On 2019-10-29 3:25 p.m., Jason Gunthorpe wrote:
>>> On Tue, Oct 29, 2019 at 07:22:37PM +0000, Yang, Philip wrote:
>>>> Hi Jason,
>>>>
>>>> I did quick test after merging amd-staging-drm-next with the
>>>> mmu_notifier branch, which includes this set changes. The test result
>>>> has different failures, app stuck intermittently, GUI no display etc. I
>>>> am understanding the changes and will try to figure out the cause.
>>>
>>> Thanks! I'm not surprised by this given how difficult this patch was
>>> to make. Let me know if I can assist in any way
>>>
>>> Please ensure to run with lockdep enabled.. Your symptops sounds sort
>>> of like deadlocking?
>>>
>> Hi Jason,
>>
>> Attached patch fix several issues in amdgpu driver, maybe you can squash
>> this into patch 14. With this is done, patch 12, 13, 14 is Reviewed-by
>> and Tested-by Philip Yang <philip.yang@amd.com>
>
> Wow, this is great thanks! Can you clarify what the problems you found
> were? Was the bug the 'return !r' below?
>
Yes. return !r is critical one, and retry if hmm_range_fault return
-EBUSY is needed too.
> I'll also add your signed off by
>
> Here are some remarks:
>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>> index cb718a064eb4..c8bbd06f1009 100644
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>> @@ -67,21 +67,15 @@ static bool amdgpu_mn_invalidate_gfx(struct mmu_range_notifier *mrn,
>> struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
>> long r;
>>
>> - /*
>> - * FIXME: Must hold some lock shared with
>> - * amdgpu_ttm_tt_get_user_pages_done()
>> - */
>> - mmu_range_set_seq(mrn, cur_seq);
>> + mutex_lock(&adev->notifier_lock);
>>
>> - /* FIXME: Is this necessary? */
>> - if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, range->start,
>> - range->end))
>> - return true;
>> + mmu_range_set_seq(mrn, cur_seq);
>>
>> - if (!mmu_notifier_range_blockable(range))
>> + if (!mmu_notifier_range_blockable(range)) {
>> + mutex_unlock(&adev->notifier_lock);
>> return false;
>
> This test for range_blockable should be before mutex_lock, I can move
> it up
>
yes, thanks.
> Also, do you know if notifier_lock is held while calling
> amdgpu_ttm_tt_get_user_pages_done()? Can we add a 'lock assert held'
> to amdgpu_ttm_tt_get_user_pages_done()?
>
gpu side hold notifier_lock but kfd side doesn't. kfd side doesn't check
amdgpu_ttm_tt_get_user_pages_done/mmu_range_read_retry return value but
check mem->invalid flag which is updated from invalidate callback. It
takes more time to change, I will come to another patch to fix it later.
>> @@ -854,12 +853,20 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
>> r = -EPERM;
>> goto out_unlock;
>> }
>> + up_read(&mm->mmap_sem);
>> + timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
>> +
>> +retry:
>> + range->notifier_seq = mmu_range_read_begin(&bo->notifier);
>>
>> + down_read(&mm->mmap_sem);
>> r = hmm_range_fault(range, 0);
>> up_read(&mm->mmap_sem);
>> -
>> - if (unlikely(r < 0))
>> + if (unlikely(r <= 0)) {
>> + if ((r == 0 || r == -EBUSY) && !time_after(jiffies, timeout))
>> + goto retry;
>> goto out_free_pfns;
>> + }
>
> This isn't really right, a retry loop like this needs to go all the
> way to mmu_range_read_retry() and done under the notifier_lock. ie
> mmu_range_read_retry() can fail just as likely as hmm_range_fault()
> can, and drivers are supposed to retry in both cases, with a single
> timeout.
>
For gpu, check mmu_range_read_retry return value under the notifier_lock
to do retry is in seperate location, not in same retry loop.
> AFAICT it is a major bug that many places ignore the return code of
> amdgpu_ttm_tt_get_user_pages_done() ???
>
For kfd, explained above.
> However, this is all pre-existing bugs, so I'm OK go ahead with this
> patch as modified. I advise AMD to make a followup patch ..
>
yes, I will.
> I'll add a FIXME note to this effect.
>
>> for (i = 0; i < ttm->num_pages; i++) {
>> pages[i] = hmm_device_entry_to_page(range, range->pfns[i]);
>> @@ -916,7 +923,7 @@ bool amdgpu_ttm_tt_get_user_pages_done(struct ttm_tt *ttm)
>> gtt->range = NULL;
>> }
>>
>> - return r;
>> + return !r;
>
> Ah is this the major error? hmm_range_valid() is inverted vs
> mmu_range_read_retry()?
>
yes.
>> }
>> #endif
>>
>> @@ -997,10 +1004,18 @@ static void amdgpu_ttm_tt_unpin_userptr(struct ttm_tt *ttm)
>> sg_free_table(ttm->sg);
>>
>> #if IS_ENABLED(CONFIG_DRM_AMDGPU_USERPTR)
>> - if (gtt->range &&
>> - ttm->pages[0] == hmm_device_entry_to_page(gtt->range,
>> - gtt->range->pfns[0]))
>> - WARN_ONCE(1, "Missing get_user_page_done\n");
>> + if (gtt->range) {
>> + unsigned long i;
>> +
>> + for (i = 0; i < ttm->num_pages; i++) {
>> + if (ttm->pages[i] !=
>> + hmm_device_entry_to_page(gtt->range,
>> + gtt->range->pfns[i]))
>> + break;
>> + }
>> +
>> + WARN((i == ttm->num_pages), "Missing get_user_page_done\n");
>> + }
>
> Is this related/necessary? I can put it in another patch if it is just
> debugging improvement? Please advise
>
I see this WARN backtrace now, but I didn't see it before. This is
somehow related.
> Thanks a lot,
> Jason
>
next prev parent reply other threads:[~2019-11-01 15:59 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-10-28 20:10 [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 01/15] mm/mmu_notifier: define the header pre-processor parts even if disabled Jason Gunthorpe
2019-11-05 21:23 ` John Hubbard
2019-11-06 13:36 ` Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier Jason Gunthorpe
2019-10-29 22:04 ` Kuehling, Felix
2019-10-29 22:56 ` Jason Gunthorpe
2019-11-07 0:23 ` John Hubbard
2019-11-07 2:08 ` Jerome Glisse
2019-11-07 20:11 ` Jason Gunthorpe
2019-11-07 21:04 ` Jerome Glisse
2019-11-08 0:32 ` Jason Gunthorpe
2019-11-08 2:00 ` Jerome Glisse
2019-11-08 20:19 ` Jason Gunthorpe
2019-11-07 20:06 ` Jason Gunthorpe
2019-11-07 20:53 ` John Hubbard
2019-11-08 15:26 ` Jason Gunthorpe
2019-11-08 6:33 ` Christoph Hellwig
2019-11-08 13:43 ` Jerome Glisse
2019-10-28 20:10 ` [PATCH v2 03/15] mm/hmm: allow hmm_range to be used with a mmu_range_notifier or hmm_mirror Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 04/15] mm/hmm: define the pre-processor related parts of hmm.h even if disabled Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 05/15] RDMA/odp: Use mmu_range_notifier_insert() Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 06/15] RDMA/hfi1: Use mmu_range_notifier_inset for user_exp_rcv Jason Gunthorpe
2019-10-29 12:19 ` Dennis Dalessandro
2019-10-29 12:51 ` Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 07/15] drm/radeon: use mmu_range_notifier_insert Jason Gunthorpe
2019-10-29 7:48 ` Koenig, Christian
2019-10-28 20:10 ` [PATCH v2 08/15] xen/gntdev: Use select for DMA_SHARED_BUFFER Jason Gunthorpe
2019-11-01 18:26 ` Jason Gunthorpe
2019-11-05 14:44 ` Jürgen Groß
2019-11-07 9:39 ` Jürgen Groß
2019-10-28 20:10 ` [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert Jason Gunthorpe
2019-10-30 16:55 ` Boris Ostrovsky
2019-11-01 17:48 ` Jason Gunthorpe
2019-11-01 18:51 ` Boris Ostrovsky
2019-11-01 19:17 ` Jason Gunthorpe
2019-11-04 22:03 ` Boris Ostrovsky
2019-11-05 2:31 ` Jason Gunthorpe
2019-11-05 15:16 ` Boris Ostrovsky
2019-11-07 20:36 ` Jason Gunthorpe
2019-11-07 22:54 ` Boris Ostrovsky
2019-11-08 14:53 ` Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 10/15] nouveau: use mmu_notifier directly for invalidate_range_start Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 11/15] nouveau: use mmu_range_notifier instead of hmm_mirror Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 13/15] drm/amdgpu: Use mmu_range_insert " Jason Gunthorpe
2019-10-29 7:51 ` Koenig, Christian
2019-10-29 13:59 ` Jason Gunthorpe
2019-10-29 22:14 ` Kuehling, Felix
2019-10-29 23:09 ` Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier " Jason Gunthorpe
2019-10-29 19:22 ` Yang, Philip
2019-10-29 19:25 ` Jason Gunthorpe
2019-11-01 14:44 ` Yang, Philip
2019-11-01 15:12 ` Jason Gunthorpe
2019-11-01 15:59 ` Yang, Philip [this message]
2019-11-01 17:42 ` Jason Gunthorpe
2019-11-01 19:19 ` Jason Gunthorpe
2019-11-01 19:45 ` Yang, Philip
2019-11-01 19:50 ` Yang, Philip
2019-11-01 19:51 ` Jason Gunthorpe
2019-11-01 18:21 ` Jason Gunthorpe
2019-11-01 18:34 ` [PATCH v2a " Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 15/15] mm/hmm: remove hmm_mirror and related Jason Gunthorpe
[not found] ` <20191028201032.6352-13-jgg@ziepe.ca>
2019-10-29 7:49 ` [PATCH v2 12/15] drm/amdgpu: Call find_vma under mmap_sem Koenig, Christian
2019-10-29 16:28 ` Kuehling, Felix
2019-10-29 13:07 ` Christian König
2019-10-29 17:19 ` Jason Gunthorpe
2019-11-01 19:54 ` [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
2019-11-01 20:54 ` Ralph Campbell
2019-11-04 20:40 ` Jason Gunthorpe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8280fb65-a897-3d71-79f9-9f80d9e474e9@amd.com \
--to=philip.yang@amd.com \
--cc=Alexander.Deucher@amd.com \
--cc=Christian.Koenig@amd.com \
--cc=David1.Zhou@amd.com \
--cc=Felix.Kuehling@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=boris.ostrovsky@oracle.com \
--cc=bskeggs@redhat.com \
--cc=dennis.dalessandro@intel.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=hch@infradead.org \
--cc=jgg@mellanox.com \
--cc=jglisse@redhat.com \
--cc=jgross@suse.com \
--cc=jhubbard@nvidia.com \
--cc=linux-mm@kvack.org \
--cc=linux-rdma@vger.kernel.org \
--cc=mike.marciniszyn@intel.com \
--cc=nouveau@lists.freedesktop.org \
--cc=oleksandr_andrushchenko@epam.com \
--cc=petrcvekcz@gmail.com \
--cc=rcampbell@nvidia.com \
--cc=sstabellini@kernel.org \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).