All of lore.kernel.org
 help / color / mirror / Atom feed
From: zhoucm1 <david1.zhou-5C7GfCeVMHo@public.gmane.org>
To: "Nicolai Hähnle"
	<nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	"Samuel Pitoiset"
	<samuel.pitoiset-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
Subject: Re: [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()
Date: Tue, 14 Feb 2017 10:56:32 +0800	[thread overview]
Message-ID: <58A271E0.9090807@amd.com> (raw)
In-Reply-To: <60486a5c-0ac3-675e-86f6-67f67db1213c-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>



On 2017年02月14日 03:03, Nicolai Hähnle wrote:
> On 13.02.2017 19:58, Nicolai Hähnle wrote:
>> On 13.02.2017 19:38, Samuel Pitoiset wrote:
>>>
>>>
>>> On 02/13/2017 07:09 PM, Nicolai Hähnle wrote:
>>>> On 13.02.2017 19:04, Nicolai Hähnle wrote:
>>>>> On 13.02.2017 18:49, Samuel Pitoiset wrote:
>>>>>>
>>>>>>
>>>>>> On 02/13/2017 05:25 PM, Nicolai Hähnle wrote:
>>>>>>> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>>>>>>>> When ttm_bo_init() fails, the reservation mutex should be 
>>>>>>>> unlocked.
>>>>>>>>
>>>>>>>> In debug build, the kernel reported "possible recursive locking
>>>>>>>> detected" in this codepath. For debugging purposes, I also added
>>>>>>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
>>>>>>>> mutex was locked as expected.
>>>>>>>>
>>>>>>>> This should fix (random) GPU hangs. The easy way to reproduce the
>>>>>>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
>>>>>>>> Hitman. It will create a huge buffer, evict a bunch of buffers
>>>>>>>> (around ~5k) and deadlock.
>>>>>>>>
>>>>>>>> This regression has been introduced pretty recently.
>>>>>>>>
>>>>>>>> v2: only release the mutex if resv is NULL
>>>>>>>>
>>>>>>>> Fixes: 12a852219583 ("drm/amdgpu: improve
>>>>>>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
>>>>>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
>>>>>>>> ---
>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>>>>>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>> index d1ef1d064de4..556236a112c1 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct
>>>>>>>> amdgpu_device *adev,
>>>>>>>>              &bo->placement, page_align, !kernel, NULL,
>>>>>>>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>>>>>>>              &amdgpu_ttm_bo_destroy);
>>>>>>>> -    if (unlikely(r != 0))
>>>>>>>> +    if (unlikely(r != 0)) {
>>>>>>>> +        if (!resv)
>>>>>>>> + ww_mutex_unlock(&bo->tbo.resv->lock);
>>>>>>>>          return r;
>>>>>>>> +    }
>>>>>>>
>>>>>>> I was looking at this myself a couple of weeks back, and I'm pretty
>>>>>>> sure
>>>>>>> I had this exact same patch just to realize that it's actually
>>>>>>> incorrect.
>>>>>>>
>>>>>>> The problem is that ttm_bo_init will actually call the destroy
>>>>>>> function
>>>>>>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been
>>>>>>> freed.
>>>>>>>
>>>>>>> This code is a huge mess. I'm surprised though: have you verified
>>>>>>> that
>>>>>>> this patch actually fixes a hang?
>>>>>>
>>>>>> Yes, I triple-checked. I can't reproduce the hangs with Hitman.
>>>>>
>>>>> That's surprising, but a relief. Maybe it ties into some of the other
>>>>> problems I'm seeing as well.
>>>>>
>>>>> This means we need a real fix for this; I still think the current 
>>>>> patch
>>>>> is broken.
>>>>>
>>>>>
>>>>>> This fixes a deadlock, here's the report:
>>>>>> https://hastebin.com/durodivoma.xml
>>>>>>
>>>>>> The resv->lock has to be unlocked when ttm_bo_init() fails (I 
>>>>>> checked
>>>>>> with a WARN_ON(is_locked)) because it doesn't call the destroy
>>>>>> function
>>>>>> in all situations. Presumably, when drm_vma_offset_add() fails and
>>>>>> resv
>>>>>> is not NULL, the mutex is not unlocked.
>>>>>
>>>>> On which code path is the destroy function not called? If that is the
>>>>> case, we're leaking memory.
>>>>>
>>>>> With the patch as-is, the error paths are either leaking memory (if
>>>>> you're right) or accessing memory after it's freed (otherwise).
>>>>> Obviously, neither is good.
>>>>
>>>> Actually, I find it extremely suspicious that this patch resolves 
>>>> hangs.
>>>> By all rights, no other task should have a pointer to this bo left. It
>>>> points at problems elsewhere in the code, possibly the precise problem
>>>> I've been trying to track down.
>>>
>>> Well, maybe we are just lucky but as I said, I checked many times to
>>> reproduce the issue with that patch applied without any success, you 
>>> can
>>> trust me. Although I'm also starting to think that's not the right
>>> solution (and could introduce other ones).
>>>
>>>>
>>>> Could you please revert the patch, reproduce the hang, and report
>>>> /proc/$pid/stack for all the hung tasks?
>>>
>>> Sure. The thing is: Hitman's branch has been updated during the weekend
>>> and my local installation is broken. I need to re-download the whole
>>> game (will take a while).
>>>
>>> I will let you know when I'm able to grab that report.
>>
>> Hmm, so I thought about this some more, and I'm no longer so sure that
>> your bug and mine are the same. If it was related, I'd somehow expect
>> you to get an error about a mutex being destroyed while it's held (at
>> least with lock debugging enabled).
>>
>> Anyway... we need to change the contract of ttm_bo_init, I'm just not
>> yet sure how, because there are two points of failure: one quite early
>> on, and the second rather late which gets cleaned up by ttm_bo_unref.
>
> Maybe it would actually be best to split ttm_bo_init into two parts: 
> the initial bulk of structure initialization as the first half, and 
> the ttm_bo_validate in the second half.
Agreed. Have you gone ahead with your proposal?
Although Samuel's patch isn't best way, it indeed fix a OCL bug which is 
trying to allocate multiple big buffers.

Thanks,
David Zhou
>
> Cheers,
> Nicolai
>
>>
>> Cheers,
>> Nicolai
>>
>>> Thanks Nicolai.
>>>>
>>>> Thanks,
>>>> Nicolai
>>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  parent reply	other threads:[~2017-02-14  2:56 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-09 10:33 [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted() Samuel Pitoiset
     [not found] ` <20170209103337.1522-1-samuel.pitoiset-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-09 10:33   ` [PATCH v2 2/2] drm/amdgpu: report the number of bytes moved at buffer creation Samuel Pitoiset
     [not found]     ` <20170209103337.1522-2-samuel.pitoiset-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-13 16:28       ` Nicolai Hähnle
     [not found]         ` <758a0adf-3e8c-1860-7528-0099a82e79da-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-13 16:39           ` Christian König
2017-02-13 17:50           ` Samuel Pitoiset
2017-02-09 16:08   ` [PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted() Christian König
2017-02-13 16:25   ` Nicolai Hähnle
     [not found]     ` <ef49a53a-6011-0de1-9ff6-3755851daf10-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-13 17:49       ` Samuel Pitoiset
     [not found]         ` <3fc6e61c-cc70-7b10-9abb-45a4f22b3909-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-13 17:52           ` Samuel Pitoiset
2017-02-13 18:04           ` Nicolai Hähnle
     [not found]             ` <7c0ece9f-8983-317b-0eae-adaf535c6d06-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-13 18:09               ` Nicolai Hähnle
     [not found]                 ` <ed01ca29-e255-62f8-ca99-f079de294eda-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-13 18:38                   ` Samuel Pitoiset
     [not found]                     ` <aaa5ce51-e2e0-f368-c3f0-b94b5afe3881-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-13 18:58                       ` Nicolai Hähnle
     [not found]                         ` <4185541c-a3a4-7681-4539-54c5daddd0ac-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-13 19:03                           ` Nicolai Hähnle
     [not found]                             ` <60486a5c-0ac3-675e-86f6-67f67db1213c-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-14  2:56                               ` zhoucm1 [this message]
     [not found]                                 ` <58A271E0.9090807-5C7GfCeVMHo@public.gmane.org>
2017-02-14  8:30                                   ` Nicolai Hähnle
2017-02-14 17:31                                   ` Alex Deucher
2017-02-13 22:13                           ` Samuel Pitoiset
2017-02-13 18:11               ` Samuel Pitoiset
     [not found]                 ` <e8a51ac0-8319-810a-0fb8-992c3afd4f34-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-13 18:19                   ` Nicolai Hähnle
     [not found]                     ` <eeafbad0-ca3b-ba03-7068-f54dd2119267-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-13 18:32                       ` Samuel Pitoiset
     [not found]                         ` <4d8bb8a9-d2b7-2ade-0dd0-8762e1dcc7a5-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-13 18:41                           ` Christian König
     [not found]                             ` <39ea0637-7794-865e-dbf7-6109d295064e-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-02-13 18:45                               ` Samuel Pitoiset

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=58A271E0.9090807@amd.com \
    --to=david1.zhou-5c7gfcevmho@public.gmane.org \
    --cc=amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org \
    --cc=nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    --cc=samuel.pitoiset-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.