From: Andrey Grodzovsky <Andrey.Grodzovsky-5C7GfCeVMHo@public.gmane.org>
To: "Deng, Emily" <Emily.Deng-5C7GfCeVMHo@public.gmane.org>,
"Koenig,
Christian" <Christian.Koenig-5C7GfCeVMHo@public.gmane.org>,
"amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org"
<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
Date: Mon, 11 Nov 2019 16:35:25 -0500 [thread overview]
Message-ID: <53130d01-da16-7cc0-55df-ea2532e6b3d0@amd.com> (raw)
In-Reply-To: <MN2PR12MB2975B736A666D9EEC5E5DB158F740-rweVpJHSKToFlvJWC7EAqwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
Emily - is there a particular scenario to reproduce this ? I am trying
with libdrm deadlock test and artificially delaying the GPU reset logic
until after the guilty job is signaling but indeed nothing bad happens
as drm_sched_cleanup_jobs returns early because there is a reset in
progress and so the bad job is not getting released while GPU reset is
running.
Can you provide event tracing for timer, dma_fence and gpu_scheduler for
when the problem happens ?
Andrey
On 11/11/19 4:05 AM, Deng, Emily wrote:
> Hi Christian and Andrey,
> The issue I encountered is the bad job is freeing after entering to the amdgpu_device_gpu_recover. Don't know why, as per Christian said, it will call cancel_delayed_work in drm_sched_cleanup_jobs.
>
> Best wishes
> Emily Deng
>
>
>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Deng,
>> Emily
>> Sent: Monday, November 11, 2019 3:19 PM
>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>
>> Hi Andrey,
>> I don’t think your patch will help for this. As it will may call
>> kthread_should_park in drm_sched_cleanup_jobs first, and then call
>> kcl_kthread_park. And then it still has a race between the 2 threads.
>>
>> Best wishes
>> Emily Deng
>>
>>
>>
>>> -----Original Message-----
>>> From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>>> Sent: Saturday, November 9, 2019 3:01 AM
>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Deng, Emily
>>> <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>
>>>
>>> On 11/8/19 5:35 AM, Koenig, Christian wrote:
>>>> Hi Emily,
>>>>
>>>> exactly that can't happen. See here:
>>>>
>>>>> /* Don't destroy jobs while the timeout worker is running
>>>>> */
>>>>> if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>> !cancel_delayed_work(&sched->work_tdr))
>>>>> return NULL;
>>>> We never free jobs while the timeout working is running to prevent
>>>> exactly that issue.
>>>
>>> I don't think this protects us if drm_sched_cleanup_jobs is called for
>>> scheduler which didn't experience a timeout, in
>>> amdgpu_device_gpu_recover we access
>>> sched->ring_mirror_list for all the schedulers on a device so this
>>> sched->condition
>>> above won't protect us. What in fact could help maybe is my recent
>>> patch
>>> 541c521 drm/sched: Avoid job cleanup if sched thread is parked. because
>>> we do park each of the scheduler threads during tdr job before trying
>>> to access
>>> sched->ring_mirror_list.
>>>
>>> Emily - did you see this problem with that patch in place ? I only
>>> pushed it yesterday.
>>>
>>> Andrey
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 08.11.19 um 11:32 schrieb Deng, Emily:
>>>>> Hi Christian,
>>>>> The drm_sched_job_timedout-> amdgpu_job_timedout call
>>> amdgpu_device_gpu_recover. I mean the main scheduler free the jobs
>>> while in amdgpu_device_gpu_recover, and before calling drm_sched_stop.
>>>>> Best wishes
>>>>> Emily Deng
>>>>>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>> Sent: Friday, November 8, 2019 6:26 PM
>>>>>> To: Deng, Emily <Emily.Deng@amd.com>; amd-
>> gfx@lists.freedesktop.org
>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>>>>
>>>>>> Hi Emily,
>>>>>>
>>>>>> well who is calling amdgpu_device_gpu_recover() in this case?
>>>>>>
>>>>>> When it's not the scheduler we shouldn't have a guilty job in the first
>> place.
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>> Am 08.11.19 um 11:22 schrieb Deng, Emily:
>>>>>>> Hi Chrisitan,
>>>>>>> No, I am with the new branch and also has the patch. Even
>>>>>>> it are freed by
>>>>>> main scheduler, how we could avoid main scheduler to free jobs
>>>>>> while enter to function amdgpu_device_gpu_recover?
>>>>>>> Best wishes
>>>>>>> Emily Deng
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>> Sent: Friday, November 8, 2019 6:15 PM
>>>>>>>> To: Deng, Emily <Emily.Deng@amd.com>;
>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>>>> tdr
>>>>>>>>
>>>>>>>> Hi Emily,
>>>>>>>>
>>>>>>>> in this case you are on an old code branch.
>>>>>>>>
>>>>>>>> Jobs are freed now by the main scheduler thread and only if no
>>>>>>>> timeout handler is running.
>>>>>>>>
>>>>>>>> See this patch here:
>>>>>>>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e
>>>>>>>>> Author: Christian König <christian.koenig@amd.com>
>>>>>>>>> Date: Thu Apr 18 11:00:21 2019 -0400
>>>>>>>>>
>>>>>>>>> drm/scheduler: rework job destruction
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 08.11.19 um 11:11 schrieb Deng, Emily:
>>>>>>>>> Hi Christian,
>>>>>>>>> Please refer to follow log, when it enter to
>>>>>>>>> amdgpu_device_gpu_recover
>>>>>>>> function, the bad job 000000005086879e is freeing in function
>>>>>>>> amdgpu_job_free_cb at the same time, because of the hardware
>>>>>>>> fence
>>>>>> signal.
>>>>>>>> But amdgpu_device_gpu_recover goes faster, at this case, the
>>>>>>>> s_fence is already freed, but job is not freed in time. Then this
>>>>>>>> issue
>>> occurs.
>>>>>>>>> [ 449.792189] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
>>>>>> sdma0
>>>>>>>>> timeout, signaled seq=2481, emitted seq=2483 [ 449.793202]
>>>>>>>>> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
>> information:
>>>>>>>> process pid 0 thread pid 0, s_job:000000005086879e [
>>>>>>>> 449.794163] amdgpu
>>>>>>>> 0000:00:08.0: GPU reset begin!
>>>>>>>>> [ 449.794175] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>> process pid 0 thread pid 0, s_job:000000005086879e [
>>>>>>>>> 449.794221] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>> process pid 0 thread pid 0, s_job:0000000066eb74ab [
>>>>>>>>> 449.794222] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>> process pid 0 thread pid 0, s_job:00000000d4438ad9 [
>>>>>>>>> 449.794255] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>> process pid 0 thread pid 0, s_job:00000000b6d69c65 [
>>>>>>>>> 449.794257] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>> process pid 0 thread pid 0,
>>>>>>>> s_job:00000000ea85e922 [ 449.794287]
>>>>>>>> Emily:amdgpu_job_free_cb,Process
>>>>>>>> information: process pid 0 thread pid 0, s_job:00000000ed3a5ac6
>>>>>>>> [ 449.794366] BUG: unable to handle kernel NULL pointer
>>>>>>>> dereference at
>>>>>>>> 00000000000000c0 [ 449.800818] PGD 0 P4D 0 [ 449.801040] Oops:
>>>>>>>> 0000 [#1] SMP PTI
>>>>>>>>> [ 449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted: G
>> OE
>>>>>>>> 4.18.0-15-generic #16~18.04.1-Ubuntu
>>>>>>>>> [ 449.802157] Hardware name: QEMU Standard PC (i440FX + PIIX,
>>>>>>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 449.802944]
>>>>>>>>> Workqueue: events drm_sched_job_timedout [amd_sched] [
>>>>>>>>> 449.803488]
>>>>>> RIP:
>>>>>>>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]
>>>>>>>>> [ 449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85 56 ff
>>>>>>>>> ff ff
>>>>>>>>> 45 85 e4 0f
>>>>>>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48 8b 40 10
>>>>>>>> <48> 8b
>>>>>> 98
>>>>>>>> c0 00 00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43 48 a8 01
>>>>>>>>> [ 449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS: 00010286 [
>>>>>>>>> 449.806032] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>>>>>>>>> 0000000000000000 [ 449.806625] RDX: ffffb4c7c08f5ac0 RSI:
>>>>>>>>> 0000000fffffffe0 RDI: 0000000000000246 [ 449.807224] RBP:
>>>>>>>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09: 0000000000000000
>> [
>>>>>>>>> 449.807818] R10: 0000000000000000 R11: 0000000000000148 R12:
>>>>>>>>> 0000000000000000 [ 449.808411] R13: ffffb4c7c08f7da0 R14:
>>>>>>>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [ 449.809004] FS:
>>>>>>>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000)
>>>>>>>>> knlGS:0000000000000000 [ 449.809674] CS: 0010 DS: 0000 ES:
>>>>>>>>> 0000
>>> CR0:
>>>>>>>>> 0000000080050033 [ 449.810153] CR2: 00000000000000c0 CR3:
>>>>>>>>> 000000003cc0a001 CR4: 00000000003606e0 [ 449.810747] DR0:
>>>>>>>> 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>>> [
>>>>>>>> 449.811344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>>>>>> 0000000000000400 [ 449.811937] Call Trace:
>>>>>>>>> [ 449.812206] amdgpu_job_timedout+0x114/0x140 [amdgpu] [
>>>>>>>>> 449.812635] drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>>>>>>>> 449.813139] ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu] [
>>>>>>>>> 449.813609] ? drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>>>>>>>> 449.814077] process_one_work+0x1fd/0x3f0 [ 449.814417]
>>>>>>>>> worker_thread+0x34/0x410 [ 449.814728] kthread+0x121/0x140 [
>>>>>>>>> 449.815004] ? process_one_work+0x3f0/0x3f0 [ 449.815374] ?
>>>>>>>>> kthread_create_worker_on_cpu+0x70/0x70
>>>>>>>>> [ 449.815799] ret_from_fork+0x35/0x40
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>> Sent: Friday, November 8, 2019 5:43 PM
>>>>>>>>>> To: Deng, Emily <Emily.Deng@amd.com>; amd-
>>>>>> gfx@lists.freedesktop.org
>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>>>>>> tdr
>>>>>>>>>>
>>>>>>>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily:
>>>>>>>>>>> Sorry, please take your time.
>>>>>>>>>> Have you seen my other response a bit below?
>>>>>>>>>>
>>>>>>>>>> I can't follow how it would be possible for job->s_fence to be
>>>>>>>>>> NULL without the job also being freed.
>>>>>>>>>>
>>>>>>>>>> So it looks like this patch is just papering over some bigger issues.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>>> Best wishes
>>>>>>>>>>> Emily Deng
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>>> Sent: Friday, November 8, 2019 5:08 PM
>>>>>>>>>>>> To: Deng, Emily <Emily.Deng@amd.com>; amd-
>>>>>>>> gfx@lists.freedesktop.org
>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>>>>> for tdr
>>>>>>>>>>>>
>>>>>>>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily:
>>>>>>>>>>>>> Ping.....
>>>>>>>>>>>> You need to give me at least enough time to wake up :)
>>>>>>>>>>>>
>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On
>>>>>> Behalf
>>>>>>>>>>>>>> Of Deng, Emily
>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM
>>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; amd-
>>>>>>>>>>>>>> gfx@lists.freedesktop.org
>>>>>>>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>>>>>>>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM
>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng@amd.com>;
>>>>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
>>>>>>>>>>>>>>> issue for tdr
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>>>>>>>>>>>>>>>> When the job is already signaled, the s_fence is freed.
>>>>>>>>>>>>>>>> Then it will has null pointer in amdgpu_device_gpu_recover.
>>>>>>>>>>>>>>> NAK, the s_fence is only set to NULL when the job is
>> destroyed.
>>>>>>>>>>>>>>> See drm_sched_job_cleanup().
>>>>>>>>>>>>>> I know it is set to NULL in drm_sched_job_cleanup. But in
>>>>>>>>>>>>>> one case, when it enter into the amdgpu_device_gpu_recover,
>>>>>>>>>>>>>> it already in drm_sched_job_cleanup, and at this time, it
>>>>>>>>>>>>>> will go to free
>>>>>>>> job.
>>>>>>>>>>>>>> But the amdgpu_device_gpu_recover sometimes is faster. At
>>>>>>>>>>>>>> that time, job is not freed, but s_fence is already NULL.
>>>>>>>>>>>> No, that case can't happen. See here:
>>>>>>>>>>>>
>>>>>>>>>>>>> drm_sched_job_cleanup(s_job);
>>>>>>>>>>>>>
>>>>>>>>>>>>> amdgpu_ring_priority_put(ring, s_job->s_priority);
>>>>>>>>>>>>> dma_fence_put(job->fence);
>>>>>>>>>>>>> amdgpu_sync_free(&job->sync);
>>>>>>>>>>>>> amdgpu_sync_free(&job->sched_sync);
>>>>>>>>>>>>> kfree(job);
>>>>>>>>>>>> The job itself is freed up directly after freeing the
>>>>>>>>>>>> reference to the
>>>>>>>> s_fence.
>>>>>>>>>>>> So you are just papering over a much bigger problem here.
>>>>>>>>>>>> This patch is a clear NAK.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Christian.
>>>>>>>>>>>>
>>>>>>>>>>>>>>> When you see a job without an s_fence then that means the
>>>>>>>>>>>>>>> problem is somewhere else.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng@amd.com>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>>>>>>>>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 11
>> ++++++-
>>> ----
>>>>>>>>>>>>>>>> 2 files changed, 7 insertions(+), 6 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>> index e6ce949..5a8f08e 100644
>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>> @@ -4075,7 +4075,7 @@ int
>>>>>> amdgpu_device_gpu_recover(struct
>>>>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>> * job->base holds a reference to parent fence
>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>> - if (job && job->base.s_fence->parent &&
>>>>>>>>>>>>>>>> + if (job && job->base.s_fence &&
>>>>>>>>>>>>>>>> +job->base.s_fence->parent
>>>>>>>>>> &&
>>>>>>>>>>>>>>>> dma_fence_is_signaled(job->base.s_fence->parent))
>>>>>>>>>>>>>>>> job_signaled = true;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>> index 31809ca..56cc10e 100644
>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>> @@ -334,8 +334,8 @@ void
>>> drm_sched_increase_karma(struct
>>>>>>>>>>>>>>> drm_sched_job
>>>>>>>>>>>>>>>> *bad)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> spin_lock(&rq->lock);
>>>>>>>>>>>>>>>> list_for_each_entry_safe(entity, tmp,
>>>>>> &rq-
>>>>>>>>>>> entities,
>>>>>>>>>>>>>>> list) {
>>>>>>>>>>>>>>>> - if (bad->s_fence-
>>>> scheduled.context
>>>>>>>>>> ==
>>>>>>>>>>>>>>>> - entity->fence_context) {
>>>>>>>>>>>>>>>> + if (bad->s_fence && (bad-
>>>> s_fence-
>>>>>>>>>>>>>>>> scheduled.context ==
>>>>>>>>>>>>>>>> + entity->fence_context)) {
>>>>>>>>>>>>>>>> if (atomic_read(&bad-
>>>>>>>>>>> karma) >
>>>>>>>>>>>>>>>> bad->sched-
>>>>>>> hang_limit)
>>>>>>>>>>>>>>>> if (entity-
>>>>>>> guilty) @@ -376,7 +376,7 @@ void
>>>>>>>>>>>>>>>> drm_sched_stop(struct
>>>>>>>>>> drm_gpu_scheduler
>>>>>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>>>>> * This iteration is thread safe as sched thread
>>>>>>>>>>>>>>>> is
>>>>>> stopped.
>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>> list_for_each_entry_safe_reverse(s_job, tmp,
>>>>>>>>>>>>>>>> &sched- ring_mirror_list, node) {
>>>>>>>>>>>>>>>> - if (s_job->s_fence->parent &&
>>>>>>>>>>>>>>>> + if (s_job->s_fence && s_job->s_fence->parent
>>> &&
>>>>>>>>>>>>>>>> dma_fence_remove_callback(s_job-
>>>>>>> s_fence-
>>>>>>>>>>> parent,
>>>>>>>>>>>>>>>> &s_job->cb)) {
>>>>>>>>>>>>>>>> atomic_dec(&sched->hw_rq_count);
>>>>>> @@ -
>>>>>>>>>> 395,7
>>>>>>>>>>>>>> +395,8 @@ void
>>>>>>>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler
>>>>>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>> * Job is still alive so fence refcount at
>>>>>> least 1
>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>> - dma_fence_wait(&s_job->s_fence-
>>>> finished,
>>>>>>>>>> false);
>>>>>>>>>>>>>>>> + if (s_job->s_fence)
>>>>>>>>>>>>>>>> + dma_fence_wait(&s_job-
>>>> s_fence-
>>>>>>>>>>> finished,
>>>>>>>>>>>>>>> false);
>>>>>>>>>>>>>>>> /*
>>>>>>>>>>>>>>>> * We must keep bad job alive for later
>>>>>> use
>>>>>>>>>> during @@
>>>>>>>>>>>>>>> -438,7
>>>>>>>>>>>>>>>> +439,7 @@ void drm_sched_start(struct
>> drm_gpu_scheduler
>>>>>>>> *sched,
>>>>>>>>>>>>>>>> +bool
>>>>>>>>>>>>>>> full_recovery)
>>>>>>>>>>>>>>>> * GPU recovers can't run in parallel.
>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>> list_for_each_entry_safe(s_job, tmp,
>>>>>>>>>>>>>>>> &sched->ring_mirror_list,
>>>>>>>>>>>>>>>> node)
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>> - struct dma_fence *fence = s_job->s_fence-
>>>> parent;
>>>>>>>>>>>>>>>> + struct dma_fence *fence = s_job->s_fence ?
>>> s_job-
>>>>>>>>>>> s_fence-
>>>>>>>>>>>>>>>> parent :
>>>>>>>>>>>>>>>> +NULL;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> atomic_inc(&sched->hw_rq_count);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
WARNING: multiple messages have this Message-ID (diff)
From: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
To: "Deng, Emily" <Emily.Deng@amd.com>,
"Koenig, Christian" <Christian.Koenig@amd.com>,
"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
Date: Mon, 11 Nov 2019 16:35:25 -0500 [thread overview]
Message-ID: <53130d01-da16-7cc0-55df-ea2532e6b3d0@amd.com> (raw)
Message-ID: <20191111213525.SfvbgJcq7dRIN9phwnqroQUzuQbKcixNFeDlWa348Os@z> (raw)
In-Reply-To: <MN2PR12MB2975B736A666D9EEC5E5DB158F740@MN2PR12MB2975.namprd12.prod.outlook.com>
Emily - is there a particular scenario to reproduce this ? I am trying
with libdrm deadlock test and artificially delaying the GPU reset logic
until after the guilty job is signaling but indeed nothing bad happens
as drm_sched_cleanup_jobs returns early because there is a reset in
progress and so the bad job is not getting released while GPU reset is
running.
Can you provide event tracing for timer, dma_fence and gpu_scheduler for
when the problem happens ?
Andrey
On 11/11/19 4:05 AM, Deng, Emily wrote:
> Hi Christian and Andrey,
> The issue I encountered is the bad job is freeing after entering to the amdgpu_device_gpu_recover. Don't know why, as per Christian said, it will call cancel_delayed_work in drm_sched_cleanup_jobs.
>
> Best wishes
> Emily Deng
>
>
>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Deng,
>> Emily
>> Sent: Monday, November 11, 2019 3:19 PM
>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>
>> Hi Andrey,
>> I don’t think your patch will help for this. As it will may call
>> kthread_should_park in drm_sched_cleanup_jobs first, and then call
>> kcl_kthread_park. And then it still has a race between the 2 threads.
>>
>> Best wishes
>> Emily Deng
>>
>>
>>
>>> -----Original Message-----
>>> From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>>> Sent: Saturday, November 9, 2019 3:01 AM
>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Deng, Emily
>>> <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>
>>>
>>> On 11/8/19 5:35 AM, Koenig, Christian wrote:
>>>> Hi Emily,
>>>>
>>>> exactly that can't happen. See here:
>>>>
>>>>> /* Don't destroy jobs while the timeout worker is running
>>>>> */
>>>>> if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>> !cancel_delayed_work(&sched->work_tdr))
>>>>> return NULL;
>>>> We never free jobs while the timeout working is running to prevent
>>>> exactly that issue.
>>>
>>> I don't think this protects us if drm_sched_cleanup_jobs is called for
>>> scheduler which didn't experience a timeout, in
>>> amdgpu_device_gpu_recover we access
>>> sched->ring_mirror_list for all the schedulers on a device so this
>>> sched->condition
>>> above won't protect us. What in fact could help maybe is my recent
>>> patch
>>> 541c521 drm/sched: Avoid job cleanup if sched thread is parked. because
>>> we do park each of the scheduler threads during tdr job before trying
>>> to access
>>> sched->ring_mirror_list.
>>>
>>> Emily - did you see this problem with that patch in place ? I only
>>> pushed it yesterday.
>>>
>>> Andrey
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 08.11.19 um 11:32 schrieb Deng, Emily:
>>>>> Hi Christian,
>>>>> The drm_sched_job_timedout-> amdgpu_job_timedout call
>>> amdgpu_device_gpu_recover. I mean the main scheduler free the jobs
>>> while in amdgpu_device_gpu_recover, and before calling drm_sched_stop.
>>>>> Best wishes
>>>>> Emily Deng
>>>>>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>> Sent: Friday, November 8, 2019 6:26 PM
>>>>>> To: Deng, Emily <Emily.Deng@amd.com>; amd-
>> gfx@lists.freedesktop.org
>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>>>>
>>>>>> Hi Emily,
>>>>>>
>>>>>> well who is calling amdgpu_device_gpu_recover() in this case?
>>>>>>
>>>>>> When it's not the scheduler we shouldn't have a guilty job in the first
>> place.
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>> Am 08.11.19 um 11:22 schrieb Deng, Emily:
>>>>>>> Hi Chrisitan,
>>>>>>> No, I am with the new branch and also has the patch. Even
>>>>>>> it are freed by
>>>>>> main scheduler, how we could avoid main scheduler to free jobs
>>>>>> while enter to function amdgpu_device_gpu_recover?
>>>>>>> Best wishes
>>>>>>> Emily Deng
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>> Sent: Friday, November 8, 2019 6:15 PM
>>>>>>>> To: Deng, Emily <Emily.Deng@amd.com>;
>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>>>> tdr
>>>>>>>>
>>>>>>>> Hi Emily,
>>>>>>>>
>>>>>>>> in this case you are on an old code branch.
>>>>>>>>
>>>>>>>> Jobs are freed now by the main scheduler thread and only if no
>>>>>>>> timeout handler is running.
>>>>>>>>
>>>>>>>> See this patch here:
>>>>>>>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e
>>>>>>>>> Author: Christian König <christian.koenig@amd.com>
>>>>>>>>> Date: Thu Apr 18 11:00:21 2019 -0400
>>>>>>>>>
>>>>>>>>> drm/scheduler: rework job destruction
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 08.11.19 um 11:11 schrieb Deng, Emily:
>>>>>>>>> Hi Christian,
>>>>>>>>> Please refer to follow log, when it enter to
>>>>>>>>> amdgpu_device_gpu_recover
>>>>>>>> function, the bad job 000000005086879e is freeing in function
>>>>>>>> amdgpu_job_free_cb at the same time, because of the hardware
>>>>>>>> fence
>>>>>> signal.
>>>>>>>> But amdgpu_device_gpu_recover goes faster, at this case, the
>>>>>>>> s_fence is already freed, but job is not freed in time. Then this
>>>>>>>> issue
>>> occurs.
>>>>>>>>> [ 449.792189] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
>>>>>> sdma0
>>>>>>>>> timeout, signaled seq=2481, emitted seq=2483 [ 449.793202]
>>>>>>>>> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
>> information:
>>>>>>>> process pid 0 thread pid 0, s_job:000000005086879e [
>>>>>>>> 449.794163] amdgpu
>>>>>>>> 0000:00:08.0: GPU reset begin!
>>>>>>>>> [ 449.794175] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>> process pid 0 thread pid 0, s_job:000000005086879e [
>>>>>>>>> 449.794221] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>> process pid 0 thread pid 0, s_job:0000000066eb74ab [
>>>>>>>>> 449.794222] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>> process pid 0 thread pid 0, s_job:00000000d4438ad9 [
>>>>>>>>> 449.794255] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>> process pid 0 thread pid 0, s_job:00000000b6d69c65 [
>>>>>>>>> 449.794257] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>> process pid 0 thread pid 0,
>>>>>>>> s_job:00000000ea85e922 [ 449.794287]
>>>>>>>> Emily:amdgpu_job_free_cb,Process
>>>>>>>> information: process pid 0 thread pid 0, s_job:00000000ed3a5ac6
>>>>>>>> [ 449.794366] BUG: unable to handle kernel NULL pointer
>>>>>>>> dereference at
>>>>>>>> 00000000000000c0 [ 449.800818] PGD 0 P4D 0 [ 449.801040] Oops:
>>>>>>>> 0000 [#1] SMP PTI
>>>>>>>>> [ 449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted: G
>> OE
>>>>>>>> 4.18.0-15-generic #16~18.04.1-Ubuntu
>>>>>>>>> [ 449.802157] Hardware name: QEMU Standard PC (i440FX + PIIX,
>>>>>>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 449.802944]
>>>>>>>>> Workqueue: events drm_sched_job_timedout [amd_sched] [
>>>>>>>>> 449.803488]
>>>>>> RIP:
>>>>>>>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]
>>>>>>>>> [ 449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85 56 ff
>>>>>>>>> ff ff
>>>>>>>>> 45 85 e4 0f
>>>>>>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48 8b 40 10
>>>>>>>> <48> 8b
>>>>>> 98
>>>>>>>> c0 00 00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43 48 a8 01
>>>>>>>>> [ 449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS: 00010286 [
>>>>>>>>> 449.806032] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>>>>>>>>> 0000000000000000 [ 449.806625] RDX: ffffb4c7c08f5ac0 RSI:
>>>>>>>>> 0000000fffffffe0 RDI: 0000000000000246 [ 449.807224] RBP:
>>>>>>>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09: 0000000000000000
>> [
>>>>>>>>> 449.807818] R10: 0000000000000000 R11: 0000000000000148 R12:
>>>>>>>>> 0000000000000000 [ 449.808411] R13: ffffb4c7c08f7da0 R14:
>>>>>>>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [ 449.809004] FS:
>>>>>>>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000)
>>>>>>>>> knlGS:0000000000000000 [ 449.809674] CS: 0010 DS: 0000 ES:
>>>>>>>>> 0000
>>> CR0:
>>>>>>>>> 0000000080050033 [ 449.810153] CR2: 00000000000000c0 CR3:
>>>>>>>>> 000000003cc0a001 CR4: 00000000003606e0 [ 449.810747] DR0:
>>>>>>>> 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>>> [
>>>>>>>> 449.811344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>>>>>> 0000000000000400 [ 449.811937] Call Trace:
>>>>>>>>> [ 449.812206] amdgpu_job_timedout+0x114/0x140 [amdgpu] [
>>>>>>>>> 449.812635] drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>>>>>>>> 449.813139] ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu] [
>>>>>>>>> 449.813609] ? drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>>>>>>>> 449.814077] process_one_work+0x1fd/0x3f0 [ 449.814417]
>>>>>>>>> worker_thread+0x34/0x410 [ 449.814728] kthread+0x121/0x140 [
>>>>>>>>> 449.815004] ? process_one_work+0x3f0/0x3f0 [ 449.815374] ?
>>>>>>>>> kthread_create_worker_on_cpu+0x70/0x70
>>>>>>>>> [ 449.815799] ret_from_fork+0x35/0x40
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>> Sent: Friday, November 8, 2019 5:43 PM
>>>>>>>>>> To: Deng, Emily <Emily.Deng@amd.com>; amd-
>>>>>> gfx@lists.freedesktop.org
>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>>>>>> tdr
>>>>>>>>>>
>>>>>>>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily:
>>>>>>>>>>> Sorry, please take your time.
>>>>>>>>>> Have you seen my other response a bit below?
>>>>>>>>>>
>>>>>>>>>> I can't follow how it would be possible for job->s_fence to be
>>>>>>>>>> NULL without the job also being freed.
>>>>>>>>>>
>>>>>>>>>> So it looks like this patch is just papering over some bigger issues.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>>> Best wishes
>>>>>>>>>>> Emily Deng
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>>> Sent: Friday, November 8, 2019 5:08 PM
>>>>>>>>>>>> To: Deng, Emily <Emily.Deng@amd.com>; amd-
>>>>>>>> gfx@lists.freedesktop.org
>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>>>>> for tdr
>>>>>>>>>>>>
>>>>>>>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily:
>>>>>>>>>>>>> Ping.....
>>>>>>>>>>>> You need to give me at least enough time to wake up :)
>>>>>>>>>>>>
>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On
>>>>>> Behalf
>>>>>>>>>>>>>> Of Deng, Emily
>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM
>>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; amd-
>>>>>>>>>>>>>> gfx@lists.freedesktop.org
>>>>>>>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>>>>>>>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM
>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng@amd.com>;
>>>>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
>>>>>>>>>>>>>>> issue for tdr
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>>>>>>>>>>>>>>>> When the job is already signaled, the s_fence is freed.
>>>>>>>>>>>>>>>> Then it will has null pointer in amdgpu_device_gpu_recover.
>>>>>>>>>>>>>>> NAK, the s_fence is only set to NULL when the job is
>> destroyed.
>>>>>>>>>>>>>>> See drm_sched_job_cleanup().
>>>>>>>>>>>>>> I know it is set to NULL in drm_sched_job_cleanup. But in
>>>>>>>>>>>>>> one case, when it enter into the amdgpu_device_gpu_recover,
>>>>>>>>>>>>>> it already in drm_sched_job_cleanup, and at this time, it
>>>>>>>>>>>>>> will go to free
>>>>>>>> job.
>>>>>>>>>>>>>> But the amdgpu_device_gpu_recover sometimes is faster. At
>>>>>>>>>>>>>> that time, job is not freed, but s_fence is already NULL.
>>>>>>>>>>>> No, that case can't happen. See here:
>>>>>>>>>>>>
>>>>>>>>>>>>> drm_sched_job_cleanup(s_job);
>>>>>>>>>>>>>
>>>>>>>>>>>>> amdgpu_ring_priority_put(ring, s_job->s_priority);
>>>>>>>>>>>>> dma_fence_put(job->fence);
>>>>>>>>>>>>> amdgpu_sync_free(&job->sync);
>>>>>>>>>>>>> amdgpu_sync_free(&job->sched_sync);
>>>>>>>>>>>>> kfree(job);
>>>>>>>>>>>> The job itself is freed up directly after freeing the
>>>>>>>>>>>> reference to the
>>>>>>>> s_fence.
>>>>>>>>>>>> So you are just papering over a much bigger problem here.
>>>>>>>>>>>> This patch is a clear NAK.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Christian.
>>>>>>>>>>>>
>>>>>>>>>>>>>>> When you see a job without an s_fence then that means the
>>>>>>>>>>>>>>> problem is somewhere else.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng@amd.com>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>>>>>>>>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 11
>> ++++++-
>>> ----
>>>>>>>>>>>>>>>> 2 files changed, 7 insertions(+), 6 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>> index e6ce949..5a8f08e 100644
>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>> @@ -4075,7 +4075,7 @@ int
>>>>>> amdgpu_device_gpu_recover(struct
>>>>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>> * job->base holds a reference to parent fence
>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>> - if (job && job->base.s_fence->parent &&
>>>>>>>>>>>>>>>> + if (job && job->base.s_fence &&
>>>>>>>>>>>>>>>> +job->base.s_fence->parent
>>>>>>>>>> &&
>>>>>>>>>>>>>>>> dma_fence_is_signaled(job->base.s_fence->parent))
>>>>>>>>>>>>>>>> job_signaled = true;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>> index 31809ca..56cc10e 100644
>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>> @@ -334,8 +334,8 @@ void
>>> drm_sched_increase_karma(struct
>>>>>>>>>>>>>>> drm_sched_job
>>>>>>>>>>>>>>>> *bad)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> spin_lock(&rq->lock);
>>>>>>>>>>>>>>>> list_for_each_entry_safe(entity, tmp,
>>>>>> &rq-
>>>>>>>>>>> entities,
>>>>>>>>>>>>>>> list) {
>>>>>>>>>>>>>>>> - if (bad->s_fence-
>>>> scheduled.context
>>>>>>>>>> ==
>>>>>>>>>>>>>>>> - entity->fence_context) {
>>>>>>>>>>>>>>>> + if (bad->s_fence && (bad-
>>>> s_fence-
>>>>>>>>>>>>>>>> scheduled.context ==
>>>>>>>>>>>>>>>> + entity->fence_context)) {
>>>>>>>>>>>>>>>> if (atomic_read(&bad-
>>>>>>>>>>> karma) >
>>>>>>>>>>>>>>>> bad->sched-
>>>>>>> hang_limit)
>>>>>>>>>>>>>>>> if (entity-
>>>>>>> guilty) @@ -376,7 +376,7 @@ void
>>>>>>>>>>>>>>>> drm_sched_stop(struct
>>>>>>>>>> drm_gpu_scheduler
>>>>>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>>>>> * This iteration is thread safe as sched thread
>>>>>>>>>>>>>>>> is
>>>>>> stopped.
>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>> list_for_each_entry_safe_reverse(s_job, tmp,
>>>>>>>>>>>>>>>> &sched- ring_mirror_list, node) {
>>>>>>>>>>>>>>>> - if (s_job->s_fence->parent &&
>>>>>>>>>>>>>>>> + if (s_job->s_fence && s_job->s_fence->parent
>>> &&
>>>>>>>>>>>>>>>> dma_fence_remove_callback(s_job-
>>>>>>> s_fence-
>>>>>>>>>>> parent,
>>>>>>>>>>>>>>>> &s_job->cb)) {
>>>>>>>>>>>>>>>> atomic_dec(&sched->hw_rq_count);
>>>>>> @@ -
>>>>>>>>>> 395,7
>>>>>>>>>>>>>> +395,8 @@ void
>>>>>>>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler
>>>>>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>>> * Job is still alive so fence refcount at
>>>>>> least 1
>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>> - dma_fence_wait(&s_job->s_fence-
>>>> finished,
>>>>>>>>>> false);
>>>>>>>>>>>>>>>> + if (s_job->s_fence)
>>>>>>>>>>>>>>>> + dma_fence_wait(&s_job-
>>>> s_fence-
>>>>>>>>>>> finished,
>>>>>>>>>>>>>>> false);
>>>>>>>>>>>>>>>> /*
>>>>>>>>>>>>>>>> * We must keep bad job alive for later
>>>>>> use
>>>>>>>>>> during @@
>>>>>>>>>>>>>>> -438,7
>>>>>>>>>>>>>>>> +439,7 @@ void drm_sched_start(struct
>> drm_gpu_scheduler
>>>>>>>> *sched,
>>>>>>>>>>>>>>>> +bool
>>>>>>>>>>>>>>> full_recovery)
>>>>>>>>>>>>>>>> * GPU recovers can't run in parallel.
>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>> list_for_each_entry_safe(s_job, tmp,
>>>>>>>>>>>>>>>> &sched->ring_mirror_list,
>>>>>>>>>>>>>>>> node)
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>> - struct dma_fence *fence = s_job->s_fence-
>>>> parent;
>>>>>>>>>>>>>>>> + struct dma_fence *fence = s_job->s_fence ?
>>> s_job-
>>>>>>>>>>> s_fence-
>>>>>>>>>>>>>>>> parent :
>>>>>>>>>>>>>>>> +NULL;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> atomic_inc(&sched->hw_rq_count);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
next prev parent reply other threads:[~2019-11-11 21:35 UTC|newest]
Thread overview: 80+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-11-07 10:25 [PATCH] drm/amdgpu: Fix the null pointer issue for tdr Emily Deng
2019-11-07 10:25 ` Emily Deng
[not found] ` <1573122349-22080-1-git-send-email-Emily.Deng-5C7GfCeVMHo@public.gmane.org>
2019-11-07 11:28 ` Christian König
2019-11-07 11:28 ` Christian König
[not found] ` <9de32e5b-69a2-f43f-629f-fef3c30bf5a1-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2019-11-08 2:55 ` Deng, Emily
2019-11-08 2:55 ` Deng, Emily
[not found] ` <MN2PR12MB2975D4E26CED960B82305F308F7B0-rweVpJHSKToFlvJWC7EAqwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-08 8:52 ` Deng, Emily
2019-11-08 8:52 ` Deng, Emily
[not found] ` <MN2PR12MB2975E26D8A8352863BA01FCA8F7B0-rweVpJHSKToFlvJWC7EAqwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-08 9:07 ` Koenig, Christian
2019-11-08 9:07 ` Koenig, Christian
[not found] ` <c01acb29-72ce-a109-3ca5-166706327d61-5C7GfCeVMHo@public.gmane.org>
2019-11-08 9:39 ` Deng, Emily
2019-11-08 9:39 ` Deng, Emily
[not found] ` <MN2PR12MB29755CFCE09CEC0D9EB999D18F7B0-rweVpJHSKToFlvJWC7EAqwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-08 9:42 ` Koenig, Christian
2019-11-08 9:42 ` Koenig, Christian
[not found] ` <70c2c1cc-40b8-30da-7aee-f59fbc4d0d42-5C7GfCeVMHo@public.gmane.org>
2019-11-08 10:11 ` Deng, Emily
2019-11-08 10:11 ` Deng, Emily
[not found] ` <DM6PR12MB2971859C1BF16EE7E65B35B18F7B0-lmeGfMZKVrGd4IXjMPYtUQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-08 10:14 ` Koenig, Christian
2019-11-08 10:14 ` Koenig, Christian
[not found] ` <d6f9c508-3c23-c797-1cbc-7502dc4c0b13-5C7GfCeVMHo@public.gmane.org>
2019-11-08 10:22 ` Deng, Emily
2019-11-08 10:22 ` Deng, Emily
[not found] ` <DM6PR12MB29714AB9AD16FA3ABD7D62C28F7B0-lmeGfMZKVrGd4IXjMPYtUQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-08 10:26 ` Koenig, Christian
2019-11-08 10:26 ` Koenig, Christian
[not found] ` <dcc1124b-5e19-b018-7449-659a8b7d74ea-5C7GfCeVMHo@public.gmane.org>
2019-11-08 10:32 ` Deng, Emily
2019-11-08 10:32 ` Deng, Emily
[not found] ` <DM6PR12MB29710DFE90F22F5903499AFE8F7B0-lmeGfMZKVrGd4IXjMPYtUQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-08 10:35 ` Koenig, Christian
2019-11-08 10:35 ` Koenig, Christian
[not found] ` <91f4a0c4-23e3-a399-5cb1-fb01da922784-5C7GfCeVMHo@public.gmane.org>
2019-11-08 10:54 ` Deng, Emily
2019-11-08 10:54 ` Deng, Emily
[not found] ` <DM6PR12MB2971D540D3000B67E44970AF8F7B0-lmeGfMZKVrGd4IXjMPYtUQdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-08 19:04 ` Grodzovsky, Andrey
2019-11-08 19:04 ` Grodzovsky, Andrey
2019-11-08 19:01 ` Grodzovsky, Andrey
2019-11-08 19:01 ` Grodzovsky, Andrey
[not found] ` <30ac4863-70e0-2b95-4819-e9431a6b4680-5C7GfCeVMHo@public.gmane.org>
2019-11-11 7:19 ` Deng, Emily
2019-11-11 7:19 ` Deng, Emily
[not found] ` <MN2PR12MB2975652B5191BAC055C01BEC8F740-rweVpJHSKToFlvJWC7EAqwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-11 9:05 ` Deng, Emily
2019-11-11 9:05 ` Deng, Emily
[not found] ` <MN2PR12MB2975B736A666D9EEC5E5DB158F740-rweVpJHSKToFlvJWC7EAqwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-11 21:35 ` Andrey Grodzovsky [this message]
2019-11-11 21:35 ` Andrey Grodzovsky
[not found] ` <53130d01-da16-7cc0-55df-ea2532e6b3d0-5C7GfCeVMHo@public.gmane.org>
2019-11-12 5:48 ` Deng, Emily
2019-11-12 5:48 ` Deng, Emily
2019-11-11 18:06 ` Andrey Grodzovsky
2019-11-11 18:06 ` Andrey Grodzovsky
2019-11-12 3:28 ` Grodzovsky, Andrey
2019-11-12 3:28 ` Grodzovsky, Andrey
[not found] ` <MWHPR12MB1453817C6F05A57FD431E159EA770-Gy0DoCVfaSWZBIDmKHdw+wdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-12 6:02 ` Deng, Emily
2019-11-12 6:02 ` Deng, Emily
[not found] ` <MN2PR12MB29750EDB35E27DF9CD63152C8F770-rweVpJHSKToFlvJWC7EAqwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-11 21:11 ` Christian König
2019-11-11 21:11 ` Christian König
[not found] ` <2f035f22-4057-dd9e-27ef-0f5612113e29-5C7GfCeVMHo@public.gmane.org>
2019-11-12 19:21 ` Andrey Grodzovsky
2019-11-12 19:21 ` Andrey Grodzovsky
[not found] ` <9269d447-ed32-81f7-ab43-cb16139096e2-5C7GfCeVMHo@public.gmane.org>
2019-11-13 7:36 ` Christian König
2019-11-13 7:36 ` Christian König
[not found] ` <33ffe2f1-32b6-a238-3752-cee67cd9e141-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2019-11-13 14:12 ` Andrey Grodzovsky
2019-11-13 14:12 ` Andrey Grodzovsky
[not found] ` <40bb3114-d996-10af-3140-51a4f7c212d6-5C7GfCeVMHo@public.gmane.org>
2019-11-13 14:20 ` Christian König
2019-11-13 14:20 ` Christian König
[not found] ` <0858ea1b-d205-006d-a6ec-24b78b33e45b-5C7GfCeVMHo@public.gmane.org>
2019-11-13 16:00 ` Andrey Grodzovsky
2019-11-13 16:00 ` Andrey Grodzovsky
[not found] ` <c784ef0a-2cd7-d4b1-0581-356d8c401102-5C7GfCeVMHo@public.gmane.org>
2019-11-14 8:12 ` Christian König
2019-11-14 8:12 ` Christian König
[not found] ` <088fb2bc-b401-17cc-4d7c-001705ee1eb9-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2019-11-14 15:53 ` Andrey Grodzovsky
2019-11-14 15:53 ` Andrey Grodzovsky
2019-11-14 22:14 ` Andrey Grodzovsky
2019-11-14 22:14 ` Andrey Grodzovsky
[not found] ` <e267429b-9c80-a9e7-7ffd-75ec439ed759-5C7GfCeVMHo@public.gmane.org>
2019-11-15 4:39 ` Deng, Emily
2019-11-15 4:39 ` Deng, Emily
[not found] ` <MN2PR12MB29754C96F982E8C4F5ACC4C08F700-rweVpJHSKToFlvJWC7EAqwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-11-18 14:07 ` Andrey Grodzovsky
2019-11-18 14:07 ` Andrey Grodzovsky
[not found] ` <c4791437-d42d-31fd-972f-cd2cdb26e951-5C7GfCeVMHo@public.gmane.org>
2019-11-18 16:16 ` Christian König
2019-11-18 16:16 ` Christian König
[not found] ` <7963ba8a-e51b-59ce-6c3e-46670e40b27f-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2019-11-18 16:23 ` Andrey Grodzovsky
2019-11-18 16:23 ` Andrey Grodzovsky
[not found] ` <ed7d4065-f83f-3273-5820-e6556e6edc46-5C7GfCeVMHo@public.gmane.org>
2019-11-18 16:44 ` Christian König
2019-11-18 16:44 ` Christian König
[not found] ` <2ac04f61-8fe9-62a9-0240-f0bb9f2b1761-5C7GfCeVMHo@public.gmane.org>
2019-11-18 17:01 ` Andrey Grodzovsky
2019-11-18 17:01 ` Andrey Grodzovsky
[not found] ` <34f789a2-4abd-d6a7-3aa0-fb37e5ba5a86-5C7GfCeVMHo@public.gmane.org>
2019-11-18 20:01 ` Christian König
2019-11-18 20:01 ` Christian König
[not found] ` <51b4b317-fa7e-8920-de56-698ce69a8d0a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2019-11-18 21:05 ` Andrey Grodzovsky
2019-11-18 21:05 ` Andrey Grodzovsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53130d01-da16-7cc0-55df-ea2532e6b3d0@amd.com \
--to=andrey.grodzovsky-5c7gfcevmho@public.gmane.org \
--cc=Christian.Koenig-5C7GfCeVMHo@public.gmane.org \
--cc=Emily.Deng-5C7GfCeVMHo@public.gmane.org \
--cc=amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.