Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr

From: Andrey Grodzovsky <Andrey.Grodzovsky-5C7GfCeVMHo@public.gmane.org>
To: "Deng, Emily" <Emily.Deng-5C7GfCeVMHo@public.gmane.org>,
	"Koenig,
	Christian" <Christian.Koenig-5C7GfCeVMHo@public.gmane.org>,
	"amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org"
	<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
Date: Mon, 11 Nov 2019 16:35:25 -0500	[thread overview]
Message-ID: <53130d01-da16-7cc0-55df-ea2532e6b3d0@amd.com> (raw)
In-Reply-To: <MN2PR12MB2975B736A666D9EEC5E5DB158F740-rweVpJHSKToFlvJWC7EAqwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>

Emily - is there a particular scenario to reproduce this ? I am trying 
with libdrm deadlock test and artificially delaying the GPU reset logic 
until after the guilty job is signaling but indeed nothing bad happens 
as drm_sched_cleanup_jobs returns early because there is a reset in 
progress and so the bad job is not getting released while GPU reset is 
running.

Can you provide event tracing for timer, dma_fence and gpu_scheduler for 
when the problem happens ?

Andrey

On 11/11/19 4:05 AM, Deng, Emily wrote:
> Hi Christian and Andrey,
>       The issue I encountered is the bad job is freeing after entering to the amdgpu_device_gpu_recover. Don't know why, as per Christian said, it will call cancel_delayed_work in drm_sched_cleanup_jobs.
>
> Best wishes
> Emily Deng
>
>
>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Deng,
>> Emily
>> Sent: Monday, November 11, 2019 3:19 PM
>> To: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>
>> Hi Andrey,
>>     I don’t think your patch will help for this. As it will may call
>> kthread_should_park in drm_sched_cleanup_jobs first, and then call
>> kcl_kthread_park. And then it still has a race between the 2 threads.
>>
>> Best wishes
>> Emily Deng
>>
>>
>>
>>> -----Original Message-----
>>> From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>>> Sent: Saturday, November 9, 2019 3:01 AM
>>> To: Koenig, Christian <Christian.Koenig@amd.com>; Deng, Emily
>>> <Emily.Deng@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>
>>>
>>> On 11/8/19 5:35 AM, Koenig, Christian wrote:
>>>> Hi Emily,
>>>>
>>>> exactly that can't happen. See here:
>>>>
>>>>>           /* Don't destroy jobs while the timeout worker is running
>>>>> */
>>>>>           if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>>               !cancel_delayed_work(&sched->work_tdr))
>>>>>                   return NULL;
>>>> We never free jobs while the timeout working is running to prevent
>>>> exactly that issue.
>>>
>>> I don't think this protects us if drm_sched_cleanup_jobs is called for
>>> scheduler which didn't experience a timeout, in
>>> amdgpu_device_gpu_recover we access
>>> sched->ring_mirror_list for all the schedulers on a device so this
>>> sched->condition
>>> above won't protect us. What in fact could help maybe is my recent
>>> patch
>>> 541c521 drm/sched: Avoid job cleanup if sched thread is parked. because
>>> we do park each of the scheduler threads during tdr job before trying
>>> to access
>>> sched->ring_mirror_list.
>>>
>>> Emily - did you see this problem with that patch in place ? I only
>>> pushed it yesterday.
>>>
>>> Andrey
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 08.11.19 um 11:32 schrieb Deng, Emily:
>>>>> Hi Christian,
>>>>>         The drm_sched_job_timedout-> amdgpu_job_timedout call
>>> amdgpu_device_gpu_recover. I mean the main scheduler free the jobs
>>> while in amdgpu_device_gpu_recover, and before calling drm_sched_stop.
>>>>> Best wishes
>>>>> Emily Deng
>>>>>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>> Sent: Friday, November 8, 2019 6:26 PM
>>>>>> To: Deng, Emily <Emily.Deng@amd.com>; amd-
>> gfx@lists.freedesktop.org
>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>>>>
>>>>>> Hi Emily,
>>>>>>
>>>>>> well who is calling amdgpu_device_gpu_recover() in this case?
>>>>>>
>>>>>> When it's not the scheduler we shouldn't have a guilty job in the first
>> place.
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>> Am 08.11.19 um 11:22 schrieb Deng, Emily:
>>>>>>> Hi Chrisitan,
>>>>>>>          No, I am with the new branch and also has the patch. Even
>>>>>>> it are freed by
>>>>>> main scheduler, how we could avoid main scheduler to free jobs
>>>>>> while enter to function amdgpu_device_gpu_recover?
>>>>>>> Best wishes
>>>>>>> Emily Deng
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>> Sent: Friday, November 8, 2019 6:15 PM
>>>>>>>> To: Deng, Emily <Emily.Deng@amd.com>;
>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>>>> tdr
>>>>>>>>
>>>>>>>> Hi Emily,
>>>>>>>>
>>>>>>>> in this case you are on an old code branch.
>>>>>>>>
>>>>>>>> Jobs are freed now by the main scheduler thread and only if no
>>>>>>>> timeout handler is running.
>>>>>>>>
>>>>>>>> See this patch here:
>>>>>>>>> commit 5918045c4ed492fb5813f980dcf89a90fefd0a4e
>>>>>>>>> Author: Christian König <christian.koenig@amd.com>
>>>>>>>>> Date:   Thu Apr 18 11:00:21 2019 -0400
>>>>>>>>>
>>>>>>>>>         drm/scheduler: rework job destruction
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 08.11.19 um 11:11 schrieb Deng, Emily:
>>>>>>>>> Hi Christian,
>>>>>>>>>           Please refer to follow log, when it enter to
>>>>>>>>> amdgpu_device_gpu_recover
>>>>>>>> function, the bad job 000000005086879e is freeing in function
>>>>>>>> amdgpu_job_free_cb  at the same time, because of the hardware
>>>>>>>> fence
>>>>>> signal.
>>>>>>>> But amdgpu_device_gpu_recover goes faster, at this case, the
>>>>>>>> s_fence is already freed, but job is not freed in time. Then this
>>>>>>>> issue
>>> occurs.
>>>>>>>>> [  449.792189] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
>>>>>> sdma0
>>>>>>>>> timeout, signaled seq=2481, emitted seq=2483 [  449.793202]
>>>>>>>>> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
>> information:
>>>>>>>> process  pid 0 thread  pid 0, s_job:000000005086879e [
>>>>>>>> 449.794163] amdgpu
>>>>>>>> 0000:00:08.0: GPU reset begin!
>>>>>>>>> [  449.794175] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>> process pid 0 thread  pid 0, s_job:000000005086879e [
>>>>>>>>> 449.794221] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>> process pid 0 thread pid 0, s_job:0000000066eb74ab [
>>>>>>>>> 449.794222] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>> process  pid 0 thread pid 0, s_job:00000000d4438ad9 [
>>>>>>>>> 449.794255] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>> process  pid 0 thread pid 0, s_job:00000000b6d69c65 [
>>>>>>>>> 449.794257] Emily:amdgpu_job_free_cb,Process information:
>>>>>>>>> process  pid 0 thread pid 0,
>>>>>>>> s_job:00000000ea85e922 [  449.794287]
>>>>>>>> Emily:amdgpu_job_free_cb,Process
>>>>>>>> information: process  pid 0 thread  pid 0, s_job:00000000ed3a5ac6
>>>>>>>> [ 449.794366] BUG: unable to handle kernel NULL pointer
>>>>>>>> dereference at
>>>>>>>> 00000000000000c0 [  449.800818] PGD 0 P4D 0 [  449.801040] Oops:
>>>>>>>> 0000 [#1] SMP PTI
>>>>>>>>> [  449.801338] CPU: 3 PID: 55 Comm: kworker/3:1 Tainted: G
>> OE
>>>>>>>> 4.18.0-15-generic #16~18.04.1-Ubuntu
>>>>>>>>> [  449.802157] Hardware name: QEMU Standard PC (i440FX + PIIX,
>>>>>>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [  449.802944]
>>>>>>>>> Workqueue: events drm_sched_job_timedout [amd_sched] [
>>>>>>>>> 449.803488]
>>>>>> RIP:
>>>>>>>> 0010:amdgpu_device_gpu_recover+0x1da/0xb60 [amdgpu]
>>>>>>>>> [  449.804020] Code: dd ff ff 49 39 c5 48 89 55 a8 0f 85 56 ff
>>>>>>>>> ff ff
>>>>>>>>> 45 85 e4 0f
>>>>>>>> 85 a1 00 00 00 48 8b 45 b0 48 85 c0 0f 84 60 01 00 00 48 8b 40 10
>>>>>>>> <48> 8b
>>>>>> 98
>>>>>>>> c0 00         00 00 48 85 db 0f 84 4c 01 00 00 48 8b 43 48 a8 01
>>>>>>>>> [  449.805593] RSP: 0018:ffffb4c7c08f7d68 EFLAGS: 00010286 [
>>>>>>>>> 449.806032] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>>>>>>>>> 0000000000000000 [  449.806625] RDX: ffffb4c7c08f5ac0 RSI:
>>>>>>>>> 0000000fffffffe0 RDI: 0000000000000246 [  449.807224] RBP:
>>>>>>>>> ffffb4c7c08f7de0 R08: 00000068b9d54000 R09: 0000000000000000
>> [
>>>>>>>>> 449.807818] R10: 0000000000000000 R11: 0000000000000148 R12:
>>>>>>>>> 0000000000000000 [  449.808411] R13: ffffb4c7c08f7da0 R14:
>>>>>>>>> ffff8d82b8525d40 R15: ffff8d82b8525d40 [  449.809004] FS:
>>>>>>>>> 0000000000000000(0000) GS:ffff8d82bfd80000(0000)
>>>>>>>>> knlGS:0000000000000000 [  449.809674] CS:  0010 DS: 0000 ES:
>>>>>>>>> 0000
>>> CR0:
>>>>>>>>> 0000000080050033 [  449.810153] CR2: 00000000000000c0 CR3:
>>>>>>>>> 000000003cc0a001 CR4: 00000000003606e0 [  449.810747] DR0:
>>>>>>>> 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>>> [
>>>>>>>> 449.811344] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>>>>>> 0000000000000400 [  449.811937] Call Trace:
>>>>>>>>> [  449.812206]  amdgpu_job_timedout+0x114/0x140 [amdgpu] [
>>>>>>>>> 449.812635]  drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>>>>>>>> 449.813139]  ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu] [
>>>>>>>>> 449.813609]  ? drm_sched_job_timedout+0x44/0x90 [amd_sched] [
>>>>>>>>> 449.814077]  process_one_work+0x1fd/0x3f0 [  449.814417]
>>>>>>>>> worker_thread+0x34/0x410 [  449.814728]  kthread+0x121/0x140 [
>>>>>>>>> 449.815004]  ? process_one_work+0x3f0/0x3f0 [  449.815374]  ?
>>>>>>>>> kthread_create_worker_on_cpu+0x70/0x70
>>>>>>>>> [  449.815799]  ret_from_fork+0x35/0x40
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>> Sent: Friday, November 8, 2019 5:43 PM
>>>>>>>>>> To: Deng, Emily <Emily.Deng@amd.com>; amd-
>>>>>> gfx@lists.freedesktop.org
>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for
>>>>>>>>>> tdr
>>>>>>>>>>
>>>>>>>>>> Am 08.11.19 um 10:39 schrieb Deng, Emily:
>>>>>>>>>>> Sorry, please take your time.
>>>>>>>>>> Have you seen my other response a bit below?
>>>>>>>>>>
>>>>>>>>>> I can't follow how it would be possible for job->s_fence to be
>>>>>>>>>> NULL without the job also being freed.
>>>>>>>>>>
>>>>>>>>>> So it looks like this patch is just papering over some bigger issues.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>>> Best wishes
>>>>>>>>>>> Emily Deng
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>>>>>>> Sent: Friday, November 8, 2019 5:08 PM
>>>>>>>>>>>> To: Deng, Emily <Emily.Deng@amd.com>; amd-
>>>>>>>> gfx@lists.freedesktop.org
>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>>>>> for tdr
>>>>>>>>>>>>
>>>>>>>>>>>> Am 08.11.19 um 09:52 schrieb Deng, Emily:
>>>>>>>>>>>>> Ping.....
>>>>>>>>>>>> You need to give me at least enough time to wake up :)
>>>>>>>>>>>>
>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On
>>>>>> Behalf
>>>>>>>>>>>>>> Of Deng, Emily
>>>>>>>>>>>>>> Sent: Friday, November 8, 2019 10:56 AM
>>>>>>>>>>>>>> To: Koenig, Christian <Christian.Koenig@amd.com>; amd-
>>>>>>>>>>>>>> gfx@lists.freedesktop.org
>>>>>>>>>>>>>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue
>>>>>>>>>>>>>> for tdr
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>> From: Christian König <ckoenig.leichtzumerken@gmail.com>
>>>>>>>>>>>>>>> Sent: Thursday, November 7, 2019 7:28 PM
>>>>>>>>>>>>>>> To: Deng, Emily <Emily.Deng@amd.com>;
>>>>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer
>>>>>>>>>>>>>>> issue for tdr
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>>>>>>>>>>>>>>>> When the job is already signaled, the s_fence is freed.
>>>>>>>>>>>>>>>> Then it will has null pointer in amdgpu_device_gpu_recover.
>>>>>>>>>>>>>>> NAK, the s_fence is only set to NULL when the job is
>> destroyed.
>>>>>>>>>>>>>>> See drm_sched_job_cleanup().
>>>>>>>>>>>>>> I know it is set to NULL in drm_sched_job_cleanup. But in
>>>>>>>>>>>>>> one case, when it enter into the amdgpu_device_gpu_recover,
>>>>>>>>>>>>>> it already in drm_sched_job_cleanup, and at this time, it
>>>>>>>>>>>>>> will go to free
>>>>>>>> job.
>>>>>>>>>>>>>> But the amdgpu_device_gpu_recover sometimes is faster. At
>>>>>>>>>>>>>> that time, job is not freed, but s_fence is already NULL.
>>>>>>>>>>>> No, that case can't happen. See here:
>>>>>>>>>>>>
>>>>>>>>>>>>>               drm_sched_job_cleanup(s_job);
>>>>>>>>>>>>>
>>>>>>>>>>>>>               amdgpu_ring_priority_put(ring, s_job->s_priority);
>>>>>>>>>>>>>               dma_fence_put(job->fence);
>>>>>>>>>>>>>               amdgpu_sync_free(&job->sync);
>>>>>>>>>>>>>               amdgpu_sync_free(&job->sched_sync);
>>>>>>>>>>>>>               kfree(job);
>>>>>>>>>>>> The job itself is freed up directly after freeing the
>>>>>>>>>>>> reference to the
>>>>>>>> s_fence.
>>>>>>>>>>>> So you are just papering over a much bigger problem here.
>>>>>>>>>>>> This patch is a clear NAK.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Christian.
>>>>>>>>>>>>
>>>>>>>>>>>>>>> When you see a job without an s_fence then that means the
>>>>>>>>>>>>>>> problem is somewhere else.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Emily Deng <Emily.Deng@amd.com>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>          drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>>>>>>>>>>>>>>>>          drivers/gpu/drm/scheduler/sched_main.c     | 11
>> ++++++-
>>> ----
>>>>>>>>>>>>>>>>          2 files changed, 7 insertions(+), 6 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>> index e6ce949..5a8f08e 100644
>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>> @@ -4075,7 +4075,7 @@ int
>>>>>> amdgpu_device_gpu_recover(struct
>>>>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>>>>>          	 *
>>>>>>>>>>>>>>>>          	 * job->base holds a reference to parent fence
>>>>>>>>>>>>>>>>          	 */
>>>>>>>>>>>>>>>> -	if (job && job->base.s_fence->parent &&
>>>>>>>>>>>>>>>> +	if (job && job->base.s_fence &&
>>>>>>>>>>>>>>>> +job->base.s_fence->parent
>>>>>>>>>> &&
>>>>>>>>>>>>>>>>          	    dma_fence_is_signaled(job->base.s_fence->parent))
>>>>>>>>>>>>>>>>          		job_signaled = true;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>> index 31809ca..56cc10e 100644
>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>>>>>> @@ -334,8 +334,8 @@ void
>>> drm_sched_increase_karma(struct
>>>>>>>>>>>>>>> drm_sched_job
>>>>>>>>>>>>>>>> *bad)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>          			spin_lock(&rq->lock);
>>>>>>>>>>>>>>>>          			list_for_each_entry_safe(entity, tmp,
>>>>>> &rq-
>>>>>>>>>>> entities,
>>>>>>>>>>>>>>> list) {
>>>>>>>>>>>>>>>> -				if (bad->s_fence-
>>>> scheduled.context
>>>>>>>>>> ==
>>>>>>>>>>>>>>>> -				    entity->fence_context) {
>>>>>>>>>>>>>>>> +				if (bad->s_fence && (bad-
>>>> s_fence-
>>>>>>>>>>>>>>>> scheduled.context ==
>>>>>>>>>>>>>>>> +				    entity->fence_context)) {
>>>>>>>>>>>>>>>>          					if (atomic_read(&bad-
>>>>>>>>>>> karma) >
>>>>>>>>>>>>>>>>          					    bad->sched-
>>>>>>> hang_limit)
>>>>>>>>>>>>>>>>          						if (entity-
>>>>>>> guilty) @@ -376,7 +376,7 @@ void
>>>>>>>>>>>>>>>> drm_sched_stop(struct
>>>>>>>>>> drm_gpu_scheduler
>>>>>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>>>>>          	 * This iteration is thread safe as sched thread
>>>>>>>>>>>>>>>> is
>>>>>> stopped.
>>>>>>>>>>>>>>>>          	 */
>>>>>>>>>>>>>>>>          	list_for_each_entry_safe_reverse(s_job, tmp,
>>>>>>>>>>>>>>>> &sched- ring_mirror_list, node) {
>>>>>>>>>>>>>>>> -		if (s_job->s_fence->parent &&
>>>>>>>>>>>>>>>> +		if (s_job->s_fence && s_job->s_fence->parent
>>> &&
>>>>>>>>>>>>>>>>          		    dma_fence_remove_callback(s_job-
>>>>>>> s_fence-
>>>>>>>>>>> parent,
>>>>>>>>>>>>>>>>          					      &s_job->cb)) {
>>>>>>>>>>>>>>>>          			atomic_dec(&sched->hw_rq_count);
>>>>>> @@ -
>>>>>>>>>> 395,7
>>>>>>>>>>>>>> +395,8 @@ void
>>>>>>>>>>>>>>>> drm_sched_stop(struct drm_gpu_scheduler
>>>>>>>>>>>>>>> *sched, struct drm_sched_job *bad)
>>>>>>>>>>>>>>>>          			 *
>>>>>>>>>>>>>>>>          			 * Job is still alive so fence refcount at
>>>>>> least 1
>>>>>>>>>>>>>>>>          			 */
>>>>>>>>>>>>>>>> -			dma_fence_wait(&s_job->s_fence-
>>>> finished,
>>>>>>>>>> false);
>>>>>>>>>>>>>>>> +			if (s_job->s_fence)
>>>>>>>>>>>>>>>> +				dma_fence_wait(&s_job-
>>>> s_fence-
>>>>>>>>>>> finished,
>>>>>>>>>>>>>>> false);
>>>>>>>>>>>>>>>>          			/*
>>>>>>>>>>>>>>>>          			 * We must keep bad job alive for later
>>>>>> use
>>>>>>>>>> during @@
>>>>>>>>>>>>>>> -438,7
>>>>>>>>>>>>>>>> +439,7 @@ void drm_sched_start(struct
>> drm_gpu_scheduler
>>>>>>>> *sched,
>>>>>>>>>>>>>>>> +bool
>>>>>>>>>>>>>>> full_recovery)
>>>>>>>>>>>>>>>>          	 * GPU recovers can't run in parallel.
>>>>>>>>>>>>>>>>          	 */
>>>>>>>>>>>>>>>>          	list_for_each_entry_safe(s_job, tmp,
>>>>>>>>>>>>>>>> &sched->ring_mirror_list,
>>>>>>>>>>>>>>>> node)
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>> -		struct dma_fence *fence = s_job->s_fence-
>>>> parent;
>>>>>>>>>>>>>>>> +		struct dma_fence *fence = s_job->s_fence ?
>>> s_job-
>>>>>>>>>>> s_fence-
>>>>>>>>>>>>>>>> parent :
>>>>>>>>>>>>>>>> +NULL;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>          		atomic_inc(&sched->hw_rq_count);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx