Re: [PATCH 3/3] drm/amdgpu: Refresh rq selection for job after ASIC reset

From: "Christian König" <ckoenig.leichtzumerken-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: "Grodzovsky,
	Andrey" <Andrey.Grodzovsky-5C7GfCeVMHo@public.gmane.org>,
	"Koenig,
	Christian" <Christian.Koenig-5C7GfCeVMHo@public.gmane.org>,
	"amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org"
	<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>,
	"dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org"
	<dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
Cc: "Deucher, Alexander" <Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>
Subject: Re: [PATCH 3/3] drm/amdgpu: Refresh rq selection for job after ASIC reset
Date: Fri, 19 Oct 2018 18:28:25 +0200	[thread overview]
Message-ID: <732e8d99-e681-e2a4-506d-6aed0d7fc594@gmail.com> (raw)
In-Reply-To: <0928be66-d606-84b8-4fa8-399e5615f75c-5C7GfCeVMHo@public.gmane.org>

Am 19.10.18 um 17:06 schrieb Grodzovsky, Andrey:
>
> On 10/19/2018 03:08 AM, Koenig, Christian wrote:
>> Am 18.10.18 um 20:44 schrieb Andrey Grodzovsky:
>>> A ring might become unusable after reset, if that the case
>>> drm_sched_entity_select_rq will choose another, working rq
>>> to run the job if there is one.
>>> Also, skip recovery of ring which is not ready.
>> Well that is not even remotely sufficient.
>>
>> If we can't bring up a ring any more after a reset we would need to move
>> all jobs which where previously scheduled on it and all of its entities
>> to a different instance.
> That one should be easy to add inside amdgpu_device_gpu_recover in case
> ring is dead, we just do the same for all the jobs in
> sched->ring_mirror_list of the dead ring
> before doing recovery for them, no?

Correct, you need to execute all jobs from the ring_mirror_list as well 
as move all entities to other rqs.

But there are some problem with that see below.

>> What you do here breaks dependencies between jobs and can result in
>> unforeseen consequences including random memory writes.
> Can you explain this a bit more ? AFAIK any job dependent on this job
> will still wait for it's completion
> before running regardless of did this job moved to a different ring.
> What am I missing ?

The stuff dependent on the moving jobs should indeed work correctly.

But you don't necessary have space to execute the ring_mirror_list on 
another scheduler. And to that the jobs are prepared to run on a 
specific ring, e.g. UVD jobs are patches, VMIDs assigned etc etc...

So that most likely won't work correctly.

>> As far as I can see that can't be done correctly with the current
>> scheduler design.
>>
>> Additional to that when we can't restart one instance of a ring we
>> usually can't restart all others either. So the whole approach is rather
>> pointless.
>   From my testing looks like we can, compute ring 0 is dead but IB tests
> pass on other compute rings.

Interesting, but I would rather investigate why compute ring 0 is dead 
while other still work.

At least some compute rings should be handled by the same engine, so if 
one is dead all other should be dead as well.

Christian.

>
> Andrey
>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++++-
>>>     1 file changed, 15 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index d11489e..3124ca1 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -3355,10 +3355,24 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>     	else
>>>     		r = amdgpu_device_reset(adev);
>>>     
>>> +	/*
>>> +	 * After reboot a ring might fail in which case this will
>>> +	 * move the job to different rq if possible
>>> +	 */
>>> +	if (job) {
>>> +		drm_sched_entity_select_rq(job->base.entity);
>>> +		if (job->base.entity->rq) {
>>> +			job->base.sched = job->base.entity->rq->sched;
>>> +		} else {
>>> +			job->base.sched = NULL;
>>> +			r = -ENOENT;
>>> +		}
>>> +	}
>>> +
>>>     	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>     		struct amdgpu_ring *ring = adev->rings[i];
>>>     
>>> -		if (!ring || !ring->sched.thread)
>>> +		if (!ring || !ring->ready || !ring->sched.thread)
>>>     			continue;
>>>     
>>>     		/* only need recovery sched of the given job's ring
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx