From: "Christian König" <ckoenig.leichtzumerken-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: "Grodzovsky,
Andrey" <Andrey.Grodzovsky-5C7GfCeVMHo@public.gmane.org>,
"Koenig,
Christian" <Christian.Koenig-5C7GfCeVMHo@public.gmane.org>,
"amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org"
<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>,
"dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org"
<dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
Cc: "Deucher, Alexander" <Alexander.Deucher-5C7GfCeVMHo@public.gmane.org>
Subject: Re: [PATCH 3/3] drm/amdgpu: Refresh rq selection for job after ASIC reset
Date: Fri, 19 Oct 2018 18:28:25 +0200 [thread overview]
Message-ID: <732e8d99-e681-e2a4-506d-6aed0d7fc594@gmail.com> (raw)
In-Reply-To: <0928be66-d606-84b8-4fa8-399e5615f75c-5C7GfCeVMHo@public.gmane.org>
Am 19.10.18 um 17:06 schrieb Grodzovsky, Andrey:
>
> On 10/19/2018 03:08 AM, Koenig, Christian wrote:
>> Am 18.10.18 um 20:44 schrieb Andrey Grodzovsky:
>>> A ring might become unusable after reset, if that the case
>>> drm_sched_entity_select_rq will choose another, working rq
>>> to run the job if there is one.
>>> Also, skip recovery of ring which is not ready.
>> Well that is not even remotely sufficient.
>>
>> If we can't bring up a ring any more after a reset we would need to move
>> all jobs which where previously scheduled on it and all of its entities
>> to a different instance.
> That one should be easy to add inside amdgpu_device_gpu_recover in case
> ring is dead, we just do the same for all the jobs in
> sched->ring_mirror_list of the dead ring
> before doing recovery for them, no?
Correct, you need to execute all jobs from the ring_mirror_list as well
as move all entities to other rqs.
But there are some problem with that see below.
>> What you do here breaks dependencies between jobs and can result in
>> unforeseen consequences including random memory writes.
> Can you explain this a bit more ? AFAIK any job dependent on this job
> will still wait for it's completion
> before running regardless of did this job moved to a different ring.
> What am I missing ?
The stuff dependent on the moving jobs should indeed work correctly.
But you don't necessary have space to execute the ring_mirror_list on
another scheduler. And to that the jobs are prepared to run on a
specific ring, e.g. UVD jobs are patches, VMIDs assigned etc etc...
So that most likely won't work correctly.
>> As far as I can see that can't be done correctly with the current
>> scheduler design.
>>
>> Additional to that when we can't restart one instance of a ring we
>> usually can't restart all others either. So the whole approach is rather
>> pointless.
> From my testing looks like we can, compute ring 0 is dead but IB tests
> pass on other compute rings.
Interesting, but I would rather investigate why compute ring 0 is dead
while other still work.
At least some compute rings should be handled by the same engine, so if
one is dead all other should be dead as well.
Christian.
>
> Andrey
>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++++-
>>> 1 file changed, 15 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index d11489e..3124ca1 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -3355,10 +3355,24 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>> else
>>> r = amdgpu_device_reset(adev);
>>>
>>> + /*
>>> + * After reboot a ring might fail in which case this will
>>> + * move the job to different rq if possible
>>> + */
>>> + if (job) {
>>> + drm_sched_entity_select_rq(job->base.entity);
>>> + if (job->base.entity->rq) {
>>> + job->base.sched = job->base.entity->rq->sched;
>>> + } else {
>>> + job->base.sched = NULL;
>>> + r = -ENOENT;
>>> + }
>>> + }
>>> +
>>> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>> struct amdgpu_ring *ring = adev->rings[i];
>>>
>>> - if (!ring || !ring->sched.thread)
>>> + if (!ring || !ring->ready || !ring->sched.thread)
>>> continue;
>>>
>>> /* only need recovery sched of the given job's ring
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
next prev parent reply other threads:[~2018-10-19 16:28 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-10-18 18:44 [PATCH 1/3] drm/sched: Add callback to mark if sched is ready to work Andrey Grodzovsky
2018-10-18 18:44 ` [PATCH 2/3] drm/sched: Expose drm_sched_entity_select_rq to use in drivers Andrey Grodzovsky
[not found] ` <1539888261-331-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
2018-10-18 18:44 ` [PATCH 3/3] drm/amdgpu: Refresh rq selection for job after ASIC reset Andrey Grodzovsky
[not found] ` <1539888261-331-3-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
2018-10-19 7:08 ` Koenig, Christian
[not found] ` <92efce5c-f9ad-847b-09b3-e7c6f7f13ff6-5C7GfCeVMHo@public.gmane.org>
2018-10-19 15:06 ` Grodzovsky, Andrey
[not found] ` <0928be66-d606-84b8-4fa8-399e5615f75c-5C7GfCeVMHo@public.gmane.org>
2018-10-19 16:28 ` Christian König [this message]
2018-10-19 16:50 ` Grodzovsky, Andrey
2018-10-19 7:13 ` [PATCH 1/3] drm/sched: Add callback to mark if sched is ready to work Michel Dänzer
[not found] ` <567b553a-eef7-9560-a789-f1c8bffe3206-otUistvHUpPR7s880joybQ@public.gmane.org>
2018-10-19 14:45 ` Grodzovsky, Andrey
2018-10-19 7:03 ` Koenig, Christian
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=732e8d99-e681-e2a4-506d-6aed0d7fc594@gmail.com \
--to=ckoenig.leichtzumerken-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
--cc=Alexander.Deucher-5C7GfCeVMHo@public.gmane.org \
--cc=Andrey.Grodzovsky-5C7GfCeVMHo@public.gmane.org \
--cc=Christian.Koenig-5C7GfCeVMHo@public.gmane.org \
--cc=amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org \
--cc=dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.