All of lore.kernel.org
 help / color / mirror / Atom feed
From: zhoucm1 <david1.zhou-5C7GfCeVMHo@public.gmane.org>
To: "Christian König"
	<deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
Subject: Re: [PATCH 00/11] add recovery entity
Date: Thu, 4 Aug 2016 17:04:42 +0800	[thread overview]
Message-ID: <57A3052A.5090001@amd.com> (raw)
In-Reply-To: <a8ade743-309a-6982-7bad-a7c5648bd5e2-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>



On 2016年08月04日 16:39, Christian König wrote:
> Am 04.08.2016 um 05:10 schrieb zhoucm1:
>>
>>
>> On 2016年08月03日 21:43, Christian König wrote:
>>> Well that is a clear NAK to this whole approach.
>>>
>>> Submitting the recovery jobs to the scheduler is reentrant because 
>>> the scheduler is the one who originally signaled us of a timeout.
>> we have reset all recovery jobs, right? Could we think those jobs are 
>> same as others?
>
> No they aren't. For recovery jobs you don't want a timeout which 
> triggers another GPU reset while your first one is still under way.
>
>>
>>>
>>> Why not submit the recovery jobs to the hardware ring directly?
>> Yeah, this is also what I did at begin.
>> The mainly reasons are:
>> 0. recovery jobs need to wait itself page table recovery completed at 
>> least.
>
> Well, as noted in the other thread we need to recover the GART table 
> with the CPU anyway.
>
>> 1. direct submission is using run_job which is used by scheduler as 
>> well, which could introduce conflicts.
>
> The scheduler should be completely stopped during the GPU reset, so 
> there shouldn't be any other processing.
>
>> 2. if all vm clients use one sdma engine, the speed of restoring is 
>> slow. If we can use itself pte ring, then we will use all sdma 
>> engines for them.
>
> A single SDMA engine should be able to max out the PCIe speed in one 
> direction, no need to offload that to both engines. If we really need 
> both engines we could also simply handle that in the recovery code as 
> well.
>
>>
>> 3. if just one entity is to recover all vm page tables, then their 
>> recovery jobs will have potential dependency, the later is waiting 
>> the front. If they have their own entity, there will be no dependency 
>> between them.
>> 4. if recovery entity is based on kernel run queue, then the recovery 
>> jobs could be executed with pt jobs at the same time.
>
> Well that's exactly the reason why I don't want to push those jobs 
> through the scheduler. The scheduler should be stopped during the GPU 
> reset so that nothing else happens with the hardware.
>
> E.g. when other jobs run concurrently with the recovery jobs you can 
> have all kinds of problems like one SDMA engine is doing a recovery 
> while the other one does a backup on the same BO etc..
OK, I got your mean. That means all recovery pt jobs of pt scheduler 
must be completed before directly submitting recovery job, which indeed 
simply many problems, especially kinds of fence sync.

Regards,
David zhou
>
> Regards,
> Christian.
>
>>
>> Above is why I introduce recovery entity and recovery run queue.
>>
>> Regards,
>> David Zhou
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 28.07.2016 um 12:13 schrieb Chunming Zhou:
>>>> every vm has itself recovery entity, which is used to reovery page 
>>>> table from their shadow.
>>>> They don't need to wait front vm completed.
>>>> And also using all pte rings can speed reovery.
>>>>
>>>> every scheduler has its own recovery entity, which is used to save 
>>>> hw jobs, and resubmit from it, which solves the conflicts between 
>>>> reset thread and scheduler thread when run job.
>>>>
>>>> And some fixes when doing this improment.
>>>>
>>>> Chunming Zhou (11):
>>>>    drm/amdgpu: hw ring should be empty when gpu reset
>>>>    drm/amdgpu: specify entity to amdgpu_copy_buffer
>>>>    drm/amd: add recover run queue for scheduler
>>>>    drm/amdgpu: fix vm init error path
>>>>    drm/amdgpu: add vm recover entity
>>>>    drm/amdgpu: use all pte rings to recover page table
>>>>    drm/amd: add recover entity for every scheduler
>>>>    drm/amd: use scheduler to recover hw jobs
>>>>    drm/amd: hw job list should be exact
>>>>    drm/amd: reset jobs to recover entity
>>>>    drm/amdgpu: no need fence wait every time
>>>>
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   5 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_benchmark.c |   3 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  35 +++++--
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c      |  11 +++
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_test.c      |   8 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   5 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |  26 ++++--
>>>>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 129 
>>>> +++++++++++++-------------
>>>>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |   4 +-
>>>>   9 files changed, 134 insertions(+), 92 deletions(-)
>>>>
>>>
>>
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  parent reply	other threads:[~2016-08-04  9:04 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-28 10:13 [PATCH 00/11] add recovery entity Chunming Zhou
     [not found] ` <1469700828-25650-1-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
2016-07-28 10:13   ` [PATCH 01/11] drm/amdgpu: hw ring should be empty when gpu reset Chunming Zhou
     [not found]     ` <1469700828-25650-2-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
2016-07-30  3:53       ` Edward O'Callaghan
2016-07-28 10:13   ` [PATCH 02/11] drm/amdgpu: specify entity to amdgpu_copy_buffer Chunming Zhou
2016-07-28 10:13   ` [PATCH 03/11] drm/amd: add recover run queue for scheduler Chunming Zhou
     [not found]     ` <1469700828-25650-4-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
2016-07-30  3:52       ` Edward O'Callaghan
2016-07-28 10:13   ` [PATCH 04/11] drm/amdgpu: fix vm init error path Chunming Zhou
     [not found]     ` <1469700828-25650-5-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
2016-07-30  3:41       ` Edward O'Callaghan
     [not found]         ` <7115f9e7-3afd-a693-3e23-1ad4acb3c700-dczkZgxz+BNUPWh3PAxdjQ@public.gmane.org>
2016-07-30  3:44           ` Edward O'Callaghan
2016-07-28 10:13   ` [PATCH 05/11] drm/amdgpu: add vm recover entity Chunming Zhou
     [not found]     ` <1469700828-25650-6-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
2016-07-30  3:51       ` Edward O'Callaghan
2016-07-28 10:13   ` [PATCH 06/11] drm/amdgpu: use all pte rings to recover page table Chunming Zhou
2016-07-28 10:13   ` [PATCH 07/11] drm/amd: add recover entity for every scheduler Chunming Zhou
2016-07-28 10:13   ` [PATCH 08/11] drm/amd: use scheduler to recover hw jobs Chunming Zhou
2016-07-28 10:13   ` [PATCH 09/11] drm/amd: hw job list should be exact Chunming Zhou
     [not found]     ` <1469700828-25650-10-git-send-email-David1.Zhou-5C7GfCeVMHo@public.gmane.org>
2016-07-30  3:46       ` Edward O'Callaghan
2016-07-28 10:13   ` [PATCH 10/11] drm/amd: reset jobs to recover entity Chunming Zhou
2016-07-28 10:13   ` [PATCH 11/11] drm/amdgpu: no need fence wait every time Chunming Zhou
2016-08-02  2:06   ` [PATCH 00/11] add recovery entity zhoucm1
2016-08-03 13:43   ` Christian König
     [not found]     ` <233bde9b-f7ff-2697-1fd9-419d08f8f359-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2016-08-04  3:10       ` zhoucm1
     [not found]         ` <57A2B217.2010100-5C7GfCeVMHo@public.gmane.org>
2016-08-04  8:39           ` Christian König
     [not found]             ` <a8ade743-309a-6982-7bad-a7c5648bd5e2-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2016-08-04  9:04               ` zhoucm1 [this message]
2016-08-04  9:04               ` zhoucm1

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=57A3052A.5090001@amd.com \
    --to=david1.zhou-5c7gfcevmho@public.gmane.org \
    --cc=amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org \
    --cc=deathsimple-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.