From mboxrd@z Thu Jan  1 00:00:00 1970
From: =?windows-1252?Q?Christian_K=F6nig?= <deathsimple@vodafone.de>
Subject: Re: Fence, timeline and android sync points
Date: Wed, 13 Aug 2014 16:08:14 +0200
Message-ID: <53EB714E.1060102@vodafone.de>
References: <20140812221340.GB5746@gmail.com> <53EB1ADE.5060104@vodafone.de>
 <20140813134145.GB2666@gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="windows-1252"; Format="flowed"
Content-Transfer-Encoding: quoted-printable
Return-path: <dri-devel-bounces@lists.freedesktop.org>
Received: from pegasos-out.vodafone.de (pegasos-out.vodafone.de [80.84.1.38])
 by gabe.freedesktop.org (Postfix) with ESMTP id 580306E0C7
 for <dri-devel@lists.freedesktop.org>; Wed, 13 Aug 2014 07:09:02 -0700 (PDT)
In-Reply-To: <20140813134145.GB2666@gmail.com>
List-Unsubscribe: <http://lists.freedesktop.org/mailman/options/dri-devel>,
 <mailto:dri-devel-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <http://lists.freedesktop.org/archives/dri-devel>
List-Post: <mailto:dri-devel@lists.freedesktop.org>
List-Help: <mailto:dri-devel-request@lists.freedesktop.org?subject=help>
List-Subscribe: <http://lists.freedesktop.org/mailman/listinfo/dri-devel>,
 <mailto:dri-devel-request@lists.freedesktop.org?subject=subscribe>
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
To: Jerome Glisse <j.glisse@gmail.com>
Cc: daniel.vetter@ffwll.ch, bskeggs@redhat.com, dri-devel@lists.freedesktop.org
List-Id: dri-devel@lists.freedesktop.org

> The whole issue is that today cs ioctl assume implied synchronization. So=
 this
> can not change, so for now anything that goes through cs ioctl would need=
 to
> use an implied timeline and have all ring that use common buffer synchron=
ize
> on it. As long as those ring use different buffer there is no need for sy=
nc.
Exactly my thoughts.

> Buffer object are what links hw timeline.
A couple of people at AMD have a problem with that and I'm currently =

working full time on a solution. But solving this and keeping 100% =

backward compatibility at the same time is not an easy task.

> Of course there might be way to be more flexible if timeline are expose to
> userspace and userspace can create several of them for a single process.
Concurrent execution is mostly used for temporary things e.g. copying a =

result to a userspace buffer while VCE is decoding into the ring buffer =

at a different location for example. Creating an extra timeline just to =

tell the kernel that two commands are allowed to run in parallel sounds =

like to much overhead to me.

Cheers,
Christian.

Am 13.08.2014 um 15:41 schrieb Jerome Glisse:
> On Wed, Aug 13, 2014 at 09:59:26AM +0200, Christian K=F6nig wrote:
>> Hi Jerome,
>>
>> first of all that finally sounds like somebody starts to draw the whole
>> picture for me.
>>
>> So far all I have seen was a bunch of specialized requirements and some =
not
>> so obvious design decisions based on those requirements.
>>
>> So thanks a lot for finally summarizing the requirements from a top above
>> view and I perfectly agree with your analysis of the current fence design
>> and the downsides of that API.
>>
>> Apart from that I also have some comments / requirements that hopefully =
can
>> be taken into account as well:
>>
>>>    pipeline timeline: timeline bound to a userspace rendering pipeline,=
 each
>>>                       point on that timeline can be a composite of seve=
ral
>>>                       different hardware pipeline point.
>>>    pipeline: abstract object representing userspace application graphic=
 pipeline
>>>              of each of the application graphic operations.
>> In the long term a requirement for the driver for AMD GFX hardware is th=
at
>> instead of a fixed pipeline timeline we need a bit more flexible model w=
here
>> concurrent execution on different hardware engines is possible as well.
>>
>> So the requirement is that you can do things like submitting a 3D job A,=
 a
>> DMA job B, a VCE job C and another 3D job D that are executed like this:
>>      A
>>     /  \
>>    B  C
>>     \  /
>>      D
>>
>> (Let's just hope that looks as good on your mail client as it looked for
>> me).
> My thinking of hw timeline is that a gpu like amd or nvidia would have se=
veral
> different hw timeline. They are per block/engine so one for dma ring, one=
 for
> gfx, one for vce, ....
>
>   =

>> My current thinking is that we avoid having a pipeline object in the ker=
nel
>> and instead letting userspace specify which fence we want to synchronize=
 to
>> explicitly as long as everything stays withing the same client. As soon =
as
>> any buffer is shared between clients the kernel we would need to fall ba=
ck
>> to implicitly synchronization to allow backward compatibility with DRI2/=
3.
> The whole issue is that today cs ioctl assume implied synchronization. So=
 this
> can not change, so for now anything that goes through cs ioctl would need=
 to
> use an implied timeline and have all ring that use common buffer synchron=
ize
> on it. As long as those ring use different buffer there is no need for sy=
nc.
>
> Buffer object are what links hw timeline.
>
> Of course there might be way to be more flexible if timeline are expose to
> userspace and userspace can create several of them for a single process.
>
>>>    if (condition) execute_command_buffer else skip_command_buffer
>>>
>>> where condition is a simple expression (memory_address cop value)) with=
 cop one
>>> of the generic comparison (=3D=3D, <, >, <=3D, >=3D). I think it is a s=
afe assumption
>>> that any gpu that slightly matter can do that. Those who can not should=
 fix
>>> there command ring processor.
>> At least for some engines on AMD hardware that isn't possible (UVD, VCE =
and
>> in some extends DMA as well), but I don't see any reason why we shouldn'=
t be
>> able to use software based scheduling on those engines by default. So th=
is
>> isn't really a problem, but just an additional comment to keep in mind.
> Yes not everything can do that but as it's a simple memory access with si=
mple
> comparison then it's easy to do on cpu for limited hardware. But this rea=
lly
> sounds like something so easy to add to hw ring execution that it is a sh=
ame
> hw designer do not already added such thing.
>
>> Regards,
>> Christian.
>>
>> Am 13.08.2014 um 00:13 schrieb Jerome Glisse:
>>> Hi,
>>>
>>> So i want over the whole fence and sync point stuff as it's becoming a =
pressing
>>> issue. I think we first need to agree on what is the problem we want to=
 solve
>>> and what would be the requirements to solve it.
>>>
>>> Problem :
>>>    Explicit synchronization btw different hardware block over a buffer =
object.
>>>
>>> Requirements :
>>>    Share common infrastructure.
>>>    Allow optimal hardware command stream scheduling accross hardware bl=
ock.
>>>    Allow android sync point to be implemented on top of it.
>>>    Handle/acknowledge exception (like good old gpu lockup).
>>>    Minimize driver changes.
>>>
>>> Glossary :
>>>    hardware timeline: timeline bound to a specific hardware block.
>>>    pipeline timeline: timeline bound to a userspace rendering pipeline,=
 each
>>>                       point on that timeline can be a composite of seve=
ral
>>>                       different hardware pipeline point.
>>>    pipeline: abstract object representing userspace application graphic=
 pipeline
>>>              of each of the application graphic operations.
>>>    fence: specific point in a timeline where synchronization needs to h=
appen.
>>>
>>>
>>> So now, current include/linux/fence.h implementation is i believe missi=
ng the
>>> objective by confusing hardware and pipeline timeline and by bolting fe=
nce to
>>> buffer object while what is really needed is true and proper timeline f=
or both
>>> hardware and pipeline. But before going further down that road let me l=
ook at
>>> things and explain how i see them.
>>>
>>> Current ttm fence have one and a sole purpose, allow synchronization fo=
r buffer
>>> object move even thought some driver like radeon slightly abuse it and =
use them
>>> for things like lockup detection.
>>>
>>> The new fence want to expose an api that would allow some implementatio=
n of a
>>> timeline. For that it introduces callback and some hard requirement on =
what the
>>> driver have to expose :
>>>    enable_signaling
>>>    [signaled]
>>>    wait
>>>
>>> Each of those have to do work inside the driver to which the fence belo=
ngs and
>>> each of those can be call more or less from unexpected (with restrictio=
n like
>>> outside irq) context. So we end up with thing like :
>>>
>>>   Process 1              Process 2                   Process 3
>>>   I_A_schedule(fence0)
>>>                          CI_A_F_B_signaled(fence0)
>>>                                                      I_A_signal(fence0)
>>>                                                      CI_B_F_A_callback(=
fence0)
>>>                          CI_A_F_B_wait(fence0)
>>> Lexique:
>>> I_x  in driver x (I_A =3D=3D in driver A)
>>> CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from=
 driver B)
>>>
>>> So this is an happy mess everyone call everyone and this bound to get m=
essy.
>>> Yes i know there is all kind of requirement on what happen once a fence=
 is
>>> signaled. But those requirement only looks like they are trying to aton=
e any
>>> mess that can happen from the whole callback dance.
>>>
>>> While i was too seduced by the whole callback idea long time ago, i thi=
nk it is
>>> a highly dangerous path to take where the combinatorial of what could h=
appen
>>> are bound to explode with the increase in the number of players.
>>>
>>>
>>> So now back to how to solve the problem we are trying to address. First=
 i want
>>> to make an observation, almost all GPU that exist today have a command =
ring
>>> on to which userspace command buffer are executed and inside the comman=
d ring
>>> you can do something like :
>>>
>>>    if (condition) execute_command_buffer else skip_command_buffer
>>>
>>> where condition is a simple expression (memory_address cop value)) with=
 cop one
>>> of the generic comparison (=3D=3D, <, >, <=3D, >=3D). I think it is a s=
afe assumption
>>> that any gpu that slightly matter can do that. Those who can not should=
 fix
>>> there command ring processor.
>>>
>>>
>>> With that in mind, i think proper solution is implementing timeline and=
 having
>>> fence be a timeline object with a way simpler api. For each hardware ti=
meline
>>> driver provide a system memory address at which the lastest signaled fe=
nce
>>> sequence number can be read. Each fence object is uniquely associated w=
ith
>>> both a hardware and a pipeline timeline. Each pipeline timeline have a =
wait
>>> queue.
>>>
>>> When scheduling something that require synchronization on a hardware ti=
meline
>>> a fence is created and associated with the pipeline timeline and hardwa=
re
>>> timeline. Other hardware block that need to wait on a fence can use the=
re
>>> command ring conditional execution to directly check the fence sequence=
 from
>>> the other hw block so you do optimistic scheduling. If optimistic sched=
uling
>>> fails (which would be reported by hw block specific solution and hidden=
) then
>>> things can fallback to software cpu wait inside what could be considere=
d the
>>> kernel thread of the pipeline timeline.
>>>
>>>
>>>  From api point of view there is no inter-driver call. All the driver n=
eeds to
>>> do is wakeup the pipeline timeline wait_queue when things are signaled =
or
>>> when things go sideway (gpu lockup).
>>>
>>>
>>> So how to implement that with current driver ? Well easy. Currently we =
assume
>>> implicit synchronization so all we need is an implicit pipeline timelin=
e per
>>> userspace process (note this do not prevent inter process synchronizati=
on).
>>> Everytime a command buffer is submitted it is added to the implicit tim=
eline
>>> with the simple fence object :
>>>
>>> struct fence {
>>>    struct list_head   list_hwtimeline;
>>>    struct list_head   list_pipetimeline;
>>>    struct hw_timeline *hw_timeline;
>>>    uint64_t           seq_num;
>>>    work_t             timedout_work;
>>>    void               *csdata;
>>> };
>>>
>>> So with set of helper function call by each of the driver command execu=
tion
>>> ioctl you have the implicit timeline that is properly populated and each
>>> dirver command execution get the dependency from the implicit timeline.
>>>
>>>
>>> Of course to take full advantages of all flexibilities this could offer=
 we
>>> would need to allow userspace to create pipeline timeline and to schedu=
le
>>> against the pipeline timeline of there choice. We could create file for
>>> each of the pipeline timeline and have file operation to wait/query
>>> progress.
>>>
>>> Note that the gpu lockup are considered exceptional event, the implicit
>>> timeline will probably want to continue on other job on other hardware
>>> block but the explicit one probably will want to decide wether to conti=
nue
>>> or abort or retry without the fault hw block.
>>>
>>>
>>> I realize i am late to the party and that i should have taken a serious
>>> look at all this long time ago. I apologize for that and if you consider
>>> this is to late then just ignore me modulo the big warning the crazyness
>>> that callback will introduce an how bad things bound to happen. I am not
>>> saying that bad things can not happen with what i propose just that
>>> because everything happen inside the process context that is the one
>>> asking/requiring synchronization there will be not interprocess kernel
>>> callback (a callback that was registered by one process and that is call
>>> inside another process time slice because fence signaling is happening
>>> inside this other process time slice).
>>>
>>>
>>> Pseudo code for explicitness :
>>>
>>> drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *f=
ilp)
>>> {
>>>     struct fence *dependency[16], *fence;
>>>     int m;
>>>
>>>     m =3D timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline,
>>>                           dependency, 16, &fence);
>>>     if (m < 0)
>>>       return m;
>>>     if (m >=3D 16) {
>>>         // alloc m and recall;
>>>     }
>>>     dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, =
fence);
>>> }
>>>
>>> int timeline_schedule(ptimeline, hwtimeline, timeout,
>>>                         dependency, mdep, **fence)
>>> {
>>>     // allocate fence set hw_timeline and init work
>>>     // build up list of dependency by looking at list of pending fence =
in
>>>     // timeline
>>> }
>>>
>>>
>>>
>>> // If device driver schedule job hopping for all dependency to be signa=
led then
>>> // it must also call this function with csdata being a copy of what nee=
ds to be
>>> // executed once all dependency are signaled
>>> void timeline_missed_schedule(timeline, fence, void *csdata)
>>> {
>>>     INITWORK(fence->work, timeline_missed_schedule_worker)
>>>     fence->csdata =3D csdata;
>>>     schedule_delayed_work(fence->work, default_timeout)
>>> }
>>>
>>> void timeline_missed_schedule_worker(work)
>>> {
>>>     driver =3D driver_from_fence_hwtimeline(fence)
>>>
>>>     // Make sure that each of the hwtimeline dependency will fire irq by
>>>     // calling a driver function.
>>>     timeline_wait_for_fence_dependency(fence);
>>>     driver->execute_cs(driver, fence);
>>> }
>>>
>>> // This function is call by driver code that signal fence (could be cal=
l from
>>> // interrupt context). It is responsabilities of device driver to call =
that
>>> // function.
>>> void timeline_signal(hwtimeline)
>>> {
>>>    for_each_fence(fence, hwtimeline->fences, list_hwtimeline) {
>>>      wakeup(fence->pipetimeline->wait_queue);
>>>    }
>>> }
>>>
>>>
>>> Cheers,
>>> J=E9r=F4me
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/dri-devel