From mboxrd@z Thu Jan  1 00:00:00 1970
From: =?windows-1252?Q?Christian_K=F6nig?= <deathsimple@vodafone.de>
Subject: Re: Fence, timeline and android sync points
Date: Wed, 13 Aug 2014 09:59:26 +0200
Message-ID: <53EB1ADE.5060104@vodafone.de>
References: <20140812221340.GB5746@gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="windows-1252"; Format="flowed"
Content-Transfer-Encoding: quoted-printable
Return-path: <dri-devel-bounces@lists.freedesktop.org>
Received: from pegasos-out.vodafone.de (pegasos-out.vodafone.de [80.84.1.38])
 by gabe.freedesktop.org (Postfix) with ESMTP id B71716E1B1
 for <dri-devel@lists.freedesktop.org>; Wed, 13 Aug 2014 01:00:04 -0700 (PDT)
In-Reply-To: <20140812221340.GB5746@gmail.com>
List-Unsubscribe: <http://lists.freedesktop.org/mailman/options/dri-devel>,
 <mailto:dri-devel-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <http://lists.freedesktop.org/archives/dri-devel>
List-Post: <mailto:dri-devel@lists.freedesktop.org>
List-Help: <mailto:dri-devel-request@lists.freedesktop.org?subject=help>
List-Subscribe: <http://lists.freedesktop.org/mailman/listinfo/dri-devel>,
 <mailto:dri-devel-request@lists.freedesktop.org?subject=subscribe>
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
To: Jerome Glisse <j.glisse@gmail.com>, dri-devel@lists.freedesktop.org, maarten.lankhorst@canonical.com
Cc: daniel.vetter@ffwll.ch, bskeggs@redhat.com
List-Id: dri-devel@lists.freedesktop.org

Hi Jerome,

first of all that finally sounds like somebody starts to draw the whole =

picture for me.

So far all I have seen was a bunch of specialized requirements and some =

not so obvious design decisions based on those requirements.

So thanks a lot for finally summarizing the requirements from a top =

above view and I perfectly agree with your analysis of the current fence =

design and the downsides of that API.

Apart from that I also have some comments / requirements that hopefully =

can be taken into account as well:

>    pipeline timeline: timeline bound to a userspace rendering pipeline, e=
ach
>                       point on that timeline can be a composite of several
>                       different hardware pipeline point.
>    pipeline: abstract object representing userspace application graphic p=
ipeline
>              of each of the application graphic operations.
In the long term a requirement for the driver for AMD GFX hardware is =

that instead of a fixed pipeline timeline we need a bit more flexible =

model where concurrent execution on different hardware engines is =

possible as well.

So the requirement is that you can do things like submitting a 3D job A, =

a DMA job B, a VCE job C and another 3D job D that are executed like this:
     A
    /  \
   B  C
    \  /
     D

(Let's just hope that looks as good on your mail client as it looked for =

me).

My current thinking is that we avoid having a pipeline object in the =

kernel and instead letting userspace specify which fence we want to =

synchronize to explicitly as long as everything stays withing the same =

client. As soon as any buffer is shared between clients the kernel we =

would need to fall back to implicitly synchronization to allow backward =

compatibility with DRI2/3.

>    if (condition) execute_command_buffer else skip_command_buffer
>
> where condition is a simple expression (memory_address cop value)) with c=
op one
> of the generic comparison (=3D=3D, <, >, <=3D, >=3D). I think it is a saf=
e assumption
> that any gpu that slightly matter can do that. Those who can not should f=
ix
> there command ring processor.
At least for some engines on AMD hardware that isn't possible (UVD, VCE =

and in some extends DMA as well), but I don't see any reason why we =

shouldn't be able to use software based scheduling on those engines by =

default. So this isn't really a problem, but just an additional comment =

to keep in mind.

Regards,
Christian.

Am 13.08.2014 um 00:13 schrieb Jerome Glisse:
> Hi,
>
> So i want over the whole fence and sync point stuff as it's becoming a pr=
essing
> issue. I think we first need to agree on what is the problem we want to s=
olve
> and what would be the requirements to solve it.
>
> Problem :
>    Explicit synchronization btw different hardware block over a buffer ob=
ject.
>
> Requirements :
>    Share common infrastructure.
>    Allow optimal hardware command stream scheduling accross hardware bloc=
k.
>    Allow android sync point to be implemented on top of it.
>    Handle/acknowledge exception (like good old gpu lockup).
>    Minimize driver changes.
>
> Glossary :
>    hardware timeline: timeline bound to a specific hardware block.
>    pipeline timeline: timeline bound to a userspace rendering pipeline, e=
ach
>                       point on that timeline can be a composite of several
>                       different hardware pipeline point.
>    pipeline: abstract object representing userspace application graphic p=
ipeline
>              of each of the application graphic operations.
>    fence: specific point in a timeline where synchronization needs to hap=
pen.
>
>
> So now, current include/linux/fence.h implementation is i believe missing=
 the
> objective by confusing hardware and pipeline timeline and by bolting fenc=
e to
> buffer object while what is really needed is true and proper timeline for=
 both
> hardware and pipeline. But before going further down that road let me loo=
k at
> things and explain how i see them.
>
> Current ttm fence have one and a sole purpose, allow synchronization for =
buffer
> object move even thought some driver like radeon slightly abuse it and us=
e them
> for things like lockup detection.
>
> The new fence want to expose an api that would allow some implementation =
of a
> timeline. For that it introduces callback and some hard requirement on wh=
at the
> driver have to expose :
>    enable_signaling
>    [signaled]
>    wait
>
> Each of those have to do work inside the driver to which the fence belong=
s and
> each of those can be call more or less from unexpected (with restriction =
like
> outside irq) context. So we end up with thing like :
>
>   Process 1              Process 2                   Process 3
>   I_A_schedule(fence0)
>                          CI_A_F_B_signaled(fence0)
>                                                      I_A_signal(fence0)
>                                                      CI_B_F_A_callback(fe=
nce0)
>                          CI_A_F_B_wait(fence0)
> Lexique:
> I_x  in driver x (I_A =3D=3D in driver A)
> CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from d=
river B)
>
> So this is an happy mess everyone call everyone and this bound to get mes=
sy.
> Yes i know there is all kind of requirement on what happen once a fence is
> signaled. But those requirement only looks like they are trying to atone =
any
> mess that can happen from the whole callback dance.
>
> While i was too seduced by the whole callback idea long time ago, i think=
 it is
> a highly dangerous path to take where the combinatorial of what could hap=
pen
> are bound to explode with the increase in the number of players.
>
>
> So now back to how to solve the problem we are trying to address. First i=
 want
> to make an observation, almost all GPU that exist today have a command ri=
ng
> on to which userspace command buffer are executed and inside the command =
ring
> you can do something like :
>
>    if (condition) execute_command_buffer else skip_command_buffer
>
> where condition is a simple expression (memory_address cop value)) with c=
op one
> of the generic comparison (=3D=3D, <, >, <=3D, >=3D). I think it is a saf=
e assumption
> that any gpu that slightly matter can do that. Those who can not should f=
ix
> there command ring processor.
>
>
> With that in mind, i think proper solution is implementing timeline and h=
aving
> fence be a timeline object with a way simpler api. For each hardware time=
line
> driver provide a system memory address at which the lastest signaled fence
> sequence number can be read. Each fence object is uniquely associated with
> both a hardware and a pipeline timeline. Each pipeline timeline have a wa=
it
> queue.
>
> When scheduling something that require synchronization on a hardware time=
line
> a fence is created and associated with the pipeline timeline and hardware
> timeline. Other hardware block that need to wait on a fence can use there
> command ring conditional execution to directly check the fence sequence f=
rom
> the other hw block so you do optimistic scheduling. If optimistic schedul=
ing
> fails (which would be reported by hw block specific solution and hidden) =
then
> things can fallback to software cpu wait inside what could be considered =
the
> kernel thread of the pipeline timeline.
>
>
>  From api point of view there is no inter-driver call. All the driver nee=
ds to
> do is wakeup the pipeline timeline wait_queue when things are signaled or
> when things go sideway (gpu lockup).
>
>
> So how to implement that with current driver ? Well easy. Currently we as=
sume
> implicit synchronization so all we need is an implicit pipeline timeline =
per
> userspace process (note this do not prevent inter process synchronization=
).
> Everytime a command buffer is submitted it is added to the implicit timel=
ine
> with the simple fence object :
>
> struct fence {
>    struct list_head   list_hwtimeline;
>    struct list_head   list_pipetimeline;
>    struct hw_timeline *hw_timeline;
>    uint64_t           seq_num;
>    work_t             timedout_work;
>    void               *csdata;
> };
>
> So with set of helper function call by each of the driver command executi=
on
> ioctl you have the implicit timeline that is properly populated and each
> dirver command execution get the dependency from the implicit timeline.
>
>
> Of course to take full advantages of all flexibilities this could offer we
> would need to allow userspace to create pipeline timeline and to schedule
> against the pipeline timeline of there choice. We could create file for
> each of the pipeline timeline and have file operation to wait/query
> progress.
>
> Note that the gpu lockup are considered exceptional event, the implicit
> timeline will probably want to continue on other job on other hardware
> block but the explicit one probably will want to decide wether to continue
> or abort or retry without the fault hw block.
>
>
> I realize i am late to the party and that i should have taken a serious
> look at all this long time ago. I apologize for that and if you consider
> this is to late then just ignore me modulo the big warning the crazyness
> that callback will introduce an how bad things bound to happen. I am not
> saying that bad things can not happen with what i propose just that
> because everything happen inside the process context that is the one
> asking/requiring synchronization there will be not interprocess kernel
> callback (a callback that was registered by one process and that is call
> inside another process time slice because fence signaling is happening
> inside this other process time slice).
>
>
> Pseudo code for explicitness :
>
> drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *fil=
p)
> {
>     struct fence *dependency[16], *fence;
>     int m;
>
>     m =3D timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline,
>                           dependency, 16, &fence);
>     if (m < 0)
>       return m;
>     if (m >=3D 16) {
>         // alloc m and recall;
>     }
>     dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fe=
nce);
> }
>
> int timeline_schedule(ptimeline, hwtimeline, timeout,
>                         dependency, mdep, **fence)
> {
>     // allocate fence set hw_timeline and init work
>     // build up list of dependency by looking at list of pending fence in
>     // timeline
> }
>
>
>
> // If device driver schedule job hopping for all dependency to be signale=
d then
> // it must also call this function with csdata being a copy of what needs=
 to be
> // executed once all dependency are signaled
> void timeline_missed_schedule(timeline, fence, void *csdata)
> {
>     INITWORK(fence->work, timeline_missed_schedule_worker)
>     fence->csdata =3D csdata;
>     schedule_delayed_work(fence->work, default_timeout)
> }
>
> void timeline_missed_schedule_worker(work)
> {
>     driver =3D driver_from_fence_hwtimeline(fence)
>
>     // Make sure that each of the hwtimeline dependency will fire irq by
>     // calling a driver function.
>     timeline_wait_for_fence_dependency(fence);
>     driver->execute_cs(driver, fence);
> }
>
> // This function is call by driver code that signal fence (could be call =
from
> // interrupt context). It is responsabilities of device driver to call th=
at
> // function.
> void timeline_signal(hwtimeline)
> {
>    for_each_fence(fence, hwtimeline->fences, list_hwtimeline) {
>      wakeup(fence->pipetimeline->wait_queue);
>    }
> }
>
>
> Cheers,
> J=E9r=F4me