From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jerome Glisse <j.glisse@gmail.com>
Subject: Re: Fence, timeline and android sync points
Date: Tue, 12 Aug 2014 21:23:54 -0400
Message-ID: <20140812234307.GA3001@gmail.com>
References: <20140812221340.GB5746@gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Return-path: <dri-devel-bounces@lists.freedesktop.org>
Received: from mail-qg0-f48.google.com (mail-qg0-f48.google.com
 [209.85.192.48])
 by gabe.freedesktop.org (Postfix) with ESMTP id D08F16E14F
 for <dri-devel@lists.freedesktop.org>; Tue, 12 Aug 2014 18:23:46 -0700 (PDT)
Received: by mail-qg0-f48.google.com with SMTP id i50so10291815qgf.35
 for <dri-devel@lists.freedesktop.org>; Tue, 12 Aug 2014 18:23:46 -0700 (PDT)
Content-Disposition: inline
In-Reply-To: <20140812221340.GB5746@gmail.com>
List-Unsubscribe: <http://lists.freedesktop.org/mailman/options/dri-devel>,
 <mailto:dri-devel-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <http://lists.freedesktop.org/archives/dri-devel>
List-Post: <mailto:dri-devel@lists.freedesktop.org>
List-Help: <mailto:dri-devel-request@lists.freedesktop.org?subject=help>
List-Subscribe: <http://lists.freedesktop.org/mailman/listinfo/dri-devel>,
 <mailto:dri-devel-request@lists.freedesktop.org?subject=subscribe>
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
To: dri-devel@lists.freedesktop.org, maarten.lankhorst@canonical.com
Cc: daniel.vetter@ffwll.ch, bskeggs@redhat.com
List-Id: dri-devel@lists.freedesktop.org

On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
> Hi,
> =

> So i want over the whole fence and sync point stuff as it's becoming a pr=
essing
> issue. I think we first need to agree on what is the problem we want to s=
olve
> and what would be the requirements to solve it.
> =

> Problem :
>   Explicit synchronization btw different hardware block over a buffer obj=
ect.
> =

> Requirements :
>   Share common infrastructure.
>   Allow optimal hardware command stream scheduling accross hardware block.
>   Allow android sync point to be implemented on top of it.
>   Handle/acknowledge exception (like good old gpu lockup).
>   Minimize driver changes.
> =

> Glossary :
>   hardware timeline: timeline bound to a specific hardware block.
>   pipeline timeline: timeline bound to a userspace rendering pipeline, ea=
ch
>                      point on that timeline can be a composite of several
>                      different hardware pipeline point.
>   pipeline: abstract object representing userspace application graphic pi=
peline
>             of each of the application graphic operations.
>   fence: specific point in a timeline where synchronization needs to happ=
en.
> =

> =

> So now, current include/linux/fence.h implementation is i believe missing=
 the
> objective by confusing hardware and pipeline timeline and by bolting fenc=
e to
> buffer object while what is really needed is true and proper timeline for=
 both
> hardware and pipeline. But before going further down that road let me loo=
k at
> things and explain how i see them.
> =

> Current ttm fence have one and a sole purpose, allow synchronization for =
buffer
> object move even thought some driver like radeon slightly abuse it and us=
e them
> for things like lockup detection.
> =

> The new fence want to expose an api that would allow some implementation =
of a
> timeline. For that it introduces callback and some hard requirement on wh=
at the
> driver have to expose :
>   enable_signaling
>   [signaled]
>   wait
> =

> Each of those have to do work inside the driver to which the fence belong=
s and
> each of those can be call more or less from unexpected (with restriction =
like
> outside irq) context. So we end up with thing like :
> =

>  Process 1              Process 2                   Process 3
>  I_A_schedule(fence0)
>                         CI_A_F_B_signaled(fence0)
>                                                     I_A_signal(fence0)
>                                                     CI_B_F_A_callback(fen=
ce0)
>                         CI_A_F_B_wait(fence0)
> Lexique:
> I_x  in driver x (I_A =3D=3D in driver A)
> CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from d=
river B)
> =

> So this is an happy mess everyone call everyone and this bound to get mes=
sy.
> Yes i know there is all kind of requirement on what happen once a fence is
> signaled. But those requirement only looks like they are trying to atone =
any
> mess that can happen from the whole callback dance.
> =

> While i was too seduced by the whole callback idea long time ago, i think=
 it is
> a highly dangerous path to take where the combinatorial of what could hap=
pen
> are bound to explode with the increase in the number of players.
> =

> =

> So now back to how to solve the problem we are trying to address. First i=
 want
> to make an observation, almost all GPU that exist today have a command ri=
ng
> on to which userspace command buffer are executed and inside the command =
ring
> you can do something like :
> =

>   if (condition) execute_command_buffer else skip_command_buffer
> =

> where condition is a simple expression (memory_address cop value)) with c=
op one
> of the generic comparison (=3D=3D, <, >, <=3D, >=3D). I think it is a saf=
e assumption
> that any gpu that slightly matter can do that. Those who can not should f=
ix
> there command ring processor.
> =

> =

> With that in mind, i think proper solution is implementing timeline and h=
aving
> fence be a timeline object with a way simpler api. For each hardware time=
line
> driver provide a system memory address at which the lastest signaled fence
> sequence number can be read. Each fence object is uniquely associated with
> both a hardware and a pipeline timeline. Each pipeline timeline have a wa=
it
> queue.
> =

> When scheduling something that require synchronization on a hardware time=
line
> a fence is created and associated with the pipeline timeline and hardware
> timeline. Other hardware block that need to wait on a fence can use there
> command ring conditional execution to directly check the fence sequence f=
rom
> the other hw block so you do optimistic scheduling. If optimistic schedul=
ing
> fails (which would be reported by hw block specific solution and hidden) =
then
> things can fallback to software cpu wait inside what could be considered =
the
> kernel thread of the pipeline timeline.
> =

> =

> From api point of view there is no inter-driver call. All the driver need=
s to
> do is wakeup the pipeline timeline wait_queue when things are signaled or
> when things go sideway (gpu lockup).
> =

> =

> So how to implement that with current driver ? Well easy. Currently we as=
sume
> implicit synchronization so all we need is an implicit pipeline timeline =
per
> userspace process (note this do not prevent inter process synchronization=
).
> Everytime a command buffer is submitted it is added to the implicit timel=
ine
> with the simple fence object :
> =

> struct fence {
>   struct list_head   list_hwtimeline;
>   struct list_head   list_pipetimeline;
>   struct hw_timeline *hw_timeline;
>   uint64_t           seq_num;
>   work_t             timedout_work;
>   void               *csdata;
> };
> =

> So with set of helper function call by each of the driver command executi=
on
> ioctl you have the implicit timeline that is properly populated and each
> dirver command execution get the dependency from the implicit timeline.
> =

> =

> Of course to take full advantages of all flexibilities this could offer we
> would need to allow userspace to create pipeline timeline and to schedule
> against the pipeline timeline of there choice. We could create file for
> each of the pipeline timeline and have file operation to wait/query
> progress.
> =

> Note that the gpu lockup are considered exceptional event, the implicit
> timeline will probably want to continue on other job on other hardware
> block but the explicit one probably will want to decide wether to continue
> or abort or retry without the fault hw block.
> =

> =

> I realize i am late to the party and that i should have taken a serious
> look at all this long time ago. I apologize for that and if you consider
> this is to late then just ignore me modulo the big warning the crazyness
> that callback will introduce an how bad things bound to happen. I am not
> saying that bad things can not happen with what i propose just that
> because everything happen inside the process context that is the one
> asking/requiring synchronization there will be not interprocess kernel
> callback (a callback that was registered by one process and that is call
> inside another process time slice because fence signaling is happening
> inside this other process time slice).
> =

> =

> Pseudo code for explicitness :
> =

> drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *fil=
p)
> {
>    struct fence *dependency[16], *fence;
>    int m;
> =

>    m =3D timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline,
>                          dependency, 16, &fence);
>    if (m < 0)
>      return m;
>    if (m >=3D 16) {
>        // alloc m and recall;
>    }
>    dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fen=
ce);
> }
> =

> int timeline_schedule(ptimeline, hwtimeline, timeout,
>                        dependency, mdep, **fence)
> {
>    // allocate fence set hw_timeline and init work
>    // build up list of dependency by looking at list of pending fence in
>    // timeline
> }
> =

> =

> =

> // If device driver schedule job hopping for all dependency to be signale=
d then
> // it must also call this function with csdata being a copy of what needs=
 to be
> // executed once all dependency are signaled
> void timeline_missed_schedule(timeline, fence, void *csdata)
> {
>    INITWORK(fence->work, timeline_missed_schedule_worker)
>    fence->csdata =3D csdata;
>    schedule_delayed_work(fence->work, default_timeout)
> }
> =

> void timeline_missed_schedule_worker(work)
> {
>    driver =3D driver_from_fence_hwtimeline(fence)
> =

>    // Make sure that each of the hwtimeline dependency will fire irq by
>    // calling a driver function.
>    timeline_wait_for_fence_dependency(fence);
>    driver->execute_cs(driver, fence);
> }
> =

> // This function is call by driver code that signal fence (could be call =
from
> // interrupt context). It is responsabilities of device driver to call th=
at
> // function.
> void timeline_signal(hwtimeline)
> {
>   for_each_fence(fence, hwtimeline->fences, list_hwtimeline) {
>     wakeup(fence->pipetimeline->wait_queue);
>   }
> }

Btw as extra note, because of implicit timeline any shared object schedule =
on a
hw timeline must add a fence to all the implicit timeline where this object=
 exist.

Also there is no need to have a fence pointer per object.

> =

> =

> Cheers,
> J=E9r=F4me