From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jerome Glisse <j.glisse@gmail.com>
Subject: Re: Fence, timeline and android sync points
Date: Wed, 13 Aug 2014 13:07:20 -0400
Message-ID: <20140813170719.GD2666@gmail.com>
References: <20140812221340.GB5746@gmail.com>
 <20140813082822.GO10500@phenom.ffwll.local>
 <20140813133602.GA2666@gmail.com>
 <20140813155420.GG10500@phenom.ffwll.local>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Return-path: <dri-devel-bounces@lists.freedesktop.org>
Received: from mail-qc0-f175.google.com (mail-qc0-f175.google.com
 [209.85.216.175])
 by gabe.freedesktop.org (Postfix) with ESMTP id 0B6F86E5AE
 for <dri-devel@lists.freedesktop.org>; Wed, 13 Aug 2014 10:07:09 -0700 (PDT)
Received: by mail-qc0-f175.google.com with SMTP id w7so32859qcr.20
 for <dri-devel@lists.freedesktop.org>; Wed, 13 Aug 2014 10:07:09 -0700 (PDT)
Content-Disposition: inline
In-Reply-To: <20140813155420.GG10500@phenom.ffwll.local>
List-Unsubscribe: <http://lists.freedesktop.org/mailman/options/dri-devel>,
 <mailto:dri-devel-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <http://lists.freedesktop.org/archives/dri-devel>
List-Post: <mailto:dri-devel@lists.freedesktop.org>
List-Help: <mailto:dri-devel-request@lists.freedesktop.org?subject=help>
List-Subscribe: <http://lists.freedesktop.org/mailman/listinfo/dri-devel>,
 <mailto:dri-devel-request@lists.freedesktop.org?subject=subscribe>
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
To: Daniel Vetter <daniel@ffwll.ch>
Cc: daniel.vetter@ffwll.ch, dri-devel@lists.freedesktop.org, bskeggs@redhat.com
List-Id: dri-devel@lists.freedesktop.org

On Wed, Aug 13, 2014 at 05:54:20PM +0200, Daniel Vetter wrote:
> On Wed, Aug 13, 2014 at 09:36:04AM -0400, Jerome Glisse wrote:
> > On Wed, Aug 13, 2014 at 10:28:22AM +0200, Daniel Vetter wrote:
> > > On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
> > > > Hi,
> > > > =

> > > > So i want over the whole fence and sync point stuff as it's becomin=
g a pressing
> > > > issue. I think we first need to agree on what is the problem we wan=
t to solve
> > > > and what would be the requirements to solve it.
> > > > =

> > > > Problem :
> > > >   Explicit synchronization btw different hardware block over a buff=
er object.
> > > > =

> > > > Requirements :
> > > >   Share common infrastructure.
> > > >   Allow optimal hardware command stream scheduling accross hardware=
 block.
> > > >   Allow android sync point to be implemented on top of it.
> > > >   Handle/acknowledge exception (like good old gpu lockup).
> > > >   Minimize driver changes.
> > > > =

> > > > Glossary :
> > > >   hardware timeline: timeline bound to a specific hardware block.
> > > >   pipeline timeline: timeline bound to a userspace rendering pipeli=
ne, each
> > > >                      point on that timeline can be a composite of s=
everal
> > > >                      different hardware pipeline point.
> > > >   pipeline: abstract object representing userspace application grap=
hic pipeline
> > > >             of each of the application graphic operations.
> > > >   fence: specific point in a timeline where synchronization needs t=
o happen.
> > > > =

> > > > =

> > > > So now, current include/linux/fence.h implementation is i believe m=
issing the
> > > > objective by confusing hardware and pipeline timeline and by boltin=
g fence to
> > > > buffer object while what is really needed is true and proper timeli=
ne for both
> > > > hardware and pipeline. But before going further down that road let =
me look at
> > > > things and explain how i see them.
> > > =

> > > fences can be used free-standing and no one forces you to integrate t=
hem
> > > with buffers. We actually plan to go this way with the intel svm stuf=
f.
> > > Ofc for dma-buf the plan is to synchronize using such fences, but tha=
t's
> > > somewhat orthogonal I think. At least you only talk about fences and
> > > timeline and not dma-buf here.
> > >  =

> > > > Current ttm fence have one and a sole purpose, allow synchronizatio=
n for buffer
> > > > object move even thought some driver like radeon slightly abuse it =
and use them
> > > > for things like lockup detection.
> > > > =

> > > > The new fence want to expose an api that would allow some implement=
ation of a
> > > > timeline. For that it introduces callback and some hard requirement=
 on what the
> > > > driver have to expose :
> > > >   enable_signaling
> > > >   [signaled]
> > > >   wait
> > > > =

> > > > Each of those have to do work inside the driver to which the fence =
belongs and
> > > > each of those can be call more or less from unexpected (with restri=
ction like
> > > > outside irq) context. So we end up with thing like :
> > > > =

> > > >  Process 1              Process 2                   Process 3
> > > >  I_A_schedule(fence0)
> > > >                         CI_A_F_B_signaled(fence0)
> > > >                                                     I_A_signal(fenc=
e0)
> > > >                                                     CI_B_F_A_callba=
ck(fence0)
> > > >                         CI_A_F_B_wait(fence0)
> > > > Lexique:
> > > > I_x  in driver x (I_A =3D=3D in driver A)
> > > > CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A =
from driver B)
> > > > =

> > > > So this is an happy mess everyone call everyone and this bound to g=
et messy.
> > > > Yes i know there is all kind of requirement on what happen once a f=
ence is
> > > > signaled. But those requirement only looks like they are trying to =
atone any
> > > > mess that can happen from the whole callback dance.
> > > > =

> > > > While i was too seduced by the whole callback idea long time ago, i=
 think it is
> > > > a highly dangerous path to take where the combinatorial of what cou=
ld happen
> > > > are bound to explode with the increase in the number of players.
> > > > =

> > > > =

> > > > So now back to how to solve the problem we are trying to address. F=
irst i want
> > > > to make an observation, almost all GPU that exist today have a comm=
and ring
> > > > on to which userspace command buffer are executed and inside the co=
mmand ring
> > > > you can do something like :
> > > > =

> > > >   if (condition) execute_command_buffer else skip_command_buffer
> > > > =

> > > > where condition is a simple expression (memory_address cop value)) =
with cop one
> > > > of the generic comparison (=3D=3D, <, >, <=3D, >=3D). I think it is=
 a safe assumption
> > > > that any gpu that slightly matter can do that. Those who can not sh=
ould fix
> > > > there command ring processor.
> > > > =

> > > > =

> > > > With that in mind, i think proper solution is implementing timeline=
 and having
> > > > fence be a timeline object with a way simpler api. For each hardwar=
e timeline
> > > > driver provide a system memory address at which the lastest signale=
d fence
> > > > sequence number can be read. Each fence object is uniquely associat=
ed with
> > > > both a hardware and a pipeline timeline. Each pipeline timeline hav=
e a wait
> > > > queue.
> > > > =

> > > > When scheduling something that require synchronization on a hardwar=
e timeline
> > > > a fence is created and associated with the pipeline timeline and ha=
rdware
> > > > timeline. Other hardware block that need to wait on a fence can use=
 there
> > > > command ring conditional execution to directly check the fence sequ=
ence from
> > > > the other hw block so you do optimistic scheduling. If optimistic s=
cheduling
> > > > fails (which would be reported by hw block specific solution and hi=
dden) then
> > > > things can fallback to software cpu wait inside what could be consi=
dered the
> > > > kernel thread of the pipeline timeline.
> > > > =

> > > > =

> > > > From api point of view there is no inter-driver call. All the drive=
r needs to
> > > > do is wakeup the pipeline timeline wait_queue when things are signa=
led or
> > > > when things go sideway (gpu lockup).
> > > > =

> > > > =

> > > > So how to implement that with current driver ? Well easy. Currently=
 we assume
> > > > implicit synchronization so all we need is an implicit pipeline tim=
eline per
> > > > userspace process (note this do not prevent inter process synchroni=
zation).
> > > > Everytime a command buffer is submitted it is added to the implicit=
 timeline
> > > > with the simple fence object :
> > > > =

> > > > struct fence {
> > > >   struct list_head   list_hwtimeline;
> > > >   struct list_head   list_pipetimeline;
> > > >   struct hw_timeline *hw_timeline;
> > > >   uint64_t           seq_num;
> > > >   work_t             timedout_work;
> > > >   void               *csdata;
> > > > };
> > > > =

> > > > So with set of helper function call by each of the driver command e=
xecution
> > > > ioctl you have the implicit timeline that is properly populated and=
 each
> > > > dirver command execution get the dependency from the implicit timel=
ine.
> > > > =

> > > > =

> > > > Of course to take full advantages of all flexibilities this could o=
ffer we
> > > > would need to allow userspace to create pipeline timeline and to sc=
hedule
> > > > against the pipeline timeline of there choice. We could create file=
 for
> > > > each of the pipeline timeline and have file operation to wait/query
> > > > progress.
> > > > =

> > > > Note that the gpu lockup are considered exceptional event, the impl=
icit
> > > > timeline will probably want to continue on other job on other hardw=
are
> > > > block but the explicit one probably will want to decide wether to c=
ontinue
> > > > or abort or retry without the fault hw block.
> > > > =

> > > > =

> > > > I realize i am late to the party and that i should have taken a ser=
ious
> > > > look at all this long time ago. I apologize for that and if you con=
sider
> > > > this is to late then just ignore me modulo the big warning the craz=
yness
> > > > that callback will introduce an how bad things bound to happen. I a=
m not
> > > > saying that bad things can not happen with what i propose just that
> > > > because everything happen inside the process context that is the one
> > > > asking/requiring synchronization there will be not interprocess ker=
nel
> > > > callback (a callback that was registered by one process and that is=
 call
> > > > inside another process time slice because fence signaling is happen=
ing
> > > > inside this other process time slice).
> > > =

> > > So I read through it all and presuming I understand it correctly your
> > > proposal and what we currently have is about the same. The big differ=
ence
> > > is that you make a timeline a first-class object and move the callback
> > > queue from the fence to the timeline, which requires callers to check=
 the
> > > fence/seqno/whatever themselves instead of pushing that responsibilit=
y to
> > > callers.
> > =

> > No, big difference is that there is no callback thus when waiting for a
> > fence you are either inside the process context that need to wait for it
> > or your inside a kernel thread process context. Which means in both case
> > you can do whatever you want. What i hate about the fence code as it is,
> > is the callback stuff, because you never know into which context fence
> > are signaled then you never know into which context callback are execut=
ed.
> =

> Look at waitqueues a bit closer. They're implemented with callbacks ;-)
> The only difference is that you're allowed to have spurious wakeups and
> need to handle that somehow, so need a separate check function.

No this not how wait queue are implemented, ie wait queue do not callback a
random function from a random driver, it callback a limited set of function
from core linux kernel scheduler so that the process thread that was waiting
and out of the scheduler list is added back and marked as something that
should be schedule. Unless this part of the kernel drasticly changed for the
worse recently.

So this is fundamentaly different, fence as they are now allow random driver
callback and this is bound to get ugly this is bound to lead to one driver
doing something that seems innocuous but turn out to break heavoc when call
from some other driver function.

> =

> > > If you actually mandate that the fence is just a seqno or similar whi=
ch
> > > can be read lockless then I could register my own special callback in=
to
> > > that waitqueue (other stuff than waking up threads is allowed) and fr=
om
> > > hard-irq context check the seqno and readd my own callback if that's =
not
> > > yet happened (that needs to go through some other context for hilarit=
y).
> > =

> > Yes mandating a simple number that can be read from anywhere without lo=
ck,
> > i am pretty sure all hw can write to system page and can write a value
> > alongside their command buffer. So either you hw support reading and te=
sting
> > value either you can do it in atomic context right before scheduling.
> =

> Imo that's a step in the wrong direction since reading a bit of system
> memory or checking a bit of irq-safe spinlock protected data or a register
> shouldn't matter. You just arbitrarily disallow that. And allowing random
> other kernel subsystems to read mmio or page mappings not under their
> control is an idea that freaks /me/ out.

Let me make this crystal clear this must be a valid kernel page that have a
valid kernel mapping for the lifetime of the device. Hence there is no acce=
ss
to mmio space or anything, just a regular kernel page. If can not rely on t=
hat
this is a sad world.

That being said, yes i am aware that some device incapacity to write to such
a page. For those dumb hardware what you need to do is have the irq handler
write to this page on behalf of the hardware. But i would like to know any
hardware than can not write a single dword from its ring buffer.

The only tricky part in this, is when device is unloaded and driver is remo=
ving
itself, it obviously need to synchronize itself with anyone possibly waitin=
g on
it and possibly reading. But truly this is not that hard to solve.

So tell me once the above is clear what kind of scary thing can happen when=
 cpu
or a device _read_ a kernel page ?

> =

> > > So from that pov (presuming I didn't miss anything) your proposal is
> > > identical to what we have, minor some different color choices (like w=
here
> > > to place the callback queue).
> > =

> > No callback is the mantra here, and instead of bolting free living fence
> > to buffer object, they are associated with timeline which means you do =
not
> > need to go over all buffer object to know what you need to wait for.
> =

> Ok, then I guess I didn't understand that part of your the proposal. Can
> you please elaborate a bit more how you want to synchronize mulitple
> drivers accessing a dma-buf object and what piece of state we need to
> associate to the dma-buf to make this happen?

Beauty of it you associate ziltch to the buffer. So for existing cs ioctl w=
here
the implicit synchronization is the rule it enforce mandatory synchronizati=
on
accross all hw timeline on which a buffer shows up :
  for_each_buffer_in_cmdbuffer(buffer, cmdbuf) {
    if (!cmdbuf_write_to_buffer(buffer, cmdbuf))
      continue;
    for_each_process_sharing_buffer(buffer, process) {
      schedule_fence(process->implicit_timeline, cmdbuf->fence)
    }
  }

Next time another process use current ioctl with implicit sync it will sync=
h with
the last fence for any shared resource. This might sounds bad but truely as=
 it is
right now this is already how it happens (at least for radeon).

So now, why it's better than include/linux/fence.h is because you can only =
link a
single fence to buffer and even if you are ready to waste memory and allow =
linking
several fence to each buffer you would need userspace to start tracking for=
 each
buffer which fence it cares about. Basicly implementing the android sync po=
int
(at least as i understand android sync point) is hard and error prone if no=
t just
impossible with the fence code.

On the contrary with explicit timeline (ie once you expose timeline to user=
space
and allow to schedule thing against a usertime line) you can do crazy stuff=
. For
a single process for instance :
  (htimeline1, seqno1) =3D dev_A_hwblock_X_cmdbuf_ioctl(cmdbuf1, utimeline_=
p0, buf0)
  (htimeline2, seqno2) =3D dev_A_hwblock_Y_cmdbuf_ioctl(cmdbuf2, utimeline_=
p0, buf0)
  (htimeline3, seqno3) =3D dev_B_hwblock_X_cmdbuf_ioctl(cmdbuf3, utimeline_=
p0, buf0)
  (htimeline4, seqno4) =3D dev_C_hwblock_Z_cmdbuf_ioctl(cmdbuf4, utimeline_=
p0, buf0)
  (htimeline1, seqno5) =3D dev_A_hwblock_X_cmdbuf_ioctl(cmdbuf5, utimeline_=
p0, buf0)
  timeline_sync(utimeline, (htimeline1_p0, seqno5), (htimeline3, seqno3))
  (htimeline1, seqno6) =3D dev_A_hwblock_X_cmdbuf_ioctl(cmdbuf6, utimeline_=
p0, buf0)

So what this allow is for cmdbuf6 to wait on cmdbuf1, cmdbuf3 and cmdbuf5 a=
nd just
ignore completely what ever cmdbuf4 is doing as a do not care. Truely only =
user
space knows the kind of interdependcy that can exist. And all cmdbuf1, cmdb=
uf2,
cmdbuf3, cmdbuf4 and cmdbuf5 will/can execute concurrently.

Now for inter-process you need the interprocess to communicate with each ot=
her and
one process can do :
  timeline_inter_sync(utimeline1_p1, utimeline1_p0, (htimeline4, seqno4))
  (htimeline1, seqno1_p1) =3D dev_A_hwblock_X_cmdbuf_ioctl(cmdbuf1_p1, utim=
eline_p1, buf0)

Then cmdbuf1_p1 will not executed until cmdbuf5 scheduled on htimeline4 by =
the user
timeline of process 0 is done. This will not care at all about any of the o=
ther
cmd using buf0 in process0.


So now truely i do not see how you can pretend allowing such things, which =
is what
android sync point seems to allow, with associated a fence to buffer.

Hope this make it clear.

Cheers,
J=E9r=F4me


> =

> Thanks, Daniel
> -- =

> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch