From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?windows-1252?Q?Christian_K=F6nig?= Subject: Re: Fence, timeline and android sync points Date: Wed, 13 Aug 2014 09:59:26 +0200 Message-ID: <53EB1ADE.5060104@vodafone.de> References: <20140812221340.GB5746@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset="windows-1252"; Format="flowed" Content-Transfer-Encoding: quoted-printable Return-path: Received: from pegasos-out.vodafone.de (pegasos-out.vodafone.de [80.84.1.38]) by gabe.freedesktop.org (Postfix) with ESMTP id B71716E1B1 for ; Wed, 13 Aug 2014 01:00:04 -0700 (PDT) In-Reply-To: <20140812221340.GB5746@gmail.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: Jerome Glisse , dri-devel@lists.freedesktop.org, maarten.lankhorst@canonical.com Cc: daniel.vetter@ffwll.ch, bskeggs@redhat.com List-Id: dri-devel@lists.freedesktop.org Hi Jerome, first of all that finally sounds like somebody starts to draw the whole = picture for me. So far all I have seen was a bunch of specialized requirements and some = not so obvious design decisions based on those requirements. So thanks a lot for finally summarizing the requirements from a top = above view and I perfectly agree with your analysis of the current fence = design and the downsides of that API. Apart from that I also have some comments / requirements that hopefully = can be taken into account as well: > pipeline timeline: timeline bound to a userspace rendering pipeline, e= ach > point on that timeline can be a composite of several > different hardware pipeline point. > pipeline: abstract object representing userspace application graphic p= ipeline > of each of the application graphic operations. In the long term a requirement for the driver for AMD GFX hardware is = that instead of a fixed pipeline timeline we need a bit more flexible = model where concurrent execution on different hardware engines is = possible as well. So the requirement is that you can do things like submitting a 3D job A, = a DMA job B, a VCE job C and another 3D job D that are executed like this: A / \ B C \ / D (Let's just hope that looks as good on your mail client as it looked for = me). My current thinking is that we avoid having a pipeline object in the = kernel and instead letting userspace specify which fence we want to = synchronize to explicitly as long as everything stays withing the same = client. As soon as any buffer is shared between clients the kernel we = would need to fall back to implicitly synchronization to allow backward = compatibility with DRI2/3. > if (condition) execute_command_buffer else skip_command_buffer > > where condition is a simple expression (memory_address cop value)) with c= op one > of the generic comparison (=3D=3D, <, >, <=3D, >=3D). I think it is a saf= e assumption > that any gpu that slightly matter can do that. Those who can not should f= ix > there command ring processor. At least for some engines on AMD hardware that isn't possible (UVD, VCE = and in some extends DMA as well), but I don't see any reason why we = shouldn't be able to use software based scheduling on those engines by = default. So this isn't really a problem, but just an additional comment = to keep in mind. Regards, Christian. Am 13.08.2014 um 00:13 schrieb Jerome Glisse: > Hi, > > So i want over the whole fence and sync point stuff as it's becoming a pr= essing > issue. I think we first need to agree on what is the problem we want to s= olve > and what would be the requirements to solve it. > > Problem : > Explicit synchronization btw different hardware block over a buffer ob= ject. > > Requirements : > Share common infrastructure. > Allow optimal hardware command stream scheduling accross hardware bloc= k. > Allow android sync point to be implemented on top of it. > Handle/acknowledge exception (like good old gpu lockup). > Minimize driver changes. > > Glossary : > hardware timeline: timeline bound to a specific hardware block. > pipeline timeline: timeline bound to a userspace rendering pipeline, e= ach > point on that timeline can be a composite of several > different hardware pipeline point. > pipeline: abstract object representing userspace application graphic p= ipeline > of each of the application graphic operations. > fence: specific point in a timeline where synchronization needs to hap= pen. > > > So now, current include/linux/fence.h implementation is i believe missing= the > objective by confusing hardware and pipeline timeline and by bolting fenc= e to > buffer object while what is really needed is true and proper timeline for= both > hardware and pipeline. But before going further down that road let me loo= k at > things and explain how i see them. > > Current ttm fence have one and a sole purpose, allow synchronization for = buffer > object move even thought some driver like radeon slightly abuse it and us= e them > for things like lockup detection. > > The new fence want to expose an api that would allow some implementation = of a > timeline. For that it introduces callback and some hard requirement on wh= at the > driver have to expose : > enable_signaling > [signaled] > wait > > Each of those have to do work inside the driver to which the fence belong= s and > each of those can be call more or less from unexpected (with restriction = like > outside irq) context. So we end up with thing like : > > Process 1 Process 2 Process 3 > I_A_schedule(fence0) > CI_A_F_B_signaled(fence0) > I_A_signal(fence0) > CI_B_F_A_callback(fe= nce0) > CI_A_F_B_wait(fence0) > Lexique: > I_x in driver x (I_A =3D=3D in driver A) > CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from d= river B) > > So this is an happy mess everyone call everyone and this bound to get mes= sy. > Yes i know there is all kind of requirement on what happen once a fence is > signaled. But those requirement only looks like they are trying to atone = any > mess that can happen from the whole callback dance. > > While i was too seduced by the whole callback idea long time ago, i think= it is > a highly dangerous path to take where the combinatorial of what could hap= pen > are bound to explode with the increase in the number of players. > > > So now back to how to solve the problem we are trying to address. First i= want > to make an observation, almost all GPU that exist today have a command ri= ng > on to which userspace command buffer are executed and inside the command = ring > you can do something like : > > if (condition) execute_command_buffer else skip_command_buffer > > where condition is a simple expression (memory_address cop value)) with c= op one > of the generic comparison (=3D=3D, <, >, <=3D, >=3D). I think it is a saf= e assumption > that any gpu that slightly matter can do that. Those who can not should f= ix > there command ring processor. > > > With that in mind, i think proper solution is implementing timeline and h= aving > fence be a timeline object with a way simpler api. For each hardware time= line > driver provide a system memory address at which the lastest signaled fence > sequence number can be read. Each fence object is uniquely associated with > both a hardware and a pipeline timeline. Each pipeline timeline have a wa= it > queue. > > When scheduling something that require synchronization on a hardware time= line > a fence is created and associated with the pipeline timeline and hardware > timeline. Other hardware block that need to wait on a fence can use there > command ring conditional execution to directly check the fence sequence f= rom > the other hw block so you do optimistic scheduling. If optimistic schedul= ing > fails (which would be reported by hw block specific solution and hidden) = then > things can fallback to software cpu wait inside what could be considered = the > kernel thread of the pipeline timeline. > > > From api point of view there is no inter-driver call. All the driver nee= ds to > do is wakeup the pipeline timeline wait_queue when things are signaled or > when things go sideway (gpu lockup). > > > So how to implement that with current driver ? Well easy. Currently we as= sume > implicit synchronization so all we need is an implicit pipeline timeline = per > userspace process (note this do not prevent inter process synchronization= ). > Everytime a command buffer is submitted it is added to the implicit timel= ine > with the simple fence object : > > struct fence { > struct list_head list_hwtimeline; > struct list_head list_pipetimeline; > struct hw_timeline *hw_timeline; > uint64_t seq_num; > work_t timedout_work; > void *csdata; > }; > > So with set of helper function call by each of the driver command executi= on > ioctl you have the implicit timeline that is properly populated and each > dirver command execution get the dependency from the implicit timeline. > > > Of course to take full advantages of all flexibilities this could offer we > would need to allow userspace to create pipeline timeline and to schedule > against the pipeline timeline of there choice. We could create file for > each of the pipeline timeline and have file operation to wait/query > progress. > > Note that the gpu lockup are considered exceptional event, the implicit > timeline will probably want to continue on other job on other hardware > block but the explicit one probably will want to decide wether to continue > or abort or retry without the fault hw block. > > > I realize i am late to the party and that i should have taken a serious > look at all this long time ago. I apologize for that and if you consider > this is to late then just ignore me modulo the big warning the crazyness > that callback will introduce an how bad things bound to happen. I am not > saying that bad things can not happen with what i propose just that > because everything happen inside the process context that is the one > asking/requiring synchronization there will be not interprocess kernel > callback (a callback that was registered by one process and that is call > inside another process time slice because fence signaling is happening > inside this other process time slice). > > > Pseudo code for explicitness : > > drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *fil= p) > { > struct fence *dependency[16], *fence; > int m; > > m =3D timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline, > dependency, 16, &fence); > if (m < 0) > return m; > if (m >=3D 16) { > // alloc m and recall; > } > dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fe= nce); > } > > int timeline_schedule(ptimeline, hwtimeline, timeout, > dependency, mdep, **fence) > { > // allocate fence set hw_timeline and init work > // build up list of dependency by looking at list of pending fence in > // timeline > } > > > > // If device driver schedule job hopping for all dependency to be signale= d then > // it must also call this function with csdata being a copy of what needs= to be > // executed once all dependency are signaled > void timeline_missed_schedule(timeline, fence, void *csdata) > { > INITWORK(fence->work, timeline_missed_schedule_worker) > fence->csdata =3D csdata; > schedule_delayed_work(fence->work, default_timeout) > } > > void timeline_missed_schedule_worker(work) > { > driver =3D driver_from_fence_hwtimeline(fence) > > // Make sure that each of the hwtimeline dependency will fire irq by > // calling a driver function. > timeline_wait_for_fence_dependency(fence); > driver->execute_cs(driver, fence); > } > > // This function is call by driver code that signal fence (could be call = from > // interrupt context). It is responsabilities of device driver to call th= at > // function. > void timeline_signal(hwtimeline) > { > for_each_fence(fence, hwtimeline->fences, list_hwtimeline) { > wakeup(fence->pipetimeline->wait_queue); > } > } > > > Cheers, > J=E9r=F4me