From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jerome Glisse Subject: Re: Fence, timeline and android sync points Date: Tue, 12 Aug 2014 21:23:54 -0400 Message-ID: <20140812234307.GA3001@gmail.com> References: <20140812221340.GB5746@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Return-path: Received: from mail-qg0-f48.google.com (mail-qg0-f48.google.com [209.85.192.48]) by gabe.freedesktop.org (Postfix) with ESMTP id D08F16E14F for ; Tue, 12 Aug 2014 18:23:46 -0700 (PDT) Received: by mail-qg0-f48.google.com with SMTP id i50so10291815qgf.35 for ; Tue, 12 Aug 2014 18:23:46 -0700 (PDT) Content-Disposition: inline In-Reply-To: <20140812221340.GB5746@gmail.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org, maarten.lankhorst@canonical.com Cc: daniel.vetter@ffwll.ch, bskeggs@redhat.com List-Id: dri-devel@lists.freedesktop.org On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote: > Hi, > = > So i want over the whole fence and sync point stuff as it's becoming a pr= essing > issue. I think we first need to agree on what is the problem we want to s= olve > and what would be the requirements to solve it. > = > Problem : > Explicit synchronization btw different hardware block over a buffer obj= ect. > = > Requirements : > Share common infrastructure. > Allow optimal hardware command stream scheduling accross hardware block. > Allow android sync point to be implemented on top of it. > Handle/acknowledge exception (like good old gpu lockup). > Minimize driver changes. > = > Glossary : > hardware timeline: timeline bound to a specific hardware block. > pipeline timeline: timeline bound to a userspace rendering pipeline, ea= ch > point on that timeline can be a composite of several > different hardware pipeline point. > pipeline: abstract object representing userspace application graphic pi= peline > of each of the application graphic operations. > fence: specific point in a timeline where synchronization needs to happ= en. > = > = > So now, current include/linux/fence.h implementation is i believe missing= the > objective by confusing hardware and pipeline timeline and by bolting fenc= e to > buffer object while what is really needed is true and proper timeline for= both > hardware and pipeline. But before going further down that road let me loo= k at > things and explain how i see them. > = > Current ttm fence have one and a sole purpose, allow synchronization for = buffer > object move even thought some driver like radeon slightly abuse it and us= e them > for things like lockup detection. > = > The new fence want to expose an api that would allow some implementation = of a > timeline. For that it introduces callback and some hard requirement on wh= at the > driver have to expose : > enable_signaling > [signaled] > wait > = > Each of those have to do work inside the driver to which the fence belong= s and > each of those can be call more or less from unexpected (with restriction = like > outside irq) context. So we end up with thing like : > = > Process 1 Process 2 Process 3 > I_A_schedule(fence0) > CI_A_F_B_signaled(fence0) > I_A_signal(fence0) > CI_B_F_A_callback(fen= ce0) > CI_A_F_B_wait(fence0) > Lexique: > I_x in driver x (I_A =3D=3D in driver A) > CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from d= river B) > = > So this is an happy mess everyone call everyone and this bound to get mes= sy. > Yes i know there is all kind of requirement on what happen once a fence is > signaled. But those requirement only looks like they are trying to atone = any > mess that can happen from the whole callback dance. > = > While i was too seduced by the whole callback idea long time ago, i think= it is > a highly dangerous path to take where the combinatorial of what could hap= pen > are bound to explode with the increase in the number of players. > = > = > So now back to how to solve the problem we are trying to address. First i= want > to make an observation, almost all GPU that exist today have a command ri= ng > on to which userspace command buffer are executed and inside the command = ring > you can do something like : > = > if (condition) execute_command_buffer else skip_command_buffer > = > where condition is a simple expression (memory_address cop value)) with c= op one > of the generic comparison (=3D=3D, <, >, <=3D, >=3D). I think it is a saf= e assumption > that any gpu that slightly matter can do that. Those who can not should f= ix > there command ring processor. > = > = > With that in mind, i think proper solution is implementing timeline and h= aving > fence be a timeline object with a way simpler api. For each hardware time= line > driver provide a system memory address at which the lastest signaled fence > sequence number can be read. Each fence object is uniquely associated with > both a hardware and a pipeline timeline. Each pipeline timeline have a wa= it > queue. > = > When scheduling something that require synchronization on a hardware time= line > a fence is created and associated with the pipeline timeline and hardware > timeline. Other hardware block that need to wait on a fence can use there > command ring conditional execution to directly check the fence sequence f= rom > the other hw block so you do optimistic scheduling. If optimistic schedul= ing > fails (which would be reported by hw block specific solution and hidden) = then > things can fallback to software cpu wait inside what could be considered = the > kernel thread of the pipeline timeline. > = > = > From api point of view there is no inter-driver call. All the driver need= s to > do is wakeup the pipeline timeline wait_queue when things are signaled or > when things go sideway (gpu lockup). > = > = > So how to implement that with current driver ? Well easy. Currently we as= sume > implicit synchronization so all we need is an implicit pipeline timeline = per > userspace process (note this do not prevent inter process synchronization= ). > Everytime a command buffer is submitted it is added to the implicit timel= ine > with the simple fence object : > = > struct fence { > struct list_head list_hwtimeline; > struct list_head list_pipetimeline; > struct hw_timeline *hw_timeline; > uint64_t seq_num; > work_t timedout_work; > void *csdata; > }; > = > So with set of helper function call by each of the driver command executi= on > ioctl you have the implicit timeline that is properly populated and each > dirver command execution get the dependency from the implicit timeline. > = > = > Of course to take full advantages of all flexibilities this could offer we > would need to allow userspace to create pipeline timeline and to schedule > against the pipeline timeline of there choice. We could create file for > each of the pipeline timeline and have file operation to wait/query > progress. > = > Note that the gpu lockup are considered exceptional event, the implicit > timeline will probably want to continue on other job on other hardware > block but the explicit one probably will want to decide wether to continue > or abort or retry without the fault hw block. > = > = > I realize i am late to the party and that i should have taken a serious > look at all this long time ago. I apologize for that and if you consider > this is to late then just ignore me modulo the big warning the crazyness > that callback will introduce an how bad things bound to happen. I am not > saying that bad things can not happen with what i propose just that > because everything happen inside the process context that is the one > asking/requiring synchronization there will be not interprocess kernel > callback (a callback that was registered by one process and that is call > inside another process time slice because fence signaling is happening > inside this other process time slice). > = > = > Pseudo code for explicitness : > = > drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *fil= p) > { > struct fence *dependency[16], *fence; > int m; > = > m =3D timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline, > dependency, 16, &fence); > if (m < 0) > return m; > if (m >=3D 16) { > // alloc m and recall; > } > dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fen= ce); > } > = > int timeline_schedule(ptimeline, hwtimeline, timeout, > dependency, mdep, **fence) > { > // allocate fence set hw_timeline and init work > // build up list of dependency by looking at list of pending fence in > // timeline > } > = > = > = > // If device driver schedule job hopping for all dependency to be signale= d then > // it must also call this function with csdata being a copy of what needs= to be > // executed once all dependency are signaled > void timeline_missed_schedule(timeline, fence, void *csdata) > { > INITWORK(fence->work, timeline_missed_schedule_worker) > fence->csdata =3D csdata; > schedule_delayed_work(fence->work, default_timeout) > } > = > void timeline_missed_schedule_worker(work) > { > driver =3D driver_from_fence_hwtimeline(fence) > = > // Make sure that each of the hwtimeline dependency will fire irq by > // calling a driver function. > timeline_wait_for_fence_dependency(fence); > driver->execute_cs(driver, fence); > } > = > // This function is call by driver code that signal fence (could be call = from > // interrupt context). It is responsabilities of device driver to call th= at > // function. > void timeline_signal(hwtimeline) > { > for_each_fence(fence, hwtimeline->fences, list_hwtimeline) { > wakeup(fence->pipetimeline->wait_queue); > } > } Btw as extra note, because of implicit timeline any shared object schedule = on a hw timeline must add a fence to all the implicit timeline where this object= exist. Also there is no need to have a fence pointer per object. > = > = > Cheers, > J=E9r=F4me