From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?windows-1252?Q?Christian_K=F6nig?= Subject: Re: Fence, timeline and android sync points Date: Wed, 13 Aug 2014 16:08:14 +0200 Message-ID: <53EB714E.1060102@vodafone.de> References: <20140812221340.GB5746@gmail.com> <53EB1ADE.5060104@vodafone.de> <20140813134145.GB2666@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset="windows-1252"; Format="flowed" Content-Transfer-Encoding: quoted-printable Return-path: Received: from pegasos-out.vodafone.de (pegasos-out.vodafone.de [80.84.1.38]) by gabe.freedesktop.org (Postfix) with ESMTP id 580306E0C7 for ; Wed, 13 Aug 2014 07:09:02 -0700 (PDT) In-Reply-To: <20140813134145.GB2666@gmail.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: Jerome Glisse Cc: daniel.vetter@ffwll.ch, bskeggs@redhat.com, dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org > The whole issue is that today cs ioctl assume implied synchronization. So= this > can not change, so for now anything that goes through cs ioctl would need= to > use an implied timeline and have all ring that use common buffer synchron= ize > on it. As long as those ring use different buffer there is no need for sy= nc. Exactly my thoughts. > Buffer object are what links hw timeline. A couple of people at AMD have a problem with that and I'm currently = working full time on a solution. But solving this and keeping 100% = backward compatibility at the same time is not an easy task. > Of course there might be way to be more flexible if timeline are expose to > userspace and userspace can create several of them for a single process. Concurrent execution is mostly used for temporary things e.g. copying a = result to a userspace buffer while VCE is decoding into the ring buffer = at a different location for example. Creating an extra timeline just to = tell the kernel that two commands are allowed to run in parallel sounds = like to much overhead to me. Cheers, Christian. Am 13.08.2014 um 15:41 schrieb Jerome Glisse: > On Wed, Aug 13, 2014 at 09:59:26AM +0200, Christian K=F6nig wrote: >> Hi Jerome, >> >> first of all that finally sounds like somebody starts to draw the whole >> picture for me. >> >> So far all I have seen was a bunch of specialized requirements and some = not >> so obvious design decisions based on those requirements. >> >> So thanks a lot for finally summarizing the requirements from a top above >> view and I perfectly agree with your analysis of the current fence design >> and the downsides of that API. >> >> Apart from that I also have some comments / requirements that hopefully = can >> be taken into account as well: >> >>> pipeline timeline: timeline bound to a userspace rendering pipeline,= each >>> point on that timeline can be a composite of seve= ral >>> different hardware pipeline point. >>> pipeline: abstract object representing userspace application graphic= pipeline >>> of each of the application graphic operations. >> In the long term a requirement for the driver for AMD GFX hardware is th= at >> instead of a fixed pipeline timeline we need a bit more flexible model w= here >> concurrent execution on different hardware engines is possible as well. >> >> So the requirement is that you can do things like submitting a 3D job A,= a >> DMA job B, a VCE job C and another 3D job D that are executed like this: >> A >> / \ >> B C >> \ / >> D >> >> (Let's just hope that looks as good on your mail client as it looked for >> me). > My thinking of hw timeline is that a gpu like amd or nvidia would have se= veral > different hw timeline. They are per block/engine so one for dma ring, one= for > gfx, one for vce, .... > > = >> My current thinking is that we avoid having a pipeline object in the ker= nel >> and instead letting userspace specify which fence we want to synchronize= to >> explicitly as long as everything stays withing the same client. As soon = as >> any buffer is shared between clients the kernel we would need to fall ba= ck >> to implicitly synchronization to allow backward compatibility with DRI2/= 3. > The whole issue is that today cs ioctl assume implied synchronization. So= this > can not change, so for now anything that goes through cs ioctl would need= to > use an implied timeline and have all ring that use common buffer synchron= ize > on it. As long as those ring use different buffer there is no need for sy= nc. > > Buffer object are what links hw timeline. > > Of course there might be way to be more flexible if timeline are expose to > userspace and userspace can create several of them for a single process. > >>> if (condition) execute_command_buffer else skip_command_buffer >>> >>> where condition is a simple expression (memory_address cop value)) with= cop one >>> of the generic comparison (=3D=3D, <, >, <=3D, >=3D). I think it is a s= afe assumption >>> that any gpu that slightly matter can do that. Those who can not should= fix >>> there command ring processor. >> At least for some engines on AMD hardware that isn't possible (UVD, VCE = and >> in some extends DMA as well), but I don't see any reason why we shouldn'= t be >> able to use software based scheduling on those engines by default. So th= is >> isn't really a problem, but just an additional comment to keep in mind. > Yes not everything can do that but as it's a simple memory access with si= mple > comparison then it's easy to do on cpu for limited hardware. But this rea= lly > sounds like something so easy to add to hw ring execution that it is a sh= ame > hw designer do not already added such thing. > >> Regards, >> Christian. >> >> Am 13.08.2014 um 00:13 schrieb Jerome Glisse: >>> Hi, >>> >>> So i want over the whole fence and sync point stuff as it's becoming a = pressing >>> issue. I think we first need to agree on what is the problem we want to= solve >>> and what would be the requirements to solve it. >>> >>> Problem : >>> Explicit synchronization btw different hardware block over a buffer = object. >>> >>> Requirements : >>> Share common infrastructure. >>> Allow optimal hardware command stream scheduling accross hardware bl= ock. >>> Allow android sync point to be implemented on top of it. >>> Handle/acknowledge exception (like good old gpu lockup). >>> Minimize driver changes. >>> >>> Glossary : >>> hardware timeline: timeline bound to a specific hardware block. >>> pipeline timeline: timeline bound to a userspace rendering pipeline,= each >>> point on that timeline can be a composite of seve= ral >>> different hardware pipeline point. >>> pipeline: abstract object representing userspace application graphic= pipeline >>> of each of the application graphic operations. >>> fence: specific point in a timeline where synchronization needs to h= appen. >>> >>> >>> So now, current include/linux/fence.h implementation is i believe missi= ng the >>> objective by confusing hardware and pipeline timeline and by bolting fe= nce to >>> buffer object while what is really needed is true and proper timeline f= or both >>> hardware and pipeline. But before going further down that road let me l= ook at >>> things and explain how i see them. >>> >>> Current ttm fence have one and a sole purpose, allow synchronization fo= r buffer >>> object move even thought some driver like radeon slightly abuse it and = use them >>> for things like lockup detection. >>> >>> The new fence want to expose an api that would allow some implementatio= n of a >>> timeline. For that it introduces callback and some hard requirement on = what the >>> driver have to expose : >>> enable_signaling >>> [signaled] >>> wait >>> >>> Each of those have to do work inside the driver to which the fence belo= ngs and >>> each of those can be call more or less from unexpected (with restrictio= n like >>> outside irq) context. So we end up with thing like : >>> >>> Process 1 Process 2 Process 3 >>> I_A_schedule(fence0) >>> CI_A_F_B_signaled(fence0) >>> I_A_signal(fence0) >>> CI_B_F_A_callback(= fence0) >>> CI_A_F_B_wait(fence0) >>> Lexique: >>> I_x in driver x (I_A =3D=3D in driver A) >>> CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from= driver B) >>> >>> So this is an happy mess everyone call everyone and this bound to get m= essy. >>> Yes i know there is all kind of requirement on what happen once a fence= is >>> signaled. But those requirement only looks like they are trying to aton= e any >>> mess that can happen from the whole callback dance. >>> >>> While i was too seduced by the whole callback idea long time ago, i thi= nk it is >>> a highly dangerous path to take where the combinatorial of what could h= appen >>> are bound to explode with the increase in the number of players. >>> >>> >>> So now back to how to solve the problem we are trying to address. First= i want >>> to make an observation, almost all GPU that exist today have a command = ring >>> on to which userspace command buffer are executed and inside the comman= d ring >>> you can do something like : >>> >>> if (condition) execute_command_buffer else skip_command_buffer >>> >>> where condition is a simple expression (memory_address cop value)) with= cop one >>> of the generic comparison (=3D=3D, <, >, <=3D, >=3D). I think it is a s= afe assumption >>> that any gpu that slightly matter can do that. Those who can not should= fix >>> there command ring processor. >>> >>> >>> With that in mind, i think proper solution is implementing timeline and= having >>> fence be a timeline object with a way simpler api. For each hardware ti= meline >>> driver provide a system memory address at which the lastest signaled fe= nce >>> sequence number can be read. Each fence object is uniquely associated w= ith >>> both a hardware and a pipeline timeline. Each pipeline timeline have a = wait >>> queue. >>> >>> When scheduling something that require synchronization on a hardware ti= meline >>> a fence is created and associated with the pipeline timeline and hardwa= re >>> timeline. Other hardware block that need to wait on a fence can use the= re >>> command ring conditional execution to directly check the fence sequence= from >>> the other hw block so you do optimistic scheduling. If optimistic sched= uling >>> fails (which would be reported by hw block specific solution and hidden= ) then >>> things can fallback to software cpu wait inside what could be considere= d the >>> kernel thread of the pipeline timeline. >>> >>> >>> From api point of view there is no inter-driver call. All the driver n= eeds to >>> do is wakeup the pipeline timeline wait_queue when things are signaled = or >>> when things go sideway (gpu lockup). >>> >>> >>> So how to implement that with current driver ? Well easy. Currently we = assume >>> implicit synchronization so all we need is an implicit pipeline timelin= e per >>> userspace process (note this do not prevent inter process synchronizati= on). >>> Everytime a command buffer is submitted it is added to the implicit tim= eline >>> with the simple fence object : >>> >>> struct fence { >>> struct list_head list_hwtimeline; >>> struct list_head list_pipetimeline; >>> struct hw_timeline *hw_timeline; >>> uint64_t seq_num; >>> work_t timedout_work; >>> void *csdata; >>> }; >>> >>> So with set of helper function call by each of the driver command execu= tion >>> ioctl you have the implicit timeline that is properly populated and each >>> dirver command execution get the dependency from the implicit timeline. >>> >>> >>> Of course to take full advantages of all flexibilities this could offer= we >>> would need to allow userspace to create pipeline timeline and to schedu= le >>> against the pipeline timeline of there choice. We could create file for >>> each of the pipeline timeline and have file operation to wait/query >>> progress. >>> >>> Note that the gpu lockup are considered exceptional event, the implicit >>> timeline will probably want to continue on other job on other hardware >>> block but the explicit one probably will want to decide wether to conti= nue >>> or abort or retry without the fault hw block. >>> >>> >>> I realize i am late to the party and that i should have taken a serious >>> look at all this long time ago. I apologize for that and if you consider >>> this is to late then just ignore me modulo the big warning the crazyness >>> that callback will introduce an how bad things bound to happen. I am not >>> saying that bad things can not happen with what i propose just that >>> because everything happen inside the process context that is the one >>> asking/requiring synchronization there will be not interprocess kernel >>> callback (a callback that was registered by one process and that is call >>> inside another process time slice because fence signaling is happening >>> inside this other process time slice). >>> >>> >>> Pseudo code for explicitness : >>> >>> drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *f= ilp) >>> { >>> struct fence *dependency[16], *fence; >>> int m; >>> >>> m =3D timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline, >>> dependency, 16, &fence); >>> if (m < 0) >>> return m; >>> if (m >=3D 16) { >>> // alloc m and recall; >>> } >>> dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, = fence); >>> } >>> >>> int timeline_schedule(ptimeline, hwtimeline, timeout, >>> dependency, mdep, **fence) >>> { >>> // allocate fence set hw_timeline and init work >>> // build up list of dependency by looking at list of pending fence = in >>> // timeline >>> } >>> >>> >>> >>> // If device driver schedule job hopping for all dependency to be signa= led then >>> // it must also call this function with csdata being a copy of what nee= ds to be >>> // executed once all dependency are signaled >>> void timeline_missed_schedule(timeline, fence, void *csdata) >>> { >>> INITWORK(fence->work, timeline_missed_schedule_worker) >>> fence->csdata =3D csdata; >>> schedule_delayed_work(fence->work, default_timeout) >>> } >>> >>> void timeline_missed_schedule_worker(work) >>> { >>> driver =3D driver_from_fence_hwtimeline(fence) >>> >>> // Make sure that each of the hwtimeline dependency will fire irq by >>> // calling a driver function. >>> timeline_wait_for_fence_dependency(fence); >>> driver->execute_cs(driver, fence); >>> } >>> >>> // This function is call by driver code that signal fence (could be cal= l from >>> // interrupt context). It is responsabilities of device driver to call = that >>> // function. >>> void timeline_signal(hwtimeline) >>> { >>> for_each_fence(fence, hwtimeline->fences, list_hwtimeline) { >>> wakeup(fence->pipetimeline->wait_queue); >>> } >>> } >>> >>> >>> Cheers, >>> J=E9r=F4me >> _______________________________________________ >> dri-devel mailing list >> dri-devel@lists.freedesktop.org >> http://lists.freedesktop.org/mailman/listinfo/dri-devel