From mboxrd@z Thu Jan 1 00:00:00 1970 From: Daniel Vetter Subject: Re: [PATCH 2/6] seqno-fence: Hardware dma-buf implementation of fencing (v4) Date: Mon, 3 Mar 2014 22:01:04 +0100 Message-ID: <20140303210104.GJ17001@phenom.ffwll.local> References: <20140217155056.20337.25254.stgit@patser> <20140217155556.20337.37589.stgit@patser> <53023F3E.3080107@vodafone.de> <530248B1.2090405@vodafone.de> <530257E3.2060508@vodafone.de> <5304B0E7.4000802@canonical.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from mail-ea0-f175.google.com ([209.85.215.175]:35439 "EHLO mail-ea0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752636AbaCCVBJ (ORCPT ); Mon, 3 Mar 2014 16:01:09 -0500 Received: by mail-ea0-f175.google.com with SMTP id d10so1651164eaj.34 for ; Mon, 03 Mar 2014 13:01:08 -0800 (PST) Content-Disposition: inline In-Reply-To: <5304B0E7.4000802@canonical.com> Sender: linux-arch-owner@vger.kernel.org List-ID: To: Maarten Lankhorst Cc: Christian =?iso-8859-1?Q?K=F6nig?= , Rob Clark , linux-arch@vger.kernel.org, Linux Kernel Mailing List , "dri-devel@lists.freedesktop.org" , "linaro-mm-sig@lists.linaro.org" , Colin Cross , "linux-media@vger.kernel.org" On Wed, Feb 19, 2014 at 02:25:59PM +0100, Maarten Lankhorst wrote: > op 17-02-14 19:41, Christian K=F6nig schreef: > >Am 17.02.2014 19:24, schrieb Rob Clark: > >>On Mon, Feb 17, 2014 at 12:36 PM, Christian K=F6nig > >> wrote: > >>>Am 17.02.2014 18:27, schrieb Rob Clark: > >>> > >>>>On Mon, Feb 17, 2014 at 11:56 AM, Christian K=F6nig > >>>> wrote: > >>>>>Am 17.02.2014 16:56, schrieb Maarten Lankhorst: > >>>>> > >>>>>>This type of fence can be used with hardware synchronization fo= r simple > >>>>>>hardware that can block execution until the condition > >>>>>>(dma_buf[offset] - value) >=3D 0 has been met. > >>>>> > >>>>>Can't we make that just "dma_buf[offset] !=3D 0" instead? As far= as I know > >>>>>this way it would match the definition M$ uses in their WDDM > >>>>>specification > >>>>>and so make it much more likely that hardware supports it. > >>>>well 'buf[offset] >=3D value' at least means the same slot can be= used > >>>>for multiple operations (with increasing values of 'value').. not= sure > >>>>if that is something people care about. > >>>> > >>>>>=3Dvalue seems to be possible with adreno and radeon. I'm not r= eally sure > >>>>>about others (although I presume it as least supported for nv de= sktop > >>>>>stuff). For hw that cannot do >=3Dvalue, we can either have a d= ifferent fence > >>>>>implementation which uses the !=3D0 approach. Or change seqno-f= ence > >>>>>implementation later if needed. But if someone has hw that can = do !=3D0 but > >>>>>not >=3Dvalue, speak up now ;-) > >>> > >>>Here! Radeon can only do >=3Dvalue on the DMA and 3D engine, but n= ot with UVD > >>>or VCE. And for the 3D engine it means draining the pipe, which is= n't really > >>>a good idea. > >>hmm, ok.. forgot you have a few extra rings compared to me. Is UVD > >>re-ordering from decode-order to display-order for you in hw? If no= t, > >>I guess you need sw intervention anyways when a frame is done for > >>frame re-ordering, so maybe hw->hw sync doesn't really matter as mu= ch > >>as compared to gpu/3d->display. For dma<->3d interactions, seems l= ike > >>you would care more about hw<->hw sync, but I guess you aren't like= ly > >>to use GPU A to do a resolve blit for GPU B.. > > > >No UVD isn't reordering, but since frame reordering is predictable y= ou usually end up with pipelining everything to the hardware. E.g. you = send the decode commands in decode order to the UVD block and if you ha= ve overlay active one of the frames are going to be the first to displa= y and then you want to wait for it on the display side. > > > >>For 3D ring, I assume you probably want a CP_WAIT_FOR_IDLE before a > >>CP_MEM_WRITE to update fence value in memory (for the one signallin= g > >>the fence). But why would you need that before a CP_WAIT_REG_MEM (= for > >>the one waiting for the fence)? I don't exactly have documentation > >>for adreno version of CP_WAIT_REG_{MEM,EQ,GTE}.. but PFP and ME > >>appear to be same instruction set as r600, so I'm pretty sure they > >>should have similar capabilities.. CP_WAIT_REG_MEM appears to be sa= me > >>but with 32bit gpu addresses vs 64b. > > > >You shouldn't use any of the CP commands for engine synchronization = (neither for wait nor for signal). The PFP and ME are just the top of a= quite deep pipeline and when you use any of the CP_WAIT functions you = block them for something and that's draining the pipeline. > > > >With the semaphore and fence commands the values are just attached a= s prerequisite to the draw command, e.g. the CP setups the draw environ= ment and issues the command, but the actual execution of it is delayed = until the "!=3D 0" condition hits. And in the meantime the CP already p= repares the next draw operation. > > > >But at least for compute queues wait semaphore aren't the perfect so= lution either. What you need then is a GPU scheduler that uses a kernel= task for setting up the command submission for you when all prerequisi= tes are meet. > nouveau has sort of a scheduler in hardware. It can yield when waitin= g > on a semaphore. And each process gets their own context and the > timeslices can be adjusted. ;-) But I don't mind changing this patch > when an actual user pops up. Nouveau can do a wait for (*sema & mask= ) > !=3D 0 only on nvc0 and newer, where mask can be chosen. But it can d= o =3D=3D > somevalue and >=3D somevalue on older relevant optimus hardware, so i= f we > know that it was zero before and we know the sign of the new value th= at > could work too. >=20 > Adding ops and a separate mask later on when users pop up is fine wit= h > me, the original design here was chosen so I could map the intel stat= us > page read-only into the process specific nvidia vm. Yeah, I guess in the end we might end up having a pile of different memory-based (shared through dma-bufs) based fences. But imo getting th= is thing of the ground is more important, and you can always do crazy per-platform/generation/whatever hacks and match on that specific fence type in the interim. That'll cut it for the SoC madness, which also see= ms to be what google does with the android syncpoint stuff. -Daniel --=20 Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch