From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?ISO-8859-1?Q?Christian_K=F6nig?= Subject: Re: [RFC PATCH] drm/radeon: rework to new fence interface Date: Tue, 20 Aug 2013 16:16:34 +0200 Message-ID: <52137A42.6090503@vodafone.de> References: <20130815124308.14812.58197.stgit@patser> <5211F0C5.2040705@canonical.com> <5212112C.70808@vodafone.de> <52127411.2010106@canonical.com> <52132AE0.4010702@vodafone.de> <521338A9.4040206@canonical.com> <52133C2A.2090200@vodafone.de> <52136D6F.6090408@canonical.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from outgoing.email.vodafone.de ([139.7.28.128]:64617 "EHLO outgoing.email.vodafone.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750934Ab3HTOQw (ORCPT ); Tue, 20 Aug 2013 10:16:52 -0400 In-Reply-To: <52136D6F.6090408@canonical.com> Sender: linux-arch-owner@vger.kernel.org List-ID: To: Maarten Lankhorst Cc: dri-devel@lists.freedesktop.org, linux-arch@vger.kernel.org, linaro-mm-sig@lists.linaro.org, Alex Deucher , linux-kernel@vger.kernel.org, linux-media@vger.kernel.org Am 20.08.2013 15:21, schrieb Maarten Lankhorst: > Op 20-08-13 11:51, Christian K=F6nig schreef: >> Am 20.08.2013 11:36, schrieb Maarten Lankhorst: >> [SNIP] >> >>>>>>> [SNIP] >>>>>>> +/** >>>>>>> + * radeon_fence_enable_signaling - enable signalling on fence >>>>>>> + * @fence: fence >>>>>>> + * >>>>>>> + * This function is called with fence_queue lock held, and add= s a callback >>>>>>> + * to fence_queue that checks if this fence is signaled, and i= f so it >>>>>>> + * signals the fence and removes itself. >>>>>>> + */ >>>>>>> +static bool radeon_fence_enable_signaling(struct fence *f) >>>>>>> +{ >>>>>>> + struct radeon_fence *fence =3D to_radeon_fence(f); >>>>>>> + >>>>>>> + if (atomic64_read(&fence->rdev->fence_drv[fence->ring].las= t_seq) >=3D fence->seq || >>>>>>> + !fence->rdev->ddev->irq_enabled) >>>>>>> + return false; >>>>>>> + >>>>>> Do I get that right that you rely on IRQs to be enabled and work= ing here? Cause that would be a quite bad idea from the conceptual side= =2E >>>>> For cross-device synchronization it would be nice to have working= irqs, it allows signalling fences faster, >>>>> and it allows for callbacks on completion to be called. For inter= nal usage it's no more required than it was before. >>>> That's a big NAK. >>>> >>>> The fence processing is actually very fine tuned to avoid IRQs and= as far as I can see you just leave them enabled by decrementing the at= omic from IRQ context. Additional to that we need allot of special hand= ling in case of a hardware lockup here, which isn't done if you abuse t= he fence interface like this. >>> I think it's not needed to leave the irq enabled, it's a leftover f= rom when I was debugging the mac and no interrupt occurred at all. >>> >>>> Also your approach of leaking the IRQ context outside of the drive= r is a very bad idea from the conceptual side. Please don't modify the = fence interface at all and instead use the wait functions already expos= ed by radeon_fence.c. If you need some kind of signaling mechanism then= wait inside a workqueue instead. >>> The fence takes up the role of a single shot workqueue here. Manual= ly resetting the counter and calling wake_up_all would end up waking al= l active fences, there's no special handling needed inside radeon for t= his. >> Yeah that's actually the point here, you NEED to activate ALL fences= , otherwise the fence handling inside the driver won't work. > It's done in a lazy fashion. If there's no need for an activated fenc= e the interrupt will not be enabled. >>> The fence api does provide a synchronous wait function, but this ca= uses a stall of whomever waits on it. >> Which is perfectly fine. What actually is the use case of not stalli= ng a process who wants to wait for something? > Does radeon call ttm_bo_wait on all bo's before doing a command submi= ssion? No? Why should other drivers do that.. Sure it does if hardware synchronization isn't supported. >>> When I was testing this with intel I used the fence callback to pok= e a register in i915, this allowed it to not block until it hits the wa= it op in the command stream, and even then only if the callback was not= called first. >>> >>> It's documented that the callbacks can be called from any context a= nd will be called with irqs disabled, so nothing scary should be done. = The kernel provides enough debug mechanisms to find any violators. >>> PROVE_LOCKING and DEBUG_ATOMIC_SLEEP for example. >> No thanks, we even abandoned that concept internal in the driver. Pl= ease use the blocking wait functions instead. > No, this just stalls all gpu's that share a bo. > > The idea is to provide a standardized api so bo's can be synchronized= without stalling. The first step to this is ww_mutex. > If this lock is shared between multiple gpu's the same object can be = reserved between multiple devices without causing > a deadlock with circular dependencies. With some small patches it's p= ossible to do this already between multiple drivers > that use ttm. ttm_bo_reserve, ttm_bo_unreserve and all the other code= dealing with ttm reservations have been converted > to use ww_mutex locking. > > Fencing is the next step. When all buffers are locked a callback shou= ld be added to any previous fence, and a single new fence > signaling completion of the command submission should be placed on al= l locked objects. Because the common path is that no > objects are shared, the callback and FIFO stalling will only be neede= d for dma-bufs. When all callbacks have fired the FIFO can be > unblocked. This prevents having to sync the gpu to the cpu. If a bo i= s submitted to 1 gpu, and then immediately to another it will not > stall unless needed. For example in a optimus configuration an applic= ation could copy a rendered frame from VRAM to a shared > dma-buf (xorg's buffer), then have Xorg copying it again (on intel's = gpu) from the dma-buf to a framebuffer . Yeah, that's the same concept we used to have for multiple rings, to=20 avoid stalling if a buffer is currently used in a different part of the= =20 GPU than the current command submission wants it to. After allot of=20 internal discussion we came to the conclusion that it just doesn't wort= h=20 the effort. Have you thought over all the consequences that are implied by having a= =20 serialized stream of command submissions? When a buffer is in use by another device you basically have only two=20 options: Either block for the buffer to become unused or use a hardware= =20 method (semaphores) to sync up your operations. Of course you can=20 optimize by using a multiple reader / single writers model, but for thi= s=20 you basically just have to teach ttm that there might be more than one=20 fence for each bo. Anyway those things are only optimizations, and in the end it always=20 comes down to blocking the command submission if there is no other way=20 of hardware synchronization and I absolutely don't see any reason why w= e=20 shouldn't do this with just a normal blocking call to a waitqueue. Christian.