From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?ISO-8859-1?Q?Christian_K=F6nig?= Subject: Re: [PATCH] drm/radeon: add an exclusive lock for GPU reset Date: Tue, 03 Jul 2012 11:26:43 +0200 Message-ID: <4FF2BAD3.7030402@vodafone.de> References: <1341243563-2559-1-git-send-email-j.glisse@gmail.com> <4FF1CBB5.2060208@vodafone.de> <4FF1D4EF.3040806@vodafone.de> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1"; Format="flowed" Content-Transfer-Encoding: quoted-printable Return-path: Received: from outgoing.email.vodafone.de (outgoing.email.vodafone.de [139.7.28.128]) by gabe.freedesktop.org (Postfix) with ESMTP id BA2C89E804 for ; Tue, 3 Jul 2012 02:26:47 -0700 (PDT) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dri-devel-bounces+sf-dri-devel=m.gmane.org@lists.freedesktop.org Errors-To: dri-devel-bounces+sf-dri-devel=m.gmane.org@lists.freedesktop.org To: Jerome Glisse Cc: Jerome Glisse , dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org On 02.07.2012 19:27, Jerome Glisse wrote: > On Mon, Jul 2, 2012 at 1:05 PM, Christian K=F6nig wrote: >> On 02.07.2012 18:41, Jerome Glisse wrote: >>> On Mon, Jul 2, 2012 at 12:26 PM, Christian K=F6nig >>> wrote: >>>> On 02.07.2012 17:39, j.glisse@gmail.com wrote: >>>>> From: Jerome Glisse >>>>> >>>>> GPU reset need to be exclusive, one happening at a time. For this >>>>> add a rw semaphore so that any path that trigger GPU activities >>>>> have to take the semaphore as a reader thus allowing concurency. >>>>> >>>>> The GPU reset path take the semaphore as a writer ensuring that >>>>> no concurrent reset take place. >>>>> >>>>> Signed-off-by: Jerome Glisse >>>> NAK, that isn't as bad as the cs mutex was but still to complicated. I= t's >>>> just too far up in the call stack, e.g. it tries to catch ioctl >>>> operations, >>>> instead of catching the underlying hardware operation which is caused = by >>>> the >>>> ioctl/ttm/etc... >>>> >>>> Why not just take the ring look as I suggested? >>>> >>>> >>> No we can't use ring lock, we need to protect suspend/resume path and >>> we need an exclusive lock for that so we need a reset mutex at the >>> very least. But instead of having a reset mutex i prefer using a rw >>> lock so that we can stop ioctl until a reset goes through an let >>> pending ioctl take proper action. Think about multiple context trying >>> to reset GPU ... >>> >>> Really this is the best option, the rw locking wont induce any lock >>> contention execept in gpu reset case which is anyway bad news. >> Why? That makes no sense to me. Well I don't want to prevent lock >> contention, but understand why we should add locking at the ioctl level. >> That violates locking rule number one "lock data instead of code" (or in= our >> case "lock hardware access instead of code path") and it really is the >> reason why we ended up with the cs_mutex protecting practically everythi= ng. >> >> Multiple context trying to reset the GPU should be pretty fine, current = it >> would just reset the GPU twice, but in the future asic_reset should be m= uch >> more fine grained and only reset the parts of the GPU which really needs= an >> reset. >> >> Cheers, >> Christian. > No multiple reset is not fine, try it your self and you will see all > kind of oops (strongly advise you to sync you hd before stress testing > that). Yes we need to protect code path because suspend/resume code > path is special one it touch many data in the device structure. GPU > reset is a rare event or should be a rare event. Yeah, but I thought that fixing those oops as the second step. I see the = fact that suspend/resume is unpinning all the ttm memory and then = pinning it again as a bug that needs to be fixed. Or as an alternative = we could split the suspend/resume calls into suspend/disable and = resume/enable calls, where we only call disable/enable in the gpu_reset = path rather than a complete suspend/resume (not really sure about that). Also a GPU reset isn't such a rare event, currently it just occurs when = userspace is doing something bad, for example submitting an invalid = shader or stuff like that. But with VM and IO protection coming into the = driver we are going to need a GPU reset when there is an protection = fault, and with that it is really desirable to just reset the hardware = in a way where we can say: This IB was faulty skip over it and resume = with whatever is after it on the ring. And todo that we need to keep the auxiliary data like sub allocator = memory, blitting shader bo, and especially vm page tables at the same = place in hardware memory. > I stress it we need at very least a mutex to protect gpu reset and i > will stand on that position because there is no other way around. > Using rw lock have a bonus of allowing proper handling of gpu reset > failure and that what the patch i sent to linus fix tree is about, so > to make drm next merge properly while preserving proper behavior in > gpu reset failure the rw semaphore is the best option. Oh well, I'm not arguing that we don't need a look here. I'm just = questioning to put it at the ioctl level (e.g. the driver entry from = userspace), that wasn't such a good idea with the cs_mutex and doesn't = seems like a good idea now. Instead we should place the looking between = the ioctl/ttm and the actual hardware submission and that brings it = pretty close (if not identically) to the ring mutex. Cheers, Christian.