From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?ISO-8859-1?Q?Christian_K=F6nig?= Subject: Re: [PATCH] drm/radeon: add an exclusive lock for GPU reset Date: Tue, 03 Jul 2012 16:52:00 +0200 Message-ID: <4FF30710.7030504@vodafone.de> References: <1341243563-2559-1-git-send-email-j.glisse@gmail.com> <4FF1CBB5.2060208@vodafone.de> <4FF1D4EF.3040806@vodafone.de> <4FF2BAD3.7030402@vodafone.de> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1"; Format="flowed" Content-Transfer-Encoding: quoted-printable Return-path: Received: from outgoing.email.vodafone.de (outgoing.email.vodafone.de [139.7.28.128]) by gabe.freedesktop.org (Postfix) with ESMTP id 73542A02C7 for ; Tue, 3 Jul 2012 07:52:05 -0700 (PDT) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dri-devel-bounces+sf-dri-devel=m.gmane.org@lists.freedesktop.org Errors-To: dri-devel-bounces+sf-dri-devel=m.gmane.org@lists.freedesktop.org To: Jerome Glisse Cc: Jerome Glisse , dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org On 03.07.2012 16:09, Jerome Glisse wrote: > On Tue, Jul 3, 2012 at 5:26 AM, Christian K=F6nig wrote: >> [SNIP] >> Yeah, but I thought that fixing those oops as the second step. I see the >> fact that suspend/resume is unpinning all the ttm memory and then pinnin= g it >> again as a bug that needs to be fixed. Or as an alternative we could spl= it >> the suspend/resume calls into suspend/disable and resume/enable calls, w= here >> we only call disable/enable in the gpu_reset path rather than a complete >> suspend/resume (not really sure about that). > Fixing oops are not second step, they are first step. I am not saying > that the suspend/resume as it happens right now is a good thing, i am > saying it's what it's and we have to deal with it until we do > something better. There is no excuse to not fix oops with a simple > patch 16 lines patch. Completely agree. >> Also a GPU reset isn't such a rare event, currently it just occurs when >> userspace is doing something bad, for example submitting an invalid shad= er >> or stuff like that. But with VM and IO protection coming into the driver= we >> are going to need a GPU reset when there is an protection fault, and with >> that it is really desirable to just reset the hardware in a way where we= can >> say: This IB was faulty skip over it and resume with whatever is after i= t on >> the ring. > There is mecanism to handle that properly from irq handler that AMD > need to release, the pagefault thing could be transparent and should > only need few lines in the irq handler (i think i did a patch for that > and sent it to AMD for review but i am wondering if i wasn't lacking > some doc). I also searched the AMD internal docs for a good description of how the = lightweight recovery is supposed to work, but haven't found anything = clear so far. My expectation is that there should be something like a = "abort current IB" command you can issue by writing an register, but = that doesn't seems to be the case. >> And todo that we need to keep the auxiliary data like sub allocator memo= ry, >> blitting shader bo, and especially vm page tables at the same place in >> hardware memory. > I agree that we need a lightweight reset but we need to keep the heavy > reset as it is right now, if you want to do a light weight reset do it > as a new function. I always intended to have two reset path, hint gpu > soft reset name vs what is call hard reset but not released, i even > made patch for that long time ago but never got them cleared from AMD > review. My idea was to pass in some extra informations, so asic_reset more = clearly knows what todo. An explicit distinction between a soft and a = hard reset also seems like a possible solution, but sounds like a bit of = code duplication. >>> I stress it we need at very least a mutex to protect gpu reset and i >>> will stand on that position because there is no other way around. >>> Using rw lock have a bonus of allowing proper handling of gpu reset >>> failure and that what the patch i sent to linus fix tree is about, so >>> to make drm next merge properly while preserving proper behavior in >>> gpu reset failure the rw semaphore is the best option. >> Oh well, I'm not arguing that we don't need a look here. I'm just >> questioning to put it at the ioctl level (e.g. the driver entry from >> userspace), that wasn't such a good idea with the cs_mutex and doesn't s= eems >> like a good idea now. Instead we should place the looking between the >> ioctl/ttm and the actual hardware submission and that brings it pretty c= lose >> (if not identically) to the ring mutex. >> >> Cheers, >> Christian. > No, locking at the ioctl level make sense please show me figure that > it hurt performance, i did a quick sysprof and i couldn't see them > impacting anything. No matter how much you hate this, this is the best > solution, it avoids each ioctl to do useless things in case of gpu > lockup and it touch a lot less code than moving a lock down the call > path would. So again this is the best solution for the heavy reset, > and i am not saying that a soft reset would need to take this lock or > that we can't improve the way it's done. All i am saying is that ring > lock is the wrong solution for heavy reset, it should be ok for light > weight reset. > I'm not into any performance concerns, it just doesn't seems to be the = right place to me. So are you really sure that the = ttm_bo_delayed_workqueue, pageflips or callbacks to radeon_bo_move = can't hit us here? IIRC that always was the big concern with the = cs_mutex not being held in all cases. Anyway, if you think it will work and fix the crash problem at hand then = I'm ok with commit it. Christian.