From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?ISO-8859-1?Q?Christian_K=F6nig?= Subject: Re: Reworking of GPU reset logic Date: Wed, 25 Apr 2012 15:01:37 +0200 Message-ID: <4F97F5B1.4040202@vodafone.de> References: <1334875160-5454-1-git-send-email-deathsimple@vodafone.de> <4F9280E9.8030100@vodafone.de> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1"; Format="flowed" Content-Transfer-Encoding: quoted-printable Return-path: Received: from outgoing.email.vodafone.de (outgoing.email.vodafone.de [139.7.28.128]) by gabe.freedesktop.org (Postfix) with ESMTP id 5BCF6A0D42 for ; Wed, 25 Apr 2012 06:01:40 -0700 (PDT) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dri-devel-bounces+sf-dri-devel=m.gmane.org@lists.freedesktop.org Errors-To: dri-devel-bounces+sf-dri-devel=m.gmane.org@lists.freedesktop.org To: Jerome Glisse Cc: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org On 21.04.2012 16:14, Jerome Glisse wrote: > 2012/4/21 Christian K=F6nig: >> On 20.04.2012 01:47, Jerome Glisse wrote: >>> 2012/4/19 Christian K=F6nig: >>>> This includes mostly fixes for multi ring lockups and GPU resets, but = it >>>> should general improve the behavior of the kernel mode driver in case >>>> something goes badly wrong. >>>> >>>> On the other hand it completely rewrites the IB pool and semaphore >>>> handling, so I think there are still a couple of problems in it. >>>> >>>> The first four patches were already send to the list, but the current = set >>>> depends on them so I resend them again. >>>> >>>> Cheers, >>>> Christian. >>> I did a quick review, it looks mostly good, but as it's sensitive code >>> i would like to spend sometime on >>> it. Probably next week. Note that i had some work on this area too, i >>> mostly want to drop all the debugfs >>> related to this and add some new more usefull (basicly something that >>> allow you to read all the data >>> needed to replay a locking up ib). I also was looking into Dave reset >>> thread and your solution of moving >>> reset in ioctl return path sounds good too but i need to convince my >>> self that it encompass all possible >>> case. >>> >>> Cheers, >>> Jerome >>> >> After sleeping a night over it I already reworked the patch for improving >> the SA performance, so please wait at least for v2 before taking a look = at >> it :) >> >> Regarding the debugging of lockups I had the following on my "in mind to= do" >> list: >> 1. Rework the chip specific lockup detection code a bit more and probably >> clean it up a bit. >> 2. Make the timeout a module parameter, cause compute task sometimes blo= ck a >> ring for more than 10 seconds. >> 3. Keep track of the actually RPTR offset a fence is emitted to >> 3. Keep track of all the BOs a IB is touching. >> 4. Now if a lockup happens start with the last successfully signaled fen= ce >> and dump the ring content after that RPTR offset till the first not sign= aled >> fence. >> 5. Then if this fence references to an IB dump it's content and the BOs = it >> is touching. >> 6. Dump everything on the ring after that fence until you reach the RPTR= of >> the next fence or the WPTR of the ring. >> 7. If there is a next fence repeat the whole thing at number 5. >> >> If I'm not completely wrong that should give you practically every >> information available, and we probably should put that behind another mo= dule >> option, cause we are going to spam syslog pretty much here. Feel free to >> add/modify the ideas on this list. >> >> Christian. > What i have is similar, i am assuming only ib trigger lockup, before each= ib > emit to scratch reg ib offset in sa and ib size. For each ib keep bo list= . On > lockup allocate big memory to copy the whole ib and all the bo referenced > by the ib (i am using my bof format as i already have userspace tools). > > Remove all the debugfs file. Just add a new one that gave you the first f= aulty > ib. On read of this file kernel free the memory. Kernel should also free = the > memory after a while or better would be to enable the lockup copy only if > some kernel radeon option is enabled. Just resent my current patchset to the mailing list, it's not as = complete as your solution, but seems to be a step into the right = direction. So please take a look at them. Being able to generate something like a "GPU crash dump" on lockup = sounds like something very valuable to me, but I'm not sure if debugfs = files are the right direction to go. Maybe something more like a module = parameter containing a directory, and if set we dump all informations = (including bo content) available in binary form (instead of the current = human readable form of the debugfs files). Anyway, the just send patchset solves the problem I'm currently looking = into, and I'm running a bit out of time (again). So I don't know if I = can complete that solution.... Cheers, Christian.