From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?UTF-8?B?Q2hyaXN0aWFuIEvDtm5pZw==?= Subject: Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2 Date: Mon, 4 Aug 2014 16:37:56 +0200 Message-ID: <53DF9AC4.3010700@amd.com> References: <20140731153245.15061.63023.stgit@patser> <20140731153342.15061.54264.stgit@patser> <53DBC1EC.1010001@amd.com> <53DBD269.80807@canonical.com> <53DF462B.2060102@amd.com> <53DF4A7D.3040505@canonical.com> <53DF7516.2010408@amd.com> <53DF8BF2.4000309@canonical.com> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8"; Format="flowed" Content-Transfer-Encoding: base64 Return-path: In-Reply-To: <53DF8BF2.4000309@canonical.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: Maarten Lankhorst , airlied@linux.ie Cc: thellstrom@vmware.com, nouveau@lists.freedesktop.org, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, bskeggs@redhat.com, alexander.deucher@amd.com List-Id: nouveau.vger.kernel.org PiBJdCdhIHBhaW4gdG8gZGVhbCB3aXRoIGdwdSByZXNldC4KClllYWgsIHdlbGwgdGhhdCdzIG5v dGhpbmcgbmV3LgoKPiBJJ3ZlIG5vdyB0cmllZCBvdGhlciBzb2x1dGlvbnMgYnV0IHRoYXQgd291 bGQgbWVhbiByZXZlcnRpbmcgdG8gdGhlIG9sZCBzdHlsZSBkdXJpbmcgZ3B1IGxvY2t1cCByZWNv dmVyeSwgYW5kIG9ubHkgcnVubmluZyB0aGUgZGVsYXllZCB3b3JrIHdoZW4gIWxvY2t1cC4KPiBC dXQgdGhpcyBtZWFudCB0aGF0IHRoZSB0aW1lb3V0IHdhcyB1c2VsZXNzIHRvIGFkZC4gSSB0aGlu ayB0aGUgY2xlYW5lc3QgaXMga2VlcGluZyB0aGUgdjIgcGF0Y2gsIGJlY2F1c2UgcG90ZW50aWFs bHkgYW55IHdhaXRpbmcgY29kZSBjYW4gYmUgY2FsbGVkIGR1cmluZyBsb2NrdXAgcmVjb3Zlcnku CgpUaGUgbG9ja3VwIGNvZGUgaXRzZWxmIHNob3VsZCBuZXZlciBjYWxsIGFueSB3YWl0aW5nIGNv ZGUgYW5kIFYyIGRvZXNuJ3QgCnNlZW0gdG8gaGFuZGxlIGEgY291cGxlIG9mIGNhc2VzIGNvcnJl Y3RseSBlaXRoZXIuCgpIb3cgYWJvdXQgbW92aW5nIHRoZSBmZW5jZSB3YWl0aW5nIG91dCBvZiB0 aGUgcmVzZXQgY29kZT8KCkNocmlzdGlhbi4KCkFtIDA0LjA4LjIwMTQgdW0gMTU6MzQgc2Nocmll YiBNYWFydGVuIExhbmtob3JzdDoKPiBIZXksCj4KPiBvcCAwNC0wOC0xNCAxMzo1NywgQ2hyaXN0 aWFuIEvDtm5pZyBzY2hyZWVmOgo+PiBBbSAwNC4wOC4yMDE0IHVtIDEwOjU1IHNjaHJpZWIgTWFh cnRlbiBMYW5raG9yc3Q6Cj4+PiBvcCAwNC0wOC0xNCAxMDozNiwgQ2hyaXN0aWFuIEvDtm5pZyBz Y2hyZWVmOgo+Pj4+IEhpIE1hYXJ0ZW4sCj4+Pj4KPj4+PiBTb3JyeSBmb3IgdGhlIGRlbGF5LiBJ J3ZlIGdvdCB3YXkgdG8gbXVjaCB0b2RvIHJlY2VudGx5Lgo+Pj4+Cj4+Pj4gQW0gMDEuMDguMjAx NCB1bSAxOTo0NiBzY2hyaWViIE1hYXJ0ZW4gTGFua2hvcnN0Ogo+Pj4+PiBPbiAwMS0wOC0xNCAx ODozNSwgQ2hyaXN0aWFuIEvDtm5pZyB3cm90ZToKPj4+Pj4+IEFtIDMxLjA3LjIwMTQgdW0gMTc6 MzMgc2NocmllYiBNYWFydGVuIExhbmtob3JzdDoKPj4+Pj4+PiBTaWduZWQtb2ZmLWJ5OiBNYWFy dGVuIExhbmtob3JzdCA8bWFhcnRlbi5sYW5raG9yc3RAY2Fub25pY2FsLmNvbT4KPj4+Pj4+PiAt LS0KPj4+Pj4+PiBWMSBoYWQgYSBuYXN0eSBidWcgYnJlYWtpbmcgZ3B1IGxvY2t1cCByZWNvdmVy eS4gVGhlIGZpeCBpcyBub3QKPj4+Pj4+PiBhbGxvd2luZyByYWRlb25fZmVuY2VfZHJpdmVyX2No ZWNrX2xvY2t1cCB0byB0YWtlIGV4Y2x1c2l2ZV9sb2NrLAo+Pj4+Pj4+IGFuZCBraWxsIGl0IGR1 cmluZyBsb2NrdXAgcmVjb3ZlcnkgaW5zdGVhZC4KPj4+Pj4+IFRoYXQgbG9va3MgbGlrZSB0aGUg ZGVsYXllZCB3b3JrIHN0YXJ0cyBydW5uaW5nIGFzIHNvb24gYXMgd2Ugc3VibWl0IGEgZmVuY2Us IGFuZCBub3Qgd2hlbiBpdCdzIG5lZWRlZCBmb3Igd2FpdGluZy4KPj4+Pj4+Cj4+Pj4+PiBTaW5j ZSBpdCdzIGEgYmFja3VwIGZvciBmYWlsaW5nIElSUXMgSSB3b3VsZCByYXRoZXIgcHV0IGl0IGlu dG8gcmFkZW9uX2lycV9rbXMuYyBhbmQgc3RhcnQvc3RvcCBpdCB3aGVuIHRoZSBJUlFzIGFyZSBz dGFydGVkL3N0b3BlZC4KPj4+Pj4gVGhlIGRlbGF5ZWQgd29yayBpcyBub3QganVzdCBmb3IgZmFp bGluZyBpcnEncywgaXQncyBhbHNvIHRoZSBoYW5kbGVyIHRoYXQncyB1c2VkIHRvIGRldGVjdCBs b2NrdXBzLCB3aGljaCBpcyB3aHkgSSB0cmlnZ2VyIGFmdGVyIHByb2Nlc3NpbmcgZmVuY2VzLCBh bmQgcmVzZXQgdGhlIHRpbWVyIGFmdGVyIHByb2Nlc3NpbmcuCj4+Pj4gVGhlIGlkZWEgd2FzIHR1 cm5pbmcgdGhlIGRlbGF5ZWQgd29yayBvbiBhbmQgb2ZmIHdoZW4gd2UgdHVybiB0aGUgaXJxIG9u IGFuZCBvZmYgYXMgd2VsbCwgcHJvY2Vzc2luZyBvZiB0aGUgZGVsYXllZCB3b3JrIGhhbmRsZXIg Y2FuIHN0aWxsIGhhcHBlbiBpbiByYWRlb25fZmVuY2UuYwo+Pj4+Cj4+Pj4+IFNwZWNpZmljYWxs eSB3aGF0IGhhcHBlbmVkIHdhcyB0aGlzIHNjZW5hcmlvOgo+Pj4+Pgo+Pj4+PiAtIGxvY2sgdXAg b2NjdXJzCj4+Pj4+IC0gd3JpdGUgbG9jayB0YWtlbiBieSBncHVfcmVzZXQKPj4+Pj4gLSBkZWxh eWVkIHdvcmsgcnVucywgdHJpZXMgdG8gYWNxdWlyZSByZWFkIGxvY2ssIGJsb2Nrcwo+Pj4+PiAt IGdwdV9yZXNldCB0cmllcyB0byBjYW5jZWwgZGVsYXllZCB3b3JrIHN5bmNocm9ub3VzbHkKPj4+ Pj4gLSBoYXMgdG8gd2FpdCBmb3IgZGVsYXllZCB3b3JrIHRvIGZpbmlzaCAtPiBkZWFkbG9jawo+ Pj4+IFdoeSBkbyB5b3Ugd2FudCB0byBkaXNhYmxlIHRoZSB3b3JrIGl0ZW0gZnJvbSB0aGUgbG9j a3VwIGhhbmRsZXIgaW4gdGhlIGZpcnN0IHBsYWNlPwo+Pj4+Cj4+Pj4gSnVzdCB0YWtlIHRoZSBl eGNsdXNpdmUgbG9jayBpbiB0aGUgd29yayBpdGVtLCB3aGVuIGl0IGNvbmN1cnJlbnRseSBydW5z IHdpdGggdGhlIGxvY2t1cCBoYW5kbGVyIGl0IHdpbGwganVzdCBibG9jayBmb3IgdGhlIGxvY2t1 cCBoYW5kbGVyIHRvIGNvbXBsZXRlLgo+Pj4gV2l0aCB0aGUgZGVsYXllZCB3b3JrIHJhZGVvbl9m ZW5jZV93YWl0IG5vIGxvbmdlciBoYW5kbGVzIHVucmVsaWFibGUgaW50ZXJydXB0cyBpdHNlbGYs IHNvIGl0IGhhcyB0byBydW4gZnJvbSB0aGUgbG9ja3VwIGhhbmRsZXIuCj4+PiBCdXQgYW4gYWx0 ZXJuYXRpdmUgc29sdXRpb24gY291bGQgYmUgYWRkaW5nIGEgcmFkZW9uX2ZlbmNlX3dhaXRfdGlt ZW91dCwgaWdub3JlIHRoZSB0aW1lb3V0IGFuZCBjaGVjayBpZiBmZW5jZSBpcyBzaWduYWxlZCBv biB0aW1lb3V0Lgo+Pj4gVGhpcyB3b3VsZCBwcm9iYWJseSBiZSBhIGNsZWFuZXIgc29sdXRpb24u Cj4+IFllYWgsIGFncmVlLiBNYW51YWxseSBzcGVjaWZ5aW5nIGEgdGltZW91dCBpbiB0aGUgZmVu Y2Ugd2FpdCBvbiBsb2NrdXAgaGFuZGxpbmcgc291bmRzIGxpa2UgdGhlIGJlc3QgYWx0ZXJuYXRp dmUgdG8gbWUuCj4gSXQnYSBwYWluIHRvIGRlYWwgd2l0aCBncHUgcmVzZXQuCj4KPiBJJ3ZlIG5v dyB0cmllZCBvdGhlciBzb2x1dGlvbnMgYnV0IHRoYXQgd291bGQgbWVhbiByZXZlcnRpbmcgdG8g dGhlIG9sZCBzdHlsZSBkdXJpbmcgZ3B1IGxvY2t1cCByZWNvdmVyeSwgYW5kIG9ubHkgcnVubmlu ZyB0aGUgZGVsYXllZCB3b3JrIHdoZW4gIWxvY2t1cC4KPiBCdXQgdGhpcyBtZWFudCB0aGF0IHRo ZSB0aW1lb3V0IHdhcyB1c2VsZXNzIHRvIGFkZC4gSSB0aGluayB0aGUgY2xlYW5lc3QgaXMga2Vl cGluZyB0aGUgdjIgcGF0Y2gsIGJlY2F1c2UgcG90ZW50aWFsbHkgYW55IHdhaXRpbmcgY29kZSBj YW4gYmUgY2FsbGVkIGR1cmluZyBsb2NrdXAgcmVjb3ZlcnkuCj4KPiB+TWFhcnRlbgo+CgpfX19f X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fXwpkcmktZGV2ZWwgbWFp bGluZyBsaXN0CmRyaS1kZXZlbEBsaXN0cy5mcmVlZGVza3RvcC5vcmcKaHR0cDovL2xpc3RzLmZy ZWVkZXNrdG9wLm9yZy9tYWlsbWFuL2xpc3RpbmZvL2RyaS1kZXZlbAo= From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751709AbaHDOiK (ORCPT ); Mon, 4 Aug 2014 10:38:10 -0400 Received: from mail-bn1lp0139.outbound.protection.outlook.com ([207.46.163.139]:55549 "EHLO na01-bn1-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1750919AbaHDOiH convert rfc822-to-8bit (ORCPT ); Mon, 4 Aug 2014 10:38:07 -0400 X-WSS-ID: 0N9SDZ8-08-OJ4-02 X-M-MSG: Message-ID: <53DF9AC4.3010700@amd.com> Date: Mon, 4 Aug 2014 16:37:56 +0200 From: =?UTF-8?B?Q2hyaXN0aWFuIEvDtm5pZw==?= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.0 MIME-Version: 1.0 To: Maarten Lankhorst , CC: , , , , , Subject: Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2 References: <20140731153245.15061.63023.stgit@patser> <20140731153342.15061.54264.stgit@patser> <53DBC1EC.1010001@amd.com> <53DBD269.80807@canonical.com> <53DF462B.2060102@amd.com> <53DF4A7D.3040505@canonical.com> <53DF7516.2010408@amd.com> <53DF8BF2.4000309@canonical.com> In-Reply-To: <53DF8BF2.4000309@canonical.com> Content-Type: text/plain; charset="utf-8"; format=flowed X-Originating-IP: [10.224.152.188] Content-Transfer-Encoding: 8BIT X-EOPAttributedMessage: 0 X-Forefront-Antispam-Report: CIP:165.204.84.222;CTRY:US;IPV:NLI;IPV:NLI;EFV:NLI;SFV:NSPM;SFS:(6009001)(428002)(24454002)(377424004)(199002)(189002)(51704005)(19580395003)(83322001)(19580405001)(44976005)(80316001)(65956001)(80022001)(65806001)(23676002)(76482001)(97736001)(77982001)(81342001)(4396001)(86362001)(50466002)(74662001)(64126003)(64706001)(20776003)(47776003)(74502001)(81542001)(31966008)(36756003)(46102001)(106466001)(68736004)(85202003)(92566001)(95666004)(101416001)(85182001)(87936001)(54356999)(85852003)(102836001)(76176999)(83072002)(105586002)(83506001)(50986999)(33656002)(84676001)(85306004)(92726001)(99396002)(21056001)(79102001)(65816999)(93886004)(87266999)(107046002);DIR:OUT;SFP:;SCL:1;SRVR:BY2PR02MB042;H:atltwp02.amd.com;FPR:;MLV:sfv;PTR:InfoDomainNonexistent;MX:1;LANG:en; X-Microsoft-Antispam: BCL:0;PCL:0;RULEID: X-Forefront-PRVS: 0293D40691 Authentication-Results: spf=none (sender IP is 165.204.84.222) smtp.mailfrom=Christian.Koenig@amd.com; X-OriginatorOrg: amd4.onmicrosoft.com Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > It'a pain to deal with gpu reset. Yeah, well that's nothing new. > I've now tried other solutions but that would mean reverting to the old style during gpu lockup recovery, and only running the delayed work when !lockup. > But this meant that the timeout was useless to add. I think the cleanest is keeping the v2 patch, because potentially any waiting code can be called during lockup recovery. The lockup code itself should never call any waiting code and V2 doesn't seem to handle a couple of cases correctly either. How about moving the fence waiting out of the reset code? Christian. Am 04.08.2014 um 15:34 schrieb Maarten Lankhorst: > Hey, > > op 04-08-14 13:57, Christian König schreef: >> Am 04.08.2014 um 10:55 schrieb Maarten Lankhorst: >>> op 04-08-14 10:36, Christian König schreef: >>>> Hi Maarten, >>>> >>>> Sorry for the delay. I've got way to much todo recently. >>>> >>>> Am 01.08.2014 um 19:46 schrieb Maarten Lankhorst: >>>>> On 01-08-14 18:35, Christian König wrote: >>>>>> Am 31.07.2014 um 17:33 schrieb Maarten Lankhorst: >>>>>>> Signed-off-by: Maarten Lankhorst >>>>>>> --- >>>>>>> V1 had a nasty bug breaking gpu lockup recovery. The fix is not >>>>>>> allowing radeon_fence_driver_check_lockup to take exclusive_lock, >>>>>>> and kill it during lockup recovery instead. >>>>>> That looks like the delayed work starts running as soon as we submit a fence, and not when it's needed for waiting. >>>>>> >>>>>> Since it's a backup for failing IRQs I would rather put it into radeon_irq_kms.c and start/stop it when the IRQs are started/stoped. >>>>> The delayed work is not just for failing irq's, it's also the handler that's used to detect lockups, which is why I trigger after processing fences, and reset the timer after processing. >>>> The idea was turning the delayed work on and off when we turn the irq on and off as well, processing of the delayed work handler can still happen in radeon_fence.c >>>> >>>>> Specifically what happened was this scenario: >>>>> >>>>> - lock up occurs >>>>> - write lock taken by gpu_reset >>>>> - delayed work runs, tries to acquire read lock, blocks >>>>> - gpu_reset tries to cancel delayed work synchronously >>>>> - has to wait for delayed work to finish -> deadlock >>>> Why do you want to disable the work item from the lockup handler in the first place? >>>> >>>> Just take the exclusive lock in the work item, when it concurrently runs with the lockup handler it will just block for the lockup handler to complete. >>> With the delayed work radeon_fence_wait no longer handles unreliable interrupts itself, so it has to run from the lockup handler. >>> But an alternative solution could be adding a radeon_fence_wait_timeout, ignore the timeout and check if fence is signaled on timeout. >>> This would probably be a cleaner solution. >> Yeah, agree. Manually specifying a timeout in the fence wait on lockup handling sounds like the best alternative to me. > It'a pain to deal with gpu reset. > > I've now tried other solutions but that would mean reverting to the old style during gpu lockup recovery, and only running the delayed work when !lockup. > But this meant that the timeout was useless to add. I think the cleanest is keeping the v2 patch, because potentially any waiting code can be called during lockup recovery. > > ~Maarten >