From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?UTF-8?B?Q2hyaXN0aWFuIEvDtm5pZw==?= Subject: Re: [PATCH] drm/radeon: signal all fences after lockup to avoid endless waiting in GEM_WAIT Date: Mon, 14 Oct 2013 11:19:52 +0200 Message-ID: <525BB738.1070904@vodafone.de> References: <5252961D.8010502@vodafone.de> <52543111.9000405@vodafone.de> <52553950.3020804@vodafone.de> <525A965A.1040504@vodafone.de> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="------------000500000407060108040301" Return-path: Received: from smtp-03.vodafone.de (mxout.vodafone.de [80.84.1.40]) by gabe.freedesktop.org (Postfix) with ESMTP id 4CC71E5F62 for ; Mon, 14 Oct 2013 02:19:56 -0700 (PDT) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dri-devel-bounces+sf-dri-devel=m.gmane.org@lists.freedesktop.org Errors-To: dri-devel-bounces+sf-dri-devel=m.gmane.org@lists.freedesktop.org To: =?UTF-8?B?TWFyZWsgT2zFocOhaw==?= Cc: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org This is a multi-part message in MIME format. --------------000500000407060108040301 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable Ok, that one was easy to fix. Please apply the attached patch as well. Going to send out both for inclusion in 3.12 in a minute. Christian. Am 13.10.2013 22:16, schrieb Marek Ol=C5=A1=C3=A1k: > This seems to be better. It can do about 3-5 resets correctly, then > the GPU resuming fails: > > [ 246.882780] [drm:cik_resume] *ERROR* cik startup failed on resume > > and then the GPU is being reset again and again endlessly without succe= ss. > > The dmesg of the endless resets is attached. > > Marek > > On Sun, Oct 13, 2013 at 2:47 PM, Christian K=C3=B6nig > wrote: >> I've figured out what was wrong with the patch. We need to reset the >> "needs_reset" flag earlier, otherwise the IB test might think we are i= n a >> lockup and aborts the reset after waiting for the minimum timeout peri= od. >> >> Please try the attached patch instead. >> >> Thanks, >> Christian. >> >> Am 09.10.2013 14:04, schrieb Marek Ol=C5=A1=C3=A1k: >> >>> The ring test of the first compute ring always fails and it shouldn't >>> affect the GPU reset in any way. >>> >>> I can't tell if the deadlock issue is fixed, because the GPU reset >>> usually fails with your patch. It always succeeded without your patch= . >>> >>> Marek >>> >>> On Wed, Oct 9, 2013 at 1:09 PM, Christian K=C3=B6nig >>> wrote: >>>> Mhm, that doesn't looks like anything related but more like the rese= t of >>>> the >>>> compute ring didn't worked. >>>> >>>> How often does that happen? And do you still get the problem where X >>>> waits >>>> for a fence that never comes back? >>>> >>>> Christian. >>>> >>>> Am 09.10.2013 12:36, schrieb Marek Ol=C5=A1=C3=A1k: >>>> >>>>> I'm afraid your patch sometimes causes the GPU reset to fail, which >>>>> had never happened before IIRC. >>>>> >>>>> The dmesg log from the failure is attached. >>>>> >>>>> Marek >>>>> >>>>> On Tue, Oct 8, 2013 at 6:21 PM, Christian K=C3=B6nig >>>>> >>>>> wrote: >>>>>> Hi Marek, >>>>>> >>>>>> please try the attached patch as a replacement for your signaling = all >>>>>> fences >>>>>> patch. I'm not 100% sure if it fixes all issues, but it's at least= a >>>>>> start. >>>>>> >>>>>> Thanks, >>>>>> Christian. >>>>>> >>>>>> Am 07.10.2013 13:08, schrieb Christian K=C3=B6nig: >>>>>> >>>>>>>> First of all, I can't complain about the reliability of the hard= ware >>>>>>>> GPU reset. It's mostly the kernel driver that happens to run int= o a >>>>>>>> deadlock at the same time. >>>>>>> >>>>>>> Alex and I spend quite some time on making this reliable again af= ter >>>>>>> activating more rings and adding VM support. The main problem is = that >>>>>>> I >>>>>>> couldn't figure out where the CPU deadlock comes from, cause I >>>>>>> couldn't >>>>>>> reliable reproduce the issue. >>>>>>> >>>>>>> What is the content of /proc//task/*/stack and >>>>>>> sys/kernel/debug/dri/0/radeon_fence_info when the X server is stu= ck in >>>>>>> the >>>>>>> deadlock situation? >>>>>>> >>>>>>> I'm pretty sure that we nearly always have a problem when two thr= eads >>>>>>> are >>>>>>> waiting for fences on one of them detects that we have a lockup w= hile >>>>>>> the >>>>>>> other one keeps holding the exclusive lock. Signaling all fences = might >>>>>>> work >>>>>>> around that problem, but it probably would be better to fix the >>>>>>> underlying >>>>>>> issue. >>>>>>> >>>>>>> Going to take a deeper look into it. >>>>>>> >>>>>>> Christian. >>>>>>> >>>>>>> Am 03.10.2013 02:45, schrieb Marek Ol=C5=A1=C3=A1k: >>>>>>>> First of all, I can't complain about the reliability of the hard= ware >>>>>>>> GPU reset. It's mostly the kernel driver that happens to run int= o a >>>>>>>> deadlock at the same time. >>>>>>>> >>>>>>>> Regarding the issue with fences, the problem is that the GPU res= et >>>>>>>> completes successfully according to dmesg, but X doesn't respond= . I >>>>>>>> can move the cursor on the screen, but I can't do anything else = and >>>>>>>> the UI is frozen. gdb says that X is stuck in GEM_WAIT_IDLE. I c= an >>>>>>>> easily reproduce this, because it's the most common reason why a= GPU >>>>>>>> lockup leads to frozen X. The GPU actually recovers, but X is hu= ng. I >>>>>>>> can't tell whether the fences are just not signalled or whether = there >>>>>>>> is actually a real CPU deadlock I can't see. >>>>>>>> >>>>>>>> This patch makes the problem go away and GPU resets are successf= ul >>>>>>>> (except for extreme cases, see below). With a small enough locku= p >>>>>>>> timeout, the lockups are just a minor annoyance and I thought I = could >>>>>>>> get through a piglit run just with a few tens or hundreds of GPU >>>>>>>> resets... >>>>>>>> >>>>>>>> A different type of deadlock showed up, though it needs a lot of >>>>>>>> concurrently-running apps like piglit. What happened is that the >>>>>>>> kernel driver was stuck/deadlocked in radeon_cs_ioctl presumably= due >>>>>>>> to a GPU hang while holding onto the exclusive lock, and another >>>>>>>> thread wanting to do the GPU reset was unable to acquire the loc= k. >>>>>>>> >>>>>>>> That said, I will use the patch locally, because it helps a lot.= I >>>>>>>> got >>>>>>>> a few lockups while writing this email and I'm glad I didn't hav= e to >>>>>>>> reboot. >>>>>>>> >>>>>>>> Marek >>>>>>>> >>>>>>>> On Wed, Oct 2, 2013 at 4:50 PM, Christian K=C3=B6nig >>>>>>>> >>>>>>>> wrote: >>>>>>>>> Possible, but I would rather guess that this doesn't work becau= se >>>>>>>>> the >>>>>>>>> IB >>>>>>>>> test runs into a deadlock situation and so the GPU reset never = fully >>>>>>>>> completes. >>>>>>>>> >>>>>>>>> Can you reproduce the problem? >>>>>>>>> >>>>>>>>> If you want to make GPU resets more reliable I would rather sug= gest >>>>>>>>> to >>>>>>>>> remove the ring lock dependency. >>>>>>>>> Then we should try to give all the fence wait functions a (reli= able) >>>>>>>>> timeout and move reset handling a layer up into the ioctl funct= ions. >>>>>>>>> But for >>>>>>>>> this you need to rip out the old PM code first. >>>>>>>>> >>>>>>>>> Christian. >>>>>>>>> >>>>>>>>> Marek Ol=C5=A1=C3=A1k schrieb: >>>>>>>>> >>>>>>>>>> I'm afraid signalling the fences with an IB test is not reliab= le. >>>>>>>>>> >>>>>>>>>> Marek >>>>>>>>>> >>>>>>>>>> On Wed, Oct 2, 2013 at 3:52 PM, Christian K=C3=B6nig >>>>>>>>>> wrote: >>>>>>>>>>> NAK, after recovering from a lockup the first thing we do is >>>>>>>>>>> signalling all remaining fences with an IB test. >>>>>>>>>>> >>>>>>>>>>> If we don't recover we indeed signal all fences manually. >>>>>>>>>>> >>>>>>>>>>> Signalling all fences regardless of the outcome of the reset >>>>>>>>>>> creates >>>>>>>>>>> problems with both types of partial resets. >>>>>>>>>>> >>>>>>>>>>> Christian. >>>>>>>>>>> >>>>>>>>>>> Marek Ol=C5=A1=C3=A1k schrieb: >>>>>>>>>>> >>>>>>>>>>>> From: Marek Ol=C5=A1=C3=A1k >>>>>>>>>>>> >>>>>>>>>>>> After a lockup, fences are not signalled sometimes, causing >>>>>>>>>>>> the GEM_WAIT_IDLE ioctl to never return, which sometimes res= ults >>>>>>>>>>>> in an X server freeze. >>>>>>>>>>>> >>>>>>>>>>>> This fixes only one of many deadlocks which can occur during= a >>>>>>>>>>>> lockup. >>>>>>>>>>>> >>>>>>>>>>>> Signed-off-by: Marek Ol=C5=A1=C3=A1k >>>>>>>>>>>> --- >>>>>>>>>>>> drivers/gpu/drm/radeon/radeon_device.c | 5 +++++ >>>>>>>>>>>> 1 file changed, 5 insertions(+) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c >>>>>>>>>>>> b/drivers/gpu/drm/radeon/radeon_device.c >>>>>>>>>>>> index 841d0e0..7b97baa 100644 >>>>>>>>>>>> --- a/drivers/gpu/drm/radeon/radeon_device.c >>>>>>>>>>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c >>>>>>>>>>>> @@ -1552,6 +1552,11 @@ int radeon_gpu_reset(struct radeon_de= vice >>>>>>>>>>>> *rdev) >>>>>>>>>>>> radeon_save_bios_scratch_regs(rdev); >>>>>>>>>>>> /* block TTM */ >>>>>>>>>>>> resched =3D >>>>>>>>>>>> ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev); >>>>>>>>>>>> + >>>>>>>>>>>> + mutex_lock(&rdev->ring_lock); >>>>>>>>>>>> + radeon_fence_driver_force_completion(rdev); >>>>>>>>>>>> + mutex_unlock(&rdev->ring_lock); >>>>>>>>>>>> + >>>>>>>>>>>> radeon_pm_suspend(rdev); >>>>>>>>>>>> radeon_suspend(rdev); >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> 1.8.1.2 >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> dri-devel mailing list >>>>>>>>>>>> dri-devel@lists.freedesktop.org >>>>>>>>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel >>>>>>> >>>>>>> _______________________________________________ >>>>>>> dri-devel mailing list >>>>>>> dri-devel@lists.freedesktop.org >>>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel >>>>>> --------------000500000407060108040301 Content-Type: text/x-diff; name="0001-drm-radeon-stop-the-leaks-in-cik_ib_test.patch" Content-Disposition: attachment; filename="0001-drm-radeon-stop-the-leaks-in-cik_ib_test.patch" Content-Transfer-Encoding: quoted-printable >>From 3b690a7b016dc63ef49ae7ac593c8f7f09f80d0d Mon Sep 17 00:00:00 2001 From: =3D?UTF-8?q?Christian=3D20K=3DC3=3DB6nig?=3D Date: Mon, 14 Oct 2013 11:11:28 +0200 Subject: [PATCH 1/2] drm/radeon: stop the leaks in cik_ib_test MIME-Version: 1.0 Content-Type: text/plain; charset=3DUTF-8 Content-Transfer-Encoding: 8bit Stop leaking IB memory and scratch register space when the test fails. Signed-off-by: Christian K=C3=B6nig --- drivers/gpu/drm/radeon/cik.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/radeon/cik.c b/drivers/gpu/drm/radeon/cik.c index b874ccd..8f393df 100644 --- a/drivers/gpu/drm/radeon/cik.c +++ b/drivers/gpu/drm/radeon/cik.c @@ -3182,6 +3182,7 @@ int cik_ib_test(struct radeon_device *rdev, struct = radeon_ring *ring) r =3D radeon_ib_get(rdev, ring->idx, &ib, NULL, 256); if (r) { DRM_ERROR("radeon: failed to get ib (%d).\n", r); + radeon_scratch_free(rdev, scratch); return r; } ib.ptr[0] =3D PACKET3(PACKET3_SET_UCONFIG_REG, 1); @@ -3198,6 +3199,8 @@ int cik_ib_test(struct radeon_device *rdev, struct = radeon_ring *ring) r =3D radeon_fence_wait(ib.fence, false); if (r) { DRM_ERROR("radeon: fence wait failed (%d).\n", r); + radeon_scratch_free(rdev, scratch); + radeon_ib_free(rdev, &ib); return r; } for (i =3D 0; i < rdev->usec_timeout; i++) { --=20 1.8.1.2 --------------000500000407060108040301 Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel --------------000500000407060108040301--