From mboxrd@z Thu Jan  1 00:00:00 1970
From: =?UTF-8?B?Q2hyaXN0aWFuIEvDtm5pZw==?= <deathsimple@vodafone.de>
Subject: Re: [PATCH] drm/radeon: signal all fences after lockup to avoid
	endless waiting in GEM_WAIT
Date: Mon, 14 Oct 2013 11:19:52 +0200
Message-ID: <525BB738.1070904@vodafone.de>
References: <gl3jv7su2g673eov9yul50yn.1380725459625@email.android.com>
	<CAAxE2A486bgeJsC9=7CgLteg_oOxNQUu9kQLLwehPXskxVu7qQ@mail.gmail.com>
	<5252961D.8010502@vodafone.de> <52543111.9000405@vodafone.de>
	<CAAxE2A776DS-BoCWwd7n8H8eLcTq2SSfOk_D9_OjOtLRTFQkGw@mail.gmail.com>
	<52553950.3020804@vodafone.de>
	<CAAxE2A6AmNJ_=qvODPRU=FJaLuef3qN-y_iZ5x1MW7Vyvdizjw@mail.gmail.com>
	<525A965A.1040504@vodafone.de>
	<CAAxE2A42biUwDXUj8Cn4xhLBTy919KRN7Yh_8TPuv6k2Jvo2Tg@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="------------000500000407060108040301"
Return-path: <dri-devel-bounces+sf-dri-devel=m.gmane.org@lists.freedesktop.org>
Received: from smtp-03.vodafone.de (mxout.vodafone.de [80.84.1.40])
	by gabe.freedesktop.org (Postfix) with ESMTP id 4CC71E5F62
	for <dri-devel@lists.freedesktop.org>;
	Mon, 14 Oct 2013 02:19:56 -0700 (PDT)
In-Reply-To: <CAAxE2A42biUwDXUj8Cn4xhLBTy919KRN7Yh_8TPuv6k2Jvo2Tg@mail.gmail.com>
List-Unsubscribe: <http://lists.freedesktop.org/mailman/options/dri-devel>,
	<mailto:dri-devel-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <http://lists.freedesktop.org/archives/dri-devel>
List-Post: <mailto:dri-devel@lists.freedesktop.org>
List-Help: <mailto:dri-devel-request@lists.freedesktop.org?subject=help>
List-Subscribe: <http://lists.freedesktop.org/mailman/listinfo/dri-devel>,
	<mailto:dri-devel-request@lists.freedesktop.org?subject=subscribe>
Sender: dri-devel-bounces+sf-dri-devel=m.gmane.org@lists.freedesktop.org
Errors-To: dri-devel-bounces+sf-dri-devel=m.gmane.org@lists.freedesktop.org
To: =?UTF-8?B?TWFyZWsgT2zFocOhaw==?= <maraeo@gmail.com>
Cc: dri-devel@lists.freedesktop.org
List-Id: dri-devel@lists.freedesktop.org

This is a multi-part message in MIME format.
--------------000500000407060108040301
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable

Ok, that one was easy to fix. Please apply the attached patch as well.

Going to send out both for inclusion in 3.12 in a minute.

Christian.

Am 13.10.2013 22:16, schrieb Marek Ol=C5=A1=C3=A1k:
> This seems to be better. It can do about 3-5 resets correctly, then
> the GPU resuming fails:
>
> [  246.882780] [drm:cik_resume] *ERROR* cik startup failed on resume
>
> and then the GPU is being reset again and again endlessly without succe=
ss.
>
> The dmesg of the endless resets is attached.
>
> Marek
>
> On Sun, Oct 13, 2013 at 2:47 PM, Christian K=C3=B6nig
> <deathsimple@vodafone.de> wrote:
>> I've figured out what was wrong with the patch. We need to reset the
>> "needs_reset" flag earlier, otherwise the IB test might think we are i=
n a
>> lockup and aborts the reset after waiting for the minimum timeout peri=
od.
>>
>> Please try the attached patch instead.
>>
>> Thanks,
>> Christian.
>>
>> Am 09.10.2013 14:04, schrieb Marek Ol=C5=A1=C3=A1k:
>>
>>> The ring test of the first compute ring always fails and it shouldn't
>>> affect the GPU reset in any way.
>>>
>>> I can't tell if the deadlock issue is fixed, because the GPU reset
>>> usually fails with your patch. It always succeeded without your patch=
.
>>>
>>> Marek
>>>
>>> On Wed, Oct 9, 2013 at 1:09 PM, Christian K=C3=B6nig <deathsimple@vod=
afone.de>
>>> wrote:
>>>> Mhm, that doesn't looks like anything related but more like the rese=
t of
>>>> the
>>>> compute ring didn't worked.
>>>>
>>>> How often does that happen? And do you still get the problem where X
>>>> waits
>>>> for a fence that never comes back?
>>>>
>>>> Christian.
>>>>
>>>> Am 09.10.2013 12:36, schrieb Marek Ol=C5=A1=C3=A1k:
>>>>
>>>>> I'm afraid your patch sometimes causes the GPU reset to fail, which
>>>>> had never happened before IIRC.
>>>>>
>>>>> The dmesg log from the failure is attached.
>>>>>
>>>>> Marek
>>>>>
>>>>> On Tue, Oct 8, 2013 at 6:21 PM, Christian K=C3=B6nig
>>>>> <deathsimple@vodafone.de>
>>>>> wrote:
>>>>>> Hi Marek,
>>>>>>
>>>>>> please try the attached patch as a replacement for your signaling =
all
>>>>>> fences
>>>>>> patch. I'm not 100% sure if it fixes all issues, but it's at least=
 a
>>>>>> start.
>>>>>>
>>>>>> Thanks,
>>>>>> Christian.
>>>>>>
>>>>>> Am 07.10.2013 13:08, schrieb Christian K=C3=B6nig:
>>>>>>
>>>>>>>> First of all, I can't complain about the reliability of the hard=
ware
>>>>>>>> GPU reset. It's mostly the kernel driver that happens to run int=
o a
>>>>>>>> deadlock at the same time.
>>>>>>>
>>>>>>> Alex and I spend quite some time on making this reliable again af=
ter
>>>>>>> activating more rings and adding VM support. The main problem is =
that
>>>>>>> I
>>>>>>> couldn't figure out where the CPU deadlock comes from, cause I
>>>>>>> couldn't
>>>>>>> reliable reproduce the issue.
>>>>>>>
>>>>>>> What is the content of /proc/<pid of X server>/task/*/stack and
>>>>>>> sys/kernel/debug/dri/0/radeon_fence_info when the X server is stu=
ck in
>>>>>>> the
>>>>>>> deadlock situation?
>>>>>>>
>>>>>>> I'm pretty sure that we nearly always have a problem when two thr=
eads
>>>>>>> are
>>>>>>> waiting for fences on one of them detects that we have a lockup w=
hile
>>>>>>> the
>>>>>>> other one keeps holding the exclusive lock. Signaling all fences =
might
>>>>>>> work
>>>>>>> around that problem, but it probably would be better to fix the
>>>>>>> underlying
>>>>>>> issue.
>>>>>>>
>>>>>>> Going to take a deeper look into it.
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 03.10.2013 02:45, schrieb Marek Ol=C5=A1=C3=A1k:
>>>>>>>> First of all, I can't complain about the reliability of the hard=
ware
>>>>>>>> GPU reset. It's mostly the kernel driver that happens to run int=
o a
>>>>>>>> deadlock at the same time.
>>>>>>>>
>>>>>>>> Regarding the issue with fences, the problem is that the GPU res=
et
>>>>>>>> completes successfully according to dmesg, but X doesn't respond=
. I
>>>>>>>> can move the cursor on the screen, but I can't do anything else =
and
>>>>>>>> the UI is frozen. gdb says that X is stuck in GEM_WAIT_IDLE. I c=
an
>>>>>>>> easily reproduce this, because it's the most common reason why a=
 GPU
>>>>>>>> lockup leads to frozen X. The GPU actually recovers, but X is hu=
ng. I
>>>>>>>> can't tell whether the fences are just not signalled or whether =
there
>>>>>>>> is actually a real CPU deadlock I can't see.
>>>>>>>>
>>>>>>>> This patch makes the problem go away and GPU resets are successf=
ul
>>>>>>>> (except for extreme cases, see below). With a small enough locku=
p
>>>>>>>> timeout, the lockups are just a minor annoyance and I thought I =
could
>>>>>>>> get through a piglit run just with a few tens or hundreds of GPU
>>>>>>>> resets...
>>>>>>>>
>>>>>>>> A different type of deadlock showed up, though it needs a lot of
>>>>>>>> concurrently-running apps like piglit. What happened is that the
>>>>>>>> kernel driver was stuck/deadlocked in radeon_cs_ioctl presumably=
 due
>>>>>>>> to a GPU hang while holding onto the exclusive lock, and another
>>>>>>>> thread wanting to do the GPU reset was unable to acquire the loc=
k.
>>>>>>>>
>>>>>>>> That said, I will use the patch locally, because it helps a lot.=
 I
>>>>>>>> got
>>>>>>>> a few lockups while writing this email and I'm glad I didn't hav=
e to
>>>>>>>> reboot.
>>>>>>>>
>>>>>>>> Marek
>>>>>>>>
>>>>>>>> On Wed, Oct 2, 2013 at 4:50 PM, Christian K=C3=B6nig
>>>>>>>> <deathsimple@vodafone.de>
>>>>>>>> wrote:
>>>>>>>>> Possible, but I would rather guess that this doesn't work becau=
se
>>>>>>>>> the
>>>>>>>>> IB
>>>>>>>>> test runs into a deadlock situation and so the GPU reset never =
fully
>>>>>>>>> completes.
>>>>>>>>>
>>>>>>>>> Can you reproduce the problem?
>>>>>>>>>
>>>>>>>>> If you want to make GPU resets more reliable I would rather sug=
gest
>>>>>>>>> to
>>>>>>>>> remove the ring lock dependency.
>>>>>>>>> Then we should try to give all the fence wait functions a (reli=
able)
>>>>>>>>> timeout and move reset handling a layer up into the ioctl funct=
ions.
>>>>>>>>> But for
>>>>>>>>> this you need to rip out the old PM code first.
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>> Marek Ol=C5=A1=C3=A1k <maraeo@gmail.com> schrieb:
>>>>>>>>>
>>>>>>>>>> I'm afraid signalling the fences with an IB test is not reliab=
le.
>>>>>>>>>>
>>>>>>>>>> Marek
>>>>>>>>>>
>>>>>>>>>> On Wed, Oct 2, 2013 at 3:52 PM, Christian K=C3=B6nig
>>>>>>>>>> <deathsimple@vodafone.de> wrote:
>>>>>>>>>>> NAK, after recovering from a lockup the first thing we do is
>>>>>>>>>>> signalling all remaining fences with an IB test.
>>>>>>>>>>>
>>>>>>>>>>> If we don't recover we indeed signal all fences manually.
>>>>>>>>>>>
>>>>>>>>>>> Signalling all fences regardless of the outcome of the reset
>>>>>>>>>>> creates
>>>>>>>>>>> problems with both types of partial resets.
>>>>>>>>>>>
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>> Marek Ol=C5=A1=C3=A1k <maraeo@gmail.com> schrieb:
>>>>>>>>>>>
>>>>>>>>>>>> From: Marek Ol=C5=A1=C3=A1k <marek.olsak@amd.com>
>>>>>>>>>>>>
>>>>>>>>>>>> After a lockup, fences are not signalled sometimes, causing
>>>>>>>>>>>> the GEM_WAIT_IDLE ioctl to never return, which sometimes res=
ults
>>>>>>>>>>>> in an X server freeze.
>>>>>>>>>>>>
>>>>>>>>>>>> This fixes only one of many deadlocks which can occur during=
 a
>>>>>>>>>>>> lockup.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Marek Ol=C5=A1=C3=A1k <marek.olsak@amd.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>> drivers/gpu/drm/radeon/radeon_device.c | 5 +++++
>>>>>>>>>>>> 1 file changed, 5 insertions(+)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>>>>>>> b/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>>>>>>> index 841d0e0..7b97baa 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>>>>>>> @@ -1552,6 +1552,11 @@ int radeon_gpu_reset(struct radeon_de=
vice
>>>>>>>>>>>> *rdev)
>>>>>>>>>>>>           radeon_save_bios_scratch_regs(rdev);
>>>>>>>>>>>>           /* block TTM */
>>>>>>>>>>>>           resched =3D
>>>>>>>>>>>> ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev);
>>>>>>>>>>>> +
>>>>>>>>>>>> +      mutex_lock(&rdev->ring_lock);
>>>>>>>>>>>> +      radeon_fence_driver_force_completion(rdev);
>>>>>>>>>>>> +      mutex_unlock(&rdev->ring_lock);
>>>>>>>>>>>> +
>>>>>>>>>>>>           radeon_pm_suspend(rdev);
>>>>>>>>>>>>           radeon_suspend(rdev);
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> 1.8.1.2
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> dri-devel mailing list
>>>>>>>>>>>> dri-devel@lists.freedesktop.org
>>>>>>>>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> dri-devel mailing list
>>>>>>> dri-devel@lists.freedesktop.org
>>>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>>>>


--------------000500000407060108040301
Content-Type: text/x-diff;
 name="0001-drm-radeon-stop-the-leaks-in-cik_ib_test.patch"
Content-Disposition: attachment;
 filename="0001-drm-radeon-stop-the-leaks-in-cik_ib_test.patch"
Content-Transfer-Encoding: quoted-printable

>>From 3b690a7b016dc63ef49ae7ac593c8f7f09f80d0d Mon Sep 17 00:00:00 2001
From: =3D?UTF-8?q?Christian=3D20K=3DC3=3DB6nig?=3D <christian.koenig@amd.=
com>
Date: Mon, 14 Oct 2013 11:11:28 +0200
Subject: [PATCH 1/2] drm/radeon: stop the leaks in cik_ib_test
MIME-Version: 1.0
Content-Type: text/plain; charset=3DUTF-8
Content-Transfer-Encoding: 8bit

Stop leaking IB memory and scratch register space when the test fails.

Signed-off-by: Christian K=C3=B6nig <christian.koenig@amd.com>
---
 drivers/gpu/drm/radeon/cik.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/radeon/cik.c b/drivers/gpu/drm/radeon/cik.c
index b874ccd..8f393df 100644
--- a/drivers/gpu/drm/radeon/cik.c
+++ b/drivers/gpu/drm/radeon/cik.c
@@ -3182,6 +3182,7 @@ int cik_ib_test(struct radeon_device *rdev, struct =
radeon_ring *ring)
 	r =3D radeon_ib_get(rdev, ring->idx, &ib, NULL, 256);
 	if (r) {
 		DRM_ERROR("radeon: failed to get ib (%d).\n", r);
+		radeon_scratch_free(rdev, scratch);
 		return r;
 	}
 	ib.ptr[0] =3D PACKET3(PACKET3_SET_UCONFIG_REG, 1);
@@ -3198,6 +3199,8 @@ int cik_ib_test(struct radeon_device *rdev, struct =
radeon_ring *ring)
 	r =3D radeon_fence_wait(ib.fence, false);
 	if (r) {
 		DRM_ERROR("radeon: fence wait failed (%d).\n", r);
+		radeon_scratch_free(rdev, scratch);
+		radeon_ib_free(rdev, &ib);
 		return r;
 	}
 	for (i =3D 0; i < rdev->usec_timeout; i++) {
--=20
1.8.1.2


--------------000500000407060108040301
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

--------------000500000407060108040301--