Re: [PATCH] drm/radeon: signal all fences after lockup to avoid endless waiting in GEM_WAIT

From: "Christian König" <deathsimple@vodafone.de>
To: "Marek Olšák" <maraeo@gmail.com>
Cc: dri-devel@lists.freedesktop.org
Subject: Re: [PATCH] drm/radeon: signal all fences after lockup to avoid endless waiting in GEM_WAIT
Date: Mon, 14 Oct 2013 11:19:52 +0200	[thread overview]
Message-ID: <525BB738.1070904@vodafone.de> (raw)
In-Reply-To: <CAAxE2A42biUwDXUj8Cn4xhLBTy919KRN7Yh_8TPuv6k2Jvo2Tg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 8893 bytes --]

Ok, that one was easy to fix. Please apply the attached patch as well.

Going to send out both for inclusion in 3.12 in a minute.

Christian.

Am 13.10.2013 22:16, schrieb Marek Olšák:
> This seems to be better. It can do about 3-5 resets correctly, then
> the GPU resuming fails:
>
> [  246.882780] [drm:cik_resume] *ERROR* cik startup failed on resume
>
> and then the GPU is being reset again and again endlessly without success.
>
> The dmesg of the endless resets is attached.
>
> Marek
>
> On Sun, Oct 13, 2013 at 2:47 PM, Christian König
> <deathsimple@vodafone.de> wrote:
>> I've figured out what was wrong with the patch. We need to reset the
>> "needs_reset" flag earlier, otherwise the IB test might think we are in a
>> lockup and aborts the reset after waiting for the minimum timeout period.
>>
>> Please try the attached patch instead.
>>
>> Thanks,
>> Christian.
>>
>> Am 09.10.2013 14:04, schrieb Marek Olšák:
>>
>>> The ring test of the first compute ring always fails and it shouldn't
>>> affect the GPU reset in any way.
>>>
>>> I can't tell if the deadlock issue is fixed, because the GPU reset
>>> usually fails with your patch. It always succeeded without your patch.
>>>
>>> Marek
>>>
>>> On Wed, Oct 9, 2013 at 1:09 PM, Christian König <deathsimple@vodafone.de>
>>> wrote:
>>>> Mhm, that doesn't looks like anything related but more like the reset of
>>>> the
>>>> compute ring didn't worked.
>>>>
>>>> How often does that happen? And do you still get the problem where X
>>>> waits
>>>> for a fence that never comes back?
>>>>
>>>> Christian.
>>>>
>>>> Am 09.10.2013 12:36, schrieb Marek Olšák:
>>>>
>>>>> I'm afraid your patch sometimes causes the GPU reset to fail, which
>>>>> had never happened before IIRC.
>>>>>
>>>>> The dmesg log from the failure is attached.
>>>>>
>>>>> Marek
>>>>>
>>>>> On Tue, Oct 8, 2013 at 6:21 PM, Christian König
>>>>> <deathsimple@vodafone.de>
>>>>> wrote:
>>>>>> Hi Marek,
>>>>>>
>>>>>> please try the attached patch as a replacement for your signaling all
>>>>>> fences
>>>>>> patch. I'm not 100% sure if it fixes all issues, but it's at least a
>>>>>> start.
>>>>>>
>>>>>> Thanks,
>>>>>> Christian.
>>>>>>
>>>>>> Am 07.10.2013 13:08, schrieb Christian König:
>>>>>>
>>>>>>>> First of all, I can't complain about the reliability of the hardware
>>>>>>>> GPU reset. It's mostly the kernel driver that happens to run into a
>>>>>>>> deadlock at the same time.
>>>>>>>
>>>>>>> Alex and I spend quite some time on making this reliable again after
>>>>>>> activating more rings and adding VM support. The main problem is that
>>>>>>> I
>>>>>>> couldn't figure out where the CPU deadlock comes from, cause I
>>>>>>> couldn't
>>>>>>> reliable reproduce the issue.
>>>>>>>
>>>>>>> What is the content of /proc/<pid of X server>/task/*/stack and
>>>>>>> sys/kernel/debug/dri/0/radeon_fence_info when the X server is stuck in
>>>>>>> the
>>>>>>> deadlock situation?
>>>>>>>
>>>>>>> I'm pretty sure that we nearly always have a problem when two threads
>>>>>>> are
>>>>>>> waiting for fences on one of them detects that we have a lockup while
>>>>>>> the
>>>>>>> other one keeps holding the exclusive lock. Signaling all fences might
>>>>>>> work
>>>>>>> around that problem, but it probably would be better to fix the
>>>>>>> underlying
>>>>>>> issue.
>>>>>>>
>>>>>>> Going to take a deeper look into it.
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 03.10.2013 02:45, schrieb Marek Olšák:
>>>>>>>> First of all, I can't complain about the reliability of the hardware
>>>>>>>> GPU reset. It's mostly the kernel driver that happens to run into a
>>>>>>>> deadlock at the same time.
>>>>>>>>
>>>>>>>> Regarding the issue with fences, the problem is that the GPU reset
>>>>>>>> completes successfully according to dmesg, but X doesn't respond. I
>>>>>>>> can move the cursor on the screen, but I can't do anything else and
>>>>>>>> the UI is frozen. gdb says that X is stuck in GEM_WAIT_IDLE. I can
>>>>>>>> easily reproduce this, because it's the most common reason why a GPU
>>>>>>>> lockup leads to frozen X. The GPU actually recovers, but X is hung. I
>>>>>>>> can't tell whether the fences are just not signalled or whether there
>>>>>>>> is actually a real CPU deadlock I can't see.
>>>>>>>>
>>>>>>>> This patch makes the problem go away and GPU resets are successful
>>>>>>>> (except for extreme cases, see below). With a small enough lockup
>>>>>>>> timeout, the lockups are just a minor annoyance and I thought I could
>>>>>>>> get through a piglit run just with a few tens or hundreds of GPU
>>>>>>>> resets...
>>>>>>>>
>>>>>>>> A different type of deadlock showed up, though it needs a lot of
>>>>>>>> concurrently-running apps like piglit. What happened is that the
>>>>>>>> kernel driver was stuck/deadlocked in radeon_cs_ioctl presumably due
>>>>>>>> to a GPU hang while holding onto the exclusive lock, and another
>>>>>>>> thread wanting to do the GPU reset was unable to acquire the lock.
>>>>>>>>
>>>>>>>> That said, I will use the patch locally, because it helps a lot. I
>>>>>>>> got
>>>>>>>> a few lockups while writing this email and I'm glad I didn't have to
>>>>>>>> reboot.
>>>>>>>>
>>>>>>>> Marek
>>>>>>>>
>>>>>>>> On Wed, Oct 2, 2013 at 4:50 PM, Christian König
>>>>>>>> <deathsimple@vodafone.de>
>>>>>>>> wrote:
>>>>>>>>> Possible, but I would rather guess that this doesn't work because
>>>>>>>>> the
>>>>>>>>> IB
>>>>>>>>> test runs into a deadlock situation and so the GPU reset never fully
>>>>>>>>> completes.
>>>>>>>>>
>>>>>>>>> Can you reproduce the problem?
>>>>>>>>>
>>>>>>>>> If you want to make GPU resets more reliable I would rather suggest
>>>>>>>>> to
>>>>>>>>> remove the ring lock dependency.
>>>>>>>>> Then we should try to give all the fence wait functions a (reliable)
>>>>>>>>> timeout and move reset handling a layer up into the ioctl functions.
>>>>>>>>> But for
>>>>>>>>> this you need to rip out the old PM code first.
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>> Marek Olšák <maraeo@gmail.com> schrieb:
>>>>>>>>>
>>>>>>>>>> I'm afraid signalling the fences with an IB test is not reliable.
>>>>>>>>>>
>>>>>>>>>> Marek
>>>>>>>>>>
>>>>>>>>>> On Wed, Oct 2, 2013 at 3:52 PM, Christian König
>>>>>>>>>> <deathsimple@vodafone.de> wrote:
>>>>>>>>>>> NAK, after recovering from a lockup the first thing we do is
>>>>>>>>>>> signalling all remaining fences with an IB test.
>>>>>>>>>>>
>>>>>>>>>>> If we don't recover we indeed signal all fences manually.
>>>>>>>>>>>
>>>>>>>>>>> Signalling all fences regardless of the outcome of the reset
>>>>>>>>>>> creates
>>>>>>>>>>> problems with both types of partial resets.
>>>>>>>>>>>
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>> Marek Olšák <maraeo@gmail.com> schrieb:
>>>>>>>>>>>
>>>>>>>>>>>> From: Marek Olšák <marek.olsak@amd.com>
>>>>>>>>>>>>
>>>>>>>>>>>> After a lockup, fences are not signalled sometimes, causing
>>>>>>>>>>>> the GEM_WAIT_IDLE ioctl to never return, which sometimes results
>>>>>>>>>>>> in an X server freeze.
>>>>>>>>>>>>
>>>>>>>>>>>> This fixes only one of many deadlocks which can occur during a
>>>>>>>>>>>> lockup.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Marek Olšák <marek.olsak@amd.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>> drivers/gpu/drm/radeon/radeon_device.c | 5 +++++
>>>>>>>>>>>> 1 file changed, 5 insertions(+)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>>>>>>> b/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>>>>>>> index 841d0e0..7b97baa 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>>>>>>> @@ -1552,6 +1552,11 @@ int radeon_gpu_reset(struct radeon_device
>>>>>>>>>>>> *rdev)
>>>>>>>>>>>>           radeon_save_bios_scratch_regs(rdev);
>>>>>>>>>>>>           /* block TTM */
>>>>>>>>>>>>           resched =
>>>>>>>>>>>> ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev);
>>>>>>>>>>>> +
>>>>>>>>>>>> +      mutex_lock(&rdev->ring_lock);
>>>>>>>>>>>> +      radeon_fence_driver_force_completion(rdev);
>>>>>>>>>>>> +      mutex_unlock(&rdev->ring_lock);
>>>>>>>>>>>> +
>>>>>>>>>>>>           radeon_pm_suspend(rdev);
>>>>>>>>>>>>           radeon_suspend(rdev);
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> 1.8.1.2
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> dri-devel mailing list
>>>>>>>>>>>> dri-devel@lists.freedesktop.org
>>>>>>>>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> dri-devel mailing list
>>>>>>> dri-devel@lists.freedesktop.org
>>>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>>>>


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-drm-radeon-stop-the-leaks-in-cik_ib_test.patch --]
[-- Type: text/x-diff; name="0001-drm-radeon-stop-the-leaks-in-cik_ib_test.patch", Size: 1411 bytes --]

>From 3b690a7b016dc63ef49ae7ac593c8f7f09f80d0d Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Christian=20K=C3=B6nig?= <christian.koenig@amd.com>
Date: Mon, 14 Oct 2013 11:11:28 +0200
Subject: [PATCH 1/2] drm/radeon: stop the leaks in cik_ib_test
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Stop leaking IB memory and scratch register space when the test fails.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/radeon/cik.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/radeon/cik.c b/drivers/gpu/drm/radeon/cik.c
index b874ccd..8f393df 100644
--- a/drivers/gpu/drm/radeon/cik.c
+++ b/drivers/gpu/drm/radeon/cik.c
@@ -3182,6 +3182,7 @@ int cik_ib_test(struct radeon_device *rdev, struct radeon_ring *ring)
 	r = radeon_ib_get(rdev, ring->idx, &ib, NULL, 256);
 	if (r) {
 		DRM_ERROR("radeon: failed to get ib (%d).\n", r);
+		radeon_scratch_free(rdev, scratch);
 		return r;
 	}
 	ib.ptr[0] = PACKET3(PACKET3_SET_UCONFIG_REG, 1);
@@ -3198,6 +3199,8 @@ int cik_ib_test(struct radeon_device *rdev, struct radeon_ring *ring)
 	r = radeon_fence_wait(ib.fence, false);
 	if (r) {
 		DRM_ERROR("radeon: fence wait failed (%d).\n", r);
+		radeon_scratch_free(rdev, scratch);
+		radeon_ib_free(rdev, &ib);
 		return r;
 	}
 	for (i = 0; i < rdev->usec_timeout; i++) {
-- 
1.8.1.2


[-- Attachment #3: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel