From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?windows-1252?Q?Christian_K=F6nig?= Subject: Re: [PATCH] drm/radeon: signal all fences after lockup to avoid endless waiting in GEM_WAIT Date: Wed, 09 Oct 2013 13:09:04 +0200 Message-ID: <52553950.3020804@vodafone.de> References: <5252961D.8010502@vodafone.de> <52543111.9000405@vodafone.de> Mime-Version: 1.0 Content-Type: text/plain; charset="windows-1252"; Format="flowed" Content-Transfer-Encoding: quoted-printable Return-path: Received: from smtp-03.vodafone.de (mxout.vodafone.de [80.84.1.40]) by gabe.freedesktop.org (Postfix) with ESMTP id 4C414E5C3E for ; Wed, 9 Oct 2013 04:09:07 -0700 (PDT) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dri-devel-bounces+sf-dri-devel=m.gmane.org@lists.freedesktop.org Errors-To: dri-devel-bounces+sf-dri-devel=m.gmane.org@lists.freedesktop.org To: =?windows-1252?Q?Marek_Ol=9A=E1k?= Cc: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org Mhm, that doesn't looks like anything related but more like the reset of = the compute ring didn't worked. How often does that happen? And do you still get the problem where X = waits for a fence that never comes back? Christian. Am 09.10.2013 12:36, schrieb Marek Ol=9A=E1k: > I'm afraid your patch sometimes causes the GPU reset to fail, which > had never happened before IIRC. > > The dmesg log from the failure is attached. > > Marek > > On Tue, Oct 8, 2013 at 6:21 PM, Christian K=F6nig wrote: >> Hi Marek, >> >> please try the attached patch as a replacement for your signaling all fe= nces >> patch. I'm not 100% sure if it fixes all issues, but it's at least a sta= rt. >> >> Thanks, >> Christian. >> >> Am 07.10.2013 13:08, schrieb Christian K=F6nig: >> >>>> First of all, I can't complain about the reliability of the hardware >>>> GPU reset. It's mostly the kernel driver that happens to run into a >>>> deadlock at the same time. >>> >>> Alex and I spend quite some time on making this reliable again after >>> activating more rings and adding VM support. The main problem is that I >>> couldn't figure out where the CPU deadlock comes from, cause I couldn't >>> reliable reproduce the issue. >>> >>> What is the content of /proc//task/*/stack and >>> sys/kernel/debug/dri/0/radeon_fence_info when the X server is stuck in = the >>> deadlock situation? >>> >>> I'm pretty sure that we nearly always have a problem when two threads a= re >>> waiting for fences on one of them detects that we have a lockup while t= he >>> other one keeps holding the exclusive lock. Signaling all fences might = work >>> around that problem, but it probably would be better to fix the underly= ing >>> issue. >>> >>> Going to take a deeper look into it. >>> >>> Christian. >>> >>> Am 03.10.2013 02:45, schrieb Marek Ol=9A=E1k: >>>> First of all, I can't complain about the reliability of the hardware >>>> GPU reset. It's mostly the kernel driver that happens to run into a >>>> deadlock at the same time. >>>> >>>> Regarding the issue with fences, the problem is that the GPU reset >>>> completes successfully according to dmesg, but X doesn't respond. I >>>> can move the cursor on the screen, but I can't do anything else and >>>> the UI is frozen. gdb says that X is stuck in GEM_WAIT_IDLE. I can >>>> easily reproduce this, because it's the most common reason why a GPU >>>> lockup leads to frozen X. The GPU actually recovers, but X is hung. I >>>> can't tell whether the fences are just not signalled or whether there >>>> is actually a real CPU deadlock I can't see. >>>> >>>> This patch makes the problem go away and GPU resets are successful >>>> (except for extreme cases, see below). With a small enough lockup >>>> timeout, the lockups are just a minor annoyance and I thought I could >>>> get through a piglit run just with a few tens or hundreds of GPU >>>> resets... >>>> >>>> A different type of deadlock showed up, though it needs a lot of >>>> concurrently-running apps like piglit. What happened is that the >>>> kernel driver was stuck/deadlocked in radeon_cs_ioctl presumably due >>>> to a GPU hang while holding onto the exclusive lock, and another >>>> thread wanting to do the GPU reset was unable to acquire the lock. >>>> >>>> That said, I will use the patch locally, because it helps a lot. I got >>>> a few lockups while writing this email and I'm glad I didn't have to >>>> reboot. >>>> >>>> Marek >>>> >>>> On Wed, Oct 2, 2013 at 4:50 PM, Christian K=F6nig >>>> wrote: >>>>> Possible, but I would rather guess that this doesn't work because the= IB >>>>> test runs into a deadlock situation and so the GPU reset never fully >>>>> completes. >>>>> >>>>> Can you reproduce the problem? >>>>> >>>>> If you want to make GPU resets more reliable I would rather suggest to >>>>> remove the ring lock dependency. >>>>> Then we should try to give all the fence wait functions a (reliable) >>>>> timeout and move reset handling a layer up into the ioctl functions. = But for >>>>> this you need to rip out the old PM code first. >>>>> >>>>> Christian. >>>>> >>>>> Marek Ol=9A=E1k schrieb: >>>>> >>>>>> I'm afraid signalling the fences with an IB test is not reliable. >>>>>> >>>>>> Marek >>>>>> >>>>>> On Wed, Oct 2, 2013 at 3:52 PM, Christian K=F6nig >>>>>> wrote: >>>>>>> NAK, after recovering from a lockup the first thing we do is >>>>>>> signalling all remaining fences with an IB test. >>>>>>> >>>>>>> If we don't recover we indeed signal all fences manually. >>>>>>> >>>>>>> Signalling all fences regardless of the outcome of the reset creates >>>>>>> problems with both types of partial resets. >>>>>>> >>>>>>> Christian. >>>>>>> >>>>>>> Marek Ol=9A=E1k schrieb: >>>>>>> >>>>>>>> From: Marek Ol=9A=E1k >>>>>>>> >>>>>>>> After a lockup, fences are not signalled sometimes, causing >>>>>>>> the GEM_WAIT_IDLE ioctl to never return, which sometimes results >>>>>>>> in an X server freeze. >>>>>>>> >>>>>>>> This fixes only one of many deadlocks which can occur during a >>>>>>>> lockup. >>>>>>>> >>>>>>>> Signed-off-by: Marek Ol=9A=E1k >>>>>>>> --- >>>>>>>> drivers/gpu/drm/radeon/radeon_device.c | 5 +++++ >>>>>>>> 1 file changed, 5 insertions(+) >>>>>>>> >>>>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c >>>>>>>> b/drivers/gpu/drm/radeon/radeon_device.c >>>>>>>> index 841d0e0..7b97baa 100644 >>>>>>>> --- a/drivers/gpu/drm/radeon/radeon_device.c >>>>>>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c >>>>>>>> @@ -1552,6 +1552,11 @@ int radeon_gpu_reset(struct radeon_device >>>>>>>> *rdev) >>>>>>>> radeon_save_bios_scratch_regs(rdev); >>>>>>>> /* block TTM */ >>>>>>>> resched =3D ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev= ); >>>>>>>> + >>>>>>>> + mutex_lock(&rdev->ring_lock); >>>>>>>> + radeon_fence_driver_force_completion(rdev); >>>>>>>> + mutex_unlock(&rdev->ring_lock); >>>>>>>> + >>>>>>>> radeon_pm_suspend(rdev); >>>>>>>> radeon_suspend(rdev); >>>>>>>> >>>>>>>> -- >>>>>>>> 1.8.1.2 >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> dri-devel mailing list >>>>>>>> dri-devel@lists.freedesktop.org >>>>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel >>> >>> _______________________________________________ >>> dri-devel mailing list >>> dri-devel@lists.freedesktop.org >>> http://lists.freedesktop.org/mailman/listinfo/dri-devel >>