From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Dawson Subject: Re: [PATCH] drm/radeon: Avoid double gpu reset by adding a timeout on IB ring tests. Date: Mon, 08 Feb 2016 09:51:03 -0500 Message-ID: <5775429.5kGhVQn69T@ring00> References: <1453616332-30763-1-git-send-email-matthew@mjdsystems.ca> <1689212.YJ1RXmWsVz@ring00> <56B1F191.5020607@vodafone.de> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0964624743==" Return-path: Received: from scadrial.mjdsystems.ca (scadrial.mjdsystems.ca [198.100.154.185]) by gabe.freedesktop.org (Postfix) with ESMTP id 889426E3DF for ; Mon, 8 Feb 2016 06:51:09 -0800 (PST) In-Reply-To: <56B1F191.5020607@vodafone.de> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: Christian =?ISO-8859-1?Q?K=F6nig?= Cc: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============0964624743== Content-Type: multipart/signed; boundary="nextPart15317036.GkK3HZAq3G"; micalg="pgp-sha256"; protocol="application/pgp-signature" --nextPart15317036.GkK3HZAq3G Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="iso-8859-1" On Wednesday, February 3, 2016 1:24:49 PM EST Christian K=F6nig wrote: > Am 31.01.2016 um 19:50 schrieb Matthew Dawson: > > On Sunday, January 24, 2016 10:49:00 AM EST Christian K=F6nig wrote: > >> Am 24.01.2016 um 07:18 schrieb Matthew Dawson: > >>> When the radeon driver resets a gpu, it attempts to test whether all = the > >>> rings can successfully handle an IB. If these rings fail to respond, > >>> the > >>> process will wait forever. Another gpu reset can't happen at this > >>> point, > >>> as the current reset holds a lock required to do so. Instead, make a= ll > >>> the IB tests run with a timeout, so the system can attempt to recover > >>> in this case. > >>>=20 > >>> While this doesn't fix the underlying issue with card resets failing,= it > >>> gives the system a higher chance of recovering. These timeouts have > >>> been > >>> confirmed to help both a Tathi and Hawaii card recover after a gpu > >>> reset. > >>>=20 > >>> I haven't been able to test this on other cards, but everything > >>> compiles. > >>> I wasn't sure what to use for a timeout value, so for now I've used > >>> radeon_lockup_timeout. When that is 0, the timeout functionality in > >>> this > >>> patch is also disabled. > >>>=20 > >>> This also adds a new function, radeon_fence_wait_timeout, that behaves > >>> like > >>> radeon_fence_wait except allowing a different timeout value to be pas= sed > >>> to > >>> radeon_fence_wait_seq_timeout. > >>>=20 > >>> Signed-off-by: Matthew Dawson > >>=20 > >> Really good idea, but please don't use radeon_lockup_timeout for this > >> cause the 10 seconds default of that is way to long. Instead just use a > >> fixed, let's say 100ms timeout defined in radeon.h. > >=20 > > I originally tried 100ms, but I found that to be too short (my system > > would > > just lockup completely (no network even)). I found 1s is long enough, = and > > isn't so long to be annoying. Would that be ok? >=20 > Yeah, 1s works as well. But that 100ms shouldn't be enough sounds like > another bug to me. Probably, but it has a tendency to crash my entire system, so I'm not sure = how=20 to debug it further. Netconsole maybe? I probably won't be able to dig in= to=20 this due to time since 1s does work. >=20 > >> Also don't define our own radeon_fence_wait_timeout, just use > >> fence_wait_timeout(&fence->base, RADEON_IB_TEST_TIMEOUT) for this. > >=20 > > I switched everything over to this, however the ib test always times out > > after a gpu reset (on startup, everything is normal). I'm not sure why > > fence_wait_timeout isn't being signaled while > > radeon_fence_wait_seq_timeout > > is. I'm suspicious that either radeon_fence_enable_signaling is not do= ing > > something necessary to get the fence completion to actual signal > > radeon_fence_default_wait, or because we hold the exclusive lock during= a > > reset radeon_fence_check_lockup never runs and thus never triggers the > > fence or enables some irq lines. > >=20 > > Is it worth digging through the interactions here to get dma-buf fences= to > > work correctly during a lockup, or would it be better to just keep > > radeon_fence_wait_timeout? If option 1 is preferred, I'm happy to learn > > but I need some help learning how it is supposed to work. >=20 > In this case feel free to add the radeon_fence_wait_timeout. It's > probably a good idea to get rid of the exclusive lock sooner or later in > radeon as well, but that really is a long term project. >=20 > If you're really bored I can give you tons of input where to start with > that. That sounds like a good project to learn more about the driver, so I'd be=20 interested. I don't have a lot of time right now to play with it though. = I=20 do want to finish off the tf2 crasher first as well. Once I have time I'll= =20 poke the list to get started. Thanks for the reviews! =2D-=20 Matthew --nextPart15317036.GkK3HZAq3G Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part. Content-Transfer-Encoding: 7Bit -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJWuKtaAAoJEMZc6/oq9HsgO5AP/0kaItMjtPwm7kfjh2YiW8kV aXDu2q+mMi3WskI0Kb08BLgCYVhjnEfrZaYbwjMe/Ixuu4TYjw1ih/tTjhYa/p1z TUNRl+a/o9O0sxrsZOkH1ZGxNkg0I+jWm/dxUVn4kZWCXN6HG1gRapL2dkLCiv3H JMj0skTPKhxtW8Bo0gMRvW3gN6CDEDgwslyQVgw2THZLVDaR8JAiG6JwCaue1vAD 0ExZCxuJwBeriEBuD+oR8lWy+hs3Ss9C0YZylNr9NqPVDGTsaWdrU2qL+OIeqMhv 6Xr1X9jPIv8jeh0CDvK5UH25HYSYq6OyODXVMGZP8/Hfz7wFRsHiIcc3WtIy/pvV mHTkMs2b3+IlA7mbk+n0oLOuqb8fanQHsiK0pU2asZlxNLcFHZMggRLRTtH2dQyC /O3QTGyC7StnI/lTMln2VsQCYH9VSwC2Mm2BQxEEjQ4OXOcf67dHsrcJ5YRmfaGN thwiNqQ5hVCVxkTWmAk35WpUx5pYg/7Jj1vxQlANgAvyeAcv5AiSLoUz09vs7qm4 BzgYaSvKY8F3iw1bKtFhmFpXDnAlJqYAxIA9lck2K5qb3MmCaY/PAB2loHjoiQxx SQcwqdl0+PRNQ94+ak7GtNpftmMvkLsfr4c5zJ5b4y+0mQeGWVSoX3QOiP/z6rVm 8vM0vU69XKEokIgDEswo =9X6V -----END PGP SIGNATURE----- --nextPart15317036.GkK3HZAq3G-- --===============0964624743== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVsCg== --===============0964624743==--