Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

From: "Ville Syrjälä" <ville.syrjala@linux.intel.com>
To: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>,
	intel-gfx@lists.freedesktop.org, stable@vger.kernel.org,
	dri-devel@lists.freedesktop.org
Subject: Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
Date: Fri, 30 Jun 2017 18:44:32 +0300	[thread overview]
Message-ID: <20170630154432.GZ12629@intel.com> (raw)
In-Reply-To: <20170630133033.27yqunbfoawuz2wl@phenom.ffwll.local>

On Fri, Jun 30, 2017 at 03:30:33PM +0200, Daniel Vetter wrote:
> On Fri, Jun 30, 2017 at 03:25:58PM +0200, Daniel Vetter wrote:
> > On Thu, Jun 29, 2017 at 04:49:48PM +0300, ville.syrjala@linux.intel.com wrote:
> > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > 
> > > Introduce an rw_semaphore to protect the display commits. All normal
> > > commits use down_read() and hence can proceed in parallel, but GPU reset
> > > will use down_write() making sure no other commits are in progress when
> > > we have to pull the plug on the display engine on pre-g4x platforms.
> > > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > > itself, and we wait for all dependencies before the down_read(), and
> > > thus there is no chance of deadlocks with this scheme.
> > 
> > How does this solve the deadlock? Afaiui the deadlock is that the gpu
> > reset stopped unconditionally completing all requests before it did
> > anything else, including anything with the hw or taking modeset locks.
> > 
> > This ensured that any outstanding flips (we only had pageflips, no atomic
> > ones back then) could complete (although maybe displaying garbage). The
> > only thing we had to do was grab the locks (to avoid concurrent modesets)
> > and then we could happily nuke the hw (since the flips where lost in the
> > CS anyway), and restore it afterwards.
> > 
> > Then the TDR rewrite came around and broke that, which now makes atomic
> > time out waiting for the gpu to complete (because the gpu reset waits for
> > the modeset to complete first). Which means if you want to fix this
> > without breaking TDR, then you need to cancel the pending atomic commits.
> > That seems somewhat what you're attempting here with trying to figure out
> > what the latest hw-committed step is (but that function gets it wrong),
> > and lots more locking and stuff on top.
> > 
> > Why exactly can't we just go back to the old world of force-completing
> > dead requests on gen4 and earlier? That would be tons simpler imo instead
> > of throwing a pile of hacks (which really needs a complete rewrite of the
> > atomic code in i915) in as a work-around. We didn't have TDR on gen4 and
> > earlier for years, going back to that isn't going to hurt anyone.
> > 
> > Making working gen4 gpu reset contigent on cancellable atomic commits and
> > the full commit queue is imo nuts.
> 
> And if the GEM folks insist the old behavior can't be restored, then we
> just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere
> in i915_sw_fence. Force-completing all render requests atomic updates
> depend upon is imo the simplest solution to this, and we've had a driver
> that worked like that for years.

And it used to break all the time. I think we've had to fix it at least
three times by now. So I tend to think it's better to fix it in a way
that won't break so easily.

-- 
Ville Syrjälä
Intel OTC
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx