[PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw

stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
       [not found] ` <20170629134948.5614-6-ville.syrjala@linux.intel.com>
@ 2017-06-29 14:36   ` ville.syrjala
  2017-06-29 17:57     ` Chris Wilson
  2017-06-30 13:25   ` [PATCH " Daniel Vetter
  1 sibling, 1 reply; 17+ messages in thread
From: ville.syrjala @ 2017-06-29 14:36 UTC (permalink / raw)
  To: intel-gfx; +Cc: dri-devel, stable, Daniel Vetter, Chris Wilson

From: Ville Syrjälä <ville.syrjala@linux.intel.com>

Introduce an rw_semaphore to protect the display commits. All normal
commits use down_read() and hence can proceed in parallel, but GPU reset
will use down_write() making sure no other commits are in progress when
we have to pull the plug on the display engine on pre-g4x platforms.
There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
itself, and we wait for all dependencies before the down_read(), and
thus there is no chance of deadlocks with this scheme.

During reset we should be recommiting the state that was committed last.
But for now we'll settle for recommiting the last state for each object.
Hence we may commit things a bit too soon when a GPU reset occurs. The
rw_semaphore should guarantee that whatever state we observe in
obj->state during reset sticks around while we do the commit. The
obj->state pointer might change for some objects if another swap_state()
occurs while we do our thing, so there migth be some theoretical
mismatch which I tink we could avoid by grabbing the rw_semaphore also
around the swap_state(), but for now I didn't do that.

Another source of mismatch can happen because we sometimes use the
intel_crtc->config during the actual commit, and that only gets updated
when we do the commuit. Hence we may get some state via ->state, some
via the duplicated ->state, and some via ->config. We'll want to
fix this all by tracking the commited state properly, but that will
some bigger refactroing so for now we'll just choose to accept these
potential mismatches.

I left out the state readout from the post-reset display
reinitialization as that still likes to clobber crtc->state etc.
If we make it use a free standing atomic state and mke sure it doesn't
need any locks we could reintroduce it. For now I just mark the
post-reset display state as dirty as possible to make sure we
reinitialize everything.

One other potential issue remains in the form of display detection.
Either we need to protect that with the same rw_semaphore as well, or
perhaps it would be enough to force gmbus into bitbanging mode while
the reset is happening and we don't have interrupts, and just across
the actual hardware GPU reset we could hold the gmbus mutex.

v2: Keep intel_prepare/finish_reset() outside struct_mutex (Chris)
v3: Drop all the committed_state refactoring to make this less
    obnoxious to backport (Daniel)
v4: Preserve the wedge timeout mechanism (Chris)

Cc: <stable@vger.kernel.org> # for the brave
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101597
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99093
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101600
Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion")
Fixes: 221fe7994554 ("drm/i915: Perform a direct reset of the GPU from the waiter")
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h      |   2 +
 drivers/gpu/drm/i915/intel_display.c | 199 ++++++++++++++++++++++++-----------
 2 files changed, 138 insertions(+), 63 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index effbe4f72a64..88ddd27f61b0 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2237,6 +2237,8 @@ struct drm_i915_private {
 	struct drm_atomic_state *modeset_restore_state;
 	struct drm_modeset_acquire_ctx reset_ctx;
 
+	struct rw_semaphore commit_sem;
+
 	struct list_head vm_list; /* Global list of all address spaces */
 	struct i915_ggtt ggtt; /* VM representing the global address space */
 
diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 7cdd6ec97f80..f13c7d81d4a9 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -123,6 +123,7 @@ static void ironlake_pfit_enable(struct intel_crtc *crtc);
 static void intel_modeset_setup_hw_state(struct drm_device *dev,
 					 struct drm_modeset_acquire_ctx *ctx);
 static void intel_pre_disable_primary_noatomic(struct drm_crtc *crtc);
+static void __intel_atomic_commit_tail(struct drm_atomic_state *state, bool is_reset);
 
 struct intel_limit {
 	struct {
@@ -3491,27 +3492,85 @@ static bool gpu_reset_clobbers_display(struct drm_i915_private *dev_priv)
 		INTEL_GEN(dev_priv) < 5 && !IS_G4X(dev_priv);
 }
 
-void intel_prepare_reset(struct drm_i915_private *dev_priv)
+static void init_intel_state(struct intel_atomic_state *state)
+{
+	struct drm_crtc *crtc;
+	struct drm_crtc_state *old_crtc_state, *new_crtc_state;
+	int i;
+
+	state->modeset = true;
+
+	for_each_oldnew_crtc_in_state(&state->base, crtc,
+				      old_crtc_state, new_crtc_state, i) {
+		if (new_crtc_state->active)
+			state->active_crtcs |= 1 << i;
+		else
+			state->active_crtcs &= ~(1 << i);
+
+		if (old_crtc_state->active != new_crtc_state->active)
+			state->active_pipe_changes |= drm_crtc_mask(crtc);
+	}
+}
+
+static struct drm_atomic_state *
+intel_duplicate_committed_state(struct drm_i915_private *dev_priv)
 {
-	struct drm_device *dev = &dev_priv->drm;
-	struct drm_modeset_acquire_ctx *ctx = &dev_priv->reset_ctx;
 	struct drm_atomic_state *state;
-	int ret;
+	struct drm_crtc *crtc;
+	struct drm_crtc_state *crtc_state;
+	int i;
 
-	/*
-	 * Need mode_config.mutex so that we don't
-	 * trample ongoing ->detect() and whatnot.
-	 */
-	mutex_lock(&dev->mode_config.mutex);
-	drm_modeset_acquire_init(ctx, 0);
-	while (1) {
-		ret = drm_modeset_lock_all_ctx(dev, ctx);
-		if (ret != -EDEADLK)
-			break;
+	state = drm_atomic_helper_duplicate_committed_state(&dev_priv->drm);
+	if (IS_ERR(state)) {
+		DRM_ERROR("Duplicating state failed with %ld\n",
+			  PTR_ERR(state));
+		return NULL;
+	}
+
+	to_intel_atomic_state(state)->active_crtcs = 0;
+	to_intel_atomic_state(state)->cdclk.logical = dev_priv->cdclk.hw;
+	to_intel_atomic_state(state)->cdclk.actual = dev_priv->cdclk.hw;
+
+	init_intel_state(to_intel_atomic_state(state));
+
+	/* force a full update */
+	for_each_new_crtc_in_state(state, crtc, crtc_state, i) {
+		struct intel_crtc_state *intel_crtc_state =
+			to_intel_crtc_state(crtc_state);
+
+		if (!crtc_state->active)
+			continue;
 
-		drm_modeset_backoff(ctx);
+		crtc_state->mode_changed = true;
+		crtc_state->active_changed = true;
+		crtc_state->planes_changed = true;
+		crtc_state->connectors_changed = true;
+		crtc_state->color_mgmt_changed = true;
+		crtc_state->zpos_changed = true;
+
+		intel_crtc_state->update_pipe = true;
+		intel_crtc_state->disable_lp_wm = true;
+		intel_crtc_state->disable_cxsr = true;
+		intel_crtc_state->update_wm_post = true;
+		intel_crtc_state->fb_changed = true;
+		intel_crtc_state->fifo_changed = true;
+		intel_crtc_state->wm.need_postvbl_update = true;
 	}
 
+	return state;
+}
+
+void intel_prepare_reset(struct drm_i915_private *dev_priv)
+{
+	struct drm_atomic_state *disable_state, *restore_state = NULL;
+	struct drm_crtc *crtc;
+	struct drm_crtc_state *crtc_state;
+	struct drm_plane *plane;
+	struct drm_plane_state *plane_state;
+	int i;
+
+	down_write(&dev_priv->commit_sem);
+
 	/* reset doesn't touch the display, but flips might get nuked anyway, */
 	if (!i915.force_reset_modeset_test &&
 	    !gpu_reset_clobbers_display(dev_priv))
@@ -3521,30 +3580,40 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv)
 	 * Disabling the crtcs gracefully seems nicer. Also the
 	 * g33 docs say we should at least disable all the planes.
 	 */
-	state = drm_atomic_helper_duplicate_state(dev, ctx);
-	if (IS_ERR(state)) {
-		ret = PTR_ERR(state);
-		DRM_ERROR("Duplicating state failed with %i\n", ret);
-		return;
-	}
+	disable_state = intel_duplicate_committed_state(dev_priv);
+	if (IS_ERR(disable_state))
+		goto out;
 
-	ret = drm_atomic_helper_disable_all(dev, ctx);
-	if (ret) {
-		DRM_ERROR("Suspending crtc's failed with %i\n", ret);
-		drm_atomic_state_put(state);
-		return;
-	}
+	to_intel_atomic_state(disable_state)->active_crtcs = 0;
 
-	dev_priv->modeset_restore_state = state;
-	state->acquire_ctx = ctx;
+	for_each_new_crtc_in_state(disable_state, crtc, crtc_state, i)
+		crtc_state->active = false;
+	for_each_new_plane_in_state(disable_state, plane, plane_state, i)
+		plane_state->visible = false;
+
+	__intel_atomic_commit_tail(disable_state, true);
+
+	drm_atomic_helper_clean_committed_state(disable_state);
+	drm_atomic_state_put(disable_state);
+
+	restore_state = intel_duplicate_committed_state(dev_priv);
+	if (IS_ERR(restore_state))
+		restore_state = NULL;
+
+	for_each_old_crtc_in_state(restore_state, crtc, crtc_state, i)
+		crtc_state->active = false;
+	for_each_old_plane_in_state(restore_state, plane, plane_state, i)
+		plane_state->visible = false;
+
+out:
+	dev_priv->modeset_restore_state = restore_state;
 }
 
 void intel_finish_reset(struct drm_i915_private *dev_priv)
 {
 	struct drm_device *dev = &dev_priv->drm;
-	struct drm_modeset_acquire_ctx *ctx = &dev_priv->reset_ctx;
-	struct drm_atomic_state *state = dev_priv->modeset_restore_state;
-	int ret;
+	struct drm_atomic_state *restore_state =
+		dev_priv->modeset_restore_state;
 
 	/*
 	 * Flips in the rings will be nuked by the reset,
@@ -3557,7 +3626,7 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)
 
 	/* reset doesn't touch the display */
 	if (!gpu_reset_clobbers_display(dev_priv)) {
-		if (!state) {
+		if (!restore_state) {
 			/*
 			 * Flips in the rings have been nuked by the reset,
 			 * so update the base address of all primary
@@ -3569,11 +3638,11 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)
 			 */
 			intel_update_primary_planes(dev);
 		} else {
-			ret = __intel_display_resume(dev, state, ctx);
-			if (ret)
-				DRM_ERROR("Restoring old state failed with %i\n", ret);
+			__intel_atomic_commit_tail(restore_state, true);
 		}
 	} else {
+		i915_redisable_vga(dev_priv);
+
 		/*
 		 * The display has been reset as well,
 		 * so need a full re-initialization.
@@ -3589,18 +3658,17 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)
 			dev_priv->display.hpd_irq_setup(dev_priv);
 		spin_unlock_irq(&dev_priv->irq_lock);
 
-		ret = __intel_display_resume(dev, state, ctx);
-		if (ret)
-			DRM_ERROR("Restoring old state failed with %i\n", ret);
+		__intel_atomic_commit_tail(restore_state, true);
 
 		intel_hpd_init(dev_priv);
 	}
 
-	if (state)
-		drm_atomic_state_put(state);
-	drm_modeset_drop_locks(ctx);
-	drm_modeset_acquire_fini(ctx);
-	mutex_unlock(&dev->mode_config.mutex);
+	if (restore_state) {
+		drm_atomic_helper_clean_committed_state(restore_state);
+		drm_atomic_state_put(restore_state);
+	}
+
+	up_write(&dev_priv->commit_sem);
 }
 
 static bool abort_flip_on_reset(struct intel_crtc *crtc)
@@ -12592,29 +12660,18 @@ static int intel_modeset_checks(struct drm_atomic_state *state)
 {
 	struct intel_atomic_state *intel_state = to_intel_atomic_state(state);
 	struct drm_i915_private *dev_priv = to_i915(state->dev);
-	struct drm_crtc *crtc;
-	struct drm_crtc_state *old_crtc_state, *new_crtc_state;
-	int ret = 0, i;
+	int ret = 0;
 
 	if (!check_digital_port_conflicts(state)) {
 		DRM_DEBUG_KMS("rejecting conflicting digital port configuration\n");
 		return -EINVAL;
 	}
 
-	intel_state->modeset = true;
 	intel_state->active_crtcs = dev_priv->active_crtcs;
 	intel_state->cdclk.logical = dev_priv->cdclk.logical;
 	intel_state->cdclk.actual = dev_priv->cdclk.actual;
 
-	for_each_oldnew_crtc_in_state(state, crtc, old_crtc_state, new_crtc_state, i) {
-		if (new_crtc_state->active)
-			intel_state->active_crtcs |= 1 << i;
-		else
-			intel_state->active_crtcs &= ~(1 << i);
-
-		if (old_crtc_state->active != new_crtc_state->active)
-			intel_state->active_pipe_changes |= drm_crtc_mask(crtc);
-	}
+	init_intel_state(intel_state);
 
 	/*
 	 * See if the config requires any additional preparation, e.g.
@@ -13004,7 +13061,7 @@ static void intel_atomic_helper_free_state_worker(struct work_struct *work)
 	intel_atomic_helper_free_state(dev_priv);
 }
 
-static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
+static void __intel_atomic_commit_tail(struct drm_atomic_state *state, bool is_reset)
 {
 	struct drm_device *dev = state->dev;
 	struct intel_atomic_state *intel_state = to_intel_atomic_state(state);
@@ -13068,10 +13125,18 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
 
 	/* Only after disabling all output pipelines that will be changed can we
 	 * update the the output configuration. */
-	intel_modeset_update_crtc_state(state);
+	if (!is_reset)
+		intel_modeset_update_crtc_state(state);
 
 	if (intel_state->modeset) {
-		drm_atomic_helper_update_legacy_modeset_state(state->dev, state);
+		if (!is_reset) {
+			drm_atomic_helper_update_legacy_modeset_state(state->dev, state);
+		} else {
+			for_each_new_crtc_in_state(state, crtc, new_crtc_state, i) {
+				if (new_crtc_state->enable)
+					drm_calc_timestamping_constants(crtc, &new_crtc_state->adjusted_mode);
+			}
+		}
 
 		intel_set_cdclk(dev_priv, &dev_priv->cdclk.actual);
 
@@ -13082,7 +13147,8 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
 		if (!intel_can_enable_sagv(state))
 			intel_disable_sagv(dev_priv);
 
-		intel_modeset_verify_disabled(dev, state);
+		if (!is_reset)
+			intel_modeset_verify_disabled(dev, state);
 	}
 
 	/* Complete the events for pipes that have now been disabled */
@@ -13135,7 +13201,8 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
 		if (put_domains[i])
 			modeset_put_power_domains(dev_priv, put_domains[i]);
 
-		intel_modeset_verify_crtc(crtc, state, old_crtc_state, new_crtc_state);
+		if (!is_reset)
+			intel_modeset_verify_crtc(crtc, state, old_crtc_state, new_crtc_state);
 	}
 
 	if (intel_state->modeset && intel_can_enable_sagv(state))
@@ -13160,10 +13227,14 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state)
 
 	drm_atomic_helper_wait_for_dependencies(state);
 
-	__intel_atomic_commit_tail(state);
+	down_read(&dev_priv->commit_sem);
+
+	__intel_atomic_commit_tail(state, false);
 
 	drm_atomic_helper_commit_hw_done(state);
 
+	up_read(&dev_priv->commit_sem);
+
 	mutex_lock(&dev->struct_mutex);
 	drm_atomic_helper_cleanup_planes(dev, state);
 	mutex_unlock(&dev->struct_mutex);
@@ -15022,6 +15093,8 @@ int intel_modeset_init(struct drm_device *dev)
 	INIT_WORK(&dev_priv->atomic_helper.free_work,
 		  intel_atomic_helper_free_state_worker);
 
+	init_rwsem(&dev_priv->commit_sem);
+
 	intel_init_quirks(dev);
 
 	intel_init_pm(dev_priv);
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-29 14:36   ` [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore ville.syrjala
@ 2017-06-29 17:57     ` Chris Wilson
  2017-06-29 19:26       ` Ville Syrjälä
  0 siblings, 1 reply; 17+ messages in thread
From: Chris Wilson @ 2017-06-29 17:57 UTC (permalink / raw)
  To: ville.syrjala, intel-gfx; +Cc: dri-devel, stable, Daniel Vetter

Quoting ville.syrjala@linux.intel.com (2017-06-29 15:36:42)
> From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> 
> Introduce an rw_semaphore to protect the display commits. All normal
> commits use down_read() and hence can proceed in parallel, but GPU reset
> will use down_write() making sure no other commits are in progress when
> we have to pull the plug on the display engine on pre-g4x platforms.
> There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> itself, and we wait for all dependencies before the down_read(), and
> thus there is no chance of deadlocks with this scheme.

This matches what I thought should be done (I didn't think of using
rwsem just a mutex, nice touch). The point I got stuck on was what
should be done after the reset? I expected another modeset to return the
state back or otherwise the inflight would get confused?
 
> During reset we should be recommiting the state that was committed last.
> But for now we'll settle for recommiting the last state for each object.

Ah, I guess that explains the above. What is the complication with
restoring the current state as opposed to the next state?

But I have to leave debating the merits of atomic internals to others.  :|
-Chris

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-29 17:57     ` Chris Wilson
@ 2017-06-29 19:26       ` Ville Syrjälä
  2017-06-30 13:35         ` Daniel Vetter
  0 siblings, 1 reply; 17+ messages in thread
From: Ville Syrjälä @ 2017-06-29 19:26 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx, dri-devel, stable, Daniel Vetter

On Thu, Jun 29, 2017 at 06:57:30PM +0100, Chris Wilson wrote:
> Quoting ville.syrjala@linux.intel.com (2017-06-29 15:36:42)
> > From: Ville Syrjï¿½lï¿½ <ville.syrjala@linux.intel.com>
> > 
> > Introduce an rw_semaphore to protect the display commits. All normal
> > commits use down_read() and hence can proceed in parallel, but GPU reset
> > will use down_write() making sure no other commits are in progress when
> > we have to pull the plug on the display engine on pre-g4x platforms.
> > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > itself, and we wait for all dependencies before the down_read(), and
> > thus there is no chance of deadlocks with this scheme.
> 
> This matches what I thought should be done (I didn't think of using
> rwsem just a mutex, nice touch). The point I got stuck on was what
> should be done after the reset? I expected another modeset to return the
> state back or otherwise the inflight would get confused?

I guess that can happen. For instance, if we have a crtc_enable() in flight,
and then we do a reset before it gets committed we would end up doing
crtc_enable() twice in a row without a crtc_disable in between. For page
flips and such this shouldn't be a big deal in general.

>  
> > During reset we should be recommiting the state that was committed last.
> > But for now we'll settle for recommiting the last state for each object.
> 
> Ah, I guess that explains the above. What is the complication with
> restoring the current state as opposed to the next state?

Well the main thing is just tracking which is the current state. That
just needs refactoring the .atomic_duplicate_state() calling convention
across the whole tree so that we can then duplicate the committed state
rather than the latest state.

Also due to the commit_hw_done() being potentially done after the
modeset locks have been dropped, I don't think we can be certain
of it getting called in the same order as swap_state(), hence
when we track the committed state in commit_hw_done() we'll have
to have some way to figure out if our new state is in fact the
latest committed state for each object or if the calls got
reordered. We don't insert any dependencies between two commits
unless they touch the same active crtc, thus this reordering
seems very much possible. Dunno if we should add some way to add
such dependeencies whenever the same object is part of two otherwise
independent commits, or if we just want to try and work with the
reordered calls. My idea for the latter was some kind of seqno/age
counter on the object states that allows me to recongnize which
state is more recent. The object states aren't refcounted so hanging
on to the wrong pointer could cause an oops the next time we have to
perform a GPU reset.

-- 
Ville Syrjï¿½lï¿½
Intel OTC

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
       [not found] ` <20170629134948.5614-6-ville.syrjala@linux.intel.com>
  2017-06-29 14:36   ` [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore ville.syrjala
@ 2017-06-30 13:25   ` Daniel Vetter
  2017-06-30 13:30     ` Daniel Vetter
  1 sibling, 1 reply; 17+ messages in thread
From: Daniel Vetter @ 2017-06-30 13:25 UTC (permalink / raw)
  To: ville.syrjala; +Cc: intel-gfx, dri-devel, stable, Daniel Vetter, Chris Wilson

On Thu, Jun 29, 2017 at 04:49:48PM +0300, ville.syrjala@linux.intel.com wrote:
> From: Ville Syrjï¿½lï¿½ <ville.syrjala@linux.intel.com>
> 
> Introduce an rw_semaphore to protect the display commits. All normal
> commits use down_read() and hence can proceed in parallel, but GPU reset
> will use down_write() making sure no other commits are in progress when
> we have to pull the plug on the display engine on pre-g4x platforms.
> There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> itself, and we wait for all dependencies before the down_read(), and
> thus there is no chance of deadlocks with this scheme.

How does this solve the deadlock? Afaiui the deadlock is that the gpu
reset stopped unconditionally completing all requests before it did
anything else, including anything with the hw or taking modeset locks.

This ensured that any outstanding flips (we only had pageflips, no atomic
ones back then) could complete (although maybe displaying garbage). The
only thing we had to do was grab the locks (to avoid concurrent modesets)
and then we could happily nuke the hw (since the flips where lost in the
CS anyway), and restore it afterwards.

Then the TDR rewrite came around and broke that, which now makes atomic
time out waiting for the gpu to complete (because the gpu reset waits for
the modeset to complete first). Which means if you want to fix this
without breaking TDR, then you need to cancel the pending atomic commits.
That seems somewhat what you're attempting here with trying to figure out
what the latest hw-committed step is (but that function gets it wrong),
and lots more locking and stuff on top.

Why exactly can't we just go back to the old world of force-completing
dead requests on gen4 and earlier? That would be tons simpler imo instead
of throwing a pile of hacks (which really needs a complete rewrite of the
atomic code in i915) in as a work-around. We didn't have TDR on gen4 and
earlier for years, going back to that isn't going to hurt anyone.

Making working gen4 gpu reset contigent on cancellable atomic commits and
the full commit queue is imo nuts.
-Daniel

> 
> During reset we should be recommiting the state that was committed last.
> But for now we'll settle for recommiting the last state for each object.
> Hence we may commit things a bit too soon when a GPU reset occurs. The
> rw_semaphore should guarantee that whatever state we observe in
> obj->state during reset sticks around while we do the commit. The
> obj->state pointer might change for some objects if another swap_state()
> occurs while we do our thing, so there migth be some theoretical
> mismatch which I tink we could avoid by grabbing the rw_semaphore also
> around the swap_state(), but for now I didn't do that.
> 
> Another source of mismatch can happen because we sometimes use the
> intel_crtc->config during the actual commit, and that only gets updated
> when we do the commuit. Hence we may get some state via ->state, some
> via the duplicated ->state, and some via ->config. We'll want to
> fix this all by tracking the commited state properly, but that will
> some bigger refactroing so for now we'll just choose to accept these
> potential mismatches.
> 
> I left out the state readout from the post-reset display
> reinitialization as that still likes to clobber crtc->state etc.
> If we make it use a free standing atomic state and mke sure it doesn't
> need any locks we could reintroduce it. For now I just mark the
> post-reset display state as dirty as possible to make sure we
> reinitialize everything.
> 
> One other potential issue remains in the form of display detection.
> Either we need to protect that with the same rw_semaphore as well, or
> perhaps it would be enough to force gmbus into bitbanging mode while
> the reset is happening and we don't have interrupts, and just across
> the actual hardware GPU reset we could hold the gmbus mutex.
> 
> v2: Keep intel_prepare/finish_reset() outside struct_mutex (Chris)
> v3: Drop all the committed_state refactoring to make this less
>     obnoxious to backport (Daniel)
> 
> Cc: stable@vger.kernel.org # for the brave
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101597
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99093
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101600
> Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion")
> Fixes: 221fe7994554 ("drm/i915: Perform a direct reset of the GPU from the waiter")
> Signed-off-by: Ville Syrjï¿½lï¿½ <ville.syrjala@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.h      |   2 +
>  drivers/gpu/drm/i915/i915_irq.c      |  44 +-------
>  drivers/gpu/drm/i915/intel_display.c | 199 ++++++++++++++++++++++++-----------
>  3 files changed, 139 insertions(+), 106 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index effbe4f72a64..88ddd27f61b0 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2237,6 +2237,8 @@ struct drm_i915_private {
>  	struct drm_atomic_state *modeset_restore_state;
>  	struct drm_modeset_acquire_ctx reset_ctx;
>  
> +	struct rw_semaphore commit_sem;
> +
>  	struct list_head vm_list; /* Global list of all address spaces */
>  	struct i915_ggtt ggtt; /* VM representing the global address space */
>  
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index eb4f1dca2077..9d591f17fda3 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2587,46 +2587,6 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg)
>  	return ret;
>  }
>  
> -struct wedge_me {
> -	struct delayed_work work;
> -	struct drm_i915_private *i915;
> -	const char *name;
> -};
> -
> -static void wedge_me(struct work_struct *work)
> -{
> -	struct wedge_me *w = container_of(work, typeof(*w), work.work);
> -
> -	dev_err(w->i915->drm.dev,
> -		"%s timed out, cancelling all in-flight rendering.\n",
> -		w->name);
> -	i915_gem_set_wedged(w->i915);
> -}
> -
> -static void __init_wedge(struct wedge_me *w,
> -			 struct drm_i915_private *i915,
> -			 long timeout,
> -			 const char *name)
> -{
> -	w->i915 = i915;
> -	w->name = name;
> -
> -	INIT_DELAYED_WORK_ONSTACK(&w->work, wedge_me);
> -	schedule_delayed_work(&w->work, timeout);
> -}
> -
> -static void __fini_wedge(struct wedge_me *w)
> -{
> -	cancel_delayed_work_sync(&w->work);
> -	destroy_delayed_work_on_stack(&w->work);
> -	w->i915 = NULL;
> -}
> -
> -#define i915_wedge_on_timeout(W, DEV, TIMEOUT)				\
> -	for (__init_wedge((W), (DEV), (TIMEOUT), __func__);		\
> -	     (W)->i915;							\
> -	     __fini_wedge((W)))
> -
>  /**
>   * i915_reset_device - do process context error handling work
>   * @dev_priv: i915 device private
> @@ -2640,15 +2600,13 @@ static void i915_reset_device(struct drm_i915_private *dev_priv)
>  	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
>  	char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
>  	char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
> -	struct wedge_me w;
>  
>  	kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);
>  
>  	DRM_DEBUG_DRIVER("resetting chip\n");
>  	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
>  
> -	/* Use a watchdog to ensure that our reset completes */
> -	i915_wedge_on_timeout(&w, dev_priv, 5*HZ) {
> +	{
>  		intel_prepare_reset(dev_priv);
>  
>  		/* Signal that locked waiters should reset the GPU */
> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
> index 7cdd6ec97f80..f13c7d81d4a9 100644
> --- a/drivers/gpu/drm/i915/intel_display.c
> +++ b/drivers/gpu/drm/i915/intel_display.c
> @@ -123,6 +123,7 @@ static void ironlake_pfit_enable(struct intel_crtc *crtc);
>  static void intel_modeset_setup_hw_state(struct drm_device *dev,
>  					 struct drm_modeset_acquire_ctx *ctx);
>  static void intel_pre_disable_primary_noatomic(struct drm_crtc *crtc);
> +static void __intel_atomic_commit_tail(struct drm_atomic_state *state, bool is_reset);
>  
>  struct intel_limit {
>  	struct {
> @@ -3491,27 +3492,85 @@ static bool gpu_reset_clobbers_display(struct drm_i915_private *dev_priv)
>  		INTEL_GEN(dev_priv) < 5 && !IS_G4X(dev_priv);
>  }
>  
> -void intel_prepare_reset(struct drm_i915_private *dev_priv)
> +static void init_intel_state(struct intel_atomic_state *state)
> +{
> +	struct drm_crtc *crtc;
> +	struct drm_crtc_state *old_crtc_state, *new_crtc_state;
> +	int i;
> +
> +	state->modeset = true;
> +
> +	for_each_oldnew_crtc_in_state(&state->base, crtc,
> +				      old_crtc_state, new_crtc_state, i) {
> +		if (new_crtc_state->active)
> +			state->active_crtcs |= 1 << i;
> +		else
> +			state->active_crtcs &= ~(1 << i);
> +
> +		if (old_crtc_state->active != new_crtc_state->active)
> +			state->active_pipe_changes |= drm_crtc_mask(crtc);
> +	}
> +}
> +
> +static struct drm_atomic_state *
> +intel_duplicate_committed_state(struct drm_i915_private *dev_priv)
>  {
> -	struct drm_device *dev = &dev_priv->drm;
> -	struct drm_modeset_acquire_ctx *ctx = &dev_priv->reset_ctx;
>  	struct drm_atomic_state *state;
> -	int ret;
> +	struct drm_crtc *crtc;
> +	struct drm_crtc_state *crtc_state;
> +	int i;
>  
> -	/*
> -	 * Need mode_config.mutex so that we don't
> -	 * trample ongoing ->detect() and whatnot.
> -	 */
> -	mutex_lock(&dev->mode_config.mutex);
> -	drm_modeset_acquire_init(ctx, 0);
> -	while (1) {
> -		ret = drm_modeset_lock_all_ctx(dev, ctx);
> -		if (ret != -EDEADLK)
> -			break;
> +	state = drm_atomic_helper_duplicate_committed_state(&dev_priv->drm);
> +	if (IS_ERR(state)) {
> +		DRM_ERROR("Duplicating state failed with %ld\n",
> +			  PTR_ERR(state));
> +		return NULL;
> +	}
> +
> +	to_intel_atomic_state(state)->active_crtcs = 0;
> +	to_intel_atomic_state(state)->cdclk.logical = dev_priv->cdclk.hw;
> +	to_intel_atomic_state(state)->cdclk.actual = dev_priv->cdclk.hw;
> +
> +	init_intel_state(to_intel_atomic_state(state));
> +
> +	/* force a full update */
> +	for_each_new_crtc_in_state(state, crtc, crtc_state, i) {
> +		struct intel_crtc_state *intel_crtc_state =
> +			to_intel_crtc_state(crtc_state);
> +
> +		if (!crtc_state->active)
> +			continue;
>  
> -		drm_modeset_backoff(ctx);
> +		crtc_state->mode_changed = true;
> +		crtc_state->active_changed = true;
> +		crtc_state->planes_changed = true;
> +		crtc_state->connectors_changed = true;
> +		crtc_state->color_mgmt_changed = true;
> +		crtc_state->zpos_changed = true;
> +
> +		intel_crtc_state->update_pipe = true;
> +		intel_crtc_state->disable_lp_wm = true;
> +		intel_crtc_state->disable_cxsr = true;
> +		intel_crtc_state->update_wm_post = true;
> +		intel_crtc_state->fb_changed = true;
> +		intel_crtc_state->fifo_changed = true;
> +		intel_crtc_state->wm.need_postvbl_update = true;
>  	}
>  
> +	return state;
> +}
> +
> +void intel_prepare_reset(struct drm_i915_private *dev_priv)
> +{
> +	struct drm_atomic_state *disable_state, *restore_state = NULL;
> +	struct drm_crtc *crtc;
> +	struct drm_crtc_state *crtc_state;
> +	struct drm_plane *plane;
> +	struct drm_plane_state *plane_state;
> +	int i;
> +
> +	down_write(&dev_priv->commit_sem);
> +
>  	/* reset doesn't touch the display, but flips might get nuked anyway, */
>  	if (!i915.force_reset_modeset_test &&
>  	    !gpu_reset_clobbers_display(dev_priv))
> @@ -3521,30 +3580,40 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv)
>  	 * Disabling the crtcs gracefully seems nicer. Also the
>  	 * g33 docs say we should at least disable all the planes.
>  	 */
> -	state = drm_atomic_helper_duplicate_state(dev, ctx);
> -	if (IS_ERR(state)) {
> -		ret = PTR_ERR(state);
> -		DRM_ERROR("Duplicating state failed with %i\n", ret);
> -		return;
> -	}
> +	disable_state = intel_duplicate_committed_state(dev_priv);
> +	if (IS_ERR(disable_state))
> +		goto out;
>  
> -	ret = drm_atomic_helper_disable_all(dev, ctx);
> -	if (ret) {
> -		DRM_ERROR("Suspending crtc's failed with %i\n", ret);
> -		drm_atomic_state_put(state);
> -		return;
> -	}
> +	to_intel_atomic_state(disable_state)->active_crtcs = 0;
>  
> -	dev_priv->modeset_restore_state = state;
> -	state->acquire_ctx = ctx;
> +	for_each_new_crtc_in_state(disable_state, crtc, crtc_state, i)
> +		crtc_state->active = false;
> +	for_each_new_plane_in_state(disable_state, plane, plane_state, i)
> +		plane_state->visible = false;
> +
> +	__intel_atomic_commit_tail(disable_state, true);
> +
> +	drm_atomic_helper_clean_committed_state(disable_state);
> +	drm_atomic_state_put(disable_state);
> +
> +	restore_state = intel_duplicate_committed_state(dev_priv);
> +	if (IS_ERR(restore_state))
> +		restore_state = NULL;
> +
> +	for_each_old_crtc_in_state(restore_state, crtc, crtc_state, i)
> +		crtc_state->active = false;
> +	for_each_old_plane_in_state(restore_state, plane, plane_state, i)
> +		plane_state->visible = false;
> +
> +out:
> +	dev_priv->modeset_restore_state = restore_state;
>  }
>  
>  void intel_finish_reset(struct drm_i915_private *dev_priv)
>  {
>  	struct drm_device *dev = &dev_priv->drm;
> -	struct drm_modeset_acquire_ctx *ctx = &dev_priv->reset_ctx;
> -	struct drm_atomic_state *state = dev_priv->modeset_restore_state;
> -	int ret;
> +	struct drm_atomic_state *restore_state =
> +		dev_priv->modeset_restore_state;
>  
>  	/*
>  	 * Flips in the rings will be nuked by the reset,
> @@ -3557,7 +3626,7 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)
>  
>  	/* reset doesn't touch the display */
>  	if (!gpu_reset_clobbers_display(dev_priv)) {
> -		if (!state) {
> +		if (!restore_state) {
>  			/*
>  			 * Flips in the rings have been nuked by the reset,
>  			 * so update the base address of all primary
> @@ -3569,11 +3638,11 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)
>  			 */
>  			intel_update_primary_planes(dev);
>  		} else {
> -			ret = __intel_display_resume(dev, state, ctx);
> -			if (ret)
> -				DRM_ERROR("Restoring old state failed with %i\n", ret);
> +			__intel_atomic_commit_tail(restore_state, true);
>  		}
>  	} else {
> +		i915_redisable_vga(dev_priv);
> +
>  		/*
>  		 * The display has been reset as well,
>  		 * so need a full re-initialization.
> @@ -3589,18 +3658,17 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)
>  			dev_priv->display.hpd_irq_setup(dev_priv);
>  		spin_unlock_irq(&dev_priv->irq_lock);
>  
> -		ret = __intel_display_resume(dev, state, ctx);
> -		if (ret)
> -			DRM_ERROR("Restoring old state failed with %i\n", ret);
> +		__intel_atomic_commit_tail(restore_state, true);
>  
>  		intel_hpd_init(dev_priv);
>  	}
>  
> -	if (state)
> -		drm_atomic_state_put(state);
> -	drm_modeset_drop_locks(ctx);
> -	drm_modeset_acquire_fini(ctx);
> -	mutex_unlock(&dev->mode_config.mutex);
> +	if (restore_state) {
> +		drm_atomic_helper_clean_committed_state(restore_state);
> +		drm_atomic_state_put(restore_state);
> +	}
> +
> +	up_write(&dev_priv->commit_sem);
>  }
>  
>  static bool abort_flip_on_reset(struct intel_crtc *crtc)
> @@ -12592,29 +12660,18 @@ static int intel_modeset_checks(struct drm_atomic_state *state)
>  {
>  	struct intel_atomic_state *intel_state = to_intel_atomic_state(state);
>  	struct drm_i915_private *dev_priv = to_i915(state->dev);
> -	struct drm_crtc *crtc;
> -	struct drm_crtc_state *old_crtc_state, *new_crtc_state;
> -	int ret = 0, i;
> +	int ret = 0;
>  
>  	if (!check_digital_port_conflicts(state)) {
>  		DRM_DEBUG_KMS("rejecting conflicting digital port configuration\n");
>  		return -EINVAL;
>  	}
>  
> -	intel_state->modeset = true;
>  	intel_state->active_crtcs = dev_priv->active_crtcs;
>  	intel_state->cdclk.logical = dev_priv->cdclk.logical;
>  	intel_state->cdclk.actual = dev_priv->cdclk.actual;
>  
> -	for_each_oldnew_crtc_in_state(state, crtc, old_crtc_state, new_crtc_state, i) {
> -		if (new_crtc_state->active)
> -			intel_state->active_crtcs |= 1 << i;
> -		else
> -			intel_state->active_crtcs &= ~(1 << i);
> -
> -		if (old_crtc_state->active != new_crtc_state->active)
> -			intel_state->active_pipe_changes |= drm_crtc_mask(crtc);
> -	}
> +	init_intel_state(intel_state);
>  
>  	/*
>  	 * See if the config requires any additional preparation, e.g.
> @@ -13004,7 +13061,7 @@ static void intel_atomic_helper_free_state_worker(struct work_struct *work)
>  	intel_atomic_helper_free_state(dev_priv);
>  }
>  
> -static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
> +static void __intel_atomic_commit_tail(struct drm_atomic_state *state, bool is_reset)
>  {
>  	struct drm_device *dev = state->dev;
>  	struct intel_atomic_state *intel_state = to_intel_atomic_state(state);
> @@ -13068,10 +13125,18 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
>  
>  	/* Only after disabling all output pipelines that will be changed can we
>  	 * update the the output configuration. */
> -	intel_modeset_update_crtc_state(state);
> +	if (!is_reset)
> +		intel_modeset_update_crtc_state(state);
>  
>  	if (intel_state->modeset) {
> -		drm_atomic_helper_update_legacy_modeset_state(state->dev, state);
> +		if (!is_reset) {
> +			drm_atomic_helper_update_legacy_modeset_state(state->dev, state);
> +		} else {
> +			for_each_new_crtc_in_state(state, crtc, new_crtc_state, i) {
> +				if (new_crtc_state->enable)
> +					drm_calc_timestamping_constants(crtc, &new_crtc_state->adjusted_mode);
> +			}
> +		}
>  
>  		intel_set_cdclk(dev_priv, &dev_priv->cdclk.actual);
>  
> @@ -13082,7 +13147,8 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
>  		if (!intel_can_enable_sagv(state))
>  			intel_disable_sagv(dev_priv);
>  
> -		intel_modeset_verify_disabled(dev, state);
> +		if (!is_reset)
> +			intel_modeset_verify_disabled(dev, state);
>  	}
>  
>  	/* Complete the events for pipes that have now been disabled */
> @@ -13135,7 +13201,8 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state)
>  		if (put_domains[i])
>  			modeset_put_power_domains(dev_priv, put_domains[i]);
>  
> -		intel_modeset_verify_crtc(crtc, state, old_crtc_state, new_crtc_state);
> +		if (!is_reset)
> +			intel_modeset_verify_crtc(crtc, state, old_crtc_state, new_crtc_state);
>  	}
>  
>  	if (intel_state->modeset && intel_can_enable_sagv(state))
> @@ -13160,10 +13227,14 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state)
>  
>  	drm_atomic_helper_wait_for_dependencies(state);
>  
> -	__intel_atomic_commit_tail(state);
> +	down_read(&dev_priv->commit_sem);
> +
> +	__intel_atomic_commit_tail(state, false);
>  
>  	drm_atomic_helper_commit_hw_done(state);
>  
> +	up_read(&dev_priv->commit_sem);
> +
>  	mutex_lock(&dev->struct_mutex);
>  	drm_atomic_helper_cleanup_planes(dev, state);
>  	mutex_unlock(&dev->struct_mutex);
> @@ -15022,6 +15093,8 @@ int intel_modeset_init(struct drm_device *dev)
>  	INIT_WORK(&dev_priv->atomic_helper.free_work,
>  		  intel_atomic_helper_free_state_worker);
>  
> +	init_rwsem(&dev_priv->commit_sem);
> +
>  	intel_init_quirks(dev);
>  
>  	intel_init_pm(dev_priv);
> -- 
> 2.13.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-30 13:25   ` [PATCH " Daniel Vetter
@ 2017-06-30 13:30     ` Daniel Vetter
  2017-06-30 15:44       ` Ville Syrjälä
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Vetter @ 2017-06-30 13:30 UTC (permalink / raw)
  To: ville.syrjala; +Cc: intel-gfx, dri-devel, stable, Daniel Vetter, Chris Wilson

On Fri, Jun 30, 2017 at 03:25:58PM +0200, Daniel Vetter wrote:
> On Thu, Jun 29, 2017 at 04:49:48PM +0300, ville.syrjala@linux.intel.com wrote:
> > From: Ville Syrjï¿½lï¿½ <ville.syrjala@linux.intel.com>
> > 
> > Introduce an rw_semaphore to protect the display commits. All normal
> > commits use down_read() and hence can proceed in parallel, but GPU reset
> > will use down_write() making sure no other commits are in progress when
> > we have to pull the plug on the display engine on pre-g4x platforms.
> > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > itself, and we wait for all dependencies before the down_read(), and
> > thus there is no chance of deadlocks with this scheme.
> 
> How does this solve the deadlock? Afaiui the deadlock is that the gpu
> reset stopped unconditionally completing all requests before it did
> anything else, including anything with the hw or taking modeset locks.
> 
> This ensured that any outstanding flips (we only had pageflips, no atomic
> ones back then) could complete (although maybe displaying garbage). The
> only thing we had to do was grab the locks (to avoid concurrent modesets)
> and then we could happily nuke the hw (since the flips where lost in the
> CS anyway), and restore it afterwards.
> 
> Then the TDR rewrite came around and broke that, which now makes atomic
> time out waiting for the gpu to complete (because the gpu reset waits for
> the modeset to complete first). Which means if you want to fix this
> without breaking TDR, then you need to cancel the pending atomic commits.
> That seems somewhat what you're attempting here with trying to figure out
> what the latest hw-committed step is (but that function gets it wrong),
> and lots more locking and stuff on top.
> 
> Why exactly can't we just go back to the old world of force-completing
> dead requests on gen4 and earlier? That would be tons simpler imo instead
> of throwing a pile of hacks (which really needs a complete rewrite of the
> atomic code in i915) in as a work-around. We didn't have TDR on gen4 and
> earlier for years, going back to that isn't going to hurt anyone.
> 
> Making working gen4 gpu reset contigent on cancellable atomic commits and
> the full commit queue is imo nuts.

And if the GEM folks insist the old behavior can't be restored, then we
just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere
in i915_sw_fence. Force-completing all render requests atomic updates
depend upon is imo the simplest solution to this, and we've had a driver
that worked like that for years.

And as long as TDR resubmits the batches the render-glitch will at most be
temporary, but that's totally fine: We're killing the entire display block
in the reset anyway, there's no way the user won't notice _that_ kind of
glitch.

Either way, let's please not over-engineer something for a dead-old
platform when something much, much, much simpler worked for years, and
should keep on working for another few years.
-Daniel

PS: One issue with atomic is that there's the small matter of having to
wait for pending atomic commits to complete before we nuke the display, to
avoid upsetting the display code. We could do that with a dummy commit, or
just having a special wait_for_depencies that waits for all CRTC to
complete their pending atomic updates.
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-29 19:26       ` Ville Syrjälä
@ 2017-06-30 13:35         ` Daniel Vetter
  2017-06-30 13:53           ` Ville Syrjälä
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Vetter @ 2017-06-30 13:35 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: Chris Wilson, intel-gfx, dri-devel, stable, Daniel Vetter

On Thu, Jun 29, 2017 at 10:26:08PM +0300, Ville Syrjï¿½lï¿½ wrote:
> On Thu, Jun 29, 2017 at 06:57:30PM +0100, Chris Wilson wrote:
> > Quoting ville.syrjala@linux.intel.com (2017-06-29 15:36:42)
> > > From: Ville Syrjï¿½lï¿½ <ville.syrjala@linux.intel.com>
> > > 
> > > Introduce an rw_semaphore to protect the display commits. All normal
> > > commits use down_read() and hence can proceed in parallel, but GPU reset
> > > will use down_write() making sure no other commits are in progress when
> > > we have to pull the plug on the display engine on pre-g4x platforms.
> > > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > > itself, and we wait for all dependencies before the down_read(), and
> > > thus there is no chance of deadlocks with this scheme.
> > 
> > This matches what I thought should be done (I didn't think of using
> > rwsem just a mutex, nice touch). The point I got stuck on was what
> > should be done after the reset? I expected another modeset to return the
> > state back or otherwise the inflight would get confused?
> 
> I guess that can happen. For instance, if we have a crtc_enable() in flight,
> and then we do a reset before it gets committed we would end up doing
> crtc_enable() twice in a row without a crtc_disable in between. For page
> flips and such this shouldn't be a big deal in general.

atomic commits are ordered. You have to wait for the previous ones to
complete before you do a new one. If you don't do that, then all hell
breaks loose.

What you really can't do with atomic (without rewriting everything once
more) is cancel a commit. Pre-atomic we could do that on gen4 since the
mmio flips died with the gpu, but that's the one design change we need to
cope with (plus TDR insisting it can't force-complete requests anymore).

> > > During reset we should be recommiting the state that was committed last.
> > > But for now we'll settle for recommiting the last state for each object.
> > 
> > Ah, I guess that explains the above. What is the complication with
> > restoring the current state as opposed to the next state?
> 
> Well the main thing is just tracking which is the current state. That
> just needs refactoring the .atomic_duplicate_state() calling convention
> across the whole tree so that we can then duplicate the committed state
> rather than the latest state.
> 
> Also due to the commit_hw_done() being potentially done after the
> modeset locks have been dropped, I don't think we can be certain
> of it getting called in the same order as swap_state(), hence
> when we track the committed state in commit_hw_done() we'll have
> to have some way to figure out if our new state is in fact the
> latest committed state for each object or if the calls got
> reordered. We don't insert any dependencies between two commits
> unless they touch the same active crtc, thus this reordering
> seems very much possible. Dunno if we should add some way to add
> such dependeencies whenever the same object is part of two otherwise
> independent commits, or if we just want to try and work with the
> reordered calls. My idea for the latter was some kind of seqno/age
> counter on the object states that allows me to recongnize which
> state is more recent. The object states aren't refcounted so hanging
> on to the wrong pointer could cause an oops the next time we have to
> perform a GPU reset.

Atomic commits are strongly ordered on a given CRTC, so stuff can't be
out-of-order within one. Across them the idea was to just add more CRTC
states into the atomic commit to make sure stuff is ordered correctly.

Laurent just pointed out that for switching planes that doesn't happen
atm, but for i915 we should be safe (I guess I thought too much about i915
when typing the commit tracking code).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-30 13:35         ` Daniel Vetter
@ 2017-06-30 13:53           ` Ville Syrjälä
  2017-06-30 15:30             ` Daniel Vetter
  0 siblings, 1 reply; 17+ messages in thread
From: Ville Syrjälä @ 2017-06-30 13:53 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Chris Wilson, intel-gfx, dri-devel, stable, Daniel Vetter

On Fri, Jun 30, 2017 at 03:35:03PM +0200, Daniel Vetter wrote:
> On Thu, Jun 29, 2017 at 10:26:08PM +0300, Ville Syrjï¿½lï¿½ wrote:
> > On Thu, Jun 29, 2017 at 06:57:30PM +0100, Chris Wilson wrote:
> > > Quoting ville.syrjala@linux.intel.com (2017-06-29 15:36:42)
> > > > From: Ville Syrjï¿½lï¿½ <ville.syrjala@linux.intel.com>
> > > > 
> > > > Introduce an rw_semaphore to protect the display commits. All normal
> > > > commits use down_read() and hence can proceed in parallel, but GPU reset
> > > > will use down_write() making sure no other commits are in progress when
> > > > we have to pull the plug on the display engine on pre-g4x platforms.
> > > > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > > > itself, and we wait for all dependencies before the down_read(), and
> > > > thus there is no chance of deadlocks with this scheme.
> > > 
> > > This matches what I thought should be done (I didn't think of using
> > > rwsem just a mutex, nice touch). The point I got stuck on was what
> > > should be done after the reset? I expected another modeset to return the
> > > state back or otherwise the inflight would get confused?
> > 
> > I guess that can happen. For instance, if we have a crtc_enable() in flight,
> > and then we do a reset before it gets committed we would end up doing
> > crtc_enable() twice in a row without a crtc_disable in between. For page
> > flips and such this shouldn't be a big deal in general.
> 
> atomic commits are ordered. You have to wait for the previous ones to
> complete before you do a new one. If you don't do that, then all hell
> breaks loose.

What we're effectively doing here is inserting two new commits in
the middle of the timeline, one to disable everything, and another
one to re-enable everything. At the end of the the re-enable we would
want the hardware state should match exactly what was there before
the disable, hence any commits still in the timeline should work
correctly. That is, their old_state will match the hardware state
when it comes time to commit them.

But we can only do that properly after we start to track the committed
state. Without that tracking we can get into trouble wrt. the
hardware state not matching the old state when it comes time to
commit the new state.

> 
> What you really can't do with atomic (without rewriting everything once
> more) is cancel a commit. Pre-atomic we could do that on gen4 since the
> mmio flips died with the gpu, but that's the one design change we need to
> cope with (plus TDR insisting it can't force-complete requests anymore).
> 
> > > > During reset we should be recommiting the state that was committed last.
> > > > But for now we'll settle for recommiting the last state for each object.
> > > 
> > > Ah, I guess that explains the above. What is the complication with
> > > restoring the current state as opposed to the next state?
> > 
> > Well the main thing is just tracking which is the current state. That
> > just needs refactoring the .atomic_duplicate_state() calling convention
> > across the whole tree so that we can then duplicate the committed state
> > rather than the latest state.
> > 
> > Also due to the commit_hw_done() being potentially done after the
> > modeset locks have been dropped, I don't think we can be certain
> > of it getting called in the same order as swap_state(), hence
> > when we track the committed state in commit_hw_done() we'll have
> > to have some way to figure out if our new state is in fact the
> > latest committed state for each object or if the calls got
> > reordered. We don't insert any dependencies between two commits
> > unless they touch the same active crtc, thus this reordering
> > seems very much possible. Dunno if we should add some way to add
> > such dependeencies whenever the same object is part of two otherwise
> > independent commits, or if we just want to try and work with the
> > reordered calls. My idea for the latter was some kind of seqno/age
> > counter on the object states that allows me to recongnize which
> > state is more recent. The object states aren't refcounted so hanging
> > on to the wrong pointer could cause an oops the next time we have to
> > perform a GPU reset.
> 
> Atomic commits are strongly ordered on a given CRTC, so stuff can't be
> out-of-order within one. Across them the idea was to just add more CRTC
> states into the atomic commit to make sure stuff is ordered correctly.

And atomic commits not touching the same crtc will not be ordered in any
way. Thus if they touch the same object (eg. disabled plane or connector)
we can't currently tell if the commit_hw_done() calls happened in the same
order as the swap_state() calls for that particular object.

-- 
Ville Syrjï¿½lï¿½
Intel OTC

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-30 13:53           ` Ville Syrjälä
@ 2017-06-30 15:30             ` Daniel Vetter
  2017-06-30 15:39               ` Chris Wilson
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Vetter @ 2017-06-30 15:30 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: Daniel Vetter, Chris Wilson, intel-gfx, dri-devel, stable,
	Daniel Vetter

On Fri, Jun 30, 2017 at 04:53:19PM +0300, Ville Syrjï¿½lï¿½ wrote:
> On Fri, Jun 30, 2017 at 03:35:03PM +0200, Daniel Vetter wrote:
> > On Thu, Jun 29, 2017 at 10:26:08PM +0300, Ville Syrjï¿½lï¿½ wrote:
> > > On Thu, Jun 29, 2017 at 06:57:30PM +0100, Chris Wilson wrote:
> > > > Quoting ville.syrjala@linux.intel.com (2017-06-29 15:36:42)
> > > > > From: Ville Syrjï¿½lï¿½ <ville.syrjala@linux.intel.com>
> > > > > 
> > > > > Introduce an rw_semaphore to protect the display commits. All normal
> > > > > commits use down_read() and hence can proceed in parallel, but GPU reset
> > > > > will use down_write() making sure no other commits are in progress when
> > > > > we have to pull the plug on the display engine on pre-g4x platforms.
> > > > > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > > > > itself, and we wait for all dependencies before the down_read(), and
> > > > > thus there is no chance of deadlocks with this scheme.
> > > > 
> > > > This matches what I thought should be done (I didn't think of using
> > > > rwsem just a mutex, nice touch). The point I got stuck on was what
> > > > should be done after the reset? I expected another modeset to return the
> > > > state back or otherwise the inflight would get confused?
> > > 
> > > I guess that can happen. For instance, if we have a crtc_enable() in flight,
> > > and then we do a reset before it gets committed we would end up doing
> > > crtc_enable() twice in a row without a crtc_disable in between. For page
> > > flips and such this shouldn't be a big deal in general.
> > 
> > atomic commits are ordered. You have to wait for the previous ones to
> > complete before you do a new one. If you don't do that, then all hell
> > breaks loose.
> 
> What we're effectively doing here is inserting two new commits in
> the middle of the timeline, one to disable everything, and another
> one to re-enable everything. At the end of the the re-enable we would
> want the hardware state should match exactly what was there before
> the disable, hence any commits still in the timeline should work
> correctly. That is, their old_state will match the hardware state
> when it comes time to commit them.
> 
> But we can only do that properly after we start to track the committed
> state. Without that tracking we can get into trouble wrt. the
> hardware state not matching the old state when it comes time to
> commit the new state.

Yeah, but my point is that this here is an extremely fancy and fragile
(and afaics in this form, incomplete) fix for something that in the past
was fixed much, much simpler. Why do we need this massive amount of
complexity now? Who's asking for all this (we don't even have anyone yet
asking for fully queued atomic commits, much less on gen4)?

I mean rewriting the entire modeset code is fun and all, but for gen3-4?

> > What you really can't do with atomic (without rewriting everything once
> > more) is cancel a commit. Pre-atomic we could do that on gen4 since the
> > mmio flips died with the gpu, but that's the one design change we need to
> > cope with (plus TDR insisting it can't force-complete requests anymore).
> > 
> > > > > During reset we should be recommiting the state that was committed last.
> > > > > But for now we'll settle for recommiting the last state for each object.
> > > > 
> > > > Ah, I guess that explains the above. What is the complication with
> > > > restoring the current state as opposed to the next state?
> > > 
> > > Well the main thing is just tracking which is the current state. That
> > > just needs refactoring the .atomic_duplicate_state() calling convention
> > > across the whole tree so that we can then duplicate the committed state
> > > rather than the latest state.
> > > 
> > > Also due to the commit_hw_done() being potentially done after the
> > > modeset locks have been dropped, I don't think we can be certain
> > > of it getting called in the same order as swap_state(), hence
> > > when we track the committed state in commit_hw_done() we'll have
> > > to have some way to figure out if our new state is in fact the
> > > latest committed state for each object or if the calls got
> > > reordered. We don't insert any dependencies between two commits
> > > unless they touch the same active crtc, thus this reordering
> > > seems very much possible. Dunno if we should add some way to add
> > > such dependeencies whenever the same object is part of two otherwise
> > > independent commits, or if we just want to try and work with the
> > > reordered calls. My idea for the latter was some kind of seqno/age
> > > counter on the object states that allows me to recongnize which
> > > state is more recent. The object states aren't refcounted so hanging
> > > on to the wrong pointer could cause an oops the next time we have to
> > > perform a GPU reset.
> > 
> > Atomic commits are strongly ordered on a given CRTC, so stuff can't be
> > out-of-order within one. Across them the idea was to just add more CRTC
> > states into the atomic commit to make sure stuff is ordered correctly.
> 
> And atomic commits not touching the same crtc will not be ordered in any
> way. Thus if they touch the same object (eg. disabled plane or connector)
> we can't currently tell if the commit_hw_done() calls happened in the same
> order as the swap_state() calls for that particular object.

Yes, but I think for i915 it's ok, because we don't have planes that move
around (not supported at least), and for other shared stuff we just grab
them all. It is a bug in general though, and for i915 it would probably be
best if we'd add the various drm_crtc_commit waits to the i915_sw_fence.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-30 15:30             ` Daniel Vetter
@ 2017-06-30 15:39               ` Chris Wilson
  2017-07-03  8:03                 ` Daniel Vetter
  0 siblings, 1 reply; 17+ messages in thread
From: Chris Wilson @ 2017-06-30 15:39 UTC (permalink / raw)
  To: Daniel Vetter, Ville Syrjälä
  Cc: Daniel Vetter, intel-gfx, dri-devel, stable, Daniel Vetter

Quoting Daniel Vetter (2017-06-30 16:30:59)
> On Fri, Jun 30, 2017 at 04:53:19PM +0300, Ville Syrjälä wrote:
> > On Fri, Jun 30, 2017 at 03:35:03PM +0200, Daniel Vetter wrote:
> > > On Thu, Jun 29, 2017 at 10:26:08PM +0300, Ville Syrjälä wrote:
> > > > On Thu, Jun 29, 2017 at 06:57:30PM +0100, Chris Wilson wrote:
> > > > > Quoting ville.syrjala@linux.intel.com (2017-06-29 15:36:42)
> > > > > > From: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > > > > > 
> > > > > > Introduce an rw_semaphore to protect the display commits. All normal
> > > > > > commits use down_read() and hence can proceed in parallel, but GPU reset
> > > > > > will use down_write() making sure no other commits are in progress when
> > > > > > we have to pull the plug on the display engine on pre-g4x platforms.
> > > > > > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > > > > > itself, and we wait for all dependencies before the down_read(), and
> > > > > > thus there is no chance of deadlocks with this scheme.
> > > > > 
> > > > > This matches what I thought should be done (I didn't think of using
> > > > > rwsem just a mutex, nice touch). The point I got stuck on was what
> > > > > should be done after the reset? I expected another modeset to return the
> > > > > state back or otherwise the inflight would get confused?
> > > > 
> > > > I guess that can happen. For instance, if we have a crtc_enable() in flight,
> > > > and then we do a reset before it gets committed we would end up doing
> > > > crtc_enable() twice in a row without a crtc_disable in between. For page
> > > > flips and such this shouldn't be a big deal in general.
> > > 
> > > atomic commits are ordered. You have to wait for the previous ones to
> > > complete before you do a new one. If you don't do that, then all hell
> > > breaks loose.
> > 
> > What we're effectively doing here is inserting two new commits in
> > the middle of the timeline, one to disable everything, and another
> > one to re-enable everything. At the end of the the re-enable we would
> > want the hardware state should match exactly what was there before
> > the disable, hence any commits still in the timeline should work
> > correctly. That is, their old_state will match the hardware state
> > when it comes time to commit them.
> > 
> > But we can only do that properly after we start to track the committed
> > state. Without that tracking we can get into trouble wrt. the
> > hardware state not matching the old state when it comes time to
> > commit the new state.
> 
> Yeah, but my point is that this here is an extremely fancy and fragile
> (and afaics in this form, incomplete) fix for something that in the past
> was fixed much, much simpler. Why do we need this massive amount of
> complexity now? Who's asking for all this (we don't even have anyone yet
> asking for fully queued atomic commits, much less on gen4)?

It was never "fixed", it was always borked; broken by design.
-Chris

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-30 13:30     ` Daniel Vetter
@ 2017-06-30 15:44       ` Ville Syrjälä
  2017-06-30 18:23         ` Daniel Vetter
  0 siblings, 1 reply; 17+ messages in thread
From: Ville Syrjälä @ 2017-06-30 15:44 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel, stable, Daniel Vetter, Chris Wilson

On Fri, Jun 30, 2017 at 03:30:33PM +0200, Daniel Vetter wrote:
> On Fri, Jun 30, 2017 at 03:25:58PM +0200, Daniel Vetter wrote:
> > On Thu, Jun 29, 2017 at 04:49:48PM +0300, ville.syrjala@linux.intel.com wrote:
> > > From: Ville Syrjï¿½lï¿½ <ville.syrjala@linux.intel.com>
> > > 
> > > Introduce an rw_semaphore to protect the display commits. All normal
> > > commits use down_read() and hence can proceed in parallel, but GPU reset
> > > will use down_write() making sure no other commits are in progress when
> > > we have to pull the plug on the display engine on pre-g4x platforms.
> > > There are no modeset/gem locks taken inside __intel_atomic_commit_tail()
> > > itself, and we wait for all dependencies before the down_read(), and
> > > thus there is no chance of deadlocks with this scheme.
> > 
> > How does this solve the deadlock? Afaiui the deadlock is that the gpu
> > reset stopped unconditionally completing all requests before it did
> > anything else, including anything with the hw or taking modeset locks.
> > 
> > This ensured that any outstanding flips (we only had pageflips, no atomic
> > ones back then) could complete (although maybe displaying garbage). The
> > only thing we had to do was grab the locks (to avoid concurrent modesets)
> > and then we could happily nuke the hw (since the flips where lost in the
> > CS anyway), and restore it afterwards.
> > 
> > Then the TDR rewrite came around and broke that, which now makes atomic
> > time out waiting for the gpu to complete (because the gpu reset waits for
> > the modeset to complete first). Which means if you want to fix this
> > without breaking TDR, then you need to cancel the pending atomic commits.
> > That seems somewhat what you're attempting here with trying to figure out
> > what the latest hw-committed step is (but that function gets it wrong),
> > and lots more locking and stuff on top.
> > 
> > Why exactly can't we just go back to the old world of force-completing
> > dead requests on gen4 and earlier? That would be tons simpler imo instead
> > of throwing a pile of hacks (which really needs a complete rewrite of the
> > atomic code in i915) in as a work-around. We didn't have TDR on gen4 and
> > earlier for years, going back to that isn't going to hurt anyone.
> > 
> > Making working gen4 gpu reset contigent on cancellable atomic commits and
> > the full commit queue is imo nuts.
> 
> And if the GEM folks insist the old behavior can't be restored, then we
> just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere
> in i915_sw_fence. Force-completing all render requests atomic updates
> depend upon is imo the simplest solution to this, and we've had a driver
> that worked like that for years.

And it used to break all the time. I think we've had to fix it at least
three times by now. So I tend to think it's better to fix it in a way
that won't break so easily.

-- 
Ville Syrjï¿½lï¿½
Intel OTC

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-30 15:44       ` Ville Syrjälä
@ 2017-06-30 18:23         ` Daniel Vetter
  2017-06-30 18:46           ` Ville Syrjälä
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Vetter @ 2017-06-30 18:23 UTC (permalink / raw)
  To: Ville Syrjälä; +Cc: intel-gfx, dri-devel, stable, Chris Wilson

On Fri, Jun 30, 2017 at 5:44 PM, Ville Syrjälä
<ville.syrjala@linux.intel.com> wrote:
>> And if the GEM folks insist the old behavior can't be restored, then we
>> just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere
>> in i915_sw_fence. Force-completing all render requests atomic updates
>> depend upon is imo the simplest solution to this, and we've had a driver
>> that worked like that for years.
>
> And it used to break all the time. I think we've had to fix it at least
> three times by now. So I tend to think it's better to fix it in a way
> that won't break so easily.

Why exactly is making the atomic code massive more tricky the easy
fix? That's the part I don't get. Yes it got broken a bunch because no
one runs CI and everyone forgets that gen3/4 reset the display in gpu
reset, but in the end we do have a depency loop, and either the
modeset side or the render side needs to bail out and cancel it's
async stuff (whether that's a request or a nonblocking flip/atomic
commit doesn't matter). In my opinion, cancelling the request (even if
we're clever and only cancel the requests for the modeset waiters,
which isn't to hard to pull off) seems about the simplest option.
Especially since we need that code anyway, even TDR can't safe
everything and resubmit under all circumstances (at least the buggy
batch can't be resubmitted).

Cancelling any kind of atomic commit otoh looks like a lot more
complexity. Why do you think this is the easier, or at least less
fragile option? This patch series is full of FIXMEs, and even the more
complete set seems to have a pile of holes. Plus we need to stop using
obj->state, and I don't see any easy way to test for that (since the
gen3/4 gpu reset case is the only corner cases that currently needs
that).

So not seeing how this is easier or more robust at all. What do I miss?

Thanks, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-30 18:23         ` Daniel Vetter
@ 2017-06-30 18:46           ` Ville Syrjälä
  2017-07-03  7:55             ` Daniel Vetter
  0 siblings, 1 reply; 17+ messages in thread
From: Ville Syrjälä @ 2017-06-30 18:46 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel, stable, Chris Wilson

On Fri, Jun 30, 2017 at 08:23:58PM +0200, Daniel Vetter wrote:
> On Fri, Jun 30, 2017 at 5:44 PM, Ville Syrjï¿½lï¿½
> <ville.syrjala@linux.intel.com> wrote:
> >> And if the GEM folks insist the old behavior can't be restored, then we
> >> just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere
> >> in i915_sw_fence. Force-completing all render requests atomic updates
> >> depend upon is imo the simplest solution to this, and we've had a driver
> >> that worked like that for years.
> >
> > And it used to break all the time. I think we've had to fix it at least
> > three times by now. So I tend to think it's better to fix it in a way
> > that won't break so easily.
> 
> Why exactly is making the atomic code massive more tricky the easy
> fix?

I don't see what this massive trickyness is. Compared to the rest of
atomic what I have is absolutely trivial. Just the
duplicate_committed_state() and the '.committed_state = foo'
assignments in hw_done(). That's it really.

> That's the part I don't get. Yes it got broken a bunch because no
> one runs CI and everyone forgets that gen3/4 reset the display in gpu
> reset, but in the end we do have a depency loop, and either the
> modeset side or the render side needs to bail out and cancel it's
> async stuff (whether that's a request or a nonblocking flip/atomic
> commit doesn't matter). In my opinion, cancelling the request (even if
> we're clever and only cancel the requests for the modeset waiters,
> which isn't to hard to pull off) seems about the simplest option.
> Especially since we need that code anyway, even TDR can't safe
> everything and resubmit under all circumstances (at least the buggy
> batch can't be resubmitted).
> 
> Cancelling any kind of atomic commit otoh looks like a lot more
> complexity.

I'm not cancelling anything.

> Why do you think this is the easier, or at least less
> fragile option? This patch series is full of FIXMEs, and even the more
> complete set seems to have a pile of holes. Plus we need to stop using
> obj->state, and I don't see any easy way to test for that (since the
> gen3/4 gpu reset case is the only corner cases that currently needs
> that).

We need to fix that stuff anyway if we ever want to queue up multiple
commits for the same crtc. The stuff I have that is specific to this
reset stuff is actually very simple. The rest is just fixing up the
huge mess we've already made.

-- 
Ville Syrjï¿½lï¿½
Intel OTC

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-30 18:46           ` Ville Syrjälä
@ 2017-07-03  7:55             ` Daniel Vetter
  2017-07-03  9:30               ` Ville Syrjälä
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Vetter @ 2017-07-03  7:55 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: Daniel Vetter, intel-gfx, dri-devel, stable, Chris Wilson

On Fri, Jun 30, 2017 at 09:46:36PM +0300, Ville Syrjï¿½lï¿½ wrote:
> On Fri, Jun 30, 2017 at 08:23:58PM +0200, Daniel Vetter wrote:
> > On Fri, Jun 30, 2017 at 5:44 PM, Ville Syrjï¿½lï¿½
> > <ville.syrjala@linux.intel.com> wrote:
> > >> And if the GEM folks insist the old behavior can't be restored, then we
> > >> just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere
> > >> in i915_sw_fence. Force-completing all render requests atomic updates
> > >> depend upon is imo the simplest solution to this, and we've had a driver
> > >> that worked like that for years.
> > >
> > > And it used to break all the time. I think we've had to fix it at least
> > > three times by now. So I tend to think it's better to fix it in a way
> > > that won't break so easily.
> > 
> > Why exactly is making the atomic code massive more tricky the easy
> > fix?
> 
> I don't see what this massive trickyness is. Compared to the rest of
> atomic what I have is absolutely trivial. Just the
> duplicate_committed_state() and the '.committed_state = foo'
> assignments in hw_done(). That's it really.

>From a quick look and your description it seems full of races. I'm not
sure it'll still be simple once those are fixed.

> > That's the part I don't get. Yes it got broken a bunch because no
> > one runs CI and everyone forgets that gen3/4 reset the display in gpu
> > reset, but in the end we do have a depency loop, and either the
> > modeset side or the render side needs to bail out and cancel it's
> > async stuff (whether that's a request or a nonblocking flip/atomic
> > commit doesn't matter). In my opinion, cancelling the request (even if
> > we're clever and only cancel the requests for the modeset waiters,
> > which isn't to hard to pull off) seems about the simplest option.
> > Especially since we need that code anyway, even TDR can't safe
> > everything and resubmit under all circumstances (at least the buggy
> > batch can't be resubmitted).
> > 
> > Cancelling any kind of atomic commit otoh looks like a lot more
> > complexity.
> 
> I'm not cancelling anything.

Well by overtaking the in-flight commit you are at least fighting with
that. Either you need to cancel that one, or insert the gpu reset commit
at the right point (and with the right state). Current code drops that and
instead seems to just hope it doesn't lead to tears too much.

> > Why do you think this is the easier, or at least less
> > fragile option? This patch series is full of FIXMEs, and even the more
> > complete set seems to have a pile of holes. Plus we need to stop using
> > obj->state, and I don't see any easy way to test for that (since the
> > gen3/4 gpu reset case is the only corner cases that currently needs
> > that).
> 
> We need to fix that stuff anyway if we ever want to queue up multiple
> commits for the same crtc. The stuff I have that is specific to this
> reset stuff is actually very simple. The rest is just fixing up the
> huge mess we've already made.

Rewriting the world for a regression fix seems a bit much is all I'm
saying. And I'm not sure your approach works without that "rewrite the
world" step. Defacto what your current patches seem to result in is
- we commit the final sw state in gpu reset
- before we resubmit the rendering

That's much easier to pull of by simply force-completing all
i915_sw_fences before we take any modeset locks in the gpu reset path.
Note that we don't need to force-complete any i915_gem_request, we can
just force-complete the i915_sw_fences the work item is blocked on. Needs
some care to avoid races with a new atomic commit (since we need to
force-complete before we grab locks one might sneak in), but that's a
standard pattern.

Plus we then need to wait for all outstanding nonblocking commits once we
do have all modeset locks, since with atomic holding the locks only syncs
against synchronous hw commits (i.e do a fake synchronous commit before we
nuke the display and we're good). A variant of wait_for_dependencies that
waits for all crtc instead of just the crtc in an atomic commit would do I
think.

None of this requires any of the prep work we need the fancy additional
atomic stuff you plan to do. Which I think is really good for a regression
fix.

So again, why do we need to rewrite the world (since these patches here
seem to just be the racy poc) to fix reset on gen3/4?

I know you want to do all this, but tangling up a regression fix in a
rewrite isn't a good idea in my opinion. I'm not against your long-term
plans, I just think it'd be good if we can have them as orthogonal pieces
if feasible.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-06-30 15:39               ` Chris Wilson
@ 2017-07-03  8:03                 ` Daniel Vetter
       [not found]                   ` <149906981450.32733.596187619908591705@mail.alporthouse.com>
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Vetter @ 2017-07-03  8:03 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Ville Syrjälä, intel-gfx, dri-devel, stable

On Fri, Jun 30, 2017 at 5:39 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>> Yeah, but my point is that this here is an extremely fancy and fragile
>> (and afaics in this form, incomplete) fix for something that in the past
>> was fixed much, much simpler. Why do we need this massive amount of
>> complexity now? Who's asking for all this (we don't even have anyone yet
>> asking for fully queued atomic commits, much less on gen4)?
>
> It was never "fixed", it was always borked; broken by design.

Hm, what was broken by design in gen3/4 reset? We never bothered to
resubmit rendering when the gpu died, but besides that I'm not aware
of a deisgn issue in that logic. We nuked in-flight pageflips (and
restored those), and we stalled for any pending modesets (grabbing
locks did that since all modesets where blocking), and that made sure
the hw was in a consistent state.

We always leaked the vblank state to userspace, but this approach here
also doesn't fix this. Plus broken rendering, but for these old
platforms I'm not too worried about displaying a few wrong frames
(with the new reset we will resubmit, so proper rendering should show
up soonish) - after all gpu reset nukes the entire display, there's no
way for the user to not notice that.

It would be neat to not have to do that, and Ville has a plan, but
meanwhile we still have this regression at hand that seems to be the
blocker for adding more machines to CI. I'd like to have the least
complex path to get that address (but maybe not long-term fixed, I'm
clear on that). If feasible.

If that's a unicorn, then let's go with Ville's approach, but then I
think we need the full thing with the races properly closed.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-07-03  7:55             ` Daniel Vetter
@ 2017-07-03  9:30               ` Ville Syrjälä
  2017-07-03 16:48                 ` Daniel Vetter
  0 siblings, 1 reply; 17+ messages in thread
From: Ville Syrjälä @ 2017-07-03  9:30 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx, dri-devel, stable, Chris Wilson

On Mon, Jul 03, 2017 at 09:55:48AM +0200, Daniel Vetter wrote:
> On Fri, Jun 30, 2017 at 09:46:36PM +0300, Ville Syrjï¿½lï¿½ wrote:
> > On Fri, Jun 30, 2017 at 08:23:58PM +0200, Daniel Vetter wrote:
> > > On Fri, Jun 30, 2017 at 5:44 PM, Ville Syrjï¿½lï¿½
> > > <ville.syrjala@linux.intel.com> wrote:
> > > >> And if the GEM folks insist the old behavior can't be restored, then we
> > > >> just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere
> > > >> in i915_sw_fence. Force-completing all render requests atomic updates
> > > >> depend upon is imo the simplest solution to this, and we've had a driver
> > > >> that worked like that for years.
> > > >
> > > > And it used to break all the time. I think we've had to fix it at least
> > > > three times by now. So I tend to think it's better to fix it in a way
> > > > that won't break so easily.
> > > 
> > > Why exactly is making the atomic code massive more tricky the easy
> > > fix?
> > 
> > I don't see what this massive trickyness is. Compared to the rest of
> > atomic what I have is absolutely trivial. Just the
> > duplicate_committed_state() and the '.committed_state = foo'
> > assignments in hw_done(). That's it really.
> 
> >From a quick look and your description it seems full of races.

This "simple" version has problems, yes. The full versions has just
the one potential race between swap_state() and hw_done(). That
seems like pretty easy to handle.

> I'm not
> sure it'll still be simple once those are fixed.
> 
> > > That's the part I don't get. Yes it got broken a bunch because no
> > > one runs CI and everyone forgets that gen3/4 reset the display in gpu
> > > reset, but in the end we do have a depency loop, and either the
> > > modeset side or the render side needs to bail out and cancel it's
> > > async stuff (whether that's a request or a nonblocking flip/atomic
> > > commit doesn't matter). In my opinion, cancelling the request (even if
> > > we're clever and only cancel the requests for the modeset waiters,
> > > which isn't to hard to pull off) seems about the simplest option.
> > > Especially since we need that code anyway, even TDR can't safe
> > > everything and resubmit under all circumstances (at least the buggy
> > > batch can't be resubmitted).
> > > 
> > > Cancelling any kind of atomic commit otoh looks like a lot more
> > > complexity.
> > 
> > I'm not cancelling anything.
> 
> Well by overtaking the in-flight commit you are at least fighting with
> that. Either you need to cancel that one, or insert the gpu reset commit
> at the right point (and with the right state). Current code drops that and
> instead seems to just hope it doesn't lead to tears too much.

The simple version is pretty much "cross our fingers and hope for
the best" type of thing, yes. The full version I cooked up earlier
doesn't rely on hope.

In fact I was able to come up with a testcase that does make the simple
version explode, whereas the full version works just fine. So we should
abandon the idea of using this simple version actually.

> 
> > > Why do you think this is the easier, or at least less
> > > fragile option? This patch series is full of FIXMEs, and even the more
> > > complete set seems to have a pile of holes. Plus we need to stop using
> > > obj->state, and I don't see any easy way to test for that (since the
> > > gen3/4 gpu reset case is the only corner cases that currently needs
> > > that).
> > 
> > We need to fix that stuff anyway if we ever want to queue up multiple
> > commits for the same crtc. The stuff I have that is specific to this
> > reset stuff is actually very simple. The rest is just fixing up the
> > huge mess we've already made.
> 
> Rewriting the world for a regression fix seems a bit much is all I'm
> saying. And I'm not sure your approach works without that "rewrite the
> world" step. Defacto what your current patches seem to result in is
> - we commit the final sw state in gpu reset
> - before we resubmit the rendering

That's true. And we do appear to need the refactoring to make it
actually work. Most of the refactoring amounts to a couple of small
cocci scripts though so not a big deal really.

> 
> That's much easier to pull of by simply force-completing all
> i915_sw_fences before we take any modeset locks in the gpu reset path.
> Note that we don't need to force-complete any i915_gem_request, we can
> just force-complete the i915_sw_fences the work item is blocked on. Needs
> some care to avoid races with a new atomic commit (since we need to
> force-complete before we grab locks one might sneak in), but that's a
> standard pattern.
> 
> Plus we then need to wait for all outstanding nonblocking commits once we
> do have all modeset locks, since with atomic holding the locks only syncs
> against synchronous hw commits (i.e do a fake synchronous commit before we
> nuke the display and we're good). A variant of wait_for_dependencies that
> waits for all crtc instead of just the crtc in an atomic commit would do I
> think.
> 
> None of this requires any of the prep work we need the fancy additional
> atomic stuff you plan to do. Which I think is really good for a regression
> fix.
> 
> So again, why do we need to rewrite the world (since these patches here
> seem to just be the racy poc) to fix reset on gen3/4?
> 
> I know you want to do all this, but tangling up a regression fix in a
> rewrite isn't a good idea in my opinion. I'm not against your long-term
> plans, I just think it'd be good if we can have them as orthogonal pieces
> if feasible.

I've pretty much decided that I'll be happy if I can solve this for
future kernels. If someone else wants to try and cook up some kind
of simple regression fix I don't have any real objections.

-- 
Ville Syrjï¿½lï¿½
Intel OTC

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
       [not found]                   ` <149906981450.32733.596187619908591705@mail.alporthouse.com>
@ 2017-07-03 16:42                     ` Daniel Vetter
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel Vetter @ 2017-07-03 16:42 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Daniel Vetter, Ville Syrjälä, intel-gfx, dri-devel,
	stable

On Mon, Jul 03, 2017 at 09:16:54AM +0100, Chris Wilson wrote:
> Quoting Daniel Vetter (2017-07-03 09:03:36)
> > On Fri, Jun 30, 2017 at 5:39 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > >> Yeah, but my point is that this here is an extremely fancy and fragile
> > >> (and afaics in this form, incomplete) fix for something that in the past
> > >> was fixed much, much simpler. Why do we need this massive amount of
> > >> complexity now? Who's asking for all this (we don't even have anyone yet
> > >> asking for fully queued atomic commits, much less on gen4)?
> > >
> > > It was never "fixed", it was always borked; broken by design.
> > 
> > Hm, what was broken by design in gen3/4 reset? We never bothered to
> > resubmit rendering when the gpu died, but besides that I'm not aware
> > of a deisgn issue in that logic. We nuked in-flight pageflips (and
> > restored those), and we stalled for any pending modesets (grabbing
> > locks did that since all modesets where blocking), and that made sure
> > the hw was in a consistent state.
> 
> KMS reset was taking mutexes within reset without any means of breaking the
> inherent deadlock, instead relying on something else to magically fix
> it. We only ever engineered around struct_mutex for reset, the remains
> of the deadlock upon struct_mutex within modeset is an issue but not the
> one causing trouble here.

So on the kms side we've had:
- grab kms locks -> grab struct_mutex -> wait for rendering

- on the reset side we've had the same order afair, with the caveat that
  we've broken the "wait for rendering" complete before trying to grab any
  of the locks in the reset path.

I thought that the problem is that gpu reset stopping force-completing
everything asap, which then lead the kms side to time-out at a point where
it shouldn't and start to fall over.

So before

commit 221fe7994554cc3985fc5d761ed7e44dcae0fa52
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Sep 9 14:11:51 2016 +0100

    drm/i915: Perform a direct reset of the GPU from the waiter

what was the deadlock we've had? Besides breaking the "wait for rendering"
I don't remember any inversions we've had. And for the breaking we've had
a fairly complex dance of barriers and reset_in_progress and waking up
waiters to make sure it would catch them all ...

> Please do note the bugs that indicate that even without modeset reset,
> hw is not in a consistent state.

So there's more bugs, not sure how that is relevant for gpu reset?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore
  2017-07-03  9:30               ` Ville Syrjälä
@ 2017-07-03 16:48                 ` Daniel Vetter
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel Vetter @ 2017-07-03 16:48 UTC (permalink / raw)
  To: Ville Syrjälä
  Cc: Daniel Vetter, intel-gfx, dri-devel, stable, Chris Wilson

On Mon, Jul 03, 2017 at 12:30:48PM +0300, Ville Syrjï¿½lï¿½ wrote:
> I've pretty much decided that I'll be happy if I can solve this for
> future kernels. If someone else wants to try and cook up some kind
> of simple regression fix I don't have any real objections.

Imo fixing the regression isn't the job of the display side, since display
isn't the one that broke the reset contract. Well apart from not syncing
with pending non-blocking atomic modeset commits, which we could fix
easily by dropping DRIVER_ATOMIC from gen2-4. But I'm still trying to
figure out why Chris thinks this isn't a gem regression ...
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-07-03 16:48 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20170629134948.5614-1-ville.syrjala@linux.intel.com>
     [not found] ` <20170629134948.5614-6-ville.syrjala@linux.intel.com>
2017-06-29 14:36   ` [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore ville.syrjala
2017-06-29 17:57     ` Chris Wilson
2017-06-29 19:26       ` Ville Syrjälä
2017-06-30 13:35         ` Daniel Vetter
2017-06-30 13:53           ` Ville Syrjälä
2017-06-30 15:30             ` Daniel Vetter
2017-06-30 15:39               ` Chris Wilson
2017-07-03  8:03                 ` Daniel Vetter
     [not found]                   ` <149906981450.32733.596187619908591705@mail.alporthouse.com>
2017-07-03 16:42                     ` Daniel Vetter
2017-06-30 13:25   ` [PATCH " Daniel Vetter
2017-06-30 13:30     ` Daniel Vetter
2017-06-30 15:44       ` Ville Syrjälä
2017-06-30 18:23         ` Daniel Vetter
2017-06-30 18:46           ` Ville Syrjälä
2017-07-03  7:55             ` Daniel Vetter
2017-07-03  9:30               ` Ville Syrjälä
2017-07-03 16:48                 ` Daniel Vetter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).