Re: [PATCH v3 2/2] drm/i915: Restore engine->submit_request before unwedging

From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>, intel-gfx@lists.freedesktop.org
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Subject: Re: [PATCH v3 2/2] drm/i915: Restore engine->submit_request before unwedging
Date: Wed, 15 Mar 2017 09:23:01 +0000	[thread overview]
Message-ID: <cdf9081e-ed02-c03d-9dbd-00332a53683d@linux.intel.com> (raw)
In-Reply-To: <20170314093441.7397-2-chris@chris-wilson.co.uk>

On 14/03/2017 09:34, Chris Wilson wrote:
> When we wedge the device, we override engine->submit_request with a nop
> to ensure that all in-flight requests are marked in error. However, igt
> would like to unwedge the device to test -EIO handling. This requires us
> to flush those in-flight requests and restore the original
> engine->submit_request.
>
> v2: Use a vfunc to unify enabling request submission to engines
> v3: Split new vfunc to a separate patch.
>
> Fixes: 821ed7df6e2a ("drm/i915: Update reset path to fix incomplete requests")
> Testcase: igt/gem_eio
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.c |  2 +-
>  drivers/gpu/drm/i915/i915_drv.h |  1 +
>  drivers/gpu/drm/i915/i915_gem.c | 51 +++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 53 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index e312b61ba6bb..11d1066b673c 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1822,7 +1822,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
>  		return;
>
>  	/* Clear any previous failed attempts at recovery. Time to try again. */
> -	__clear_bit(I915_WEDGED, &error->flags);
> +	i915_gem_unset_wedged(dev_priv);
>  	error->reset_count++;
>
>  	pr_notice("drm/i915: Resetting chip after gpu hang\n");
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 48ff64812289..53a791d8d992 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -3406,6 +3406,7 @@ int i915_gem_reset_prepare(struct drm_i915_private *dev_priv);
>  void i915_gem_reset(struct drm_i915_private *dev_priv);
>  void i915_gem_reset_finish(struct drm_i915_private *dev_priv);
>  void i915_gem_set_wedged(struct drm_i915_private *dev_priv);
> +void i915_gem_unset_wedged(struct drm_i915_private *dev_priv);
>
>  void i915_gem_init_mmio(struct drm_i915_private *i915);
>  int __must_check i915_gem_init(struct drm_i915_private *dev_priv);
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 202bb850f260..e06830916a05 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3000,6 +3000,57 @@ void i915_gem_set_wedged(struct drm_i915_private *dev_priv)
>  	mod_delayed_work(dev_priv->wq, &dev_priv->gt.idle_work, 0);
>  }
>
> +void i915_gem_unset_wedged(struct drm_i915_private *i915)
> +{
> +	struct i915_gem_timeline *tl;
> +	int i;
> +
> +	lockdep_assert_held(&i915->drm.struct_mutex);
> +	if (!test_bit(I915_WEDGED, &i915->gpu_error.flags))
> +		return;
> +
> +	/* Before unwedging, make sure that all pending operations
> +	 * are flushed and errored out. No more can be submitted until
> +	 * we reset the wedged bit.
> +	 */
> +	list_for_each_entry(tl, &i915->gt.timelines, link) {
> +		for (i = 0; i < ARRAY_SIZE(tl->engine); i++) {
> +			struct drm_i915_gem_request *rq;
> +
> +			rq = i915_gem_active_peek(&tl->engine[i].last_request,
> +						  &i915->drm.struct_mutex);
> +			if (!rq)
> +				continue;
> +
> +			/* We can't use our normal waiter as we want to
> +			 * avoid recursively trying to handle the current
> +			 * reset. The basic dma_fence_default_wait() installs
> +			 * a callback for dma_fence_signal(), which is
> +			 * triggered by our nop handler (indirectly, the
> +			 * callback enables the signaler thread which is
> +			 * woken by the nop_submit_request() advancing the seqno
> +			 * and when the seqno passes the fence, the signaler
> +			 * then signals the fence waking us up).
> +			 */
> +			dma_fence_default_wait(&rq->fence, false,
> +					       MAX_SCHEDULE_TIMEOUT);
> +		}
> +	}
> +
> +	/* Undo nop_submit_request. We prevent all new i915 requests from
> +	 * being queued (by disallowing execbuf whilst wedged) so having
> +	 * waited for all active requests above, we know the system is idle
> +	 * and do not have to worry about a thread being inside
> +	 * engine->submit_request() as we swap over. So unlike installing
> +	 * the nop_submit_request on reset, we can do this from normal
> +	 * context and do not require stop_machine().
> +	 */
> +	intel_engines_enable_submission(i915);

So the point of the dma_fence_default_wait above is it to ensure all 
nop_submit_request call backs have completed? I don't at the moment 
understand how could there be such callbacks since unwedge is happening 
after the wedge. So the wedge already installed the nop handler, and by 
the time we get to another reset attempt, isn't it already guaranteed 
all of those have exited?

Regards,

Tvrtko

> +
> +	smp_mb__before_atomic(); /* complete takeover before enabling execbuf */
> +	clear_bit(I915_WEDGED, &i915->gpu_error.flags);
> +}
> +
>  static void
>  i915_gem_retire_work_handler(struct work_struct *work)
>  {
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx