intel-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/3] drm/i915: Disable hangcheck when wedged
@ 2016-11-18  9:37 Chris Wilson
  2016-11-18  9:37 ` [PATCH 2/3] drm/i915: Complete requests in nop_submit_request Chris Wilson
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Chris Wilson @ 2016-11-18  9:37 UTC (permalink / raw)
  To: intel-gfx; +Cc: mika.kuoppala

If the gpu reset fails and the machine is terminally wedged, further
hangchecks achieve nothing but noise. Disable them, with a corollary
that we re-enable hangchecking after a successful GPU reset in case the
user is artificially bringing the machine back to life through the debug
interface.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_drv.c        | 2 ++
 drivers/gpu/drm/i915/intel_hangcheck.c | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 445fec9c2841..d3ee72449025 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1814,6 +1814,8 @@ void i915_reset(struct drm_i915_private *dev_priv)
 		goto error;
 	}
 
+	i915_queue_hangcheck(dev_priv);
+
 wakeup:
 	wake_up_bit(&error->flags, I915_RESET_IN_PROGRESS);
 	return;
diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
index 53df5b11bff4..c0cfa5b8b87e 100644
--- a/drivers/gpu/drm/i915/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/intel_hangcheck.c
@@ -319,6 +319,9 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 	if (!READ_ONCE(dev_priv->gt.awake))
 		return;
 
+	if (i915_terminally_wedged(&dev_priv->gpu_error))
+		return;
+
 	/* As enabling the GPU requires fairly extensive mmio access,
 	 * periodically arm the mmio checker to see if we are triggering
 	 * any invalid access.
-- 
2.10.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/3] drm/i915: Complete requests in nop_submit_request
  2016-11-18  9:37 [PATCH 1/3] drm/i915: Disable hangcheck when wedged Chris Wilson
@ 2016-11-18  9:37 ` Chris Wilson
  2016-11-18 12:56   ` Tvrtko Ursulin
  2016-11-18  9:37 ` [PATCH 3/3] drm/i915: Stop the machine as we install the wedged submit_request handler Chris Wilson
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Chris Wilson @ 2016-11-18  9:37 UTC (permalink / raw)
  To: intel-gfx; +Cc: mika.kuoppala

Since the submit/execute split in commit d55ac5bf97c6 ("drm/i915: Defer
transfer onto execution timeline to actual hw submission") the
global seqno advance was deferred until the submit_request callback.
After wedging the GPU, we were installing a nop_submit_request handler
(to avoid waking up the dead hw) but I had missed converting this over
to the new scheme. Under the new scheme, we have to explicitly call
i915_gem_submit_request() from the submit_request handler to mark the
request as on the hardware. If we don't the request is always pending,
and any waiter will continue to wait indefinitely and hangcheck will not
be able to resolve the lockup.

References: https://bugs.freedesktop.org/show_bug.cgi?id=98748
Testcase: igt/gem_eio/in-flight
Fixes: d55ac5bf97c6 ("drm/i915: Defer transfer onto execution timeline to actual hw submission")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 7b9f5b99b0f3..7037a8b26903 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2762,6 +2762,8 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
 
 static void nop_submit_request(struct drm_i915_gem_request *request)
 {
+	i915_gem_request_submit(request);
+	intel_engine_init_global_seqno(request->engine, request->global_seqno);
 }
 
 static void i915_gem_cleanup_engine(struct intel_engine_cs *engine)
-- 
2.10.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/3] drm/i915: Stop the machine as we install the wedged submit_request handler
  2016-11-18  9:37 [PATCH 1/3] drm/i915: Disable hangcheck when wedged Chris Wilson
  2016-11-18  9:37 ` [PATCH 2/3] drm/i915: Complete requests in nop_submit_request Chris Wilson
@ 2016-11-18  9:37 ` Chris Wilson
  2016-11-18 14:38   ` Chris Wilson
  2016-11-21 12:40   ` [PATCH v2] " Chris Wilson
  2016-11-18 10:16 ` ✓ Fi.CI.BAT: success for series starting with [1/3] drm/i915: Disable hangcheck when wedged Patchwork
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 13+ messages in thread
From: Chris Wilson @ 2016-11-18  9:37 UTC (permalink / raw)
  To: intel-gfx; +Cc: mika.kuoppala

In order to prevent a race between the old callback submitting an
incomplete request and i915_gem_set_wedged() installing its nop handler,
we must ensure that the swap occurs when the machine is idle
(stop_machine).

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_gem.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 7037a8b26903..6b1df3de90f0 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -38,6 +38,7 @@
 #include <linux/reservation.h>
 #include <linux/shmem_fs.h>
 #include <linux/slab.h>
+#include <linux/stop_machine.h>
 #include <linux/swap.h>
 #include <linux/pci.h>
 #include <linux/dma-buf.h>
@@ -2768,6 +2769,12 @@ static void nop_submit_request(struct drm_i915_gem_request *request)
 
 static void i915_gem_cleanup_engine(struct intel_engine_cs *engine)
 {
+	/* We need to be sure that no thread is running the old callback as
+	 * we install the nop handler (otherwise we would submit a request
+	 * to hardware that will never complete). In order to prevent this
+	 * race, we wait until the machine is idle before making the swap
+	 * (using stop_machine()).
+	 */
 	engine->submit_request = nop_submit_request;
 
 	/* Mark all pending requests as complete so that any concurrent
@@ -2798,19 +2805,27 @@ static void i915_gem_cleanup_engine(struct intel_engine_cs *engine)
 	}
 }
 
-void i915_gem_set_wedged(struct drm_i915_private *dev_priv)
+static int __i915_gem_set_wedged_BKL(void *data)
 {
+	struct drm_i915_private *i915 = data;
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
 
+	i915_gem_context_lost(i915);
+	for_each_engine(engine, i915, id)
+		i915_gem_cleanup_engine(engine);
+
+	return 0;
+}
+
+void i915_gem_set_wedged(struct drm_i915_private *dev_priv)
+{
 	lockdep_assert_held(&dev_priv->drm.struct_mutex);
 	set_bit(I915_WEDGED, &dev_priv->gpu_error.flags);
 
-	i915_gem_context_lost(dev_priv);
-	for_each_engine(engine, dev_priv, id)
-		i915_gem_cleanup_engine(engine);
-	mod_delayed_work(dev_priv->wq, &dev_priv->gt.idle_work, 0);
+	stop_machine(__i915_gem_set_wedged_BKL, dev_priv, NULL);
 
+	mod_delayed_work(dev_priv->wq, &dev_priv->gt.idle_work, 0);
 	i915_gem_retire_requests(dev_priv);
 }
 
-- 
2.10.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* ✓ Fi.CI.BAT: success for series starting with [1/3] drm/i915: Disable hangcheck when wedged
  2016-11-18  9:37 [PATCH 1/3] drm/i915: Disable hangcheck when wedged Chris Wilson
  2016-11-18  9:37 ` [PATCH 2/3] drm/i915: Complete requests in nop_submit_request Chris Wilson
  2016-11-18  9:37 ` [PATCH 3/3] drm/i915: Stop the machine as we install the wedged submit_request handler Chris Wilson
@ 2016-11-18 10:16 ` Patchwork
  2016-11-18 11:40 ` [PATCH 1/3] " Mika Kuoppala
  2016-11-21 16:45 ` ✗ Fi.CI.BAT: warning for series starting with [1/3] drm/i915: Disable hangcheck when wedged (rev2) Patchwork
  4 siblings, 0 replies; 13+ messages in thread
From: Patchwork @ 2016-11-18 10:16 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/3] drm/i915: Disable hangcheck when wedged
URL   : https://patchwork.freedesktop.org/series/15543/
State : success

== Summary ==

Series 15543v1 Series without cover letter
https://patchwork.freedesktop.org/api/1.0/series/15543/revisions/1/mbox/


fi-bdw-5557u     total:244  pass:229  dwarn:0   dfail:0   fail:0   skip:15 
fi-bsw-n3050     total:244  pass:204  dwarn:0   dfail:0   fail:0   skip:40 
fi-byt-j1900     total:244  pass:216  dwarn:0   dfail:0   fail:0   skip:28 
fi-byt-n2820     total:244  pass:212  dwarn:0   dfail:0   fail:0   skip:32 
fi-hsw-4770r     total:244  pass:224  dwarn:0   dfail:0   fail:0   skip:20 
fi-ilk-650       total:244  pass:191  dwarn:0   dfail:0   fail:0   skip:53 
fi-ivb-3520m     total:244  pass:222  dwarn:0   dfail:0   fail:0   skip:22 
fi-ivb-3770      total:244  pass:222  dwarn:0   dfail:0   fail:0   skip:22 
fi-kbl-7200u     total:244  pass:222  dwarn:0   dfail:0   fail:0   skip:22 
fi-skl-6260u     total:244  pass:230  dwarn:0   dfail:0   fail:0   skip:14 
fi-skl-6700hq    total:244  pass:223  dwarn:0   dfail:0   fail:0   skip:21 
fi-skl-6700k     total:244  pass:222  dwarn:1   dfail:0   fail:0   skip:21 
fi-skl-6770hq    total:244  pass:230  dwarn:0   dfail:0   fail:0   skip:14 
fi-snb-2520m     total:244  pass:212  dwarn:0   dfail:0   fail:0   skip:32 
fi-snb-2600      total:244  pass:211  dwarn:0   dfail:0   fail:0   skip:33 

65a7649362717adfae612dc8fbc7555e3f3ea64f drm-intel-nightly: 2016y-11m-18d-07h-27m-44s UTC integration manifest
b9c1edd drm/i915: Stop the machine as we install the wedged submit_request handler
6ec0221 drm/i915: Complete requests in nop_submit_request
ada8a5b drm/i915: Disable hangcheck when wedged

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_3050/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] drm/i915: Disable hangcheck when wedged
  2016-11-18  9:37 [PATCH 1/3] drm/i915: Disable hangcheck when wedged Chris Wilson
                   ` (2 preceding siblings ...)
  2016-11-18 10:16 ` ✓ Fi.CI.BAT: success for series starting with [1/3] drm/i915: Disable hangcheck when wedged Patchwork
@ 2016-11-18 11:40 ` Mika Kuoppala
  2016-11-21 16:45 ` ✗ Fi.CI.BAT: warning for series starting with [1/3] drm/i915: Disable hangcheck when wedged (rev2) Patchwork
  4 siblings, 0 replies; 13+ messages in thread
From: Mika Kuoppala @ 2016-11-18 11:40 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> If the gpu reset fails and the machine is terminally wedged, further
> hangchecks achieve nothing but noise. Disable them, with a corollary
> that we re-enable hangchecking after a successful GPU reset in case the
> user is artificially bringing the machine back to life through the debug
> interface.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>

> ---
>  drivers/gpu/drm/i915/i915_drv.c        | 2 ++
>  drivers/gpu/drm/i915/intel_hangcheck.c | 3 +++
>  2 files changed, 5 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index 445fec9c2841..d3ee72449025 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1814,6 +1814,8 @@ void i915_reset(struct drm_i915_private *dev_priv)
>  		goto error;
>  	}
>  
> +	i915_queue_hangcheck(dev_priv);
> +
>  wakeup:
>  	wake_up_bit(&error->flags, I915_RESET_IN_PROGRESS);
>  	return;
> diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> index 53df5b11bff4..c0cfa5b8b87e 100644
> --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> @@ -319,6 +319,9 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>  	if (!READ_ONCE(dev_priv->gt.awake))
>  		return;
>  
> +	if (i915_terminally_wedged(&dev_priv->gpu_error))
> +		return;
> +
>  	/* As enabling the GPU requires fairly extensive mmio access,
>  	 * periodically arm the mmio checker to see if we are triggering
>  	 * any invalid access.
> -- 
> 2.10.2
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] drm/i915: Complete requests in nop_submit_request
  2016-11-18  9:37 ` [PATCH 2/3] drm/i915: Complete requests in nop_submit_request Chris Wilson
@ 2016-11-18 12:56   ` Tvrtko Ursulin
  2016-11-18 13:03     ` Mika Kuoppala
  0 siblings, 1 reply; 13+ messages in thread
From: Tvrtko Ursulin @ 2016-11-18 12:56 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: mika.kuoppala


On 18/11/2016 09:37, Chris Wilson wrote:
> Since the submit/execute split in commit d55ac5bf97c6 ("drm/i915: Defer
> transfer onto execution timeline to actual hw submission") the
> global seqno advance was deferred until the submit_request callback.
> After wedging the GPU, we were installing a nop_submit_request handler
> (to avoid waking up the dead hw) but I had missed converting this over
> to the new scheme. Under the new scheme, we have to explicitly call
> i915_gem_submit_request() from the submit_request handler to mark the
> request as on the hardware. If we don't the request is always pending,
> and any waiter will continue to wait indefinitely and hangcheck will not
> be able to resolve the lockup.
>
> References: https://bugs.freedesktop.org/show_bug.cgi?id=98748
> Testcase: igt/gem_eio/in-flight
> Fixes: d55ac5bf97c6 ("drm/i915: Defer transfer onto execution timeline to actual hw submission")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 7b9f5b99b0f3..7037a8b26903 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2762,6 +2762,8 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
>
>  static void nop_submit_request(struct drm_i915_gem_request *request)
>  {
> +	i915_gem_request_submit(request);
> +	intel_engine_init_global_seqno(request->engine, request->global_seqno);

Slight deja vu but not sure - we don't have a way of marking these as 
failed so what happens in practice here? This as at the point of no 
return, no replay, or allowing the context to recover or something?

>  }
>
>  static void i915_gem_cleanup_engine(struct intel_engine_cs *engine)
>

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] drm/i915: Complete requests in nop_submit_request
  2016-11-18 12:56   ` Tvrtko Ursulin
@ 2016-11-18 13:03     ` Mika Kuoppala
  2016-11-18 13:33       ` Chris Wilson
  0 siblings, 1 reply; 13+ messages in thread
From: Mika Kuoppala @ 2016-11-18 13:03 UTC (permalink / raw)
  To: Tvrtko Ursulin, Chris Wilson, intel-gfx

Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> writes:

> On 18/11/2016 09:37, Chris Wilson wrote:
>> Since the submit/execute split in commit d55ac5bf97c6 ("drm/i915: Defer
>> transfer onto execution timeline to actual hw submission") the
>> global seqno advance was deferred until the submit_request callback.
>> After wedging the GPU, we were installing a nop_submit_request handler
>> (to avoid waking up the dead hw) but I had missed converting this over
>> to the new scheme. Under the new scheme, we have to explicitly call
>> i915_gem_submit_request() from the submit_request handler to mark the
>> request as on the hardware. If we don't the request is always pending,
>> and any waiter will continue to wait indefinitely and hangcheck will not
>> be able to resolve the lockup.
>>
>> References: https://bugs.freedesktop.org/show_bug.cgi?id=98748
>> Testcase: igt/gem_eio/in-flight
>> Fixes: d55ac5bf97c6 ("drm/i915: Defer transfer onto execution timeline to actual hw submission")
>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> ---
>>  drivers/gpu/drm/i915/i915_gem.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>> index 7b9f5b99b0f3..7037a8b26903 100644
>> --- a/drivers/gpu/drm/i915/i915_gem.c
>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>> @@ -2762,6 +2762,8 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
>>
>>  static void nop_submit_request(struct drm_i915_gem_request *request)
>>  {
>> +	i915_gem_request_submit(request);
>> +	intel_engine_init_global_seqno(request->engine, request->global_seqno);
>
> Slight deja vu but not sure - we don't have a way of marking these as 
> failed so what happens in practice here? This as at the point of no 
> return, no replay, or allowing the context to recover or something?
>

I have another question. If we do this, why not
do it (also) when we prepare to start in the postfix of failed
request.

In another words, write the seqno with mmio, and start
replay not in the postfi,  but at the start of next whole request?

-Mika

>>  }
>>
>>  static void i915_gem_cleanup_engine(struct intel_engine_cs *engine)
>>
>
> Regards,
>
> Tvrtko
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] drm/i915: Complete requests in nop_submit_request
  2016-11-18 13:03     ` Mika Kuoppala
@ 2016-11-18 13:33       ` Chris Wilson
  2016-11-18 14:22         ` Tvrtko Ursulin
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Wilson @ 2016-11-18 13:33 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Fri, Nov 18, 2016 at 03:03:21PM +0200, Mika Kuoppala wrote:
> Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> writes:
> 
> > On 18/11/2016 09:37, Chris Wilson wrote:
> >> Since the submit/execute split in commit d55ac5bf97c6 ("drm/i915: Defer
> >> transfer onto execution timeline to actual hw submission") the
> >> global seqno advance was deferred until the submit_request callback.
> >> After wedging the GPU, we were installing a nop_submit_request handler
> >> (to avoid waking up the dead hw) but I had missed converting this over
> >> to the new scheme. Under the new scheme, we have to explicitly call
> >> i915_gem_submit_request() from the submit_request handler to mark the
> >> request as on the hardware. If we don't the request is always pending,
> >> and any waiter will continue to wait indefinitely and hangcheck will not
> >> be able to resolve the lockup.
> >>
> >> References: https://bugs.freedesktop.org/show_bug.cgi?id=98748
> >> Testcase: igt/gem_eio/in-flight
> >> Fixes: d55ac5bf97c6 ("drm/i915: Defer transfer onto execution timeline to actual hw submission")
> >> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >> ---
> >>  drivers/gpu/drm/i915/i915_gem.c | 2 ++
> >>  1 file changed, 2 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >> index 7b9f5b99b0f3..7037a8b26903 100644
> >> --- a/drivers/gpu/drm/i915/i915_gem.c
> >> +++ b/drivers/gpu/drm/i915/i915_gem.c
> >> @@ -2762,6 +2762,8 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
> >>
> >>  static void nop_submit_request(struct drm_i915_gem_request *request)
> >>  {
> >> +	i915_gem_request_submit(request);
> >> +	intel_engine_init_global_seqno(request->engine, request->global_seqno);
> >
> > Slight deja vu but not sure - we don't have a way of marking these as 
> > failed so what happens in practice here? This as at the point of no 
> > return, no replay, or allowing the context to recover or something?

This is past the point of no return. We failed to reset the hardware, so
we need to catch all the inflight requests and signal them. Treating
them in flight is nasty as they are a part of a giant web of
dependencies, so I wanted to have them just complete quietly as they
became ready (to be careful that we don't start dependent third party
work before other third party work finishes).
 
> I have another question. If we do this, why not
> do it (also) when we prepare to start in the postfix of failed
> request.

Different situation entirely, There we are fixing up the hw in such a
way that we want to avoid locks, specifically the struct_mutex, in the
future. (Probably moot now as we probably have the tools to avoid
struct_mutex when handling active requests.) As the hardware is working
we don't want to manually send the interrupt as that just causes a race
with the hardware (i.e. we will treat the request as free whilst it is
still in use by the hw). A very small but possible explosion.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] drm/i915: Complete requests in nop_submit_request
  2016-11-18 13:33       ` Chris Wilson
@ 2016-11-18 14:22         ` Tvrtko Ursulin
  0 siblings, 0 replies; 13+ messages in thread
From: Tvrtko Ursulin @ 2016-11-18 14:22 UTC (permalink / raw)
  To: Chris Wilson, Mika Kuoppala, intel-gfx


On 18/11/2016 13:33, Chris Wilson wrote:
> On Fri, Nov 18, 2016 at 03:03:21PM +0200, Mika Kuoppala wrote:
>> Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> writes:
>>
>>> On 18/11/2016 09:37, Chris Wilson wrote:
>>>> Since the submit/execute split in commit d55ac5bf97c6 ("drm/i915: Defer
>>>> transfer onto execution timeline to actual hw submission") the
>>>> global seqno advance was deferred until the submit_request callback.
>>>> After wedging the GPU, we were installing a nop_submit_request handler
>>>> (to avoid waking up the dead hw) but I had missed converting this over
>>>> to the new scheme. Under the new scheme, we have to explicitly call
>>>> i915_gem_submit_request() from the submit_request handler to mark the
>>>> request as on the hardware. If we don't the request is always pending,
>>>> and any waiter will continue to wait indefinitely and hangcheck will not
>>>> be able to resolve the lockup.
>>>>
>>>> References: https://bugs.freedesktop.org/show_bug.cgi?id=98748
>>>> Testcase: igt/gem_eio/in-flight
>>>> Fixes: d55ac5bf97c6 ("drm/i915: Defer transfer onto execution timeline to actual hw submission")
>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>> ---
>>>>  drivers/gpu/drm/i915/i915_gem.c | 2 ++
>>>>  1 file changed, 2 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>>> index 7b9f5b99b0f3..7037a8b26903 100644
>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>>> @@ -2762,6 +2762,8 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
>>>>
>>>>  static void nop_submit_request(struct drm_i915_gem_request *request)
>>>>  {
>>>> +	i915_gem_request_submit(request);
>>>> +	intel_engine_init_global_seqno(request->engine, request->global_seqno);
>>>
>>> Slight deja vu but not sure - we don't have a way of marking these as
>>> failed so what happens in practice here? This as at the point of no
>>> return, no replay, or allowing the context to recover or something?
>
> This is past the point of no return. We failed to reset the hardware, so
> we need to catch all the inflight requests and signal them. Treating
> them in flight is nasty as they are a part of a giant web of
> dependencies, so I wanted to have them just complete quietly as they
> became ready (to be careful that we don't start dependent third party
> work before other third party work finishes).

Right, I wanted to confirm if I was reading it correctly. In this case:

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] drm/i915: Stop the machine as we install the wedged submit_request handler
  2016-11-18  9:37 ` [PATCH 3/3] drm/i915: Stop the machine as we install the wedged submit_request handler Chris Wilson
@ 2016-11-18 14:38   ` Chris Wilson
  2016-11-21 12:40   ` [PATCH v2] " Chris Wilson
  1 sibling, 0 replies; 13+ messages in thread
From: Chris Wilson @ 2016-11-18 14:38 UTC (permalink / raw)
  To: intel-gfx; +Cc: mika.kuoppala

On Fri, Nov 18, 2016 at 09:37:08AM +0000, Chris Wilson wrote:
> In order to prevent a race between the old callback submitting an
> incomplete request and i915_gem_set_wedged() installing its nop handler,
> we must ensure that the swap occurs when the machine is idle
> (stop_machine).
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/i915/i915_gem.c | 25 ++++++++++++++++++++-----
>  1 file changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 7037a8b26903..6b1df3de90f0 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -38,6 +38,7 @@
>  #include <linux/reservation.h>
>  #include <linux/shmem_fs.h>
>  #include <linux/slab.h>
> +#include <linux/stop_machine.h>
>  #include <linux/swap.h>
>  #include <linux/pci.h>
>  #include <linux/dma-buf.h>
> @@ -2768,6 +2769,12 @@ static void nop_submit_request(struct drm_i915_gem_request *request)
>  
>  static void i915_gem_cleanup_engine(struct intel_engine_cs *engine)
>  {
> +	/* We need to be sure that no thread is running the old callback as
> +	 * we install the nop handler (otherwise we would submit a request
> +	 * to hardware that will never complete). In order to prevent this
> +	 * race, we wait until the machine is idle before making the swap
> +	 * (using stop_machine()).
> +	 */
>  	engine->submit_request = nop_submit_request;
>  
>  	/* Mark all pending requests as complete so that any concurrent
> @@ -2798,19 +2805,27 @@ static void i915_gem_cleanup_engine(struct intel_engine_cs *engine)
>  	}
>  }
>  
> -void i915_gem_set_wedged(struct drm_i915_private *dev_priv)
> +static int __i915_gem_set_wedged_BKL(void *data)
>  {
> +	struct drm_i915_private *i915 = data;
>  	struct intel_engine_cs *engine;
>  	enum intel_engine_id id;
>  
> +	i915_gem_context_lost(i915);
> +	for_each_engine(engine, i915, id)
> +		i915_gem_cleanup_engine(engine);
> +
> +	return 0;
> +}
> +
> +void i915_gem_set_wedged(struct drm_i915_private *dev_priv)
> +{
>  	lockdep_assert_held(&dev_priv->drm.struct_mutex);
>  	set_bit(I915_WEDGED, &dev_priv->gpu_error.flags);
>  
> -	i915_gem_context_lost(dev_priv);
> -	for_each_engine(engine, dev_priv, id)
> -		i915_gem_cleanup_engine(engine);
> -	mod_delayed_work(dev_priv->wq, &dev_priv->gt.idle_work, 0);
> +	stop_machine(__i915_gem_set_wedged_BKL, dev_priv, NULL);
>  
> +	mod_delayed_work(dev_priv->wq, &dev_priv->gt.idle_work, 0);
>  	i915_gem_retire_requests(dev_priv);

mod_delayed_work() should be after retire_requests as retire_requests
should hopefully also do mod_delayed_work (we could prefix if with
if (!active_requests) mod_delayed_work())

Also considering pull context lost is proably best after the
stop_machine, before the retire_requests. There is no reason to do that
inside the stop_machine (it's just the takeover of the fence callback
that is racy with fence signaling).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2] drm/i915: Stop the machine as we install the wedged submit_request handler
  2016-11-18  9:37 ` [PATCH 3/3] drm/i915: Stop the machine as we install the wedged submit_request handler Chris Wilson
  2016-11-18 14:38   ` Chris Wilson
@ 2016-11-21 12:40   ` Chris Wilson
  2016-11-22 14:07     ` Mika Kuoppala
  1 sibling, 1 reply; 13+ messages in thread
From: Chris Wilson @ 2016-11-21 12:40 UTC (permalink / raw)
  To: intel-gfx; +Cc: mika.kuoppala

In order to prevent a race between the old callback submitting an
incomplete request and i915_gem_set_wedged() installing its nop handler,
we must ensure that the swap occurs when the machine is idle
(stop_machine).

v2: move context lost from out of BKL.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_gem.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index d0dcaf35b429..e80ad6906fb4 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -38,6 +38,7 @@
 #include <linux/reservation.h>
 #include <linux/shmem_fs.h>
 #include <linux/slab.h>
+#include <linux/stop_machine.h>
 #include <linux/swap.h>
 #include <linux/pci.h>
 #include <linux/dma-buf.h>
@@ -2770,6 +2771,12 @@ static void nop_submit_request(struct drm_i915_gem_request *request)
 
 static void i915_gem_cleanup_engine(struct intel_engine_cs *engine)
 {
+	/* We need to be sure that no thread is running the old callback as
+	 * we install the nop handler (otherwise we would submit a request
+	 * to hardware that will never complete). In order to prevent this
+	 * race, we wait until the machine is idle before making the swap
+	 * (using stop_machine()).
+	 */
 	engine->submit_request = nop_submit_request;
 
 	/* Mark all pending requests as complete so that any concurrent
@@ -2800,20 +2807,29 @@ static void i915_gem_cleanup_engine(struct intel_engine_cs *engine)
 	}
 }
 
-void i915_gem_set_wedged(struct drm_i915_private *dev_priv)
+static int __i915_gem_set_wedged_BKL(void *data)
 {
+	struct drm_i915_private *i915 = data;
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
 
+	for_each_engine(engine, i915, id)
+		i915_gem_cleanup_engine(engine);
+
+	return 0;
+}
+
+void i915_gem_set_wedged(struct drm_i915_private *dev_priv)
+{
 	lockdep_assert_held(&dev_priv->drm.struct_mutex);
 	set_bit(I915_WEDGED, &dev_priv->gpu_error.flags);
 
-	i915_gem_context_lost(dev_priv);
-	for_each_engine(engine, dev_priv, id)
-		i915_gem_cleanup_engine(engine);
-	mod_delayed_work(dev_priv->wq, &dev_priv->gt.idle_work, 0);
+	stop_machine(__i915_gem_set_wedged_BKL, dev_priv, NULL);
 
+	i915_gem_context_lost(dev_priv);
 	i915_gem_retire_requests(dev_priv);
+
+	mod_delayed_work(dev_priv->wq, &dev_priv->gt.idle_work, 0);
 }
 
 static void
-- 
2.10.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* ✗ Fi.CI.BAT: warning for series starting with [1/3] drm/i915: Disable hangcheck when wedged (rev2)
  2016-11-18  9:37 [PATCH 1/3] drm/i915: Disable hangcheck when wedged Chris Wilson
                   ` (3 preceding siblings ...)
  2016-11-18 11:40 ` [PATCH 1/3] " Mika Kuoppala
@ 2016-11-21 16:45 ` Patchwork
  4 siblings, 0 replies; 13+ messages in thread
From: Patchwork @ 2016-11-21 16:45 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/3] drm/i915: Disable hangcheck when wedged (rev2)
URL   : https://patchwork.freedesktop.org/series/15543/
State : warning

== Summary ==

Series 15543v2 Series without cover letter
https://patchwork.freedesktop.org/api/1.0/series/15543/revisions/2/mbox/

Test kms_pipe_crc_basic:
        Subgroup hang-read-crc-pipe-b:
                pass       -> DMESG-WARN (fi-ilk-650)

fi-bdw-5557u     total:244  pass:229  dwarn:0   dfail:0   fail:0   skip:15 
fi-bsw-n3050     total:244  pass:204  dwarn:0   dfail:0   fail:0   skip:40 
fi-bxt-t5700     total:244  pass:216  dwarn:0   dfail:0   fail:0   skip:28 
fi-byt-j1900     total:244  pass:216  dwarn:0   dfail:0   fail:0   skip:28 
fi-byt-n2820     total:244  pass:212  dwarn:0   dfail:0   fail:0   skip:32 
fi-hsw-4770      total:244  pass:224  dwarn:0   dfail:0   fail:0   skip:20 
fi-hsw-4770r     total:244  pass:224  dwarn:0   dfail:0   fail:0   skip:20 
fi-ilk-650       total:244  pass:190  dwarn:1   dfail:0   fail:0   skip:53 
fi-ivb-3520m     total:244  pass:222  dwarn:0   dfail:0   fail:0   skip:22 
fi-ivb-3770      total:244  pass:222  dwarn:0   dfail:0   fail:0   skip:22 
fi-kbl-7200u     total:244  pass:222  dwarn:0   dfail:0   fail:0   skip:22 
fi-skl-6260u     total:244  pass:230  dwarn:0   dfail:0   fail:0   skip:14 
fi-skl-6700hq    total:244  pass:223  dwarn:0   dfail:0   fail:0   skip:21 
fi-skl-6700k     total:244  pass:222  dwarn:1   dfail:0   fail:0   skip:21 
fi-skl-6770hq    total:244  pass:230  dwarn:0   dfail:0   fail:0   skip:14 
fi-snb-2520m     total:244  pass:212  dwarn:0   dfail:0   fail:0   skip:32 
fi-snb-2600      total:244  pass:211  dwarn:0   dfail:0   fail:0   skip:33 

d6149d212b69a8e1d9229fe80fca034a0abe1d0e drm-intel-nightly: 2016y-11m-21d-12h-48m-13s UTC integration manifest
1886e59 drm/i915: Stop the machine as we install the wedged submit_request handler
0012d66 drm/i915: Complete requests in nop_submit_request
8355d47 drm/i915: Disable hangcheck when wedged

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_3071/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] drm/i915: Stop the machine as we install the wedged submit_request handler
  2016-11-21 12:40   ` [PATCH v2] " Chris Wilson
@ 2016-11-22 14:07     ` Mika Kuoppala
  0 siblings, 0 replies; 13+ messages in thread
From: Mika Kuoppala @ 2016-11-22 14:07 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> In order to prevent a race between the old callback submitting an
> incomplete request and i915_gem_set_wedged() installing its nop handler,
> we must ensure that the swap occurs when the machine is idle
> (stop_machine).
>
> v2: move context lost from out of BKL.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>

> ---
>  drivers/gpu/drm/i915/i915_gem.c | 26 +++++++++++++++++++++-----
>  1 file changed, 21 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index d0dcaf35b429..e80ad6906fb4 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -38,6 +38,7 @@
>  #include <linux/reservation.h>
>  #include <linux/shmem_fs.h>
>  #include <linux/slab.h>
> +#include <linux/stop_machine.h>
>  #include <linux/swap.h>
>  #include <linux/pci.h>
>  #include <linux/dma-buf.h>
> @@ -2770,6 +2771,12 @@ static void nop_submit_request(struct drm_i915_gem_request *request)
>  
>  static void i915_gem_cleanup_engine(struct intel_engine_cs *engine)
>  {
> +	/* We need to be sure that no thread is running the old callback as
> +	 * we install the nop handler (otherwise we would submit a request
> +	 * to hardware that will never complete). In order to prevent this
> +	 * race, we wait until the machine is idle before making the swap
> +	 * (using stop_machine()).
> +	 */
>  	engine->submit_request = nop_submit_request;
>  
>  	/* Mark all pending requests as complete so that any concurrent
> @@ -2800,20 +2807,29 @@ static void i915_gem_cleanup_engine(struct intel_engine_cs *engine)
>  	}
>  }
>  
> -void i915_gem_set_wedged(struct drm_i915_private *dev_priv)
> +static int __i915_gem_set_wedged_BKL(void *data)
>  {
> +	struct drm_i915_private *i915 = data;
>  	struct intel_engine_cs *engine;
>  	enum intel_engine_id id;
>  
> +	for_each_engine(engine, i915, id)
> +		i915_gem_cleanup_engine(engine);
> +
> +	return 0;
> +}
> +
> +void i915_gem_set_wedged(struct drm_i915_private *dev_priv)
> +{
>  	lockdep_assert_held(&dev_priv->drm.struct_mutex);
>  	set_bit(I915_WEDGED, &dev_priv->gpu_error.flags);
>  
> -	i915_gem_context_lost(dev_priv);
> -	for_each_engine(engine, dev_priv, id)
> -		i915_gem_cleanup_engine(engine);
> -	mod_delayed_work(dev_priv->wq, &dev_priv->gt.idle_work, 0);
> +	stop_machine(__i915_gem_set_wedged_BKL, dev_priv, NULL);
>  
> +	i915_gem_context_lost(dev_priv);
>  	i915_gem_retire_requests(dev_priv);
> +
> +	mod_delayed_work(dev_priv->wq, &dev_priv->gt.idle_work, 0);
>  }
>  
>  static void
> -- 
> 2.10.2
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-11-22 14:09 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-11-18  9:37 [PATCH 1/3] drm/i915: Disable hangcheck when wedged Chris Wilson
2016-11-18  9:37 ` [PATCH 2/3] drm/i915: Complete requests in nop_submit_request Chris Wilson
2016-11-18 12:56   ` Tvrtko Ursulin
2016-11-18 13:03     ` Mika Kuoppala
2016-11-18 13:33       ` Chris Wilson
2016-11-18 14:22         ` Tvrtko Ursulin
2016-11-18  9:37 ` [PATCH 3/3] drm/i915: Stop the machine as we install the wedged submit_request handler Chris Wilson
2016-11-18 14:38   ` Chris Wilson
2016-11-21 12:40   ` [PATCH v2] " Chris Wilson
2016-11-22 14:07     ` Mika Kuoppala
2016-11-18 10:16 ` ✓ Fi.CI.BAT: success for series starting with [1/3] drm/i915: Disable hangcheck when wedged Patchwork
2016-11-18 11:40 ` [PATCH 1/3] " Mika Kuoppala
2016-11-21 16:45 ` ✗ Fi.CI.BAT: warning for series starting with [1/3] drm/i915: Disable hangcheck when wedged (rev2) Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).