public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed
* [PATCH] drm/i915: Try harder to reset the gen8+ engines
@ 2016-09-05  9:30 Chris Wilson
  2016-09-05  9:50 ` ✓ Fi.CI.BAT: success for " Patchwork
  2016-09-06 11:11 ` [PATCH] " Mika Kuoppala
  0 siblings, 2 replies; 5+ messages in thread
From: Chris Wilson @ 2016-09-05  9:30 UTC (permalink / raw)
  To: intel-gfx
  Cc: Chris Wilson, Mika Kuoppala, Arun Siluvery, Daniel Vetter, stable

If at first we don't succeed, try again.

Running the reset and recovery routines in a loop ends in a "reset
request timeout" with a mtbf of an hour on Braswell. This is eerily
similar to the unrecoverable reset condition that first prompted us to
use the reset-request mechanism in commit 7fd2d26921d1 ("drm/i915: Reset
request handling for gen8+"). Repeating the reset request makes the
failure much harder to reproduce (but there is no reason to believe that
it is more than mere paper over a timing or other issue).

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Arun Siluvery <arun.siluvery@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: stable@vger.kernel.org
---
 drivers/gpu/drm/i915/intel_uncore.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index e9f68cd56e32..1be8ced03ba5 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1688,20 +1688,22 @@ int intel_wait_for_register(struct drm_i915_private *dev_priv,
 static int gen8_request_engine_reset(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
-	int ret;
+	int loop = 3;
 
-	I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
-		      _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
+	do {
+		I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
+			      _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
 
-	ret = intel_wait_for_register_fw(dev_priv,
-					 RING_RESET_CTL(engine->mmio_base),
-					 RESET_CTL_READY_TO_RESET,
-					 RESET_CTL_READY_TO_RESET,
-					 700);
-	if (ret)
-		DRM_ERROR("%s: reset request timeout\n", engine->name);
+		if (!intel_wait_for_register_fw(dev_priv,
+						RING_RESET_CTL(engine->mmio_base),
+						RESET_CTL_READY_TO_RESET,
+						RESET_CTL_READY_TO_RESET,
+						700))
+			return 0;
+	} while (--loop);
 
-	return ret;
+	DRM_ERROR("%s: reset request timeout\n", engine->name);
+	return -EIO;
 }
 
 static void gen8_unrequest_engine_reset(struct intel_engine_cs *engine)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915: Try harder to reset the gen8+ engines
  2016-09-05  9:30 [PATCH] drm/i915: Try harder to reset the gen8+ engines Chris Wilson
@ 2016-09-05  9:50 ` Patchwork
  2016-09-06 11:11 ` [PATCH] " Mika Kuoppala
  1 sibling, 0 replies; 5+ messages in thread
From: Patchwork @ 2016-09-05  9:50 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Try harder to reset the gen8+ engines
URL   : https://patchwork.freedesktop.org/series/11998/
State : success

== Summary ==

Series 11998v1 drm/i915: Try harder to reset the gen8+ engines
http://patchwork.freedesktop.org/api/1.0/series/11998/revisions/1/mbox/

Test kms_pipe_crc_basic:
        Subgroup nonblocking-crc-pipe-b:
                skip       -> PASS       (fi-hsw-4770r)
        Subgroup read-crc-pipe-b:
                skip       -> PASS       (fi-skl-6260u)
        Subgroup read-crc-pipe-c:
                skip       -> PASS       (fi-hsw-4770r)

fi-bdw-5557u     total:252  pass:233  dwarn:2   dfail:1   fail:1   skip:15 
fi-bsw-n3050     total:252  pass:203  dwarn:1   dfail:1   fail:1   skip:46 
fi-byt-n2820     total:252  pass:206  dwarn:2   dfail:1   fail:2   skip:41 
fi-hsw-4770k     total:252  pass:226  dwarn:2   dfail:1   fail:1   skip:22 
fi-hsw-4770r     total:252  pass:222  dwarn:2   dfail:1   fail:1   skip:26 
fi-ivb-3520m     total:252  pass:217  dwarn:2   dfail:1   fail:1   skip:31 
fi-skl-6260u     total:252  pass:234  dwarn:2   dfail:1   fail:1   skip:14 
fi-skl-6700k     total:252  pass:219  dwarn:3   dfail:1   fail:1   skip:28 
fi-snb-2520m     total:252  pass:204  dwarn:2   dfail:1   fail:2   skip:43 
fi-snb-2600      total:252  pass:205  dwarn:2   dfail:1   fail:1   skip:43 

Results at /archive/results/CI_IGT_test/Patchwork_2468/

9baa666b3e48f71b46c5f63541f57d2a95a1b1c0 drm-intel-nightly: 2016y-09m-03d-12h-12m-15s UTC integration manifest
c5cdaa1 drm/i915: Try harder to reset the gen8+ engines

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/i915: Try harder to reset the gen8+ engines
  2016-09-05  9:30 [PATCH] drm/i915: Try harder to reset the gen8+ engines Chris Wilson
  2016-09-05  9:50 ` ✓ Fi.CI.BAT: success for " Patchwork
@ 2016-09-06 11:11 ` Mika Kuoppala
  2016-09-06 11:25   ` Chris Wilson
  1 sibling, 1 reply; 5+ messages in thread
From: Mika Kuoppala @ 2016-09-06 11:11 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Daniel Vetter, stable


Resending my r-b...

Chris Wilson <chris@chris-wilson.co.uk> writes:

> If at first we don't succeed, try again.
>
> Running the reset and recovery routines in a loop ends in a "reset
> request timeout" with a mtbf of an hour on Braswell. This is eerily
> similar to the unrecoverable reset condition that first prompted us to
> use the reset-request mechanism in commit 7fd2d26921d1 ("drm/i915: Reset
> request handling for gen8+"). Repeating the reset request makes the
> failure much harder to reproduce (but there is no reason to believe that
> it is more than mere paper over a timing or other issue).
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Arun Siluvery <arun.siluvery@linux.intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: stable@vger.kernel.org
> ---
>  drivers/gpu/drm/i915/intel_uncore.c | 24 +++++++++++++-----------
>  1 file changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index e9f68cd56e32..1be8ced03ba5 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -1688,20 +1688,22 @@ int intel_wait_for_register(struct drm_i915_private *dev_priv,
>  static int gen8_request_engine_reset(struct intel_engine_cs *engine)
>  {
>  	struct drm_i915_private *dev_priv = engine->i915;
> -	int ret;
> +	int loop = 3;
>

retries?

> -	I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
> -		      _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
> +	do {
> +		I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
> +			      _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
>  
> -	ret = intel_wait_for_register_fw(dev_priv,
> -					 RING_RESET_CTL(engine->mmio_base),
> -					 RESET_CTL_READY_TO_RESET,
> -					 RESET_CTL_READY_TO_RESET,
> -					 700);
> -	if (ret)
> -		DRM_ERROR("%s: reset request timeout\n", engine->name);
> +		if (!intel_wait_for_register_fw(dev_priv,
> +						RING_RESET_CTL(engine->mmio_base),
> +						RESET_CTL_READY_TO_RESET,
> +						RESET_CTL_READY_TO_RESET,
> +						700))


Did you check between the write didn't stick vs the readyness didn't
signal?

With gen8, we might get away with just resetting regardless of the
the ready state. Needs some experimenting/testing first tho.

Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>

> +			return 0;
> +	} while (--loop);
>  
> -	return ret;
> +	DRM_ERROR("%s: reset request timeout\n", engine->name);
> +	return -EIO;
>  }
>  
>  static void gen8_unrequest_engine_reset(struct intel_engine_cs *engine)
> -- 
> 2.9.3
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/i915: Try harder to reset the gen8+ engines
  2016-09-06 11:11 ` [PATCH] " Mika Kuoppala
@ 2016-09-06 11:25   ` Chris Wilson
  2016-09-06 13:25     ` Mika Kuoppala
  0 siblings, 1 reply; 5+ messages in thread
From: Chris Wilson @ 2016-09-06 11:25 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx, Arun Siluvery, Daniel Vetter, stable

On Tue, Sep 06, 2016 at 02:11:47PM +0300, Mika Kuoppala wrote:
> 
> Resending my r-b...

It's not enough, even retrying the reset a few times, we still
eventually get a timeout.

This is just the usual
	10: 0xffffffff
	20: goto 10
hanging batch
> 
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > If at first we don't succeed, try again.
> >
> > Running the reset and recovery routines in a loop ends in a "reset
> > request timeout" with a mtbf of an hour on Braswell. This is eerily
> > similar to the unrecoverable reset condition that first prompted us to
> > use the reset-request mechanism in commit 7fd2d26921d1 ("drm/i915: Reset
> > request handling for gen8+"). Repeating the reset request makes the
> > failure much harder to reproduce (but there is no reason to believe that
> > it is more than mere paper over a timing or other issue).
> >
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > Cc: Arun Siluvery <arun.siluvery@linux.intel.com>
> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> > Cc: stable@vger.kernel.org
> > ---
> >  drivers/gpu/drm/i915/intel_uncore.c | 24 +++++++++++++-----------
> >  1 file changed, 13 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> > index e9f68cd56e32..1be8ced03ba5 100644
> > --- a/drivers/gpu/drm/i915/intel_uncore.c
> > +++ b/drivers/gpu/drm/i915/intel_uncore.c
> > @@ -1688,20 +1688,22 @@ int intel_wait_for_register(struct drm_i915_private *dev_priv,
> >  static int gen8_request_engine_reset(struct intel_engine_cs *engine)
> >  {
> >  	struct drm_i915_private *dev_priv = engine->i915;
> > -	int ret;
> > +	int loop = 3;
> >
> 
> retries?
> 
> > -	I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
> > -		      _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
> > +	do {
> > +		I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
> > +			      _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
> >  
> > -	ret = intel_wait_for_register_fw(dev_priv,
> > -					 RING_RESET_CTL(engine->mmio_base),
> > -					 RESET_CTL_READY_TO_RESET,
> > -					 RESET_CTL_READY_TO_RESET,
> > -					 700);
> > -	if (ret)
> > -		DRM_ERROR("%s: reset request timeout\n", engine->name);
> > +		if (!intel_wait_for_register_fw(dev_priv,
> > +						RING_RESET_CTL(engine->mmio_base),
> > +						RESET_CTL_READY_TO_RESET,
> > +						RESET_CTL_READY_TO_RESET,
> > +						700))
> 
> 
> Did you check between the write didn't stick vs the readyness didn't
> signal?

The read will flush the write. So reading back the bit we just set shows
us that the hw is not yet ready.

Shouldn't we also be waiting for the reest bit to clear on cancel? And
verifying that the reset itself did?

diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 6b84bd63310c..36d9a485b788 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1701,6 +1701,9 @@ static void gen8_unrequest_engine_reset(struct intel_engine_cs *engine)
 
        I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
                      _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET));
+       intel_wait_for_register_fw(dev_priv,
+                                  RING_RESET_CTL(engine->mmio_base),
+                                  RESET_CTL_READY_TO_RESET, 0, 700);
 }
 
 static int gen8_reset_engines(struct drm_i915_private *dev_priv,
@@ -1708,18 +1711,18 @@ static int gen8_reset_engines(struct drm_i915_private *dev_priv,
 {
        struct intel_engine_cs *engine;
        unsigned int tmp;
+       int ret = -EIO;
 
        for_each_engine_masked(engine, dev_priv, engine_mask, tmp)
                if (gen8_request_engine_reset(engine))
                        goto not_ready;
 
-       return gen6_reset_engines(dev_priv, engine_mask);
+       ret = gen6_reset_engines(dev_priv, engine_mask);
 
-not_ready:
        for_each_engine_masked(engine, dev_priv, engine_mask, tmp)
                gen8_unrequest_engine_reset(engine);
 
-       return -EIO;
+       return ret;
 }
 
 typedef int (*reset_func)(struct drm_i915_private *, unsigned engine_mask);

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/i915: Try harder to reset the gen8+ engines
  2016-09-06 11:25   ` Chris Wilson
@ 2016-09-06 13:25     ` Mika Kuoppala
  0 siblings, 0 replies; 5+ messages in thread
From: Mika Kuoppala @ 2016-09-06 13:25 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Daniel Vetter, intel-gfx, stable

Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Tue, Sep 06, 2016 at 02:11:47PM +0300, Mika Kuoppala wrote:
>> 
>> Resending my r-b...
>
> It's not enough, even retrying the reset a few times, we still
> eventually get a timeout.
>
> This is just the usual
> 	10: 0xffffffff
> 	20: goto 10
> hanging batch
>> 
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> 
>> > If at first we don't succeed, try again.
>> >
>> > Running the reset and recovery routines in a loop ends in a "reset
>> > request timeout" with a mtbf of an hour on Braswell. This is eerily
>> > similar to the unrecoverable reset condition that first prompted us to
>> > use the reset-request mechanism in commit 7fd2d26921d1 ("drm/i915: Reset
>> > request handling for gen8+"). Repeating the reset request makes the
>> > failure much harder to reproduce (but there is no reason to believe that
>> > it is more than mere paper over a timing or other issue).
>> >
>> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
>> > Cc: Arun Siluvery <arun.siluvery@linux.intel.com>
>> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>> > Cc: stable@vger.kernel.org
>> > ---
>> >  drivers/gpu/drm/i915/intel_uncore.c | 24 +++++++++++++-----------
>> >  1 file changed, 13 insertions(+), 11 deletions(-)
>> >
>> > diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
>> > index e9f68cd56e32..1be8ced03ba5 100644
>> > --- a/drivers/gpu/drm/i915/intel_uncore.c
>> > +++ b/drivers/gpu/drm/i915/intel_uncore.c
>> > @@ -1688,20 +1688,22 @@ int intel_wait_for_register(struct drm_i915_private *dev_priv,
>> >  static int gen8_request_engine_reset(struct intel_engine_cs *engine)
>> >  {
>> >  	struct drm_i915_private *dev_priv = engine->i915;
>> > -	int ret;
>> > +	int loop = 3;
>> >
>> 
>> retries?
>> 
>> > -	I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
>> > -		      _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
>> > +	do {
>> > +		I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
>> > +			      _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
>> >  
>> > -	ret = intel_wait_for_register_fw(dev_priv,
>> > -					 RING_RESET_CTL(engine->mmio_base),
>> > -					 RESET_CTL_READY_TO_RESET,
>> > -					 RESET_CTL_READY_TO_RESET,
>> > -					 700);
>> > -	if (ret)
>> > -		DRM_ERROR("%s: reset request timeout\n", engine->name);
>> > +		if (!intel_wait_for_register_fw(dev_priv,
>> > +						RING_RESET_CTL(engine->mmio_base),
>> > +						RESET_CTL_READY_TO_RESET,
>> > +						RESET_CTL_READY_TO_RESET,
>> > +						700))
>> 
>> 
>> Did you check between the write didn't stick vs the readyness didn't
>> signal?
>
> The read will flush the write. So reading back the bit we just set shows
> us that the hw is not yet ready.
>
> Shouldn't we also be waiting for the reest bit to clear on cancel? And
> verifying that the reset itself did?
>

Request -> timeout -> cancel and wait 2 lowest bit to be zero and
then only after that retry?

And would not hurt to check they are zero before continuing after
a succesful reset also.

-Mika

> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index 6b84bd63310c..36d9a485b788 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -1701,6 +1701,9 @@ static void gen8_unrequest_engine_reset(struct intel_engine_cs *engine)
>  
>         I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
>                       _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET));
> +       intel_wait_for_register_fw(dev_priv,
> +                                  RING_RESET_CTL(engine->mmio_base),
> +                                  RESET_CTL_READY_TO_RESET, 0, 700);
>  }
>  
>  static int gen8_reset_engines(struct drm_i915_private *dev_priv,
> @@ -1708,18 +1711,18 @@ static int gen8_reset_engines(struct drm_i915_private *dev_priv,
>  {
>         struct intel_engine_cs *engine;
>         unsigned int tmp;
> +       int ret = -EIO;
>  
>         for_each_engine_masked(engine, dev_priv, engine_mask, tmp)
>                 if (gen8_request_engine_reset(engine))
>                         goto not_ready;
>  
> -       return gen6_reset_engines(dev_priv, engine_mask);
> +       ret = gen6_reset_engines(dev_priv, engine_mask);
>  
> -not_ready:
>         for_each_engine_masked(engine, dev_priv, engine_mask, tmp)
>                 gen8_unrequest_engine_reset(engine);
>  
> -       return -EIO;
> +       return ret;
>  }
>  
>  typedef int (*reset_func)(struct drm_i915_private *, unsigned engine_mask);
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-09-06 13:27 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-09-05  9:30 [PATCH] drm/i915: Try harder to reset the gen8+ engines Chris Wilson
2016-09-05  9:50 ` ✓ Fi.CI.BAT: success for " Patchwork
2016-09-06 11:11 ` [PATCH] " Mika Kuoppala
2016-09-06 11:25   ` Chris Wilson
2016-09-06 13:25     ` Mika Kuoppala

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox