All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mika Kuoppala <mika.kuoppala@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>,
	intel-gfx@lists.freedesktop.org, stable@vger.kernel.org
Subject: Re: [PATCH] drm/i915: Try harder to reset the gen8+ engines
Date: Tue, 06 Sep 2016 16:25:11 +0300	[thread overview]
Message-ID: <87h99t88js.fsf@gaia.fi.intel.com> (raw)
In-Reply-To: <20160906112533.GA16682@nuc-i3427.alporthouse.com>

Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Tue, Sep 06, 2016 at 02:11:47PM +0300, Mika Kuoppala wrote:
>> 
>> Resending my r-b...
>
> It's not enough, even retrying the reset a few times, we still
> eventually get a timeout.
>
> This is just the usual
> 	10: 0xffffffff
> 	20: goto 10
> hanging batch
>> 
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> 
>> > If at first we don't succeed, try again.
>> >
>> > Running the reset and recovery routines in a loop ends in a "reset
>> > request timeout" with a mtbf of an hour on Braswell. This is eerily
>> > similar to the unrecoverable reset condition that first prompted us to
>> > use the reset-request mechanism in commit 7fd2d26921d1 ("drm/i915: Reset
>> > request handling for gen8+"). Repeating the reset request makes the
>> > failure much harder to reproduce (but there is no reason to believe that
>> > it is more than mere paper over a timing or other issue).
>> >
>> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
>> > Cc: Arun Siluvery <arun.siluvery@linux.intel.com>
>> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>> > Cc: stable@vger.kernel.org
>> > ---
>> >  drivers/gpu/drm/i915/intel_uncore.c | 24 +++++++++++++-----------
>> >  1 file changed, 13 insertions(+), 11 deletions(-)
>> >
>> > diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
>> > index e9f68cd56e32..1be8ced03ba5 100644
>> > --- a/drivers/gpu/drm/i915/intel_uncore.c
>> > +++ b/drivers/gpu/drm/i915/intel_uncore.c
>> > @@ -1688,20 +1688,22 @@ int intel_wait_for_register(struct drm_i915_private *dev_priv,
>> >  static int gen8_request_engine_reset(struct intel_engine_cs *engine)
>> >  {
>> >  	struct drm_i915_private *dev_priv = engine->i915;
>> > -	int ret;
>> > +	int loop = 3;
>> >
>> 
>> retries?
>> 
>> > -	I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
>> > -		      _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
>> > +	do {
>> > +		I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
>> > +			      _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
>> >  
>> > -	ret = intel_wait_for_register_fw(dev_priv,
>> > -					 RING_RESET_CTL(engine->mmio_base),
>> > -					 RESET_CTL_READY_TO_RESET,
>> > -					 RESET_CTL_READY_TO_RESET,
>> > -					 700);
>> > -	if (ret)
>> > -		DRM_ERROR("%s: reset request timeout\n", engine->name);
>> > +		if (!intel_wait_for_register_fw(dev_priv,
>> > +						RING_RESET_CTL(engine->mmio_base),
>> > +						RESET_CTL_READY_TO_RESET,
>> > +						RESET_CTL_READY_TO_RESET,
>> > +						700))
>> 
>> 
>> Did you check between the write didn't stick vs the readyness didn't
>> signal?
>
> The read will flush the write. So reading back the bit we just set shows
> us that the hw is not yet ready.
>
> Shouldn't we also be waiting for the reest bit to clear on cancel? And
> verifying that the reset itself did?
>

Request -> timeout -> cancel and wait 2 lowest bit to be zero and
then only after that retry?

And would not hurt to check they are zero before continuing after
a succesful reset also.

-Mika

> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index 6b84bd63310c..36d9a485b788 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -1701,6 +1701,9 @@ static void gen8_unrequest_engine_reset(struct intel_engine_cs *engine)
>  
>         I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
>                       _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET));
> +       intel_wait_for_register_fw(dev_priv,
> +                                  RING_RESET_CTL(engine->mmio_base),
> +                                  RESET_CTL_READY_TO_RESET, 0, 700);
>  }
>  
>  static int gen8_reset_engines(struct drm_i915_private *dev_priv,
> @@ -1708,18 +1711,18 @@ static int gen8_reset_engines(struct drm_i915_private *dev_priv,
>  {
>         struct intel_engine_cs *engine;
>         unsigned int tmp;
> +       int ret = -EIO;
>  
>         for_each_engine_masked(engine, dev_priv, engine_mask, tmp)
>                 if (gen8_request_engine_reset(engine))
>                         goto not_ready;
>  
> -       return gen6_reset_engines(dev_priv, engine_mask);
> +       ret = gen6_reset_engines(dev_priv, engine_mask);
>  
> -not_ready:
>         for_each_engine_masked(engine, dev_priv, engine_mask, tmp)
>                 gen8_unrequest_engine_reset(engine);
>  
> -       return -EIO;
> +       return ret;
>  }
>  
>  typedef int (*reset_func)(struct drm_i915_private *, unsigned engine_mask);
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

WARNING: multiple messages have this Message-ID (diff)
From: Mika Kuoppala <mika.kuoppala@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>
Cc: intel-gfx@lists.freedesktop.org,
	Arun Siluvery <arun.siluvery@linux.intel.com>,
	Daniel Vetter <daniel.vetter@ffwll.ch>,
	stable@vger.kernel.org
Subject: Re: [PATCH] drm/i915: Try harder to reset the gen8+ engines
Date: Tue, 06 Sep 2016 16:25:11 +0300	[thread overview]
Message-ID: <87h99t88js.fsf@gaia.fi.intel.com> (raw)
In-Reply-To: <20160906112533.GA16682@nuc-i3427.alporthouse.com>

Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Tue, Sep 06, 2016 at 02:11:47PM +0300, Mika Kuoppala wrote:
>> 
>> Resending my r-b...
>
> It's not enough, even retrying the reset a few times, we still
> eventually get a timeout.
>
> This is just the usual
> 	10: 0xffffffff
> 	20: goto 10
> hanging batch
>> 
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> 
>> > If at first we don't succeed, try again.
>> >
>> > Running the reset and recovery routines in a loop ends in a "reset
>> > request timeout" with a mtbf of an hour on Braswell. This is eerily
>> > similar to the unrecoverable reset condition that first prompted us to
>> > use the reset-request mechanism in commit 7fd2d26921d1 ("drm/i915: Reset
>> > request handling for gen8+"). Repeating the reset request makes the
>> > failure much harder to reproduce (but there is no reason to believe that
>> > it is more than mere paper over a timing or other issue).
>> >
>> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
>> > Cc: Arun Siluvery <arun.siluvery@linux.intel.com>
>> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>> > Cc: stable@vger.kernel.org
>> > ---
>> >  drivers/gpu/drm/i915/intel_uncore.c | 24 +++++++++++++-----------
>> >  1 file changed, 13 insertions(+), 11 deletions(-)
>> >
>> > diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
>> > index e9f68cd56e32..1be8ced03ba5 100644
>> > --- a/drivers/gpu/drm/i915/intel_uncore.c
>> > +++ b/drivers/gpu/drm/i915/intel_uncore.c
>> > @@ -1688,20 +1688,22 @@ int intel_wait_for_register(struct drm_i915_private *dev_priv,
>> >  static int gen8_request_engine_reset(struct intel_engine_cs *engine)
>> >  {
>> >  	struct drm_i915_private *dev_priv = engine->i915;
>> > -	int ret;
>> > +	int loop = 3;
>> >
>> 
>> retries?
>> 
>> > -	I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
>> > -		      _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
>> > +	do {
>> > +		I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
>> > +			      _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
>> >  
>> > -	ret = intel_wait_for_register_fw(dev_priv,
>> > -					 RING_RESET_CTL(engine->mmio_base),
>> > -					 RESET_CTL_READY_TO_RESET,
>> > -					 RESET_CTL_READY_TO_RESET,
>> > -					 700);
>> > -	if (ret)
>> > -		DRM_ERROR("%s: reset request timeout\n", engine->name);
>> > +		if (!intel_wait_for_register_fw(dev_priv,
>> > +						RING_RESET_CTL(engine->mmio_base),
>> > +						RESET_CTL_READY_TO_RESET,
>> > +						RESET_CTL_READY_TO_RESET,
>> > +						700))
>> 
>> 
>> Did you check between the write didn't stick vs the readyness didn't
>> signal?
>
> The read will flush the write. So reading back the bit we just set shows
> us that the hw is not yet ready.
>
> Shouldn't we also be waiting for the reest bit to clear on cancel? And
> verifying that the reset itself did?
>

Request -> timeout -> cancel and wait 2 lowest bit to be zero and
then only after that retry?

And would not hurt to check they are zero before continuing after
a succesful reset also.

-Mika

> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index 6b84bd63310c..36d9a485b788 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -1701,6 +1701,9 @@ static void gen8_unrequest_engine_reset(struct intel_engine_cs *engine)
>  
>         I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
>                       _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET));
> +       intel_wait_for_register_fw(dev_priv,
> +                                  RING_RESET_CTL(engine->mmio_base),
> +                                  RESET_CTL_READY_TO_RESET, 0, 700);
>  }
>  
>  static int gen8_reset_engines(struct drm_i915_private *dev_priv,
> @@ -1708,18 +1711,18 @@ static int gen8_reset_engines(struct drm_i915_private *dev_priv,
>  {
>         struct intel_engine_cs *engine;
>         unsigned int tmp;
> +       int ret = -EIO;
>  
>         for_each_engine_masked(engine, dev_priv, engine_mask, tmp)
>                 if (gen8_request_engine_reset(engine))
>                         goto not_ready;
>  
> -       return gen6_reset_engines(dev_priv, engine_mask);
> +       ret = gen6_reset_engines(dev_priv, engine_mask);
>  
> -not_ready:
>         for_each_engine_masked(engine, dev_priv, engine_mask, tmp)
>                 gen8_unrequest_engine_reset(engine);
>  
> -       return -EIO;
> +       return ret;
>  }
>  
>  typedef int (*reset_func)(struct drm_i915_private *, unsigned engine_mask);
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre

  reply	other threads:[~2016-09-06 13:27 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-05  9:30 [PATCH] drm/i915: Try harder to reset the gen8+ engines Chris Wilson
2016-09-05  9:50 ` ✓ Fi.CI.BAT: success for " Patchwork
2016-09-06 11:11 ` [PATCH] " Mika Kuoppala
2016-09-06 11:11   ` Mika Kuoppala
2016-09-06 11:25   ` Chris Wilson
2016-09-06 13:25     ` Mika Kuoppala [this message]
2016-09-06 13:25       ` Mika Kuoppala

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87h99t88js.fsf@gaia.fi.intel.com \
    --to=mika.kuoppala@linux.intel.com \
    --cc=chris@chris-wilson.co.uk \
    --cc=daniel.vetter@ffwll.ch \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.