Re: [PATCH] drm/i915: Try harder to reset the gen8+ engines

From: Mika Kuoppala <mika.kuoppala@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>,
	intel-gfx@lists.freedesktop.org, stable@vger.kernel.org
Subject: Re: [PATCH] drm/i915: Try harder to reset the gen8+ engines
Date: Tue, 06 Sep 2016 16:25:11 +0300	[thread overview]
Message-ID: <87h99t88js.fsf@gaia.fi.intel.com> (raw)
In-Reply-To: <20160906112533.GA16682@nuc-i3427.alporthouse.com>

Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Tue, Sep 06, 2016 at 02:11:47PM +0300, Mika Kuoppala wrote:
>> 
>> Resending my r-b...
>
> It's not enough, even retrying the reset a few times, we still
> eventually get a timeout.
>
> This is just the usual
> 	10: 0xffffffff
> 	20: goto 10
> hanging batch
>> 
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> 
>> > If at first we don't succeed, try again.
>> >
>> > Running the reset and recovery routines in a loop ends in a "reset
>> > request timeout" with a mtbf of an hour on Braswell. This is eerily
>> > similar to the unrecoverable reset condition that first prompted us to
>> > use the reset-request mechanism in commit 7fd2d26921d1 ("drm/i915: Reset
>> > request handling for gen8+"). Repeating the reset request makes the
>> > failure much harder to reproduce (but there is no reason to believe that
>> > it is more than mere paper over a timing or other issue).
>> >
>> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
>> > Cc: Arun Siluvery <arun.siluvery@linux.intel.com>
>> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>> > Cc: stable@vger.kernel.org
>> > ---
>> >  drivers/gpu/drm/i915/intel_uncore.c | 24 +++++++++++++-----------
>> >  1 file changed, 13 insertions(+), 11 deletions(-)
>> >
>> > diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
>> > index e9f68cd56e32..1be8ced03ba5 100644
>> > --- a/drivers/gpu/drm/i915/intel_uncore.c
>> > +++ b/drivers/gpu/drm/i915/intel_uncore.c
>> > @@ -1688,20 +1688,22 @@ int intel_wait_for_register(struct drm_i915_private *dev_priv,
>> >  static int gen8_request_engine_reset(struct intel_engine_cs *engine)
>> >  {
>> >  	struct drm_i915_private *dev_priv = engine->i915;
>> > -	int ret;
>> > +	int loop = 3;
>> >
>> 
>> retries?
>> 
>> > -	I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
>> > -		      _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
>> > +	do {
>> > +		I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
>> > +			      _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
>> >  
>> > -	ret = intel_wait_for_register_fw(dev_priv,
>> > -					 RING_RESET_CTL(engine->mmio_base),
>> > -					 RESET_CTL_READY_TO_RESET,
>> > -					 RESET_CTL_READY_TO_RESET,
>> > -					 700);
>> > -	if (ret)
>> > -		DRM_ERROR("%s: reset request timeout\n", engine->name);
>> > +		if (!intel_wait_for_register_fw(dev_priv,
>> > +						RING_RESET_CTL(engine->mmio_base),
>> > +						RESET_CTL_READY_TO_RESET,
>> > +						RESET_CTL_READY_TO_RESET,
>> > +						700))
>> 
>> 
>> Did you check between the write didn't stick vs the readyness didn't
>> signal?
>
> The read will flush the write. So reading back the bit we just set shows
> us that the hw is not yet ready.
>
> Shouldn't we also be waiting for the reest bit to clear on cancel? And
> verifying that the reset itself did?
>

Request -> timeout -> cancel and wait 2 lowest bit to be zero and
then only after that retry?

And would not hurt to check they are zero before continuing after
a succesful reset also.

-Mika

> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index 6b84bd63310c..36d9a485b788 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -1701,6 +1701,9 @@ static void gen8_unrequest_engine_reset(struct intel_engine_cs *engine)
>  
>         I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
>                       _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET));
> +       intel_wait_for_register_fw(dev_priv,
> +                                  RING_RESET_CTL(engine->mmio_base),
> +                                  RESET_CTL_READY_TO_RESET, 0, 700);
>  }
>  
>  static int gen8_reset_engines(struct drm_i915_private *dev_priv,
> @@ -1708,18 +1711,18 @@ static int gen8_reset_engines(struct drm_i915_private *dev_priv,
>  {
>         struct intel_engine_cs *engine;
>         unsigned int tmp;
> +       int ret = -EIO;
>  
>         for_each_engine_masked(engine, dev_priv, engine_mask, tmp)
>                 if (gen8_request_engine_reset(engine))
>                         goto not_ready;
>  
> -       return gen6_reset_engines(dev_priv, engine_mask);
> +       ret = gen6_reset_engines(dev_priv, engine_mask);
>  
> -not_ready:
>         for_each_engine_masked(engine, dev_priv, engine_mask, tmp)
>                 gen8_unrequest_engine_reset(engine);
>  
> -       return -EIO;
> +       return ret;
>  }
>  
>  typedef int (*reset_func)(struct drm_i915_private *, unsigned engine_mask);
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx