Re: [PATCH 07/11] drm/i915/tdr: Add support for per engine reset recovery

From: Arun Siluvery <arun.siluvery@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>,
	intel-gfx@lists.freedesktop.org,
	Mika Kuoppala <mika.kuoppala@linux.intel.com>,
	Tomas Elf <tomas.elf@intel.com>
Subject: Re: [PATCH 07/11] drm/i915/tdr: Add support for per engine reset recovery
Date: Wed, 27 Jul 2016 12:49:13 +0100	[thread overview]
Message-ID: <57989FB9.5000308@linux.intel.com> (raw)
In-Reply-To: <20160726215121.GF31051@nuc-i3427.alporthouse.com>

On 26/07/2016 22:51, Chris Wilson wrote:
> On Tue, Jul 26, 2016 at 05:40:53PM +0100, Arun Siluvery wrote:
>> This change implements support for per-engine reset as an initial, less
>> intrusive hang recovery option to be attempted before falling back to the
>> legacy full GPU reset recovery mode if necessary. This is only supported
>> from Gen8 onwards.
>>
>> Hangchecker determines which engines are hung and invokes error handler to
>> recover from it. Error handler schedules recovery for each of those engines
>> that are hung. The recovery procedure is as follows,
>>   - force engine to idle: this is done by issuing a reset request
>>   - identifies the request that caused the hang and it is dropped
>>   - reset and re-init engine
>>   - restart submissions to the engine
>>
>> If engine reset fails then we fall back to heavy weight full gpu reset
>> which resets all engines and reinitiazes complete state of HW and SW.
>>
>> Possible reasons for failure,
>>   - engine not ready for reset
>>   - if the HW and SW doesn't agree on the context that caused the hang
>>   - reset itself failed for some reason
>>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_drv.c     | 55 +++++++++++++++++++++++++++++++++----
>>   drivers/gpu/drm/i915/i915_drv.h     |  3 ++
>>   drivers/gpu/drm/i915/i915_gem.c     |  2 +-
>>   drivers/gpu/drm/i915/intel_lrc.h    |  4 +++
>>   drivers/gpu/drm/i915/intel_uncore.c | 33 ++++++++++++++++++++++
>>   5 files changed, 90 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>> index 5c20e5d..8151aa9 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.c
>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>> @@ -1824,17 +1824,60 @@ error:
>>    * Returns zero on successful reset or otherwise an error code.
>>    *
>>    * Procedure is fairly simple:
>> - *  - force engine to idle
>> - *  - save current state which includes head and current request
>> - *  - reset engine
>> - *  - restore saved state and resubmit context
>> + *    - force engine to idle: this is done by issuing a reset request
>> + *    - identifies the request that caused the hang and it is retired
>> + *    - reset engine
>> + *    - restart submissions to the engine
>>    */
>>   int i915_reset_engine(struct intel_engine_cs *engine)
>>   {
>> +	struct drm_i915_private *dev_priv = engine->i915;
>>   	int ret;
>>
>> -	/* FIXME: replace me with engine reset sequence */
>> -	ret = -ENODEV;
>> +	/* Ensure irq handler finishes or is cancelled. */
>> +	tasklet_kill(&engine->irq_tasklet);
>> +
>> +	i915_gem_reset_engine_status(engine);
>
> Not yet safe to be used lockless.
>
engine is hung and intel_lrc_irq_handler() also won't be running at this 
point and we should've received all seqno updates, is it still not safe? 
do I really need struct_mutex for this? only caller is i915_gem_reset() 
and it takes it.

>> +	/*Take wake lock to prevent power saving mode */
>
> Why else would you be taking it? We know the wakelock prevents the GT
> power well disappearing, the question is why is that important now
> across the entirety of this sequence? That's what you explain in
> comments.
>
yes, not useful. It is carried over from initial patch but I agree I 
should've updated this.

>> +	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
>> +
>> +	ret = intel_request_for_reset(engine);
>> +	if (ret) {
>> +		DRM_ERROR("Failed to disable %s\n", engine->name);
>> +		goto out;
>> +	}
>> +
>> +	ret = intel_execlists_reset_prepare(engine);
>> +	if (ret)
>> +		goto enable_engine;
>> +
>> +	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
>> +	if (ret) {
>> +		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
>> +		goto enable_engine;
>> +	}
>> +
>> +	ret = engine->init_hw(engine);
>> +	if (ret)
>> +		goto out;
>> +
>> +	/* Restart submissions to the engine after reset */
>> +	intel_execlists_restart_submission(engine);
>
> Clumsy. See above as we already have an entry point that can initiate
> the hw.
>
I think this is a separate step from init_hw().
In init_hw() we initialize hw state and keep it ready then we feed 
submissions.

>> +enable_engine:
>> +	/*
>> +	 * we only need to enable engine if we cannot prepare engine for
>> +	 * reset or reset fails. If the reset is successful, engine gets
>> +	 * enabled automatically so we can skip this step.
>> +	 */
>> +	if (ret)
>
> Then why is the normal path coming through here?
>
>> +		intel_clear_reset_request(engine);
>> +
there is no harm in doing this even when reset is successful, hence 
included in the normal path. I will remove ret check to avoid confusion.

>> +out:
>> +	/* Wake up anything waiting on this engine's queue */
>> +	intel_engine_wakeup(engine);
>
> ? Random call with incorrect locking. What do you think you are
> achieving here because it's not what the comment implies.

sorry, I will come back on this one, I need to better understand few 
more things.
>
>> +	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
>>
>>   	return ret;
>>   }
>> +/*
>> + * On gen8+ a reset request has to be issued via the reset control register
>> + * before a GPU engine can be reset in order to stop the command streamer
>> + * and idle the engine. This replaces the legacy way of stopping an engine
>> + * by writing to the stop ring bit in the MI_MODE register.
>> + */
>> +int intel_request_for_reset(struct intel_engine_cs *engine)
>
> intel_engine_reset_begin()
>
> It would be preferrable if we can avoid using request here as we are
> talking about reseting the hung request elsewhere in this chain.
>
the term request is overloaded but here infact we are actually 
requesting the hw to prepare for reset and wait for it to acknowledge.

intel_engine_reset_begin() sounds fine.

>> +{
>> +	if (!intel_has_engine_reset(engine->i915)) {
>> +		DRM_ERROR("Engine Reset not supported on Gen%d\n",
>> +			  INTEL_INFO(engine->i915)->gen);
>
> ERROR?
This msg is probably unnecessary as we go for engine reset only when it 
is available.
>
>> +		return -EINVAL;
> return -ENODEV;

ok.

>
>> +	}
>> +
>> +	return gen8_request_engine_reset(engine);
>> +}
>
> gen8_engine_reset_begin()
>
>> +/*
>> + * It is possible to back off from a previously issued reset request by simply
>> + * clearing the reset request bit in the reset control register.
>> + */
>> +int intel_clear_reset_request(struct intel_engine_cs *engine)
>
> intel_engine_reset_cancel()
>
>> +{
>> +{
>> +	if (!intel_has_engine_reset(engine->i915)) {
>> +		DRM_ERROR("Request to clear reset not supported on Gen%d\n",
>> +			  INTEL_INFO(engine->i915)->gen);
>> +		return -EINVAL;
>> +	}
>> +
>> +	gen8_unrequest_engine_reset(engine);
>
> gen8_engine_reset_cancel()
>
> (only suggestions for avoiding having confusion over requests)

ok, thanks for the renaming suggestions.

regards
Arun

>
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx