public inbox for intel-gfx@lists.freedesktop.org
 help / color / mirror / Atom feed
From: Arun Siluvery <arun.siluvery@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>,
	intel-gfx@lists.freedesktop.org,
	Mika Kuoppala <mika.kuoppala@linux.intel.com>,
	Tomas Elf <tomas.elf@intel.com>
Subject: Re: [PATCH 07/11] drm/i915/tdr: Add support for per engine reset recovery
Date: Wed, 27 Jul 2016 12:49:13 +0100	[thread overview]
Message-ID: <57989FB9.5000308@linux.intel.com> (raw)
In-Reply-To: <20160726215121.GF31051@nuc-i3427.alporthouse.com>

On 26/07/2016 22:51, Chris Wilson wrote:
> On Tue, Jul 26, 2016 at 05:40:53PM +0100, Arun Siluvery wrote:
>> This change implements support for per-engine reset as an initial, less
>> intrusive hang recovery option to be attempted before falling back to the
>> legacy full GPU reset recovery mode if necessary. This is only supported
>> from Gen8 onwards.
>>
>> Hangchecker determines which engines are hung and invokes error handler to
>> recover from it. Error handler schedules recovery for each of those engines
>> that are hung. The recovery procedure is as follows,
>>   - force engine to idle: this is done by issuing a reset request
>>   - identifies the request that caused the hang and it is dropped
>>   - reset and re-init engine
>>   - restart submissions to the engine
>>
>> If engine reset fails then we fall back to heavy weight full gpu reset
>> which resets all engines and reinitiazes complete state of HW and SW.
>>
>> Possible reasons for failure,
>>   - engine not ready for reset
>>   - if the HW and SW doesn't agree on the context that caused the hang
>>   - reset itself failed for some reason
>>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_drv.c     | 55 +++++++++++++++++++++++++++++++++----
>>   drivers/gpu/drm/i915/i915_drv.h     |  3 ++
>>   drivers/gpu/drm/i915/i915_gem.c     |  2 +-
>>   drivers/gpu/drm/i915/intel_lrc.h    |  4 +++
>>   drivers/gpu/drm/i915/intel_uncore.c | 33 ++++++++++++++++++++++
>>   5 files changed, 90 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>> index 5c20e5d..8151aa9 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.c
>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>> @@ -1824,17 +1824,60 @@ error:
>>    * Returns zero on successful reset or otherwise an error code.
>>    *
>>    * Procedure is fairly simple:
>> - *  - force engine to idle
>> - *  - save current state which includes head and current request
>> - *  - reset engine
>> - *  - restore saved state and resubmit context
>> + *    - force engine to idle: this is done by issuing a reset request
>> + *    - identifies the request that caused the hang and it is retired
>> + *    - reset engine
>> + *    - restart submissions to the engine
>>    */
>>   int i915_reset_engine(struct intel_engine_cs *engine)
>>   {
>> +	struct drm_i915_private *dev_priv = engine->i915;
>>   	int ret;
>>
>> -	/* FIXME: replace me with engine reset sequence */
>> -	ret = -ENODEV;
>> +	/* Ensure irq handler finishes or is cancelled. */
>> +	tasklet_kill(&engine->irq_tasklet);
>> +
>> +	i915_gem_reset_engine_status(engine);
>
> Not yet safe to be used lockless.
>
engine is hung and intel_lrc_irq_handler() also won't be running at this 
point and we should've received all seqno updates, is it still not safe? 
do I really need struct_mutex for this? only caller is i915_gem_reset() 
and it takes it.

>> +	/*Take wake lock to prevent power saving mode */
>
> Why else would you be taking it? We know the wakelock prevents the GT
> power well disappearing, the question is why is that important now
> across the entirety of this sequence? That's what you explain in
> comments.
>
yes, not useful. It is carried over from initial patch but I agree I 
should've updated this.

>> +	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
>> +
>> +	ret = intel_request_for_reset(engine);
>> +	if (ret) {
>> +		DRM_ERROR("Failed to disable %s\n", engine->name);
>> +		goto out;
>> +	}
>> +
>> +	ret = intel_execlists_reset_prepare(engine);
>> +	if (ret)
>> +		goto enable_engine;
>> +
>> +	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
>> +	if (ret) {
>> +		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
>> +		goto enable_engine;
>> +	}
>> +
>> +	ret = engine->init_hw(engine);
>> +	if (ret)
>> +		goto out;
>> +
>> +	/* Restart submissions to the engine after reset */
>> +	intel_execlists_restart_submission(engine);
>
> Clumsy. See above as we already have an entry point that can initiate
> the hw.
>
I think this is a separate step from init_hw().
In init_hw() we initialize hw state and keep it ready then we feed 
submissions.

>> +enable_engine:
>> +	/*
>> +	 * we only need to enable engine if we cannot prepare engine for
>> +	 * reset or reset fails. If the reset is successful, engine gets
>> +	 * enabled automatically so we can skip this step.
>> +	 */
>> +	if (ret)
>
> Then why is the normal path coming through here?
>
>> +		intel_clear_reset_request(engine);
>> +
there is no harm in doing this even when reset is successful, hence 
included in the normal path. I will remove ret check to avoid confusion.

>> +out:
>> +	/* Wake up anything waiting on this engine's queue */
>> +	intel_engine_wakeup(engine);
>
> ? Random call with incorrect locking. What do you think you are
> achieving here because it's not what the comment implies.

sorry, I will come back on this one, I need to better understand few 
more things.
>
>> +	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
>>
>>   	return ret;
>>   }
>> +/*
>> + * On gen8+ a reset request has to be issued via the reset control register
>> + * before a GPU engine can be reset in order to stop the command streamer
>> + * and idle the engine. This replaces the legacy way of stopping an engine
>> + * by writing to the stop ring bit in the MI_MODE register.
>> + */
>> +int intel_request_for_reset(struct intel_engine_cs *engine)
>
> intel_engine_reset_begin()
>
> It would be preferrable if we can avoid using request here as we are
> talking about reseting the hung request elsewhere in this chain.
>
the term request is overloaded but here infact we are actually 
requesting the hw to prepare for reset and wait for it to acknowledge.

intel_engine_reset_begin() sounds fine.

>> +{
>> +	if (!intel_has_engine_reset(engine->i915)) {
>> +		DRM_ERROR("Engine Reset not supported on Gen%d\n",
>> +			  INTEL_INFO(engine->i915)->gen);
>
> ERROR?
This msg is probably unnecessary as we go for engine reset only when it 
is available.
>
>> +		return -EINVAL;
> return -ENODEV;

ok.

>
>> +	}
>> +
>> +	return gen8_request_engine_reset(engine);
>> +}
>
> gen8_engine_reset_begin()
>
>> +/*
>> + * It is possible to back off from a previously issued reset request by simply
>> + * clearing the reset request bit in the reset control register.
>> + */
>> +int intel_clear_reset_request(struct intel_engine_cs *engine)
>
> intel_engine_reset_cancel()
>
>> +{
>> +{
>> +	if (!intel_has_engine_reset(engine->i915)) {
>> +		DRM_ERROR("Request to clear reset not supported on Gen%d\n",
>> +			  INTEL_INFO(engine->i915)->gen);
>> +		return -EINVAL;
>> +	}
>> +
>> +	gen8_unrequest_engine_reset(engine);
>
> gen8_engine_reset_cancel()
>
> (only suggestions for avoiding having confusion over requests)

ok, thanks for the renaming suggestions.

regards
Arun

>
>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2016-07-27 11:49 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-26 16:40 [PATCH 00/11] Execlist based Engine reset patches Arun Siluvery
2016-07-26 16:40 ` [PATCH 01/11] drm/i915: Update i915.reset to handle engine resets Arun Siluvery
2016-07-26 16:40 ` [PATCH 02/11] drm/i915: Separate out reset flags from the reset counter Arun Siluvery
2016-07-26 16:40 ` [PATCH 03/11] drm/i915/tdr: Update reset_in_progress to account for engine reset Arun Siluvery
2016-07-26 21:52   ` Chris Wilson
2016-07-27 11:16     ` Arun Siluvery
2016-07-27 11:41       ` Chris Wilson
2016-07-27 11:52         ` Arun Siluvery
2016-07-26 16:40 ` [PATCH 04/11] drm/i915/tdr: Modify error handler for per engine hang recovery Arun Siluvery
2016-07-26 16:40 ` [PATCH 05/11] drm/i915/tdr: Identify hung request and drop it Arun Siluvery
2016-07-26 21:37   ` Chris Wilson
2016-07-27 11:54     ` Arun Siluvery
2016-07-27 12:27       ` Chris Wilson
2016-07-27 21:35         ` Resubmitting requests following the hang Chris Wilson
2016-07-27 21:35           ` [PATCH 1/3] drm/i915: Record the position of the start of the request Chris Wilson
2016-07-27 21:35           ` [PATCH 2/3] lrc Chris Wilson
2016-07-27 21:36           ` [PATCH 3/3] reset-request-recovery Chris Wilson
2016-07-26 16:40 ` [PATCH 06/11] drm/i915/tdr: Restart submission after engine reset Arun Siluvery
2016-07-26 21:32   ` Chris Wilson
2016-07-26 16:40 ` [PATCH 07/11] drm/i915/tdr: Add support for per engine reset recovery Arun Siluvery
2016-07-26 21:51   ` Chris Wilson
2016-07-27 11:49     ` Arun Siluvery [this message]
2016-07-26 16:40 ` [PATCH 08/11] drm/i915: Skip reset request if there is one already Arun Siluvery
2016-07-26 16:40 ` [PATCH 09/11] drm/i915/tdr: Add engine reset count to error state Arun Siluvery
2016-07-26 16:40 ` [PATCH 10/11] drm/i915/tdr: Export reset count info to debugfs Arun Siluvery
2016-07-26 16:40 ` [PATCH 11/11] drm/i915/tdr: Enable Engine reset and recovery support Arun Siluvery
2016-07-26 17:11 ` ✗ Ro.CI.BAT: failure for Execlist based Engine reset patches Patchwork
2016-07-28  5:40 ` ✗ Ro.CI.BAT: failure for Execlist based Engine reset patches (rev4) Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=57989FB9.5000308@linux.intel.com \
    --to=arun.siluvery@linux.intel.com \
    --cc=chris@chris-wilson.co.uk \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=mika.kuoppala@linux.intel.com \
    --cc=tomas.elf@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox