From: Arun Siluvery <arun.siluvery@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>,
intel-gfx@lists.freedesktop.org,
Mika Kuoppala <mika.kuoppala@linux.intel.com>,
Tomas Elf <tomas.elf@intel.com>
Subject: Re: [PATCH 07/11] drm/i915/tdr: Add support for per engine reset recovery
Date: Wed, 27 Jul 2016 12:49:13 +0100 [thread overview]
Message-ID: <57989FB9.5000308@linux.intel.com> (raw)
In-Reply-To: <20160726215121.GF31051@nuc-i3427.alporthouse.com>
On 26/07/2016 22:51, Chris Wilson wrote:
> On Tue, Jul 26, 2016 at 05:40:53PM +0100, Arun Siluvery wrote:
>> This change implements support for per-engine reset as an initial, less
>> intrusive hang recovery option to be attempted before falling back to the
>> legacy full GPU reset recovery mode if necessary. This is only supported
>> from Gen8 onwards.
>>
>> Hangchecker determines which engines are hung and invokes error handler to
>> recover from it. Error handler schedules recovery for each of those engines
>> that are hung. The recovery procedure is as follows,
>> - force engine to idle: this is done by issuing a reset request
>> - identifies the request that caused the hang and it is dropped
>> - reset and re-init engine
>> - restart submissions to the engine
>>
>> If engine reset fails then we fall back to heavy weight full gpu reset
>> which resets all engines and reinitiazes complete state of HW and SW.
>>
>> Possible reasons for failure,
>> - engine not ready for reset
>> - if the HW and SW doesn't agree on the context that caused the hang
>> - reset itself failed for some reason
>>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> ---
>> drivers/gpu/drm/i915/i915_drv.c | 55 +++++++++++++++++++++++++++++++++----
>> drivers/gpu/drm/i915/i915_drv.h | 3 ++
>> drivers/gpu/drm/i915/i915_gem.c | 2 +-
>> drivers/gpu/drm/i915/intel_lrc.h | 4 +++
>> drivers/gpu/drm/i915/intel_uncore.c | 33 ++++++++++++++++++++++
>> 5 files changed, 90 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>> index 5c20e5d..8151aa9 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.c
>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>> @@ -1824,17 +1824,60 @@ error:
>> * Returns zero on successful reset or otherwise an error code.
>> *
>> * Procedure is fairly simple:
>> - * - force engine to idle
>> - * - save current state which includes head and current request
>> - * - reset engine
>> - * - restore saved state and resubmit context
>> + * - force engine to idle: this is done by issuing a reset request
>> + * - identifies the request that caused the hang and it is retired
>> + * - reset engine
>> + * - restart submissions to the engine
>> */
>> int i915_reset_engine(struct intel_engine_cs *engine)
>> {
>> + struct drm_i915_private *dev_priv = engine->i915;
>> int ret;
>>
>> - /* FIXME: replace me with engine reset sequence */
>> - ret = -ENODEV;
>> + /* Ensure irq handler finishes or is cancelled. */
>> + tasklet_kill(&engine->irq_tasklet);
>> +
>> + i915_gem_reset_engine_status(engine);
>
> Not yet safe to be used lockless.
>
engine is hung and intel_lrc_irq_handler() also won't be running at this
point and we should've received all seqno updates, is it still not safe?
do I really need struct_mutex for this? only caller is i915_gem_reset()
and it takes it.
>> + /*Take wake lock to prevent power saving mode */
>
> Why else would you be taking it? We know the wakelock prevents the GT
> power well disappearing, the question is why is that important now
> across the entirety of this sequence? That's what you explain in
> comments.
>
yes, not useful. It is carried over from initial patch but I agree I
should've updated this.
>> + intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
>> +
>> + ret = intel_request_for_reset(engine);
>> + if (ret) {
>> + DRM_ERROR("Failed to disable %s\n", engine->name);
>> + goto out;
>> + }
>> +
>> + ret = intel_execlists_reset_prepare(engine);
>> + if (ret)
>> + goto enable_engine;
>> +
>> + ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
>> + if (ret) {
>> + DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
>> + goto enable_engine;
>> + }
>> +
>> + ret = engine->init_hw(engine);
>> + if (ret)
>> + goto out;
>> +
>> + /* Restart submissions to the engine after reset */
>> + intel_execlists_restart_submission(engine);
>
> Clumsy. See above as we already have an entry point that can initiate
> the hw.
>
I think this is a separate step from init_hw().
In init_hw() we initialize hw state and keep it ready then we feed
submissions.
>> +enable_engine:
>> + /*
>> + * we only need to enable engine if we cannot prepare engine for
>> + * reset or reset fails. If the reset is successful, engine gets
>> + * enabled automatically so we can skip this step.
>> + */
>> + if (ret)
>
> Then why is the normal path coming through here?
>
>> + intel_clear_reset_request(engine);
>> +
there is no harm in doing this even when reset is successful, hence
included in the normal path. I will remove ret check to avoid confusion.
>> +out:
>> + /* Wake up anything waiting on this engine's queue */
>> + intel_engine_wakeup(engine);
>
> ? Random call with incorrect locking. What do you think you are
> achieving here because it's not what the comment implies.
sorry, I will come back on this one, I need to better understand few
more things.
>
>> + intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
>>
>> return ret;
>> }
>> +/*
>> + * On gen8+ a reset request has to be issued via the reset control register
>> + * before a GPU engine can be reset in order to stop the command streamer
>> + * and idle the engine. This replaces the legacy way of stopping an engine
>> + * by writing to the stop ring bit in the MI_MODE register.
>> + */
>> +int intel_request_for_reset(struct intel_engine_cs *engine)
>
> intel_engine_reset_begin()
>
> It would be preferrable if we can avoid using request here as we are
> talking about reseting the hung request elsewhere in this chain.
>
the term request is overloaded but here infact we are actually
requesting the hw to prepare for reset and wait for it to acknowledge.
intel_engine_reset_begin() sounds fine.
>> +{
>> + if (!intel_has_engine_reset(engine->i915)) {
>> + DRM_ERROR("Engine Reset not supported on Gen%d\n",
>> + INTEL_INFO(engine->i915)->gen);
>
> ERROR?
This msg is probably unnecessary as we go for engine reset only when it
is available.
>
>> + return -EINVAL;
> return -ENODEV;
ok.
>
>> + }
>> +
>> + return gen8_request_engine_reset(engine);
>> +}
>
> gen8_engine_reset_begin()
>
>> +/*
>> + * It is possible to back off from a previously issued reset request by simply
>> + * clearing the reset request bit in the reset control register.
>> + */
>> +int intel_clear_reset_request(struct intel_engine_cs *engine)
>
> intel_engine_reset_cancel()
>
>> +{
>> +{
>> + if (!intel_has_engine_reset(engine->i915)) {
>> + DRM_ERROR("Request to clear reset not supported on Gen%d\n",
>> + INTEL_INFO(engine->i915)->gen);
>> + return -EINVAL;
>> + }
>> +
>> + gen8_unrequest_engine_reset(engine);
>
> gen8_engine_reset_cancel()
>
> (only suggestions for avoiding having confusion over requests)
ok, thanks for the renaming suggestions.
regards
Arun
>
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
next prev parent reply other threads:[~2016-07-27 11:49 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-07-26 16:40 [PATCH 00/11] Execlist based Engine reset patches Arun Siluvery
2016-07-26 16:40 ` [PATCH 01/11] drm/i915: Update i915.reset to handle engine resets Arun Siluvery
2016-07-26 16:40 ` [PATCH 02/11] drm/i915: Separate out reset flags from the reset counter Arun Siluvery
2016-07-26 16:40 ` [PATCH 03/11] drm/i915/tdr: Update reset_in_progress to account for engine reset Arun Siluvery
2016-07-26 21:52 ` Chris Wilson
2016-07-27 11:16 ` Arun Siluvery
2016-07-27 11:41 ` Chris Wilson
2016-07-27 11:52 ` Arun Siluvery
2016-07-26 16:40 ` [PATCH 04/11] drm/i915/tdr: Modify error handler for per engine hang recovery Arun Siluvery
2016-07-26 16:40 ` [PATCH 05/11] drm/i915/tdr: Identify hung request and drop it Arun Siluvery
2016-07-26 21:37 ` Chris Wilson
2016-07-27 11:54 ` Arun Siluvery
2016-07-27 12:27 ` Chris Wilson
2016-07-27 21:35 ` Resubmitting requests following the hang Chris Wilson
2016-07-27 21:35 ` [PATCH 1/3] drm/i915: Record the position of the start of the request Chris Wilson
2016-07-27 21:35 ` [PATCH 2/3] lrc Chris Wilson
2016-07-27 21:36 ` [PATCH 3/3] reset-request-recovery Chris Wilson
2016-07-26 16:40 ` [PATCH 06/11] drm/i915/tdr: Restart submission after engine reset Arun Siluvery
2016-07-26 21:32 ` Chris Wilson
2016-07-26 16:40 ` [PATCH 07/11] drm/i915/tdr: Add support for per engine reset recovery Arun Siluvery
2016-07-26 21:51 ` Chris Wilson
2016-07-27 11:49 ` Arun Siluvery [this message]
2016-07-26 16:40 ` [PATCH 08/11] drm/i915: Skip reset request if there is one already Arun Siluvery
2016-07-26 16:40 ` [PATCH 09/11] drm/i915/tdr: Add engine reset count to error state Arun Siluvery
2016-07-26 16:40 ` [PATCH 10/11] drm/i915/tdr: Export reset count info to debugfs Arun Siluvery
2016-07-26 16:40 ` [PATCH 11/11] drm/i915/tdr: Enable Engine reset and recovery support Arun Siluvery
2016-07-26 17:11 ` ✗ Ro.CI.BAT: failure for Execlist based Engine reset patches Patchwork
2016-07-28 5:40 ` ✗ Ro.CI.BAT: failure for Execlist based Engine reset patches (rev4) Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=57989FB9.5000308@linux.intel.com \
--to=arun.siluvery@linux.intel.com \
--cc=chris@chris-wilson.co.uk \
--cc=intel-gfx@lists.freedesktop.org \
--cc=mika.kuoppala@linux.intel.com \
--cc=tomas.elf@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox