From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: John.C.Harrison@Intel.com, Intel-GFX@Lists.FreeDesktop.Org
Cc: DRI-Devel@Lists.FreeDesktop.Org
Subject: Re: [Intel-gfx] [PATCH 3/4] drm/i915/guc: Look for a guilty context when an engine reset fails
Date: Thu, 12 Jan 2023 10:15:12 +0000 [thread overview]
Message-ID: <393edad8-fa78-4b28-46ac-86da56d03de0@linux.intel.com> (raw)
In-Reply-To: <20230112025311.2577084-4-John.C.Harrison@Intel.com>
On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
>
> Engine resets are supposed to never fail. But in the case when one
> does (due to unknown reasons that normally come down to a missing
> w/a), it is useful to get as much information out of the system as
> possible. Given that the GuC effectively dies on such a situation, it
> is not possible to get a guilty context notification back. So do a
> manual search instead. Given that GuC is dead, this is safe because
> GuC won't be changing the engine state asynchronously.
>
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
> .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 17 +++++++++++++++--
> 1 file changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index b436dd7f12e42..99d09e3394597 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -4754,11 +4754,24 @@ static void reset_fail_worker_func(struct work_struct *w)
> guc->submission_state.reset_fail_mask = 0;
> spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>
> - if (likely(reset_fail_mask))
> + if (likely(reset_fail_mask)) {
> + struct intel_engine_cs *engine;
> + enum intel_engine_id id;
> +
> + /*
> + * GuC is toast at this point - it dead loops after sending the failed
> + * reset notification. So need to manually determine the guilty context.
> + * Note that it should be safe/reliable to do this here because the GuC
> + * is toast and will not be scheduling behind the KMD's back.
> + */
> + for_each_engine_masked(engine, gt, reset_fail_mask, id)
> + intel_guc_find_hung_context(engine);
> +
> intel_gt_handle_error(gt, reset_fail_mask,
> I915_ERROR_CAPTURE,
> - "GuC failed to reset engine mask=0x%x\n",
> + "GuC failed to reset engine mask=0x%x",
> reset_fail_mask);
> + }
> }
>
> int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
This one I don't feel "at home" enough to r-b. Just a question - can we
be sure at this point that GuC is 100% stuck and there isn't a chance it
somehow comes alive and starts running in parallel (being driven in
parallel by a different "thread" in i915), interfering with the
assumption made in the comment?
Regards,
Tvrtko
next prev parent reply other threads:[~2023-01-12 10:15 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-12 2:53 [Intel-gfx] [PATCH 0/4] Allow error capture without a request / on reset failure John.C.Harrison
2023-01-12 2:53 ` John.C.Harrison
2023-01-12 2:53 ` [Intel-gfx] [PATCH 1/4] drm/i915: Allow error capture without a request John.C.Harrison
2023-01-12 2:53 ` John.C.Harrison
2023-01-12 10:01 ` [Intel-gfx] " Tvrtko Ursulin
2023-01-12 20:40 ` John Harrison
2023-01-13 9:51 ` Tvrtko Ursulin
2023-01-13 17:46 ` Hellstrom, Thomas
2023-01-13 21:29 ` John Harrison
2023-01-16 12:38 ` Tvrtko Ursulin
2023-01-17 19:40 ` John Harrison
2023-01-16 12:13 ` Tvrtko Ursulin
2023-01-12 2:53 ` [Intel-gfx] [PATCH 2/4] drm/i915: Allow error capture of a pending request John.C.Harrison
2023-01-12 2:53 ` John.C.Harrison
2023-01-12 10:06 ` [Intel-gfx] " Tvrtko Ursulin
2023-01-12 20:46 ` John Harrison
2023-01-13 9:10 ` Tvrtko Ursulin
2023-01-12 2:53 ` [Intel-gfx] [PATCH 3/4] drm/i915/guc: Look for a guilty context when an engine reset fails John.C.Harrison
2023-01-12 2:53 ` John.C.Harrison
2023-01-12 10:15 ` Tvrtko Ursulin [this message]
2023-01-12 20:59 ` [Intel-gfx] " John Harrison
2023-01-13 9:22 ` Tvrtko Ursulin
2023-01-14 1:27 ` John Harrison
2023-01-16 12:43 ` Tvrtko Ursulin
2023-01-17 21:14 ` John Harrison
2023-01-12 2:53 ` [Intel-gfx] [PATCH 4/4] drm/i915/guc: Add a debug print on GuC triggered reset John.C.Harrison
2023-01-12 2:53 ` John.C.Harrison
2023-01-12 10:11 ` [Intel-gfx] " Tvrtko Ursulin
2023-01-12 3:21 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Allow error capture without a request / on reset failure (rev2) Patchwork
2023-01-12 3:36 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2023-01-12 5:36 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=393edad8-fa78-4b28-46ac-86da56d03de0@linux.intel.com \
--to=tvrtko.ursulin@linux.intel.com \
--cc=DRI-Devel@Lists.FreeDesktop.Org \
--cc=Intel-GFX@Lists.FreeDesktop.Org \
--cc=John.C.Harrison@Intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.