From: Matthew Brost <matthew.brost@intel.com>
To: John.C.Harrison@Intel.com
Cc: Intel-GFX@Lists.FreeDesktop.Org, DRI-Devel@Lists.FreeDesktop.Org
Subject: Re: [Intel-gfx] [PATCH] drm/i915/guc: Don't error on reset of banned context
Date: Tue, 11 Jan 2022 10:45:41 -0800 [thread overview]
Message-ID: <20220111184541.GA14954@jons-linux-dev-box> (raw)
In-Reply-To: <20220107003143.326046-1-John.C.Harrison@Intel.com>
On Thu, Jan 06, 2022 at 04:31:43PM -0800, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
>
> There is a race (already documented in the code) whereby a context can
> be (re-)queued for submission at the same time as it is being banned
> due to a hang and reset. That leads to a hang/reset report from GuC
> for a context which i915 thinks is already banned.
>
I think there are 2 issues here.
1. Banning of context (e.g. user closes a non-persistent context)
results in an context reset. In this case we will receive a G2H
indicating a context reset and we want to convert the context reset to a
nop.
2. A GT reset races with a context reset result in the context getting
banned before the G2H is processed. Again we want to convert the context
reset to a nop. This race should be sealed when we can flush the G2H
handler in the reset path. Flushing G2H handler depends on the error
capture not allocating memory in non-sleeping contexts. Thomas H had a
patch for this.
In both cases we shouldn't print an error.
> While the race is indented to be fixed in a future GuC update, there
> is no actual harm beyond the wasted execution time of that new hang
> detection period. The context has already been banned for bad
> behaviour so a fresh hang is hardly surprising and certainly isn't
> going to be losing any work that wouldn't already have been lost if
> there was no race.
>
See above, I think you are confusing the issues here. This won't be
fixed by an updated GuC firmware.
> So don't treat this situation as an error. The error message is seen
> by the CI system as something fatal and causes test failures. Instead,
> just print an informational so the user at least knows a context reset
> occurred (given that the error capture is being skipped).
>
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 9989d121127d..e8a32a7e7daf 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -3978,6 +3978,10 @@ static void guc_handle_context_reset(struct intel_guc *guc,
> !context_blocked(ce))) {
> capture_error_state(guc, ce);
> guc_context_replay(ce);
> + } else if (intel_context_is_banned(ce)) {
> + drm_info(&guc_to_gt(guc)->i915->drm,
> + "Reset notificaion for banned context 0x%04X on %s",
> + ce->guc_id.id, ce->engine->name);
The context being blocking isn't an error either. I think real fix is
changing the below drm_err to drm_info and call it a day.
Matt
> } else {
> drm_err(&guc_to_gt(guc)->i915->drm,
> "Invalid GuC engine reset notificaion for 0x%04X on %s: banned = %d, blocked = %d",
> --
> 2.25.1
>
WARNING: multiple messages have this Message-ID (diff)
From: Matthew Brost <matthew.brost@intel.com>
To: John.C.Harrison@Intel.com
Cc: Intel-GFX@Lists.FreeDesktop.Org, DRI-Devel@Lists.FreeDesktop.Org
Subject: Re: [PATCH] drm/i915/guc: Don't error on reset of banned context
Date: Tue, 11 Jan 2022 10:45:41 -0800 [thread overview]
Message-ID: <20220111184541.GA14954@jons-linux-dev-box> (raw)
In-Reply-To: <20220107003143.326046-1-John.C.Harrison@Intel.com>
On Thu, Jan 06, 2022 at 04:31:43PM -0800, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
>
> There is a race (already documented in the code) whereby a context can
> be (re-)queued for submission at the same time as it is being banned
> due to a hang and reset. That leads to a hang/reset report from GuC
> for a context which i915 thinks is already banned.
>
I think there are 2 issues here.
1. Banning of context (e.g. user closes a non-persistent context)
results in an context reset. In this case we will receive a G2H
indicating a context reset and we want to convert the context reset to a
nop.
2. A GT reset races with a context reset result in the context getting
banned before the G2H is processed. Again we want to convert the context
reset to a nop. This race should be sealed when we can flush the G2H
handler in the reset path. Flushing G2H handler depends on the error
capture not allocating memory in non-sleeping contexts. Thomas H had a
patch for this.
In both cases we shouldn't print an error.
> While the race is indented to be fixed in a future GuC update, there
> is no actual harm beyond the wasted execution time of that new hang
> detection period. The context has already been banned for bad
> behaviour so a fresh hang is hardly surprising and certainly isn't
> going to be losing any work that wouldn't already have been lost if
> there was no race.
>
See above, I think you are confusing the issues here. This won't be
fixed by an updated GuC firmware.
> So don't treat this situation as an error. The error message is seen
> by the CI system as something fatal and causes test failures. Instead,
> just print an informational so the user at least knows a context reset
> occurred (given that the error capture is being skipped).
>
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 9989d121127d..e8a32a7e7daf 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -3978,6 +3978,10 @@ static void guc_handle_context_reset(struct intel_guc *guc,
> !context_blocked(ce))) {
> capture_error_state(guc, ce);
> guc_context_replay(ce);
> + } else if (intel_context_is_banned(ce)) {
> + drm_info(&guc_to_gt(guc)->i915->drm,
> + "Reset notificaion for banned context 0x%04X on %s",
> + ce->guc_id.id, ce->engine->name);
The context being blocking isn't an error either. I think real fix is
changing the below drm_err to drm_info and call it a day.
Matt
> } else {
> drm_err(&guc_to_gt(guc)->i915->drm,
> "Invalid GuC engine reset notificaion for 0x%04X on %s: banned = %d, blocked = %d",
> --
> 2.25.1
>
next prev parent reply other threads:[~2022-01-11 18:51 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-01-07 0:31 [Intel-gfx] [PATCH] drm/i915/guc: Don't error on reset of banned context John.C.Harrison
2022-01-07 0:31 ` John.C.Harrison
2022-01-07 0:35 ` [Intel-gfx] " John Harrison
2022-01-07 0:35 ` John Harrison
2022-01-07 1:49 ` [Intel-gfx] ✓ Fi.CI.BAT: success for " Patchwork
2022-01-07 17:06 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork
2022-01-11 18:45 ` Matthew Brost [this message]
2022-01-11 18:45 ` [PATCH] " Matthew Brost
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220111184541.GA14954@jons-linux-dev-box \
--to=matthew.brost@intel.com \
--cc=DRI-Devel@Lists.FreeDesktop.Org \
--cc=Intel-GFX@Lists.FreeDesktop.Org \
--cc=John.C.Harrison@Intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.