Intel-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: John Harrison <john.c.harrison@intel.com>
To: Andrzej Hajda <andrzej.hajda@intel.com>,
	<intel-gfx@lists.freedesktop.org>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>,
	Rodrigo Vivi <rodrigo.vivi@intel.com>
Subject: Re: [Intel-gfx] [PATCH] drm/i915/guc: add CAT error handler
Date: Wed, 19 Oct 2022 09:21:00 -0700	[thread overview]
Message-ID: <dd25282b-655b-eae9-6e02-ea4b54b167ea@intel.com> (raw)
In-Reply-To: <20221019083325.214960-1-andrzej.hajda@intel.com>

[-- Attachment #1: Type: text/plain, Size: 5734 bytes --]

On 10/19/2022 01:33, Andrzej Hajda wrote:
> In case of catastrophic errors GuC sends notification, which results in
> cryptic message. Let's add handler which, for starters, dumps state
> of affected engine.
See below - the notification is sent by the GPU memory subsystem not the 
GuC. Also, not sure what you mean by 'cryptic message'. It would be 
better to say something like:

    Bad GPU memory accesses can result in catastrophic error
    notifications being send from the GPU to the KMD via the GuC. Add a
    handler to process the notification by printing a kernel message and
    dumping the related engine state (if appropriate).


>
> Signed-off-by: Andrzej Hajda<andrzej.hajda@intel.com>
> ---
>   .../gpu/drm/i915/gt/uc/abi/guc_actions_abi.h  |  1 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  2 ++
>   drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c     |  3 ++
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 31 +++++++++++++++++++
>   4 files changed, 37 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> index f359bef046e0b2..f9a1c5642855e3 100644
> --- a/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> +++ b/drivers/gpu/drm/i915/gt/uc/abi/guc_actions_abi.h
> @@ -138,6 +138,7 @@ enum intel_guc_action {
>   	INTEL_GUC_ACTION_REGISTER_CONTEXT_MULTI_LRC = 0x4601,
>   	INTEL_GUC_ACTION_CLIENT_SOFT_RESET = 0x5507,
>   	INTEL_GUC_ACTION_SET_ENG_UTIL_BUFF = 0x550A,
> +	INTEL_GUC_ACTION_NOTIFY_MEMORY_CAT_ERROR = 0x6000,
>   	INTEL_GUC_ACTION_STATE_CAPTURE_NOTIFICATION = 0x8002,
>   	INTEL_GUC_ACTION_NOTIFY_FLUSH_LOG_BUFFER_TO_FILE = 0x8003,
>   	INTEL_GUC_ACTION_NOTIFY_CRASH_DUMP_POSTED = 0x8004,
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> index 804133df1ac9b4..61b412732d095a 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h
> @@ -445,6 +445,8 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>   					 const u32 *msg, u32 len);
>   int intel_guc_error_capture_process_msg(struct intel_guc *guc,
>   					const u32 *msg, u32 len);
> +int intel_guc_cat_error_process_msg(struct intel_guc *guc,
> +				    const u32 *msg, u32 len);
>   
>   struct intel_engine_cs *
>   intel_guc_lookup_engine(struct intel_guc *guc, u8 guc_class, u8 instance);
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> index 2b22065e87bf9a..f55f724e264407 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
> @@ -1035,6 +1035,9 @@ static int ct_process_request(struct intel_guc_ct *ct, struct ct_incoming_msg *r
>   		CT_ERROR(ct, "Received GuC exception notification!\n");
>   		ret = 0;
>   		break;
> +	case INTEL_GUC_ACTION_NOTIFY_MEMORY_CAT_ERROR:
> +		ret = intel_guc_cat_error_process_msg(guc, payload, len);
> +		break;
>   	default:
>   		ret = -EOPNOTSUPP;
>   		break;
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 693b07a977893d..94f91dfa3ec456 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -4659,6 +4659,37 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>   	return 0;
>   }
>   
> +int intel_guc_cat_error_process_msg(struct intel_guc *guc,
> +				    const u32 *msg, u32 len)
> +{
> +	struct drm_i915_private *i915 = guc_to_gt(guc)->i915;
> +	struct intel_engine_cs *engine;
> +	struct intel_context *ce;
> +	struct drm_printer p;
> +	unsigned long flags;
> +	int ctx_id;
> +
> +	if (unlikely(len != 1)) {
> +		drm_dbg(&i915->drm, "Invalid length %u", len);
> +		return -EPROTO;
> +	}
> +	ctx_id = msg[0];
> +
> +	xa_lock_irqsave(&guc->context_lookup, flags);
> +	ce = g2h_context_lookup(guc, ctx_id);
> +	if (ce)
> +		engine = ce->engine;
> +	xa_unlock_irqrestore(&guc->context_lookup, flags);
> +	if (unlikely(!ce || !engine))
> +		return -EPROTO;
This is not correct. There is no guarantee that a context will be 
associated with the catastrophic error. Certain categories of error are 
caused by context activity but other categories are caused by 'global' 
activity (e.g. CPU hitting a bad address over the PCIe bus, other 
non-engine hardware in the GPU, etc.). In those cases, the ctx_id field 
will be set to an invalid context (~0, I believe) and the ce will come 
back as NULL.

> +
> +	drm_err(&i915->drm, "%s: CAT error reported by GuC\n", engine->name);
This is not accurate. The error is reported *via* GuC but it is reported 
by the hardware. The GuC does not do anything to detect memory errors. 
GuC receives an interrupt from the hardware to say a catastrophic error 
has occurred, it then reads some data from a FIFO and basically returns 
that data back to the KMD.  The sole extent of the GuC processing is to 
convert the hardware id into a context id.

Saying that GuC is reporting the error implies that GuC knows what went 
wrong or could maybe even do something about it. Certainly it makes it 
likely that bugs will be logged against the GuC which is not correct.

Also, you are not reporting the context in the error message. Assuming 
that a context has been provided then that is important information to 
debug where the error came from. You should at least include 
ce->guc_id.id (which should match ctx_id) in the message.

John.

> +	p = drm_info_printer(i915->drm.dev);
> +	intel_engine_dump(engine, &p, "%s\n", engine->name);
> +
> +	return 0;
> +}
> +
>   void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>   {
>   	struct intel_guc *guc = &engine->gt->uc.guc;

[-- Attachment #2: Type: text/html, Size: 6650 bytes --]

  parent reply	other threads:[~2022-10-19 16:23 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-19  8:33 [Intel-gfx] [PATCH] drm/i915/guc: add CAT error handler Andrzej Hajda
2022-10-19 12:47 ` [Intel-gfx] ✓ Fi.CI.BAT: success for " Patchwork
2022-10-19 16:21 ` John Harrison [this message]
2022-10-19 18:31 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dd25282b-655b-eae9-6e02-ea4b54b167ea@intel.com \
    --to=john.c.harrison@intel.com \
    --cc=andrzej.hajda@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=rodrigo.vivi@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox