Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Brost <matthew.brost@intel.com>
To: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Cc: <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH 11/16] drm/xe/multi_queue: Handle CGP context error
Date: Mon, 3 Nov 2025 09:18:12 -0800	[thread overview]
Message-ID: <aQjj1HQMSkMF87sO@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <aQjb4PmAC9VBZ3FP@nvishwa1-desk>

On Mon, Nov 03, 2025 at 08:44:16AM -0800, Niranjana Vishwanathapura wrote:
> On Sun, Nov 02, 2025 at 10:29:32AM -0800, Matthew Brost wrote:
> > On Fri, Oct 31, 2025 at 11:29:31AM -0700, Niranjana Vishwanathapura wrote:
> > > Trigger multi-queue context cleanup upon CGP context error
> > > notification from GuC.
> > > 
> > > Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/abi/guc_actions_abi.h |  1 +
> > >  drivers/gpu/drm/xe/xe_guc_ct.c           |  4 +++
> > >  drivers/gpu/drm/xe/xe_guc_submit.c       | 33 ++++++++++++++++++++++++
> > >  drivers/gpu/drm/xe/xe_guc_submit.h       |  2 ++
> > >  drivers/gpu/drm/xe/xe_trace.h            |  5 ++++
> > >  5 files changed, 45 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/abi/guc_actions_abi.h b/drivers/gpu/drm/xe/abi/guc_actions_abi.h
> > > index 3e9fbed9cda6..8af3691626bf 100644
> > > --- a/drivers/gpu/drm/xe/abi/guc_actions_abi.h
> > > +++ b/drivers/gpu/drm/xe/abi/guc_actions_abi.h
> > > @@ -142,6 +142,7 @@ enum xe_guc_action {
> > >  	XE_GUC_ACTION_REGISTER_CONTEXT_MULTI_QUEUE = 0x4602,
> > >  	XE_GUC_ACTION_MULTI_QUEUE_CONTEXT_CGP_SYNC = 0x4603,
> > >  	XE_GUC_ACTION_NOTIFY_MULTI_QUEUE_CONTEXT_CGP_SYNC_DONE = 0x4604,
> > > +	XE_GUC_ACTION_NOTIFY_MULTI_QUEUE_CGP_CONTEXT_ERROR = 0x4605,
> > >  	XE_GUC_ACTION_CLIENT_SOFT_RESET = 0x5507,
> > >  	XE_GUC_ACTION_SET_ENG_UTIL_BUFF = 0x550A,
> > >  	XE_GUC_ACTION_SET_DEVICE_ENGINE_ACTIVITY_BUFFER = 0x550C,
> > > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> > > index 48b5006eb080..d0e19af0b4d2 100644
> > > --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> > > +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> > > @@ -1574,6 +1574,10 @@ static int process_g2h_msg(struct xe_guc_ct *ct, u32 *msg, u32 len)
> > >  	case XE_GUC_ACTION_NOTIFY_MULTI_QUEUE_CONTEXT_CGP_SYNC_DONE:
> > >  		ret = xe_guc_exec_queue_cgp_sync_done_handler(guc, payload, adj_len);
> > >  		break;
> > > +	case XE_GUC_ACTION_NOTIFY_MULTI_QUEUE_CGP_CONTEXT_ERROR:
> > > +		ret = xe_guc_exec_queue_cgp_context_error_handler(guc, payload,
> > > +								  adj_len);
> > > +		break;
> > >  	default:
> > >  		xe_gt_err(gt, "unexpected G2H action 0x%04x\n", action);
> > >  	}
> > > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > index 87c13feb2cef..605352145d76 100644
> > > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > @@ -48,6 +48,8 @@
> > >  #include "xe_vm.h"
> > >  #include "xe_bo.h"
> > > 
> > > +#define XE_GUC_EXEC_QUEUE_CGP_CONTEXT_ERROR_LEN		6
> > > +
> > >  static struct xe_guc *
> > >  exec_queue_to_guc(struct xe_exec_queue *q)
> > >  {
> > > @@ -3001,6 +3003,37 @@ int xe_guc_exec_queue_reset_failure_handler(struct xe_guc *guc, u32 *msg, u32 le
> > >  	return 0;
> > >  }
> > > 
> > > +int xe_guc_exec_queue_cgp_context_error_handler(struct xe_guc *guc, u32 *msg,
> > > +						u32 len)
> > > +{
> > > +	struct xe_gt *gt = guc_to_gt(guc);
> > > +	struct xe_device *xe = guc_to_xe(guc);
> > > +	struct xe_exec_queue *q;
> > > +	u32 guc_id = msg[2];
> > > +
> > > +	if (unlikely(len != XE_GUC_EXEC_QUEUE_CGP_CONTEXT_ERROR_LEN)) {
> > > +		drm_err(&xe->drm, "Invalid length %u", len);
> > > +		return -EPROTO;
> > > +	}
> > > +
> > > +	q = g2h_exec_queue_lookup(guc, guc_id);
> > > +	if (unlikely(!q))
> > > +		return -EPROTO;
> > > +
> > > +	xe_gt_dbg(gt,
> > > +		  "CGP context error: region=%s err=0x%x, context=0x%x LRCA=0x%x:0x%x SgId=0x%x",
> > > +		  msg[0] & 1 ? "uc" : "kmd", msg[1], msg[2], msg[4], msg[3], msg[5]);
> > > +
> > > +	trace_xe_exec_queue_cgp_context_error(q);
> > > +
> > > +	/* Treat the same as engine reset */
> > > +	set_exec_queue_reset(q);
> > > +	if (!exec_queue_banned(q) && !exec_queue_check_timeout(q))
> > 
> > I don't think you need the exec_queue_check_timeout check.
> > 
> 
> The check here is same as in other guc error handlers like
> xe_guc_exec_queue_reset_handler() and xe_guc_exec_queue_memory_cat_error_handler().

Ah, it is but it doesn't seem right in those cases either. Maybe I'm
forgeting something, will try page this information back in.

> Hence the reason to keep it here also. Doesn't exec_queue_check_timeout()
> mean TDR is already underway?

Yes, it does.

So with the current state of the code, this is correct, so:
Reviewed-by: Matthew Brost <matthew.brost@intel.com>

If I reason this can be dropped, let's drop it everywhere in a single
shot.

Matt

> 
> Niranjana
> 
> > Otherwise LGTM.
> > 
> > Matt
> > 
> > > +		xe_guc_exec_queue_trigger_cleanup(q);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > >  /**
> > >   * xe_guc_exec_queue_cgp_sync_done_handler - CGP synchronization done handler
> > >   * @guc: guc
> > > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.h b/drivers/gpu/drm/xe/xe_guc_submit.h
> > > index abfa94bce391..01b013a90b1b 100644
> > > --- a/drivers/gpu/drm/xe/xe_guc_submit.h
> > > +++ b/drivers/gpu/drm/xe/xe_guc_submit.h
> > > @@ -35,6 +35,8 @@ int xe_guc_exec_queue_memory_cat_error_handler(struct xe_guc *guc, u32 *msg,
> > >  int xe_guc_exec_queue_reset_failure_handler(struct xe_guc *guc, u32 *msg, u32 len);
> > >  int xe_guc_error_capture_handler(struct xe_guc *guc, u32 *msg, u32 len);
> > >  int xe_guc_exec_queue_cgp_sync_done_handler(struct xe_guc *guc, u32 *msg, u32 len);
> > > +int xe_guc_exec_queue_cgp_context_error_handler(struct xe_guc *guc, u32 *msg,
> > > +						u32 len);
> > > 
> > >  struct xe_guc_submit_exec_queue_snapshot *
> > >  xe_guc_exec_queue_snapshot_capture(struct xe_exec_queue *q);
> > > diff --git a/drivers/gpu/drm/xe/xe_trace.h b/drivers/gpu/drm/xe/xe_trace.h
> > > index 79a97b086cb2..c9d0748dae9d 100644
> > > --- a/drivers/gpu/drm/xe/xe_trace.h
> > > +++ b/drivers/gpu/drm/xe/xe_trace.h
> > > @@ -172,6 +172,11 @@ DEFINE_EVENT(xe_exec_queue, xe_exec_queue_memory_cat_error,
> > >  	     TP_ARGS(q)
> > >  );
> > > 
> > > +DEFINE_EVENT(xe_exec_queue, xe_exec_queue_cgp_context_error,
> > > +	     TP_PROTO(struct xe_exec_queue *q),
> > > +	     TP_ARGS(q)
> > > +);
> > > +
> > >  DEFINE_EVENT(xe_exec_queue, xe_exec_queue_stop,
> > >  	     TP_PROTO(struct xe_exec_queue *q),
> > >  	     TP_ARGS(q)
> > > --
> > > 2.43.0
> > > 

  reply	other threads:[~2025-11-03 17:18 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-31 18:29 [PATCH 00/16] drm/xe: Multi Queue feature support Niranjana Vishwanathapura
2025-10-31 18:29 ` [PATCH 01/16] drm/xe/multi_queue: Add multi_queue_enable_mask to gt information Niranjana Vishwanathapura
2025-11-02  0:01   ` Matthew Brost
2025-11-03  1:25     ` Niranjana Vishwanathapura
2025-10-31 18:29 ` [PATCH 02/16] drm/xe/multi_queue: Add user interface for multi queue support Niranjana Vishwanathapura
2025-10-31 19:31   ` Matthew Brost
2025-11-03 22:58     ` Niranjana Vishwanathapura
2025-11-02  0:23   ` Matthew Brost
2025-11-03 22:59     ` Niranjana Vishwanathapura
2025-11-02 17:37   ` Matthew Brost
2025-11-03 23:06     ` Niranjana Vishwanathapura
2025-10-31 18:29 ` [PATCH 03/16] drm/xe/multi_queue: Add GuC " Niranjana Vishwanathapura
2025-11-01 18:07   ` Matthew Brost
2025-11-04  4:56     ` Niranjana Vishwanathapura
2025-11-04 17:41       ` Matthew Brost
2025-11-04 18:55         ` Niranjana Vishwanathapura
2025-11-04 19:26           ` Matthew Brost
2025-11-02 18:02   ` Matthew Brost
2025-11-04  5:02     ` Niranjana Vishwanathapura
2025-10-31 18:29 ` [PATCH 04/16] drm/xe/multi_queue: Add multi queue priority property Niranjana Vishwanathapura
2025-11-01 23:59   ` Matthew Brost
2025-11-03  4:45     ` Niranjana Vishwanathapura
2025-10-31 18:29 ` [PATCH 05/16] drm/xe/multi_queue: Handle invalid exec queue property setting Niranjana Vishwanathapura
2025-11-03 22:41   ` Matthew Brost
2025-10-31 18:29 ` [PATCH 06/16] drm/xe/multi_queue: Add exec_queue set_property ioctl support Niranjana Vishwanathapura
2025-11-02 16:53   ` Matthew Brost
2025-11-03  1:49     ` Niranjana Vishwanathapura
2025-10-31 18:29 ` [PATCH 07/16] drm/xe/multi_queue: Add support for multi queue dynamic priority change Niranjana Vishwanathapura
2025-11-01 23:23   ` Matthew Brost
2025-11-03 18:06     ` Niranjana Vishwanathapura
2025-11-01 23:41   ` Matthew Brost
2025-11-03 18:14     ` Niranjana Vishwanathapura
2025-11-03 19:05       ` Matthew Brost
2025-10-31 18:29 ` [PATCH 08/16] drm/xe/multi_queue: Add multi queue information to guc_info dump Niranjana Vishwanathapura
2025-11-01 18:31   ` Matthew Brost
2025-11-03  1:15     ` Niranjana Vishwanathapura
2025-10-31 18:29 ` [PATCH 09/16] drm/xe/multi_queue: Handle tearing down of a multi queue Niranjana Vishwanathapura
2025-11-02  0:39   ` Matthew Brost
2025-11-04  3:35     ` Niranjana Vishwanathapura
2025-10-31 18:29 ` [PATCH 10/16] drm/xe/multi_queue: Set QUEUE_DRAIN_MODE for Multi Queue batches Niranjana Vishwanathapura
2025-11-02 18:22   ` Matthew Brost
2025-11-03 17:09     ` Niranjana Vishwanathapura
2025-10-31 18:29 ` [PATCH 11/16] drm/xe/multi_queue: Handle CGP context error Niranjana Vishwanathapura
2025-11-02 18:29   ` Matthew Brost
2025-11-03 16:44     ` Niranjana Vishwanathapura
2025-11-03 17:18       ` Matthew Brost [this message]
2025-10-31 18:29 ` [PATCH 12/16] drm/xe/multi_queue: Tracepoint support Niranjana Vishwanathapura
2025-11-01 18:32   ` Matthew Brost
2025-10-31 18:29 ` [PATCH 13/16] drm/xe/multi_queue: Support active group after primary is destroyed Niranjana Vishwanathapura
2025-11-03 22:05   ` Matthew Brost
2025-11-04 17:24     ` Niranjana Vishwanathapura
2025-11-04 17:30       ` Niranjana Vishwanathapura
2025-10-31 18:29 ` [PATCH 14/16] drm/xe/doc: Add documentation for Multi Queue Group Niranjana Vishwanathapura
2025-10-31 18:29 ` [PATCH 15/16] drm/xe/doc: Add documentation for Multi Queue Group GuC interface Niranjana Vishwanathapura
2025-10-31 18:29 ` [PATCH 16/16] drm/xe/multi_queue: Enable multi_queue on xe3p_xpc Niranjana Vishwanathapura
2025-11-02  0:05   ` Matthew Brost
2025-10-31 18:47 ` [PATCH 00/16] drm/xe: Multi Queue feature support Niranjana Vishwanathapura
2025-10-31 21:15 ` ✗ CI.checkpatch: warning for " Patchwork
2025-10-31 21:16 ` ✓ CI.KUnit: success " Patchwork
2025-10-31 22:19 ` ✗ Xe.CI.BAT: failure " Patchwork
2025-11-01 11:25 ` ✗ Xe.CI.Full: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aQjj1HQMSkMF87sO@lstrano-desk.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=niranjana.vishwanathapura@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox