Re: [PATCH 7/7] drm/xe: Check for GuC responses on disabling scheduling

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Matthew Brost <matthew.brost@intel.com>
To: Stuart Summers <stuart.summers@intel.com>
Cc: <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH 7/7] drm/xe: Check for GuC responses on disabling scheduling
Date: Fri, 3 Oct 2025 11:54:08 -0700	[thread overview]
Message-ID: <aOAb0JnFULgND3ad@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <20251002230444.313505-8-stuart.summers@intel.com>

On Thu, Oct 02, 2025 at 11:04:44PM +0000, Stuart Summers wrote:
> In the event the GuC becomes unresponsive during a scheduling
> disable event, we still want the driver to be able to recover.
> This patch follows the same methodology we already have in place
> for TLB invalidation requests, where we send a request to GuC
> and wait for that invalidation done response. If the response
> doesn't come back in time we then at least print a message
> indicating the invalidation failed for some reason.
> 
> In this case, we send the schedule disable and the expectation
> is that GuC will respond with a schedule done response. The KMD
> then catches that response and in turn sends a context deregistration
> response. So in the event GuC becomes unresponsive after we send
> the schedule disable, we actually have two g2h responses that
> have been reserved but never received.
> 
> To handle this, make sure the pending disable event in the
> exec queue gets cleared (i.e. we received that response from
> GuC). If it doesn't in a reasonable amount of time, assume
> GuC is dead: ban the exec queue, queue up a GT reset, and
> manually call the schedule done handler. Then in the schedule
> done handler, in turn, check whether the context had been
> banned. If so, manually call the deregistration done handler
> to ensure all resources related to that exec queue get
> cleaned up properly. Without this, if the device becomes
> wedged after an exec queue has been created, the attached
> resources like the LRC will not get feed properly resulting
> in a memory leak.
> 
> Signed-off-by: Stuart Summers <stuart.summers@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_guc_submit.c | 23 ++++++++++++++++++++++-
>  1 file changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 45b72bebfc63..a177d87c8524 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -939,6 +939,9 @@ int xe_guc_read_stopped(struct xe_guc *guc)
>  		GUC_CONTEXT_##enable_disable,				\
>  	}
>  
> +static void handle_sched_done(struct xe_guc *guc, struct xe_exec_queue *q,
> +			      u32 runnable_state);
> +
>  static void disable_scheduling_deregister(struct xe_guc *guc,
>  					  struct xe_exec_queue *q)
>  {
> @@ -974,6 +977,17 @@ static void disable_scheduling_deregister(struct xe_guc *guc,
>  	xe_guc_ct_send(&guc->ct, action, ARRAY_SIZE(action),
>  		       G2H_LEN_DW_SCHED_CONTEXT_MODE_SET +
>  		       G2H_LEN_DW_DEREGISTER_CONTEXT, 2);
> +
> +	ret = wait_event_timeout(guc->ct.wq,
> +				 !exec_queue_pending_disable(q) ||
> +				 xe_guc_read_stopped(guc),
> +				 HZ * 5);

This doesn't look right. Deregister is designed to be fully async. If
this flow stops working for whatever reason the GuC is dead and
eventually somewhere in driver will detect this and trigger a GT reset
which is cleanup all lost H2G.

> +	if (!ret || xe_guc_read_stopped(guc)) {
> +		xe_gt_warn(guc_to_gt(guc), "Schedule disable failed to respond");
> +		set_exec_queue_banned(q);
> +		handle_sched_done(guc, q, 0);
> +		xe_gt_reset_async(q->gt);
> +	}
>  }
>  
>  static void xe_guc_exec_queue_trigger_cleanup(struct xe_exec_queue *q)
> @@ -2117,6 +2131,8 @@ g2h_exec_queue_lookup(struct xe_guc *guc, u32 guc_id)
>  	return q;
>  }
>  
> +static void handle_deregister_done(struct xe_guc *guc, struct xe_exec_queue *q);
> +
>  static void deregister_exec_queue(struct xe_guc *guc, struct xe_exec_queue *q)
>  {
>  	u32 action[] = {
> @@ -2131,7 +2147,12 @@ static void deregister_exec_queue(struct xe_guc *guc, struct xe_exec_queue *q)
>  
>  	trace_xe_exec_queue_deregister(q);
>  
> -	xe_guc_ct_send_g2h_handler(&guc->ct, action, ARRAY_SIZE(action));
> +	if (exec_queue_banned(q)) {
> +		handle_deregister_done(guc, q);

This would leave the GuC with reference to guc_id and subsequent reuse
of the guc_id (i.e., next register) will fall.

Matt

> +	} else {
> +		xe_guc_ct_send_g2h_handler(&guc->ct, action,
> +					   ARRAY_SIZE(action));
> +	}
>  }
>  
>  static void handle_sched_done(struct xe_guc *guc, struct xe_exec_queue *q,
> -- 
> 2.34.1
>

next prev parent reply	other threads:[~2025-10-03 18:54 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-02 23:04 [PATCH 0/7] Fix a couple of wedge corner-case memory leaks Stuart Summers
2025-10-02 23:04 ` [PATCH 1/7] drm/xe: Add additional trace points for LRCs Stuart Summers
2025-10-02 23:04 ` [PATCH 2/7] drm/xe: Add a trace point for VM close Stuart Summers
2025-10-02 23:04 ` [PATCH 3/7] drm/xe: Add the BO pointer info to the BO trace Stuart Summers
2025-10-02 23:04 ` [PATCH 4/7] drm/xe: Add new exec queue trace points Stuart Summers
2025-10-02 23:04 ` [PATCH 5/7] drm/xe: Handle missing migration VM on VM creation Stuart Summers
2025-10-02 23:34   ` Lin, Shuicheng
2025-10-03  6:56     ` Matthew Brost
2025-10-03 14:33       ` Summers, Stuart
2025-10-02 23:04 ` [PATCH 6/7] drm/xe: Don't send a CLEANUP message on sched pause Stuart Summers
2025-10-03 18:50   ` Matthew Brost
2025-10-03 18:53     ` Summers, Stuart
2025-10-02 23:04 ` [PATCH 7/7] drm/xe: Check for GuC responses on disabling scheduling Stuart Summers
2025-10-03 18:54   ` Matthew Brost [this message]
2025-10-03 18:58     ` Summers, Stuart
2025-10-03 19:38       ` Matthew Brost
2025-10-03 19:42         ` Summers, Stuart
2025-10-03 19:49           ` Matthew Brost
2025-10-03 19:53             ` Summers, Stuart
2025-10-02 23:11 ` ✗ CI.checkpatch: warning for Fix a couple of wedge corner-case memory leaks Patchwork
2025-10-02 23:12 ` ✓ CI.KUnit: success " Patchwork
2025-10-02 23:58 ` ✗ Xe.CI.BAT: failure " Patchwork
2025-10-03  2:16 ` ✗ Xe.CI.Full: " Patchwork
2025-10-03 14:38   ` Summers, Stuart
  -- strict thread matches above, loose matches on Subject: below --
2025-10-13 16:24 [PATCH 0/7] " Stuart Summers
2025-10-13 16:25 ` [PATCH 7/7] drm/xe: Check for GuC responses on disabling scheduling Stuart Summers
2025-10-13 22:31 [PATCH 0/7] Fix a couple of wedge corner-case memory leaks Stuart Summers
2025-10-13 22:31 ` [PATCH 7/7] drm/xe: Check for GuC responses on disabling scheduling Stuart Summers
2025-10-14  2:09   ` Matthew Brost
2025-10-14  3:10     ` Summers, Stuart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aOAb0JnFULgND3ad@lstrano-desk.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=stuart.summers@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox