Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Brost <matthew.brost@intel.com>
To: Jan Maslak <jan.maslak@intel.com>
Cc: <intel-xe@lists.freedesktop.org>, <maciej.patelczyk@intel.com>,
	<joonas.lahtinen@intel.com>
Subject: Re: [PATCH 1/1] drm/xe/guc: Don't run jobs during GT reset
Date: Thu, 20 Nov 2025 19:39:44 -0800	[thread overview]
Message-ID: <aR/fAFW8Tyzef5he@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <20251120151611.2205914-2-jan.maslak@intel.com>

On Thu, Nov 20, 2025 at 04:16:11PM +0100, Jan Maslak wrote:
> During GT reset, hardware is in an inconsistent state and jobs might not
> execute correctly. Fail the job with -ECANCELED, allowing userspace to
> retry after reset completes.
> 

What is the motivation here? i.e., Is this fixing a bug, etc..,?

> Signed-off-by: Jan Maslak <jan.maslak@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_guc_submit.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 7e0882074a99..1a5f5bfb05bb 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -876,6 +876,12 @@ guc_exec_queue_run_job(struct drm_sched_job *drm_job)
>  	xe_gt_assert(guc_to_gt(guc), !(exec_queue_destroyed(q) || exec_queue_pending_disable(q)) ||
>  		     exec_queue_banned(q) || exec_queue_suspended(q));
>  
> +	/* Don't run jobs while GT reset is in progress */
> +	if (work_busy(&guc_to_gt(guc)->reset.worker)) {
> +		xe_sched_job_set_error(job, -ECANCELED);

This doesn’t return an error to user space—it just signals the job’s
fence (there’s no mechanism in the drm_syncobj_wait uAPI [1][2] to
communicate a fence error) without actually running it. Nor does it tear
down the queue on error, so the user will think something ran when it
didn’t, which is considerably worse than where we are now (see below).

What a GT reset does is stop all queues (e.g., guc_exec_queue_run_job
can’t be called—though this step can race with the GT reset), reset the
HW/GuC (nothing left on hardware), then check all queue states and ban
anything with a job that started but didn’t complete. The ban actually
shows up in user space: IOCTLs to submit to the queue report this, and
queries for queue reset status do as well. Any queue that seems okay
resubmits everything after bringing the GuC back up. As far as I can
tell, this is about the best we can do for a situation that should never
happen unless we have KMD, GuC, or hardware bugs.

Matt

[1] https://elixir.bootlin.com/linux/v6.17.8/source/include/uapi/drm/drm.h#L934
[2] https://elixir.bootlin.com/linux/v6.17.8/source/include/uapi/drm/drm.h#L952

> +		return NULL;
> +	}
> +
>  	trace_xe_sched_job_run(job);
>  
>  	if (!killed_or_banned_or_wedged && !xe_sched_job_is_error(job)) {
> -- 
> 2.34.1
> 

  reply	other threads:[~2025-11-21  3:39 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-20 15:16 [PATCH 0/1] drm/xe/guc: Don't run jobs during GT reset Jan Maslak
2025-11-20 15:16 ` [PATCH 1/1] " Jan Maslak
2025-11-21  3:39   ` Matthew Brost [this message]
2025-11-20 17:49 ` ✓ CI.KUnit: success for " Patchwork
2025-11-20 18:29 ` ✓ Xe.CI.BAT: " Patchwork
2025-11-20 22:39 ` ✗ Xe.CI.Full: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aR/fAFW8Tyzef5he@lstrano-desk.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=jan.maslak@intel.com \
    --cc=joonas.lahtinen@intel.com \
    --cc=maciej.patelczyk@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox