Re: [PATCH 1/3] drm/xe: fix job timeout recovery for unstarted jobs and kernel queues

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: "Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com>
Cc: <intel-xe@lists.freedesktop.org>,
	Matthew Auld <matthew.auld@intel.com>,
	Sanjay Yadav <sanjay.kumar.yadav@intel.com>
Subject: Re: [PATCH 1/3] drm/xe: fix job timeout recovery for unstarted jobs and kernel queues
Date: Wed, 3 Jun 2026 15:00:12 -0400	[thread overview]
Message-ID: <aiB5vKto1zWv9S8B@intel.com> (raw)
In-Reply-To: <5f49eefd-a984-4d96-875a-87173a99a775@intel.com>

On Wed, Jun 03, 2026 at 10:01:39PM +0530, Ghimiray, Himal Prasad wrote:
> 
> 
> On 03-06-2026 20:31, Rodrigo Vivi wrote:
> > Jobs that GuC never scheduled were silently errored out instead of
> > triggering a GT reset. Kernel jobs that exhaust all recovery attempts
> > should wedge the device rather than silently fail, and userspace VM bind
> > queues should stay permanently banned rather than being reset and retried.
> > 
> > The queue is banned early in the timeout handler to signal the G2H
> > scheduling-done handler so it wakes the disable-scheduling waiter; without
> > it the waiter sleeps the full 5s timeout. For kernel queues the ban is
> > cleared before rearming so that guc_exec_queue_start() can resubmit jobs
> > after the GT reset — a banned queue would block resubmission and cause an
> > infinite TDR loop.
> > 
> > Cc: Matthew Auld <matthew.auld@intel.com>
> > Cc: Sanjay Yadav <sanjay.kumar.yadav@intel.com>
> > Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> > Assisted-by: GitHub-Copilot:claude-sonnet-4.6
> > Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > ---
> >   drivers/gpu/drm/xe/xe_guc_submit.c | 31 +++++++++++++++++++++---------
> >   1 file changed, 22 insertions(+), 9 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index ab501513d806..bbccba367626 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -158,6 +158,11 @@ static void set_exec_queue_banned(struct xe_exec_queue *q)
> >   	atomic_or(EXEC_QUEUE_STATE_BANNED, &q->guc->state);
> >   }
> > +static void clear_exec_queue_banned(struct xe_exec_queue *q)
> > +{
> > +	atomic_andnot(EXEC_QUEUE_STATE_BANNED, &q->guc->state);
> > +}
> > +
> >   static bool exec_queue_suspended(struct xe_exec_queue *q)
> >   {
> >   	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_SUSPENDED;
> > @@ -1376,7 +1381,8 @@ static bool check_timeout(struct xe_exec_queue *q, struct xe_sched_job *job)
> >   			   xe_sched_job_seqno(job), xe_sched_job_lrc_seqno(job),
> >   			   q->guc->id);
> > -		return xe_sched_invalidate_job(job, 2);
> > +		/* GuC never scheduled this job - let the caller trigger a GT reset. */
> > +		return true;
> 
> 
> Sounds sane. But Will also need clear_exec_queue_banned and GT Reset for
> user execqueue, if job wasn't started. Below changes mark it only for
> migration queues.

do you mean to have something like this below:

 if (!xe_sched_job_started(job)) {
        clear_exec_queue_banned(q);

regardless of the type of the exec queue?


> 
> /Himal
> 
> >   	}
> >   	ctx_timestamp = lower_32_bits(xe_lrc_timestamp(q->lrc[0]));
> > @@ -1622,19 +1628,26 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> >   			       q->guc->id, q->flags);
> >   	/*
> > -	 * Kernel jobs should never fail, nor should VM jobs if they do
> > -	 * somethings has gone wrong and the GT needs a reset
> > +	 * Kernel jobs should never fail permanently. Attempt GT reset and
> > +	 * resubmit; if karma is exhausted the hardware is unrecoverable so
> > +	 * wedge the device.
> > +	 *
> > +	 * Userspace VM bind queues are banned permanently on timeout
> > +.	 * No reset is attempted, the ban already
> > +	 * signals the G2H handler, and the queue stays banned so the job
> > +	 * errors out cleanly.
> >   	 */
> > -	xe_gt_WARN(q->gt, q->flags & EXEC_QUEUE_FLAG_KERNEL,
> > -		   "Kernel-submitted job timed out\n");
> > -	xe_gt_WARN(q->gt, q->flags & EXEC_QUEUE_FLAG_VM && !exec_queue_killed(q),
> > -		   "VM job timed out on non-killed execqueue\n");
> > -	if (!wedged && (q->flags & EXEC_QUEUE_FLAG_KERNEL ||
> > -			(q->flags & EXEC_QUEUE_FLAG_VM && !exec_queue_killed(q)))) {
> > +	if (!wedged && q->flags & EXEC_QUEUE_FLAG_KERNEL) {
> >   		if (!xe_sched_invalidate_job(job, 2)) {
> > +			clear_exec_queue_banned(q);
> >   			xe_gt_reset_async(q->gt);
> >   			goto rearm;
> >   		}
> > +		xe_gt_WARN(q->gt, true, "Kernel-submitted job timed out\n");
> > +		xe_device_declare_wedged(gt_to_xe(q->gt));
> > +	} else if (!wedged && q->flags & EXEC_QUEUE_FLAG_VM &&
> > +		   !exec_queue_killed(q)) {
> > +		xe_gt_WARN(q->gt, true, "VM job timed out on non-killed execqueue\n");
> >   	}
> >   	/* Mark all outstanding jobs as bad, thus completing them */
>

next prev parent reply	other threads:[~2026-06-03 19:00 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-03 15:01 [PATCH 1/3] drm/xe: fix job timeout recovery for unstarted jobs and kernel queues Rodrigo Vivi
2026-06-03 15:01 ` [PATCH 2/3] drm/xe/lrc: fix spurious warning when reading context timestamp Rodrigo Vivi
2026-06-03 15:01 ` [PATCH 3/3] drm/xe/lrc: remove engine_id PPHWSP stash Rodrigo Vivi
2026-06-03 15:06 ` ✗ CI.checkpatch: warning for series starting with [1/3] drm/xe: fix job timeout recovery for unstarted jobs and kernel queues Patchwork
2026-06-03 15:08 ` ✓ CI.KUnit: success " Patchwork
2026-06-03 15:52 ` ✓ Xe.CI.BAT: " Patchwork
2026-06-03 16:31 ` [PATCH 1/3] " Ghimiray, Himal Prasad
2026-06-03 19:00   ` Rodrigo Vivi [this message]
2026-06-04  6:43     ` Ghimiray, Himal Prasad
2026-06-04  2:32 ` ✓ Xe.CI.FULL: success for series starting with [1/3] " Patchwork
2026-06-04  6:46 ` [PATCH 1/3] " Yadav, Sanjay Kumar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aiB5vKto1zWv9S8B@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.auld@intel.com \
    --cc=sanjay.kumar.yadav@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.