Re: [PATCH 1/2] drm/xe: fix job timeout recovery for unstarted jobs and kernel queues

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Matthew Brost <matthew.brost@intel.com>
Cc: <intel-xe@lists.freedesktop.org>,
	Matthew Auld <matthew.auld@intel.com>,
	Sanjay Yadav <sanjay.kumar.yadav@intel.com>,
	Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Subject: Re: [PATCH 1/2] drm/xe: fix job timeout recovery for unstarted jobs and kernel queues
Date: Tue, 9 Jun 2026 12:05:12 -0400	[thread overview]
Message-ID: <aig5uLonYEgjeZaA@intel.com> (raw)
In-Reply-To: <aigyd1rBHQHc8FKl@gsse-cloud1.jf.intel.com>

On Tue, Jun 09, 2026 at 08:34:15AM -0700, Matthew Brost wrote:
> On Tue, Jun 09, 2026 at 10:44:13AM -0400, Rodrigo Vivi wrote:
> > Jobs that GuC never scheduled were silently errored out instead of
> > triggering a GT reset. Kernel jobs that exhaust all recovery attempts
> > should wedge the device rather than silently fail, and userspace VM bind
> > queues should stay permanently banned rather than being reset and retried.
> > 
> > The queue is banned early in the timeout handler to signal the G2H
> > scheduling-done handler so it wakes the disable-scheduling waiter; without
> > it the waiter sleeps the full 5s timeout. For not started works the ban is
> > cleared before rearming so that guc_exec_queue_start() can resubmit jobs
> > after the GT reset — a banned queue would block resubmission and cause an
> > infinite TDR loop.
> > 
> > v2: (Himal) Do it for any queue type, not just kernel/migration
> > 
> > Cc: Matthew Auld <matthew.auld@intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Sanjay Yadav <sanjay.kumar.yadav@intel.com>
> > Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> > Assisted-by: GitHub-Copilot:claude-sonnet-4.6
> > Assisted-by: GitHub-Copilot:claude-opus-4.8
> > Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_guc_submit.c | 41 ++++++++++++++++++++----------
> >  1 file changed, 27 insertions(+), 14 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index 4b247a3019d2..5c40eee41103 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -157,6 +157,11 @@ static void set_exec_queue_banned(struct xe_exec_queue *q)
> >  	atomic_or(EXEC_QUEUE_STATE_BANNED, &q->guc->state);
> >  }
> >  
> > +static void clear_exec_queue_banned(struct xe_exec_queue *q)
> > +{
> > +	atomic_andnot(EXEC_QUEUE_STATE_BANNED, &q->guc->state);
> > +}
> > +
> >  static bool exec_queue_suspended(struct xe_exec_queue *q)
> >  {
> >  	return atomic_read(&q->guc->state) & EXEC_QUEUE_STATE_SUSPENDED;
> > @@ -1363,7 +1368,8 @@ static bool check_timeout(struct xe_exec_queue *q, struct xe_sched_job *job)
> >  			   xe_sched_job_seqno(job), xe_sched_job_lrc_seqno(job),
> >  			   q->guc->id);
> >  
> > -		return xe_sched_invalidate_job(job, 2);
> > +		/* GuC never scheduled this job - let the caller trigger a GT reset. */
> > +		return true;
> 
> I think there are some edge cases in the VK conformance tests where,
> with a large number of queues, 5 seconds isn’t enough time for queues to
> start after submission. Based on a 1 ms timeslice, this would correspond
> to more than 5,000 queues.
> 
> The current code allows a 10-second window, which is just as arbitrary
> as 5s. This isn’t a blocker—just something to consider.

Indeed. The ambiguous 'reasonable time'. But better this then the loop I believe...

> 
> >  	}
> >  
> >  	ctx_timestamp = lower_32_bits(xe_lrc_timestamp(q->lrc[0]));
> > @@ -1460,6 +1466,12 @@ static void disable_scheduling(struct xe_exec_queue *q, bool immediate)
> >  			       G2H_LEN_DW_SCHED_CONTEXT_MODE_SET, 1);
> >  }
> >  
> > +/* Unstarted jobs (GuC scheduling failure) and kernel queues recover via GT reset */
> > +static bool timeout_needs_gt_reset(struct xe_exec_queue *q, struct xe_sched_job *job)
> > +{
> > +	return !xe_sched_job_started(job) || (q->flags & EXEC_QUEUE_FLAG_KERNEL);
> > +}
> > +
> >  static enum drm_gpu_sched_stat
> >  guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> >  {
> > @@ -1608,19 +1620,20 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> >  			       xe_sched_job_seqno(job), xe_sched_job_lrc_seqno(job),
> >  			       q->guc->id, q->flags);
> >  
> > -	/*
> > -	 * Kernel jobs should never fail, nor should VM jobs if they do
> > -	 * somethings has gone wrong and the GT needs a reset
> > -	 */
> > -	xe_gt_WARN(q->gt, q->flags & EXEC_QUEUE_FLAG_KERNEL,
> > -		   "Kernel-submitted job timed out\n");
> > -	xe_gt_WARN(q->gt, q->flags & EXEC_QUEUE_FLAG_VM && !exec_queue_killed(q),
> > -		   "VM job timed out on non-killed execqueue\n");
> > -	if (!wedged && (q->flags & EXEC_QUEUE_FLAG_KERNEL ||
> > -			(q->flags & EXEC_QUEUE_FLAG_VM && !exec_queue_killed(q)))) {
> > -		if (!xe_sched_invalidate_job(job, 2)) {
> > -			xe_gt_reset_async(q->gt);
> > -			goto rearm;
> > +	if (!wedged) {
> > +		if (timeout_needs_gt_reset(q, job)) {
> > +			/* Retry after a GT reset; wedge a kernel queue once karma is exhausted */
> > +			if (!xe_sched_invalidate_job(job, 2)) {
> > +				clear_exec_queue_banned(q);
> > +				xe_gt_reset_async(q->gt);
> > +				goto rearm;
> > +			}
> > +			if (q->flags & EXEC_QUEUE_FLAG_KERNEL) {
> > +				xe_gt_WARN(q->gt, true, "Kernel-submitted job timed out\n");
> > +				xe_device_declare_wedged(gt_to_xe(q->gt));
> > +			}
> 
> This part LGTM. I have seen when a device gets in a bad state and kernel
> jobs fail (typically a bug somewhere else in the driver) kernel jobs
> just spun forever - I never spent the time trying to fix that. I think
> this should fix this problem?

That's exactly my hope.

Issues like Linus was facing here:
https://lore.kernel.org/intel-xe/CAHk-=whiv=b+dAvjaZDsZkfUEzjZMSSLExDOWVcbJ0exsCj6_Q@mail.gmail.com/

and some other issues that had similar signatures:

https://gitlab.freedesktop.org/drm/xe/kernel/-/work_items/7810
https://gitlab.freedesktop.org/drm/xe/kernel/-/work_items/7814
https://gitlab.freedesktop.org/drm/xe/kernel/-/work_items/7893
https://gitlab.freedesktop.org/drm/xe/kernel/-/work_items/8003

I don't believe that this patch here will solve any of these issues themselves.
At least not the source of the initial GPU Hang.

But at least the machine won't be locked in the infinite TDR handling
with never started jobs, what will allow us to debug the true bug when
that happens...

Thanks,
Rodrigo.

> 
> Matt
> 
> > +		} else if (q->flags & EXEC_QUEUE_FLAG_VM && !exec_queue_killed(q)) {
> > +			xe_gt_WARN(q->gt, true, "VM job timed out on non-killed execqueue\n");
> >  		}
> >  	}
> >  
> > -- 
> > 2.54.0
> >

next prev parent reply	other threads:[~2026-06-09 16:05 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-09 14:44 [PATCH 1/2] drm/xe: fix job timeout recovery for unstarted jobs and kernel queues Rodrigo Vivi
2026-06-09 14:44 ` [PATCH 2/2] drm/xe/lrc: fix spurious warning when reading context timestamp Rodrigo Vivi
2026-06-09 16:23   ` Ghimiray, Himal Prasad
2026-06-09 14:49 ` ✗ CI.checkpatch: warning for series starting with [1/2] drm/xe: fix job timeout recovery for unstarted jobs and kernel queues Patchwork
2026-06-09 14:51 ` ✓ CI.KUnit: success " Patchwork
2026-06-09 15:34 ` [PATCH 1/2] " Matthew Brost
2026-06-09 16:05   ` Rodrigo Vivi [this message]
2026-06-09 16:12     ` Matthew Brost
2026-06-09 15:46 ` ✓ Xe.CI.BAT: success for series starting with [1/2] " Patchwork
2026-06-09 16:16 ` [PATCH 1/2] " Ghimiray, Himal Prasad
2026-06-10  3:38 ` ✗ Xe.CI.FULL: failure for series starting with [1/2] " Patchwork
2026-06-10  9:47 ` [PATCH 1/2] " Yadav, Sanjay Kumar
2026-06-10 15:24   ` Rodrigo Vivi
  -- strict thread matches above, loose matches on Subject: below --
2026-06-10 15:25 Rodrigo Vivi
2026-06-10 16:07 ` Ghimiray, Himal Prasad
2026-06-10 16:30 ` Matthew Brost

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aig5uLonYEgjeZaA@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.auld@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=sanjay.kumar.yadav@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.