Re: [RFC PATCH 1/3] drm/xe: skip banning kernel migration queue on TDR timeout

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Matthew Auld <matthew.auld@intel.com>
Cc: Sanjay Yadav <sanjay.kumar.yadav@intel.com>,
	<intel-xe@lists.freedesktop.org>,
	<dri-devel@lists.freedesktop.org>, <nirmoy.das@intel.com>,
	<umesh.nerlige.ramappa@intel.com>,
	<thomas.hellstrom@linux.intel.com>, <matthew.brost@intel.com>,
	<niranjana.vishwanathapura@intel.com>,
	<thomas.hellstrom@intel.com>, <fei.yang@intel.com>,
	<himal.prasad.ghimiray@intel.com>, <matthew.d.roper@intel.com>,
	<maarten.lankhorst@intel.com>, <joonas.lahtinen@intel.com>
Subject: Re: [RFC PATCH 1/3] drm/xe: skip banning kernel migration queue on TDR timeout
Date: Wed, 3 Jun 2026 09:52:10 -0400	[thread overview]
Message-ID: <aiAxiiYO8nkE2gvl@intel.com> (raw)
In-Reply-To: <5634e7fc-6931-465f-ba3d-4068b4fe53ba@intel.com>

On Wed, Jun 03, 2026 at 01:42:25PM +0100, Matthew Auld wrote:
> On 03/06/2026 13:06, Sanjay Yadav wrote:
> > guc_exec_queue_timedout_job() unconditionally bans the queue once a
> > job times out. For the kernel migration queue this is fatal — once
> > banned, no page table migrations can complete and the GPU is
> > effectively dead until driver reload.
> > 
> > The submission is already stopped and the timed-out job is erred out,
> > so banning is not needed for correctness. GT reset handles the actual
> > hardware recovery. Skip banning for kernel queues so they remain
> > available after reset.
> 
> Is wedging/reload not the more correct thing here? Kernel job is usually
> performing critical and potentially security sensitive work, like memory
> clearing, migrations, binding etc. If something goes wrong in one of those
> jobs, how should we go about recovering from that? Is driver reload/wedge
> not the more appropriate thing here, or least would need a more elaborate
> recovery?
> 
> For example, memclear get nuked, what stops the user from accessing
> uncleared memory later? Or a migration/copy/save/restore/ job gets nuked,
> from correctness pov how do we recover from that?

I agree with Matt here something is off. we cannot blindly skip these
kernel submission cases... (This and the other patch in this series)

> 
> > 
> > Fixes: bb63e7257e63 ("drm/xe: Avoid toggling schedule state to check LRC timestamp in TDR")
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > Assisted-by: Claude:claude-opus-4.6
> > Suggested-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> > Signed-off-by: Sanjay Yadav <sanjay.kumar.yadav@intel.com>
> > ---
> >   drivers/gpu/drm/xe/xe_guc_submit.c | 3 ++-
> >   1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index ab501513d806..e6ad57cbbf0e 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -1543,7 +1543,8 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
> >   	if (!exec_queue_killed(q))
> >   		wedged = guc_submit_hint_wedged(exec_queue_to_guc(q));
> > -	set_exec_queue_banned(q);
> > +	if (!(q->flags & EXEC_QUEUE_FLAG_KERNEL))
> > +		set_exec_queue_banned(q);
> >   	/* Kick job / queue off hardware */
> >   	if (!wedged && (exec_queue_enabled(primary) ||
>

next prev parent reply	other threads:[~2026-06-03 13:52 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-03 12:06 [RFC PATCH 1/3] drm/xe: skip banning kernel migration queue on TDR timeout Sanjay Yadav
2026-06-03 12:06 ` [RFC PATCH 2/3] drm/sched: fix drm_sched_tdr_queue_imm to not corrupt timeout value Sanjay Yadav
2026-06-03 13:47   ` Rodrigo Vivi
2026-06-03 12:06 ` [RFC PATCH 3/3] drm/xe: don't cancel other pending jobs on kernel migration queue timeout Sanjay Yadav
2026-06-03 12:21 ` ✓ CI.KUnit: success for series starting with [RFC,1/3] drm/xe: skip banning kernel migration queue on TDR timeout Patchwork
2026-06-03 12:42 ` [RFC PATCH 1/3] " Matthew Auld
2026-06-03 13:52   ` Rodrigo Vivi [this message]
2026-06-03 15:13     ` Hellstrom, Thomas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aiAxiiYO8nkE2gvl@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=fei.yang@intel.com \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=joonas.lahtinen@intel.com \
    --cc=maarten.lankhorst@intel.com \
    --cc=matthew.auld@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=matthew.d.roper@intel.com \
    --cc=niranjana.vishwanathapura@intel.com \
    --cc=nirmoy.das@intel.com \
    --cc=sanjay.kumar.yadav@intel.com \
    --cc=thomas.hellstrom@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=umesh.nerlige.ramappa@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox