Re: [PATCH] drm/xe: Suppress reset log for destroyed queues

public inbox for intel-xe@lists.freedesktop.org
 help / color / mirror / Atom feed

From: Matthew Brost <matthew.brost@intel.com>
To: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH] drm/xe: Suppress reset log for destroyed queues
Date: Mon, 6 Apr 2026 13:08:34 -0700	[thread overview]
Message-ID: <adQSwo+SkbQc4goC@gsse-cloud1.jf.intel.com> (raw)
In-Reply-To: <c06619c4-59ea-4ed1-a76b-4d0a70fc8900@intel.com>

On Fri, Apr 03, 2026 at 10:33:55AM -0700, Daniele Ceraolo Spurio wrote:
> 
> 
> On 4/2/2026 5:45 PM, Matthew Brost wrote:
> > On Thu, Apr 02, 2026 at 03:51:36PM -0700, Daniele Ceraolo Spurio wrote:
> > > 
> > > On 4/2/2026 2:37 PM, Matthew Brost wrote:
> > > > On Thu, Apr 02, 2026 at 02:30:30PM -0700, Daniele Ceraolo Spurio wrote:
> > > > > When a queue is destroyed while still active on the HW (for example
> > > > > because the app owning it is exiting abruptly), the driver tells the GuC
> > > > > to preempt it off the HW immediately and to reset it if it doesn't
> > > > > preempt. This can cause a reset log to be printed to dmesg, which can
> > > > > be confusing to users as resets are commonly tied to errors, while in
> > > > > this case the reset is just done to speed up the cleanup.
> > > > > Given that a queue is only destroyed once all refs on it have been
> > > > > released (i.e., no one cares about it anymore), the log of it being
> > > > > reset is not useful and therefore we can simply suppress it to avoid
> > > > > confusion.
> > > > > 
> > > > > Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> > > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > > ---
> > > > >    drivers/gpu/drm/xe/xe_guc_submit.c | 7 ++++---
> > > > >    1 file changed, 4 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > > index 10556156eaad..e6702bd99309 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > > @@ -2968,9 +2968,10 @@ int xe_guc_exec_queue_reset_handler(struct xe_guc *guc, u32 *msg, u32 len)
> > > > >    	if (unlikely(!q))
> > > > >    		return -EPROTO;
> > > > > -	xe_gt_info(gt, "Engine reset: engine_class=%s, logical_mask: 0x%x, guc_id=%d, state=0x%0x",
> > > > > -		   xe_hw_engine_class_to_str(q->class), q->logical_mask, guc_id,
> > > > > -		   atomic_read(&q->guc->state));
> > > > > +	if (!exec_queue_destroyed(q))
> > > > I think you want killed here—right? Destroyed is tied to the refcount,
> > > I actually wanted destroyed here. I was thinking that a queue can be killed
> > > outside of the user control while the user is still using it (e.g. PXP
> > > queues are killed when the PXP session is invalidated), in which case it
> > > might be useful to keep the log.
> > > I have been pinged multiple times about engine reset logs on destroyed
> > > queues and I kind of got tired of having to explain that it was irrelevant,
> > > which is why I'd like to just suppress the log in that scenario. Maybe I can
> > > rework the commit message to be clearer on where I am coming from?
> > This was my mistake—I didn’t read your commit message closely enough.
> > 
> > However, destroyed can only be set once the queue is off the hardware
> > due to the reference counting between jobs and the queue (job hold a ref
> > to the queue). Note that disable_scheduling_deregister, which sets
> > set_exec_queue_destroyed, is called from __guc_exec_queue_process_msg_cleanup,
> > which only happens when the queue refcount reaches zero.
> > 
> > We can only reach a refcount of zero once all jobs have naturally
> > completed, or after we time out jobs, disable scheduling (a one-way
> > transition), and then signal the job fences.
> > 
> > So we should likely assert !destroyed in
> > xe_guc_exec_queue_reset_handler, since reaching that state afterward
> > really shouldn’t be possible.
> > 
> > Compare this to a user pressing Ctrl-C on an app (which sets kill). We
> > can trivially trigger this message while the queue is still on the GPU
> > because the TDR (immediated kicked) sets the preemption timeout to the
> > minimum value, disables scheduling, making an engine reset highly
> > likely. I think this is the case you’re trying to fix, unless I’m
> > misunderstanding the problem.
> > 
> > I haven't looked at the PXP cases so can't comment but will look in bit
> > if we to discuss further.
> > 
> > Matt
> 
> mmm, in the logs I've seen the queue was marked as both killed and destroyed

Ah, yes—I probably missed a race involving the final fence signal. The
queue is closed, we issue a scheduling disable with a preempt timeout of
0, and the hardware context switch-off is still running. Perhaps in
disable_scheduling_deregister we should be a bit more forgiving with the
preempt timeout in that function. I suspect the GuC having to issue a
reset when the context is switching off, just wastes cycles. Somewhat of
a separate issue, though. 

> when the reset occurred (state = 0x99), but the job also seemed to have
> completed naturally (i.e., the engine reset happens during the context
> switch out).
> However, your reasoning makes sense and my case is not what usually happens
> on ctrl+c, so it might be easier to just check for killed here and put an
> explicit log in the PXP code to cover that use-case (which AFAICT is the
> only case where a user queue can be killed while the fd is still in use).
> Let me think about it a bit more.
> 

Yes, I think this is the way to go. The Ctrl-C skipping timeout jobs,
messages, and devcoredumps was the reason for checking killed in the
TDR—we received feedback from Mesa along the lines of, “Why are messages
popping up?” or “Why do I get a devcoredump?” when we force an
application to close,

Matt

> Thanks for the feedback!
> Daniele
> 
> > > > whereas killed is tied to either a user closing an exec_queue or the DRM
> > > > render FD being closed. There may also be a multi-queue case that needs
> > > > slightly different logic. I forget the exact multi-queue teardown flow,
> > > > but IIRC it is slightly different. Niranjana would likely know that
> > > > offhand; otherwise, I’d have to reverse-engineer those flows again.
> > > I'll check that out.
> > > 
> > > Thanks,
> > > Daniele
> > > 
> > > > Matt
> > > > 
> > > > > +		xe_gt_info(gt, "Engine reset: engine_class=%s, logical_mask: 0x%x, guc_id=%d, state=0x%0x",
> > > > > +			   xe_hw_engine_class_to_str(q->class), q->logical_mask, guc_id,
> > > > > +			   atomic_read(&q->guc->state));
> > > > >    	trace_xe_exec_queue_reset(q);
> > > > > -- 
> > > > > 2.43.0
> > > > > 
>

next prev parent reply	other threads:[~2026-04-06 20:08 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-02 21:30 [PATCH] drm/xe: Suppress reset log for destroyed queues Daniele Ceraolo Spurio
2026-04-02 21:37 ` Matthew Brost
2026-04-02 22:51   ` Daniele Ceraolo Spurio
2026-04-03  0:45     ` Matthew Brost
2026-04-03 17:33       ` Daniele Ceraolo Spurio
2026-04-06 20:08         ` Matthew Brost [this message]
2026-04-02 21:37 ` ✓ CI.KUnit: success for " Patchwork
2026-04-02 22:14 ` ✓ Xe.CI.BAT: " Patchwork
2026-04-03 11:29 ` ✗ Xe.CI.FULL: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=adQSwo+SkbQc4goC@gsse-cloud1.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=daniele.ceraolospurio@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox