Re: [PATCH] drm/xe: Suppress reset log for destroyed queues

public inbox for intel-xe@lists.freedesktop.org
 help / color / mirror / Atom feed

From: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
To: Matthew Brost <matthew.brost@intel.com>
Cc: <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH] drm/xe: Suppress reset log for destroyed queues
Date: Fri, 3 Apr 2026 10:33:55 -0700	[thread overview]
Message-ID: <c06619c4-59ea-4ed1-a76b-4d0a70fc8900@intel.com> (raw)
In-Reply-To: <ac8NoEXghE1XbfbE@gsse-cloud1.jf.intel.com>



On 4/2/2026 5:45 PM, Matthew Brost wrote:
> On Thu, Apr 02, 2026 at 03:51:36PM -0700, Daniele Ceraolo Spurio wrote:
>>
>> On 4/2/2026 2:37 PM, Matthew Brost wrote:
>>> On Thu, Apr 02, 2026 at 02:30:30PM -0700, Daniele Ceraolo Spurio wrote:
>>>> When a queue is destroyed while still active on the HW (for example
>>>> because the app owning it is exiting abruptly), the driver tells the GuC
>>>> to preempt it off the HW immediately and to reset it if it doesn't
>>>> preempt. This can cause a reset log to be printed to dmesg, which can
>>>> be confusing to users as resets are commonly tied to errors, while in
>>>> this case the reset is just done to speed up the cleanup.
>>>> Given that a queue is only destroyed once all refs on it have been
>>>> released (i.e., no one cares about it anymore), the log of it being
>>>> reset is not useful and therefore we can simply suppress it to avoid
>>>> confusion.
>>>>
>>>> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>> ---
>>>>    drivers/gpu/drm/xe/xe_guc_submit.c | 7 ++++---
>>>>    1 file changed, 4 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
>>>> index 10556156eaad..e6702bd99309 100644
>>>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
>>>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>>>> @@ -2968,9 +2968,10 @@ int xe_guc_exec_queue_reset_handler(struct xe_guc *guc, u32 *msg, u32 len)
>>>>    	if (unlikely(!q))
>>>>    		return -EPROTO;
>>>> -	xe_gt_info(gt, "Engine reset: engine_class=%s, logical_mask: 0x%x, guc_id=%d, state=0x%0x",
>>>> -		   xe_hw_engine_class_to_str(q->class), q->logical_mask, guc_id,
>>>> -		   atomic_read(&q->guc->state));
>>>> +	if (!exec_queue_destroyed(q))
>>> I think you want killed here—right? Destroyed is tied to the refcount,
>> I actually wanted destroyed here. I was thinking that a queue can be killed
>> outside of the user control while the user is still using it (e.g. PXP
>> queues are killed when the PXP session is invalidated), in which case it
>> might be useful to keep the log.
>> I have been pinged multiple times about engine reset logs on destroyed
>> queues and I kind of got tired of having to explain that it was irrelevant,
>> which is why I'd like to just suppress the log in that scenario. Maybe I can
>> rework the commit message to be clearer on where I am coming from?
> This was my mistake—I didn’t read your commit message closely enough.
>
> However, destroyed can only be set once the queue is off the hardware
> due to the reference counting between jobs and the queue (job hold a ref
> to the queue). Note that disable_scheduling_deregister, which sets
> set_exec_queue_destroyed, is called from __guc_exec_queue_process_msg_cleanup,
> which only happens when the queue refcount reaches zero.
>
> We can only reach a refcount of zero once all jobs have naturally
> completed, or after we time out jobs, disable scheduling (a one-way
> transition), and then signal the job fences.
>
> So we should likely assert !destroyed in
> xe_guc_exec_queue_reset_handler, since reaching that state afterward
> really shouldn’t be possible.
>
> Compare this to a user pressing Ctrl-C on an app (which sets kill). We
> can trivially trigger this message while the queue is still on the GPU
> because the TDR (immediated kicked) sets the preemption timeout to the
> minimum value, disables scheduling, making an engine reset highly
> likely. I think this is the case you’re trying to fix, unless I’m
> misunderstanding the problem.
>
> I haven't looked at the PXP cases so can't comment but will look in bit
> if we to discuss further.
>
> Matt

mmm, in the logs I've seen the queue was marked as both killed and 
destroyed when the reset occurred (state = 0x99), but the job also 
seemed to have completed naturally (i.e., the engine reset happens 
during the context switch out).
However, your reasoning makes sense and my case is not what usually 
happens on ctrl+c, so it might be easier to just check for killed here 
and put an explicit log in the PXP code to cover that use-case (which 
AFAICT is the only case where a user queue can be killed while the fd is 
still in use).
Let me think about it a bit more.

Thanks for the feedback!
Daniele

>>> whereas killed is tied to either a user closing an exec_queue or the DRM
>>> render FD being closed. There may also be a multi-queue case that needs
>>> slightly different logic. I forget the exact multi-queue teardown flow,
>>> but IIRC it is slightly different. Niranjana would likely know that
>>> offhand; otherwise, I’d have to reverse-engineer those flows again.
>> I'll check that out.
>>
>> Thanks,
>> Daniele
>>
>>> Matt
>>>
>>>> +		xe_gt_info(gt, "Engine reset: engine_class=%s, logical_mask: 0x%x, guc_id=%d, state=0x%0x",
>>>> +			   xe_hw_engine_class_to_str(q->class), q->logical_mask, guc_id,
>>>> +			   atomic_read(&q->guc->state));
>>>>    	trace_xe_exec_queue_reset(q);
>>>> -- 
>>>> 2.43.0
>>>>

next prev parent reply	other threads:[~2026-04-03 17:34 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-02 21:30 [PATCH] drm/xe: Suppress reset log for destroyed queues Daniele Ceraolo Spurio
2026-04-02 21:37 ` Matthew Brost
2026-04-02 22:51   ` Daniele Ceraolo Spurio
2026-04-03  0:45     ` Matthew Brost
2026-04-03 17:33       ` Daniele Ceraolo Spurio [this message]
2026-04-06 20:08         ` Matthew Brost
2026-04-02 21:37 ` ✓ CI.KUnit: success for " Patchwork
2026-04-02 22:14 ` ✓ Xe.CI.BAT: " Patchwork
2026-04-03 11:29 ` ✗ Xe.CI.FULL: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c06619c4-59ea-4ed1-a76b-4d0a70fc8900@intel.com \
    --to=daniele.ceraolospurio@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox