From: "Summers, Stuart" <stuart.summers@intel.com>
To: "Brost, Matthew" <matthew.brost@intel.com>
Cc: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
"Dong, Zhanjun" <zhanjun.dong@intel.com>,
"Lin, Shuicheng" <shuicheng.lin@intel.com>,
"Vishwanathapura,
Niranjana" <niranjana.vishwanathapura@intel.com>
Subject: Re: [PATCH 6/7] drm/xe: Clean up GuC software state after a wedge
Date: Thu, 23 Oct 2025 17:43:34 +0000 [thread overview]
Message-ID: <b0ddbf8914338d94bce2ba6c092f6c3fd0635c57.camel@intel.com> (raw)
In-Reply-To: <aPlJihKuwV3A4cLP@lstrano-desk.jf.intel.com>
On Wed, 2025-10-22 at 14:15 -0700, Matthew Brost wrote:
> On Mon, Oct 20, 2025 at 09:45:28PM +0000, Stuart Summers wrote:
> > When the driver is wedged during a hardware failure, there
> > is a chance the queue kill coming from those events can
> > race with either the scheduler teardown or the queue
> > deregistration with GuC. Basically the following two
> > scenarios can occur (from event trace):
> >
> > Scheduler start missing:
> > xe_exec_queue_create
>
> The queues should be initialized in a started state unless a GT reset
> or
> VF migration is in progress. In both cases, upon successful
> completion,
> all queues will be restarted.
>
> I did spot a bug in GT resets — if those fail, we don’t properly
> restart
> the queues. That should be fixed.
>
> Also, I think xe_guc_declare_wedged is incorrect now that I’m looking
> at
> it.
>
> It probably should be:
>
> void xe_guc_declare_wedged(struct xe_guc *guc)
> {
> xe_gt_assert(guc_to_gt(guc), guc_to_xe(guc)->wedged.mode);
>
> xe_guc_ct_stop(&guc->ct);
> xe_guc_submit_wedge(guc);
> xe_guc_sanitize(guc);
> }
It's a good point, but to be clear, I'm not doing a GT reset here.
Actually in the case I'm testing, there is an explicit PCIe FLR and
then I'm explicitly wedging after and making sure the unbind completes
error-free. Not something I'm necessarily planning on driving into the
tree here, but doing this for some internal testing.
But yeah I'll give this a try with a GT reset in the mix to make sure
that cleans up the way you're suggesting. And thanks for that wedge
revision, I'll give that a try too.
>
> > xe_exec_queue_kill
> > xe_guc_exec_queue_kill
> > xe_exec_queue_destroy
> >
> > GuC CT response missing:
> > xe_exec_queue_create
> > xe_exec_queue_register
> > xe_exec_queue_scheduling_enable
> > xe_exec_queue_scheduling_done
> > xe_exec_queue_kill
> > xe_guc_exec_queue_kill
> > xe_exec_queue_close
> > xe_exec_queue_destroy
> > xe_exec_queue_cleanup_entity
> > xe_exec_queue_scheduling_disable
>
> The ref count should be zero here — xe_exec_queue_scheduling_disable
I did confirm that it is (at least per the get_unless_zero() call
below).
> is
> called after this series [1]. I think we need to get this series in
Let me pull in that series. I'm still going through it on my side...
thanks for the link!
> before making changes to the GuC submission state machine.
> Technically,
> all we need are the last three patches from that series, as they
> simplify some things. I believe an upcoming Xe3 feature would also
> benefit from getting these patches in too.
>
> So that means in xe_guc_submit_wedge() the below if statement is
> going
> to fail.
>
> 1006 mutex_lock(&guc->submission_state.lock);
> 1007 xa_for_each(&guc->submission_state.exec_queue_lookup,
> index, q)
> 1008 if (xe_exec_queue_get_unless_zero(q))
> 1009 set_exec_queue_wedged(q);
> 1010 mutex_unlock(&guc->submission_state.lock);
>
> I think we need...
>
> else if (exec_queue_register(q))
> __guc_exec_queue_destroy(guc, q);
Right.. I remember you mentioning that also in a prior rev... let me
confirm here. When I was testing, this wasn't working in all cases, but
I'll double check and get back.
Also this was the point of the pending_disable here. We do explicitly
set that in this flow whereas registered has a bunch of entry points
and I was trying to isolate to the case of GuC dying mid-CT send. It
seems to me if we have registered but not pending_disable, we have a
bug in the sequence somewhere rather than an outside error injection
(like GuC dying).
>
> We also need to cleanup suspend fences too as those could be lost
> under
> the right race condition.
>
> So prior to existing if statement, we also need:
>
> if (q->guc->suspend_pending)
> suspend_fence_signal(q);
Ok
>
> [1] https://patchwork.freedesktop.org/series/155315/
>
> >
> > The above traces depend also on inclusion of [1].
> >
> > In the first scenario, the queue is created, but killed
> > prior to completing the message cleanup. In the second,
> > we go through a full registration before killing. The
> > CT communication happens in that last call to
> > xe_exec_queue_scheduling_disable.
> >
> > We expect to then get a call to xe_guc_exec_queue_destroy
> > in both cases if the aforementioned scheduler/GuC CT communication
> > had happened, which we are missing here, hence missing any
> > LRC/BO cleanup in the exec queues in question.
> >
> > Since this sequence seems specific to the wedge case
> > as described above, add a targeted scheduler start
> > and guc deregistration handler to the wedged_fini()
> > routine.
> >
> > Without this change, if we inject wedges in the above scenarios
> > we can expect the following when the DRM memory tracking is
> > enabled (see CONFIG_DRM_DEBUG_MM):
> > [ 129.600285] [drm:drm_mm_takedown] *ERROR* node [00647000 +
> > 00008000]: inserted at
> > drm_mm_insert_node_in_range+0x2ec/0x4b0
> > __xe_ggtt_insert_bo_at+0x10f/0x360 [xe]
> > __xe_bo_create_locked+0x184/0x520 [xe]
> > xe_bo_create_pin_map_at_aligned+0x3b/0x180 [xe]
> > xe_bo_create_pin_map+0x13/0x20 [xe]
> > xe_lrc_create+0x139/0x18e0 [xe]
> > xe_exec_queue_create+0x22f/0x3e0 [xe]
> > xe_exec_queue_create_ioctl+0x4e9/0xbf0 [xe]
> > drm_ioctl_kernel+0x9f/0xf0
> > drm_ioctl+0x20f/0x440
> > xe_drm_ioctl+0x121/0x150 [xe]
> > __x64_sys_ioctl+0x8c/0xe0
> > do_syscall_64+0x4c/0x1d0
> > entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [ 129.601966] [drm:drm_mm_takedown] *ERROR* node [0064f000 +
> > 00008000]: inserted at
> > drm_mm_insert_node_in_range+0x2ec/0x4b0
> > __xe_ggtt_insert_bo_at+0x10f/0x360 [xe]
> > __xe_bo_create_locked+0x184/0x520 [xe]
> > xe_bo_create_pin_map_at_aligned+0x3b/0x180 [xe]
> > xe_bo_create_pin_map+0x13/0x20 [xe]
> > xe_lrc_create+0x139/0x18e0 [xe]
> > xe_exec_queue_create+0x22f/0x3e0 [xe]
> > xe_exec_queue_create_bind+0x7f/0xd0 [xe]
> > xe_vm_create+0x4aa/0x8b0 [xe]
> > xe_vm_create_ioctl+0x17b/0x420 [xe]
> > drm_ioctl_kernel+0x9f/0xf0
> > drm_ioctl+0x20f/0x440
> > xe_drm_ioctl+0x121/0x150 [xe]
> > __x64_sys_ioctl+0x8c/0xe0
> > do_syscall_64+0x4c/0x1d0
> > entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >
> > Signed-off-by: Stuart Summers <stuart.summers@intel.com>
> >
> > [1]
> > https://patchwork.freedesktop.org/patch/680852/?series=155352&rev=4
> > ---
> > drivers/gpu/drm/xe/xe_guc_submit.c | 12 ++++++++++++
> > 1 file changed, 12 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> > b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index 5ec1e4a83d68..a11ae4e70809 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -287,6 +287,8 @@ static void guc_submit_fini(struct drm_device
> > *drm, void *arg)
> > xa_destroy(&guc->submission_state.exec_queue_lookup);
> > }
> >
> > +static void __guc_exec_queue_destroy(struct xe_guc *guc, struct
> > xe_exec_queue *q);
> > +
> > static void guc_submit_wedged_fini(void *arg)
> > {
> > struct xe_guc *guc = arg;
> > @@ -299,6 +301,16 @@ static void guc_submit_wedged_fini(void *arg)
> > mutex_unlock(&guc->submission_state.lock);
> > xe_exec_queue_put(q);
> > mutex_lock(&guc->submission_state.lock);
>
> With everything above I don't think this new code below is needed.
>
> But to make sure we know what we are doing, how about this from [2]
> before the xe_exec_queue_put.
>
> xe_gt_assert(..., !drm_sched_is_stopped(sched));
Yeah I agree this seems like a good idea. It also follows what we're
doing in the other state changes.
>
> Wanna try out these suggestions? It is always possible I made a
> mistake
> here.
Really appreciate the feedback Matt. Yeah I'll take a look and get back
if it still doesn't work here.
Thanks,
Stuart
>
> Matt
>
> [2]
> https://patchwork.freedesktop.org/patch/681606/?series=155315&rev=3
>
> > + } else {
> > + /*
> > + * Make sure queues which were killed as
> > part of a
> > + * wedge are cleaned up properly. Clean up
> > any
> > + * dangling scheduler tasks and pending
> > exec queue
> > + * deregistration.
> > + */
> > + xe_sched_submission_start(&q->guc->sched);
> > + if (exec_queue_pending_disable(q))
> > + __guc_exec_queue_destroy(guc, q);
> > }
> > }
> > mutex_unlock(&guc->submission_state.lock);
> > --
> > 2.34.1
> >
next prev parent reply other threads:[~2025-10-23 17:43 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-20 21:45 [PATCH 0/7] Fix a couple of wedge corner-case memory leaks Stuart Summers
2025-10-20 21:45 ` [PATCH 1/7] drm/xe: Add additional trace points for LRCs Stuart Summers
2025-10-20 21:45 ` [PATCH 2/7] drm/xe: Add a trace point for VM close Stuart Summers
2025-10-20 21:45 ` [PATCH 3/7] drm/xe: Add the BO pointer info to the BO trace Stuart Summers
2025-10-20 21:45 ` [PATCH 4/7] drm/xe: Add new exec queue trace points Stuart Summers
2025-10-20 21:45 ` [PATCH 5/7] drm/xe: Correct migration VM teardown order Stuart Summers
2025-10-22 20:30 ` Matthew Brost
2025-10-23 17:18 ` Summers, Stuart
2025-10-20 21:45 ` [PATCH 6/7] drm/xe: Clean up GuC software state after a wedge Stuart Summers
2025-10-22 21:15 ` Matthew Brost
2025-10-23 17:43 ` Summers, Stuart [this message]
2025-10-23 18:26 ` Matthew Brost
2025-10-20 21:45 ` [PATCH 7/7] drm/xe/doc: Add GuC submission kernel-doc Stuart Summers
2025-10-20 22:05 ` Matthew Brost
2025-10-20 22:07 ` Summers, Stuart
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b0ddbf8914338d94bce2ba6c092f6c3fd0635c57.camel@intel.com \
--to=stuart.summers@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=matthew.brost@intel.com \
--cc=niranjana.vishwanathapura@intel.com \
--cc=shuicheng.lin@intel.com \
--cc=zhanjun.dong@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox