Re: [PATCH 6/7] drm/xe: Clean up GuC software state after a wedge

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "Summers, Stuart" <stuart.summers@intel.com>
To: "Brost, Matthew" <matthew.brost@intel.com>
Cc: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
	"Dong,  Zhanjun" <zhanjun.dong@intel.com>,
	"Lin, Shuicheng" <shuicheng.lin@intel.com>,
	"Vishwanathapura,
	Niranjana" <niranjana.vishwanathapura@intel.com>
Subject: Re: [PATCH 6/7] drm/xe: Clean up GuC software state after a wedge
Date: Thu, 23 Oct 2025 17:43:34 +0000	[thread overview]
Message-ID: <b0ddbf8914338d94bce2ba6c092f6c3fd0635c57.camel@intel.com> (raw)
In-Reply-To: <aPlJihKuwV3A4cLP@lstrano-desk.jf.intel.com>

On Wed, 2025-10-22 at 14:15 -0700, Matthew Brost wrote:
> On Mon, Oct 20, 2025 at 09:45:28PM +0000, Stuart Summers wrote:
> > When the driver is wedged during a hardware failure, there
> > is a chance the queue kill coming from those events can
> > race with either the scheduler teardown or the queue
> > deregistration with GuC. Basically the following two
> > scenarios can occur (from event trace):
> > 
> > Scheduler start missing:
> >   xe_exec_queue_create
> 
> The queues should be initialized in a started state unless a GT reset
> or
> VF migration is in progress. In both cases, upon successful
> completion,
> all queues will be restarted.
> 
> I did spot a bug in GT resets — if those fail, we don’t properly
> restart
> the queues. That should be fixed.
> 
> Also, I think xe_guc_declare_wedged is incorrect now that I’m looking
> at
> it.
> 
> It probably should be:
> 
> void xe_guc_declare_wedged(struct xe_guc *guc)
> {
>         xe_gt_assert(guc_to_gt(guc), guc_to_xe(guc)->wedged.mode);
> 
>         xe_guc_ct_stop(&guc->ct);
>         xe_guc_submit_wedge(guc);
>         xe_guc_sanitize(guc);
> }

It's a good point, but to be clear, I'm not doing a GT reset here.
Actually in the case I'm testing, there is an explicit PCIe FLR and
then I'm explicitly wedging after and making sure the unbind completes
error-free. Not something I'm necessarily planning on driving into the
tree here, but doing this for some internal testing.

But yeah I'll give this a try with a GT reset in the mix to make sure
that cleans up the way you're suggesting. And thanks for that wedge
revision, I'll give that a try too.

> 
> >   xe_exec_queue_kill
> >   xe_guc_exec_queue_kill
> >   xe_exec_queue_destroy
> > 
> > GuC CT response missing:
> >   xe_exec_queue_create
> >   xe_exec_queue_register
> >   xe_exec_queue_scheduling_enable
> >   xe_exec_queue_scheduling_done
> >   xe_exec_queue_kill
> >   xe_guc_exec_queue_kill
> >   xe_exec_queue_close
> >   xe_exec_queue_destroy
> >   xe_exec_queue_cleanup_entity
> >   xe_exec_queue_scheduling_disable
> 
> The ref count should be zero here — xe_exec_queue_scheduling_disable

I did confirm that it is (at least per the get_unless_zero() call
below).

> is
> called after this series [1]. I think we need to get this series in

Let me pull in that series. I'm still going through it on my side...
thanks for the link!

> before making changes to the GuC submission state machine.
> Technically,
> all we need are the last three patches from that series, as they
> simplify some things. I believe an upcoming Xe3 feature would also
> benefit from getting these patches in too.
> 
> So that means in xe_guc_submit_wedge() the below if statement is
> going
> to fail.
> 
> 1006         mutex_lock(&guc->submission_state.lock);
> 1007         xa_for_each(&guc->submission_state.exec_queue_lookup,
> index, q)
> 1008                 if (xe_exec_queue_get_unless_zero(q))
> 1009                         set_exec_queue_wedged(q);
> 1010         mutex_unlock(&guc->submission_state.lock);
> 
> I think we need...
> 
> else if (exec_queue_register(q)) 
>         __guc_exec_queue_destroy(guc, q);

Right.. I remember you mentioning that also in a prior rev... let me
confirm here. When I was testing, this wasn't working in all cases, but
I'll double check and get back.

Also this was the point of the pending_disable here. We do explicitly
set that in this flow whereas registered has a bunch of entry points
and I was trying to isolate to the case of GuC dying mid-CT send. It
seems to me if we have registered but not pending_disable, we have a
bug in the sequence somewhere rather than an outside error injection
(like GuC dying).

> 
> We also need to cleanup suspend fences too as those could be lost
> under
> the right race condition.
> 
> So prior to existing if statement, we also need:
> 
> if (q->guc->suspend_pending)
>         suspend_fence_signal(q);

Ok

> 
> [1] https://patchwork.freedesktop.org/series/155315/
> 
> > 
> > The above traces depend also on inclusion of [1].
> > 
> > In the first scenario, the queue is created, but killed
> > prior to completing the message cleanup. In the second,
> > we go through a full registration before killing. The
> > CT communication happens in that last call to
> > xe_exec_queue_scheduling_disable.
> > 
> > We expect to then get a call to xe_guc_exec_queue_destroy
> > in both cases if the aforementioned scheduler/GuC CT communication
> > had happened, which we are missing here, hence missing any
> > LRC/BO cleanup in the exec queues in question.
> > 
> > Since this sequence seems specific to the wedge case
> > as described above, add a targeted scheduler start
> > and guc deregistration handler to the wedged_fini()
> > routine.
> > 
> > Without this change, if we inject wedges in the above scenarios
> > we can expect the following when the DRM memory tracking is
> > enabled (see CONFIG_DRM_DEBUG_MM):
> > [  129.600285] [drm:drm_mm_takedown] *ERROR* node [00647000 +
> > 00008000]: inserted at
> >                 drm_mm_insert_node_in_range+0x2ec/0x4b0
> >                 __xe_ggtt_insert_bo_at+0x10f/0x360 [xe]
> >                 __xe_bo_create_locked+0x184/0x520 [xe]
> >                 xe_bo_create_pin_map_at_aligned+0x3b/0x180 [xe]
> >                 xe_bo_create_pin_map+0x13/0x20 [xe]
> >                 xe_lrc_create+0x139/0x18e0 [xe]
> >                 xe_exec_queue_create+0x22f/0x3e0 [xe]
> >                 xe_exec_queue_create_ioctl+0x4e9/0xbf0 [xe]
> >                 drm_ioctl_kernel+0x9f/0xf0
> >                 drm_ioctl+0x20f/0x440
> >                 xe_drm_ioctl+0x121/0x150 [xe]
> >                 __x64_sys_ioctl+0x8c/0xe0
> >                 do_syscall_64+0x4c/0x1d0
> >                 entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  129.601966] [drm:drm_mm_takedown] *ERROR* node [0064f000 +
> > 00008000]: inserted at
> >                 drm_mm_insert_node_in_range+0x2ec/0x4b0
> >                 __xe_ggtt_insert_bo_at+0x10f/0x360 [xe]
> >                 __xe_bo_create_locked+0x184/0x520 [xe]
> >                 xe_bo_create_pin_map_at_aligned+0x3b/0x180 [xe]
> >                 xe_bo_create_pin_map+0x13/0x20 [xe]
> >                 xe_lrc_create+0x139/0x18e0 [xe]
> >                 xe_exec_queue_create+0x22f/0x3e0 [xe]
> >                 xe_exec_queue_create_bind+0x7f/0xd0 [xe]
> >                 xe_vm_create+0x4aa/0x8b0 [xe]
> >                 xe_vm_create_ioctl+0x17b/0x420 [xe]
> >                 drm_ioctl_kernel+0x9f/0xf0
> >                 drm_ioctl+0x20f/0x440
> >                 xe_drm_ioctl+0x121/0x150 [xe]
> >                 __x64_sys_ioctl+0x8c/0xe0
> >                 do_syscall_64+0x4c/0x1d0
> >                 entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > 
> > Signed-off-by: Stuart Summers <stuart.summers@intel.com>
> > 
> > [1]
> > https://patchwork.freedesktop.org/patch/680852/?series=155352&rev=4
> > ---
> >  drivers/gpu/drm/xe/xe_guc_submit.c | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> > b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index 5ec1e4a83d68..a11ae4e70809 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -287,6 +287,8 @@ static void guc_submit_fini(struct drm_device
> > *drm, void *arg)
> >         xa_destroy(&guc->submission_state.exec_queue_lookup);
> >  }
> >  
> > +static void __guc_exec_queue_destroy(struct xe_guc *guc, struct
> > xe_exec_queue *q);
> > +
> >  static void guc_submit_wedged_fini(void *arg)
> >  {
> >         struct xe_guc *guc = arg;
> > @@ -299,6 +301,16 @@ static void guc_submit_wedged_fini(void *arg)
> >                         mutex_unlock(&guc->submission_state.lock);
> >                         xe_exec_queue_put(q);
> >                         mutex_lock(&guc->submission_state.lock);
> 
> With everything above I don't think this new code below is needed.
> 
> But to make sure we know what we are doing, how about this from [2]
> before the xe_exec_queue_put.
> 
> xe_gt_assert(..., !drm_sched_is_stopped(sched));

Yeah I agree this seems like a good idea. It also follows what we're
doing in the other state changes.

> 
> Wanna try out these suggestions? It is always possible I made a
> mistake
> here.

Really appreciate the feedback Matt. Yeah I'll take a look and get back
if it still doesn't work here.

Thanks,
Stuart

> 
> Matt
> 
> [2]
> https://patchwork.freedesktop.org/patch/681606/?series=155315&rev=3
> 
> > +               } else {
> > +                       /*
> > +                        * Make sure queues which were killed as
> > part of a
> > +                        * wedge are cleaned up properly. Clean up
> > any
> > +                        * dangling scheduler tasks and pending
> > exec queue
> > +                        * deregistration.
> > +                        */
> > +                       xe_sched_submission_start(&q->guc->sched);
> > +                       if (exec_queue_pending_disable(q))
> > +                               __guc_exec_queue_destroy(guc, q);
> >                 }
> >         }
> >         mutex_unlock(&guc->submission_state.lock);
> > -- 
> > 2.34.1
> >

next prev parent reply	other threads:[~2025-10-23 17:43 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-20 21:45 [PATCH 0/7] Fix a couple of wedge corner-case memory leaks Stuart Summers
2025-10-20 21:45 ` [PATCH 1/7] drm/xe: Add additional trace points for LRCs Stuart Summers
2025-10-20 21:45 ` [PATCH 2/7] drm/xe: Add a trace point for VM close Stuart Summers
2025-10-20 21:45 ` [PATCH 3/7] drm/xe: Add the BO pointer info to the BO trace Stuart Summers
2025-10-20 21:45 ` [PATCH 4/7] drm/xe: Add new exec queue trace points Stuart Summers
2025-10-20 21:45 ` [PATCH 5/7] drm/xe: Correct migration VM teardown order Stuart Summers
2025-10-22 20:30   ` Matthew Brost
2025-10-23 17:18     ` Summers, Stuart
2025-10-20 21:45 ` [PATCH 6/7] drm/xe: Clean up GuC software state after a wedge Stuart Summers
2025-10-22 21:15   ` Matthew Brost
2025-10-23 17:43     ` Summers, Stuart [this message]
2025-10-23 18:26       ` Matthew Brost
2025-10-20 21:45 ` [PATCH 7/7] drm/xe/doc: Add GuC submission kernel-doc Stuart Summers
2025-10-20 22:05   ` Matthew Brost
2025-10-20 22:07     ` Summers, Stuart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b0ddbf8914338d94bce2ba6c092f6c3fd0635c57.camel@intel.com \
    --to=stuart.summers@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    --cc=niranjana.vishwanathapura@intel.com \
    --cc=shuicheng.lin@intel.com \
    --cc=zhanjun.dong@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox