From: "Summers, Stuart" <stuart.summers@intel.com>
To: "Brost, Matthew" <matthew.brost@intel.com>
Cc: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
"Lin, Shuicheng" <shuicheng.lin@intel.com>
Subject: Re: [PATCH 6/6] drm/xe: Clean up GuC software state after a wedge
Date: Fri, 17 Oct 2025 14:48:11 +0000 [thread overview]
Message-ID: <f213a216ae227f0119f8a50cc968c728a26bb44b.camel@intel.com> (raw)
In-Reply-To: <aPFtYIRh2Q/3Z64G@lstrano-desk.jf.intel.com>
On Thu, 2025-10-16 at 15:10 -0700, Matthew Brost wrote:
> On Thu, Oct 16, 2025 at 11:50:44AM -0600, Summers, Stuart wrote:
> > On Wed, 2025-10-15 at 20:42 -0700, Matthew Brost wrote:
> > > On Wed, Oct 15, 2025 at 01:45:35PM -0600, Summers, Stuart wrote:
> > > > On Tue, 2025-10-14 at 18:09 +0000, Stuart Summers wrote:
> > > > > Comments also added to the code, but in the event of
> > > > > a wedge or a hardware failure while communication with
> > > > > GuC is outstanding (e.g. during a schedule disable or
> > > > > context deregistration), the driver doesn't automatically
> > > > > reset the software state as it would in a typical GT reset
> > > > > since we are trying to save the state for debug. However
> > > > > once the user unbinds the driver we still need to go through
> > > > > and clean everything up for these exec queues so we don't
> > > > > leak memory on the DRM side (e.g. LRC or LRC BO).
> > > > >
> > > > > Add a kick start to the DRM scheduler to handle any
> > > > > outstanding
> > > > > messages on hold during the wedge and go through the GuC stop
> > > > > flow to simulate that reset on teardown.
> > > > >
> > > > > Signed-off-by: Stuart Summers <stuart.summers@intel.com>
> > > >
> > > > Please hold on review here. I think there's still something
> > > > missing
> > > > on
> > > > the wedge cleanup case that causes issues after wedging and
> > > > then
> > > > doing
> > > > binds/unbinds in a loop after. I'm working to resolve that and
> > > > I'll
> > > > post a new series after.
> > > >
> > >
> > > Sure. Just couple of thoughts below which might help.
> > >
> > > > Thanks,
> > > > Stuart
> > > >
> > > > > ---
> > > > > drivers/gpu/drm/xe/xe_guc_submit.c | 15 +++++++++++++++
> > > > > 1 file changed, 15 insertions(+)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > > b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > > index 5ec1e4a83d68..0bbae336c722 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > > > > @@ -276,6 +276,14 @@ static void guc_submit_fini(struct
> > > > > drm_device
> > > > > *drm, void *arg)
> > > > > struct xe_gt *gt = guc_to_gt(guc);
> > > > > int ret;
> > > > >
> > > > > + /*
> > > > > + * If GuC stopped responding during deregistration
> > > > > + * some queues can be left in a bad state. Ensure
> > > > > + * these are all cleaned up by going through the
> > > > > + * GuC software reset flow.
> > > > > + */
> > > > > + xe_guc_stop(guc);
> > > > > +
> > >
> > > I think we'd need to restart here at minimuum after cleanup.
> >
> > Yeah this was one of the key issues, just trying to find the right
> > pieces to insert here..
> >
> > >
> > > > > ret = wait_event_timeout(guc-
> > > > > >submission_state.fini_wq,
> > > > > xa_empty(&guc-
> > > > > > submission_state.exec_queue_lookup),
> > > > > HZ * 5);
> > > > > @@ -295,6 +303,13 @@ static void guc_submit_wedged_fini(void
> > > > > *arg)
> > > > >
> > > > > mutex_lock(&guc->submission_state.lock);
> > > > > xa_for_each(&guc->submission_state.exec_queue_lookup,
> > > > > index,
> > > > > q) {
> > > > > + /*
> > > > > + * Kick start the scheduler since some
> > > > > messages
> > > > > + * might have been added while the scheduler
> > > > > was
> > > > > + * stopped during a wedge event.
> > > > > + */
> > > > > + xe_sched_submission_start(&q->guc->sched);
> > > > > +
> > >
> > > I don't think possible to get here with a stopped queue, I think
> > > it a
> > > bug elsewhere if we do.
> >
> > Hm.. so this would either be we went through a GT reset prior to
> > the
> > wedge (which should have restarted at the end of that) - which
> > isn't
> > the case here - or the job timed out and stopped the scheduler for
> > that
> > queue. Seems like there might be a bug in that path...
> >
>
> The job timeout mechanism should always restart the queue. I just
> checked, and it looks like there's a bug in the vf_recovery flows,
> but
> that shouldn't affect this case. I did review the GT reset path, and
> if
> that fails after xe_uc_stop is called, the queues are not
> restarted—which is a problem. We likely need to call something like
> xe_guc_submit_pause_abort (or a GT-layer equivalent) to restart the
> queues. If we don’t it could lead to memory-unsafe behavior as things
> will not get freed. This call should happen after wedging the device.
We should not be calling uc_stop() though if we're in a wedge state.
The uc_stop_prepare() and uc_stop() are part of the software reset that
is skipped when we're wedged. In the case here, I'm explicitly stopping
to try to reset the queue in the patch here and that's why it needs to
be restarted. The stop was intended to fix one bug and the restart
after stop fixes a second, but it still isn't the right fix as
discussed.
Thanks,
Stuart
>
> Matt
>
> > Thanks,
> > Stuart
> >
> > >
> > > Matt
> > >
> > > > > if (exec_queue_wedged(q)) {
> > > > > mutex_unlock(&guc-
> > > > > > submission_state.lock);
> > > > > xe_exec_queue_put(q);
> > > >
> >
next prev parent reply other threads:[~2025-10-17 14:48 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-14 18:09 [PATCH 0/6] Fix a couple of wedge corner-case memory leaks Stuart Summers
2025-10-14 18:09 ` [PATCH 1/6] drm/xe: Add additional trace points for LRCs Stuart Summers
2025-10-14 18:09 ` [PATCH 2/6] drm/xe: Add a trace point for VM close Stuart Summers
2025-10-14 18:09 ` [PATCH 3/6] drm/xe: Add the BO pointer info to the BO trace Stuart Summers
2025-10-14 18:09 ` [PATCH 4/6] drm/xe: Add new exec queue trace points Stuart Summers
2025-10-14 18:09 ` [PATCH 5/6] drm/xe: Correct migration VM teardown order Stuart Summers
2025-10-15 0:59 ` Matthew Brost
2025-10-15 17:55 ` Summers, Stuart
2025-10-14 18:09 ` [PATCH 6/6] drm/xe: Clean up GuC software state after a wedge Stuart Summers
2025-10-15 19:45 ` Summers, Stuart
2025-10-16 3:42 ` Matthew Brost
2025-10-16 17:50 ` Summers, Stuart
2025-10-16 22:10 ` Matthew Brost
2025-10-17 14:48 ` Summers, Stuart [this message]
2025-10-14 18:15 ` ✗ CI.checkpatch: warning for Fix a couple of wedge corner-case memory leaks (rev4) Patchwork
2025-10-14 18:16 ` ✓ CI.KUnit: success " Patchwork
2025-10-14 19:13 ` ✗ Xe.CI.BAT: failure " Patchwork
2025-10-14 21:14 ` Summers, Stuart
2025-10-15 5:18 ` ✗ Xe.CI.Full: " Patchwork
-- strict thread matches above, loose matches on Subject: below --
2025-10-27 18:04 [PATCH 0/6] Fix a couple of wedge corner-case memory leaks Stuart Summers
2025-10-27 18:04 ` [PATCH 6/6] drm/xe: Clean up GuC software state after a wedge Stuart Summers
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f213a216ae227f0119f8a50cc968c728a26bb44b.camel@intel.com \
--to=stuart.summers@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=matthew.brost@intel.com \
--cc=shuicheng.lin@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox