Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Brost <matthew.brost@intel.com>
To: "Summers, Stuart" <stuart.summers@intel.com>
Cc: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH v2 3/3] drm/xe: Trigger queue cleanup if not in wedged mode 2
Date: Thu, 18 Dec 2025 17:10:25 -0800	[thread overview]
Message-ID: <aUSmAdQ8bY3GjlFJ@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <24018a6e534cba02a43e7036eed4bf8965b30f4f.camel@intel.com>

On Thu, Dec 18, 2025 at 04:45:02PM -0700, Summers, Stuart wrote:
> On Thu, 2025-12-18 at 13:44 -0800, Matthew Brost wrote:
> > The intent of wedging a device is to allow queues to continue running
> > only in wedged mode 2. In other modes, queues should initiate cleanup
> > and signal all remaining fences. Fix xe_guc_submit_wedge to correctly
> > clean up queues when wedge mode != 2.
> 
> Yeah this makes sense. Should we use this for the fixes instead though?
> 
> commit 8ed9aaae39f39130b7a3eb2726be05d7f64b344c
> Author: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Date:   Tue Apr 23 18:18:16 2024 -0400
> 
>     drm/xe: Force wedged state and block GT reset upon any GPU hang
> 

Maybe both? I suspect those went into the same kernel release.

Either way, this series won’t apply cleanly to any stable tree. I’ll
follow up with Rodrigo and Thomas on a strategy here, as this is
definitely something we need to get into all stable trees as wedging can
take down the machine.

We also need to expand our IGT coverage so that we run jobs (likely with
userptr), forcefully wedge, and ensure there are no hangs.

I’ll open a Jira now for this test case.

Matt

> Thanks,
> Stuart
> 
> > 
> > Fixes: 7dbe8af13c18 ("drm/xe: Wedge the entire device")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > 
> > ---
> > 
> > This fix will not apply outright to any stable kernel as it depeneds
> > on
> > functions which have added in the KMD since the original commit.
> > Likely
> > will have to manually send out patches to stable for kernel which
> > we'd
> > like to fix.
> > ---
> >  drivers/gpu/drm/xe/xe_guc_submit.c | 31 ++++++++++++++++++----------
> > --
> >  1 file changed, 19 insertions(+), 12 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> > b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index 58ec94439df1..63b984f6d78d 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -1313,6 +1313,7 @@ static void
> > disable_scheduling_deregister(struct xe_guc *guc,
> >   */
> >  void xe_guc_submit_wedge(struct xe_guc *guc)
> >  {
> > +       struct xe_device *xe = guc_to_xe(guc);
> >         struct xe_gt *gt = guc_to_gt(guc);
> >         struct xe_exec_queue *q;
> >         unsigned long index;
> > @@ -1327,19 +1328,25 @@ void xe_guc_submit_wedge(struct xe_guc *guc)
> >         if (!guc->submission_state.initialized)
> >                 return;
> >  
> > -       err = devm_add_action_or_reset(guc_to_xe(guc)->drm.dev,
> > -                                      guc_submit_wedged_fini, guc);
> > -       if (err) {
> > -               xe_gt_err(gt, "Failed to register clean-up on
> > wedged.mode=2; "
> > -                         "Although device is wedged.\n");
> > -               return;
> > -       }
> > +       if (xe->wedged.mode == 2) {
> > +               err = devm_add_action_or_reset(guc_to_xe(guc)-
> > >drm.dev,
> > +                                             
> > guc_submit_wedged_fini, guc);
> > +               if (err) {
> > +                       xe_gt_err(gt, "Failed to register clean-up on
> > wedged.mode=2; "
> > +                                 "Although device is wedged.\n");
> > +                       return;
> > +               }
> >  
> > -       mutex_lock(&guc->submission_state.lock);
> > -       xa_for_each(&guc->submission_state.exec_queue_lookup, index,
> > q)
> > -               if (xe_exec_queue_get_unless_zero(q))
> > -                       set_exec_queue_wedged(q);
> > -       mutex_unlock(&guc->submission_state.lock);
> > +               mutex_lock(&guc->submission_state.lock);
> > +               xa_for_each(&guc->submission_state.exec_queue_lookup,
> > index, q)
> > +                       if (xe_exec_queue_get_unless_zero(q))
> > +                               set_exec_queue_wedged(q);
> > +               mutex_unlock(&guc->submission_state.lock);
> > +       } else {
> > +               /* Forcefully kill any remaining exec queues, signal
> > fences */
> > +               xe_guc_submit_stop(guc);
> > +               xe_guc_submit_pause_abort(guc);
> > +       }
> >  }
> >  
> >  static bool guc_submit_hint_wedged(struct xe_guc *guc)
> 

  reply	other threads:[~2025-12-19  1:10 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-18 21:44 [PATCH v2 0/3] Attempt to fixup reset, wedge, unload corner cases Matthew Brost
2025-12-18 21:44 ` [PATCH v2 1/3] drm/xe: Always kill exec queues in xe_guc_submit_pause_abort Matthew Brost
2025-12-18 23:36   ` Summers, Stuart
2025-12-18 21:44 ` [PATCH v2 2/3] drm/xe: Forcefully tear down exec queues in GuC submit fini Matthew Brost
2025-12-18 23:36   ` Summers, Stuart
2025-12-19  1:15     ` Matthew Brost
2026-01-08 19:00   ` Dong, Zhanjun
2026-01-08 19:17     ` Matthew Brost
2026-01-14 22:35       ` Dong, Zhanjun
2026-02-06  5:50         ` Matthew Brost
2026-02-06 20:29           ` Dong, Zhanjun
2025-12-18 21:44 ` [PATCH v2 3/3] drm/xe: Trigger queue cleanup if not in wedged mode 2 Matthew Brost
2025-12-18 23:45   ` Summers, Stuart
2025-12-19  1:10     ` Matthew Brost [this message]
2025-12-18 23:08 ` ✓ CI.KUnit: success for Attempt to fixup reset, wedge, unload corner cases Patchwork
2025-12-18 23:44 ` ✓ Xe.CI.BAT: " Patchwork
2025-12-20  1:22 ` ✗ Xe.CI.Full: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aUSmAdQ8bY3GjlFJ@lstrano-desk.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=stuart.summers@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox